Research | Yao Zhang

Selected works along my research line on agentic system design, reasoning reliability, and scalable learning.

System Level — Agentic System Architecture and Scalable Multi-Agent Autonomy

SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence

Yao Zhang , Chenyang Lin , Shijie Tang , and 4 more authors

EMNLP 2025 (Main)

Abs arXiv Bib Homepage Code

SwarmAgentic is a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. It reformulates particle swarm optimization into interpretable text-symbol updates over agent roles and coordination structures, enabling efficient exploration of the agentic system design space.

Scalable Autonomy Automated Agentic System Generation Swarm Intelligence

The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated multi-agent generation.
@misc{zhang2025swarmagenticfullyautomatedagentic, title = {SwarmAgentic: Towards Fully Automated Agentic System Generation via Swarm Intelligence}, author = {Zhang, Yao and Lin, Chenyang and Tang, Shijie and Chen, Haokun and Zhou, Shijie and Ma, Yunpu and Tresp, Volker}, year = {2025}, eprint = {2506.15672}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2506.15672}, topic = {system}, }
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

Yao Zhang , Zijian Ma , Yunpu Ma , and 3 more authors

AAAI 2025

Abs arXiv Bib Homepage Code

WebPilot is a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. It uses Global Optimization for high-level planning and Local Optimization for executing subtasks, achieving SOTA performance on WebArena with a 93% relative increase in success rate.

Web Agents Monte Carlo Tree Search Multi-Agent Systems Reflection-Based Optimization

LLM-based autonomous agents often fail in executing complex web tasks that require dynamic interaction, largely due to the inherent uncertainty and complexity of these environments. Existing LLM-based web agents typically rely on rigid, expert-designed policies specific to certain states and actions, which lack the flexibility and generalizability needed to adapt to unseen tasks. In contrast, humans excel by exploring unknowns, continuously adapting strategies and actions based on new observations, and resolving ambiguities through exploration. To emulate human-like adaptability, web agents need strategic exploration and complex decision-making. Monte Carlo Tree Search (MCTS) is well-suited for this, but classical MCTS struggles with the vast action spaces, unpredictable state transitions, and incomplete information in web tasks. In light of this, we develop WebPilot, a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. Specifically, the Global Optimization phase involves generating a high-level plan by breaking down tasks into manageable subtasks, continuously refining this plan through reflective analysis of new observations and previous subtask attempts, thereby focusing the search process and mitigating the challenges posed by vast action spaces in classical MCTS. Subsequently, the Local Optimization phase executes each subtask using a tailored MCTS designed for complex environments, effectively addressing uncertainties and managing incomplete information by iteratively refining decisions based on new observations. Experimental results on WebArena and MiniWoB++ demonstrate the effectiveness of WebPilot. Notably, on WebArena, WebPilot achieves SOTA performance with GPT-4, achieving a 93% relative increase in success rate over the concurrent tree search-based method. WebPilot marks a significant advancement in general autonomous agent capabilities, paving the way for more advanced and reliable decision-making in practical environments.
@misc{zhang2024webpilotversatileautonomousmultiagent, title = {WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration}, author = {Zhang, Yao and Ma, Zijian and Ma, Yunpu and Han, Zhen and Wu, Yu and Tresp, Volker}, year = {2025}, eprint = {2408.15978}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2408.15978}, topic = {system}, }

Reasoning Level — Process-Level Reasoning and Policy Alignment for Reliable Decision-Making

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Yao Zhang , Yu Wu , Haowei Zhang , and 6 more authors

NeurIPS 2025 Workshop LAW

Abs arXiv Bib Homepage Code

GroundedPRM is a tree-guided and fidelity-aware framework for automatic process reward modeling that combines MCTS-guided path construction with tool-based step verification. It achieves SOTA performance with only 10% of the training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency and superior reasoning quality.

Process Reward Modeling Multi-Step Reasoning Monte Carlo Tree Search Tool Verification

Process Reward Models (PRMs) aim to improve multi-step reasoning in Large Language Models (LLMs) by supervising intermediate steps and identifying errors throughout the reasoning process. However, building effective PRMs remains challenging due to the lack of scalable, high-quality annotations. Existing approaches rely on costly human labeling, LLM-based self-evaluation that is prone to hallucination, or Monte Carlo (MC) estimation, which infers step quality solely from rollout outcomes and often introduces noisy, misaligned supervision due to credit misattribution. These issues result in three core limitations: noisy rewards, low factual fidelity, and misalignment with step-level reasoning objectives. To address these challenges, we introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. To reduce reward noise and enable fine-grained credit assignment, we construct structured reasoning paths via Monte Carlo Tree Search (MCTS). To eliminate hallucinated supervision, we validate each intermediate step using an external tool, providing precise, execution-grounded correctness signals. To combine both step-level validation and global outcome assessment, we design a hybrid reward aggregation mechanism that fuses tool-based verification with MCTS-derived feedback. Finally, we format the reward signal into a rationale-enhanced, generative structure to promote interpretability and compatibility with instruction-tuned LLMs. GroundedPRM is trained on only 40K automatically labeled samples, amounting to just 10% of the data used by the best-performing PRM trained with auto-labeled supervision. Nevertheless, it achieves up to a 26% relative improvement in average performance on ProcessBench. When used for reward-guided greedy search, GroundedPRM outperforms even PRMs trained with human-labeled supervision, offering a scalable and verifiable path toward high-quality process-level reasoning.
@misc{zhang2025groundedprmtreeguidedfidelityawareprocess, title = {GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning}, author = {Zhang, Yao and Wu, Yu and Zhang, Haowei and Li, Weiguo and Chen, Haokun and Wu, Jingpei and Li, Guohao and Han, Zhen and Tresp, Volker}, year = {2025}, eprint = {2510.14942}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2510.14942}, topic = {reasoning}, }

Learning Level — Adaptive and Federated Learning for Scalable Multimodal Intelligence

CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering

Yao Zhang , Haokun Chen , Aymen Frikha , and 1 more author

WACV 2025

Abs arXiv Bib

CL-CrossVQA is a benchmark for continual learning in cross-domain visual question answering that evaluates the ability of vision-language models to retain knowledge while adapting to new domains. The benchmark highlights key challenges in representation retention and cross-domain generalization, providing a systematic framework for assessing model capabilities in maintaining previously learned knowledge when encountering new visual domains and question types.

Continual Learning Multimodal Reasoning Cross-Domain Robustness Generalization

We introduce CL-CrossVQA, a benchmark for continual learning in cross-domain visual question answering. The benchmark evaluates the ability of vision-language models to retain knowledge while adapting to new domains, highlighting challenges in representation retention and cross-domain generalization.
@misc{zhang2025clcrossvqacontinuallearningbenchmark, title = {CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering}, author = {Zhang, Yao and Chen, Haokun and Frikha, Aymen and Tresp, Volker}, year = {2025}, eprint = {}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {}, topic = {learning} }
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models

Yao Zhang , Hewei Gao , Haokun Chen , and 1 more author

Under Review, 2025

Abs arXiv Bib

FedNano is a lightweight federated tuning framework for pretrained multimodal large language models that drastically reduces client-side computational cost while maintaining strong reasoning and adaptation performance. The framework enables efficient federated fine-tuning of multimodal models, achieving scalable and privacy-preserving multimodal intelligence with minimal computational overhead on client devices.

Federated Learning Multimodal Adaptation Efficient Fine-Tuning Scalable Intelligence

FedNano proposes a lightweight federated tuning framework for pretrained multimodal large language models. The method drastically reduces client-side computational cost while maintaining strong reasoning and adaptation performance, enabling scalable and privacy-preserving multimodal intelligence.
@misc{zhang2025fednanotowardlightweightfederated, title = {FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models}, author = {Zhang, Yao and Gao, Hewei and Chen, Haokun and Tresp, Volker}, year = {2025}, eprint = {2506.14824}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {https://arxiv.org/abs/2506.14824}, topic = {learning} }
FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning

Haokun Chen , Yao Zhang , Daniel Krompass , and 1 more author

AAAI 2024

Abs arXiv Bib

FedDAT is an approach for foundation model finetuning in multi-modal heterogeneous federated learning that addresses challenges in adapting large foundation models across diverse data modalities and client distributions. The method enables efficient federated fine-tuning while handling heterogeneity in both data modalities and client data distributions, enabling scalable and privacy-preserving adaptation of foundation models.

Heterogeneous Federated Learning Multimodal Finetuning Foundation Model Adaptation

FedDAT presents an approach for foundation model finetuning in multi-modal heterogeneous federated learning. The method addresses challenges in federated adaptation of large foundation models across diverse data modalities and client distributions.
@misc{chen2024feddatapproachfoundationmodel, title = {FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning}, author = {Chen, Haokun and Zhang, Yao and Krompass, Daniel and Tresp, Volker}, year = {2024}, eprint = {}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, url = {}, topic = {learning} }