Research

Selected works along my research line on agentic system design, reasoning reliability, and scalable learning.

System Level — Agentic System Architecture and Scalable Multi-Agent Autonomy

  1. swarmagentic.png
    Yao Zhang ,  Chenyang Lin ,  Shijie Tang , and 4 more authors
    EMNLP 2025 (Main)
    SwarmAgentic is a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. It reformulates particle swarm optimization into interpretable text-symbol updates over agent roles and coordination structures, enabling efficient exploration of the agentic system design space.
    Scalable Autonomy Automated Agentic System Generation Swarm Intelligence
  2. webpilot.png
    Yao Zhang ,  Zijian Ma ,  Yunpu Ma , and 3 more authors
    AAAI 2025
    WebPilot is a multi-agent system with a dual optimization strategy that improves MCTS to better handle complex web environments. It uses Global Optimization for high-level planning and Local Optimization for executing subtasks, achieving SOTA performance on WebArena with a 93% relative increase in success rate.
    Web Agents Monte Carlo Tree Search Multi-Agent Systems Reflection-Based Optimization

Reasoning Level — Process-Level Reasoning and Policy Alignment for Reliable Decision-Making

  1. groundedprm.png
    Yao Zhang ,  Yu Wu ,  Haowei Zhang , and 6 more authors
    NeurIPS 2025 Workshop LAW
    GroundedPRM is a tree-guided and fidelity-aware framework for automatic process reward modeling that combines MCTS-guided path construction with tool-based step verification. It achieves SOTA performance with only 10% of the training data compared to existing auto-labeled methods, demonstrating exceptional sample efficiency and superior reasoning quality.
    Process Reward Modeling Multi-Step Reasoning Monte Carlo Tree Search Tool Verification

Learning Level — Adaptive and Federated Learning for Scalable Multimodal Intelligence

  1. clcrossvqa.png
    CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
    Yao Zhang ,  Haokun Chen ,  Aymen Frikha , and 1 more author
    WACV 2025
    CL-CrossVQA is a benchmark for continual learning in cross-domain visual question answering that evaluates the ability of vision-language models to retain knowledge while adapting to new domains. The benchmark highlights key challenges in representation retention and cross-domain generalization, providing a systematic framework for assessing model capabilities in maintaining previously learned knowledge when encountering new visual domains and question types.
    Continual Learning Multimodal Reasoning Cross-Domain Robustness Generalization
  2. fednano.png
    FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models
    Yao Zhang ,  Hewei Gao ,  Haokun Chen , and 1 more author
    Under Review, 2025
    FedNano is a lightweight federated tuning framework for pretrained multimodal large language models that drastically reduces client-side computational cost while maintaining strong reasoning and adaptation performance. The framework enables efficient federated fine-tuning of multimodal models, achieving scalable and privacy-preserving multimodal intelligence with minimal computational overhead on client devices.
    Federated Learning Multimodal Adaptation Efficient Fine-Tuning Scalable Intelligence
  3. feddat.png
    FedDAT: An Approach for Foundation Model Finetuning in Multi-Modal Heterogeneous Federated Learning
    Haokun Chen ,  Yao Zhang ,  Daniel Krompass , and 1 more author
    AAAI 2024
    FedDAT is an approach for foundation model finetuning in multi-modal heterogeneous federated learning that addresses challenges in adapting large foundation models across diverse data modalities and client distributions. The method enables efficient federated fine-tuning while handling heterogeneity in both data modalities and client data distributions, enabling scalable and privacy-preserving adaptation of foundation models.
    Heterogeneous Federated Learning Multimodal Finetuning Foundation Model Adaptation