Xunzhe Zhou

I am an undergraduate student at Fudan University, pursuing my Bachelor's degree in Computer Science and Technology.

Recently, I have been a research assistant at NUS working with Prof. Lin Shao studying robotic task and motion planning. Previously, I collaborated with Prof. Yanwei Fu and Prof. Xiangyang Xue on constructing a mobile manipulation system with 3D Semantic Fields and Vision-Language Models. Before that, I studied LLMs' real-world complex instruction following capabilities with Prof. Yanghua Xiao, and nonlinear dynamical systems control with Prof. Siyang Leng.

I also had a wonderful semester at the UC Berkeley during fall 2023, where I studied reinforcement learning, deep learning, optimization models, and artificial intelligence, with GPA 4.00 / 4.00.

I am excited to join Shanghai AI Lab as a research intern!

Email  /  CV(2025.01)  /  GitHub  /  Scholar  /  LinkedIn  /  Twitter

profile photo

News

NEW [Jan. 2025]
Our paper EMOS is accepted by ICLR 2025!
NEW [Dec. 2024]
Excited to join Shanghai AI Lab as a research intern!
[Dec. 2023]
Our paper CELLO is accepted by AAAI 2024!
[Aug. 2023]
Honored to visit UC Berkeley as an exchange student!
[Sep. 2020]
Excited to begin my undergraduate journey at Fudan University!

Publications

My research interest is to efficiently leverage generalization and common sense knowledge of foundation models for robotics decision making and policy learning.

* denotes equal contribution. Representative papers are highlighted.

biadapt Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence
Jinxian Zhou, Ruihai Wu, Xunzhe Zhou, Checheng Yu, Licheng Zhong, Lin Shao

In submission, 2024
abstract

Bimanual manipulation tasks require collaboration, leading to a high demand for extensive training data. We introduce Bi-Adapt, a novel framework for learning bi-manual manipulation of novel categories. We leverage semantic correspondence from foundation models with generalization and few-shot adaptations using minimal data to achieves high success rates in cross-category bi-manual manipulation tasks.

emos EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
Xunzhe Zhou*, Junting Chen*, Checheng Yu*, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao

International Conference on Learning Representations (ICLR), 2025
project page / abstract / paper / code / bibtex

In the the real-world robot environment, the capability of the agent in MAS is tied to the physical composition of the robot. We introduced a multi-agent framework EMOS to improve the collaboration among heterogeneous robots with varying embodiment capabilities. To evaluate how well our MAS performs, we designed Habitat-MAS benchmark, including four tasks: 1) navigation, 2) perception, 3) manipulation, and 4) comprehensive multi-floor object rearrangement.

CELLO Can Large Language Models Understand Real-World Complex Instructions?
Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao

AAAI Conference on Artificial Intelligence (AAAI), 2024
project page / abstract / paper / code / bibtex / video

Current LLMs often ignore semantic constraints, generate incorrect formats, violate length or count constraints, and be unfaithful to input text. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions. We design a real-world dataset carefully crafted by human experts with 566 samples and 9 tasks. We also established 4 criteria and corresponding metrics and compared 18 Chinese-oriented models and 15 English-oriented models.

RC Reservoir Computing as Digital Twins for Controlling Nonlinear Dynamical Systems
Xunzhe Zhou*, Ruizhi Cao*, Jiawen Hou, Chun Guan, Siyang Leng

In submission, 2023

It is difficult to control a chaotic system without knowing its differential equation and traditional control strategies require continuous adjustment with various parameters. To address this issue, we leveraged Echo State Network as the digital twin to predict and control the behavior of chaotic systems. We evaluated our model performance on 3 chaotic systems and 3 control strategies collectively. And we conducted extensive experiments to validate the prediction accuracy, control efficiency, and noise robustness of our model.

Selected Projects

TAMMA Mobile Manipulation Based on 3D Semantic Fields and VLMs
Jiawei Hou, Xunzhe Zhou, Tongying Pan, Jie Zhang, Shanshan Li, Junyu Lin, Kuanning Wang, Jingshun Huang, Yanwei Fu, Xiangyang Xue

Construct a mobile manipulation and task planning system
project page/ code(soon)

Aiming to build a service robot with generalizability in daily life scenarios, we constructed a mobile manipulation system with a robot assembled with Franka Panda arm and Hermes mobile base. In this project, I was responsible for implementing 1) semantic grasping pose estimation, 2) semantic mobile base navigation, and 3) hierarchical task planning with 3D Semantic Fields and VLMs. The follow-up work: TaMMa (Hou et al.) was accepted by CoRL 2024.

VCD Resolving Knowledge Conflicts in Vision-Language Models
Xunzhe Zhou, Xu Li, Yi Zheng, Xiangyang Xue

Construct a small VQA dataset for evaluation, resolve with contrastive decoding
project page / code

VLMs tend to perform hallucination when the image input conflicts with the LLM decoder knowledge base (common sense). To resolve this issue, we constructed a small-scale VQA dataset with images involving knowledge conflicts from the Internet or generated with DALLĀ·E 3 for validation, and evaluated 8 state-of-the-art VLMs on the dataset. We also resolved the knowledge conflicts in LLaVA-1.5 with contrastive decoding.

NST Neural Style Transfer Based on Fine Tuning Vision Transformer
Xunzhe Zhou, Yisi Liu, Yujie Zhao, Yuanteng Chen

UC Berkeley CS182/282A course project, Fall 2023
project page / essay / code

To improve Neural Style Transfer, we replaced the content and style encoders of StyTr2 with pre-trained ViT. Restricted by practical computation limitations, we leveraged a two-stage training strategy: we first froze the pre-trained ViT, just trained the decoders. Then we wrapped LoRA to fine-tune ViT with COCO datasets for joint training.

Experiences

pjlab

Shanghai AI Laboratory, China 2024.12 - Present

Research Intern

NUS

National University of Singapore, Singapore 2024.07 - 2024.12

Research Assistant (Advisor: Prof. Lin Shao)

Berkeley

University of California, Berkeley, USA 2023.08 - 2023.12

Exchange Student (GPA: 4.00 / 4.00)

Fudan

Fudan University, China 2020.09 - 2025.06

B.S. in Computer Science and Technology (2021.09 - 2025.06)
Natural Science Experimental Class (2020.09 - 2021.06)