Xunzhe Zhou's Homepage

Publications

My research interest is to efficiently leverage generalization and common sense knowledge of foundation models for robotics decision making and policy learning.

* denotes equal contribution. Representative papers are highlighted.

	Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence Jinxian Zhou, Ruihai Wu, Xunzhe Zhou, Checheng Yu, Licheng Zhong, Lin Shao In submission, 2024 abstract Bimanual manipulation tasks require collaboration, leading to a high demand for extensive training data. We introduce Bi-Adapt, a novel framework for learning bi-manual manipulation of novel categories. We leverage semantic correspondence from foundation models with generalization and few-shot adaptations using minimal data to achieves high success rates in cross-category bi-manual manipulation tasks.
	EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents Xunzhe Zhou, Junting Chen, Checheng Yu, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao International Conference on Learning Representations (ICLR)*, 2025 project page / abstract / paper / code / bibtex In the the real-world robot environment, the capability of the agent in MAS is tied to the physical composition of the robot. We introduced a multi-agent framework EMOS to improve the collaboration among heterogeneous robots with varying embodiment capabilities. To evaluate how well our MAS performs, we designed Habitat-MAS benchmark, including four tasks: 1) navigation, 2) perception, 3) manipulation, and 4) comprehensive multi-floor object rearrangement.
	Can Large Language Models Understand Real-World Complex Instructions? Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao AAAI Conference on Artificial Intelligence (AAAI), 2024 project page / abstract / paper / code / bibtex / video Current LLMs often ignore semantic constraints, generate incorrect formats, violate length or count constraints, and be unfaithful to input text. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions. We design a real-world dataset carefully crafted by human experts with 566 samples and 9 tasks. We also established 4 criteria and corresponding metrics and compared 18 Chinese-oriented models and 15 English-oriented models.
	Reservoir Computing as Digital Twins for Controlling Nonlinear Dynamical Systems Xunzhe Zhou, Ruizhi Cao, Jiawen Hou, Chun Guan, Siyang Leng In submission, 2023 It is difficult to control a chaotic system without knowing its differential equation and traditional control strategies require continuous adjustment with various parameters. To address this issue, we leveraged Echo State Network as the digital twin to predict and control the behavior of chaotic systems. We evaluated our model performance on 3 chaotic systems and 3 control strategies collectively. And we conducted extensive experiments to validate the prediction accuracy, control efficiency, and noise robustness of our model.

Selected Projects

Mobile Manipulation Based on 3D Semantic Fields and VLMs
Jiawei Hou, Xunzhe Zhou, Tongying Pan, Jie Zhang, Shanshan Li, Junyu Lin, Kuanning Wang, Jingshun Huang, Yanwei Fu, Xiangyang Xue

Construct a mobile manipulation and task planning system
project page/ code(soon)

Aiming to build a service robot with generalizability in daily life scenarios, we constructed a mobile manipulation system with a robot assembled with Franka Panda arm and Hermes mobile base. In this project, I was responsible for implementing 1) semantic grasping pose estimation, 2) semantic mobile base navigation, and 3) hierarchical task planning with 3D Semantic Fields and VLMs. The follow-up work: TaMMa (Hou et al.) was accepted by CoRL 2024.

Resolving Knowledge Conflicts in Vision-Language Models
Xunzhe Zhou, Xu Li, Yi Zheng, Xiangyang Xue

Construct a small VQA dataset for evaluation, resolve with contrastive decoding
project page / code

VLMs tend to perform hallucination when the image input conflicts with the LLM decoder knowledge base (common sense). To resolve this issue, we constructed a small-scale VQA dataset with images involving knowledge conflicts from the Internet or generated with DALL·E 3 for validation, and evaluated 8 state-of-the-art VLMs on the dataset. We also resolved the knowledge conflicts in LLaVA-1.5 with contrastive decoding.

Neural Style Transfer Based on Fine Tuning Vision Transformer
Xunzhe Zhou, Yisi Liu, Yujie Zhao, Yuanteng Chen

UC Berkeley CS182/282A course project, Fall 2023
project page / essay / code

To improve Neural Style Transfer, we replaced the content and style encoders of StyTr² with pre-trained ViT. Restricted by practical computation limitations, we leveraged a two-stage training strategy: we first froze the pre-trained ViT, just trained the decoders. Then we wrapped LoRA to fine-tune ViT with COCO datasets for joint training.

Experiences

	Shanghai AI Laboratory, China 2024.12 - Present Research Intern
	National University of Singapore, Singapore 2024.07 - 2024.12 Research Assistant (Advisor: Prof. Lin Shao)
	University of California, Berkeley, USA 2023.08 - 2023.12 Exchange Student (GPA: 4.00 / 4.00)
	Fudan University, China 2020.09 - 2025.06 B.S. in Computer Science and Technology (2021.09 - 2025.06) Natural Science Experimental Class (2020.09 - 2021.06)

News

Publications

Selected Projects

Experiences