Xunzhe Zhou
I am an undergraduate student at Fudan University, pursuing my Bachelor's degree in Computer Science and Technology.
Recently, I have been a research assistant at NUS working with Prof. Lin Shao studying robotic task and motion planning.
Previously, I collaborated with Prof. Yanwei Fu and Prof. Xiangyang Xue on constructing a mobile manipulation system with 3D Semantic Fields and Vision-Language Models.
Before that, I studied LLMs' real-world complex instruction following capabilities with Prof. Yanghua Xiao,
and nonlinear dynamical systems control with Prof. Siyang Leng.
I also had a wonderful semester at the UC Berkeley during fall 2023, where I studied reinforcement learning, deep learning, optimization models, and artificial intelligence, with GPA 4.00 / 4.00.
I am excited to join Shanghai AI Lab as a research intern!
Email /
CV(2025.01) /
GitHub /
Scholar /
LinkedIn /
Twitter
|
|
News
NEW [Jan. 2025]
Our paper EMOS is accepted by ICLR 2025!
NEW [Dec. 2024]
Excited to join Shanghai AI Lab as a research intern!
[Dec. 2023]
Our paper CELLO is accepted by AAAI 2024!
[Aug. 2023]
Honored to visit UC Berkeley as an exchange student!
[Sep. 2020]
Excited to begin my undergraduate journey at Fudan University!
Publications
My research interest is to efficiently leverage generalization and common sense knowledge of foundation models for robotics decision making and policy learning.
* denotes equal contribution. Representative papers are highlighted.
|
Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence
Jinxian Zhou, Ruihai Wu, Xunzhe Zhou, Checheng Yu, Licheng Zhong, Lin Shao
In submission, 2024
abstract
Bimanual manipulation tasks require collaboration, leading to a high demand for extensive training data.
We introduce Bi-Adapt, a novel framework for learning bi-manual manipulation of novel categories.
We leverage semantic correspondence from foundation models with generalization and few-shot adaptations using minimal data to achieves high success rates in cross-category bi-manual manipulation tasks.
|
|
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
Xunzhe Zhou*, Junting Chen*, Checheng Yu*, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao
International Conference on Learning Representations (ICLR), 2025
project page / abstract / paper / code / bibtex
In the the real-world robot environment, the capability of the agent in MAS is tied to the physical composition of the robot.
We introduced a multi-agent framework EMOS to improve the collaboration among heterogeneous robots with varying embodiment capabilities.
To evaluate how well our MAS performs, we designed Habitat-MAS benchmark, including four tasks: 1) navigation, 2) perception, 3) manipulation, and 4) comprehensive multi-floor object rearrangement.
|
|
Can Large Language Models Understand Real-World Complex Instructions?
Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen, Xintao Wang, Yuncheng Huang, Haoning Ye, Zihan Li, Shisong Chen, Yikai Zhang, Zhouhong Gu, Jiaqing Liang, Yanghua Xiao
AAAI Conference on Artificial Intelligence (AAAI), 2024
project page / abstract / paper / code / bibtex / video
Current LLMs often ignore semantic constraints, generate incorrect formats, violate length or count constraints, and be unfaithful to input text.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions. We design a real-world dataset carefully crafted by human experts with 566 samples and 9 tasks. We also established 4 criteria and corresponding metrics and compared 18 Chinese-oriented models and 15 English-oriented models.
|
|
Reservoir Computing as Digital Twins for Controlling Nonlinear Dynamical Systems
Xunzhe Zhou*, Ruizhi Cao*, Jiawen Hou, Chun Guan, Siyang Leng
In submission, 2023
It is difficult to control a chaotic system without knowing its differential equation and traditional control strategies require continuous adjustment with various parameters.
To address this issue, we leveraged Echo State Network as the digital twin to predict and control the behavior of chaotic systems.
We evaluated our model performance on 3 chaotic systems and 3 control strategies collectively.
And we conducted extensive experiments to validate the prediction accuracy, control efficiency, and noise robustness of our model.
|
Selected Projects
|
Mobile Manipulation Based on 3D Semantic Fields and VLMs
Jiawei Hou, Xunzhe Zhou, Tongying Pan, Jie Zhang, Shanshan Li, Junyu Lin, Kuanning Wang, Jingshun Huang, Yanwei Fu, Xiangyang Xue
Construct a mobile manipulation and task planning system
project page/ code(soon)
Aiming to build a service robot with generalizability in daily life scenarios, we constructed a mobile manipulation system with a robot assembled with Franka Panda arm and Hermes mobile base.
In this project, I was responsible for implementing 1) semantic grasping pose estimation, 2) semantic mobile base navigation, and 3) hierarchical task planning with 3D Semantic Fields and VLMs. The follow-up work: TaMMa (Hou et al.) was accepted by CoRL 2024.
|
|
Resolving Knowledge Conflicts in Vision-Language Models
Xunzhe Zhou, Xu Li, Yi Zheng, Xiangyang Xue
Construct a small VQA dataset for evaluation, resolve with contrastive decoding
project page / code
VLMs tend to perform hallucination when the image input conflicts with the LLM decoder knowledge base (common sense).
To resolve this issue, we constructed a small-scale VQA dataset with images involving knowledge conflicts from the Internet or generated with DALLĀ·E 3 for validation, and evaluated 8 state-of-the-art VLMs on the dataset.
We also resolved the knowledge conflicts in LLaVA-1.5 with contrastive decoding.
|
|
Neural Style Transfer Based on Fine Tuning Vision Transformer
Xunzhe Zhou, Yisi Liu, Yujie Zhao, Yuanteng Chen
UC Berkeley CS182/282A course project, Fall 2023
project page / essay / code
To improve Neural Style Transfer, we replaced the content and style encoders of StyTr2 with pre-trained ViT.
Restricted by practical computation limitations, we leveraged a two-stage training strategy: we first froze the pre-trained ViT, just trained the decoders. Then we wrapped LoRA to fine-tune ViT with COCO datasets for joint training.
|
Experiences
|
Shanghai AI Laboratory, China
2024.12 - Present
Research Intern
|
|
National University of Singapore, Singapore
2024.07 - 2024.12
Research Assistant (Advisor: Prof. Lin Shao)
|
|
University of California, Berkeley, USA
2023.08 - 2023.12
Exchange Student (GPA: 4.00 / 4.00)
|
|
Fudan University, China
2020.09 - 2025.06
B.S. in Computer Science and Technology (2021.09 - 2025.06)
Natural Science Experimental Class (2020.09 - 2021.06)
|
|