M^3-Verse: A “Spot the Difference” Challenge for Large Multimodal Models
Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia,
Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang
Arxiv, 2025
[Paper]
[Code]
[Dataset]
M^3-Verse enables evaluating Large Multimodal Models' ability to reason about object state changes in paired
indoor videos, and proposes the HCTR method to boost multi-state perception performance.
MovieChat+: Question-Aware Sparse Memory for Long Video Question Answering
Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
[Paper]
[Code]
Song et al. (2024) propose MovieChat+ with question-aware sparse memory to handle long video QA, outperforming
SOTA, releasing
MovieChat-1K benchmark, and cutting VRAM cost.
IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion
Wenhao Hu, Zesheng Li, Haonan Zhou, Liu Liu, Xuexiang Wen, Zhizhong Su,
Xi Li, Gaoang Wang
Association for the Advancement of Artificial Intelligence (AAAI), 2025
[Paper]
[Code]
[Website]
IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex
pipelines.
Adaptive Graph Pruning for Multi-Agent Communication
Boyi Lia, Zhonghan Zhao, Der-Horng Lee, Gaoang Wang
Arxiv, 2025
[Paper]
[Code]
[Dataset]
[Website]
Adaptive Graph Pruning (AGP), a novel task-adaptive multi-agent collaboration framework that jointly optimizes
agent quantity (hardpruning) and communication topology (soft-pruning).
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang
Arxiv, 2025
[Paper]
[Code]
[Dataset]
[Website]
Video-MMLU is a benchmark for evaluating LMMs' multi-discipline lecture understanding, covering math, physics,
chemistry, with captioning and QA tasks, revealing model limitations.
Pointmap Association and Piecewise-Plane Constraint for Consistent and Compact 3D Gaussian Segmentation Field
Wenhao Hu, Wenhao Chai, Shengyu Hao, Xiaotong Cui, Xuexiang Wen, Jenq-Neng Hwang, Gaoang Wang
Arxiv, 2025
[Paper]
[Website]
A method designed to achieve both view Consistent 2D segmentation and a
Compact 3D Gaussian Segmentation field.
CityCraft: A Real Crafter for 3D City Generation
Jie Deng, Wenhao Chai, Junsheng Huang, Zhonghan Zhao, Mingyan Gao, Qixuan Huang, Jianshu Guo, Shengyu Hao, Wenhao
Hu, Jenq-Neng Hwang, Xi Li, Gaoang Wang
Arxiv, 2024
[Paper]
[Code]
CityCraft, an innovative framework designed to enhance both the
diversity and quality of urban scene generation.
Vision meets mmWave Radar: 3D Object Perception Benchmark for
Autonomous Driving
Yizhou Wang, Jen-Hao Cheng, Jui-Te Huang, Sheng-Yao Kuan,
Qiqian Fu, Chiming Ni, Shengyu Hao, Gaoang Wang,
Guanbin Xing, Hui Liu, Jenq-Neng Hwang
IEEE Intelligent Vehicles Symposium (IV), 2024
[Paper]
[Dataset]
CRUW3D dataset, including 66K synchronized and wellcalibrated camera, radar, and LiDAR frames in various driving
scenarios.
MovieChat: From Dense Token to Sparse Memory in Long Video Understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye,
Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
Computer Vision and Pattern Recognition (CVPR), 2024
[Paper]
[Code]
[Dataset]
[Website]
MovieChat achieves state-of-the-art performace in long video understanding by introducing memory mechanism.
UniAP: Towards Universal Animal Perception in Vision via Few-Shot Learning
Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
Association for the Advancement of Artificial Intelligence (AAAI), 2024
[Paper]
[Code]
[Website]
UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species
perception among various visual tasks.
DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes
Shengyu Hao, Peiyuan Liu, Yibing Zhan, Kaixun Jin, Zuozhu Liu, Mingli Song, Jenq-Neng Hwang, Gaoang Wang
International Journal of Computer Vision (IJCV), 2023
[Paper]
[Dataset]
[Code]
A new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians.
DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
Shidong Cao, Wenhao Chai, Shengyu Hao, Yanting Zhang, Hangyue Chen, Gaoang Wang
IEEE Transactions on Multimedia (TMM), 2023
[Paper]
[Code]
We focus on a new fashion design task, where we aim to transfer a reference appearance image onto a clothing
image while preserving the structure of the clothing image.
StableVideo: Text-driven Consistency-aware Diffusion Video Editing
Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
International Conference on Computer Vision (ICCV), 2023
[Website]
[Paper]
[Demo]
[Code]
We tackle introduce temporal dependency to existing text-driven diffusion models, which allows them to generate
consistent appearance for the new objects.
Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang, Gaoang Wang
International Conference on Computer Vision (ICCV), 2023
[Paper]
[Code]
A simple yet effective framework of unsupervised domain adaptation for 3D human pose estimation.
PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Enhanced 3D Human Pose Estimation
Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo,
Yifeng Geng, Xuansong Xie
ACM Multimedia (ACM MM), 2023
[Paper]
[Code]
PoSynDA offers a state-of-the-art domain adaptation solution for 3D pose estimation.
SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic
Segmentation
Xuewei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), 2023
[Paper]
[Code]
To be more robust to 3D disturbance, we propose our Spherical GeometryAware Transformer for PAnoramic Semantic
Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge.