🎩HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Jiaxin Li1,2,*, Yuxiang Wu1,*, Zhenkai Zhang1, Xinrui Shi1, Haoyuan Wang1
Yichen Zhao1,†, Su Linxiang1,†, Chenyang Yu1, Mingyu Zhang1, Yifan Ding2
Boran Wen1,2, Li Zhang3, Ruiyang Liu4, Yong-Lu Li1,2,‡
ECCV 2026
1 Shanghai Jiao Tong University (SJTU) 2 Shanghai Innovation Institute (SII)
* Equal contribution † Internship ‡ Corresponding author
HAT-4D teaser

HAT-4D reconstructs the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single in-the-wild monocular video through human-agent collaboration.

Abstract

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training Vision-Language-Action (VLA) models. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating Vision-Language Models (VLMs) with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multi-camera rigs. Functioning as a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments on MVOIK-4D demonstrate that HAT-4D achieves state-of-the-art performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback significantly improves interaction reconstruction. In addition, the data produced by HAT-4D effectively improves baseline performance when used for finetuning.

Method Overview

HAT-4D method overview
Overview of HAT-4D. The framework extracts an interaction knowledge graph from a monocular video, reconstructs and composes interacting objects, and uses memory-guided 4D propagation with agent and human feedback to refine failed cases.

Interaction Knowledge Graph (IKG)

Interaction Knowledge Graph formulation
Interaction Knowledge Graph formulation. IKG models physical interactions as a dynamic directed graph over event segments, object attributes, spatio-temporal relations, and physical constraints for downstream 4D generation.

Experiments

Human intervention ablation
Human intervention. Allowing a small number of human annotations during generation substantially improves reconstruction and interaction quality.
PSNR versus epoch results
View scaling. L4GM finetuning benefits from additional original, jittered, and randomly sampled views on HAT-4D sequences.

Citation

@inproceedings{Li2026hat4d,
  title     = {HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration},
  author    = {Li, Jiaxin and Author, Second and Author, Third},
  booktitle = {Computer Vision -- ECCV 2026},
  year      = {2026},
  note      = {Accepted, to appear}
}