HAT-4D

HAT-4D reconstructs the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single in-the-wild monocular video through human-agent collaboration.

Abstract

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training Vision-Language-Action (VLA) models. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating Vision-Language Models (VLMs) with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multi-camera rigs. Functioning as a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments on MVOIK-4D demonstrate that HAT-4D achieves state-of-the-art performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback significantly improves interaction reconstruction. In addition, the data produced by HAT-4D effectively improves baseline performance when used for finetuning.

Method Overview

Overview of HAT-4D. The framework extracts an interaction knowledge graph from a monocular video, reconstructs and composes interacting objects, and uses memory-guided 4D propagation with agent and human feedback to refine failed cases.

Interaction Knowledge Graph (IKG)

Interaction Knowledge Graph formulation. IKG models physical interactions as a dynamic directed graph over event segments, object attributes, spatio-temporal relations, and physical constraints for downstream 4D generation.

Gallery

Diverse Interaction Scenarios

Interaction Challenges

Experiments

Human intervention ablation — **Human intervention.** Allowing a small number of human annotations during generation substantially improves reconstruction and interaction quality.

PSNR versus epoch results — **View scaling.** L4GM finetuning benefits from additional original, jittered, and randomly sampled views on HAT-4D sequences.

Citation

@inproceedings{Li2026hat4d,
  title     = {HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration},
  author    = {Li, Jiaxin and Author, Second and Author, Third},
  booktitle = {Computer Vision -- ECCV 2026},
  year      = {2026},
  note      = {Accepted, to appear}
}

🎩HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration