We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
[Demo 1]
[Demo 1 caption]
[Demo 2]
[Demo 2 caption]
[Demo 3]
[Demo 3 caption]
[Result Image 1]
[Caption 1]
[Result Image 2]
[Caption 2]
[Result Image 3]
[Caption 3]
[Result Image 4]
[Caption 4]
@article{fu2026egograsp,
title={EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos},
author={Fu, Hongming and Wang, Wenjia and Qiao, Xiaozhen and Yang, Shuo and Liu, Zheng and Zhao, Bo},
journal={arXiv preprint arXiv:2601.01050},
year={2026}
}