Overview of the proposed LHM++. In 2D space, we extract image tokens T2D from input RGB images by DINOv2. In 3D space, geometric tokens T3D are derived from SMPL-X anchor points via an MLP.
Next, we design an Encoder-Decoder Point-Image Transformer (PIT) to hierarchically fuse 3D and 2D tokens, where the downsampled 3D tokens interact with 2D tokens via multi-modal attention in each layer.
The final 3D tokens are decoded to predict 3D Gaussian parameters, followed by a light-weight DPT head for photorealistic animation.
Thanks to the proposed Encoder-Decoder Point-Image Transformer, LHM++ achieves dramatic speedups over LHM-0.7B across all configurations. As shown below, LHM++ reconstructs a 160K-point avatar from a single image in under 0.79 s, compared to 74.31 s for LHM-0.7B — a 94× speedup. Even with 16 input views, LHM++ completes inference in only 2.13 s, making it highly practical for real-time applications.
Animation videos coming soon...
@article{qiu2025lhmpp,
title={LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D},
author={Lingteng Qiu and Peihao Li and Heyuan Li
and Qi Zuo and Xiaodong Gu and Yuan Dong and Weihao Yuan and Rui Peng and Siyu Zhu
and Xiaoguang Han and Guanying Chen and Zilong Dong
},
booktitle={arXiv preprint arXiv:2503.10625},
year={2025}
}
@inproceedings{qiu2025LHM,
title={LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds},
author={Lingteng Qiu and Xiaodong Gu and Peihao Li and Qi Zuo
and Weichao Shen and Junfei Zhang and Kejie Qiu and Weihao Yuan
and Guanying Chen and Zilong Dong and Liefeng Bo
},
booktitle={ICCV},
year={2025}
}