LHM++

Abstract

Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder–Decoder Point–Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations.

Video

For netizens in China, considering the problem of Internet restrictions, we provide a video link to bilibili.

Methodology

Overview of the proposed LHM++. In 2D space, we extract image tokens T_2D from input RGB images by DINOv2. In 3D space, geometric tokens T_3D are derived from SMPL-X anchor points via an MLP. Next, we design an Encoder-Decoder Point-Image Transformer (PIT) to hierarchically fuse 3D and 2D tokens, where the downsampled 3D tokens interact with 2D tokens via multi-modal attention in each layer. The final 3D tokens are decoded to predict 3D Gaussian parameters, followed by a light-weight DPT head for photorealistic animation.

Efficiency Analysis

Thanks to the proposed Encoder-Decoder Point-Image Transformer, LHM++ achieves dramatic speedups over LHM-0.7B across all configurations. As shown below, LHM++ reconstructs a 160K-point avatar from a single image in under 0.79 s, compared to 74.31 s for LHM-0.7B — a 94× speedup. Even with 16 input views, LHM++ completes inference in only 2.13 s, making it highly practical for real-time applications.

Animation Results

Note: Videos may exhibit white occlusion artifacts due to cropping with the original mask, and some motion estimation may have slight jittering.

Dance I
Dance II
Dance III
Dance IV
Basketball
Jump
Breaking
Tai Chi

Citation

If you find our approach helpful, you may consider citing our works.

  @article{qiu2025lhmpp,
    title={LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D},
    author={Lingteng Qiu and Peihao Li and Heyuan Li 
        and Qi Zuo and Xiaodong Gu and Yuan Dong and Weihao Yuan and Rui Peng and Siyu Zhu 
        and Xiaoguang Han and Guanying Chen and Zilong Dong
      },
    booktitle={arXiv preprint arXiv:2503.10625},
    year={2025}
    }

    @inproceedings{qiu2025LHM,
    title={LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds},
    author={Lingteng Qiu and Xiaodong Gu and Peihao Li  and Qi Zuo
      and Weichao Shen and Junfei Zhang and Kejie Qiu and Weihao Yuan
      and Guanying Chen and Zilong Dong and Liefeng Bo 
      },
    booktitle={ICCV},
    year={2025}
    }