Single camera 3D pose estimation is an ill-defined problem due to inherent ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose estimation accounts for this uncertainty by providing multiple 3D poses consistent with the 2D measurements. Current research has predominantly concentrated on generating multiple hypotheses for single frame static pose estimation. In this study we focus on the new task of multi-hypothesis motion estimation. Motion estimation is not simply pose estimation applied to multiple frames, which would ignore temporal correlation across frames. Instead, it requires distributions which are capable of generating temporally consistent samples, which is significantly more challenging. To this end, we introduce Platypose, a framework that uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation. Platypose outperforms baseline methods on multiple hypotheses for motion estimation. Additionally, Platypose also achieves state-of-the-art calibration and competitive joint error when tested on static poses from Human3.6M, MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes flexibly to different settings such as multi-camera inference.
A noisy 3D motion $\mathbf{x}$ is denoised by a motion diffusion model trained on H36M. The denoised 3D motion samples $\hat{\mathbf{x}}_0$ are projected to 2D with a camera model. The reprojection error between the projections and 2D observations is minimized. The updated 3D motion is diffused to $t - n$ and passed back into the diffusion model.
Motion estimation involves predicting a consistent sequence of poses based on 2D observations, as opposed to pose estimation which estimates a static pose for a single frame. We evaluate the generation of multiple hypotheses for sequences of different lengths (16, 64 and 128 frames) using the H36M dataset.
Walking
Directions
Eating
Discussion
SittingDown
Platypose can generate motions of different sequence length. We show an example where Platypose generates sequences of progressively longer lengths. The long sequences are harder to generate so a visible decrease in accuracy occurs.
16 Frames
64 Frames
128 Frames
256 Frames
Multi-hypothesis pose estimation allows sampling from the posterior distribution of 3D poses conditioned on 2D observations. We show examples of such distributions below.
Here we show cases where Platypose fails to generate reasonable 3D hypotheses. These failures may stem from issues with 2D keypoint detection or unexplained ambiguities. We include videos showcasing these failure examples.
2D Keypoint Failure
2D Keypoint Failure
Bimodality
Bimodality
@inproceedings{ pierzchlewicz2024platypose,
title="Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation",
author="Pawel A. Pierzchlewicz, Caio da Silva, James Cotton, Fabian H. Sinz",
year="2024"
eprint=2403.06164,
archivePrefix="ArXiv"
}