Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation

Pawel A. Pierzchlewicz, Caio da Silva, James Cotton, Fabian H. Sinz

Univeristy of Tübingen, Univeristy of Göttingen, Shirley Ryan AbilityLab, Northwestern University, Baylor College of Medicine
teaser image of Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation

Platypose uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation.

Abstract

Single camera 3D pose estimation is an ill-defined problem due to inherent ambiguities from depth, occlusion or keypoint noise. Multi-hypothesis pose estimation accounts for this uncertainty by providing multiple 3D poses consistent with the 2D measurements. Current research has predominantly concentrated on generating multiple hypotheses for single frame static pose estimation. In this study we focus on the new task of multi-hypothesis motion estimation. Motion estimation is not simply pose estimation applied to multiple frames, which would ignore temporal correlation across frames. Instead, it requires distributions which are capable of generating temporally consistent samples, which is significantly more challenging. To this end, we introduce Platypose, a framework that uses a diffusion model pretrained on 3D human motion sequences for zero-shot 3D pose sequence estimation. Platypose outperforms baseline methods on multiple hypotheses for motion estimation. Additionally, Platypose also achieves state-of-the-art calibration and competitive joint error when tested on static poses from Human3.6M, MPI-INF-3DHP and 3DPW. Finally, because it is zero-shot, our method generalizes flexibly to different settings such as multi-camera inference.

Zero-Shot Sampling

A noisy 3D motion $\mathbf{x}$ is denoised by a motion diffusion model trained on H36M. The denoised 3D motion samples $\hat{\mathbf{x}}_0$ are projected to 2D with a camera model. The reprojection error between the projections and 2D observations is minimized. The updated 3D motion is diffused to $t - n$ and passed back into the diffusion model.

model

Motion Estimation Gallery

Motion estimation involves predicting a consistent sequence of poses based on 2D observations, as opposed to pose estimation which estimates a static pose for a single frame. We evaluate the generation of multiple hypotheses for sequences of different lengths (16, 64 and 128 frames) using the H36M dataset.

Flexible Sequence Lengths

Platypose can generate motions of different sequence length. We show an example where Platypose generates sequences of progressively longer lengths. The long sequences are harder to generate so a visible decrease in accuracy occurs.

16 Frames

diffusion meis

64 Frames

diffusion meis

128 Frames

diffusion meis

256 Frames

diffusion meis

Multi Hypothesis Estimation

Multi-hypothesis pose estimation allows sampling from the posterior distribution of 3D poses conditioned on 2D observations. We show examples of such distributions below.

Failure Cases

Here we show cases where Platypose fails to generate reasonable 3D hypotheses. These failures may stem from issues with 2D keypoint detection or unexplained ambiguities. We include videos showcasing these failure examples.

2D Keypoint Failure

diffusion meis

2D Keypoint Failure

diffusion meis

Bimodality

diffusion meis

Bimodality

diffusion meis

BibTeX

@inproceedings{ pierzchlewicz2024platypose,
    title="Platypose: Calibrated Zero-Shot Multi-Hypothesis 3D Human Motion Estimation",
    author="Pawel A. Pierzchlewicz, Caio da Silva, James Cotton, Fabian H. Sinz",
    year="2024"
    eprint=2403.06164,
    archivePrefix="ArXiv"
}