Diffusion Models as Masked Autoencoders

1FAIR, Meta AI, 2Johns Hopkins University, 3UC Santa Cruz
ground-truth DiffMAE MAE
ground-truth DiffMAE MAE

DiffMAE gradually adds visual details by diffusion for masked autoencoding.


It has been a longstanding belief that generation can facilitate a true understanding of visual data.

In line with this, we revisit generatively pre-training visual representations in light of denoising diffusion models, and build connection between diffusion models and masked autoencoders.

In particular, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach can:

  • Serve as a strong initialization for downstream recognition tasks;
  • Conduct generative image inpainting;
  • Be effortlessly extended to video.

Random Mask Inpainting

The images are from ImageNet-1K validation set.

Use the slider to see generations from different inference steps.

Interpolate start reference image.

masked input

Interpolation end reference image.


Hover to view the masked inputs.

ground-truth DiffMAE
ground-truth DiffMAE

Center Mask Inpainting

The images are from ImageNet-1K validation set. Hover to view the masked inputs.

ground-truth DiffMAE
ground-truth DiffMAE

Video Inpainting

The videos are from Kinetics-400 validation set.

ground-truth inputs DiffMAE
ground-truth inputs DiffMAE

Fine-Tuning Generative Models

While being able to generatively inpaint images, DiffMAE is a strong self-supervised pre-training approach. The performance is:

  • Comparable to leading self-supervised algorithms that focus solely on recognition;
  • Stronger than other generative based algorithms by a large margin.

pre-train architecture params. (M) fine-tuned
MAE ViT-L 304 85.9
iGPT iGPT-L 1362 72.6
ADM U-Net 211 83.3
DDPM ViT-L 304 83.4
DiffMAE ViT-L 304 85.8
Fine-tuning generative models on ImageNet-1K, a system-level comparison.


      author    = {Wei, Chen and Mangalam, Karttikeya and Huang, Po-Yao and Li, Yanghao and Fan, Haoqi and Xu, Hu and Wang, Huiyu and Xie, Cihang and Yuille, Alan and Feichtenhofer, Christoph},
      title     = {Diffusion Models as Masked Autoencoder},
      booktitle = {ICCV},
      year      = {2023},