Diffusion Models as Masked Autoencoders

Chen Wei^1,2, Karttikeya Mangalam¹, Po-Yao Huang¹, Yanghao Li¹, Haoqi Fan¹, Hu Xu¹, Huiyu Wang¹ Cihang Xie³, Alan Yuille², Christoph Feichtenhofer¹

¹FAIR, Meta AI, ²Johns Hopkins University, ³UC Santa Cruz

Paper arXiv

ground-truth	DiffMAE	MAE

ground-truth	DiffMAE	MAE

DiffMAE gradually adds visual details by diffusion for masked autoencoding.

DiffMAE

It has been a longstanding belief that generation can facilitate a true understanding of visual data.

In line with this, we revisit generatively pre-training visual representations in light of denoising diffusion models, and build connection between diffusion models and masked autoencoders.

In particular, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach can:

Serve as a strong initialization for downstream recognition tasks;
Conduct generative image inpainting;
Be effortlessly extended to video.

Random Mask Inpainting

The images are from ImageNet-1K validation set.

Use the slider to see generations from different inference steps.

masked input

ground-truth

Hover to view the masked inputs.

ground-truth

DiffMAE

ground-truth

DiffMAE

Center Mask Inpainting

The images are from ImageNet-1K validation set. Hover to view the masked inputs.

ground-truth

DiffMAE

ground-truth

DiffMAE

Video Inpainting

The videos are from Kinetics-400 validation set.

ground-truth

inputs

DiffMAE

ground-truth

inputs

DiffMAE

Fine-Tuning Generative Models

While being able to generatively inpaint images, DiffMAE is a strong self-supervised pre-training approach. The performance is:

Comparable to leading self-supervised algorithms that focus solely on recognition;
Stronger than other generative based algorithms by a large margin.

Fine-tuning generative models on ImageNet-1K, a system-level comparison.
pre-train	architecture	params. (M)	fine-tuned

MAE	ViT-L	304	85.9

iGPT	iGPT-L	1362	72.6
ADM	U-Net	211	83.3
DDPM	ViT-L	304	83.4
DiffMAE	ViT-L	304	85.8

BibTeX

@inproceedings{wei2023diffusion,
      author    = {Wei, Chen and Mangalam, Karttikeya and Huang, Po-Yao and Li, Yanghao and Fan, Haoqi and Xu, Hu and Wang, Huiyu and Xie, Cihang and Yuille, Alan and Feichtenhofer, Christoph},
      title     = {Diffusion Models as Masked Autoencoder},
      booktitle = {ICCV},
      year      = {2023},
    }