It has been a longstanding belief that generation can facilitate a true understanding of visual data.
In line with this, we revisit generatively pre-training visual representations in light of denoising diffusion models, and build connection between diffusion models and masked autoencoders.
In particular, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach can:
The images are from ImageNet-1K validation set.
Use the slider to see generations from different inference steps.
masked input
ground-truth
Hover to view the masked inputs.
The images are from ImageNet-1K validation set. Hover to view the masked inputs.
The videos are from Kinetics-400 validation set.
ground-truth | inputs | DiffMAE |
ground-truth | inputs | DiffMAE |
While being able to generatively inpaint images, DiffMAE is a strong self-supervised pre-training approach. The performance is:
pre-train | architecture | params. (M) | fine-tuned |
---|---|---|---|
MAE | ViT-L | 304 | 85.9 |
iGPT | iGPT-L | 1362 | 72.6 |
ADM | U-Net | 211 | 83.3 |
DDPM | ViT-L | 304 | 83.4 |
DiffMAE | ViT-L | 304 | 85.8 |
@inproceedings{wei2023diffusion,
author = {Wei, Chen and Mangalam, Karttikeya and Huang, Po-Yao and Li, Yanghao and Fan, Haoqi and Xu, Hu and Wang, Huiyu and Xie, Cihang and Yuille, Alan and Feichtenhofer, Christoph},
title = {Diffusion Models as Masked Autoencoder},
booktitle = {ICCV},
year = {2023},
}