Generative Image Inpainting with Contextual Attention

  • 1 University of Illinois at Urbanaā€“Champaign
  • 2 Adobe Research

Note: the second version DeepFillv2 can be found here.

Update (Jun, 2018):
1. The tech report of our new image inpainting system DeepFillv2 is released. ArXiv | Project
2. We also released recorded demo video YouTube based on DeepFillv1 (CVPR 2018), as well as video YouTube of DeepFillv2. Best viewed with highest resolution 1080p.
3. DeepFillv1 is trained and mainly works on rectangular masks, while DeepFillv2 can complete images on free-form masks with user guidance as an option.

DeepFill (v1) Demo:

Some notes:
1. Results are direct outputs from trained generative neural networks. No post-processing steps are applied.
2. Model is trained on CelebA-HQ (with randomly sampling 2k as validation set for demo).
3. Demo is for research purposes only.

Some examples are shown here. Tag #deepfill.

Now I am watching you SmilešŸ˜ Remove watermark. Edit bangs Swap eyes No mustache

YouTube Video Demo

Best viewed with highest resolution 1080p.

Example inpainting results of our method on images of natural scene (Places2), face (CelebA) and object (ImageNet). Missing regions are shown in white. In each pair, the left is input image and right is the direct output of our trained generative neural networks without any post-processing.


Recent deep learning based approaches have shown promising results on image inpainting for the challenging task of filling in large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces, textures and natural images demonstrate that the proposed approach generates higher-quality inpainting results than existing ones. Code and trained models will be released.

Contextual Attention

Contextual attention layer learns where to borrow or copy feature information from known background patches (orange pixels) to generate missing patches (blue pixels). Firstly convolution is used to compute matching score between foreground patches with background patches (as convolutional filters). Then softmax is applied to compare and get attention score for each pixel. Finally foreground patches are reconstructed with background ones by performing deconvolution on score map. Contextual attention layer is differentiable and fully-convolutional.

Model Architecture

Results (input, baseline model output, full model output and attention map) on Places2:

Visual attention interpretation examples. Visualization (highlighted regions) on which parts in input image are mostly attended. Each triad, from left to right, shows input image, result and attention visualization.


More results (input, output and attention map) on CelebA:

More results (input, output and attention map) on ImageNet:


  title={Generative Image Inpainting with Contextual Attention},
  author={Yu, Jiahui and Lin, Zhe and Yang, Jimei and Shen, Xiaohui and Lu, Xin and Huang, Thomas S},
  journal={arXiv preprint arXiv:1801.07892},

  title={Free-Form Image Inpainting with Gated Convolution},
  author={Yu, Jiahui and Lin, Zhe and Yang, Jimei and Shen, Xiaohui and Lu, Xin and Huang, Thomas S},
  journal={arXiv preprint arXiv:1806.03589},