Understanding Pix2Pix GAN

Published in

Artificial Intelligence in Plain English

4 min readJan 16, 2021

The name itself says “Pixel to Pixel” which means, in an image it takes a pixel, then convert that into another pixel.

The goal of this model is to convert from one image to another image, in other words the goal is to learn the mapping from an input image to an output image.

But why and what application we can think of ??

Well, there are tons of applications we can think of:

The Pix2Pix GAN has been demonstrated on a range of image-to-image translation tasks such as converting maps to satellite photographs, black and white photographs to color, and sketches of products to product photographs.

And the reason why we use GAN’s for this is to synthesize these photos from one space to another.

Pix2Pix is a Generative Adversarial Network, or GAN, model designed for general purpose image-to-image translation.

The approach was presented by Phillip Isola, et al. in their 2016 paper titled “Image-to-Image Translation with Conditional Adversarial Networks” and presented at CVPR in 2017.

Introduction to Gan

The GAN architecture is comprised of two models:

1. Generator model for outputting new plausible synthetic images, and a

2. Discriminator model that classifies images as real (from the dataset) or fake (generated).

The discriminator model is updated directly, whereas the generator model is updated via the discriminator model. As such, the two models are trained simultaneously in an adversarial process where the generator seeks to better fool the discriminator and the discriminator seeks to better identify the counterfeit images.

How Pix2Pix GAN works?

Heard about GANs (Generative Adversarial Network) that generate realistic synthetic images? Similarly, Pix2pix belongs to one such type called conditional GAN or CGAN. They have some conditional settings and they learn the image-to-to mapping under this condition. Whereas, basic GAN’s generate images from a random distribution vector with no condition applied. Confused?

Steps Involved:

1. Training data pairs (x and y where x: input image and y: output image)

2. Pix2Pix uses the conditional GAN (CGAN) → G: {x, z} → y. (z → noise vector, x → input image, y → output image)

3. Generator Network (Encode- decode architecture) as an image is the input, we want to learn the deep representation and decode it and Discriminator Network (PatchGAN).

4. CGAN loss function and L1 or L2 Distance.

Let’s talk about Networks Architectures.

U-Net Generator

As we know the generator is an encoder-decoder network (first a series of down sampling layers then we have bottle neck layer then a series of upsampling layers)

The authors used the “U-Net” architecture with skip connections as the E-D network. The U-Net skip connections are also interesting because they do not require any resizing, projections etc. since the spatial resolution of the layers being connected already match each other.

PatchGAN Discriminator

The discriminator network uses the PatchGAN network. instead of predicting the whole image as fake or real at the discriminator, the model takes a N*N patch image and predicts every pixel in that patch if its real or fake. The authors reason that this enforces more constraints that encourage sharp high-frequency detail. Additionally, the PatchGAN has fewer parameters and runs faster than classifying the entire image.

Generator Loss

The paper also includes L1 loss which is MAE (mean absolute error) between the generated image and the target image. But, the problem was that this allows the generated image to become structurally similar to the target image.

Loss function of Generator network results in:

In the experiments, the authors report that they found the most success with the lambda parameter equal to 100.

Conclusions

Pix2Pix is a very interesting strategy for Image-to-Image translation using a combination of L1 Distance and Adversarial Loss with additional novelties in the design of the Generator and Discriminator.

Thanks for reading !