In this report we explore the possibility of using cGAN (Conditional Generative Adversarial Networks) for performing automatic graphic operations on the photographs or videos of human faces, similar to those typically done manually using a software tool such as Photoshop or After Effects, by learning from examples.
A good part of my research in Machine Learning has to do with images, videos, and 3D objects (e.g., Monocular Depth Perception, Generate photo from sketch, Generate Photo-realistic Avatars, How to build a Holodeck series), so I find myself constantly in need of an artist for the tedious task of create a large number of suitable image/video training datasets for such researches. Given that cGAN is a versatile tool for learning some sort of image-to-image mapping, then the natural question is whether cGAN is useful for automating some of these tasks.
And more importantly, we seek to find out whether we can use cGAN for performing an atypical type of supervised learning that does not involve category labels assigned to the training samples, nor the goal is about learning categorization. Instead, in our experiments the image pairs used for training embody the non-textual intention of the teacher, and the goal is for the system to learn the image mapping operations that achieve the intended goal, so that such operations can be applied successfully to unseen data samples.
For this report we will focus on dealing with human facial images. The image operations investigated include erasing background, image level adjustment, patching small flaws, as well as translation, scaling, and alignment. Experiments regarding removing outlier video frames will be reported in a separated post.
This report is part of a series of studies on the possibility of using GAN/cGAN as the latent representation for representing human faces, which is why the datasets used here are mostly images of human faces.
It should be noted that we have used relatively small datasets in the experiments here, mainly because this is only an exploratory study. For a more complete study, some of the more promising ones should be followed up with larger-scale experiments.
The setup for the experiments is as follows:
- Hardware: (unless noted otherwise) Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
- Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
- Torch 7, Python 2.7, Cuda 8
- cGAN implementation: pix2pix, a Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
- Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width, with a small random jitter applied during the process.
Experiment #1: erasing image background
The purpose of this experiment to see whether cGAN (using the pix2pix implementation) can be used to learn to erase the background of photos, in particular those with human facial images. Figure 1 shows a typical training sample.
Datasets: Photos from previous experiments are recycled for use here, augmented with additional facial images scrapped from the Internet. Such photos are then paired with target images that have the photo background manually erased. Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 118 sample image pairs.
Training: The training session took 10 hours of computing using the setup described above.
Evaluating test result: A test dataset of 75 samples are used. The output images generated by the trained cGAN are subjectively ranked into the following five categories (see Figure 2.a-2.e for samples):
- 5-stars: almost perfect. Total: 18 samples.
- 4-stars: pretty close, but with some minor problems. Total: 17 samples.
- 3-stars: so-so result. Total: 24 samples.
- 2-stars: not good, but still within bounds. Total: 13 samples.
- 1-star: complete disaster. Total: 3 samples.
The test result above may seem unimpressive, but the following facts should be taken into perspective:
- The training dataset size is actually quite tiny, considering that typical cGAN training requires a very large training dataset in order for it to properly learn the probability distribution in the dataset. It is in fact quite impressive that such result can be achieved with so little training, which leads one to conclude that the direction is quite promising.
- The training dataset is also not very diverse (aside from being small). For example, it does not contain photos that are either black-and-white, or much off-center, or close-up with no background, as such it naturally do not test well against those types of photos (which was in fact intentionally included in the test dataset). Adding more training photos of those types has shown to almost always improve the result.
Overall we judge that cGAN could work well for background erasure, provided that a large and sufficiently diverse training dataset (relative to the expected test samples) is used.
Experiment #2: image alignment
In this experiment we try to get cGAN to learn how to align images from examples, which involves translation, scaling, and cropping.
We actually do not expect this to work, but ran the experiment anyways just so that we can set it up as a goal post for others to explore it further.
Datasets: Photos from previous experiments are re-used here (which are manually cropped), augmented with the original un-cropped version. See Figure 3a for a sample training image pair.
Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 18 sample image pairs to probe the possibilities.
Training: The training session took 4 hours, which shows that cGAN was able to learn to map the image pairs in the training to near perfection. However, testing result is an entirely different matter.
All test results look like Figure 3b, where the output image (at right) looks like a jumble of many faces, roughly at the right position, but totally unrecognizable. In other words, this experiment has failed miserably as expected.
So why wouldn't cGAN work for this task? To human eyes the translation and scaling of an image are fairly simple operations, but it is not the case for cGAN. The successive convolutional layers in cGAN is very good at capturing local dependency among nearby pixels, but global operations such as scaling or translation, which affects all pixels equally, is not what cGAN is designed for. The cGAN design is still in its infancy, and as it is right now it does not handle translation, scaling, and rotation well.
So what would it take to make this work? One approach is to incorporate something like the Spatial Transformer Networks to see whether it makes any difference, which we shall explore in a future post.
Experiment #3: video processing
In this experiment we want to find out if cGAN can be used for some simple video processing, including background erasure, image tone level adjustments, and making small repairs.
In other words, we seek to find out whether cGAN can be used to learn the intended image operations from a small number of training samples, and then apply the operations to an entire video to achieve satisfactory result.
In this experiment we treat a video pretty much as a collection of images, without taking advantage of its sequential nature (which we shall explore in another report). While it is similar to Experiment #1 for background erasure, there are some differences:
- We also want to try incorporating other image operations at the same time, such as image tone level adjustment, as well as the patching of minor flaws.
- The video sequence is essentially images of the same person, which allows us to explore more efficient training methods. Here we apply the technique of drill training to get good result from very few training samples. Drill training refers to the technique of using small number of images for intense training, with the goal of getting good results for this particular training set, but possibly increases test errors against a wider test dataset.
Experimental Setup The setup is the same as Experiments #1 and #2, except the following:
- Hardware: a standard laptop (Intel i7 CPU with 8GB RAM) is used instead due to resource constraints. This hardware runs the experiment about 10-15 times slower than using an AWS/EC2 g2.2xlarge GPU instance.
- Model: a pix2pix cGAN model trained in Experiment #1 is used as the initial model.
- Dataset: video segment of a celebrity interview is used for the test. The video is sampled at 10 fps and cropped to 400x400 pixels at the image center. 1185 frames are selected for this test, most of which (1128 frames) have the same person as the main subject in the image. No manual alignment or color adjustment are applied. Out of these 1185 frames 22 are selected and manually modified for use as the training dataset, with the rest used as the test dataset. The manual modifications done on the training dataset are as follows:
- Background are erased to show pure white.
- Image are adjusted using the Photoshop Levels tool for better brightness and contrast.
- Minor intrusion of other people in the image are erased and patched up as appropriate (see Figure 4a).
- For trainings we apply the technique of drill training to reduce training time and number of samples required. Overall the training took several days using the non-GPU setup.
Figure 4b shows a 10-second segment of the test result. Our observations of the result are:
- The background erasure has worked remarkably well, even with only 20 training samples. The outline of the clothing appears a bit wavy, which is due to the difficulty in guessing the outline of the dark clothing over dark background, and the current method provides no continuity between frames.
- The levels adjustment applied in the training samples, which brightens up the images, are successfully transferred to the test result and makes the resulting video brighter.
- cGAN can be seen patching up the intruding fingers problem with some of the test samples (see Figure 4c), where only two such patching examples (see Figure 4a) were provided in the training dataset. The result is by no means satisfactory, but it points to the possibility of getting much better result if more training is applied.
This is a preliminary study using very small datasets to demonstrate the possibilities. Further comprehensive experimentation is definitely needed.
The experiments conducted in this report are not meant just as fun applications of the cGAN method. The experiments above show that as an atypical type of supervised learning, cGAN can be used to perform certain types of image operations for achieving practical purposes.
Overall in our limited experiments we have shown that operations such as background erasure and image levels adjustment worked well. For such image operations just training 2% of the frames in a video is sufficient to transfer the image operations to the entire video with good result. The operation of patching up minor flaws has worked to some limited degree.
The operations of scaling and alignment did not work at all, which was expected. This actually shows the limitations of the current cGAN architecture. We may conduct a more detailed study on this in a separate post later.
It is worth noting that the background erasure operation may seem to bear some surface resemblance to semantic segmentation (e.g., as described in this paper), in the sense that both can be used to separate certain recognizable targets out of an image. They are in fact very different, because cGAN is generative, and the method here does not require any training on category labels.
Following are some planned follow-up studies:
- As extensions to Experiment #3, explore how to take advantage of the sequential nature of a video, where adjacent frames are similar, in order achieve better test quality or faster training time.
- Use cGAN to synthesize missing frames in a video, or for creating smooth slow-motion replay.
- Detect outlier video frames that are substantially different from the training dataset. This can be used for carrying out semi-automatic cleanup of a video in order to remove wanted frames, which is quite useful for my own research since processing video for machine learning is a very tedious process.
The idea here is that it is generally believed that GAN is able to learn a meaningful latent representation, and this implies the possibility that the unwanted data samples can be easily detected as some sort of outliers far apart from the training dataset in this latent representation (see Figure 5). It should be interesting to find out if it fact works well in the context of the automatic removal of unwanted video frames.
- Use cGAN for image/video indexing and retrieval. The idea here is related to the last point regarding the latent representation learned by cGAN, since a good latent representation should make it easier to do indexing and retrieval.
Last but not least, I want to show my gratitude to Fonchin Chen for helping with the unending process of collecting and processing the images needed for the project.
- Isola et al,Image-to-Image Translation with Conditional Generative Adversarial Networks, 2016.
- pix2pix, a Torch implementation for cGAN