Evolution of Style Transfer techniques

7 min readDec 19, 2020

Our tech inspiration Tony Stark fused with Van Gogh’s Starry Night

What is style transfer?

Neural Style Transfer refers to a class of software algorithms that manipulate digital images, or videos, in order to adapt the appearance or visual style of another image. Neural Style Transfer algorithms are characterized by their use of deep neural networks for the sake of image transformation. Simply put output image retains the core elements of the content image but appears to be painted in the style of the style reference image.

The Deep Learning magic

Our starting point was the original algorithm for neural style transfer by Gatys et al (2015) based on optimizing an image to match the content and style of another image by manipulating some clever losses.

In summary, the general idea is:

Take an image for style (style_img)
Take an image for content (content_img)
Take a pre-trained neural net
Create an “output” image where each pixel is a parameter to optimize
Optimize the “output” image so that the content loss and style loss are minimized

Networks

Convolutional neural network

In deep learning, a convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks, based on their shared-weights architecture and translation invariance characteristics. Convolution neural networks work well in image feature representation tasks, where content and style features can be separated in the same network because lower-level layers are more concerned with pixel values and image textures and higher-level layer features are more abstract. So we will be using the VGG network for feature representation purposes. Where lower-level layer features are used as a style feature representation and higher-level layer features as a content feature representation.

VGG 16 Network

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual computer vision competition. Each year, teams compete on two tasks. The first is to detect objects within an image coming 200 classes, which is called object localization. The second is to classify images, each labeled with one of 1000 categories, which is called image classification. VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper “VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION”. This model won 1st and 2nd place in the above categories in the 2014 ILSVRC challenge.

VGG 19 Network

VGG-19 is a convolutional neural network that is 19 layers deep. You can load a pre-trained version of the network trained on more than a million images from the ImageNet database. The pre-trained network can classify images into 1000 object categories, such as a keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224.

Cyclic GAN

The CycleGAN is a technique that involves the automatic training of image-to-image translation models without paired examples. The models are trained in an unsupervised manner using a collection of images from the source and target domain that do not need to be related in any way.

Losses

Content Loss

The content loss is calculated using the target and content images. First, we select one of the layers of our model. Then we feedforward the content image through the pre-trained model and get the feature maps produced by the selected layer. After that, we feedforward the target image and get the feature maps produced by the selected layer.

Then we calculate the loss using these feature maps (content feature maps and target feature maps). For example, we can use the mean squared distance between the content feature maps and the target feature maps as a loss function.

Style Loss

In order to calculate the style loss, we need to define a style representation for the images. In the paper, the style representation is defined as the Gram matrix of the feature maps. The Gram matrix is a measure of a statistic in the features. It measures the correlation between the feature maps.

The style loss is calculated using the style representations (Gram matrix) for the style and target images. In the paper, the mean squared distance is used as a style loss. So the style loss will be the mean squared distance between the two Gram matrices.

Total Loss = content weight * content loss + style weight * style loss

By varying the values of content weight and style weight, we can control the amount of information retained from the content image and style image.

Image style transfer

To apply style transfer on the images, we load the weights of the pre-trained model, extract the content and style features and minimize the loss between the target image and content/style image where the Initial target image is used as a mixup of lower-level layer feature of style image and higher-level layer feature of content image. Before training, we set the weights of layers in decreasing order from lower to higher-level layers to calculate the loss. After training, when the model converges, it produces a style transferred output image.

The above strategy works well when we want to impose the style over the entire content image. To target a specific area in the content image, the Cycle GAN method is used.

Model Evaluation

To quantify how well the content of the original image is preserved, we stylize the Google image search dataset with different methods (VGG-16 backbone network, VGG-19 backbone network, and CycleGAN) and compute the accuracy using pre-trained ResNet 50, averaged across all the artists’ images. Results presented show that the best classification score is achieved on stylizations by CycleGAN since this method barely alter the content of the image. This strategy is known as Content retention evaluation.

Video style transfer

For video, we can apply the same method as applied for images. First, we extract the frames and transfer the style one by one and merge all the frames, this method works well if want to transfer only a single style over the entire video because as we increase the number of style images on the video frames, behavior of the model changes significantly. This is because every time we are loading the weights of pre-trained the model and extracting the content and style feature and start training where weights of the model change. So, to make the model more consistent we have introduced two model frameworks where one model is common across all the frames (during entire training) and a separate model for each frame. Then we calculate the loss between the target image and content/style feature image for both the model and minimize the overall loss and update both the models. This makes the model more consistent as we increase the number of style images. In other words, the model’s behavior achieves linearity. This helps in making changes in the style of output frame sequence very subtle even though style image changes.

Let’s DiGGiN !!!!!!!

Acknowledgment

This article is based on a project in the Machine Learning course at IIIT Delhi. We are truly grateful for the guidance of our professor, Dr. Tanmoy Chakraborty, and all the TAs of the course for suggesting the project and guiding us in achieving these results.

I also want to thank my teammates Mukul and Shubham for equally contributing to this project.

Contribution

Anmol Jain(MT19005): Cyclic GAN for style transfer to specific areas, and model evaluation.

Shubham Verma (MT19017): VGG-19 for Image and for Video Style Transfer.

Mukul Sharma(MT19042): VGG-16 for Image and VGG-19 for Video Style Transfer.