Using Deep Neural Networks to mimic human right brain

deepakvraghavan
9 min readApr 29, 2018

--

It is said that human beings have two parts of the brain — left and right. The left side of the brain is known to be the center for analytical and problem-solving skills. The right side of the brain is known to be the center for artistic abilities. The ability to create “original” artistic content differentiates human beings from machines.

Human Brain — Left and Right

Some examples of this contents are:

  • Generation of Art
  • Generation of Music
  • Generation of Prose/Text

There have been artists for hundreds of years who have mastered these skills and have created benchmarks in these respective art mediums which we continue to admire to date. Every generation has looked up to these experts. The benchmarks have been high and unattainable even for the best of us. For instance: It is extremely rare to match the genius of art like Vincent Van Gogh or a musical piece by Beethoven or a writing sample by Shakespeare. Malcolm Gladwell had popularized the notion that one needs to spent at least 10,000 hours to be considered an expert in an area. It could be science, art, engineering, etc. For human beings, this means that we have physical limitations on how much time we can spend on an area to gain expertise. But, can this be challenged by a machine or A.I.? The answer is a promising Yes! We are not there yet completely, but the advances in Neural Networks and Artificial Intelligence/Machine Learning look encouraging to get to that place in the near future. In this article, we will look at some the underlying principles that make this possible, and also explore a few examples to see this in action.

Generative Adversarial Networks (GANs) have become increasingly popular in the last two years especially in the area of expressing art through images as a medium of communication. There are other algorithms which are used on sequence data like Restricted Boltzman Machines (RBM), Long Short-Term Memory (LSTM) Network and Recurrent Neural Network (RNN). Use cases of using these networks and mapping them to the above three categories are shown here:

  • Image Generation — Creating Novel images using training images or just text descriptions, improve image resolution, Syle Translation etc. We will look at the first two and explore Style Translation in a future article.
  • Music Generation — Creating music in a given Genre using training music samples. This uses LSTM as there is at least one shortcoming of a Vanishing Gradient Problem when using Recurrent Neural Networks (RNN).
  • Text/Prose Generation — Creating NEWS articles or stories using certain keywords or themes

Generative Adversarial Network (GAN)

In order to understand GAN, we need to realize that Machine Learning algorithms are mainly of two types- Generative and Discriminative.

Generative models generate new samples from a training sample distribution

Discriminative models can map a given sample to one of the many possible outputs (for example: Classify that a input image is value that maps to 0 — 9 when using a MNIST dataset) or two (for example: mapping an image as a cat or a dog)

Ian Goodfellow proposed a model to leverage these two models in tandem and coined the GAN architecture. The model works in the following way.

  • The generator uses Random Noise to create an output.
  • The discriminator takes the above input and compares it with the real training data. It classifies it as fake (or real).
  • This insight is used by the generator to keep improving the quality of the output images (using back propagation to adjust the weights and bias of the underlying neural network)
  • This process is continued till the time where the Generator creates images so real that the discriminator is no longer able to see the difference (the model reaches a Nash equilibrium at this time)
GAN Architecture

A brief dose of Math/Statistics behind the model:

  • If the given sample space and the generated output are two distributions, our goal is to minimize the divergence of the latter from the former. The architecture proposed by Ian Goodfellow uses a flavor of this divergence referred to as the Jensen-Shannon (JS) divergence (there have been other flavors proposed for GAN too, for example Kullback-Leibler aka KL divergence or Wasserstein distance as used in Wassestein GAN aka WGAN).
  • Without going into a lot of probability theory for the scope of this article, just know that GAN became popular with JS divergence as it is symmetric and KL is not symmetric. The architecture still had a challenge when using JS divergence in that it was difficult to optimize and train the network. It was noted that typically the generator learns very small subset of the true distribution (as you will see later in some of the examples that the output generated does not “feel” similar to the input training).
  • In order to address these limitations, WGAN suggested using a statistical measure known as Earth Mover’s distance (EMD) as a way to calculate the distance between the distributions. Using this measure, we can reflect the distance of the two distributions even if they do not overlap, and thus can provide meaningful gradients.
  • You can see in the graph below that JS divergence and Wasserstein both are able to mark the distributions as separate, but the latter does not suffer from the vanishing gradient issue.
WGAN performing better than traditional GAN

Now that we have had a chance to see GAN and the underlying math at a high level, let us take a look at some use cases on how this is leveraged.

Practical Examples of generating context (Images, Music, Text) using Deep Neural Networks

  • Images — Novel content creation — Pokemon Image Generation

Yota Ishidi used PokemonGo dataset and showed how GANs can be used to create novel Pokemon characters. Note that this did not involve ANY human intervention — no artist or graphics developer to suggest what needs to be generated. The neural network learnt the patterns of the given set of training images, the generator progressively got better at creating novel or “never seen before” Pokemon characters. This is mighty powerful! I ran the code on my laptop and didn’t train the model long enough. The two pictures and the gif show the progress made during this brief training process which started as a blank slate.

Epoch 1 blank slate for Pokemon Generation
Emerging shapes and colors after brief training

The second image shows new Pokemon characters in the process of being generated as the model gets better through the training process. Although it looks like blobs, there are features which we can make out if we look closely.

The below image shows the one generated on a powerful GPU machine. These characters are closer to Pokemon that we are used to seeing in the media.

Pokemon characters generated using GAN after training
  • Images — Novel content creation — Generation of flowers and birds

We looked at some of the constraints of the standard GAN earlier in this article and the high-level overview of the underlying math. A research team at Rutgers and Baidu used two GANs and called it StackGAN and use that to generate high quality images from just text descriptions.

The team compared this proposed model with the previous model and the image below shows the difference between the vanilla GAN and the outputs from the two stages of this proposed architecture.

Comparison of StackGAN output with vanilla GAN

Architecture for the StackGAN model is presented below.

  • The first GAN (called Stage 1 GAN) takes a text input and generates primitive shapes and colors to form a low-resolution image
  • The second GAN (called Stage 2 GAN) takes this low-resolution image and the text input and generates a much higher resolution image.
StackGAN Architecture

The image below shows the novel generated images that are created by mere text descriptions across the two stages in the proposed architecture.

Sample outputs generated from StackGAN
  • Images — Image Enhancement — Improve Resolution

The highly challenging task of estimating a high-resolution (HR) image from a low-resolution (LR) counterpart is referred to as super resolution (SR). This problem is more prominent especially at high upscaling factors, at which the texture detail in reconstructed SR images is absent. Researchers at Twitter have found a unique way to use GAN for this problem. The two images on the left show the traditional mechanisms for resolution and the third one shows the one using the proposed model. This looks pretty close to the original HR image shown on the far right.

Using GAN for improving Image Resolution
  • Music Generation

There is a strong correlation between music and Mathematics. The fundamental characteristics of music such as notes, chords, pitch, tempo etc have certain repeatable patterns that can be expressed and analyzed using Mathematics. Music comes under the category of sequence data when we study Music using the lens of Machine Learning and AI. LSTM networks have been best fit for generating this kind of music by providing a sample set of musical inputs for training. The below example shows how we can generate a Jazz output by providing training set as input.

  • Text Generation

RNNs and LSTMs are used for analyzing sequence data in neural networks. We can provide a text corpus and ask the model to model the probability distribution of the next character in the sequence given a sequence of previous characters. This process is carried out to generate one text character at a time. Stanford university conducted a research to show that we can not only generate text using samples from a corpus but also model the structure and style of the output prose based on the input.

This sample output shows “novel” (never seen before) text that was created by the model by using Shakespearean works as a training set.

VIOLA:

Why, Salisbury must find his flesh and thought

That which I am not aps, not a man and in fire,

To show the reining of the raven and the wars

To grace my hand reproach within, and not a fair are hand,

That Caesar and my goodly father’s world;

When I was heaven of presence and our fleets,

We spare with hours, but cut thy council I am great,

Murdered and by thy master’s ready there

My power to give thee but so much as hell:

Some service in the noble bondman here,

Would show him to her wine.

KING LEAR:

O, if you were a feeble sight, the courtesy of your law,

Your sight and several breath, will wear the gods

With his heads, and my hands are wonder’d at the deeds,

So drop upon your lordship’s head, and your opinion

Shall be against your honour.

Conclusion:

As we have seen in these examples, the ability to create “original” thought by a machine in various mediums of expression is extremely powerful. As the research in this area improves, the ability to distinguish a human created art from a computer-generated (using AI/Machine Learning) art will continue to become extremely difficult. In order to establish the difference in these cases, we can leverage Blockchain (as an example) to digitally finger print machine generated content to uniquely distinguish them from training set.

--

--