Calculate Value-at-Risk Using Wasserstein Generative Adversarial Networks (WGAN-GP) for Risk Management System

Published in

Chatbots Life

7 min readMar 17, 2019

It is basically focused on Wasserstein Generative Adversarial Networks- gradient penalty (WGAN-GP) which is pure Artificial Intelligence method for Risk Management System. WGAN-GP method claims that it is more powerful than the other 3 methods i.e. historical method, Variance-Covariance method, and Monte Carlo method for calculating risk in RMS.

Specifically, WGAN-GP will allow us to deal with potentially complex financial services data such that we do not have to explicitly specify a distribution such as a multidimensional Gaussian distribution which is used in Monte Carlo.

When training my model, I found that an Actor-Critic Framework worked best to generate synthetic data from our training set. At first, I trained using a binary Generator-Discriminator approach but found that the GAN suffered from “mode collapse” (when the generator only learns a small subset of the possible realistic modes), — specifically, in a range of values where the discriminator does poorly to accurately classify the data as real or synthetic. The Actor-Critic framework solved this problem by evaluating the Wasserstein distance between the real and synthetic data rather than evaluating binary cross-entropy.

Wasserstein Distance

Instead of adding noise, Wasserstein GAN (WGAN) proposes a new cost function using Wasserstein distance that has a smoother gradient everywhere. WGAN learns no matter the generator is performing or not. The diagram below repeats a similar plot on the value of D(X) for both GAN and WGAN. For GAN (the red line), it fills with areas with diminishing or exploding gradients. For WGAN (the blue line), the gradient is smoother everywhere and learns better even the generator is not producing good data.

Wasserstein GAN (WGAN)

GANs are described as adversarial networks, but the generator and critic should be learning from each other.

If a GAN discriminator is too good at detecting manufactured information, the generator has no direction for improvement. Conversely, if the generator is always able to fool the critic, this also leaves the generator with no opportunity for development.

In the Wasserstein GAN, the discriminator doesn’t just return a positive/negative answer. Instead, it provides the generator with the underlying information it would use to decide.

This gives the generator much smoother feedback to work with, and the hope is that the generator could always use this output to progress, thereby avoiding mode collapse.

The idea of WGAN is to replace the loss function such that it is ensured that there always exists a non-zero gradient. It turns out that this can be done with the Wasserstein distance between the generator distribution and the data distribution.

This is the WGAN discriminator’s loss function:

disc_loss_base = -tf.reduce_mean(x_out) + tf.reduce_mean(z_out)

As the discriminator learns to correctly identify the training data as real, x_out should increase. This is good for the discriminator, but we multiply it by -1 because we are minimizing this measure of skill. (“In mathematics, conventional optimization problems are usually stated in terms of minimization.”)

As the discriminator learns to identify the generated samples as fake, z_out should decrease. We keep a plus sign in front of this term since it’s going in the direction we want: a lower value means the discriminator is better at its job.

The WGAN generator’s loss function is:

gen_loss_base = -tf.reduce_mean(z_out)

If the generator is fooling the critic, it means that the critic is classifying the generated data with a higher value, and z_out is larger. A higher z_out is therefore good for the generator, and we multiply it by -1 since we are optimizing in the downhill direction.

Wasserstein GAN + Gradient Penalty (WGAN-GP)

In the WGAN, the relationship between the discriminator’s input and output can’t be too steep or jagged; the slope needs to be less than or equal to 1. The gradient penalty enforces this.

The gradient penalty used here adds a cost term to the discriminator that increases when the discriminator’s gradients move away from 1. This follows the design of the WGAN-GP paper, in which the authors penalize deviations from 1 in either direction:

The norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty). Empirically this seems not to constrain the critic too much…

Critic vs Discriminator

WGAN introduces a new concept called ‘critic’, which corresponds to discriminator in GAN. As is briefly mentioned above, the discriminator in GAN only tells if the incoming dataset is fake or real and it evolves as epoch goes to increase accuracy in making such a series of decisions. In contrast, the critic in WGAN tries to measure Wasserstein distance better by simulating Lipschitz function more tightly to get a more accurate distance. Simulation is done by updating the critic network under the implicit constraint that critic network satisfies Lipschitz continuity condition.

If you look at the final algorithm, they, GAN and WGAN, look very similar to each other in algorithmic point of view, but their intuition is quite different as much as variational autoencoder is different from autoencoder. One fascinating thing is that the derived loss function is even simpler than that of the original GAN algorithm. It’s just a difference between two averages.

What is the relationship between reinforcement learning and adversarial learning (e.g. GAN)?

There are two types of Reinforcement learning approaches:

i. Model-based RL

ii. Model-free RL

Model-based approaches are the ones that contain a generative model.

The critic in AC is like the discriminator in GANs, and the actor in AC methods is like the generator in GANs. In both systems, there is a game being played between the actor (generator) and the critic (discriminator). Each starts out not knowing very much. The actor begins to sort of bumbling around the state space, and the critic has no clue how to evaluate the sort of random behavior of the actor.

Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize.

Value-at-Risk(VaR)

VaR is a measure of portfolio risk i.e. maximum risk. For instance, a 1% VaR(i.e. 99% confidence) of -5% means that there is a 1% chance that we do lose more than 5%.

Use of neural networks in the Risk Management System is basically to train the model w.r.t the calculated daily returns. Calculating VaR is a purely mathematical function. After train the model we will test the model using random noise until e.g. 1000 simulations. The predicted output will be the normal distribution which is WGAN-GP returns. Using WGAN-GP returns we will calculate the VaR by using percentile.

i.e. if the VaR is 1% or 99% confidence then the percentile is set to 1.

VaR is applicable for equity, forex, commodity market.

Mind it there are lots other techniques like exit conditions(i.e. stop loss etc) will help you to handle the risk better.

In this article, I just wanted to show you the market risk based on 5 companies. Code is available on my Github profile. In a future article, I will explain the time series prediction of VaR which is more efficient and some reinforcement learner techniques on that.