Generative Adversarial Networks (GANs)

A Generative Adversarial Network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. It consists of two neural networks, namely the Generator and the Discriminator, which are trained simultaneously through adversarial processes.

1. Components of GAN:

Generator (G):
- The Generator takes random noise (often sampled from a simple distribution like a Gaussian distribution) as input and generates fake data.
- Its goal is to create synthetic data that is as close to real data as possible.
Discriminator (D):
- The Discriminator receives both real data (from the training dataset) and fake data (generated by the Generator).
- Its goal is to correctly classify data as real or fake.

2. Training Process:

The training of a GAN is like a two-player game:

Generator’s Objective: To generate data that is indistinguishable from the real data so that the Discriminator cannot differentiate between real and fake.
Discriminator’s Objective: To distinguish between real and fake data as accurately as possible.

Adversarial Training:

The Generator and Discriminator are trained alternately.
The Generator tries to fool the Discriminator by generating realistic samples.
The Discriminator tries to correctly identify real samples and fake samples.
This competition drives both networks to improve, with the Generator producing increasingly realistic data over time.

3. Loss Function:

The overall loss function of the GAN can be described as a minimax game:

$min_{G} max_{D} V (D, G) = E_{x \sim p_{d a t a} (x)} [lo g D (x)] + E_{z \sim p_{z} (z)} [lo g (1 - D (G (z)))]$

The Discriminator tries to maximize the probability of correctly classifying real and fake samples.
The Generator tries to minimize the probability that the Discriminator correctly classifies the generated samples as fake.

4. Challenges of GANs:

Mode Collapse: The Generator may find a small set of outputs that successfully fool the Discriminator, resulting in a lack of diversity in generated samples.
Training Instability: GANs are notoriously difficult to train, and getting the balance between the Generator and Discriminator is challenging.
Convergence Issues: There is no clear way to measure whether the model has converged to a good solution, as GANs do not have an inherent loss metric like other neural networks.

5. Applications of GANs:

Image Generation: GANs are widely used for generating high-quality images, such as generating realistic human faces.
Image-to-Image Translation: GANs can transform images from one domain to another, such as converting sketches to full-color images.
Text-to-Image Generation: GANs can be used to generate images based on textual descriptions.
Super-Resolution: GANs are used to enhance image resolution, creating high-resolution images from low-resolution inputs.
Data Augmentation: GANs can be used to generate additional training data for improving the performance of machine learning models.

GANs are a powerful and versatile tool in generative modeling, offering a wide range of creative and practical applications.

Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that is well-suited for sequential data, such as time series, text, or video data. It was designed to address the vanishing and exploding gradient problems that traditional RNNs face during backpropagation, especially when dealing with long sequences.

1. Structure of LSTM:

An LSTM unit consists of a cell state and three gates:

Cell State (c_t): It is the memory part of the LSTM, which carries information across time steps.
Gates: The gates control the flow of information into and out of the memory cell.
- Forget Gate ( $f_{t}$ ): Decides what information to discard from the cell state.
- Input Gate ( $i_{t}$ ): Determines what new information to store in the cell state.
- Output Gate ( $o_{t}$ ): Controls what part of the cell state should be output as the next hidden state.

Each gate in the LSTM uses a sigmoid activation function (outputs values between 0 and 1), allowing it to control how much information should pass through.

2. Key Equations of LSTM:

At each time step t, the LSTM unit computes the following:

Forget Gate:

$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$

The forget gate determines which part of the previous cell state ( $c_{t - 1}$ ) should be kept.
Input Gate:

$i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$

$\tilde{C_{t}} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})$

The input gate controls which new information ( $\tilde{C_{t}}$ ) will be added to the cell state.
Cell State Update:

$C_{t} = f_{t} * C_{t - 1} + i_{t} * \tilde{C_{t}}$

The cell state is updated by combining the old cell state ( $C_{t - 1}$ ), modified by the forget gate, with new information selected by the input gate.
Output Gate:

$o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$

$h_{t} = o_{t} * tanh (C_{t})$

The output gate determines what information from the current cell state should be output as the hidden state ( $h_{t}$ ), which will be used as input for the next time step.

3. Flow of Information in LSTM:

Forget Gate: Controls what portion of the previous memory to keep. If $f_{t}$ is close to 1, the previous cell state is preserved; if it’s close to 0, the previous cell state is forgotten.
Input Gate: Decides what new information will be added to the cell state based on the current input $x_{t}$ and the previous hidden state $h_{t - 1}$ .
Output Gate: Controls what part of the updated cell state should be passed to the next hidden state. This updated hidden state is passed to the next LSTM cell and can also be used for predictions.

4. Advantages of LSTM:

Memory Capability: LSTMs can learn long-term dependencies by effectively controlling the flow of information using gates.
Solving Vanishing Gradient Problem: LSTM’s architecture prevents the gradients from vanishing (or exploding) during backpropagation through time, allowing it to remember important information over long sequences.
Flexible Time Dependencies: LSTM can capture both short-term and long-term dependencies in sequence data.

5. Applications of LSTM:

Time Series Forecasting: LSTMs are widely used for predicting future values in time series data, such as stock prices, weather, and electricity prices.
Natural Language Processing: Tasks like machine translation, text generation, and sentiment analysis often use LSTMs to capture long-term dependencies in text sequences.
Speech Recognition: LSTMs are used in recognizing patterns in audio data, enabling more accurate speech recognition.
Video Analysis: LSTMs are useful in tasks that require analyzing sequences of frames, such as activity recognition in videos.

LSTMs are powerful models for handling sequential data, offering significant improvements over traditional RNNs, especially when modeling long-term dependencies.

The vanishing gradient problem is an issue that occurs during the training of deep neural networks, particularly in networks with many layers. It arises when the gradients used to update the weights become exceedingly small, effectively “vanishing” as they are propagated back through the network during backpropagation.

Why it Happens:

In backpropagation, gradients (partial derivatives of the loss with respect to the model parameters) are computed starting from the output layer and moving backward toward the input layer. If the activation functions used in the network (such as the sigmoid or tanh functions) squash their inputs into a small range (e.g., between 0 and 1 for sigmoid), the gradients of these activations can become very small. As a result, the gradient signal diminishes as it moves backward through the network’s layers.

Effects:

Slow or no learning in deeper layers: The weights in the earlier layers of the network receive very small updates, causing the network to learn very slowly or sometimes not at all.
Difficulty in training deep networks: This problem makes training deep networks particularly challenging, as the earlier layers fail to capture useful representations due to the small weight updates.

Example:

Consider a deep network with many layers, and let’s assume the activation function is the sigmoid function:

The gradient of the sigmoid function is small for inputs that are very large or very small. This means when a gradient passes through many sigmoid layers, it is multiplied by small numbers repeatedly, causing it to shrink exponentially. Eventually, it becomes so small that it effectively stops updating the weights of the early layers.

Solutions to the Vanishing Gradient Problem:

Use of ReLU Activation Functions: The ReLU (Rectified Linear Unit) activation function does not suffer from the vanishing gradient problem as severely as sigmoid or tanh because its gradient is constant (1) for positive inputs.
Batch Normalization: This normalizes the input of each layer, helping to maintain the gradient’s scale throughout the network.
Residual Networks (ResNets): These networks use skip connections (residual connections), allowing the gradient to flow more easily through the network and helping to mitigate the vanishing gradient problem.
Gradient Clipping: This technique involves artificially limiting the size of the gradients to prevent them from becoming too small (or too large).

Summary:

The vanishing gradient problem is a significant challenge when training deep neural networks, especially those using activation functions like sigmoid or tanh. It results in gradients that become too small to effectively update weights, particularly in the earlier layers of the network, slowing down or halting learning. Solutions such as ReLU activations, batch normalization, and residual connections are often used to address this issue.

Hua Wang

Explorer

COSC440 Deep Learning