Neural Networks and Deep Learning#
1. Neural Networks: Architecture and Components#
Basic Structure of a Neural Network:#
Input Layer:
Receives the input features. The number of neurons in this layer equals the number of features in the input data.
Hidden Layers:
These are intermediate layers where the actual processing takes place through weighted connections and activation functions. You can have one or multiple hidden layers depending on whether it’s a shallow or deep neural network.
Output Layer:
Produces the final prediction, typically with one or more neurons corresponding to the number of target variables (for example, a single neuron for binary classification, multiple neurons for multi-class classification).
Activation Functions:#
ReLU (Rectified Linear Unit):
Formula: \( f(x) = \max(0, x) \)
Commonly used in hidden layers because it helps to address the vanishing gradient problem. It introduces non-linearity but is computationally simple.
Sigmoid:
Formula: \( f(x) = \frac{1}{1 + e^{-x}} \)
Used in binary classification tasks, as it squashes the output between 0 and 1.
Tanh (Hyperbolic Tangent):
Formula: \( f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)
Outputs between -1 and 1, often used in recurrent networks.
Softmax:
Formula: \( f(x_i) = \frac{e^{x_i}}{\sum_{j}e^{x_j}} \)
Typically used in the output layer for multi-class classification, providing probabilities for each class.
2. Backpropagation and Gradient Descent#
Backpropagation:#
Definition: Backpropagation is the algorithm used to compute the gradient of the loss function with respect to each weight by applying the chain rule of calculus.
Steps:
Forward Pass: Calculate the output by propagating the input through the network and compute the loss.
Backward Pass: Compute gradients of the loss with respect to each parameter using the chain rule.
Parameter Update: Update weights using an optimization algorithm like gradient descent.
Gradient Descent:#
Gradient Descent Algorithm:
Weights are updated by moving in the opposite direction of the gradient of the loss function with respect to the weights.
Formula for weight update: \( w = w - \eta \cdot \nabla J(w) \)
Where \( \eta \) is the learning rate and \( \nabla J(w) \) is the gradient of the loss function.
3. Common Deep Learning Architectures#
Feedforward Neural Networks (FNNs):#
Definition: The simplest type of neural network, where connections are only in one direction from input to output. No cycles or loops.
Use Case: Basic classification or regression tasks.
Convolutional Neural Networks (CNNs):#
Definition: CNNs are specialized for processing grid-like data, such as images.
Components:
Convolutional Layers: Apply convolutional filters to capture spatial hierarchies in the data.
Pooling Layers: Downsample feature maps to reduce dimensionality and computation.
Fully Connected Layers: Connect every neuron to every neuron in the next layer, typically after several convolutional layers.
Use Case: Image classification (e.g., ImageNet), object detection (e.g., YOLO), and computer vision tasks.
Recurrent Neural Networks (RNNs):#
Definition: RNNs are designed to handle sequential data, where the output at each step depends on the previous step’s output.
Challenge: RNNs suffer from the vanishing gradient problem, making it difficult to learn long-term dependencies.
Long Short-Term Memory (LSTM) Networks:#
Definition: LSTM is a type of RNN that introduces memory cells and gating mechanisms to better capture long-term dependencies in sequences.
Components:
Forget Gate: Controls what portion of the previous memory is carried forward.
Input Gate: Decides what new information is added to the memory.
Output Gate: Determines the output based on the memory cell.
Use Case: Natural language processing (NLP), time series prediction, speech recognition.
4. Transformer Architecture (Crucial for LLMs)#
Definition:#
Transformers are now the backbone of many state-of-the-art models in NLP (e.g., BERT, GPT). They replace recurrence with attention mechanisms, making them more efficient in processing long sequences.
Key Components:#
Self-Attention Mechanism:
The attention mechanism allows the model to focus on different parts of the input sequence when predicting each word. It is computed using the query, key, and value matrices.
Positional Encoding:
Since transformers don’t have recurrence, they rely on positional encodings to inject information about the relative position of tokens in the sequence.
Multi-Head Attention:
Instead of having a single attention mechanism, multi-head attention runs several attention mechanisms in parallel to capture different relationships in the sequence.
Feedforward Layers:
After the attention layer, there is a fully connected feedforward network, applied to each position separately.
Layer Normalization and Residual Connections:
These help stabilize and speed up training.
Use Case:
BERT (Bidirectional Encoder Representations from Transformers): Pre-trained using masked language modeling, useful for tasks like question answering and named entity recognition.
GPT (Generative Pre-trained Transformer): A generative model pre-trained on a large corpus of text to generate human-like text.
5. Transfer Learning and Fine-Tuning#
Transfer Learning:#
Definition: Using a pre-trained model (often trained on a large dataset like ImageNet or text corpora) as a starting point for a new task.
Fine-Tuning:#
Definition: The process of taking a pre-trained model and training it on a smaller, task-specific dataset. You might freeze the earlier layers and only fine-tune the last few layers or fine-tune the entire network.
Use Case: Models like BERT, GPT, and ResNet are often pre-trained on large datasets and fine-tuned for specific tasks such as sentiment analysis or object detection.
6. Regularization in Deep Networks#
Dropout:#
Definition: During training, randomly set a fraction of the neurons to zero at each forward pass, preventing co-adaptation of neurons and reducing overfitting.
Batch Normalization:#
Definition: Normalize the input of each layer to have mean 0 and variance 1, improving convergence and reducing the problem of vanishing gradients.
Data Augmentation:#
Definition: Applying random transformations like rotation, flipping, and scaling to training data to artificially increase the size of the dataset and reduce overfitting.
7. Optimization in Deep Learning#
Adam Optimizer:#
Definition: A popular optimization algorithm that combines momentum and RMSProp. It adapts the learning rate for each parameter and uses first and second moments of the gradient to improve convergence.
Learning Rate Scheduling:#
Definition: Gradually reducing the learning rate during training can improve convergence and prevent overshooting the optimal minimum.