I've been writing code in Python for almost over a decade now (7 years in Academia, 3+ years in Industry). I've also read a ton of Python code written by others. Python programmers who come from Java may be influenced by the Java-style of programming (variable names, comments style, etc.). The same stands for C/C++ programmers, PHP, SWIFT, etc. who code in Python.

The style of programming in Python greatly influences the readability of code one writes. Since Python lays more emphasis on readability, it is important to properly vet your code (especially when you want the code to be contributed to a large project).

For passionate Python programmers, I don't have to remind them of PEP20 (Zen of Python) and PEP8 standards. These provide important guidelines on how to write code in python with a focus on simplicity and readability.

To properly incorporate these standards into your writing style, one could use linting (for static code checking). I use PyCodeStyle frequently. Pylint is another program that one could use for linting which rates your code (out of 10). One can configure these linting tools to follow standard python coding style practices into your code.

One could configure them to run on your code within the IDE/editor (like VSCode/PyCharm).

# DeveloperStation.ORG

Learn n Share!!

## Thursday, January 16, 2020

## Wednesday, February 14, 2018

### Neural Network weight optimization algorithms

In this post, I write about different ways of updating
neural network weights.

I’m writing this post out of the notes I took for the class
“Improving Deep
Neural Networks: Hyperparameter tuning, Regularization and Optimization”.

Specifically, I discuss about optimizing the neural network
weights using non-traditional gradient descent algorithms.

Applying proper optimization techniques have direct effect
on the performance of the neural network.

Required pre-requisites to understand this content:

- Neural Networks
- Gradient Descent
- Back Propagation
- Cost functions

To explain the following, I’m assuming neural network
weights θ
to be 1D/2D vector.

The different ways to update the neural network weights are
described as follows:

## Traditional gradient Descent:

- It is prone to larger oscillations during its descent to global/local optimum while optimizing the cost function. This affects performance.
- To reduce the oscillations, we use (exponentially weighted) moving average like approaches to dampen out the effects of these oscillations.
- This ensures smoother transition to the global/local optimum while optimizing the cost function.

##
Exponentially weighted Moving averages:

Assuming we have a 1D vector

**V**, that changes over time (like temperature). It is prone to large local variations.
To correct
for that (quick and abrupt) change, we perform the following operations

V

_{corrected}= (β* V_{t-1}) + (1 - β) * V_{original}
In other words, V

_{corrected}≈ Approximately average over (1.0 / (1.0 - β)) observations
For example,

If β = 0.9 ≈ approximately last 10
observations average.

If β = 0.98 ≈ approximately last 50
observations average.

- Larger value of β gives smoother curves (as opposed to zig-zag/abrupt movement as observed in pure gradient descent).
- Bias correction in exponential weighted moving averages applied to gradient descent like algorithms don't affect them significantly. They can be implemented if needed. This is required in ADAptive momentum algorithm.

##
Momentum

Since, we understand basics of exponentially weighted moving
averages, we can apply that in weight update step in neural network
optimization.

We update the weights by performing the following steps:

V

_{dw}= (β * V_{dw}) + ((1- β)* V_{dw})
V

_{db}= (β * V_{db}) + ((1- β)* V_{db})
Weight update step:

W = W - (α * V

_{dw}) instead of W = W - (α * dw)
b = b - (α * V

_{db}) instead of b = b - (α * dw)
Usually Beta = 0.9

- This weight update procedure gives us smoother convergence to global/local minimum.
- The V
_{dw}, V_{db}terms are derived from the exponentially weighted moving average equations.

## RMSProp

Root Mean Squared Propagation. Interestingly this was proposed in the Coursera course by Geoffrey Hinton back in 2011-2012.While applying gradient descent, we update the weights by performing the following:

S

_{dw}= (β * S_{dw}) + ((1- β)* (dw)^{2})
S

_{db}= (β * S_{db}) + ((1- β)* (db)^{2})
Weight update step:

W = W - (α * (dw/ (sqrt(S

_{dw}) +ε))) instead of W = W - (α * dw)
b = b - (α * (db/ sqrt(S

_{db})+ε))) instead of W = W - (α * db).
The S

_{dw}, S_{db}terms are derived from the exponentially weighted moving average equations.##
Adam: ADAptive Momentum

This is the most commonly used/popular optimization
algorithm in the computer vision community.

We have slightly changed update
algorithm compared to Momentum.

While applying gradient descent, we update the weights by
performing the following:

From Momentum we have:

V

_{dw}= (β1 * V_{dw}) + ((1.0 - β1) * V_{dw})
V

_{db}= (β1 * V_{db}) + ((1.0 - β1) * V_{db})
You have to perform bias correction:

V

_{dw_corrected}= V_{dw}/(1.0 - ( β1)^{t})
V

_{db_corrected}= V_{db}/(1.0 - ( β1)^{t})
From RMSProp we have

S

_{dw}= (β2 * S_{dw}) + ((1.0 - β2)* (dw)^{2})
S

_{db}= (β2 * S_{db}) + ((1.0 - β2)* (db)^{2})
You have to perform bias correction:

S

_{dw_corrected}= S_{dw}/(1.0 - ( β2)^{t})
S

_{db_corrected}= S_{db}/(1.0 - ( β2)^{t})
Finally, we perform weight updates:

W = W - (α * (V

_{dw}/ (sqrt(S_{dw})+ε))) instead of W = W - (α * dw).
b = b - (α * (V

_{db}/ (sqrt(S_{db})+ε))) instead of b = b - (α * db).
All of this reminds me of Kalman filters, where the observed
signal value is not exactly correct, we observe some process noise and
observation noise and we try to incorporate that to have a updated/corrected
signal value from the sensor.

## Sunday, March 13, 2016

### Simple Autoencoder on MNIST dataset

So, I had fun with Theano and trained an Autoencoder on a MNIST dataset.

Autoencoder is a simple Neural network (with one hidden layer) which reproduces the input passed to it. By controlling the number of hidden neurons, we can learn interesting features from the input and data can be compressed as well (sounds like PCA). Autoencoders can be used for unsupervised feature learning. Data trasnformed using autoencoders can be used for supervised classification of datasets.

More about Autoencoders is available here. More variants of Autoencoders exist (Sparse, Contractive, etc.) are available with different constraints on the hidden layer representation.

Here are the figures for digit 7 with hidden size 10 and 20 (original data was MNIST training dataset with 784 unit length feature vector). Each of the digits (whose value was 8) were passed to the autoencoder. Each of the hidden units were visualized (after computing the mean).

Autoencoder is a simple Neural network (with one hidden layer) which reproduces the input passed to it. By controlling the number of hidden neurons, we can learn interesting features from the input and data can be compressed as well (sounds like PCA). Autoencoders can be used for unsupervised feature learning. Data trasnformed using autoencoders can be used for supervised classification of datasets.

More about Autoencoders is available here. More variants of Autoencoders exist (Sparse, Contractive, etc.) are available with different constraints on the hidden layer representation.

**I trained an vanilla Autoencoder for 100 epochs with 16 mini batch size and learning rate of 0.01**Here are the figures for digit 7 with hidden size 10 and 20 (original data was MNIST training dataset with 784 unit length feature vector). Each of the digits (whose value was 8) were passed to the autoencoder. Each of the hidden units were visualized (after computing the mean).

Subscribe to:
Posts (Atom)