Training a machine learning model is not only about choosing the right architecture or features. It is also about how efficiently you can minimise the loss function so the model learns useful patterns. This is where optimization algorithms come in. Among modern optimisers, Adam is one of the most widely used because it tends to converge quickly and works well across many deep learning and classical ML setups. If you are exploring optimisation as part of a data science course in Kolkata, understanding Adam will help you reason about model behaviour, training stability, and performance.
Why Optimisation Needs More Than Plain Gradient Descent
Vanilla gradient descent updates parameters by moving in the negative direction of the gradient. In practice, this approach struggles with real-world training conditions:
- Different features learn at different speeds: A single learning rate may be too large for some parameters and too small for others.
- Sparse gradients: In NLP or recommender systems, many parameters receive gradients infrequently.
- Noisy mini-batch updates: Stochastic training introduces variance, causing updates to zig-zag or overshoot.
- Ill-conditioned curvature: Loss landscapes can have steep directions and flat directions at the same time, slowing convergence.
These issues motivated optimisers that adapt learning rates per parameter and smooth out update directions.
Adagrad and RMSProp: The Building Blocks Adam Learns From
Adam is best understood as a combination of two influential ideas:
Adagrad: Per-Parameter Learning Rates for Sparse Features
Adagrad scales the learning rate for each parameter by the accumulated sum of squared gradients. Parameters that receive large gradients get smaller future steps, while parameters with small or rare gradients maintain larger steps. This is very useful when gradients are sparse.
Limitation: the accumulated sum keeps growing, so the effective learning rate keeps shrinking. Over long training runs, steps can become so small that learning nearly stops.
RMSProp: Fixing Adagrad’s “Shrinking Too Much” Problem
RMSProp replaces Adagrad’s full accumulation with an exponential moving average of squared gradients. Instead of adding squared gradients forever, it “forgets” older gradients gradually. This keeps learning rates adaptive without decaying to near zero.
In many practical training scenarios discussed in a data science course in Kolkata, this distinction matters: Adagrad may work well early, but RMSProp often remains stable for longer training.
How Adam Works: Momentum + RMSProp in One Optimiser
Adam stands for Adaptive Moment Estimation. It combines:
- Momentum-like behaviour via the moving average of gradients (the first moment), helping updates move consistently in promising directions.
- RMSProp-like scaling via the moving average of squared gradients (the second moment), adapting step sizes per parameter.
At each step ttt, given gradient gtg_tgt:
- Compute first moment estimate:
- mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_tmt=β1mt−1+(1−β1)gt
- Compute second moment estimate:
- vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2vt=β2vt−1+(1−β2)gt2
Because mtm_tmt and vtv_tvt start at zero, they are biased towards zero early in training. Adam corrects this using bias correction:
- m^t=mt1−β1t\hat{m}_t = \frac{m_t}{1-\beta_1^t}m^t=1−β1tmt
- v^t=vt1−β2t\hat{v}_t = \frac{v_t}{1-\beta_2^t}v^t=1−β2tvt
Finally, the parameter update is:
θt+1=θt−α⋅m^tv^t+ϵ\theta_{t+1} = \theta_t – \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}θt+1=θt−α⋅v^t+ϵm^t
Intuition:
- m^t\hat{m}_tm^t acts like a smoothed direction (less noisy than raw gradients).
- v^t\sqrt{\hat{v}_t}v^t acts like a scale factor (smaller steps where gradients are consistently large).
- ϵ\epsilonϵ prevents division by zero and improves numerical stability.
Practical Tips: Hyperparameters and Training Stability
Adam is popular partly because its defaults often work well:
- Learning rate α\alphaα: commonly 0.001
- β1\beta_1β1: commonly 0.9 (controls momentum strength)
- β2\beta_2β2: commonly 0.999 (controls smoothing of squared gradients)
- ϵ\epsilonϵ: commonly 1e-8
Still, practical training benefits from a few best practices:
- Tune the learning rate first: If training diverges, reduce α\alphaα. If training is slow, increase carefully.
- Use learning-rate schedules: Warm-up and decay can improve stability and final accuracy.
- Consider AdamW for regularisation: Adam’s original weight decay behaviour can be unintuitive; AdamW decouples weight decay from gradient updates and often generalises better.
- Apply gradient clipping when needed: For RNNs or unstable deep networks, clipping can prevent exploding gradients.
These are the kinds of configuration details learners often practise hands-on in a data science course in Kolkata, because they affect real model outcomes more than minor architectural tweaks.
When Adam Is (and Isn’t) the Best Choice
Adam is a strong default for:
- Deep neural networks (CNNs, transformers, MLPs)
- Problems with sparse or noisy gradients
- Rapid experimentation where quick convergence matters
However, there are caveats:
- In some tasks, SGD with momentum can yield better final generalisation, especially in vision benchmarks.
- Certain theoretical convergence guarantees are weaker than for simpler methods, which has led to variants such as AMSGrad.
A practical approach is to start with Adam for quick progress, then compare against SGD (or AdamW) when optimising final performance.
Conclusion
Adam is widely used because it blends two powerful ideas: Adagrad’s per-parameter adaptivity and RMSProp’s stable scaling, while also adding momentum and bias correction for smoother early training. Understanding how Adam builds on gradient history—both in direction and magnitude—helps you troubleshoot training issues and choose better hyperparameters. Whether you are experimenting with neural networks or tuning classical models, this knowledge is a useful foundation for anyone learning optimization through a data science course in Kolkata.