# Reducing the variance in online optimization by transporting past gradients

@inproceedings{Arnold2019ReducingTV, title={Reducing the variance in online optimization by transporting past gradients}, author={S{\'e}bastien M. R. Arnold and Pierre-Antoine Manzagol and Reza Babanezhad and Ioannis Mitliagkas and Nicolas Le Roux}, booktitle={NeurIPS}, year={2019} }

Most stochastic optimization methods use gradients once before discarding them. [...] Key Method In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online… Expand

#### Figures, Tables, and Topics from this paper

#### 15 Citations

Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering

- Computer Science
- ArXiv
- 2020

It is proved that the model-based procedure converges in the noisy quadratic setting and can match the performance of well-tuned optimizers and ultimately, this is an interesting step for constructing self-tuning optimizers. Expand

Variance reduced optimization with implicit gradient transport

- Computer Science
- Knowl. Based Syst.
- 2021

The modified form of the implicit gradient transport (MIGT) technique is introduced into the classical stochastic variance reduced methods, i.e., SVRG, which leads to a new method called S VRG–MigT, which enjoys second order information to provide faster variance reduction, but without computing the Hessian explicitly. Expand

AG-SGD: Angle-Based Stochastic Gradient Descent

- Computer Science
- IEEE Access
- 2021

An algorithm is proposed that quantifies this deviation based on the angle between the past and the current gradients which is then applied to calibrate these two gradients, generating a more accurate new gradient. Expand

Understanding Accelerated Stochastic Gradient Descent via the Growth Condition

- Mathematics
- 2020

We study accelerated stochastic gradient descent through the lens of the growth condition. Stochastic gradient methods (SGD) with momentum, such as heavy ball (HB) and Nesterov's accelerated methods… Expand

Variance Reduction in Deep Learning: More Momentum is All You Need

- Computer Science
- ArXiv
- 2021

The ubiquitous clustering structure of rich datasets used in deep learning is exploited to design a family of scalable variance reduced optimization procedures by combining existing optimizers with a multi-momentum strategy (Yuan et al., 2019). Expand

DEMON: MOMENTUM DECAY FOR IMPROVED NEU-

Momentum is a popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (DEMON) rule, motivated by decaying the total contribution of a gradient to all future… Expand

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

- Computer Science, Mathematics
- ArXiv
- 2021

It is shown that after a suitable “burn-in” period, the objective value will monotonically decrease whenever the current iterate is not a critical point, which provides intuition behind the popular practice of learning rate “warm-up” and also yields a last-iterate guarantee. Expand

ErrorCompensatedX: error compensation for variance reduced algorithms

- Computer Science
- ArXiv
- 2021

ErrorCompensatedX is proposed, which uses the compression error from the previous two steps to achieve the same asymptotic convergence rate with the training without compression, and provides a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation. Expand

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

- Computer Science
- ArXiv
- 2021

The proposed TSGD algorithm has faster training speed, higher accuracy and better stability, and a learning rate that decreases linearly with the iterations is used instead of a constant learning rate. Expand

Randomized Iterative Methods for Linear Systems: Momentum, Inexactness and Gossip

- Computer Science, Mathematics
- ArXiv
- 2019

This thesis is interested in the development of randomized iterative methods for solving large scale linear systems, stochastic quadratic optimization problems, the best approximation problem and quadratics optimization problems. Expand

#### References

SHOWING 1-10 OF 54 REFERENCES

Variance Reduced Stochastic Gradient Descent with Neighbors

- Computer Science, Mathematics
- NIPS
- 2015

This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. Expand

Adam: A Method for Stochastic Optimization

- Computer Science, Mathematics
- ICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand

The Marginal Value of Adaptive Gradient Methods in Machine Learning

- Computer Science, Mathematics
- NIPS
- 2017

It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks. Expand

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2017

A novel analysis is developed in bounding these operators to characterize the excess risk of communication efficient parallelization schemes such as model-averaging/parameter mixing methods, which are of broader interest in analyzing computational aspects of stochastic approximation. Expand

YellowFin and the Art of Momentum Tuning

- Mathematics, Computer Science
- MLSys
- 2019

This work revisits the momentum SGD algorithm and shows that hand-tuning a single learning rate and momentum makes it competitive with Adam, and designs YellowFin, an automatic tuner for momentum and learning rate in SGD. Expand

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2017

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite… Expand

StopWasting My Gradients: Practical SVRG

- Computer Science, Mathematics
- NIPS
- 2015

This work shows how to exploit support vectors to reduce the number of gradient computations in the later iterations of stochastic variance-reduced gradient methods and proves that the commonly-used regularized SVRG iteration is justified and improves the convergence rate. Expand

On the importance of initialization and momentum in deep learning

- Computer Science
- ICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand

Linear Convergence with Condition Number Independent Access of Full Gradients

- Mathematics, Computer Science
- NIPS
- 2013

This paper proposes to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function, and presents a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. Expand

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

- Computer Science, Mathematics
- AISTATS
- 2019

It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand