On the different regimes of stochastic gradient descent.

Researchers

Journal

Modalities

Models

Abstract

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size [Formula: see text], and the step size or learning rate [Formula: see text]. For small [Formula: see text] and large [Formula: see text], SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the “temperature” [Formula: see text]. Yet this description is observed to break down for sufficiently large batches [Formula: see text], or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the [Formula: see text]-[Formula: see text] plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size [Formula: see text] separating regimes (i) and (ii) scale with the size [Formula: see text] of the training set, with an exponent that characterizes the hardness of the classification problem.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *