Adaptive Subgradient Method¶
The adaptive subgradient method (AdaGrad) [Duchi2011] follows the algorithmic framework of an iterative solver with the algorithm-specific transformation \(T\), set of intrinsic parameters \(S_t\) defined for the learning rate \(\eta\), and algorithm-specific vector \(U\) and power \(d\) of Lebesgue space defined as follows:
\(T(\theta_{t - 1}, g(\theta_{t - 1}), S_{t - 1})\):
\(G_{t, i} = G_{t - 1, i} + g_i^2(\theta_{t - 1})\), where \(g_i(\theta_{t - 1})\) is the \(i\)-th coordinate of the gradient \(g(\theta_{t - 1})\)
\(\theta_t = \theta_{t - 1} - \frac {\eta}{\sqrt{G_t + \varepsilon}} g(\theta_{t - 1})\), where
\[\frac {\eta}{\sqrt{G_t + \varepsilon}} g(\theta_{t - 1}) = \{\frac {\eta}{\sqrt{G_{t, 1} + \varepsilon}} g_1(\theta_{t - 1}), \ldots, \frac {\eta}{\sqrt{G_{t, 1} + \varepsilon}} g_p(\theta_{t - 1})\}\]
Convergence check: \(U = g(\theta_{t - 1}), d = 2\)
Computation¶
The adaptive subgradient (AdaGrad) method is a special case of an iterative solver. For parameters, input, and output of iterative solvers, see Computation for Iterative Solver.
Algorithm Input¶
In addition to the input of the iterative solver, the AdaGrad method accepts the following optional input:
OptionalDataID |
Input |
---|---|
|
A numeric table of size \(p \times 1\) with the values of \(G_t\). Each value is an accumulated sum of squares of coordinate values of a corresponding gradient. |
Algorithm Parameters¶
In addition to parameters of the iterative solver, the AdaGrad method has the following parameters:
Parameter |
Default Value |
Description |
---|---|---|
|
|
The floating-point type that the algorithm uses for intermediate computations. Can be |
|
|
Default performance-oriented computation method. |
|
|
A numeric table of size \(\text{nIterations} \times \text{batchSize}\) for the |
|
\(128\) |
The number of batch indices to compute the stochastic gradient. If The algorithm ignores this parameter if the |
|
A numeric table of size \(1 \times 1\) that contains the default step length equal to \(0.01\). |
A numeric table of size \(1 \times 1\) that contains the value of learning rate \(\eta\). Note This parameter can be an object of any class derived from |
|
\(1\mathrm{e}{-08}\) |
Value \(\varepsilon\) needed to avoid degenerate cases when computing square roots. |
|
SharePtr< engines:: mt19937:: Batch>() |
Pointer to the random number generator engine that is used internally for generation of 32-bit integer indices of terms in the objective function. |
Algorithm Output¶
In addition to the output of the iterative solver, the AdaGrad method calculates the following optional result:
OptionalDataID |
Output |
---|---|
|
A numeric table of size \(p \times 1\) with the values of \(G_t\). Each value is an accumulated sum of squares of coordinate values of a corresponding gradient. |
Examples¶
Note
There is no support for Java on GPU.