参考论文:
- 2015: Stage-wise training: An improved feature learning strategy for deep models
- 2018 Why does stagewise training accelerate convergence of testing error over SGD
Deep neural networks currently stand at the state of the art for many machine learning applications, yet there still remain limitations in the training of such networks because of their very high parameter dimensionality.
目前,深度神经网络在许多机器学习应用中处于最先进的状态,但是由于其非常高的参数维数,因此在训练方面仍然存在局限性。
In this paper we show that network training performance can be improved using a stage-wise learning strategy, in which the learning process is broken down into a number of related sub-tasks that are completed stage-bystage. The idea is to inject the information to the network gradually so that in the early stages of training the “coarse-scale” properties of the data are captured while the “finerscale” characteristics are learned in later stages. Moreover, the solution found in each stage serves as a prior to the next stage, which produces a regularization effect and enhances the generalization of the learned representations.
本文提出使用【分阶段训练策略】来提高网络的训练性能,该策略将学习过程分解为逐步完成的许多相关【子任务】,逐步将信息注入网络,以便在训练的初级阶段捕获数据的“粗尺度”特征,而在后续阶段学习“细尺度”特征,并且每个阶段的训练结果都可以作为下一阶段的先决条件,这会产生正则化效果并增强泛化能力
。
We show that decoupling the classifier layer from the feature extraction layers of the network is necessary, as it alleviates the diffusion of gradient and over-fitting problems. Experimental results in the context of image classification support these claims.
图像分类的试验结果表明,将分类器层与网络的特征提取层【去耦】是必要的,这样可以减轻 梯度弥散 和 过拟合 问题。
相关介绍:
-
梯度弥散/梯度消失与梯度爆炸:
随着神经网络深度增加,每层计算误差都会逐步缩小,这个主要跟激活函数有关,链式法则相乘之后,输出层的很大的误差,对前几层带来的影响会变得非常小,此时前几层的梯度修正会非常小,梯度会趋于零,即梯度消失(gradient vanish), 使得训练无效. 梯度爆炸(gradient explode)则相反.
-
梯度不稳定问题: 深度神经网络中的梯度不稳定性指前几层的梯度消失或爆炸问题. 原因是前几层的梯度是后面层梯度的乘积, 当存在过多的层次时,就出现了内在本质上的不稳定场景.
-
解决方法:
① 改变激活函数, 使用ReLU
等替代sigmoid
② 采用BN层
(Batch Normalization), 对每层输入数据做标准化,使其以较小的方差集中在均值附近. -
BN层的优点:
(1)加速收敛(2)控制过拟合,可以少用或不用Dropout和正则(3)降低网络对初始化权重的敏感性(4)允许使用较大的学习率
Stagewise training strategy is commonly used for learning neural networks, which uses a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number of iterations. It has been observed that the stagewise SGD has much faster convergence than the vanilla(基础的) SGD with a continuously decreasing step size in terms of both training error and testing error. But how to explain this phenomenon has been largely ignored by existing studies. This paper provides theoretical evidence for explaining this faster convergence.
【分阶段训练策略】通常用于深度神经网络的训练,主要思路是在初始阶段把随机优化算法(如SGD)的学习率设为较大值,然后在后续阶段以几何形式减小学习率。试验表明,就训练误差和测试误差而言,逐步SGD比基本SGD具有快得多的收敛性, 但现有研究大都忽略了如何解释这种现象, 本文则提供了理论依据。
In particular, we consider the stagewise training strategy for minimizing empirical risk that satisfies the Polyak-Łojasiewicz condition, which has been observed/proved for neural networks and also holds for a broad family of convex functions. For convex loss functions and “nicebehaviored" non-convex loss functions that are close to a convex function (namely weakly convex functions), we establish that the faster convergence of stagewise training than the vanilla SGD under the same condition on both training error and testing error lies on better dependence on the condition number of the problem.
特别地,我们考虑了用分阶段训练策略去最小化满足Polyak-Łojasiewicz条件的经验风险,该条件已在神经网络进行了观察/验证,并且适用于广泛的凸函数族。对于凸损失函数和近似凸函数(即弱凸函数)的“良性”非凸损失函数,我们发现,在相同的条件下,无论是训练误差还是测试误差,分阶段训练的收敛速度都比基础SGD更快, 这取决于对问题条件数的依赖程度。
Indeed, the proposed algorithm has additional favorable features that come with theoretical guarantee for the considered non-convex optimization problems, including using explicit algorithmic regularization at each stage, using stagewise averaged solution for restarting, and returning the last stagewise averaged solution as the final solution.
实际上, 所提算法还具有其他良好特性, 这为非凸优化问题提供了理论依据, 包括在每个阶段使用显式的算法正则化, 使用阶段的平均结果重新训练, 以及用最后阶段的平均结果作为最终的训练结果.
To differentiate from commonly used stagewise SGD, we refer to our algorithm as stagewise regularized training algorithm or Start. Of independent interest(), the proved testing error bound of Start for a family of non-convex loss functions is dimensionality and norm independent.
为了区别于常见的分阶段SGD算法, 我们称所提算法为"分阶段正则训练算法"或"Start". 另外有趣的是, 所提算法的测试误差边界对于非凸损失函数族的维度和范数是不相关的.