质量声明:原创文章,内容质量问题请评论吐槽。如对您产生干扰,可私信删除。
主要参考:李沐等:动手学深度学习-伯克利教材



摘要: MXNet实践: 用Hybrid混合编程对经典卷积神经网络模型进行复现,包括AlexNetVGG-{11, 13, 16, 19}、ResNet-{50, 101, 152}、DenseNet-{121, 161, 169, 201}. 最后提供完整的Fashion-MNIST数据集训练示例.


from mxnet.initializer import Xavier
from mxnet.gluon.block import HybridBlock
from mxnet.gluon import nn

关于卷积/池化操作,值得说明的是:

  • 图像的输出尺寸
    n o u t = ( n k + 2 p ) / s + 1 n_{out}=(n-k+2p)/s + 1 nout=(nk+2p)/s+1
  • 卷积核尺寸为 ( N × C × K × K ) (N\times C\times K\times K) (N×C×K×K),其通道数 C C C与输入图像的通道数一致。在做卷积操作时,对应位置的结果求和。其卷积核的个数 N N N即输出特征图的通道数。

AlexNet

2012年,AlexNet横空出世。这个模型的名字来源于论文第一作者的姓名Alex Krizhevsky [1]。AlexNet使用了8层卷积神经网络,并以很大的优势赢得了ImageNet 2012图像识别挑战赛。它首次证明了学习到的特征可以超越手工设计的特征,从而一举打破计算机视觉研究的前状。

class AlexNet(HybridBlock):
    def __init__(self, classes=1000, **kwargs):
        super(AlexNet, self).__init__(**kwargs)
        with self.name_scope():
            self.features = self._make_features()
            self.output = nn.Dense(classes)
    
    def hybrid_forward(self, F, x):
        print("F:", F)
        x = self.features(x)
        x = self.output(x)
        return x
    
    def _make_features(self):
        featurizer = nn.HybridSequential(prefix='')
        # 卷积块 1
        featurizer.add(nn.Conv2D(64, kernel_size=11, strides=4, padding=2, activation='relu')) 
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
         # 卷积块 2
        featurizer.add(nn.Conv2D(192, kernel_size=5, padding=2, activation='relu'))
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
        # 卷积块 3,4,5
        featurizer.add(nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
        # 全连接层 6,7
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        return featurizer
net = AlexNet(classes=10)
net.initialize(init.Normal(sigma=0.01),force_reinit=True)
net.hybridize()
X = nd.random.uniform(shape=(1,1,224,224))
net(X)

注: AlexNet使用的输入图像尺寸为 224×224


VGG

虽然AlexNet指明了深度卷积神经网络可以取得出色的结果,但并没有提供简单的规则以指导后来的研究者如何设计新的网络。VGG提出了可以通过重复使用简单的基础块来构建深度模型的思路。

  • VGGNet 详细介绍参见博客: 深度学习经典卷积神经网络之VGGNet
  • VGG-16即下表中的D模型, 参数如下:
  • 输入 224 × 224 × 3 224\times224\times3 224×224×3 的图片,经过的卷积核大小为 3 × 3 3\times3 3×3,stride=1,padding=1,pooling为采用 2 × 2 2\times2 2×2 的max pooling方式:
class VGG(HybridBlock):
    def __init__(self, layers=[2, 2, 3, 3, 3], filters=[64, 128, 256, 512, 512], classes=1000,**kwargs):
        super(VGG, self).__init__(**kwargs)
        assert len(layers) == len(filters)
        with self.name_scope():
            self.features = self._make_features(layers, filters)
            self.output = nn.Dense(classes)
    
    def hybrid_forward(self, F, x):
        print("F:", F)
        x = self.features(x)
        x = self.output(x)
        return x
    
    def _make_features(self, layers, filters):
        featurizer = nn.HybridSequential(prefix='')
        block_size = len(layers)
        # 卷积块
        for i in range(block_size):
            for j in range(layers[i]):
                featurizer.add(nn.Conv2D(filters[i], kernel_size=3, padding=1, activation="relu"))
            featurizer.add(nn.MaxPool2D(pool_size=2, strides=2))
        # 全连接层
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        return featurizer
vgg_spec = {11: ([1, 1, 2, 2, 2], [64, 128, 256, 512, 512]),
            13: ([2, 2, 2, 2, 2], [64, 128, 256, 512, 512]),
            16: ([2, 2, 3, 3, 3], [64, 128, 256, 512, 512]),
            19: ([2, 2, 4, 4, 4], [64, 128, 256, 512, 512]),}
layers, filters = vgg_spec[16]
net = VGG(layers, filters, classes=1000)
net.initialize(init=init.Xavier(),force_reinit=True, ctx=mx.gpu())
# net.hybridize()
X = nd.random.uniform(shape=(1,1,224,224), ctx=mx.gpu())
print(net(X))   # 输出预测
net.summary(X)  # 输出结构

ResNet

对神经网络模型添加新的层,充分训练后的模型是否只可能更有效地降低训练误差?理论上,原模型解的空间只是新模型解的空间的子空间。也就是说,如果我们能将新添加的层训练成恒等映射 ?(?)=?f(x)=x ,新模型和原模型将同样有效。由于新模型可能得出更优的解来拟合训练数据集,因此添加层似乎更容易降低训练误差。然而在实践中,添加过多的层后训练误差往往不降反升。即使利用批量归一化带来的数值稳定性使训练深层模型更加容易,该问题仍然存在。针对这一问题,何恺明等人提出了残差网络(ResNet)。它在2015年的ImageNet图像识别挑战赛夺魁,并深刻影响了后来的深度神经网络的设计。


class Residual(HybridBlock):
    def __init__(self, num_channels, strides=1, use_1x1conv=False, **kwargs):
        super(Residual, self).__init__(**kwargs)
        self.body = nn.HybridSequential(prefix='')
        self.body.add(nn.Conv2D(num_channels//4, kernel_size=1, strides=strides),
                      nn.BatchNorm(),
                      nn.Activation('relu'),)
        self.body.add(nn.Conv2D(num_channels//4, kernel_size=3, strides=1, padding=1),
                      nn.BatchNorm(),
                      nn.Activation('relu'),)
        self.body.add(nn.Conv2D(num_channels, kernel_size=1, strides=1),
                      nn.BatchNorm(),)
        if use_1x1conv:
            self.shortcut = nn.HybridSequential(prefix='')
            self.shortcut.add(nn.Conv2D(num_channels, kernel_size=1, strides=strides),
                              nn.BatchNorm(),)
        else:
            self.shortcut = None

    def hybrid_forward(self, F, x):
        y = self.body(x)
        if self.shortcut:
            x = self.shortcut(x)
        x = F.Activation(y+x, act_type='relu')
        return x

class ResNet(HybridBlock):
    def __init__(self, layers=[3, 4, 6, 3], filters=[256, 512, 1024, 2048], classes=1000, **kwargs):
        super(ResNet, self).__init__(**kwargs)
        assert len(layers) == len(filters)
        with self.name_scope():
            self.features = self._make_features(layers, filters)
            self.output = nn.Dense(classes)

    def hybrid_forward(self, F, x):
        print("F:", F)
        x = self.features(x)
        x = self.output(x)
        return x

    def _make_features(self, layers, filters):
        featurizer = nn.HybridSequential(prefix='')
        # 起始卷积块
        featurizer.add(nn.Conv2D(channels=64, kernel_size=7, strides=2, padding=3),
                       nn.BatchNorm(),
                       nn.Activation('relu'),
                       nn.MaxPool2D(pool_size=3, strides=2, padding=1),)
        # 按组重复残差块(瓶颈结构)
        for i in range(len(layers)):
            strides = 2 if i else 1
            featurizer.add(Residual(filters[i], strides=strides, use_1x1conv=True))
            for _ in range(layers[i]-1):
                featurizer.add(Residual(filters[i], strides=1))
        # 全局平均池化
        featurizer.add(nn.GlobalAvgPool2D())
        return featurizer
resnet_spec = { 50: ([3, 4,  6, 3], [256, 512, 1024, 2048]),
               101: ([3, 4, 23, 3], [256, 512, 1024, 2048]),
               152: ([3, 8, 36, 3], [256, 512, 1024, 2048]),}
layers, filters = resnet_spec[50]
net = ResNet(layers, filters, classes=10)
net.initialize(init.Normal(sigma=0.01), force_reinit=True)
net.hybridize()
X = nd.random.uniform(shape=(1,1,224,224))
net(X)

网络越来越深,这里不再展示网络结构参数


DenseNet-BC

ResNet模型是CNN史上的一个里程碑,其核心是通过建立前面层与后面层之间的“短路连接”(shortcuts,skip connection),这样的跨层设计有助于训练过程中梯度的反向传播,从而能训练出更深的网络,实现更高的准确度. 以此引申出了数个后续工作,其中的一个是DenseNet.
与ResNet的主要区别在于,DenseNet中模块B 的输出和输入是在通道维上进行连结(concat), 而不是运算. 而且每个层的输出都会与前面所有层的输出在channel维度上连接(concat)(这里各个层的特征图大小是相同的)来作为下一层的输入, 从而实现特征重用(feature reuse)。相比ResNet,这是一种“稠密连接”(dense connection), 因此称为"DenseNet". 这些特点让DenseNet在参数和计算成本更少的情形下实现比ResNet更优的性能,DenseNet也因此斩获CVPR 2017的最佳论文奖。


DenseNet的密集连接方式需要特征图大小保持一致。为了解决这个问题,DenseNet网络中使用稠密块DenseBlock+过渡层Transition的结构,其中DenseBlock是包含很多层的模块,每个层的特征图大小相同,层与层之间采用密集连接方式。而Transition是连接两个相邻的DenseBlock,并且通过Pooling使特征图大小降低。在下图所示的DenseNet网路***包含4个DenseBlock,各个DenseBlock之间通过Transition连接在一起。

DenseNet详解参见博客: DenseNet比ResNet更优的CNN模型


所有DenseBlock中各个层卷积之后均输出k 个特征图,即得到的特征图的channel数为k,或者说采用k 个卷积核。k 在DenseNet称为growth rate,这是一个超参数。一般情况下使用较小的 k(如 k = 12 k=12 k=12 ),就可以得到较佳的性能。假定输入层的特征图的channel数为 k 0 k_0 k0,那么第 l l l 层(Dense Layer)输入的channel数为 k 0 + k ( l 1 ) k_0+k(l−1) k0+k(l1),因此随着层数增加,尽管k设定得较小,但DenseBlock的输入会非常多.


由于后面层的输入会非常大,可以采用 bottleneck结构 在原有结构中增加1x1 Conv, 来降低特征数量,提升计算效率. 则稠密块DenseBlock内部变为 BN+ReLU+1x1 Conv+BN+ReLU+3x3 Conv,称为DenseNet-B结构; 而连接层Transition结构变为 BN+ReLU+1x1 Conv+2x2 AvgPooling. 另外,Transition层可以起到压缩模型的作用。假定连接层输入channels数为m,则通过卷积层输出为θ×m,其中θ∈(0,1]是压缩系数(compression rate),此结构称为DenseNet-C,论文中使用 θ = 0.5 \theta=0.5 θ=0.5 .
将使用bottleneck结构的稠密块和压缩系数小于1的连接层组合起来, 即得到 DenseNet-BC 结构

class DenseLayer(HybridBlock):
    """ Basic unit of DenseBlock (using bottleneck layer) BN + ReLU + 1x1Conv + BN + ReLU + 3x3Conv growth_rate also is channels bn_size : Multiplicative factor for number of bottle neck layers. default 4 (i.e. bn_size * k features in the bottleneck layer) """
    def __init__(self, growth_rate, bn_size, **kwargs):
        super(DenseLayer, self).__init__(**kwargs)
        self.body = nn.HybridSequential(prefix='')
        self.body.add(nn.BatchNorm(),
                      nn.Activation('relu'),
                      nn.Conv2D(bn_size*growth_rate, kernel_size=1, strides=1),
                      nn.BatchNorm(),
                      nn.Activation('relu'),
                      nn.Conv2D(growth_rate, kernel_size=3, strides=1, padding=1),)
        self.shortcut = None
    
    def hybrid_forward(self, F, x):
        y = self.body(x)
        x = F.concat(y, x)
        return x

def transition_layer(num_channels):
    """ BN + ReLU + 1x1Conv + 2x2AvgPooling """
    layer = nn.HybridSequential()
    layer.add(nn.BatchNorm(), 
              nn.Activation('relu'),
              nn.Conv2D(num_channels, kernel_size=1, strides=1),
              nn.AvgPool2D(pool_size=2, strides=2),)
    return layer
    

class DenseNet(HybridBlock):
    def __init__(self, num_channels_input=64, growth_rate=32, layers=(6, 12, 24, 16), compression_rate=0.5, bn_size=4, classes=1000, **kwargs):
        super(DenseNet, self).__init__(**kwargs)
        with self.name_scope():
            self.features = self._make_features(num_channels_input, growth_rate, compression_rate, layers, bn_size=4)
            self.output = nn.Dense(classes)

    def hybrid_forward(self, F, x):
        print("F:", F)
        x = self.features(x)
        x = self.output(x)
        return x

    def _make_features(self, num_channels_input, growth_rate, compression_rate, layers, bn_size=4):
        featurizer = nn.HybridSequential(prefix='')
        # 起始卷积块
        featurizer.add(nn.Conv2D(num_channels_input, kernel_size=7, strides=2, padding=3),
                       nn.BatchNorm(),
                       nn.Activation('relu'),
                       nn.MaxPool2D(pool_size=3, strides=2, padding=1),)
        # 按组重复稠密块和连接层, 最后一个稠密块后不跟连接层
        num_channels = num_channels_input
        group_num = len(layers)
        for i in range(group_num):
            for j in range(layers[i]):
                featurizer.add(DenseLayer(growth_rate, bn_size))
            if i + 1 < group_num:
                num_channels += layers[i] * growth_rate
                num_channels = int(num_channels * compression_rate)
                featurizer.add(transition_layer(num_channels))
        # 全局平均池化层
        featurizer.add(nn.BatchNorm(),
                       nn.Activation('relu'),
                       nn.AvgPool2D(pool_size=7),)
        return featurizer
densenet_spec = {121: (64, 32, [6, 12, 24, 16]),
                 161: (96, 48, [6, 12, 36, 24]),
                 169: (64, 32, [6, 12, 32, 32]),
                 201: (64, 32, [6, 12, 48, 32]),}
num_channels_input, growth_rate, layers = densenet_spec[121]
net = DenseNet(num_channels_input, growth_rate, layers, classes=10)
net.initialize(init.Normal(sigma=0.01), force_reinit=True)
net.hybridize()
X = nd.random.uniform(shape=(1,1,224,224))
net(X)

混合编程的完整示例

import os,sys
import mxnet as mx
from mxnet import nd, autograd, init, gluon
from mxnet.gluon import nn
from mxnet.gluon.block import HybridBlock
from mxnet.gluon import utils as gutils
from mxnet.gluon import data as gdata
from mxnet.gluon import loss as gloss
from time import time

# 构造模型
class AlexNet(HybridBlock):
    def __init__(self, classes=1000, **kwargs):
        super(AlexNet, self).__init__(**kwargs)
        with self.name_scope():
            self.features = self._make_features()
            self.output = nn.Dense(classes)
    
    def hybrid_forward(self, F, x):
        print("F:", F)
        x = self.features(x)
        x = self.output(x)
        return x
    
    def _make_features(self):
        featurizer = nn.HybridSequential(prefix='')
        # 卷积块 1
        featurizer.add(nn.Conv2D(64, kernel_size=11, strides=4, padding=2, activation='relu')) 
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
         # 卷积块 2
        featurizer.add(nn.Conv2D(192, kernel_size=5, padding=2, activation='relu'))
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
        # 卷积块 3,4,5
        featurizer.add(nn.Conv2D(384, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.Conv2D(256, kernel_size=3, padding=1, activation='relu'))
        featurizer.add(nn.MaxPool2D(pool_size=3, strides=2))
        # 全连接层 6,7
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        featurizer.add(nn.Dense(4096, activation='relu')),    featurizer.add(nn.Dropout(0.5))
        return featurizer
    
def try_gpu():
    """If GPU is available, return mx.gpu(0); else return mx.cpu()."""
    try:
        ctx = mx.gpu()
        _ = nd.array([0], ctx=ctx)
    except mx.base.MXNetError:
        ctx = mx.cpu()
    return ctx
    
# 读取数据集
dataset_dir = "~/.mxnet/datasets/fashion-mnist"
trainset = gdata.vision.ImageFolderDataset(os.path.join(dataset_dir, 'train'))
testset = gdata.vision.ImageFolderDataset(os.path.join(dataset_dir, 'test'))

# 图像增广
transformer_train = gdata.vision.transforms.Compose([  
    gdata.vision.transforms.Resize(224),
    gdata.vision.transforms.ToTensor(),
    gdata.vision.transforms.Normalize(),])
trainset = trainset.transform_first(transformer_train)
transformer_test = gdata.vision.transforms.Compose([  
    gdata.vision.transforms.ToTensor(),
    gdata.vision.transforms.Normalize(),])
testset = testset.transform_first(transformer_test)

# 设置超参数
batch_size = 128
num_epochs = 100
lr = 0.01
lr_period = 20
lr_decay = 0.1
ctx = try_gpu()

# 模型初始化
net = AlexNet(classes=10)
net.initialize(init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)
net.hybridize()

# 训练: 构造小批量数据生成器; 构造训练器; 选取损失函数; 周期迭代
num_workers = 0 if sys.platform.startswith('win32') else 4
train_iter = gdata.DataLoader(trainset, batch_size, shuffle=True, num_workers=num_workers)
test_iter = gdata.DataLoader(testset, batch_size, shuffle=True, num_workers=num_workers)
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gloss.SoftmaxCrossEntropyLoss()
for epoch in range(num_epochs):
    loss_sum = 0.0; accu_sum = 0.0; n = 0; start = time()
    if epoch > 0 and epoch % lr_period == 0:
        trainer.set_learning_rate(trainer.learning_rate * lr_decay)
    for X, y in train_iter:
        X, y = X.as_in_context(ctx), y.as_in_context(ctx)
        with autograd.record():
            y = y.astype("float32")
            output = net(X)
            l = loss(output, y).sum()
        l.backward()
        trainer.step(batch_size)
        loss_sum += l.asscalar()
        accu_sum += (y==output.argmax(axis=1)).sum().asscalar()
        n += y.size
    print(f"epoch {epoch+1:2d}, lr = {trainer.learning_rate}, loss = {loss_sum/n:.3f}, train accuracy = {accu_sum/n:.3f}, {time()-start:.1f} sec")

训练结果:
F: <module ‘mxnet.symbol’ from ‘miniconda3/lib/python3.7/site-packages/mxnet/symbol/init.py’>
epoch 1, lr 0.01, loss 886.381, train accuracy 0.518, 2838.7
epoch 2, lr 0.01, loss 0.733, train accuracy 0.729, 3185.1 sec
epoch 3, lr 0.01, loss 0.644, train accuracy 0.765, 3276.1 sec
epoch 4, lr 0.01, loss 0.592, train accuracy 0.785, 3267.6 sec
epoch 5, lr 0.01, loss 0.556, train accuracy 0.798, 3264.2 sec
epoch 6, lr 0.01, loss 0.549, train accuracy 0.802, 3266.0 sec
epoch 7, lr 0.01, loss 0.550, train accuracy 0.801, 3263.2 sec
epoch 8, lr 0.01, loss 0.548, train accuracy 0.802, 3274.2 sec
epoch 9, lr 0.01, loss 0.523, train accuracy 0.809, 3268.4 sec
epoch 10, lr 0.01, loss 0.523, train accuracy 0.809, 3275.2 sec
epoch 11, lr 0.01, loss 0.556, train accuracy 0.804, 3273.2 sec
epoch 12, lr 0.01, loss 0.590, train accuracy 0.793, 3267.5 sec
epoch 13, lr 0.01, loss 0.673, train accuracy 0.761, 3266.4 sec