Device Plugin 入门笔记（二）

本节承接上一节对Kubernetes Device Plugin的介绍，以几个社区的Device Plugin方案为例，包括

NVIDIA/k8s-device-plugin
intel/sriov-network-device-plugin
AliyunContainerService/gpushare-device-plugin

以探索Device Plugin的设计、使用以及代码结构。

Review

回顾上一节对Device Plugin的介绍，Device Manager是K8s社区提供给设备服务商的一种统一的插件化方案，是自Kubernetes v1.11的beta feature。
它通过Extended Resources和Device Plugin这两个模块，允许用户自定义某种设备资源在K8s上的管控逻辑；不过管控层面上仅限于资源的发现、health check以及分配，不负责异构节点的拓扑维护或监控数据收集。

在实现上，Device Plugin实际是一个运行在Kubelet所在的Node上的gRPC server，通过Unix Socket、基于以下（简化的）API来和Kubelet的gRPC server通信，并维护对应设备资源在当前Node上的注册、发现、分配、卸载。
其中，ListAndWatch()负责对应设备资源的discovery和watch；Allocate()负责设备资源的分配。

service DevicePlugin {
    // returns a stream of []Device
    rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
    rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
}

Installation

Environment:

社区推荐的方案是让Device Plugin运行在K8s编排的容器内，方便K8s在Device Plugin失效时重启它们；不过，也并不阻止大家把Device Plugin运行在裸机上。

Deployment:

对于Device Plugin的容器化部署方案，我们可以理所当然地想到用DaemonSet。用DaemonSet部署可以不对K8s集群本身做出额外改动，只需要通过亲和性调度和DaemonSet本身的特性，就可以让Device Plugin部署在需要它的Nodes上，类似如下方式。

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
spec:
    template:
        metadata:
            labels:
                - name: device-plugin
        spec:
            nodeSelector:
                nvidia-gpu: true
            containers:
                name: device-plugin-ctr
                image: NVIDIA/device-plugin:1.0
                volumeMounts:
                  - mountPath: /device-plugin
                  - name: device-plugin
           volumes:
             - name: device-plugin
               hostPath:
                   path: /var/lib/kubelet/device-plugins

除此之外，一些像kubeadm之类的K8s部署工具已经集成了经过社区验证的Device Plugins部署。

NVIDIA Device

Overview

NVIDIA提供的这个Device Plugin当然是服务于GPU的，它可能是社区中最经典的Device Plugin了，所以下面的分析就以它为例。

它以DaemonSet的形式在K8s集群上运行：

暴露各Nodes的GPUs数目；
跟踪GPUs的健康情况；
帮助运行启用了GPU的容器。

Usage

安装NVIDIA Device Plugin之前，需要准备好指定版本的NVIDIA驱动、nvidia-docker、nvidia-container-runtime以及K8s集群。
配置nvidia-container-runtime为对应Nodes上的Docker默认运行时。

# /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

部署NVIDIA Device Plugin。

这里官方推荐的是DaemonSet的形式将Device Plugin部署到K8s集群上，不过基于前面社区的说明，直接使用Docker在本地进行容器化部署、或者将Device Plugin运行为bare-metal-mode都是可以的。

使用K8s DaemonSet部署：

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

使用Docker部署：

$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.0.0-beta4

裸金属部署：

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
$ ./k8s-device-plugin

在K8s集群上创建带GPU的workload。

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

Insight

以下参考项目主分支的f737ebb9aab0d50b24b1ce513236410fcfd0dee1版本。

首先，看一下DaemonSet的YAML以及容器的Dockerfile。

nvidia-device-plugin.yml:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta4
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Dockerfile:

FROM ubuntu:16.04 as build

RUN apt-get update && apt-get install -y --no-install-recommends \
        g++ \
        ca-certificates \
        wget && \
    rm -rf /var/lib/apt/lists/*

ENV GOLANG_VERSION 1.10.3
RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \
    | tar -C /usr/local -xz
ENV GOPATH /go
ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH

WORKDIR /go/src/nvidia-device-plugin
COPY . .

RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && \
    go install -ldflags="-s -w" -v nvidia-device-plugin

FROM debian:stretch-slim

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=utility

COPY --from=build /go/bin/nvidia-device-plugin /usr/bin/nvidia-device-plugin

CMD ["nvidia-device-plugin"]

从YAML中可以发现，Device Plugin的workload以DaemonSet形式运行，而/var/lib/kubelet/device-plugins目录会以同样路径挂载，这步挂载显然是为了上节Registration部分中说的Unix Socket通信。

再看Dockerfile，可以看到这里用了multi-stage builds，除了编译运行的依赖以及环境变量需要注意，其他没什么特殊的，在GOPATH和PATH下安装了Go代码的编译产物。这也说明了Device Plugin本身是比较小巧的。

看依赖的packages。

NVIDIA/k8s-device-plugin这个项目依赖了以下Go packages。

NVIDIA/gpu-monitoring-tools/bindings/go/nvml: 这是一个cpp+Go的API库，可以监控和管理NVIDIA GPU设备。GPU虚拟化控制面的工作应该主要是由该库负责。
notify/fsnotify: 这是一个跨平台的文件系统提醒（filesystem notifications）库，用来watch DevicePluginPath下的变更，以便于Device Plugin感知到Kubelet的重启。
kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1: 这是K8s Device Plugin的gRPC API，定义了Device Plugin的gRPC interface和client。

看源码：NVIDIA/k8s-device-plugin/tree/f737ebb9aab0d50b24b1ce513236410fcfd0dee1。

NVIDIA Devive Plugin的主干逻辑分为3块：Device Initialization, Running Loop、Plugin Events。

Device Initialization:

func main() {
    log.Println("Loading NVML")
    if err := nvml.Init(); err != nil {
        log.Printf("Failed to initialize NVML: %s.", err)
        log.Printf("If this is a GPU node, did you set the docker default runtime to `nvidia`?")
        log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")
        log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")

        select {}
    }
    defer func() { log.Println("Shutdown of NVML returned:", nvml.Shutdown()) }()

    log.Println("Starting FS watcher.")
    watcher, err := newFSWatcher(pluginapi.DevicePluginPath)
    if err != nil {
        log.Println("Failed to created FS watcher.")
        os.Exit(1)
    }
    defer watcher.Close()

    log.Println("Starting OS watcher.")
    sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)

    log.Println("Retreiving plugins.")
    plugins := getAllPlugins()
...

Device Plugin刚运行时会利用NVIDIA/gpu-monitoring-tools的nvml.Init()来初始化NVML，如果加载失败，则Plugin执行失败。
而后，Device Plugin利用notify/fsnotify来watch Kubelet在DevicePluginPath（默认是/var/lib/kubelet/device-plugins）下的变动；同时，利用os/signal来watch进程收到的signals。
最后，Device Plugin会获取代码内记录的所有Plugins，包含了Plugin的资源名和Socket路径，被用于Running Loop阶段中按Plugin挨个重启；不过目前只记录了NVIDIA Device Plugin自己。

Running Loop:

restart:
    // Loop through all plugins, idempotently stopping them, and then starting
    // them if they have any devices to serve. If even one plugin fails to
    // start properly, try starting them all again.
    started := 0
    pluginStartError := make(chan struct{})
    for _, p := range plugins {
        p.Stop()

        // Just continue if there are no devices to serve for plugin p.
        if len(p.Devices()) == 0 {
            continue
        }

        // Start the gRPC server for plugin p and connect it with the kubelet.
        if err := p.Start(); err != nil {
            log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?")
            log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")
            log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")
            close(pluginStartError)
            goto events
        }
        started++
    }

    if started == 0 {
        log.Println("No devices found. Waiting indefinitely.")
    }
...

Running Loop即"restart"代码块内的逻辑。
这里的for循环会遍历前面提到的getAllPlugins()结果，目前代码里只有nvidia-devive-plugin自己。
goto代码块既然名为restart，就意为重启Device Plugin，实际上因为p.Stop()在Plugin刚运行时什么也不做，所以"restart"当作"start"也没问题。
如果是Device Plugin已经运行了一段时间，那么p.Stop()会stop gRPC server、并清理掉socket和部分Plugin的内部变量（包括device lists、server指针、health check和stop的go channel）。

正常情况会进入p.Start()调用，这里会启用Device Plugin的gRPC server（net.Listen()），并向Kubelet注册。注册完毕后，Plugin主进程会起一个go routine来运行GPU的health check（和主进程通过channel通信）。

Health check内会通过nvml库的CGO接口去创建设备的EventSet，底层实现逻辑尚未查明，猜测是为对应GPU设备注册watch事件，以持续监测设备健康状态。

返回到主进程逻辑，p.Start()失败时，会close掉pluginStartError，因为这个channel之前没有写入过，所以执行了<-pluginStartError的语句此时才会解除阻塞，对应了goto到events代码块后的事情。

Plugin Events:

events:
    // Start an infinite loop, waiting for several indicators to either log
    // some messages, trigger a restart of the plugins, or exit the program.
    for {
        select {
        // If there was an error starting any plugins, restart them all.
        case <-pluginStartError:
            goto restart

        // Detect a kubelet restart by watching for a newly created
        // 'pluginapi.KubeletSocket' file. When this occurs, restart this loop,
        // restarting all of the plugins in the process.
        case event := <-watcher.Events:
            if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
                log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
                goto restart
            }

        // Watch for any other fs errors and log them.
        case err := <-watcher.Errors:
            log.Printf("inotify: %s", err)

        // Watch for any signals from the OS. On SIGHUP, restart this loop,
        // restarting all of the plugins in the process. On all other
        // signals, exit the loop and exit the program.
        case s := <-sigs:
            switch s {
            case syscall.SIGHUP:
                log.Println("Received SIGHUP, restarting.")
                goto restart
            default:
                log.Printf("Received signal \"%v\", shutting down.", s)
                for _, p := range plugins {
                    p.Stop()
                }
                break events
            }
        }
    }
}

"events"代码块主要是处理Device Plugin运行中的一些事件：

p.Start()失败会唤醒case <-pluginStartError，然后重新执行Plugin的Running Loop，即"restart"代码块，以清理并重启gRPC server，重新注册Device Plugin等。
Watch到KubeletSocket重新创建时，会打印Kubelet重启的log，并进入"restart"；处理的代码逻辑和上一case基本等同，跳过nvml初始化并重启了Device Plugin。
Watch到其他文件系统的问题会打印日志。
主进程收到signals，如果是SIGHUP就进入"restart"、重启server；对其他signals则会停止Device Plugin。

Discussion

NVIDIA的这个GPU Device Plugin的逻辑很典型地贴合了K8s社区的proposal，笔者虽未进一步调查，不过可以猜测该proposal本身就是和这个Device Plugin一起提出的。

可以看到，Device Plugin本身的代码逻辑比较简单，它需要为设备驱动做好相关的初始化，维护一个被Kubelet call的gRPC server，并处理Kubelet重启等事件；而具体设备驱动、文件系统底层的工作，还是调用其他API，不需要Device Plugin单独实现，这一点也降低了开发的门槛。