本节承接上一节对Kubernetes Device Plugin的介绍,以几个社区的Device Plugin方案为例,包括
- NVIDIA/k8s-device-plugin
- intel/sriov-network-device-plugin
- AliyunContainerService/gpushare-device-plugin
以探索Device Plugin的设计、使用以及代码结构。
Review
回顾上一节对Device Plugin的介绍,Device Manager
是K8s社区提供给设备服务商的一种统一的插件化方案,是自Kubernetes v1.11的beta feature。
它通过Extended Resources
和Device Plugin
这两个模块,允许用户自定义某种设备资源在K8s上的管控逻辑;不过管控层面上仅限于资源的发现、health check以及分配,不负责异构节点的拓扑维护或监控数据收集。
在实现上,Device Plugin实际是一个运行在Kubelet所在的Node上的gRPC server,通过Unix Socket、基于以下(简化的)API来和Kubelet的gRPC server通信,并维护对应设备资源在当前Node上的注册、发现、分配、卸载。
其中,ListAndWatch()
负责对应设备资源的discovery和watch;Allocate()
负责设备资源的分配。
service DevicePlugin { // returns a stream of []Device rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} rpc Allocate(AllocateRequest) returns (AllocateResponse) {} }
Installation
Environment:
社区推荐的方案是让Device Plugin运行在K8s编排的容器内,方便K8s在Device Plugin失效时重启它们;不过,也并不阻止大家把Device Plugin运行在裸机上。
Deployment:
对于Device Plugin的容器化部署方案,我们可以理所当然地想到用DaemonSet。用DaemonSet部署可以不对K8s集群本身做出额外改动,只需要通过亲和性调度和DaemonSet本身的特性,就可以让Device Plugin部署在需要它的Nodes上,类似如下方式。
apiVersion: extensions/v1beta1 kind: DaemonSet metadata: spec: template: metadata: labels: - name: device-plugin spec: nodeSelector: nvidia-gpu: true containers: name: device-plugin-ctr image: NVIDIA/device-plugin:1.0 volumeMounts: - mountPath: /device-plugin - name: device-plugin volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
除此之外,一些像kubeadm之类的K8s部署工具已经集成了经过社区验证的Device Plugins部署。
NVIDIA Device
Overview
NVIDIA提供的这个Device Plugin当然是服务于GPU的,它可能是社区中最经典的Device Plugin了,所以下面的分析就以它为例。
它以DaemonSet的形式在K8s集群上运行:
- 暴露各Nodes的GPUs数目;
- 跟踪GPUs的健康情况;
- 帮助运行启用了GPU的容器。
Usage
安装NVIDIA Device Plugin之前,需要准备好指定版本的NVIDIA驱动、nvidia-docker、nvidia-container-runtime以及K8s集群。
配置nvidia-container-runtime为对应Nodes上的Docker默认运行时。
# /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
- 部署NVIDIA Device Plugin。
这里官方推荐的是DaemonSet的形式将Device Plugin部署到K8s集群上,不过基于前面社区的说明,直接使用Docker在本地进行容器化部署、或者将Device Plugin运行为bare-metal-mode都是可以的。
使用K8s DaemonSet部署:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
使用Docker部署:
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.0.0-beta4
裸金属部署:
$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build $ ./k8s-device-plugin
- 在K8s集群上创建带GPU的workload。
apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs
Insight
以下参考项目主分支的f737ebb9aab0d50b24b1ce513236410fcfd0dee1版本。
- 首先,看一下DaemonSet的YAML以及容器的Dockerfile。
nvidia-device-plugin.yml
:
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: # This annotation is deprecated. Kept here for backward compatibility annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: nvidia-device-plugin-ds spec: tolerations: # This toleration is deprecated. Kept here for backward compatibility - key: CriticalAddonsOnly operator: Exists - key: nvidia.com/gpu operator: Exists effect: NoSchedule # Mark this pod as a critical add-on; when enabled, the critical add-on # scheduler reserves resources for critical add-on pods so that they can # be rescheduled after a failure. # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ priorityClassName: "system-node-critical" containers: - image: nvidia/k8s-device-plugin:1.0.0-beta4 name: nvidia-device-plugin-ctr securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
Dockerfile
:
FROM ubuntu:16.04 as build RUN apt-get update && apt-get install -y --no-install-recommends \ g++ \ ca-certificates \ wget && \ rm -rf /var/lib/apt/lists/* ENV GOLANG_VERSION 1.10.3 RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \ | tar -C /usr/local -xz ENV GOPATH /go ENV PATH $GOPATH/bin:/usr/local/go/bin:$PATH WORKDIR /go/src/nvidia-device-plugin COPY . . RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && \ go install -ldflags="-s -w" -v nvidia-device-plugin FROM debian:stretch-slim ENV NVIDIA_VISIBLE_DEVICES=all ENV NVIDIA_DRIVER_CAPABILITIES=utility COPY --from=build /go/bin/nvidia-device-plugin /usr/bin/nvidia-device-plugin CMD ["nvidia-device-plugin"]
从YAML中可以发现,Device Plugin的workload以DaemonSet形式运行,而/var/lib/kubelet/device-plugins
目录会以同样路径挂载,这步挂载显然是为了上节Registration部分中说的Unix Socket通信。
再看Dockerfile,可以看到这里用了multi-stage builds,除了编译运行的依赖以及环境变量需要注意,其他没什么特殊的,在GOPATH和PATH下安装了Go代码的编译产物。这也说明了Device Plugin本身是比较小巧的。
- 看依赖的packages。
NVIDIA/k8s-device-plugin这个项目依赖了以下Go packages。
NVIDIA/gpu-monitoring-tools/bindings/go/nvml
: 这是一个cpp+Go的API库,可以监控和管理NVIDIA GPU设备。GPU虚拟化控制面的工作应该主要是由该库负责。notify/fsnotify
: 这是一个跨平台的文件系统提醒(filesystem notifications)库,用来watchDevicePluginPath
下的变更,以便于Device Plugin感知到Kubelet的重启。kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1
: 这是K8s Device Plugin的gRPC API,定义了Device Plugin的gRPC interface和client。
NVIDIA Devive Plugin的主干逻辑分为3块:Device Initialization, Running Loop、Plugin Events。
Device Initialization:
func main() { log.Println("Loading NVML") if err := nvml.Init(); err != nil { log.Printf("Failed to initialize NVML: %s.", err) log.Printf("If this is a GPU node, did you set the docker default runtime to `nvidia`?") log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites") log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start") select {} } defer func() { log.Println("Shutdown of NVML returned:", nvml.Shutdown()) }() log.Println("Starting FS watcher.") watcher, err := newFSWatcher(pluginapi.DevicePluginPath) if err != nil { log.Println("Failed to created FS watcher.") os.Exit(1) } defer watcher.Close() log.Println("Starting OS watcher.") sigs := newOSWatcher(syscall.SIGHUP, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT) log.Println("Retreiving plugins.") plugins := getAllPlugins() ...
Device Plugin刚运行时会利用NVIDIA/gpu-monitoring-tools
的nvml.Init()
来初始化NVML,如果加载失败,则Plugin执行失败。
而后,Device Plugin利用notify/fsnotify
来watch Kubelet在DevicePluginPath
(默认是/var/lib/kubelet/device-plugins
)下的变动;同时,利用os/signal
来watch进程收到的signals。
最后,Device Plugin会获取代码内记录的所有Plugins,包含了Plugin的资源名和Socket路径,被用于Running Loop阶段中按Plugin挨个重启;不过目前只记录了NVIDIA Device Plugin自己。
Running Loop:
restart: // Loop through all plugins, idempotently stopping them, and then starting // them if they have any devices to serve. If even one plugin fails to // start properly, try starting them all again. started := 0 pluginStartError := make(chan struct{}) for _, p := range plugins { p.Stop() // Just continue if there are no devices to serve for plugin p. if len(p.Devices()) == 0 { continue } // Start the gRPC server for plugin p and connect it with the kubelet. if err := p.Start(); err != nil { log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?") log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites") log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start") close(pluginStartError) goto events } started++ } if started == 0 { log.Println("No devices found. Waiting indefinitely.") } ...
Running Loop即"restart"代码块内的逻辑。
这里的for循环会遍历前面提到的getAllPlugins()
结果,目前代码里只有nvidia-devive-plugin自己。
goto代码块既然名为restart,就意为重启Device Plugin,实际上因为p.Stop()
在Plugin刚运行时什么也不做,所以"restart"当作"start"也没问题。
如果是Device Plugin已经运行了一段时间,那么p.Stop()
会stop gRPC server、并清理掉socket和部分Plugin的内部变量(包括device lists、server指针、health check和stop的go channel)。
正常情况会进入p.Start()
调用,这里会启用Device Plugin的gRPC server(net.Listen()),并向Kubelet注册。注册完毕后,Plugin主进程会起一个go routine来运行GPU的health check(和主进程通过channel通信)。
Health check内会通过nvml
库的CGO接口去创建设备的EventSet,底层实现逻辑尚未查明,猜测是为对应GPU设备注册watch事件,以持续监测设备健康状态。
返回到主进程逻辑,p.Start()
失败时,会close掉pluginStartError
,因为这个channel之前没有写入过,所以执行了<-pluginStartError
的语句此时才会解除阻塞,对应了goto到events代码块后的事情。
Plugin Events:
events: // Start an infinite loop, waiting for several indicators to either log // some messages, trigger a restart of the plugins, or exit the program. for { select { // If there was an error starting any plugins, restart them all. case <-pluginStartError: goto restart // Detect a kubelet restart by watching for a newly created // 'pluginapi.KubeletSocket' file. When this occurs, restart this loop, // restarting all of the plugins in the process. case event := <-watcher.Events: if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create { log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket) goto restart } // Watch for any other fs errors and log them. case err := <-watcher.Errors: log.Printf("inotify: %s", err) // Watch for any signals from the OS. On SIGHUP, restart this loop, // restarting all of the plugins in the process. On all other // signals, exit the loop and exit the program. case s := <-sigs: switch s { case syscall.SIGHUP: log.Println("Received SIGHUP, restarting.") goto restart default: log.Printf("Received signal \"%v\", shutting down.", s) for _, p := range plugins { p.Stop() } break events } } } }
"events"代码块主要是处理Device Plugin运行中的一些事件:
p.Start()
失败会唤醒case <-pluginStartError
,然后重新执行Plugin的Running Loop,即"restart"代码块,以清理并重启gRPC server,重新注册Device Plugin等。- Watch到
KubeletSocket
重新创建时,会打印Kubelet重启的log,并进入"restart";处理的代码逻辑和上一case基本等同,跳过nvml
初始化并重启了Device Plugin。 - Watch到其他文件系统的问题会打印日志。
- 主进程收到signals,如果是SIGHUP就进入"restart"、重启server;对其他signals则会停止Device Plugin。
Discussion
NVIDIA的这个GPU Device Plugin的逻辑很典型地贴合了K8s社区的proposal,笔者虽未进一步调查,不过可以猜测该proposal本身就是和这个Device Plugin一起提出的。
可以看到,Device Plugin本身的代码逻辑比较简单,它需要为设备驱动做好相关的初始化,维护一个被Kubelet call的gRPC server,并处理Kubelet重启等事件;而具体设备驱动、文件系统底层的工作,还是调用其他API,不需要Device Plugin单独实现,这一点也降低了开发的门槛。