Kernel module management (KMM) operator manages the deployment and lifecycle of out-of-tree kernel modules on RHOCP.
In this release, KMM operator is used to manage and deploy the Intel® Data Center GPU driver container image on the RHOCP cluster.
Intel data center GPU driver container images are released from Intel Data Center GPU Driver for OpenShift Project.
Follow the installation guide below to install the KMM operator via CLI or web console.
Canary deployment is enabled by default to deploy the driver container image only on specific node(s) to ensure the initial deployment succeeds prior to rollout to all the eligible nodes in the cluster. This safety mechanism can reduce risk and prevent a deployment from adversely affecting the entire cluster.
Follow the steps below to set the alternative firmware path at runtime.
ConfigMap
to set worker.setFirmwareClassPath
to /var/lib/firmware
$ oc patch configmap kmm-operator-manager-config -n openshift-kmm --type='json' -p='[{"op": "add", "path": "/data/controller_config.yaml", "value": "healthProbeBindAddress: :8081\nmetricsBindAddress: 127.0.0.1:8080\nleaderElection:\n enabled: true\n resourceID: kmm.sigs.x-k8s.io\nwebhook:\n disableHTTP2: true\n port: 9443\nworker:\n runAsUser: 0\n seLinuxType: spc_t\n setFirmwareClassPath: /var/lib/firmware"}]'
ConfigMap
changes to take effect.
$ oc get pods -n openshift-kmm | grep -i "kmm-operator-controller-" | awk '{print $1}' | xargs oc delete pod -n openshift-kmm
For more details, see link.
Follow the steps below to deploy the driver container image with pre-build mode.
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true
Example output:
NAME STATUS ROLES AGE VERSION
icx-dgpu-1 Ready worker 30d v1.25.4+18eadca
$ oc label node <node_name> intel.feature.node.kubernetes.io/dgpu-canary=true
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/main/kmmo/intel-dgpu.yaml
intel-dgpu.yaml
file and reapply the yaml file to deploy the driver to the entire cluster. As a cluster administrator, you can also select another deployment policy.
intel.feature.node.kubernetes.io/dgpu-canary: 'true'
To verify that the drivers have been loaded, follow the steps below:
kmm.node.kubernetes.io/openshift-kmm.intel-dgpu.ready
using the command shown below:
$ oc get nodes -l kmm.node.kubernetes.io/openshift-kmm.intel-dgpu.ready
Example output:
NAME STATUS ROLES AGE VERSION
icx-dgpu-1 Ready worker 30d v1.25.4+18eadca
The label shown above indicates that the KMM operator has successfully deployed the drivers and firmware on the node.
$ chroot /host
$ lsmod | grep i915
Ensure i915
and intel_vsec
are loaded in the kernel, as shown in the output below:
i915 3633152 0
i915_compat 16384 1 i915
intel_vsec 16384 1 i915
intel_gtt 20480 1 i915
video 49152 1 i915
i2c_algo_bit 16384 1 i915
drm_kms_helper 290816 1 i915
drm 589824 3 drm_kms_helper,i915
dmabuf 77824 4 drm_kms_helper,i915,i915_compat,dr
c. Run dmesg to ensure there are no errors in the kernel message log.