Design and architecture

Design

Architecture and Operation

The PMEM-CSI driver can operate in two different device modes: LVM and direct. This table contains an overview and comparison of those modes. There is a more detailed explanation in the following paragraphs.

	`LVM`	`direct`
Main advantage	avoids free space fragmentation¹	simpler, somewhat faster, but free space may get fragmented¹
What is served	LVM logical volume	pmem block device
Region affinity²	yes: one LVM volume group is created per region, and a volume has to be in one volume group	yes: namespace can belong to one region only
Startup	two extra stages: pmem-ns-init (creates namespaces), vgm (creates volume groups)	no extra steps at startup
Namespace modes	`fsdax` mode³ namespaces pre-created as pools	namespace in `fsdax` mode created directly, no need to pre-create pools
Limiting space usage	can leave part of device unused during pools creation	no limits, creates namespaces on device until runs out of space
Name field in namespace	Name gets set to 'pmem-csi' to achieve own vs. foreign marking	Name gets set to VolumeID, without attempting own vs. foreign marking
Minimum volume size	4 MB	1 GB (see also alignment adjustment below)
Alignment requirements	LVM creation aligns size up to next 4MB boundary	driver aligns size up to next alignment boundary. The default alignment step is 1 GB. Device(s) in interleaved mode will require larger minimum as size has to be at least one alignment step. The possibly bigger alignment step is calculated as interleave-set-size multiplied by 1 GB

¹ Free space fragmentation is a problem when there appears to be enough free capacity for a new namespace, but there isn’t a contiguous region big enough to allocate it. The PMEM-CSI driver is only capable of allocating continguous memory to a namespace and cannot de-fragment or combine smaller blocks. For example, this could happen when you create a 63 GB namespace, followed by a 1 GB namespace, and then delete the 63 GB namespace. Eventhough there is 127 GB available, the driver cannot create a namespace larger than 64 GB.

---------------------------------------------------------------------
|         63 GB free         | 1GB used |         64 GB free        |
---------------------------------------------------------------------

² Region affinity means that all parts of a provisioned file system are physically located on device(s) that belong to same PMEM region. This is important on multi-socket systems where media access time may vary based on where the storage device(s) are physically attached.

³ fsdax mode is required for NVDIMM namespaces. See Persistent Memory Programming for details. devdax mode is not supported. Though a raw block volume would be useful when a filesystem isn’t needed, Kubernetes cannot handle binding a character device to a loop device.

LVM device mode

In Logical Volume Management (LVM) mode the PMEM-CSI driver uses LVM for logical volume Management to avoid the risk of fragmentation. The LVM logical volumes are served to satisfy API requests. There is one volume group created per region, ensuring the region-affinity of served volumes.

devicemode-lvm diagram

The driver consists of three separate binaries that form two initialization stages and a third API-serving stage.

During startup, the driver scans persistent memory for regions and namespaces, and tries to create more namespaces using all or part (selectable via option) of the remaining available space. This first stage is performed by a separate entity pmem-ns-init.

The second stage of initialization arranges physical volumes provided by namespaces into LVM volume groups. This is performed by a separate binary pmem-vgm.

After two initialization stages, the third binary pmem-csi-driver starts serving CSI API requests.

Namespace modes in LVM device mode

The PMEM-CSI driver pre-creates namespaces in fsdax mode forming the corresponding LVM volume group. The amount of space to be used is determined using the option -useforfsdax given to pmem-ns-init. This options specifies an integer presenting limit as percentage. The default value is useforfsdax=100.

Using limited amount of total space in LVM device mode

The PMEM-CSI driver can leave space on devices for others, and recognize “own” namespaces. Leaving space for others can be achieved by specifying lower-than-100 value to -useforfsdax options The distinction “own” vs. “foreign” is implemented by setting the Name field in namespace to a static string “pmem-csi” during namespace creation. When adding physical volumes to volume groups, only those physical volumes that are based on namespaces with the name “pmem-csi” are considered.

Direct device mode

The following diagram illustrates the operation in Direct device mode: devicemode-direct diagram

In direct device mode PMEM-CSI driver allocates namespaces directly from the storage device. This creates device space fragmentation risk, but reduces complexity and run-time overhead by avoiding additional device mapping layer. Direct mode also ensures the region-affinity of served volumes, because provisioned volume can belong to one region only.

In Direct mode, the two preparation stages used in LVM mode, are not needed.

Namespace modes in direct device mode

The PMEM-CSI driver creates a namespace directly in the mode which is asked by volume creation request, thus bypassing the complexity of pre-allocated pools that are used in LVM device mode.

Using limited amount of total space in direct device mode

In direct device mode, the driver does not attempt to limit space use. It also does not mark “own” namespaces. The Name field of a namespace gets value of the VolumeID.

Kata Containers support

Kata Containers runs applications inside a virtual machine. This poses a problem for App Direct mode, because access to the filesystem prepared by PMEM-CSI is provided inside the virtual machine by the 9p or virtio-fs filesystems. Both do not support App Direct mode:

9p does not support mmap at all.
virtio-fs only supports it when not using MAP_SYNC, i.e. without dax semantic.

This gets solved as follows:

PMEM-CSI creates a volume as usual, either in direct mode or LVM mode.
Inside that volume it sets up an ext4 or xfs filesystem.
Inside that filesystem it creates a pmem-csi-vm.img file that contains partition tables, dax metadata and a partition that takes up most of the space available in the volume.
That partition is bound to a /dev/loop device and the formatted with the requested filesystem type for the volume.
When an application needs access to the volume, PMEM-CSI mounts that /dev/loop device.
An application not running under Kata Containers then uses that filesystem normally but due to limitations in the Linux kernel, mounting might have to be done without -odax and thus App Direct access does not work.
When the Kata Containers runtime is asked to provide access to that filesystem, it will instead pass the underlying pmem-csi-vm.img file into QEMU as a nvdimm device and inside the VM mount the /dev/pmem0p1 partition that the Linux kernel sets up based on the dax meta data that was placed in the file by PMEM-CSI. Inside the VM, the App Direct semantic is fully supported.

Such volumes can be used with full dax semantic only inside Kata Containers. They are still usable with other runtimes, just not with dax semantic. Because of that and the additional space overhead, Kata Containers support has to be enabled explicitly via a storage class parameter and Kata Containers must be set up appropriately.

Driver modes

The PMEM-CSI driver supports running in different modes, which can be controlled by passing one of the below options to the driver’s ‘-mode’ command line option. In each mode, it starts a different set of open source Remote Procedure Call (gRPC) servers on given driver endpoint(s).

Controller should run as a single instance in cluster level. When the driver is running in Controller mode, it forwards the pmem volume create/delete requests to the registered node controller servers running on the worker node. In this mode, the driver starts the following gRPC servers:
One Node instance should run on each worker node that has persistent memory devices installed. When the driver starts in such mode, it registers with the Controller driver running on a given -registryEndpoint. In this mode, the driver starts the following servers:

Driver Components

Identity Server

This gRPC server operates on a given endpoint in all driver modes and implements the CSI Identity interface.

Node Registry Server

When the PMEM-CSI driver runs in Controller mode, it starts a gRPC server on a given endpoint(-registryEndpoint) and serves the RegistryServer interface. The driver(s) running in Node mode can register themselves with node specific information such as node id, NodeControllerServer endpoint, and their available persistent memory capacity.

Master Controller Server

This gRPC server is started by the PMEM-CSI driver running in Controller mode and serves the Controller interface defined by the CSI specification. The server responds to CreateVolume(), DeleteVolume(), ControllerPublishVolume(), ControllerUnpublishVolume(), and ListVolumes() calls coming from external-provisioner() and external-attacher() sidecars. It forwards the publish and unpublish volume requests to the appropriate Node controller server running on a worker node that was registered with the driver.

Node Controller Server

This gRPC server is started by the PMEM-CSI driver running in Node mode and implements the ControllerPublishVolume and ControllerUnpublishVolume methods of the Controller service interface defined by the CSI specification. It serves the ControllerPublishVolume() and ControllerUnpublish() requests coming from the Master controller server and creates/deletes persistent memory devices.

Node Server

This gRPC server is started by the driver running in Node mode and implements the Node service interface defined in the CSI specification. It serves the NodeStageVolume(), NodeUnstageVolume(), NodePublishVolume(), and NodeUnpublishVolume() requests coming from the Container Orchestrator (CO).

Communication between components

The following diagram illustrates the communication channels between driver components: communication diagram

Security

All PMEM-CSI specific communication shown in above section between Master Controller(RegistryServer, MasterControllerServer) and NodeControllers(NodeControllerServer) is protected by mutual TLS. Both client and server must identify themselves and the certificate they present must be trusted. The common name in each certificate is used to identify the different components. The following common names have a special meaning:

pmem-registry is used by the RegistryServer.
pmem-node-controller is used by NodeControllerServers

The test/setup-ca.sh script shows how to generate self-signed certificates. The test cluster is set up using certificates created by that script, with secrets prepared by test/setup-deployment.sh before deploying the driver using the provided deployment files.

Beware that these are just examples. Administrators of a cluster must ensure that they choose key lengths and algorithms of sufficient strength for their purposes and manage certificate distribution.

A production deployment can improve upon that by using some other key delivery mechanism, like for example Vault.

Volume Persistency

In a typical CSI deployment, volumes are provided by a storage backend that is independent of a particular node. When a node goes offline, the volume can be mounted elsewhere. But PMEM volumes are local to node and thus can only be used on the node where they were created. This means the applications using PMEM volume cannot freely move between nodes. This limitation needs to be considered when designing and deploying applications that are to use local storage.

These are the volume persistency models considered for implementation in PMEM-CSI to serve different application use cases:

Persistent volumes
A volume gets created independently of the application, on some node where there is enough free space. Applications using such a volume are then forced to run on that node and cannot run when the node is down. Data is retained until the volume gets deleted.
Ephemeral volumes
Each time an application starts to run on a node, a new volume is created for it on that node. When the application stops, the volume is deleted. The volume cannot be shared with other applications. Data on this volume is retained only while the application runs.
Cache volumes
Volumes are pre-created on a certain set of nodes, each with its own local data. Applications are started on those nodes and then get to use the volume on their node. Data persists across application restarts. This is useful when the data is only cached information that can be discarded and reconstructed at any time and the application can reuse existing local data when restarting.

Volume	Kubernetes	PMEM-CSI	Limitations
Persistent	supported	supported	topology aware scheduling¹
Ephemeral	supported²	supported	resource constraints³
Cache	supported	supported	topology aware scheduling¹

¹Topology aware scheduling ensures that an application runs on a node where the volume was created. For CSI-based drivers like PMEM-CSI, Kubernetes >= 1.13 is needed. On older Kubernetes releases, pods must be scheduled manually onto the right node(s).

² CSI ephemeral volumes feature support is alpha in Kubernetes v1.15, and beta in v1.16.

³The upstream design for ephemeral volumes currently does not take resource constraints into account. If an application gets scheduled onto a node and then creating the ephemeral volume on that node fails, the application on the node cannot start until resources become available.

See exposing persistent and cache volumes for configuration information.

Capacity-aware pod scheduling

PMEM-CSI implements the CSI GetCapacity call, but Kubernetes currently doesn’t call that and schedules pods onto nodes without being aware of available storage capacity on the nodes. The effect is that pods using volumes with late binding may get tentatively assigned to a node and then get stuck because that decision is not reconsidered when the volume cannot be created there (a bug). Even if that decision is reconsidered, the same node may get selected again because Kubernetes does not get informed about the insufficient storage. Pods with ephemeral inline volumes always get stuck because the decision to use the node is final.

Work is under way to enhance scheduling in Kubernetes. In the meantime, PMEM-CSI provides two components that help with pod scheduling:

Scheduler extender

When a pod requests the special extended resource called pmem-csi.intel.com/scheduler, the Kubernetes scheduler calls a scheduler extender provided by PMEM-CSI with a list of nodes that a pod might run on. This extender is implemented in the master controller and thus can connect to the controller on each of these nodes to check for capacity. PMEM-CSI then filters out all nodes which currently do not have enough storage left for the volumes that still need to be created. This considers inline ephemeral volumes and all unbound volumes, regardless whether they use late binding or immediate binding.

This special scheduling can be requested manually by adding this snippet to one container in the pod spec:

containers:
- name: some-container
  ...
  resources:
    limits:
      pmem-csi.intel.com/scheduler: "1"
    requests:
      pmem-csi.intel.com/scheduler: "1"

      
    

This scheduler extender is optional and not necessarily installed in all clusters that have PMEM-CSI. Don’t add this extended resource unless the scheduler extender is installed, otherwise the pod won’t start!

See our implementation of a scheduler extender.

Pod admission webhook

Having to add pmem-csi.intel.com/scheduler manually is not user-friendly. To simplify this, PMEM-CSI provides a mutating admission webhook which intercepts the creation of all pods. If that pod uses inline ephemeral volumes or volumes with late binding that are provided by PMEM-CSI, the webhook transparently adds the extended resource request. PMEM-CSI volumes with immediate binding are ignored because for those the normal topology support ensures that unsuitable nodes are filtered out.

The webhook can only do that if the persistent volume claim (PVC) and its storage class have been created already. This is normally not required: it’s okay to create the pod first, then later add the PVC. The pod simply won’t start in the meantime.

The webhook deals with this uncertainty by allowing the creation of the pod without adding the extended resource when it lacks the necessary information. The alternative would be to reject the pod, but that would be a change of behavior of the cluster that may affect also pods that don’t use PMEM-CSI at all.

Users must take care to create PVCs first, then the pods if they want to use the webhook. In practice, that is often already done because it is more natural, so it is not a big limitation.

PMEM-CSI Operator

PMEM-CSI operator facilitates deploying and managing the PMEM-CSI driver on a Kubernetes cluster. This operator is based on the CoreOS operator-sdk tools and APIs.

The driver deployment is controlled by a cluster-scoped custom resource named Deployment in the pmem-csi.intel.com/v1alpha1 API group. The operator runs inside the cluster and listens for deployment changes. It makes sure that the required Kubernetes objects are created for a driver deployment. Refer to Deployment CRD for details.