Heartbeat
Under some circumstances, firmware in the Intel® QAT devices could become unresponsive, requiring a device reset to recover. The Intel® QAT Heartbeat feature provides a mechanism for the customer application to detect and reset unresponsive devices. It also notifies the application processes of the start and end of the reset operation and suspends all Intel® QAT instances between the events.
Heartbeat Operation
A Heartbeat-enabled Intel® QAT device firmware periodically writes counters to a specified physical memory location. A pair of counters per thread is incremented at the start and end of the main processing loop within the firmware. Checking for Heartbeat consists of checking the validity of the pair of counter values for each thread. Stagnant counters indicate a firmware hang.
Initialization
At startup, the Intel® QAT device driver allocates memory for the counter pairs to be written by the firmware and then sends a message to the firmware to start the Heartbeat functionality.
Heartbeat Monitoring
Heartbeat check/monitoring refers to invocation of one of the two API calls that checks if the device is responsive. Heartbeat failure refers the API returning failure.
The Intel® QAT driver does not monitor for Heartbeat. It should be initiated by a Heartbeat management thread calling one of the following APIs periodically:
icp_sal_check_device(Cpa32U packageId)
icp_sal_check_all_devices(void)
A failure return code implies the device has failed or hung.
The Heartbeat management thread should satisfy the following conditions:
For any given device, only one such process/thread should monitor.
One process can monitor one or more devices.
It can be a user application that uses Intel® QAT services, or a separate management/control plane process.
In virtualized environment, monitoring process(es)/thread(s) must run in the context of the host or hypervisor.
Important
When performing heartbeat checks, such as executing cat /sys/kernel/debug/<device>/heartbeat or calling APIs like icp_sal_check_all_devices() or icp_sal_check_device(), be aware that these actions may result in the silent clearing of the BusMaster, which disables QAT DMA operations. Disabling PCIe Bus Master is done to put the QAT device in a safe state to prevent the possibility of additional data corruption. This behavior can lead to unexpected QAT device hangs if a simulated or incorrectly detected heartbeat failure triggers BusMaster clearing.
Users should ensure that heartbeat monitoring is performed with an understanding of the potential impact on BusMaster status, especially when configuring monitoring frequencies or using failure simulation tools.
Resetting a Failed Device
A device can be configured for automatic reset by the Intel® QAT framework or manually reset by the application. The configuration differs based on whether the device is using the out-of-tree (OOT) driver or the in-tree kernel driver.
Out-of-Tree (OOT) Driver
For devices using the OOT driver, the AutoResetOnError
field in the device configuration file /etc/<device>.conf
can be used to configure automatic reset, as shown below.
AutoResetOnError Value |
Action on Heartbeat Failure |
---|---|
0 (default) |
Do not reset the device |
1 |
Reset the device automatically |
If an Intel® QAT device is not configured for automatic reset, the management thread should reset it using the icp_sal_reset_device()
API.
The icp_sal_reset_device()
function starts an asynchronous reset sequence and returns immediately. The reset function should not be called again until the device has completed the reset to avoid a reset storm. The icp_sal_check_device()
function could be called in a loop to check if the device reset is still in progress.
If the application devices are all configured for automatic reset then the icp_sal_check_all_devices()
function could be used; otherwise, the function should not be used because it does not return the identity of the failed device, which is a required parameter for the icp_sal_reset_device()
function.
In-Tree Kernel Driver
For devices using the in-tree kernel driver, automatic reset configuration is managed through the sysfs interface. The path will be similar to:
/sys/bus/pci/devices/0000:05:00.0/qat/auto_reset
The automatic reset applies to PCIe errors as well as Heartbeat errors. For more detailed information, refer to the official documentation at: sysfs-driver-qat
Function Signatures
The details of the above functions, parameters, and return values can be found in Supported APIs > Additional APIs.
Incorporating Heartbeat into Intel® QAT Applications
A typical Intel® QAT user application consists of two tasks:
The first task is typically an application thread that initializes Intel® QAT instances and sessions, and then submits service requests for Intel® QAT crypto or compression.
If an application employs polling to receive Intel® QAT service responses, then this task is also an application thread. Alternatively, responses are received as an interrupt handler.
Two more tasks are required to support Heartbeat:
The first is a management task to monitor the devices for failure or hang and then resets them, when required. As discussed earlier, this could be an application thread of an independent management process.
The second task is an application thread that polls for device reset events:
Device encountered error:
CPA_INSTANCE_EVENT_FATAL_ERROR
Device is restarting:
CPA_INSTANCE_EVENT_RESTARTING
Device restart is complete:
CPA_INSTANCE_EVENT_RESTARTED
If the application employs polling to receive Intel® QAT service responses, then this task could be included in the same polling loop.
The polling for device events is done using the API: icp_sal_poll_device_events()
.
The two callback functions for crypto and compression are registered using the following APIs. If an event occurs, the poll will trigger callbacks registered by these APIs:
cpaCyInstanceSetNotificationCb
cpaDcInstanceSetNotificationCb
The details of the above functions, parameters, and return values can be found in Supported APIs > Additional APIs.
Restart Sequence
During the restart sequence, the user space library releases the memory used for rings and other data structures as part of the shutdown and reallocates them when the restart is completed. This is transparent to the user application, so it can continue to use the same logical instance after reset to submit Intel® QAT service requests. Any memory allocated by the user application for the Intel® QAT service is untouched during device reset.
A typical Heartbeat error use-case is as follows:
The driver and the firmware is loaded, initialized and started.
The user-space application registers to receive instance notifications by calling
cpaCyInstanceSetNotificationCb
andcpaDcInstanceSetNotificationCb
.The management thread monitors for the device’s heartbeat. When a device is unresponsive, a device reset is initiated by this thread or by the Intel® QAT framework depending on the device configuration.
The kernel-space process sends the restarting event to the user-space process.
The user-space driver passes the device restarting event to all the registered application instances. It also frees memory and rings associated with the registered instances.
The kernel-space driver triggers the device reset.
During reset, the Intel® QAT service request made by the user application returns one of:
CPA_STATUS_FAIL
CPA_STATUS_RETRY
CPA_STATUS_RESTARTING
When the device reset is complete, the kernel-space driver sends a device restarted event to the user space driver.
The user space driver allocates the memory and rings and then forwards the device Restarted event to each of the registered instances.
Status of Packets in Flight
When a device has fatal errors, the application ordinarily cannot determine whether or not inflight requests have been processed successfully.
The Intel® QAT release includes a dummy response feature that creates mock responses to all requests submitted during a fatal error condition, so the application can detect them and, therefore, know which requests need to be resubmitted to the available devices or to the software.
Note
The sequence of dummy responses will match the sending request sequence for all requests submitted during a fatal error.
Crypto Applications
The dummy response feature supports both Public Key Encryption (PKE) and symmetric crypto. Dummy responses may be generated when the icp_sal_CyPollInstance() function is called, since it is the function for crypto services.
The icp_sal_poll_device_events() function should also be called by the application, so that the application gets a notification when the device encounters a failure and dummy responses are generated when calling icp_sal_CyPollInstance() for the inflight requests.
Compression Applications
The dummy response feature also supports compression services. Dummy responses may be generated when the icp_sal_DcPollInstance() function is called, as it handles compression services.
The icp_sal_poll_device_events() function remains essential for applications to receive notifications of device failures and the generation of dummy responses for inflight requests.
Determining Device ID
The <device id> that is passed as a parameter to several Heartbeat API is the numeric suffix of the device name displayed by the following command. (device name: qat_dev0):
service qat_service status
The output will look like:
There is 1 QAT acceleration device(s) in the system: qat_dev0 - type: 4xxx, inst_id: 0, node_id: 0, bsf: 0000:76:00.0, #accel: 1 #engines: 9 state: up
The Intel® QAT library has no API to discover the device number easily. However, an application can use the IOCTLs IOCTL_GET_NUM_DEVICES
and IOCTL_STATUS_ACCEL_DEV
to find the device_id of a particular device if they know the Bus Device Function (BDF). Refer to perform_query_dev()
in ./adf_ctl.cpp
.
Testing Heartbeat
Two debug capabilities are available to assist the developers incorporating Heartbeat into their applications:
Simulation of Heartbeat failure.
System virtual files under
/sys/kernel/debug/
.
Simulated Heartbeat Failure Configuration
The Heartbeat feature is always enabled in the package. However, debug capabilities that simulate device failure can be configured differently for Out-Of-Tree and In-Tree drivers.
Out-Of-Tree Driver
For the Out-Of-Tree driver, enable the simulation of device failure during the configure step as follows:
./configure --enable-icp-hb-fail-sim
Simulating Heartbeat Failure
Simulating Heartbeat failure for the Out-Of-Tree driver can be accomplished using two methods:
Using the API
icp_sal_heartbeat_simulate_failure(<device id>)
.Executing the command:
cat /sys/kernel/debug/<device>/heartbeat_sim_fail
In-Tree Driver
For the In-Tree driver, configure the QATlib to support heartbeat simulation failures during the configure step as follows:
./configure --enable-hb-error-simulation
Simulating Heartbeat Failure
Simulating Heartbeat failure for the In-Tree driver can be accomplished using the following method:
Execute the command:
echo 1 > /sys/kernel/debug/qat_<device_id>/heartbeat/inject_error
System Virtual Files
Note
The heartbeat /sys/kernel/debug
files are associated with the QAT Physical Function (PF).
/sys/kernel/debug/qat_<device>_<your_device_BDF>/
directory.File |
Content |
---|---|
|
0: Device is responsive. -1: Device is NOT responsive. |
|
Number of times the device became unresponsive. |
|
Number of times the control process checked if the device is responsive. |
A developer could simulate the Heartbeat management process by running the following script in the background:
#!/bin/bash
while : do
cat /sys/kernel/debug/<device>/heartbeat > /dev/null sleep 1
done
Heartbeat Polling Frequencies
The application developer should decide on the following two Heartbeat polling frequencies:
Device Heartbeat monitoring.
Checking for device reset events.
Device Heartbeat Monitoring
Consider the following points when determining the frequency of Heartbeat monitoring:
Increasing Heartbeat monitoring frequency will minimize the customer’s system downtime.
However, since device unresponsiveness should be an infrequent event, high frequency Heartbeat monitoring wastes CPU cycles.
Also, if there are large Intel® QAT service requests that take some time to complete, high frequency Heartbeat monitoring could result in false reports of unresponsiveness.
With QAT Gen4 devices, heartbeat update timer in firmware is a constant value of 200ms (unconfigurable). With QAT Gen2 devices this value is configurable with configuration item HeartbeatTimer (the default value is 500ms and the minimal allowed value is 200ms)
For both QAT Gen2 and Gen4 monitoring the interval should be greater than or equal to the Heartbeat update interval. (e.g. if user configure HeartbeatTimer=300, polling interval should be >=300ms)
Checking for Device Reset Events
If the application uses polling for reading Intel® QAT service responses, there is no value in checking for resets more frequently. Since device unresponsiveness is an infrequent occurrence, frequency of checking for reset events could be a fraction of the frequency of polling for Intel® QAT service responses.
Handling Device Failures in a Virtualized Environment
The Heartbeat feature in the acceleration software can be used in a virtualized environment. Refer to the Using Intel® Virtualization Technology (Intel® VT) with Intel® QuickAssist Technology Application Note for more details on enabling SR-IOV and the creation of Virtual Functions (VFs) from a single Intel® QuickAssist Technology acceleration device to support acceleration for multiple Virtual Machines (VMs).
The following sequence describes a possible use case for using the Heartbeat feature in a virtualized environment.
The Intel® QAT Physical Function driver (PF driver) isloaded, initialized and started.
The Intel® QAT Virtual Function driver (VF driver) is loaded, initialized and started in the Guest OS in the VM.
The PF driver detects that the firmware is unresponsive (using either of the following methods: User Proc Entry Read (not Enabled by Default) on page 47 or User Application Heartbeat APIs (not Enabled by Default) on page 48).
The PF driver sends the “Restarting” event message to the VF via the internal PF-to-VF communication messaging mechanism.
The VF driver sends the “Restarting” event to the application’s registered callback. The callback is registered using either of the Intel® QAT API functions
cpaDcInstanceSetNotificationCb()
orcpaCyInstanceSetNotificationCb()
in the Guest OS. (The application’s callback function may perform any application-level cleanup.)The PF driver starts the reset sequence (save state, initiate reset, and restore state).
The user restarts the Guest OS and loads the VF driver and application in the Guest OS.
Note
If the Heartbeat feature in the acceleration software is not enabled, the PF driver will not notify the VF driver that the firmware is unresponsive.
The error detection mechanisms are not available on the VF driver in the VM, but device errors caused by any of the software running on the VM will be detected by the PF driver using the above mechanisms.
Incorporating Dummy Responses into an Intel® QAT Application
The dummy response feature has been incorporated in a scenario with the Intel® QAT engine and Nginx*. Figure below illustrates how it works. This can be used as a reference to so-called “software fallback.”
The Intel® QAT engine is a shim layer between OpenSSL* libcrypto* and Intel® QAT Library. The Intel® QAT Library will generate failover responses.
The Heartbeat Monitoring Daemon, a single process, is a daemon which is used to check the device status periodically and trigger the driver the reset the device when heartbeat failure happens. Its only activity is calling icp_sal_check_device()
or icp_sal_check_all_devices()
periodically.
The Intel® QAT Engine polls for and handles “device error” and “device ok” events (via udev). It keeps track of the number of devices which are active.
If some, but not all, Intel® QAT devices encounter errors, switch to remaining available devices by resubmitting the inflight requests, which are responded to with dummy responses, as new requests to the available devices.
If the number of active Intel® QAT devices goes to zero, switch to software and resubmit the inflight requests, which are responded to with dummy responses, as new requests to the software.
If the number of active Intel® QAT devices goes positive again, switch back to hardware.
