Heartbeat

Under some circumstances, firmware in the Intel® QAT devices could become unresponsive, requiring a device reset to recover. The Intel® QAT Heartbeat feature provides a mechanism for the customer application to detect and reset unresponsive devices. It also notifies the application processes of the start and end of the reset operation and suspends all Intel® QAT instances between the events.

Heartbeat Operation

A Heartbeat-enabled Intel® QAT device firmware periodically writes counters to a specified physical memory location. A pair of counters per thread is incremented at the start and end of the main processing loop within the firmware. Checking for Heartbeat consists of checking the validity of the pair of counter values for each thread. Stagnant counters indicate a firmware hang.

Initialization

At startup, the Intel® QAT device driver allocates memory for the counter pairs to be written by the firmware and then sends a message to the firmware to start the Heartbeat functionality.

Heartbeat Monitoring

Heartbeat check/monitoring refers to invocation of one of the two API calls that checks if the device is responsive. Heartbeat failure refers the API returning failure.

The Intel® QAT driver does not monitor for Heartbeat. It should be initiated by a Heartbeat management thread calling one of the following APIs periodically:

  • icp_sal_check_device(Cpa32U accelId)

  • icp_sal_check_all_devices(void)

A failure return code implies the device has failed or hung.

The Heartbeat management thread should satisfy the following conditions:

  • For any given device, only one such process/thread should monitor.

  • One process can monitor one or more devices.

  • It can be a user application that uses Intel® QAT services, or a separate management/control plane process.

  • In virtualized environment, monitoring process(es)/thread(s) must run in the context of the host or hypervisor.

Resetting a Failed Device

A device can be configured for automatic reset by the Intel® QAT framework or manually reset by the application by using the AutoResetOnError field in the device configuration file /etc/<device>.conf, as shown below.

AutoResetOnError Values

AutoResetOnError Value

Action on Heartbeat Failure

0 (default)

Do not reset the device

1

Reset the device automatically

If an Intel® QAT device is not configured for automatic reset, the management thread should reset it using the icp_sal_reset_device(Cpa32U accelId) API.

The icp_sal_reset_device() function starts an asynchronous reset sequence and returns immediately. The reset function should not be called again until the device has completed the reset to avoid a reset storm. The icp_sal_check_device(<device id>) function could be called in a loop to check if the device reset is still in progress.

If the application devices are all configured for automatic reset then the icp_sal_check_all_devices() function could be used; otherwise, the function should not be used because it does not return the identity of the failed device, which is a required parameter for the icp_sal_reset_device() function.

Function Signatures

The details of the above functions, parameters, and return values can be found in Supported APIs > Additional APIs.

Incorporating Heartbeat into Intel® QAT Applications

A typical Intel® QAT user application consists of two tasks:

  • The first task is typically an application thread that initializes Intel® QAT instances and sessions, and then submits service requests for Intel® QAT crypto or compression.

  • If an application employs polling to receive Intel® QAT service responses, then this task is also an application thread. Alternatively, responses are received as an interrupt handler.

Two more tasks are required to support Heartbeat:

  • The first is a management task to monitor the devices for failure or hang and then resets them, when required. As discussed earlier, this could be an application thread of an independent management process.

  • The second task is an application thread that polls for device reset events:

    • Device is restarting: CPA_INSTANCE_EVENT_RESTARTING

    • Device restart is complete: CPA_INSTANCE_EVENT_RESTARTED

If the application employs polling to receive Intel® QAT service responses, then this task could be included in the same polling loop.

The polling for device events is done using the API: icp_sal_poll_device_events().

The two callback functions for crypto and compression are registered using the following APIs:

  • cpaCyInstanceSetNotificationCb

  • cpaDcInstanceSetNotificationCb

The details of the above functions, parameters, and return values can be found in Supported APIs > Additional APIs.

Restart Sequence

During the restart sequence, the user space library releases the memory used for rings and other data structures as part of the shutdown and reallocates them when the restart is completed. This is transparent to the user application, so it can continue to use the same logical instance after reset to submit Intel® QAT service requests. Any memory allocated by the user application for the Intel® QAT service is untouched during device reset.

A typical Heartbeat error use-case is as follows:

  1. The driver and the firmware is loaded, initialized and started.

  2. The user-space application registers to receive instance notifications by calling cpaCyInstanceSetNotificationCb and cpaDcInstanceSetNotificationCb.

  3. The management thread monitors for the device’s heartbeat. When a device is unresponsive, a device reset is initiated by this thread or by the Intel® QAT framework depending on the device configuration.

  4. The kernel-space process sends the restarting event to the user-space process.

  5. The user-space driver passes the device restarting event to all the registered application instances. It also frees memory and rings associated with the registered instances.

  6. The kernel-space driver triggers the device reset.

  7. During reset, the Intel® QAT service request made by the user application returns one of:

    • CPA_STATUS_FAIL

    • CPA_STATUS_RETRY

    • CPA_STATUS_RESTARTING

  8. When the device reset is complete, the kernel-space driver sends a device restarted event to the user space driver.

  9. The user space driver allocates the memory and rings and then forwards the device Restarted event to each of the registered instances.

Status of Packets in Flight (Crypto Applications Only)

When a device has fatal errors, the application ordinarily cannot determine whether or not inflight requests have been processed successfully.

The current Intel® QAT release includes a dummy response feature that creates mock responses to all requests submitted during a fatal error condition, so the application can detect them and, therefore, know which requests need to be resubmitted to the available devices or to the software.

Note

The sequence of dummy responses will match the sending request sequence for all requests submitted during a fatal error.

Since the dummy response feature only supports Public Key Encryption (PKE), dummy responses may be generated only when the icp_sal_CyPollInstance() function is called, since it is the function for crypto services.

The icp_sal_poll_device_events() function should also be called by the application, so that the application get a notification when the device encounters a failure and dummy responses are generated when calling icp_sal_CyPollInstance() for the inflight requests.

Determining Device ID

The <device id> that is passed as a parameter to several Heartbeat API is the numeric suffix of the device name displayed by the following command. (device name: qat_dev0):

service qat_service status

The output will look like:

There is 1 QAT acceleration device(s) in the system: qat_dev0 - type: c3xxx, inst_id: 0, node_id: 0, bsf: 01:00.0, #accel: 3 #engines: 6 state: up

The Intel® QAT library has no API to discover the device number easily. However, an application can use the IOCTLs IOCTL_GET_NUM_DEVICES and IOCTL_STATUS_ACCEL_DEV to find the device_id of a particular device if they know the Bus Device Function (BDF). Refer to perform_query_dev() in ./adf_ctl.cpp.

Testing Heartbeat

Two debug capabilities are available to assist the developers incorporating Heartbeat into their applications:

  • Simulation of Heartbeat failure.

  • System virtual files under /sys/kernel/debug/.

Simulated Heartbeat Failure Configuration

The Heartbeat feature is always enabled in the package. However, a debug capability that simulates device failure can be enabled during the configure step as follows:

./configure --enable-icp-hb-fail-sim

Simulating Heartbeat Failure

Simulating Heartbeat failure can be accomplished using two methods:

  • Using the API icp_sal_heartbeat_simulate_failure(<device id>).

  • Executing the command:

    cat /sys/kernel/debug/<device>/heartbeat_sim_fail
    

System Virtual Files

Note

The heartbeat /sys/kernel/debug files are associated with the QAT Physical Function (PF).

The Heartbeat feature implements the following system virtual files under the /sys/kernel/debug/qat_<device>_<your_device_BDF>/ directory.

Heartbeat System Virtual Files

File

Content

heartbeat

0: Device is responsive. -1: Device is NOT responsive.

heartbeat_failed

Number of times the device became unresponsive.

heartbeat_sent

Number of times the control process checked if the device is responsive.

A developer could simulate the Heartbeat management process by running the following script in the background:

#!/bin/bash
while : do
   cat /sys/kernel/debug/<device>/heartbeat > /dev/null sleep 1
done

Heartbeat Polling Frequencies

The application developer should decide on the following two Heartbeat polling frequencies:

  • Device Heartbeat monitoring.

  • Checking for device reset events.

Device Heartbeat Monitoring

Consider the following points when determining the frequency of Heartbeat monitoring:

  • Increasing Heartbeat monitoring frequency will minimize the customer’s system downtime.

  • However, since device unresponsiveness should be an infrequent event, high frequency Heartbeat monitoring wastes CPU cycles.

  • Also, if there are large Intel® QAT service requests that take some time to complete, high frequency Heartbeat monitoring could result in false reports of unresponsiveness.

  • With QAT Gen4 devices, heartbeat update timer in firmware is a constant value of 200ms (unconfigurable). With QAT Gen2 devices this value is configurable with configuration item HeartbeatTimer (the default value is 500ms and the minimal allowed value is 200ms)

  • For both QAT Gen2 and Gen4 monitoring interval should be larger or equal than the Heartbeat update interval. (e.g. if user configure HeartbeatTimer=300, polling interval should be >=300ms)

Checking for Device Reset Events

If the application uses polling for reading Intel® QAT service responses, there is no value in checking for resets more frequently. Since device unresponsiveness is an infrequent occurrence, frequency of checking for reset events could be a fraction of the frequency of polling for Intel® QAT service responses.

Handling Device Failures in a Virtualized Environment

The Heartbeat feature in the acceleration software can be used in a virtualized environment. Refer to the Using Intel® Virtualization Technology (Intel® VT) with Intel® QuickAssist Technology Application Note for more details on enabling SR-IOV and the creation of Virtual Functions (VFs) from a single Intel® QuickAssist Technology acceleration device to support acceleration for multiple Virtual Machines (VMs).

The following sequence describes a possible use case for using the Heartbeat feature in a virtualized environment.

  1. The Intel® QAT Physical Function driver (PF driver) isloaded, initialized and started.

  2. The Intel® QAT Virtual Function driver (VF driver) is loaded, initialized and started in the Guest OS in the VM.

  3. The PF driver detects that the firmware is unresponsive (using either of the following methods: User Proc Entry Read (not Enabled by Default) on page 47 or User Application Heartbeat APIs (not Enabled by Default) on page 48).

  4. The PF driver sends the “Restarting” event message to the VF via the internal PF-to-VF communication messaging mechanism.

  5. The VF driver sends the “Restarting” event to the application’s registered callback. The callback is registered using either of the Intel® QAT API functions cpaDcInstanceSetNotificationCb() or cpaCyInstanceSetNotificationCb() in the Guest OS. (The application’s callback function may perform any application-level cleanup.)

  6. The PF driver starts the reset sequence (save state, initiate reset, and restore state).

  7. The user restarts the Guest OS and loads the VF driver and application in the Guest OS.

Note

  • If the Heartbeat feature in the acceleration software is not enabled, the PF driver will not notify the VF driver that the firmware is unresponsive.

  • The error detection mechanisms are not available on the VF driver in the VM, but device errors caused by any of the software running on the VM will be detected by the PF driver using the above mechanisms.

Incorporating Dummy Responses into an Intel® QAT Application

The dummy response feature has been incorporated in a scenario with the Intel® QAT engine and Nginx*. Figure below illustrates how it works. This can be used as a reference to so-called “software fallback.”

The Intel® QAT engine is a shim layer between OpenSSL* libcrypto* and Intel® QAT Library. The Intel® QAT Library will generate failover responses.

The Heartbeat Monitoring Daemon, a single process, is a daemon which is used to check the device status periodically and trigger the driver the reset the device when heartbeat failure happens. Its only activity is calling icp_sal_check_device() or icp_sal_check_all_devices() periodically.

The Intel® QAT Engine polls for and handles “device error” and “device ok” events (via udev). It keeps track of the number of devices which are active.

  • If some, but not all, Intel® QAT devices encounter errors, switch to remaining available devices by resubmitting the inflight requests, which are responded to with dummy responses and new requests to the available devices.

  • If the number of active Intel® QAT devices goes to zero, switch to software and resubmit the inflight requests which are responded to with dummy responses and new requests to the software.

  • If the number of active Intel® QAT devices goes positive again, switch back to hardware.

../_images/Incorporating_Dummy_Responses_updated.png