Reliability, Availability, and Stability (RAS)
The RAS feature goal is to support the acceleration devices Reliability, Availability and Stability by handling the error interrupts initiated by the device.
Additionally the types of errors are counted and the counters made available via sysfs.
RAS Usage
Following PCIe specifications, errors are categorized as follows:
Error Type |
Description |
---|---|
Correctable |
Device can recover on its own, no software involvement. The |
Uncorrectable |
Software intervention is needed to resolve the error. This may require the application to reset the session or resend the request to the device. The |
Fatal |
Device unable to recover on its own even with software help. Restarting the device is required. The |
The RAS sysfs files are located at /sys/devices/pciAAAA:BB/AAAA:BB:CC.D/ras_X
where:
AAAA:BB:CC.D
is the Domain:BDF of the target Intel® QAT Endpoint.
ras_X
is the error type (ras_correctable/ras_uncorrectable/ras_fatal
).
Example:
cat /sys/bus/pci/devices/0000\:6b\:00.0/ras_fatal
Note
RAS is enabled by default when the device is initialised.
AER Errors
The Linux kernel implements an AER driver for each PCIe device to handle errors reported through the AER mechanism.
AER error counters for each device are exposed through sysfs files categorized as follows:
Error Type |
Description |
---|---|
AER Correctable |
Device can recover on its own, no software involvement. The |
AER Uncorrectable |
Software intervention is needed to resolve the error. In the case of an error caused by a transaction failure or for instance a packet memory buffer that can’t be restored by ECC, then the device will need to reset in order to retry the transaction and attempt recovery. The |
AER Fatal |
In the case of a fatal error, the AER driver will additionally reset the PCIe link in an attempt to recover. The |
AER errors counters are exposed at /sys/bus/pci/devices/AAAA:BB:CC.D/aer_dev_X
where:
AAAA:BB:CC.D
is the Domain:BDF of the target Intel® QAT Endpoint.
aer_dev_X
is the error type (aer_dev_correctable/aer_dev_uncorrectable/aer_dev_fatal
).
Example:
cat /sys/bus/pci/devices/0000\:6b\:00.0/aer_dev_correctable
Important
AER reporting must be enabled in the BIOS to have errors reported through AER.