Reliability, Availability, and Serviceability (RAS)
The RAS feature goal is to support the acceleration devices Reliability, Availability and Serviceability by handling the error interrupts initiated by the device.
RAS Usage
Following PCIe specifications, errors are categorized as follows:
Error Type |
Description |
---|---|
Correctable |
Device can recover on its own, no software involvement. The |
Uncorrectable |
Software intervention is needed to resolve the error. This may require the application to reset the session or resend the request to the device. The |
Fatal |
Device unable to recover on its own even with software help. Restarting the device is required. The |
The RAS sysfs files are located at /sys/bus/pci/devices/AAAA:BB:CC.D/ras_X
where:
AAAA:BB:CC.D
is the Domain:BDF of the target Intel® QAT Endpoint.
ras_X
is the error type (ras_correctable/ras_uncorrectable/ras_fatal
).
Example:
cat /sys/bus/pci/devices/0000\:6b\:00.0/ras_fatal
Note
RAS is enabled by default when the device is initialized.
AER Errors
The Linux kernel implements an AER driver for each PCIe device to handle errors reported through the AER mechanism.
AER error counters for each device are exposed through sysfs files categorized as follows:
Error Type |
Description |
---|---|
AER Correctable |
Device can recover on its own, no software involvement. The |
AER Non-Fatal |
Software intervention is needed to resolve the error. In the case of an error caused by a transaction failure or for instance a packet memory buffer that can’t be restored by ECC, then the device will need to reset in order to retry the transaction and attempt recovery. The |
AER Fatal |
In the case of a fatal error, the AER driver will additionally reset the PCIe link in an attempt to recover. The |
AER errors counters are exposed at /sys/bus/pci/devices/AAAA:BB:CC.D/aer_dev_X
where:
AAAA:BB:CC.D
is the Domain:BDF of the target Intel® QAT Endpoint.
aer_dev_X
is the error type (aer_dev_correctable/aer_dev_uncorrectable/aer_dev_fatal
).
Example:
cat /sys/bus/pci/devices/0000\:6b\:00.0/aer_dev_correctable
Important
AER reporting must be enabled in the BIOS to have errors reported through AER.
AER Error Injection Testing
To conduct PCI Error Injection with AER, follow these steps:
Kernel Configuration:
Ensure the kernel is compiled with CONFIG_PCIEAER_INJECT=m to obtain the aer_inject.ko module.
Enable Linux Native PCIe Interrupt Handler:
Update the kernel parameters to use native PCIe interrupt handling:
sudo grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="pcie_ports=native" sudo reboot
Load AER Inject Kernel Module:
Load the aer_inject module to enable the /dev/aer_inject interface:
sudo modprobe aer_inject
Clone and Compile User Space Tool:
Clone the aer-inject repository and compile the user space tool to inject errors:
git clone https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git cd aer-inject make -j
Inject Errors:
Assuming QAT is loaded, use the aer-inject tool to inject errors. Adjust the PCIe address to match your environment:
sudo ./aer-inject --id 6b:00.0 ./examples/nonfatal
After error injection, only the PCIe AER error counters will increase (e.g., /sys/bus/pci/devices/0000:6b:00.0/aer_dev_correctable or aer_dev_nonfatal). QAT RAS error counters will not increase.
You may observe a message in dmesg such as:
AER: Uncorrected (Non-Fatal) error received: 0000:6b:00.0
Important
The AER injection test focuses on two levels of errors:
PCIe Interface Errors: These involve testing the PCIe interface through PCIe AER testing, including Fatal, Non-fatal, correctable, and non-correctable errors. This level of testing can be conducted by customers using the AER injection tools.
Low-Level Errors: These target specific hardware components such as slices, bus, and memory parity. This type of testing may require specialized hardware configurations or access to specific CPU parts that are not generally available outside of Intel.
The PCIe AER testing is designed to simulate errors at the PCIe interface level, which can help identify and address issues related to PCIe communication and error handling.