Reliability, Availability, and Serviceability (RAS)

The RAS feature goal is to support the acceleration devices Reliability, Availability and Serviceability by handling the error interrupts initiated by the device.

Additionally the types of errors are counted and the counters made available via sysfs.
For additional details refer to sysfs-driver-qat_ras documentation.

RAS Usage

Following PCIe specifications, errors are categorized as follows:

RAS Error Types

Error Type

Description

Correctable

Device can recover on its own, no software involvement.

The ras_correctable counter is incremented in sysfs.

Uncorrectable

Software intervention is needed to resolve the error. This may require the application to reset the session or resend the request to the device.

The ras_uncorrectable counter is incremented in sysfs.

Fatal

Device unable to recover on its own even with software help. Restarting the device is required.

The ras_fatal counter is incremented in sysfs.

The RAS sysfs files are located at /sys/bus/pci/devices/AAAA:BB:CC.D/ras_X where:

  • AAAA:BB:CC.D is the Domain:BDF of the target Intel® QAT Endpoint.

  • ras_X is the error type (ras_correctable/ras_uncorrectable/ras_fatal).

Example:

cat /sys/bus/pci/devices/0000\:6b\:00.0/ras_fatal

Note

RAS is enabled by default when the device is initialized.

AER Errors

The Linux kernel implements an AER driver for each PCIe device to handle errors reported through the AER mechanism.

AER error counters for each device are exposed through sysfs files categorized as follows:

RAS AER Errors

Error Type

Description

AER Correctable

Device can recover on its own, no software involvement.

The aer_dev_correctable counter is incremented in sysfs.

AER Non-Fatal

Software intervention is needed to resolve the error. In the case of an error caused by a transaction failure or for instance a packet memory buffer that can’t be restored by ECC, then the device will need to reset in order to retry the transaction and attempt recovery.

The aer_dev_nonfatal counter is incremented in sysfs.

AER Fatal

In the case of a fatal error, the AER driver will additionally reset the PCIe link in an attempt to recover.

The aer_dev_fatal counter is incremented in sysfs.

AER errors counters are exposed at /sys/bus/pci/devices/AAAA:BB:CC.D/aer_dev_X where:

  • AAAA:BB:CC.D is the Domain:BDF of the target Intel® QAT Endpoint.

  • aer_dev_X is the error type (aer_dev_correctable/aer_dev_uncorrectable/aer_dev_fatal).

Example:

cat /sys/bus/pci/devices/0000\:6b\:00.0/aer_dev_correctable

Important

AER reporting must be enabled in the BIOS to have errors reported through AER.

AER Error Injection Testing

To conduct PCI Error Injection with AER, follow these steps:

  1. Kernel Configuration:

    Ensure the kernel is compiled with CONFIG_PCIEAER_INJECT=m to obtain the aer_inject.ko module.

  2. Enable Linux Native PCIe Interrupt Handler:

    Update the kernel parameters to use native PCIe interrupt handling:

    sudo grubby --update-kernel=/boot/vmlinuz-$(uname -r) --args="pcie_ports=native"
    sudo reboot
    
  3. Load AER Inject Kernel Module:

    Load the aer_inject module to enable the /dev/aer_inject interface:

    sudo modprobe aer_inject
    
  4. Clone and Compile User Space Tool:

    Clone the aer-inject repository and compile the user space tool to inject errors:

    git clone https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
    cd aer-inject
    make -j
    
  5. Inject Errors:

    Assuming QAT is loaded, use the aer-inject tool to inject errors. Adjust the PCIe address to match your environment:

    sudo ./aer-inject --id 6b:00.0 ./examples/nonfatal
    

    After error injection, only the PCIe AER error counters will increase (e.g., /sys/bus/pci/devices/0000:6b:00.0/aer_dev_correctable or aer_dev_nonfatal). QAT RAS error counters will not increase.

    You may observe a message in dmesg such as:

    AER: Uncorrected (Non-Fatal) error received: 0000:6b:00.0
    

Important

The AER injection test focuses on two levels of errors:

  1. PCIe Interface Errors: These involve testing the PCIe interface through PCIe AER testing, including Fatal, Non-fatal, correctable, and non-correctable errors. This level of testing can be conducted by customers using the AER injection tools.

  2. Low-Level Errors: These target specific hardware components such as slices, bus, and memory parity. This type of testing may require specialized hardware configurations or access to specific CPU parts that are not generally available outside of Intel.

The PCIe AER testing is designed to simulate errors at the PCIe interface level, which can help identify and address issues related to PCIe communication and error handling.