Summary of Checksum Algorithms

Each checksum algorithm utilizes distinct methods for error detection. The Adler32 uses an additive process. The CRC32 uses a polynomial-based division operation. The XXHash32 checksum employs addition, multiplication, rotate, shift and xor operations.

Despite their different techniques, all checksum algorithms are, to varying extents, vulnerable to collisions—instances where distinct inputs produce the same hash output. This susceptibility impacts their performance and reliability, although the degree of vulnerability varies across the algorithms

Empirical testing, which involved over 500 billion error-injected dataset tests (detailed in the section The Empirical Study), supports the notion that each algorithm’s collision rate differs based on its design.

It was noted that when the dataset was sourced from the printable ASCII range (32 to 128), Adler32 exhibited a higher collision rate compared to when the dataset used the full ASCII character set. CRC32 and XXHash32, however, showed no significant difference in collision rates between these two datasets.

The increased collision rate for Adler32 when using printable ASCII characters can be attributed to the simplicity of its additive process. The nature of the data perturbations, combined with the smaller dataset size, makes Adler32 more prone to hash collisions—particularly when the modulo operation used in the addition step does not adequately account for the smaller dataset.

This paper suggests that the differing mechanisms behind Adler32, CRC32, and XXHash32 provide complementary checks for dataset integrity. This idea is further corroborated by the empirical data, supporting the rationale for using multiple checksums in the QAT CnV process.

From the testing, the following p-values for collisions were obtained:

  • CRC32: Pc = 2.22196E-10

  • Adler32: Pa = 1.75051E-07

  • XXHash32: Px = 2.87547E-10

Because the algorithms are independent of each other, calculating the probability of an error inducing collisions with two checks is straightforward.

For the Deflate algorithm, using both CRC32 and Adler32 as integrity checks produces a P value of:

Pdef = Pc x Pa = 2.22196E-10 * 1.75051E-07 = 5.03354E-17

For LZ4 and LZ4s, using CRC32 and XXHash32:

Plz4 = Pc x Px = 2.22196E-10 * 2.87547E-10 = 6.38918E-20