The CRC64 Values Used in End-To-End mode

Compress and Verify (CnV) was developed before the end-to-end (e2e) feature with CRC64 checksums was introduced. The e2e feature allows customers to supply the polynomial. While this is useful for some customers, it presents two problems for CnV.

If CnV included CRC64 with a customer-supplied polynomial, the following situations may arise.

The polynomial may be poorly chosen, resulting in higher levels of collisions.
A CnV solution with different polynomials may result in failure to detect an error based solely on the polynomials, resulting in inconsistent behavior.

The Empirical Study

In order to examine the existence or lack of existence of checksum collisions for more than one checksum at the same time.

Stated differently, can a single intentionally corrupted dataset show a checksum collision for more than one checksum?

Dataset Generation

The datasets were created with the following parameters:

Characters	Dataset Length
Full ASCII in pseudo-random order	4096, 8192, 32767, 65536, 40960, 81920, 32767, 655360
Printable ASCII in pseudo-random order	4096, 8192, 32767, 65536, 40960, 81920, 32767, 655360

Dataset Modification

A second dataset was created with the following changes:

do
{
   copy_start = (rand()%(bff_sz - 1024))+10;
   copy_len = (rand()%(256))+1;
   copy_offset = rand()%(copy_start)+10;
   for ( i = 0; i < copy_len; i++ )
   {
      bff2[i+copy_offset] = bff[i+copy_start];
   }
} while (memcmp( bff, bff2, bff_sz ) == 0 );

This provides a second buffer with the following properties:

A subset of the dataset has been copied from one location to another pseudorandom location.
The length of the copied data is limited to be within the range of 1 to 256.
The start and offset are set to ensure the length of the second modified buffer matches the original unmodified buffer.

Pictorially, this looks like:

This dataset loosely resembles an error in a “distance” code in the compressed data. Mimicking errors in length codes were not considered.

A change in a length code implies a change in the overall dataset length. This error would be detected by the test for length matching.

Scope of Experiment

A total of 548,459,000,000 datasets were created, duplicated with errors, and checksums were calculated for both the original dataset and modified datasets.

The checksums from the two datasets were compared to check for checksum collisions.

Collisions

Checksum	Number of Collisions	Rate of Collisions (per billion)	P
Adler32	115458	< 176	1.7 5051E-07
CRC32	122	< 0.3	2.2 2196E-10
xxhash	157	< 0.3	2.8 7547E-10

The number of datasets with collisions had collisions with more than one algorithm.

Details - Total collisions by dataset size

Size (octets)	Adler32 collision count	CRC32 collision count	XXHash32 collision count
4096	28898	25	32
8192	27907	34	44
32768	27281	32	31
64536	27208	22	30
40960	1067	3	5
81920	1028	2	4
327680	1034	3	6
645360	1035	1	5

Details - Total datasets by dataset size

Dataset size (octets)	Datasets
4096	124,224,000,000
8192	122,679,000,000
32768	122,324,000,000
64536	122,084,000,000
40960	14,536,000,000
81920	14,320,000,000
327680	14,261,000,000
645360	14,031,000,000