Footnotes

MIT FreeBSD Project: CRC32 Implementation in C

https://web.mit.edu/freebsd/head/sys/libkern/crc32.c

/*
 * COPYRIGHT (C) 1986 Gary S. Brown. You may use this program, or
 * code or tables extracted from it, as desired without restriction.
 */

/*
 * First, the polynomial itself and its table of feedback terms. The
 * polynomial is
 * X^32+X^26+X^23+X^22+X^16+X^12+X^11+X^10+X^8+X^7+X^5+X^4+X^2+X^1+X^0
 *
 * Note that we take it "backwards" and put the highest-order term in
 * the lowest-order bit. The X^32 term is "implied"; the LSB is the
 * X^31 term, etc. The X^0 term (usually shown as "+1") results in
 * the MSB being 1
 …

xxHash Specification Documentation on GitHub

https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md

Note

For brevity, the Version, Introduction and Operation notations sections have been removed.

Notices

Copyright (c) Yann Collet

Permission is granted to copy and distribute this document for any purpose and without charge, including translations into other languages and incorporation into compilations, provided that the copyright notice and this notice are preserved, and that any substantive changes or deletions from the original are clearly marked. Distribution of this document is unlimited.

XXH32 Algorithm Description

Overview

We begin by supposing that we have a message of any length L as input, and that we wish to find its digest. Here L is an arbitrary nonnegative integer; L may be zero. The following steps are performed to compute the digest of the message.

The algorithm collects and transforms input in stripes of 16 bytes. The transforms are stored inside 4 “accumulators”, each one storing an unsigned 32-bit value. Each accumulator can be processed independently in parallel, speeding up processing for CPUs with multiple execution units.

The algorithm uses 32-bits addition, multiplication, rotate, shift, and xor operations. Many operations require some 32-bits prime number constants, all defined below:

static const u32 PRIME32_1 = 0x9E3779B1U;  // 0b10011110001101110111100110110001
static const u32 PRIME32_2 = 0x85EBCA77U;  // 0b10000101111010111100101001110111
static const u32 PRIME32_3 = 0xC2B2AE3DU;  // 0b11000010101100101010111000111101
static const u32 PRIME32_4 = 0x27D4EB2FU;  // 0b00100111110101001110101100101111
static const u32 PRIME32_5 = 0x165667B1U;  // 0b00010110010101100110011110110001

These constants are prime numbers and feature a good mix of bits 1 and 0, neither too regular nor too dissymmetric. These properties help dispersion capabilities.

Step 1. Initialize internal accumulators

Each accumulator gets an initial value based on optional seed input. Since the seed is optional, it can be 0.

u32 acc1 = seed + PRIME32_1 + PRIME32_2;
u32 acc2 = seed + PRIME32_2;
u32 acc3 = seed + 0;
u32 acc4 = seed - PRIME32_1;

Special case: input is less than 16 bytes

When the input is too small (< 16 bytes), the algorithm will not process any stripes. Consequently, it will not make use of parallel accumulators.

In this case, a simplified initialization is performed, using a single accumulator:

u32 acc  = seed + PRIME32_5;

The algorithm then proceeds directly to step 4.

Step 2. Process stripes

A stripe is a contiguous segment of 16 bytes. It is evenly divided into 4 lanes, of 4 bytes each. The first lane is used to update accumulator 1, the second lane is used to update accumulator 2, and so on.

Each lane reads its associated 32-bit value using little-endian convention.

For each {lane, accumulator}, the update process is called a round, and applies the following formula:

accN = accN + (laneN * PRIME32_2);
accN = accN <<< 13;
accN = accN * PRIME32_1;

This shuffles the bits so that any bit from input lane impacts several bits in output accumulator. All operations are performed modulo 2^32.

Input is consumed one full stripe at a time. Step 2 is looped as many times as necessary to consume the whole input, except for the last remaining bytes which cannot form a stripe (< 16 bytes). When that happens, move to step 3.

Step 3. Accumulator convergence

All 4 lane accumulators from the previous steps are merged to produce a single remaining accumulator of the same width (32-bit). The associated formula is as follows:

acc = (acc1 <<< 1) + (acc2 <<< 7) + (acc3 <<< 12) + (acc4 <<< 18);

Step 4. Add input length

The input total length is presumed known at this stage. This step is just about adding the length to the accumulator, so that it participates in the final mixing.

acc = acc + (u32)inputLength;

Note that, if the input length is so large that it requires more than 32-bits, only the lower 32-bits are added to the accumulator.

Step 5. Consume remaining input

There may be up to 15 bytes remaining to consume from the input. The final stage will digest them according to the following pseudo-code:

while (remainingLength >= 4) {
    lane = read_32bit_little_endian(input_ptr);
    acc = acc + lane * PRIME32_3;
    acc = (acc <<< 17) * PRIME32_4;
    input_ptr += 4; remainingLength -= 4;
}

while (remainingLength >= 1) {
    lane = read_byte(input_ptr);
    acc = acc + lane * PRIME32_5;
    acc = (acc <<< 11) * PRIME32_1;
    input_ptr += 1; remainingLength -= 1;
}

This process ensures that all input bytes are present in the final mix.

Step 6. Final mix (avalanche)

The final mix ensures that all input bits have a chance to impact any bit in the output digest, resulting in an unbiased distribution. This is also called the avalanche effect.

acc = acc xor (acc >> 15);
acc = acc * PRIME32_2;
acc = acc xor (acc >> 13);
acc = acc * PRIME32_3;
acc = acc xor (acc >> 16);

Step 7. Output

The XXH32() function produces an unsigned 32-bit value as output.

For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows big-endian convention (most significant byte first).