11 minute read

As a new way of describing hardware programmatically, ROHD opens up the world of hardware design to people with background in software algorithms. Today, we see a lot of hardware design in the space of hardware accelerators which are being used to speed up algorithms that already exist in software. It is therefore interesting to consider design patterns that help us transform traditional software techniques into hardware: a key software technique to consider is recursion which is often the most natural way to describe a given computation.

Trees are a key data structure in both software and hardware upon which recursive algorithms can operate to achieve fast computation. Yet it is challenging to write recursive tree computation in traditional hardware description languages. We will consider a simple recursive algorithm and then a more general hardware tree generator as two techniques for capturing recursive computation easily in hardware using ROHD.

A Pseudo-LRU Algorithm

Set-associative caching requires a replacement policy to figure out which ‘way’ in the cache from which to evict data in order to store new data. Least Recently Used (LRU) algorithms require a very complex history mechanism to compute exactly, so an approximation is often used called ‘pseudo-LRU’.

To compute the least-recently-used ‘way’, a pseudo-LRU algorithm uses a binary tree with ‘ways’ as leaves, where a ‘0’ at an intermediate node indicates the LRU ‘way’ is to the right and a ‘1’ indicates it is to the left. In the figure below, we can see that ‘way’ ‘5’ is LRU according to the current settings at each node of the tree.

plru

Two routines are needed to maintain the PLRU tree above in order to do pseudo-LRU computation, first for finding the LRU way to ‘allocate’ a new entry and a second for updating the LRU state as ways are ‘hit’ by reads or writes.

PLRU Allocation

Software PLRU Allocation

It is quite natural to describe allocation in the PLRU tree as a recursive routine. Let us assume the PLRU tree is represented as a 0/1 state vector of the tree nodes as seen from left-to-right. In the example above, the state vector would be [1,0,1,0,0,1,0]. Below is the routine written in software that, given the state of the PLRU tree as a List, returns the integer LRU way.

 1 int allocPLRU(List<int> v, {int base = 0}) {
 2   final mid = v.length ~/ 2;
 3   return v.length == 1
 4       ? v[0] == 1
 5           ? base
 6           : base + 1
 7       : v[mid] == 1
 8           ? allocPLRU(v.sublist(0, mid), base: base)
 9           : allocPLRU(v.sublist(mid + 1, v.length), base: mid + 1 + base);
10 }

Here you can see that the recursion splits on the middle element of the List at line 7, searching left for the LRU if the node has a ‘1’ otherwise searching right. At the leaf (line 4) if the node is ‘1’ it returns the left element, otherwise the right, as the LRU way.

Hardware PLRU Allocation

To implement a tree-based algorithm like PLRU allocation in traditional HDLs, a designer usually has to encode the tree and write a decoder, limited to a specific number of ways. This is not unlike when software developers had to write assembly – using a bit-level abstraction in which to embed algorithms in either software or hardware is painful and error-prone.

Here is an example in Verilog HDL for a fixed 8-way allocation. It is very hard to determine this algorithm is correct as this description obfuscates the algorithm almost completely. Looking at this code a designer would likely wish this to be compiled from a higher level description than have to write and maintain this kind of bit-level code (there could easily be an error in this code that cannot be seen without exhaustive testing of what is an approximate algorithm!).

    always_comb begin
        if (plru_bits[0] == 0) begin
            if (plru_bits[2] == 0) begin
                if (plru_bits[6] == 0) lru_way = 3'b111;
                else lru_way = 3'b110;
            end else begin
                if (plru_bits[5] == 0) lru_way = 3'b101;
                else lru_way = 3'b100;
            end
        end else begin
            if (plru_bits[1] == 0) begin
                if (plru_bits[4] == 0) lru_way = 3'b011;
                else lru_way = 3'b010;
            end else begin
                if (plru_bits[3] == 0) lru_way = 3'b001;
                else lru_way = 3'b000;
            end
        end
    end

ROHD Recursive Hardware PLRU Allocation

ROHD allows us to describe the LRU allocation algorithm recursively in hardware as well because it enables us to generate hardware recursively for a PLRU tree of an arbitrary number of ways.

In the hardware algorithm below, we see that we pass in a bitvector in the form of Logic and again split on the middle element of the vector (line 5 below). Instead of a ternary operation (lines 4-6 of the software algorithm above), we can use a multiplexor or mux (line 8 below) on the middle element value to return the result of the left recursion or right recursion. At the leaf (line 7), we can mux on the node to return the left element in case of a ‘1’ otherwise the right element.

 1 Logic allocPLRU(Logic v, {int base = 0, int sz = 0}) {
 2   final lsz = sz == 0 ? log2Ceil(v.width) : sz;
 3   Logic convertInt(int i) => Const(i, width: lsz);
 4
 5   final mid = v.width ~/ 2;
 6   return v.width == 1
 7       ? mux(v[0], convertInt(base), convertInt(base + 1))
 8       : mux(
 9            v[mid],
10            allocPLRU(v.slice(mid - 1, 0), base: base, sz: lsz),
11            allocPLRU(v.getRange(mid + 1),
12                      base: mid + 1 + base, sz: lsz));
13 }

This algorithm is incredibly similar to the software recursive algorithm, replacing conditionals with muxes.

PLRU Hit/Invalidate

A second algorithm needed to maintain the PLRU tree is managing ‘hits’ to a given way, making that way ‘recently used’, or to manage ‘invalidates’ of a way, making it available for use and marking it as LRU. Given a way to hit or invalidate, this algorithm needs to update the state of the PLRU tree and return that state vector.

Software PLRU Hit/Invalidate

For the software form of the hit algorithm, the PLRU state is again traversed recursively, splitting on the middle element (line 6) and processing the lower (line 9) and upper (line 12) portions of the state vector. If the way being hit is in the left portion, then that portion of the tree is processed, otherwise it is simply returned as is – similarly for the right.

 1 List<int> hitPLRU(List<int> v, int way,
 2     {int base = 0, bool invalidate = false}) {
 3   if (v.length == 1) {
 4     return [if ((way == base) == invalidate) 1 else 0];
 5   } else {
 6     final mid = v.length ~/ 2;
 7     var lower = v.sublist(0, mid);
 8     var upper = v.sublist(mid + 1, v.length);
 9     lower = (way <= mid + base)
10         ? hitPLRU(lower, way, base: base, invalidate: invalidate)
11         : lower;
12     upper = (way > mid + base)
13         ? hitPLRU(upper, way, base: mid + base + 1, invalidate: invalidate)
14         : upper;
15     final midVal = ((way <= mid + base) == invalidate) ? 1 : 0;
16     return [...lower, midVal, ...upper];
17   }
18 }

Let’s consider the ‘hit’ case where invalidate=false, so we are marking the way as recently used. At the splitting point, the middle value (line 15) is set to ‘1’ if the way is to the right of middle, otherwise it is set to ‘0’, indicating the LRU is in the opposite direction of the hit way. The leaf processing code (line 4) is similar: if we match the way at the node, then set the node to ‘0’ to switch LRU away from this leaf, otherwise set to ‘1’ (remember ‘0’ indicates the LRU direction is to the right).

If we consider the invalidate=true case, where we are marking the way as available or LRU, the logic is simply reversed: a hit means LRU is going in the same direction of the hit way and wherever we would have set a ‘0’ for a ‘hit’, we would set a ‘1’ instead to invalidate and a ‘1’ instead of a ‘0’. Invalidate marks each node along the path to the invalidated way with a ‘0’ to designate the way as LRU.

ROHD Recursive Hardware PLRU Hit/Invalidate

The hardware recursion for PLRU hit/invalidate follows the same pattern as software. Yet because we need to pass back the entire state vector, not knowing the actual value of way at generation time, we need to fully process each element of the vector with the Logic signal way in the recursion. That can be seen most clearly in the lower and upper recursions (lines 13 and 14) where we must invoke the recursive routine to pass the way for both. Thes makes the way signal available at each node to be compared appropriately in hardware since way is a dynamic signal that changes value after hardware generation. Remember that in the software algorithm we knew the value of way so we could just recurse into the appropriate subtree and return the other subtree unprocessed.

 1 Logic hitPLRU(Logic v, Logic way,
 2     {int base = 0, Logic? invalidate}) {
 3   Logic convertInt(int i) => Const(i, width: way.width);
 4 
 5   invalidate ??= Const(0);
 6   if (v.width == 1) {
 7     return mux(way.eq(convertInt(base)), invalidate,
 8         mux(way.eq(convertInt(base + 1)), ~invalidate, v[0]));
 9   } else {
10     final mid = v.width ~/ 2;
11     var lower = v.slice(mid - 1, 0);
12     var upper = v.getRange(mid + 1);
13     lower = hitPLRU(lower, way, base: base, invalidate: invalidate);
14     upper = hitPLRU(upper, way, base: mid + base + 1, invalidate: invalidate);
15     final midVal = mux(
16         way.lt(convertInt(base)) | way.gt(convertInt(base + v.width)),
17         v[mid],
18         mux(way.lte(convertInt(mid + base)), invalidate, ~invalidate));
19     return [lower, midVal, upper].rswizzle();
20   }
21 }

At the current node, we compare the way to the range of the current subtree. If the way is not in range (line 16), we simply return the midpoint value (line 17) unchanged. This is an example where hardware recursion differs from software in that we need to check this condition.

At line 18, we can see that when the way falls within this subtree and when processing the ‘hit’ case (invalidate=false), we return ‘1’ if the way is in the right subtree and ‘0’ otherwise. Finally, we return a concatenation of the computed PLRU state vector (line 19). Note that the invalidate=true case simply reverses the meaning of ‘1’ and ‘0’.

We can see that maintaining a PLRU tree to support pseudo-LRU computation is quite similar in both software and hardware recursive forms. ROHD makes it simple to follow software patterns to generate hardware when the computation is easily described recursively.

A Reduction Tree Component

A parallel reduction is an efficient arrangement of computation of associative operations (like sum or maximum) to compute a single result from an array of inputs.

Reductions are common in both software and hardware and improve the latency of computation logarithmically by arranging a tree of computation to perform parallel operations even when the output is a single result.

In Hardware Description Languages (HDL)s, there is even operator syntax to perform common bit-wise reductions like or-reduction and and-reduction. It is more difficult to describe reductions on complex inputs or operations in traditional HDLs.

Yet in software, more complex reduction trees are possible because of the compositional nature of software languages. For example the C++ Standard Template Library (STL) contains a template operator std::reduce to perform reduction generically on STL containers. Indeed, these operations are threaded to ensure fastest possible execution in reduction tree arrangements within STL.

To perform more complex reductions in hardware, the ROHD Hardware Component Library (HCL) provides a ReductionTree component which takes an associative operation and populates a hardware tree arrangement to compute a hardware reduction with a latency that is logarithmic with the number of inputs. In hardware we need to consider pipelining as well for a reduction of significant length.

Add Reduction Tree

Here is a simple example of a reduction tree using the native add operations of SystemVerilog, but written using a ROHD generator class.

The method addReduce is an operation to be instantiated by the reduction tree generator. In this case it is simply a native addition of two inputs, but could be a more complex generator or Module instance of an associative 2-input computation. Because the tree can handle lengths that are not powers of 2, the operation must take care of the length=1 case when the tree is not balanced perfectly.

  Logic addReduce(List<Logic> inputs,
          {int depth, Logic? control, String name = ''}) =>
       inputs.length < 2 ? inputs[0] : inputs[0] + inputs[1];

Then we pass this reduction operation during an instantiation of ReductionTree, producing a binary tree with 79 13-bit inputs and a tree of adders (inserted at each node by the addReduce method), along with adding pipelining at every other level of the tree, starting from the leaves.

 main () {
     const width = 13;
     const length = 79;
     final clk = Logic();
     final vec = <Logic>[];

     final reductionTree = ReductionTree(
         vec, addReduce, clk: clk, depthBetweenFlops: 2);
}

It is quite simple to do other operations like max or min or operate on other datatypes like FloatingPoint or FixedPoint replacing the operator addReduce with more complex hardware generators, including Module instances. The ReductionTree can also manage data widening and sign extension for arithmetic operation reductions.

Mux Reduction Tree

Here is a more complex example (similar to, but not technically reduction) that passes in a control line that can be indexed by the depth of the tree to perform a multiplexing (or ‘mux’ing) operation, an operation quite useful in hardware. Here you see the operation is now the muxReduce method which injects a mux at each node of the tree.

Note that the ReductionTree component does not know apriori what the node hardware will be, giving it infinite extensibility.

main() {
    const length = 1024;
    final width = log2Ceil(length);
    final clk = Logic();
    final vec = <Logic>[];
    final control = Logic(width: log2Ceil(vec.length)));

    Logic muxReduce(List<Logic> inputs,
        {int depth, Logic? control, String name = 'mux'}) =>
      mux(control![depth], inputs[1], inputs[0]);

    final muxTree = ReductionTree(vec, muxReduce,
        clk: clk, depthBetweenFlops: 2, control: control, name: 'mux');
}

Being able to quickly generate tree computations is another way to map what are common software design patterns into hardware with a high degree of control over the performance of that hardware.

Updated: