5.2 Simulating a Simple Cache

Caches models are not currently well integrated with the component system that Simics uses for other devices. For that reason, users are typically required to create caches and connect them by hand. This approach offers, on the other hand, total control on the configuration process. There are however some help commands that can be used to add simple cache hierarchies, and we will study them below.

Here is an example on how to create three level caches by using the instrumentation framework.

First you need to create a cache tool object with the following command:


  new-simple-cache-tool name = cachetool -connect-all

If you want to simulate simple timing you can add a cycle staller object to the cache tool as well:


  new-cycle-staller name = cs0 stall-interval = 10000
  new-simple-cache-tool name = cachetool cycle-staller = cs0 -connect-all

The cycle-staller object adds extra stalls from the caches to the processors at a given interval (here 10000 cycles). The purpose of this method compared to stalling the processor at each memory access is that is much more efficient to do it. The staller will sum up all penalty cycles from the caches, and then apply them as a single stall time at the end of the interval.

If only cache statistics such as hit rates etc. are requested the cycle staller can be omitted.

The new-simple-cache-tool command creates an instrumentation tool that works as a handler for all processors added to the tool. The -connect-all flag tells the command to connect to all the processors in the configuration. Simics usually does not make any distinction between single cores and cores with hardware threads. Each thread will be regarded a "processor", and thus you will get a connection for each hardware threads present in the configuration. A connection is small Simics object that handles the association between the caches and the processor hardware threads.

You can also use the processors argument to the command to list only a certain group of processors (threads) to connect to. This can be useful for heterogeneous setups where different core have different cache settings. In such scenario you should create more than one cache tool.

The cache tool installs instrumentation callbacks for every memory operation and redirects them to the caches. The callbacks can be installed on different access types such as data operation and instruction fetches, and this is used to dispatch the access to the correct first level cache, the instruction cache or the data cache.

Now we are going to add the cache models to the tool. We use the commands add-l1d-cache (level one data cache), add-l1i-cache (level one instruction cache), add-l2-cache (level two cache for both data and instructions), and then add-l3-cache (shared level tree cache among the cores).


(cachetool.add-l1d-cache name = l1d line-size = 64 sets = 64 ways = 12
 -ip-read-prefetcher prefetch-additional = 1)

(cachetool.add-l1i-cache name = l1i line-size = 64 sets = 64 ways = 8)

(cachetool.add-l2-cache name = l2 line-size = 64 sets = 1024 ways = 20
 -prefetch-adjacent prefetch-additional = 4)

(cachetool.add-l3-cache name = l3 line-size = 64 sets = 8192 ways = 12)

This is an example of added caches to a QSP (quick start platform) that consists of 1 cpu socket (cpu0), with 2 cores and 2 threads per core. Before the caches are added the processor hierarchy looks like:

(Notice, the following examples shows just selected output from the list-objects -tree command.)


 cpu0 ┐
      ├ core[0][0]
      ├ core[0][1]
      ├ core[1][0]
      ├ core[1][1]

and after when the caches has been added:


 cpu0 ┐
      ├ cache[0] ┐
      │          ├ l1d 
      │          ├ l1i 
      │          └ l2
      ├ cache[1] ┐
      │          ├ l1d 
      │          ├ l1i 
      │          └ l2
      ├ core[0][0]
      ├ core[0][1]
      ├ core[1][0]
      ├ core[1][1]
      ├ directory_l1 
      ├ directory_l2 
      ├ directory_l3 
      ├ l3

As you can see, there are added caches for all the cores. The cache[0] namespace keeps the caches for core 0, and cache[1] for core 1 respectively. The two hardware threads core[0][0] and core[0][1] will share the caches under cache[0], and core[1][0] and core[1][1] will share the caches under cache[1]. All accesses from the threads go first to l1d/l1i and then to l2. The l3 cache is shared between both cores. The directory objects keeps a cache directory for each cache level that keeps track of the cache coherency. The caches models a simple MESI protocol for each level. The directories also talk to each other to keep the consistency for all levels.

You can also list the cache objects created in a table by the following command:


simics> list-objects class = simple_cache -all
┌──────────────┬──────────────────────────┐
│    Class     │          Object          │
├──────────────┼──────────────────────────┤
│<simple_cache>│board.mb.cpu0.cache[0].l1d│
│<simple_cache>│board.mb.cpu0.cache[0].l1i│
│<simple_cache>│board.mb.cpu0.cache[0].l2 │
│<simple_cache>│board.mb.cpu0.l3          │
└──────────────┴──────────────────────────┘

There is also a command, <simple_cache_tool>.list-caches, to list the caches connected to a specific simple_cache_tool. For example:


    simics> cachetool.list-caches
┌─────┬──────────────────────────┬────┬────┬─────────┬──────────┐
│Row #│       Cache Object       │Sets│Ways│Line Size│Total Size│
├─────┼──────────────────────────┼────┼────┼─────────┼──────────┤
│    1│board.mb.cpu0.cache[0].l1d│  64│  12│       64│ 48.00 kiB│
│    2│board.mb.cpu0.cache[0].l1i│  64│   8│       64│ 32.00 kiB│
│    3│board.mb.cpu0.cache[0].l2 │1024│  20│       64│  1.25 MiB│
│    4│board.mb.cpu0.l3          │4096│  12│       64│  3.00 MiB│
└─────┴──────────────────────────┴────┴────┴─────────┴──────────┘

Here some size properties of the caches is also displayed.

All the configuration parameters to the <simple_cache>.add-{l1d,l1i,l2,l3}-cache commands are listed here:

line-size the cache line size, default 64 (bytes).
sets the number of cache sets, i.e., number of indices.
ways sets the cache associativity, i.e., the total number of cache lines will be sets * ways, default number of ways is 1.
-write-through if the cache should be a write through cache, i.e., all writes will be passed through to the next level (even cache hits). Default is not to write through.
-no-write-allocate if the cache should not allocate lines upon a cache write miss. If no write allocate the cache will write through on write misses. Default is to do write allocate, by first reading the cache line.
read-penalty sets the time in cycle it takes to read from the cache.
read-miss-penalty sets the time in cycles it takes to miss in the cache. So for a miss both the read penalty and read miss penalty will be added. Usually this is only set for the last cache to set the time it takes to reach memory.
write-penalty sets the time in cycle it takes to write to the cache.
write-miss-penalty sets the time in cycle it takes to write and miss in the cache. So for a miss both the write penalty and write miss penalty will be added. Usually this is only set for the last cache to set the time it takes to reach memory.
prefetch-additional sets how many consecutive cache lines to fetch additionally to the one that missed.
-prefetch-adjacent means that the cache will, on a miss, prefetch the adjacent cache line as well, so the total fetch region is cache line size times 2, naturally aligned.
-ip-read-prefetcher adds an instruction pointer stride prefetcher for reads
-ip-write-prefetcher adds an instruction pointer stride prefetcher for writes
-no-issue this is a special flag for the add-l1i-cache command which prevents the CPU to do any instruction fetch accesses to the instruction cache. This is useful if the instruction cache should be called from another tool, such as a branch predictor tool that drives the instructions cache. If not set, the instruction cache will be called for each new cache block that the CPU fetches.

The penalties from above is only relevant if you add the cycle staller object.

If you run the simulation for a while, say 1 000 000 000 cycles, with the run-cycles command, you will get information about a cache with the cache.print-statistics command.

So, from the table above choose one cache object, and do for instance:


simics> board.mb.cpu0.cache[0].l1d.print-statistics

The output will be something like the following. See chapter Understanding the Statistics of Simple Cache below for more information about the statistics.



┌─────┬───────────────────────────────────┬────────┬─────┐
│Row #│              Counter              │ Value  │  %  │
├─────┼───────────────────────────────────┼────────┼─────┤
│    1│read accesses                      │10571090│     │
│    2│read misses                        │  194075│ 1.84│
│    3│write accesses                     │ 3704615│     │
│    4│write misses                       │  152557│ 4.12│
│    5│prefetch accesses                  │  323980│     │
│    6│prefetch misses                    │  253378│78.21│
│    7│prefetched lines used              │  139516│43.06│
│    8│evicted lines (total)              │  599242│     │
│    9│evicted modified lines             │  172067│28.71│
│   10│entire cache flushes (invd, wbinvd)│       8│     │
│   11│uncachable read accesses           │56531126│     │
│   12│uncachable write accesses          │38655295│     │
└─────┴───────────────────────────────────┴────────┴─────┘

The table can also be exported to a comma separated value file (csv), by using the .export-table-csv command, e.g.,


simics> board.mb.cpu0.cache[0].l1d.export-table-csv file = my-stats.csv

You can also view the content of a cache with the .print-cache-content command, e.g.,


simics> board.mb.cpu0.cache[0].l1d.print-cache-content -no-row-column

┌─────┬────────────────┬────────────────┬────────────────┬────────────────┐
│Index│      Way0      │      Way1      │      Way2      │      Way3      │
├─────┼────────────────┼────────────────┼────────────────┼────────────────┤
│    0│M:0xdffd4b00:2:-│S:0xdffef000:1:-│S:0xdffce000:0:-│M:0xdffcf200:3:-│
│    1│E:0xdffedf40:2:-│E:0xdffcf240:1:P│M:0xdffcf040:3:-│S:0xdffce040:0:-│
│    2│S:0xdffce080:1:-│S:0xdffcf280:2:-│M:0xdffcf180:3:-│E:0xdffedf80:0:-│
│    3│M:0xdffd4ac0:1:-│E:0xdffcf2c0:3:P│E:0xdffcf1c0:0:S│E:0xdffedfc0:2:-│
└─────┴────────────────┴────────────────┴────────────────┴────────────────┘

The -no-row-column removes the default Row column from the table, which is useful since it reduces possible confusion with the Index column.

This example shows a small cache with only 4 sets and 4 ways. The first letter of each cache line shows the state of the cache line in the MESI cache protocol. M is modified, E is Exclusive, S is shared, and I is Invalid. The next field is the tag, or rather the physical address of the cache line. The next number tells the age of the cache line among the ways in the set. 0 means the most recently used (MRU) and higher number means older up to the highest number representing the leased recently used (LRU) line. The last letter shows the prefetch status. P means the line has been prefetched but not used yet, and S means that line is currently part of a stride prefetching scheme.

There is a flag to the command, -no-invalid-sets, that filters out sets with only invalid lines.

Also, the table printed will default to only show a maximum of 40 sets. To show more or fewer of the sets use the max argument to the command to set the limit.

5.1 Introduction to Cache Simulation with Simics 5.3 Programmatically constructing a cache hierarchy