Caches models are not currently well integrated with the component system that Simics uses for other devices. For that reason, users are typically required to create caches and connect them by hand. This approach offers, on the other hand, total control on the configuration process. There are however some help commands that can be used to add simple cache hierarchies, and we will study them below.
Here is an example on how to create three level caches by using the instrumentation framework.
First you need to create a cache tool object with the following command:
new-simple-cache-tool name = cachetool -connect-all
If you want to simulate simple timing you can add a cycle staller object to the cache tool as well:
new-cycle-staller name = cs0 stall-interval = 10000
new-simple-cache-tool name = cachetool cycle-staller = cs0 -connect-all
The cycle-staller object adds extra stalls from the caches to the processors at a given interval (here 10000 cycles). The purpose of this method compared to stalling the processor at each memory access is that is much more efficient to do it. The staller will sum up all penalty cycles from the caches, and then apply them as a single stall time at the end of the interval.
If only cache statistics such as hit rates etc. are requested the cycle staller can be omitted.
The new-simple-cache-tool
command creates an instrumentation tool that works as a handler for all processors added to the tool. The -connect-all
flag tells the command to connect to all the processors in the configuration. Simics usually does not make any distinction between single cores and cores with hardware threads. Each thread will be regarded a "processor", and thus you will get a connection for each hardware threads present in the configuration. A connection is small Simics object that handles the association between the caches and the processor hardware threads.
You can also use the processors argument to the command to list only a certain group of processors (threads) to connect to. This can be useful for heterogeneous setups where different core have different cache settings. In such scenario you should create more than one cache tool.
The cache tool installs instrumentation callbacks for every memory operation and redirects them to the caches. The callbacks can be installed on different access types such as data operation and instruction fetches, and this is used to dispatch the access to the correct first level cache, the instruction cache or the data cache.
Now we are going to add the cache models to the tool. We use the commands add-l1d-cache
(level one data cache), add-l1i-cache
(level one instruction cache), add-l2-cache
(level two cache for both data and instructions), and then add-l3-cache
(shared level tree cache among the cores).
(cachetool.add-l1d-cache name = l1d line-size = 64 sets = 64 ways = 12
-ip-read-prefetcher prefetch-additional = 1)
(cachetool.add-l1i-cache name = l1i line-size = 64 sets = 64 ways = 8)
(cachetool.add-l2-cache name = l2 line-size = 64 sets = 1024 ways = 20
-prefetch-adjacent prefetch-additional = 4)
(cachetool.add-l3-cache name = l3 line-size = 64 sets = 8192 ways = 12)
This is an example of added caches to a QSP (quick start platform) that consists of 1 cpu socket (cpu0), with 2 cores and 2 threads per core. Before the caches are added the processor hierarchy looks like:
(Notice, the following examples shows just selected output from the list-objects -tree
command.)
cpu0 ┐
├ core[0][0]
├ core[0][1]
├ core[1][0]
├ core[1][1]
and after when the caches has been added:
cpu0 ┐
├ cache[0] ┐
│ ├ l1d
│ ├ l1i
│ └ l2
├ cache[1] ┐
│ ├ l1d
│ ├ l1i
│ └ l2
├ core[0][0]
├ core[0][1]
├ core[1][0]
├ core[1][1]
├ directory_l1
├ directory_l2
├ directory_l3
├ l3
As you can see, there are added caches for all the cores. The cache[0] namespace keeps the caches for core 0, and cache[1] for core 1 respectively. The two hardware threads core[0][0] and core[0][1] will share the caches under cache[0], and core[1][0] and core[1][1] will share the caches under cache[1]. All accesses from the threads go first to l1d/l1i and then to l2. The l3 cache is shared between both cores. The directory objects keeps a cache directory for each cache level that keeps track of the cache coherency. The caches models a simple MESI protocol for each level. The directories also talk to each other to keep the consistency for all levels.
You can also list the cache objects created in a table by the following command:
simics> list-objects class = simple_cache -all
┌──────────────┬──────────────────────────┐
│ Class │ Object │
├──────────────┼──────────────────────────┤
│<simple_cache>│board.mb.cpu0.cache[0].l1d│
│<simple_cache>│board.mb.cpu0.cache[0].l1i│
│<simple_cache>│board.mb.cpu0.cache[0].l2 │
│<simple_cache>│board.mb.cpu0.l3 │
└──────────────┴──────────────────────────┘
There is also a command, <simple_cache_tool>.list-caches
, to list the caches connected to a specific simple_cache_tool
. For example:
simics> cachetool.list-caches
┌─────┬──────────────────────────┬────┬────┬─────────┬──────────┐
│Row #│ Cache Object │Sets│Ways│Line Size│Total Size│
├─────┼──────────────────────────┼────┼────┼─────────┼──────────┤
│ 1│board.mb.cpu0.cache[0].l1d│ 64│ 12│ 64│ 48.00 kiB│
│ 2│board.mb.cpu0.cache[0].l1i│ 64│ 8│ 64│ 32.00 kiB│
│ 3│board.mb.cpu0.cache[0].l2 │1024│ 20│ 64│ 1.25 MiB│
│ 4│board.mb.cpu0.l3 │4096│ 12│ 64│ 3.00 MiB│
└─────┴──────────────────────────┴────┴────┴─────────┴──────────┘
Here some size properties of the caches is also displayed.
All the configuration parameters to the <simple_cache>
.add-{l1d,l1i,l2,l3}-cache commands are listed here:
line-size
the cache line size, default 64 (bytes).sets
the number of cache sets, i.e., number of indices.ways
sets the cache associativity, i.e., the total number of cache lines will be sets * ways, default number of ways is 1.-write-through
if the cache should be a write through cache, i.e., all writes will be passed through to the next level (even cache hits). Default is not to write through.-no-write-allocate
if the cache should not allocate lines upon a cache write miss. If no write allocate the cache will write through on write misses. Default is to do write allocate, by first reading the cache line.read-penalty
sets the time in cycle it takes to read from the cache.read-miss-penalty
sets the time in cycles it takes to miss in the cache. So for a miss both the read penalty and read miss penalty will be added. Usually this is only set for the last cache to set the time it takes to reach memory.write-penalty
sets the time in cycle it takes to write to the cache.write-miss-penalty
sets the time in cycle it takes to write and miss in the cache. So for a miss both the write penalty and write miss penalty will be added. Usually this is only set for the last cache to set the time it takes to reach memory.prefetch-additional
sets how many consecutive cache lines to fetch additionally to the one that missed.-prefetch-adjacent
means that the cache will, on a miss, prefetch the adjacent cache line as well, so the total fetch region is cache line size times 2, naturally aligned.-ip-read-prefetcher
adds an instruction pointer stride prefetcher for reads-ip-write-prefetcher
adds an instruction pointer stride prefetcher for writes-no-issue
this is a special flag for the add-l1i-cache command which prevents the CPU to do any instruction fetch accesses to the instruction cache. This is useful if the instruction cache should be called from another tool, such as a branch predictor tool that drives the instructions cache. If not set, the instruction cache will be called for each new cache block that the CPU fetches.The penalties from above is only relevant if you add the cycle staller object.
If you run the simulation for a while, say 1 000 000 000 cycles, with the run-cycles
command, you will get information about a cache with the cache.print-statistics
command.
So, from the table above choose one cache object, and do for instance:
simics> board.mb.cpu0.cache[0].l1d.print-statistics
The output will be something like the following. See chapter Understanding the Statistics of Simple Cache below for more information about the statistics.
┌─────┬───────────────────────────────────┬────────┬─────┐
│Row #│ Counter │ Value │ % │
├─────┼───────────────────────────────────┼────────┼─────┤
│ 1│read accesses │10571090│ │
│ 2│read misses │ 194075│ 1.84│
│ 3│write accesses │ 3704615│ │
│ 4│write misses │ 152557│ 4.12│
│ 5│prefetch accesses │ 323980│ │
│ 6│prefetch misses │ 253378│78.21│
│ 7│prefetched lines used │ 139516│43.06│
│ 8│evicted lines (total) │ 599242│ │
│ 9│evicted modified lines │ 172067│28.71│
│ 10│entire cache flushes (invd, wbinvd)│ 8│ │
│ 11│uncachable read accesses │56531126│ │
│ 12│uncachable write accesses │38655295│ │
└─────┴───────────────────────────────────┴────────┴─────┘
The table can also be exported to a comma separated value file (csv), by using the .export-table-csv
command, e.g.,
simics> board.mb.cpu0.cache[0].l1d.export-table-csv file = my-stats.csv
You can also view the content of a cache with the .print-cache-content
command, e.g.,
simics> board.mb.cpu0.cache[0].l1d.print-cache-content -no-row-column
┌─────┬────────────────┬────────────────┬────────────────┬────────────────┐
│Index│ Way0 │ Way1 │ Way2 │ Way3 │
├─────┼────────────────┼────────────────┼────────────────┼────────────────┤
│ 0│M:0xdffd4b00:2:-│S:0xdffef000:1:-│S:0xdffce000:0:-│M:0xdffcf200:3:-│
│ 1│E:0xdffedf40:2:-│E:0xdffcf240:1:P│M:0xdffcf040:3:-│S:0xdffce040:0:-│
│ 2│S:0xdffce080:1:-│S:0xdffcf280:2:-│M:0xdffcf180:3:-│E:0xdffedf80:0:-│
│ 3│M:0xdffd4ac0:1:-│E:0xdffcf2c0:3:P│E:0xdffcf1c0:0:S│E:0xdffedfc0:2:-│
└─────┴────────────────┴────────────────┴────────────────┴────────────────┘
The -no-row-column
removes the default Row column from the table, which is useful since it reduces possible confusion with the Index column.
This example shows a small cache with only 4 sets and 4 ways. The first letter of each cache line shows the state of the cache line in the MESI cache protocol. M is modified, E is Exclusive, S is shared, and I is Invalid. The next field is the tag, or rather the physical address of the cache line. The next number tells the age of the cache line among the ways in the set. 0 means the most recently used (MRU) and higher number means older up to the highest number representing the leased recently used (LRU) line. The last letter shows the prefetch status. P means the line has been prefetched but not used yet, and S means that line is currently part of a stride prefetching scheme.
There is a flag to the command, -no-invalid-sets
, that filters out sets with only invalid lines.
Also, the table printed will default to only show a maximum of 40 sets. To show more or fewer of the sets use the max argument to the command to set the limit.