From Newsgroup: comp.arch
On 2/9/26 8:56 PM, Robert Finch wrote:
On 2026-02-09 5:27 p.m., BGB wrote:
[snip]
My usual thing here was:
In mixed page-size configurations, the MMU is set to the
smallest in-use page size;
When set to larger page sizes, a few of the bits that were
low-order bits with a smaller page size change to being
high-order bits with a larger page size (all other bits
remaining in the same place).
This makes sense if one knows (or can predict) the page size
early enough.
Itanium allowed different base page sizes for different
"segments" which were looked up in a small table, so this would
be similar to (if a bit slower than) per ASID base page sizes.
One could also do prediction based on the register used for the
base address (e.g., recording the last page size encountered
when a register, or register group, was used).
I suspect a simple predictor could be highly accurate (100% for
programs that only use base size pages, if the predictor is
initialized on thread switches). The stack is likely to use base
page size and many other accesses would have temporal locality
with respect to the page size.
[snip]
Seemed "obvious enough".
Yes, such is obviously workable with page size known in advance.
Paper seems to have the idea of subdividing associativity if
multiple page sizes are in use at the same time. I guess this
works, if one has enough associativity.
The skewed associativity helps to reduce conflict misses, so the
cost of lower associativity is reduced.
[snip]
{On 2/9/26 8:56 PM, Robert Finch wrote:}
I found a slide presentation on the idea.
I am wondering how a LRU system would work with the skewed entries.
I assume the entries would be shifting between the ways for LRU.
So, if there is a different size page in one of the ways would
it still work?
With ordinary skewed associativity, LRU would use compressed/
truncated timestamps, but an alternative is bit per entry NRU
with the interesting aspect being that skewing groups different
NRU bits. The original skewed associativity paper ("Skewed
Associative Caches", Seznec and Bodin, 1992) proposed using
recently used bits with periodic clearing of all of the bits and
choosing victims randomly first among not recently used, then
among not modified. With a TLB 'prefer keeping larger pages'
might be part of the policy.
(Timestamps could be miss count. One might also have different
offsets for different banks or sections of banks; I am guessing
that such might reduce wraparound issues. For an L1 TLB, this
might not require many bits to provide a close approximation of
LRU.)
A slightly later paper ("A case for two-way skewed-associative
caches", Seznec, 1993) proposed one bit per pair of cache lines
with the bit set when bank zero is accessed and cleared with
bank one is accessed.
Sharing the indexing with two ways (e.g., two address hashing
functions in a four-way associative cache) would allow the pair
to use true LRU, limiting the damage from something like random
replacement. Combining this with a use bit might further improve
replacement choices.
One might also track L1 evictions in an L2 TLB and use a
"sticky/second chance" bit when such are soon returned to L1 so
that they are less likely to be evicted (since all entries
eligible for replacement might have sticky bits set, one would
have to be a victim, though one might choose to use cuckoo
replacement then). A victim cache would have similar use. With a
non-clustered TLB, the moving overhead is lower than with a data
cache because the blocks are small.
One interesting aspect of skewed associative caches is that one
can implement cuckoo (a.k.a. elbow) caches that choose a victim
in one group of ways and look for a victim in the victim's other
ways (potentially recursive to some depth). This can approach a
fully associative cache. Of course, it also depends on having a
replacement policy that works well with skewed associativity.
The use of overlaid skewed associativity to support multiple
page sizes may not be especially practical since one would
typically want different page sizes to be much larger (16 times
or more, probably), so one is less likely to have shared bits
for indexing. Yet I still consider it a clever and potentially
useful mechanism. For an L2 TLB with a modest number of page
sizes used by a thread, hash-rehash probably works well enough.
With huge pages, there is also more benefit of a separate TLB
since the tag and translation address will be smaller, though
one could dynamically share storage and use sign-extension to
use smaller tags for some pages.
With page size prediction, it might be practical to only index
the two most likely page sizes.
--- Synchronet 3.21b-Linux NewsLink 1.2