• Re: Inverted Page Tables

    From Paul Clayton@paaronclayton@gmail.com to comp.arch on Mon Feb 16 19:29:45 2026
    From Newsgroup: comp.arch

    On 2/9/26 8:56 PM, Robert Finch wrote:
    On 2026-02-09 5:27 p.m., BGB wrote:
    [snip]

    My usual thing here was:
    In mixed page-size configurations, the MMU is set to the
    smallest in-use page size;
    When set to larger page sizes, a few of the bits that were
    low-order bits with a smaller page size change to being
    high-order bits with a larger page size (all other bits
    remaining in the same place).

    This makes sense if one knows (or can predict) the page size
    early enough.

    Itanium allowed different base page sizes for different
    "segments" which were looked up in a small table, so this would
    be similar to (if a bit slower than) per ASID base page sizes.

    One could also do prediction based on the register used for the
    base address (e.g., recording the last page size encountered
    when a register, or register group, was used).

    I suspect a simple predictor could be highly accurate (100% for
    programs that only use base size pages, if the predictor is
    initialized on thread switches). The stack is likely to use base
    page size and many other accesses would have temporal locality
    with respect to the page size.

    [snip]
    Seemed "obvious enough".

    Yes, such is obviously workable with page size known in advance.

    Paper seems to have the idea of subdividing associativity if
    multiple page sizes are in use at the same time. I guess this
    works, if one has enough associativity.

    The skewed associativity helps to reduce conflict misses, so the
    cost of lower associativity is reduced.

    [snip]
    {On 2/9/26 8:56 PM, Robert Finch wrote:}
    I found a slide presentation on the idea.

    I am wondering how a LRU system would work with the skewed entries.
    I assume the entries would be shifting between the ways for LRU.
    So, if there is a different size page in one of the ways would
    it still work?

    With ordinary skewed associativity, LRU would use compressed/
    truncated timestamps, but an alternative is bit per entry NRU
    with the interesting aspect being that skewing groups different
    NRU bits. The original skewed associativity paper ("Skewed
    Associative Caches", Seznec and Bodin, 1992) proposed using
    recently used bits with periodic clearing of all of the bits and
    choosing victims randomly first among not recently used, then
    among not modified. With a TLB 'prefer keeping larger pages'
    might be part of the policy.

    (Timestamps could be miss count. One might also have different
    offsets for different banks or sections of banks; I am guessing
    that such might reduce wraparound issues. For an L1 TLB, this
    might not require many bits to provide a close approximation of
    LRU.)

    A slightly later paper ("A case for two-way skewed-associative
    caches", Seznec, 1993) proposed one bit per pair of cache lines
    with the bit set when bank zero is accessed and cleared with
    bank one is accessed.

    Sharing the indexing with two ways (e.g., two address hashing
    functions in a four-way associative cache) would allow the pair
    to use true LRU, limiting the damage from something like random
    replacement. Combining this with a use bit might further improve
    replacement choices.

    One might also track L1 evictions in an L2 TLB and use a
    "sticky/second chance" bit when such are soon returned to L1 so
    that they are less likely to be evicted (since all entries
    eligible for replacement might have sticky bits set, one would
    have to be a victim, though one might choose to use cuckoo
    replacement then). A victim cache would have similar use. With a
    non-clustered TLB, the moving overhead is lower than with a data
    cache because the blocks are small.

    One interesting aspect of skewed associative caches is that one
    can implement cuckoo (a.k.a. elbow) caches that choose a victim
    in one group of ways and look for a victim in the victim's other
    ways (potentially recursive to some depth). This can approach a
    fully associative cache. Of course, it also depends on having a
    replacement policy that works well with skewed associativity.

    The use of overlaid skewed associativity to support multiple
    page sizes may not be especially practical since one would
    typically want different page sizes to be much larger (16 times
    or more, probably), so one is less likely to have shared bits
    for indexing. Yet I still consider it a clever and potentially
    useful mechanism. For an L2 TLB with a modest number of page
    sizes used by a thread, hash-rehash probably works well enough.

    With huge pages, there is also more benefit of a separate TLB
    since the tag and translation address will be smaller, though
    one could dynamically share storage and use sign-extension to
    use smaller tags for some pages.

    With page size prediction, it might be practical to only index
    the two most likely page sizes.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Wed Feb 18 14:03:36 2026
    From Newsgroup: comp.arch

    On 2/9/26 12:36 AM, Robert Finch wrote:
    There used to be separate two bits in my MMU project PTE to
    encode the type and a shortcut page indicator. They were
    combined into a single two-bit field, 0=PTE,1=PTP, and
    2=shortcut. That leaves an extra unused code. I am wondering
    what to use the code for. The only thing I can think of ATM is a
    meta-data record indicator.

    By "meta-data record indicator" do you mean 'invalid' (i.e.,
    whatever information is in the other bits is managed by the OS)?

    I have also been considering PTEs that are not a power-of-two in
    size. For instance, 48-bits or possibly 56 bits. The PTEs would
    be stored in the last part of a page with a page header and
    other information at the beginning of the page. For an 8kB page
    the last (or middle) 6kB would be PTEs. IDK what the header
    would be used for, other than perhaps some sort of error
    management. That is, if the first part or the last part of the
    page was overwritten, it could generate an error. Same as a
    header on a block of memory.

    I think one might want blocks of PTEs to be within a minimum
    memory access chunk (cache block or smaller). This avoids having
    to fetch two cache blocks for a PTE (if the memory system is
    limited to aligned fetches) and allows the extra bits in the
    cache block to be applied to the PTEs in that chunk (without
    having to fetch separately from the top of the table).

    With 32-byte chunks and 48-bit PTEs, one would have 16 bits of
    chunk-local metadata in most chunks. (Special casing one chunk
    feels extra funky somehow.) 16 bits does not seem like much.
    Giving all the bits saved from a smaller PTE to provide chunk-
    local information — with a power-of-two PTEs per chunk — might
    be interesting. It is not clear if it would be useful to have
    policies applied to 4- or 8-page chunks.

    When a chunk could be coalesced into a single page in terms of
    translation, the address portions of all but one PTE could be
    repurposed (in theory). I do not know what information one would
    want to pack into such space. One interesting possibility would
    be to increase the physical address space, though that does not
    seem that useful (because of the awkwardness of applying only to
    "large" pages) and would not use that many bits.

    Having some information be common to a group of pages could save
    a few bits, but managing the extra complexity is probably
    something OS developers would avoid. E.g., if a group of pages
    shared the most significant N bits of translation — sort of inverted-from-usual page coloring — most OSes would just have
    the allowed physical address space be N bits smaller.
    --- Synchronet 3.21b-Linux NewsLink 1.2