• Should an ISA contain

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 23:34:21 2026
    From Newsgroup: comp.arch


    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri May 8 19:52:53 2026
    From Newsgroup: comp.arch

    On 2026-May-08 19:34, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Do you mean changes a single line from using write-invalidate
    protocol to write-update so any remote writes are forwarded
    by the home directory to the current line owner?
    In effect, blocks line movement but not updates.

    Or something else?

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Not unprivileged or applications could un-zero fields that had
    been intentionally zeroed out but still held in cache.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri May 8 21:36:56 2026
    From Newsgroup: comp.arch

    On 5/8/2026 6:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}


    At present, this sounds pretty close to my default/weak memory semantics.


    I was left to consider adding something for temporary line locking for volatile write operations for stronger ordering constraints, but this is
    not currently the default (so it is more just "hope that access timing
    works in ones' favor" regarding volatile).

    In this case, when doing a volatile operation, it would first load the
    line with a locking flag, and then when the operation completes it
    either writes back or sends a message to release the line. When a line
    is locked, the L2 cache or similar will not give it over to another core
    that tries to request the same line for volatile access (but may still
    allow it for non-volatile access).


    Though, if multiple cores were to write to the same area of memory
    without explicit synchronization/flushing, then memory coherence issues
    could result. Current rule is mostly "don't do this".

    But, arguably, this is one of the faster/cheaper ways to do memory, even
    if the one most likely to result in unintended memory coherence issues.



    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    I wouldn't expect so. If it existed it would likely be very niche.




    Otherwise, most of my most recent efforts have been going more into
    working on my documentation, and finding/fixing some bugs in BGBCC.

    For a long time, there were demo desync issues in ROTT, which I
    eventually found the cause:
    If a local array was initialized, it would only initialize the items
    from the initializer list and leave the rest of the array uninitialized.


    Say:
    int arr[10]={1,2,3,4,5};
    Would initialize items 0..4 but leave 5..9 holding garbage.

    I went and fixed this, and now ROTT seems to behave more consistently.


    Ironically, this was just after working some on the memset builtin
    mechanism to be more effective (adding a memset slide similar to the
    existing memcpy slide).

    At present:
    1..95 bytes: Handled inline (reduced from 128);
    96..512 bytes: Uses a newly added memset slide;
    513+: call the generic memset function.

    Memset of 1..512 will use the slide if the value is non-zero.

    The memset slide is analogous to the memcpy slide, where it can encode a branch into somewhere in the slide (for coarse memset), and the location
    in the slide controls how much memory gets zeroed. Then, there are finer
    entry points, which generally fill in the final fractional bytes before branching into the main slide (for any bulk zero'ing).

    In this case, the built-in memset mechanism is being used to
    zero-initialize local arrays before setting up the other members.

    Compiler isn't currently smart enough to do bulk initialization though.
    char arr[16]="SOMESTRING";
    Will currently initialize the array using a series of byte stores (it
    sets each array member individually).
    Might be faster, arguably, to transform this case into the equivalent of
    doing an inline memcpy from the string literal, but this particular
    situation is infrequent.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 9 18:44:55 2026
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-May-08 19:34, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Do you mean changes a single line from using write-invalidate
    protocol to write-update so any remote writes are forwarded
    by the home directory to the current line owner?

    The line in a Exclusive or Modified state downgrades to a line
    in the Shared state {while the line remains resident}. If the
    line is no longer present, the instruction does nothing.

    In effect, blocks line movement but not updates.

    In a directory system, the directory knows that the line
    is shared in every cache it is present in.

    Or something else?

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Not unprivileged or applications could un-zero fields that had
    been intentionally zeroed out but still held in cache.

    Allowing optimistic SW updates that can be reverted.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat May 9 23:27:56 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-05-08 23:34:21] wrote:
    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    I can't see how you can add that without introducing undefined
    behavior into your ISA, so that's a clear no for me.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun May 10 02:04:51 2026
    From Newsgroup: comp.arch

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single
    instruction called ‘CACHE’. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    Turn cache on/off (D$ only)
    Invalidate entire cache (I$ or D$ or both)
    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB
    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 17:26:11 2026
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single
    instruction called ‘CACHE’. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    I have the same format, a memory reference that does not need a DST register specifier, so it becomes the OpCode.

    Turn cache on/off (D$ only)

    Why would you want the cache turned off??

    Invalidate entire cache (I$ or D$ or both)

    What if the cache is 1GB in size ??? This could take a long time.

    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB

    With a coherent TLB this is unnecessary.

    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.

    That may take a long time !
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 21:03:48 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.


    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    Q+ has several I$ and D$ cache operations wrapped up in a single instruction called ‘CACHE’. I thought it best to put these in one instruction since they are infrequently used. The instruction has the
    same format as a load/store but the source/dest register is replaced by
    a command code. It uses the supplied address (if an address is needed).

    I have the same format, a memory reference that does not need a DST register specifier, so it becomes the OpCode.

    I broke the instruction into 3 sub-groups::
    a) prefetch
    b) invalidate
    c) post-push

    Prefetch brings data closer to the CPU (caches) and provides a specifier
    to which cache {I$, D$, L2, L3} and whether one wants write permission
    (or not).

    Invalidate gets rid of cached data without writing back.

    Post-Push pushes modified data farther from the PCU caches.

    I launched this topic because I can put as many as 32 instructions in this sub-group, and after months of thinking, I only found 19 to put there.
    {yes this violated the R in RISC should me reduced}

    Turn cache on/off (D$ only)

    Why would you want the cache turned off??

    Invalidate entire cache (I$ or D$ or both)

    What if the cache is 1GB in size ??? This could take a long time.

    Invalidate cache line (I$ or D$ or both)
    Invalidate TLB

    With a coherent TLB this is unnecessary.

    Invalidate TLB entry


    Both the I$ and D$ caches can be invalidated with a single instruction.

    That may take a long time !
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 18:45:34 2026
    From Newsgroup: comp.arch

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure). There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.
    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted. This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss. Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache. For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 00:33:55 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure).

    Thank you Paul ! An interesting rational

    There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    In a directory based caching system the directory is in a position
    that ANY shared cache line can be granted into the Exclusive state
    {minimizing transfer distance}.

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.

    If you care about RAS, you cannot have write back L1 caches
    with that property.

    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    A whole different can or worms.....

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    Luckily, I have a fundamental disagreement on ISA-extensions that
    provide SW the illusion that "lots or places" can be in intermediate
    states (i.e. TM).

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted.

    Why is the OS zeroing a page that has already been mapped into
    unprivileged VAS ???

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss.

    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache.

    I always liked that feature. I count not work it into a more
    conventional architecture, except for the 'known' program stack.

    For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Known stack.

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    SW would consider this undefined--SW depends (way too much) of
    a read returning exactly the last thing read.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 21:43:42 2026
    From Newsgroup: comp.arch

    On 5/10/26 8:33 PM, MitchAlsup wrote:

    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/8/26 7:34 PM, MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Intel added Cache Line WriteBack to memory to help with memory
    persistence (IIRC), which can be viewed as a reliability
    assertion (data will not be lost on power failure).

    Thank you Paul ! An interesting rational

    Oops. I think the Optane-inspired instruction may have been
    CLFLUSHOPT, which may have been intended to allow faster
    controlled power down. (Even that guess may be wrong.)

    There could
    also be performance reasons for pushing data outward while
    retaining it locally in a clean (shared) state; a remote request
    for the data might have lower latency (sourcing directly from
    L3, e.g., rather than an L3 coherence directory indicating where
    the data is and having to request the data from and a state
    change for the owner).

    In a directory based caching system the directory is in a position
    that ANY shared cache line can be granted into the Exclusive state {minimizing transfer distance}.

    For L1 to L2, cache line granularity might be too fine for
    'checkpointing' data from a merely parity protected L1 to an
    ECC-protected L2, though My 66000's VVM (with appropriate
    acceleration) might make substantial blocks fast/low overhead.

    If you care about RAS, you cannot have write back L1 caches
    with that property.

    Different customers may have different preferences. I seem to
    recall that Intel offered the option to replicate parity-
    protected L1 data cache to allow recovery (at the cost of half
    the capacity).

    It might be possible to pay the area penalty for ECC but have
    modal configuration of whether ECC or parity is used. If the
    SRAM cells were made less reliable to improve density (I think I
    read that Intel used more reliable L1 cells to mitigate the
    parity-only effect), then turning off ECC would result in more
    frequent flakiness but one could avoid read-modify-write on sub-
    word accesses. Design choices that improve reliability will
    tend to hurt performance; acceptable reliability can be fairly
    low when software (and no-ECC DRAM) will provide more failures.
    (With newer DRAM standards seeming likely to add ECC, software
    and CPU memory system reliability may become more important.)

    On the other hand, assigning reliability factor at a page level
    might be awkward from PTE bit starvation, granularity
    inflexibility, and timing.

    Would this also ensure data presence in outer cache/memory on a
    clean line? E.g., if applied with an L2 target when L2 is non-
    inclusive (but possibly tag inclusive or at least snoop
    filtering) and the line is clean, would the line be written back
    if not present in L2?

    A whole different can or worms.....

    If one had a mode that disallowed escape of dirty lines, this
    might be used as a means to commit temporary, local values. This
    seems somewhat similar to a transactional memory mechanism,
    though transactional memory would typically distinguish old
    dirty lines (and perhaps clean ones) allowing them to be written
    back on replacement.

    Luckily, I have a fundamental disagreement on ISA-extensions that
    provide SW the illusion that "lots or places" can be in intermediate
    states (i.e. TM).

    I also wonder if this might be used to assist in determining
    what cache indexes have been replaced in L2. With lazy writeback
    the timing factors may be fuzzed more. My mind does not work
    well for this type of problem.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    EricP pointed out a possible security issue if OS page zeroing
    could be thwarted.

    Why is the OS zeroing a page that has already been mapped into
    unprivileged VAS ???

    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page). If the zeroed data is
    still present in cache, an invalidate without writeback would
    preserve the data in main memory with access by the new context.
    Forcing the OS to writeback (or writethrough at zeroing) such
    zeroing to memory would be bad for performance (especially with
    copy-on-write when the data will be dirtied again).

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    This could be worked around by having such
    page (or cache line) zeroing use special cases that act as if
    the zeroed memory was written back to memory. Forcing a
    distinguishing between explicit zeroing to provide a base value
    and zeroing to remove access to old data may facilitate software
    bugs when the difference is not recognized/remembered.

    This is similar to the problem that data cache block allocate
    had where old data (that the current thread was not permitted to
    read) of a possibly different address could be read. This was
    generally "solved" by defining allocation as either no-op on a
    cache hit and cache block zero on a miss.

    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    That may be the most common use for avoiding read-to-own, but it
    is not the only use.

    Since the benefit of
    doing nothing on a cache hit may not have been very beneficial
    (one might use a bit of cache bandwidth) and zeroing provides
    other benefits, block zeroing seems to be preferred (though I
    still like allocation).

    (This also is reminiscent of the Mill's unbacked memory, which
    was memory that reads as zero [providing an implicit data cache
    block zero] and has no physical memory address until evicted
    from last level cache.

    I always liked that feature. I count not work it into a more
    conventional architecture, except for the 'known' program stack.

    I think one could by defining a read-only physical page as read-
    as-zero and "allocate" on write. One would have to have a
    background process to maintain a free list (perhaps similar to a
    hypervisor with extremely limited functionality) and an
    interface for that free list manager to request new pages (so
    conventional OSes would need more porting effort).

    One could also use placeholder physical page addresses (a.k.a.,
    shadow memory; "Increasing TLB Reach Using Superpages Backed by
    Shadow Memory", Mark Swanson, Leigh Stoller, and John Carter,
    1998; the paper uses a large virtual physical address page,
    whose address is treated as physical by caches and TLBs and is
    translated before cache eviction into multiple smaller physical
    pages), effectively introducing another layer of address
    virtualization for a modest subset of the physical address
    space. This would allow "physical addresses" in the caches with
    a TLB for the relatively few unbacked pages to allow them to be
    in cache without having a memory address allocated.

    (Having this additional TLB layer near the memory controller or
    shared cache slice would (I think) either require replication or
    page-size constraints on address distribution.)

    Avoiding software (OS) copy-on-write of zero pages might not be
    a significant benefit given My 66000's relatively fast context
    switches, but other ISAs might benefit just from low cost
    secure page zeroing copy-on-write.

    (Large pages might be a problem. The Mill had the advantage of
    supporting a larger variety of page sizes than My 66000. Having
    hardware change address translations to merge pages would also
    violate traditional OS assumptions.)

    There might be tension in deciding when a page zeroing
    instruction should write to the cache or write to the TLB. (The
    Mill's use of virtually addressed caches and delayed TLB helped,
    but I do not think that was essential.)

    Maybe I am missing something or maybe the issues I mentioned are
    larger than I suspected.

    For highly temporary data, the data
    would never leave the cache; this could also allow cache as
    memory as long as no cache was forced to be written back. I do
    not know if unbacked memory allowed an application to release
    the memory, which would be like an invalidate without
    writeback.)

    Known stack.

    For an activation record stack, this is somewhat straightforward
    (and such also presents security-improving opportunities, which
    you mentioned My 66000 also added).

    Optimistic updates sounds similar to transactional memory or
    versioned memory.

    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model. I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    SW would consider this undefined--SW depends (way too much) of
    a read returning exactly the last thing read.

    I suspect one would have to be very careful about defining how
    such would interact with ESM (and perhaps other memory
    interaction methods).

    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    Yes, though slow can also motivate incorrect software. Being
    able to clearly communicate the dangers also seems important
    (which argues for simplicity/orthogonality).

    If the memory so cleared is thread local, I do not _think_
    there would be consistency issues. (I think IBM defined "local"
    memory transactions which supported speculation but not system
    atomicity.) Yet I feel that there might be uses for value
    checkpointing (versioning) where the address is shared by
    multiple threads.

    Obviously, hardware could in some cases interweave versions into
    a consistent order, but forcing software to handle the cases
    when hardware fails sounds problematic. Explicit checkpoints
    like with transactional memory, might be easier for programmers
    to use correctly than a fully flexible handling of speculation.
    On the other hand, finer-grained control could allow software
    to exploit knowledge that is not easily observed by (or
    communicated to) hardware.

    I think there are opportunities for versioned memory and/or
    other timing/speculation manipulation, but I do not have a clue
    about what interface should be presented to software. A RISC-
    like approach of cache line control instructions could provide
    flexibility, but the overhead for idiom recognition should also
    be considered.

    Modal operation (like transactional memory or ASM) simplifies
    some aspects and complicates others.

    I tend to favor complexity (flexibility), so my opinion is
    dangerous.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 06:07:42 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Write permission is usually done at page granularity.

    But maybe your question is: Should there be an instruction that is a
    hint that a 64B block is not going to be written to in the forseeable
    future (so a microarchitecture with 64B cache lines would write that
    line back to main memory now instead of later)?

    The answer to that question is: maybe. The case for such an
    instruction seems weaker than for prefetch instructions, and the
    results from using prefetch instructions has been disappointing.

    There are "architectures" like Power where "data memory" and
    "instruction memory" are not coherent, even when they are the same
    memory. Upon updating instructions (e.g., from a JIT compiler), they
    require that the modifying thread(s) write the lines back from the
    data cache to a shared cache or main memory, and that the executing
    threads invalidate these cache lines and flush their pipeline. I
    think that that's a bad idea, not just because it exposes
    microarchitectural concepts like cache and pipeline to the
    architecture, and leads to unpredictable results in some usage
    scenarios (see my signature), but also because the requirements on the executing threads are extremely difficult to meet if the executing
    threads run independently of the modifying thread(s). Or, in short,
    IA-32 and AMD64 did the right architecture for that.

    In any case, "architectures" with the deficiency described above
    necessarily have instructions that write data cache lines back to
    shared storage. In the case of Power this instruction is dcbst.
    While this instruction is documented, it refers to non-architected and implementation-dependent concepts like "cache line", i.e., it is not a
    properly architected instruction, and the cache synchronization code
    on Power is implementation-dependent.

    Should an ISA contain an instruction that invalidates (without
    writing back) a Data Cache (or L2) line ?? {Discard}

    No! What would the architectural meaning of such an instruction be?
    "Maybe restore some previous contents of this memory"? Does not sound
    useful at all. Not everything that can be done should be done.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 06:39:42 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    With respect to Stefan Monnier's seeing this as undefined
    behavior, I think this might be presented similarly to memory
    ordering with a weaker memory model.

    Is that supposed to be a defense of the discard instruction? It
    isn't. Descriptions of weak memory models are full of "undefined
    behaviour".

    Weak memory models are a bad idea like many supercomputer ideas (e.g.,
    division with wrong results, or imprecise exceptions), but unlike
    other bad ideas they have made it almost to general-purpose computing.
    And they are exactly bad ideas because:

    1) The unpredictable results if their restrictions are not heeded.

    2) The difficulty of heeding these restrictions by adding close to the
    minimum necessary strongifying instructions (e.g., memory barriers and
    atomic instructions). In particular, thanks to 1 there is no way to
    check the correctness of the placement of these instructions by
    testing.

    3) The extreme performance cost of the strongifying instructions, so
    when you use some simple scheme that guarantees correctness (e.g.,
    inserting a write barrier after every store and a read barrier before
    every load), the resulting program is extrememly slow.

    In the case of weak memory models the hardware designers have the
    excuse that they are too lazy to implement a strong memory model
    efficiently (although they typically frame it by showing the
    inefficiency of some lazy implementation of a strong memory model),
    and that not that big parts of the software actually communicate with
    other threads.

    But I think that the chilling effects of difficulties in inter-thread communication have kept that back. But difficulties already exist
    with sequential consistency; transactional memory looked like it might
    come to the rescue, but after the hype from about 20 years ago is now
    in the valley of disappointment.

    I.e., the result of a read
    would still return a previously held value, but the "version"
    might be unexpected. The result is not "undefined" but timing
    dependent.

    "Undefined behaviour" typically originally means something where the
    people specifying it have a good idea what can happen, but where it is
    too complex and has too little benefit to actually specify it. E.g.,
    an out-of-bounds access to an object resulted in actually accessing
    that memory, but what is there is not specified and is implementation-dependent, and it might be that the address is not
    accessible, and results in a trap (and it seems to me that everything
    that may result in a trap on some machines has been labeled "undefined behaviour" in C; note the fine differences between
    implementation-defined and undefined behaviour for shifts in C). Only
    later compiler shenanigans like "optimizing" a loop that performs an out-of-bounds access into an endless loop were introduced and
    justified with the specified undefined behaviour.

    If MY66000 ever becomes a popular architecture with many
    implementations, and if discard's effect has been specified as making
    any read access before a write access to any memory in the cache line "undefined behaviour", we may see implementations that implement
    discard with effects that do not reflect your expectations at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon May 11 07:29:32 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache, but AFAIK Linux actually zeros physical pages ahead of time
    typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing
    instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:27:53 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping
    the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.

    Seems to me that the code will certainly call another function
    almost immediately that will simply reuse the already
    present stack cache line; prematurely invalidating it will
    actually slow things down.

    I see no benefit in invalidating it pre-emptively.

    It would certainly cause problems for code that intentionally
    uses the soi disant "free" stack space in legal but unusual ways.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:32:17 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    ArmV8 has the DC ZVA instruction to zero blocks of cache,
    specifically for this purpose.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:00:29 2026
    From Newsgroup: comp.arch


    Paul Clayton <paaronclayton@gmail.com> posted:

    On 5/10/26 8:33 PM, MitchAlsup wrote:
    --------------------------
    If you care about RAS, you cannot have write back L1 caches
    with that property.

    Different customers may have different preferences. I seem to
    recall that Intel offered the option to replicate parity-
    protected L1 data cache to allow recovery (at the cost of half
    the capacity).

    Yet, the design team can afford 1 design and the FAB can afford
    1 mask set--regardless of the number of customers.


    ---------------
    vVM is allowed to 'allocate' cache lines (CI without Read) when
    a line boundary is crossed and more than 1 complete line remains
    in the loop--saving interconnect BW and coherence messages.

    That may be the most common use for avoiding read-to-own, but it
    is not the only use.

    It is the only one easy to recognize.
    ------------
    Any kind of ATOMIC thing is WAY better to do it correct and
    SLOW than to take ANY chance of doing it wrong.

    Yes, though slow can also motivate incorrect software. Being
    able to clearly communicate the dangers also seems important
    (which argues for simplicity/orthogonality).

    As do most CPU things.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:09:59 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/10/26 8:33 PM, MitchAlsup wrote:
    [...]
    The OS zeros the physical page before assigning it to the new
    context (or more likely assigns a zero page and does copy on
    write, which is just zeroing the page).

    Assigning a zero page for reading is a good idea. Copying that page
    on writing appears inefficient to me, because it needs to read the
    zero page into cache and write it to a newly allocated page.

    A better approach is to do just the writes. I think that zeroing the
    page on demand is a good approach, because then it is already in the
    D-cache,

    Because there is a pool of already zeroed pages (for COW) it may be in
    some other CPUs cache.

    but AFAIK Linux actually zeros physical pages ahead of time typically on a separate (otherwise idle) core, and just maps one of
    those pages to the virtual page that needs to be written to. I wonder
    why Linux does that.

    Luckily, in My 66000, this zeroing is 1 instruction {MS #0,[&page]}
    and the interconnect is designed to transport the page zero in one
    transaction.

    This is more flexible than having cache line and page clearing >instructions.

    In what way is it more flexible? It is a page-clearing instruction.

    It is a memset() function as an instruction. Any size is acceptable.

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 11 18:18:27 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:

    On 2026-05-08 7:34 p.m., MitchAlsup wrote:

    Should an ISA contain an instruction that gives Write-Permission
    from the Data Cache (or L2) line outward in the memory hierarchy,
    while keeping the <now shared> line resident ?? {allow}

    Trying to fathom what is going on with this. Is it an issue with keeping >> the cache coherent? Sounds like the D$ cache line was write-protected
    and now it is to be made writable?

    Consider the stack, and after adding a number to SP there are now
    a bunch of lines that are neither accessible nor containing a useful
    value.

    Seems to me that the code will certainly call another function
    almost immediately that will simply reuse the already
    present stack cache line; prematurely invalidating it will
    actually slow things down.

    I did not invalidate those lines, I just marked them that if they are
    replaced before becoming "in stack" again they can be dropped without
    being pushed farther out the memory hierarchy.

    I see no benefit in invalidating it pre-emptively.

    It would certainly cause problems for code that intentionally
    uses the soi disant "free" stack space in legal but unusual ways.


    --- Synchronet 3.22a-Linux NewsLink 1.2