• ARM CAS vs LL/SC

    From jseigh@jseigh_es00@xemaps.com to comp.arch on Tue May 5 17:08:48 2026
    From Newsgroup: comp.arch

    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 5 15:01:29 2026
    From Newsgroup: comp.arch

    On 5/5/2026 2:08 PM, jseigh wrote:
    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?

    Get rid of possible livelock? Reservation granule poison pills?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue May 5 22:32:11 2026
    From Newsgroup: comp.arch

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?

    Scalability. Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 01:29:51 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >reason they add CAS to the architecture?

    Scalability.



    Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 02:03:12 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?

    Scalability.



    Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).

    And why implement both atomics and LL/SC in a new architecture?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 5 21:38:38 2026
    From Newsgroup: comp.arch

    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?

    Scalability.



    Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid it
    is. LL/SC provides a very flexible "framework" for implementing
    whatever atomic operation seems right for a particular application, but
    the atomic operations are more efficient if you want to do exactly what
    they do.

    Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions. If you wanted the functionality of atomic fetch and add,
    you would have had to do a TS instruction, followed by a load, then an
    add, then a store and finally a clear test and set - five instructions.
    Now think about the same thing if you had LL/SC. It would be LL, add,
    SC - three instructions. Of course, if you had the atomics, it would be
    one instruction. Of course, a similar argument applies to a later
    generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Kevin Bowling@kevin.bowling@kev009.com to comp.arch on Tue May 5 22:54:34 2026
    From Newsgroup: comp.arch

    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >>>>> reason they add CAS to the architecture?

    Scalability.



                   Moving the contention detection to the cache >>>> is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia
    instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid it
      is.  LL/SC provides a very flexible "framework" for implementing whatever atomic operation seems right for a particular application, but
    the atomic operations are more efficient if you want to do exactly what
    they do.

    Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions.  If you wanted the functionality of atomic fetch and add,
    you would have had to do a TS instruction, followed by a load, then an
    add, then a store and finally a clear test and set - five instructions.
    Now think about the same thing if you had LL/SC.  It would be LL, add,
    SC - three instructions.  Of course, if you had the atomics, it would be one instruction.  Of course, a similar argument applies to a later generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
    latency for the uncontested path. The CAS hit the L2. I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 6 15:40:53 2026
    From Newsgroup: comp.arch

    Kevin Bowling wrote:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

    Scalability.



                   Moving the contention detection to the
    cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et >>> alia
    instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid >> it  Â  is.  LL/SC provides a very flexible "framework" for
    implementing whatever atomic operation seems right for a particular
    application, but the atomic operations are more efficient if you want >> to do exactly what they do.

    Think back to the time when the only atomic operation supported was
    essentially test and set, i.e. before CPUs had atomic fetch and add
    instructions.  If you wanted the functionality of atomic fetch and
    add, you would have had to do a TS instruction, followed by a load,
    then an add, then a store and finally a clear test and set - five
    instructions. Now think about the same thing if you had LL/SC.  It
    would be LL, add, SC - three instructions.  Of course, if you had the
    atomics, it would be one instruction.  Of course, a similar argument
    applies to a later generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your
    requirements are met by the atomic instructions, but the LL/SC gives
    you better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost >> of having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower > latency for the uncontested path.  The CAS hit the L2.  I'd guess the
    reasoning was the same, CAS wins for higher core counts and guaranteed > progress?
    XADD is the better version of CAS, imho.
    It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
    To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of
    them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
    respond as if it was just RAM but with far lower latency.
    The question is who makes that kind of memory controller?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 14:19:20 2026
    From Newsgroup: comp.arch

    Kevin Bowling <kevin.bowling@kev009.com> writes:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

    Scalability.



                   Moving the contention detection to the cache >>>>> is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>> instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid it
      is.  LL/SC provides a very flexible "framework" for implementing
    whatever atomic operation seems right for a particular application, but
    the atomic operations are more efficient if you want to do exactly what
    they do.

    Think back to the time when the only atomic operation supported was
    essentially test and set, i.e. before CPUs had atomic fetch and add
    instructions.  If you wanted the functionality of atomic fetch and add,
    you would have had to do a TS instruction, followed by a load, then an
    add, then a store and finally a clear test and set - five instructions.
    Now think about the same thing if you had LL/SC.  It would be LL, add,
    SC - three instructions.  Of course, if you had the atomics, it would be >> one instruction.  Of course, a similar argument applies to a later
    generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your
    requirements are met by the atomic instructions, but the LL/SC gives you
    better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost of
    having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
    latency for the uncontested path. The CAS hit the L2. I'd guess the >reasoning was the same, CAS wins for higher core counts and guaranteed >progress?

    The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8 CN8800. Both supported
    cache coherency across multiple sockets (up to 4 for the CN7800). The CN8800 implemented the new ARMv8.1 Large Systems Extension (i.e. atomic instructions) from the start as it was realized that the load/store exclusive paradigm in arm V8.0 was
    a performance limition with multisocket CN8800 processors.

    Subsequent ARMv8/9 processors in the Octeon family have only supported single socket implementations, with large on-chip core counts.

    CXL seems to be the wave of the future, and supports the standard PCI
    Express atomic operations, so atomic CPU instructions are definitely a
    better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 18:31:29 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Kevin Bowling wrote:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

    Scalability.
    ------------------
    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower
    latency for the uncontested path.  The CAS hit the L2.  I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?

    XADD is the better version of CAS, imho.

    It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.

    Under heavy contention, yes;
    under light contention, no.

    XADD has DRAM+coherence latency minimum.
    LL/SC has L1 (or L2) latency.

    To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers.

    Both ASF and ESM have this functionality--it is effectively a TLB
    (used differently).

    I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
    respond as if it was just RAM but with far lower latency.

    You send it a cache line of addresses, and if none of the addresses
    is already in the arbiter, then you get exclusive access to all of
    them and are in a state where you can NAK accesses to those addresses.

    The question is who makes that kind of memory controller?

    Terje

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 18:54:03 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Kevin Bowling wrote:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >> >>>>>> reason they add CAS to the architecture?

    Scalability.
    ------------------
    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower
    latency for the uncontested path.  The CAS hit the L2.  I'd guess the >> > reasoning was the same, CAS wins for higher core counts and guaranteed
    progress?

    XADD is the better version of CAS, imho.

    It allows actual progress for multiple contending threads, with just the
    minimum-possible delay for transferring ownership.

    Under heavy contention, yes;
    under light contention, no.

    XADD has DRAM+coherence latency minimum.

    Surely the XADD can be handled by the L2
    or L3 (PoC - Point of Coherency) that owns
    the line.

    Only when caching is disabled for a
    particular address, will the XADD need to
    be done at the DRAM controller.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 19:48:51 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    ERROR "unexpected byte sequence starting at index 630: '\xC2'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Kevin Bowling wrote:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific
    reason they add CAS to the architecture?

    Scalability.
    ------------------
    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower
    latency for the uncontested path.  The CAS hit the L2.  I'd guess the
    reasoning was the same, CAS wins for higher core counts and guaranteed >> > progress?

    XADD is the better version of CAS, imho.

    It allows actual progress for multiple contending threads, with just the >> minimum-possible delay for transferring ownership.

    Under heavy contention, yes;
    under light contention, no.

    XADD has DRAM+coherence latency minimum.

    Surely the XADD can be handled by the L2
    or L3 (PoC - Point of Coherency) that owns
    the line.

    I totally agree. XADD has to be done at the
    place who owns write permission. The problem
    is the latency to find where write permission
    is.

    Only when caching is disabled for a
    particular address, will the XADD need to
    be done at the DRAM controller.

    There will be lots of times where the PoC
    is near maximal latency, and as the size
    of the coherent area increases, that latency
    goes up.

    The real question is what kinds of memory
    accesses are taking place in the millisecond
    leading up to the XADD--because that is what
    determine which lines are where.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Wed May 6 18:30:20 2026
    From Newsgroup: comp.arch

    On 5/6/26 10:19, Scott Lurndal wrote:
    Kevin Bowling <kevin.bowling@kev009.com> writes:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the specific >>>>>>> reason they add CAS to the architecture?

    Scalability.



                   Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>>> instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid it >>>   is.  LL/SC provides a very flexible "framework" for implementing
    whatever atomic operation seems right for a particular application, but
    the atomic operations are more efficient if you want to do exactly what
    they do.

    Think back to the time when the only atomic operation supported was
    essentially test and set, i.e. before CPUs had atomic fetch and add
    instructions.  If you wanted the functionality of atomic fetch and add, >>> you would have had to do a TS instruction, followed by a load, then an
    add, then a store and finally a clear test and set - five instructions.
    Now think about the same thing if you had LL/SC.  It would be LL, add,
    SC - three instructions.  Of course, if you had the atomics, it would be >>> one instruction.  Of course, a similar argument applies to a later
    generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your
    requirements are met by the atomic instructions, but the LL/SC gives you >>> better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost of >>> having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
    latency for the uncontested path. The CAS hit the L2. I'd guess the
    reasoning was the same, CAS wins for higher core counts and guaranteed
    progress?

    The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8 CN8800. Both supported
    cache coherency across multiple sockets (up to 4 for the CN7800). The CN8800
    implemented the new ARMv8.1 Large Systems Extension (i.e. atomic instructions)
    from the start as it was realized that the load/store exclusive paradigm in arm V8.0 was
    a performance limition with multisocket CN8800 processors.

    Subsequent ARMv8/9 processors in the Octeon family have only supported single socket implementations, with large on-chip core counts.

    CXL seems to be the wave of the future, and supports the standard PCI
    Express atomic operations, so atomic CPU instructions are definitely a
    better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.


    I was messing about with lock-free queue implementations and realized
    you can do a Michael-Scott lock-free queue implementation using LL/SC,
    which is an advantage since you don't need deferred reclamation that
    CAS implementations require and which can slow things down. Plus if reclamation stops because of a stalled thread, is your queue still
    lock free?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 6 21:50:49 2026
    From Newsgroup: comp.arch

    On 5/6/2026 3:30 PM, jseigh wrote:
    On 5/6/26 10:19, Scott Lurndal wrote:
    Kevin Bowling <kevin.bowling@kev009.com> writes:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the
    specific
    reason they add CAS to the architecture?

    Scalability.



                    Moving the contention detection to the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear,
    et alia
    instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how
    valid it
        is.  LL/SC provides a very flexible "framework" for implementing >>>> whatever atomic operation seems right for a particular application, but >>>> the atomic operations are more efficient if you want to do exactly what >>>> they do.

    Think back to the time when the only atomic operation supported was
    essentially test and set, i.e. before CPUs had atomic fetch and add
    instructions.  If you wanted the functionality of atomic fetch and add, >>>> you would have had to do a TS instruction, followed by a load, then an >>>> add, then a store and finally a clear test and set - five instructions. >>>> Now think about the same thing if you had LL/SC.  It would be LL, add, >>>> SC - three instructions.  Of course, if you had the atomics, it
    would be
    one instruction.  Of course, a similar argument applies to a later
    generation of systems with respect to CAS/DCAS.

    So, having both gives you better efficiency for the cases where your
    requirements are met by the atomic instructions, but the LL/SC gives
    you
    better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design
    cost of
    having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower
    latency for the uncontested path.  The CAS hit the L2.  I'd guess the
    reasoning was the same, CAS wins for higher core counts and guaranteed
    progress?

    The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8
    CN8800.   Both supported
    cache coherency across multiple sockets (up to 4 for the CN7800).
    The CN8800
    implemented the new ARMv8.1 Large Systems Extension (i.e. atomic
    instructions)
    from the start as it was realized that the load/store exclusive
    paradigm in arm V8.0 was
    a performance limition with multisocket CN8800 processors.

    Subsequent ARMv8/9 processors in the Octeon family have only supported
    single
    socket implementations, with large on-chip core counts.

    CXL seems to be the wave of the future, and supports the standard PCI
    Express atomic operations, so atomic CPU instructions are definitely a
    better choice than LDEX/STREX on ARM-based (and for that matter Intel)
    processor chips.


    I was messing about with lock-free queue implementations and realized
    you can do a Michael-Scott lock-free queue implementation using LL/SC,
    which is an advantage since you don't need deferred reclamation that
    CAS implementations require and which can slow things down.

    How? Say using it raw with dynamic nodes. The user deletes a node. How
    does that work without hitting deleted memory? I know that Microsoft
    uses SEH in its SLIST impl.



      Plus if
    reclamation stops because of a stalled thread, is your queue still
    lock free?



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From jseigh@jseigh_es00@xemaps.com to comp.arch on Thu May 7 17:34:21 2026
    From Newsgroup: comp.arch

    On 5/7/26 00:50, Chris M. Thomasson wrote:
    On 5/6/2026 3:30 PM, jseigh wrote:


    I was messing about with lock-free queue implementations and realized
    you can do a Michael-Scott lock-free queue implementation using LL/SC,
    which is an advantage since you don't need deferred reclamation that
    CAS implementations require and which can slow things down.

    How? Say using it raw with dynamic nodes. The user deletes a node. How
    does that work without hitting deleted memory? I know that Microsoft
    uses SEH in its SLIST impl.


    Dequeuing from head is straightforward. I don't think I need to
    describe that.

    Enqueuing uses the old Double Tap trick.
    1) load locked from tail node next pointer.
    2) if null and tail still (double tap) points to node,
    store conditional new node address into the next pointer.

    Updating tail by doing a load locked on it,
    loading the next pointer from tail node, and
    updating tail.

    It's ok to access deleted memory and long as you don't
    actually use it. The old lock-free stack using
    DCAS did that when unsuccessfully popping the stack.
    It may have had an invalid value but the DCAS will fail.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 13:43:55 2026
    From Newsgroup: comp.arch

    On 5/7/2026 2:34 PM, jseigh wrote:
    On 5/7/26 00:50, Chris M. Thomasson wrote:
    On 5/6/2026 3:30 PM, jseigh wrote:


    I was messing about with lock-free queue implementations and realized
    you can do a Michael-Scott lock-free queue implementation using LL/SC,
    which is an advantage since you don't need deferred reclamation that
    CAS implementations require and which can slow things down.

    How? Say using it raw with dynamic nodes. The user deletes a node. How
    does that work without hitting deleted memory? I know that Microsoft
    uses SEH in its SLIST impl.


    Dequeuing from head is straightforward.  I don't think I need to
    describe that.

    Enqueuing uses the old Double Tap trick.
    1) load locked from tail node next pointer.
    2) if null and tail still (double tap) points to node,
       store conditional new node address into the next pointer.

    Updating tail by doing a load locked on it,
    loading the next pointer from tail node, and
    updating tail.

    It's ok to access deleted memory and long as you don't
    actually use it.  The old lock-free stack using
    DCAS did that when unsuccessfully popping the stack.
    It may have had an invalid value but the DCAS will fail.



    You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
    DCAS. I try to separate those in my mind. The only way the Microsoft
    SLIST can get away with accessing deleted memory is SEH. What about
    accessing the next node from deleted memory? LL/SC does not need ABA
    counters, last time I checked, but it can have an issue with deleted
    memory? What am I missing here Joe? I guess it depends on what the
    underlying allocator actually does with the deleted node...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 22:18:50 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 5/7/2026 2:34 PM, jseigh wrote:
    On 5/7/26 00:50, Chris M. Thomasson wrote:
    On 5/6/2026 3:30 PM, jseigh wrote:


    I was messing about with lock-free queue implementations and realized
    you can do a Michael-Scott lock-free queue implementation using LL/SC, >>> which is an advantage since you don't need deferred reclamation that
    CAS implementations require and which can slow things down.

    How? Say using it raw with dynamic nodes. The user deletes a node. How
    does that work without hitting deleted memory? I know that Microsoft
    uses SEH in its SLIST impl.


    Dequeuing from head is straightforward.  I don't think I need to
    describe that.

    Enqueuing uses the old Double Tap trick.
    1) load locked from tail node next pointer.
    2) if null and tail still (double tap) points to node,
       store conditional new node address into the next pointer.

    Updating tail by doing a load locked on it,
    loading the next pointer from tail node, and
    updating tail.

    It's ok to access deleted memory and long as you don't
    actually use it.  The old lock-free stack using
    DCAS did that when unsuccessfully popping the stack.
    It may have had an invalid value but the DCAS will fail.



    You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
    DCAS. I try to separate those in my mind. The only way the Microsoft
    SLIST can get away with accessing deleted memory is SEH. What about accessing the next node from deleted memory? LL/SC does not need ABA counters, last time I checked, but it can have an issue with deleted
    memory? What am I missing here Joe? I guess it depends on what the underlying allocator actually does with the deleted node...

    a) I can't seem to find a definition of SEH instruction

    b) LL/SC is not subject to ABA when SC fails on any control transfer
    out of thread {between LL and SC}; and this is one primary reason
    that LL/SC can fail spuriously.

    c) if deleted memory has been freed--all bets are off.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 22:06:47 2026
    From Newsgroup: comp.arch

    On 5/8/2026 3:18 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 5/7/2026 2:34 PM, jseigh wrote:
    On 5/7/26 00:50, Chris M. Thomasson wrote:
    On 5/6/2026 3:30 PM, jseigh wrote:


    I was messing about with lock-free queue implementations and realized >>>>> you can do a Michael-Scott lock-free queue implementation using LL/SC, >>>>> which is an advantage since you don't need deferred reclamation that >>>>> CAS implementations require and which can slow things down.

    How? Say using it raw with dynamic nodes. The user deletes a node. How >>>> does that work without hitting deleted memory? I know that Microsoft
    uses SEH in its SLIST impl.


    Dequeuing from head is straightforward.  I don't think I need to
    describe that.

    Enqueuing uses the old Double Tap trick.
    1) load locked from tail node next pointer.
    2) if null and tail still (double tap) points to node,
       store conditional new node address into the next pointer.

    Updating tail by doing a load locked on it,
    loading the next pointer from tail node, and
    updating tail.

    It's ok to access deleted memory and long as you don't
    actually use it.  The old lock-free stack using
    DCAS did that when unsuccessfully popping the stack.
    It may have had an invalid value but the DCAS will fail.



    You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
    DCAS. I try to separate those in my mind. The only way the Microsoft
    SLIST can get away with accessing deleted memory is SEH. What about
    accessing the next node from deleted memory? LL/SC does not need ABA
    counters, last time I checked, but it can have an issue with deleted
    memory? What am I missing here Joe? I guess it depends on what the
    underlying allocator actually does with the deleted node...

    a) I can't seem to find a definition of SEH instruction

    Oh, SEH is Microsoft's structured exception handling thing.


    b) LL/SC is not subject to ABA when SC fails on any control transfer
    out of thread {between LL and SC}; and this is one primary reason
    that LL/SC can fail spuriously.

    Yeah. But, some of them can even fail on loading from a reservation granule?


    c) if deleted memory has been freed--all bets are off.

    Oh yeah. Indeed!
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 22:11:42 2026
    From Newsgroup: comp.arch

    On 5/6/2026 6:40 AM, Terje Mathisen wrote:
    Kevin Bowling wrote:
    On 5/5/26 21:38, Stephen Fuld wrote:
    On 5/5/2026 7:03 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    jseigh <jseigh_es00@xemaps.com> writes:
    Of the possible issues LL/SC might have, did ARM mention the
    specific
    reason they add CAS to the architecture?

    Scalability.



                   Moving the contention detection to
    the cache
    is much more bandwidth efficient than swapping cache lines
    between a hundred cores.

    CASs touch the modifiable data fields without write permission,
    allowing other cores to touch that data, too. Then, whomever
    gets to CAS first {and then gets their CAS addresses to LLC/DRC
    first} wins. But you still have the property that only 1 CAS
    {in a conflicting group} succeeds.

    I think this second point is dependent on your cache coherent
    protocol.

    With this in mind, My 66000 CCP has the ability to request write
    permission on a cache line request, but the other end of the
    transaction can refuse to send write permission. So, LL requests
    write permission, but the 'system' can send the line read-only.
    A core can refuse to pass write permission when it has performed
    one or more LLs without having run into the SC.

    Given that, one can make LL/SC with that same scaling properties
    as CASs.

    It's still far less convenient to actually use (particularly
    when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et
    alia
    instructions).

    And why implement both atomics and LL/SC in a new architecture?

    I think there is an argument for both, though I am not sure how valid
    it  Â  is.  LL/SC provides a very flexible "framework" for
    implementing whatever atomic operation seems right for a particular
    application, but the atomic operations are more efficient if you want
    to do exactly what they do.

    Think back to the time when the only atomic operation supported was
    essentially test and set, i.e. before CPUs had atomic fetch and add
    instructions.  If you wanted the functionality of atomic fetch and
    add, you would have had to do a TS instruction, followed by a load,
    then an add, then a store and finally a clear test and set - five
    instructions. Now think about the same thing if you had LL/SC.  It
    would be LL, add, SC - three instructions.  Of course, if you had
    the atomics, it would be one instruction.  Of course, a similar
    argument applies to a later generation of systems with respect to
    CAS/DCAS.

    So, having both gives you better efficiency for the cases where your
    requirements are met by the atomic instructions, but the LL/SC gives
    you better efficiency when they are not.

    Of course, YMMV, and whether it is worth the hardware and design cost
    of having both is a separate discussion.



    Cavium cnMIPS (OCTEONII/III) implement both.  The LL/SC has lower
    latency for the uncontested path.  The CAS hit the L2.  I'd guess the
    reasoning was the same, CAS wins for higher core counts and guaranteed
    progress?

    XADD is the better version of CAS, imho.

    But, XADD is NOT CAS. Very different things, one can fail the other
    cannot. But you already know all of that.



    It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.

    If you can get an algo to use XADD, use that big time, good times. XADD
    can be used in wait-free loopless algos. CAS can to, but in different
    ways. Think of weak CAS (ala LL/SC like), vs strong CAS (ala x86 style).

    strong CAS can be used for interesting state machines.



    To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
    respond as if it was just RAM but with far lower latency.

    The question is who makes that kind of memory controller?



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 20:19:47 2026
    From Newsgroup: comp.arch

    On 5/6/26 10:19 AM, Scott Lurndal wrote:
    [snip]
    CXL seems to be the wave of the future, and supports the standard PCI
    Express atomic operations, so atomic CPU instructions are definitely a
    better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.

    Since idiom recognition allows a software LL/op/SC to be
    translated into an internal atomic operation, I do not see
    specific atomic instructions as "definitely a better choice".

    One could argue that code density, decode simplicity, and
    simpler detection of non-support (illegal instruction exception)
    give a significant advantage to atomic instructions, but I
    do not perceive atomic instructions as obviously better.

    Non-support of idiom recognition could be handled either by
    architectural requirement (for a version if not so defined
    initially) or software probing for the feature. That may be
    a sufficient solution for that issue.

    Code density may not be a huge issue if such atomic operations
    are infrequent. I am skeptical that using providing small (16-
    bit) LL/SC instructions would be worthwhile (a 16-bit branch
    instruction might be worthwhile).

    It might be practical to provide a single load-and-store
    instruction that (in Mitch Alsup's term) casts a shadow on one
    operation instruction such that the store is delayed until after
    the operation. This might simplify decode a little (the idiom
    recognition check would be explicitly activated by the
    instruction encoding) and remove the need for a branch for retry
    (ESM does not require a branch for automatic retry). Such would
    also provide the same flexibility as LL/SC in terms of what the
    operation is. It may even be practical to specify the length of
    the "shadow" such that non-primitive operations could still be
    expressed as atomic within one store.

    (It might be useful to explicitly declare single _cache block_
    atomic operations. Such might be more easily ordered flexibly as
    the "transaction signature" is a single address. Hardware
    knowing this as soon as the first instruction is decoded might
    be simpler and able to act more quickly on the information.)

    Atomic operations also have the issue of which operations (and
    addressing modes and operands) are supported. Either interface
    might also needs to be extended to provide additional
    performance-hinting or possibly even semantic information. As a
    new feature, atomic instructions might be better positioned to
    either provide space for such extensions or include them in the
    original design, though with fixed 32-bit instructions that
    might require a prefix instruction.

    Having both an optimistic concurrency mechanism (LL/SC, ESM,
    transactional memory) and atomic instructions requires defining
    how the two interact. For ESM, where a transaction can send a
    NAK to delay a conflicting access, an atomic instruction would
    have to be retryable even if the hardware would otherwise know
    that the operation would complete (or the otherwise NAKing node
    might have to handle the atomic operation providing the result
    to the node executing it).

    Idiom recognition is a powerful technique for translating
    a sequence of primitives into a solution, but sometimes an
    explicit approach is better. Yet I object to rejecting a
    sequence of primitives interface for semantic reasons when idiom
    recognition (and architectural guarantees) could provide the
    semantics. One could argue that the complexity budget could
    be better spent or that code density is important enough. (I do
    feel that RISC-V's emphasis on using idiom recognition is
    flawed.)

    Idiom recognition can also have the advantage of supporting
    legacy software that naturally uses the idiom. With LL/SC the
    expression of atomic operations is natural, so legacy software
    could use a new translate-to-atomic feature. (Legacy software
    might use simple atomic operations less often given the
    availability of larger "locked" code sequences.) If software
    translation was more widespread, this advantage could disappear.

    (Nearly universal software translation would provide other
    advantages — and, of course, the disadvantage of reducing lock-
    in to a particular architecture.)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:38:52 2026
    From Newsgroup: comp.arch

    Paul Clayton <paaronclayton@gmail.com> writes:
    On 5/6/26 10:19 AM, Scott Lurndal wrote:
    [snip]
    CXL seems to be the wave of the future, and supports the standard PCI
    Express atomic operations, so atomic CPU instructions are definitely a
    better choice than LDEX/STREX on ARM-based (and for that matter Intel)
    processor chips.

    Since idiom recognition allows a software LL/op/SC to be
    translated into an internal atomic operation, I do not see
    specific atomic instructions as "definitely a better choice".

    Please elaborate. There are few restrictions on the instructions
    that lie between the LL and SC instructions - I don't see how
    any CPU could translate an arbitrary sequence of instructions
    between the LL and SS into an atomic bus operation efficiently.


    One could argue that code density, decode simplicity, and
    simpler detection of non-support (illegal instruction exception)
    give a significant advantage to atomic instructions, but I
    do not perceive atomic instructions as obviously better.


    Non-support of idiom recognition could be handled either by
    architectural requirement (for a version if not so defined
    initially) or software probing for the feature. That may be
    a sufficient solution for that issue.

    Atomic instructions are already available for all existing
    mainstream CPUs, so your idea only applies to new CPU designs.

    IMO, LL/SC is an obsolete artifact of the past.
    --- Synchronet 3.22a-Linux NewsLink 1.2