Forum: War Ensemble BBS

ARM CAS vs LL/SC

From jseigh@jseigh_es00@xemaps.com to comp.arch on Tue May 5 17:08:48 2026

From Newsgroup: comp.arch

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Tue May 5 15:01:29 2026

From Newsgroup: comp.arch

On 5/5/2026 2:08 PM, jseigh wrote:

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?

Get rid of possible livelock? Reservation granule poison pills?
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue May 5 22:32:11 2026

From Newsgroup: comp.arch

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?

Scalability. Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 01:29:51 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 02:03:12 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).

And why implement both atomics and LL/SC in a new architecture?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 5 21:38:38 2026

From Newsgroup: comp.arch

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid it
is. LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.

Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions. If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be
one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Kevin Bowling@kevin.bowling@kev009.com to comp.arch on Tue May 5 22:54:34 2026

From Newsgroup: comp.arch

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >>>>> reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache >>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia
instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid it
is. LL/SC provides a very flexible "framework" for implementing whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.

Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions. If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be one instruction. Of course, a similar argument applies to a later generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed May 6 15:40:53 2026

From Newsgroup: comp.arch

Kevin Bowling wrote:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

Scalability.

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Moving the contention detection to the
cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et >>> alia
instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid >> it Â is.Â LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want >> to do exactly what they do.

Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.Â If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC.Â It
would be LL, add, SC - three instructions.Â Of course, if you had the
atomics, it would be one instruction.Â Of course, a similar argument
applies to a later generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost >> of having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower > latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed > progress?

XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of
them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.
The question is who makes that kind of memory controller?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 14:19:20 2026

From Newsgroup: comp.arch

Kevin Bowling <kevin.bowling@kev009.com> writes:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache >>>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>> instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid it
is. LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.

Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be >> one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives you
better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost of
having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the >reasoning was the same, CAS wins for higher core counts and guaranteed >progress?

The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8 CN8800. Both supported
cache coherency across multiple sockets (up to 4 for the CN7800). The CN8800 implemented the new ARMv8.1 Large Systems Extension (i.e. atomic instructions) from the start as it was realized that the load/store exclusive paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.

Subsequent ARMv8/9 processors in the Octeon family have only supported single socket implementations, with large on-chip core counts.

CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 18:31:29 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Kevin Bowling wrote:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?

Scalability.

------------------

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?

XADD is the better version of CAS, imho.

It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.

Under heavy contention, yes;
under light contention, no.

XADD has DRAM+coherence latency minimum.
LL/SC has L1 (or L2) latency.

To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers.

Both ASF and ESM have this functionality--it is effectively a TLB
(used differently).

I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.

You send it a cache line of addresses, and if none of the addresses
is already in the arbiter, then you get exclusive access to all of
them and are in a state where you can NAK accesses to those addresses.

The question is who makes that kind of memory controller?

Terje

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed May 6 18:54:03 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Kevin Bowling wrote:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >> >>>>>> reason they add CAS to the architecture?

Scalability.

------------------

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the >> > reasoning was the same, CAS wins for higher core counts and guaranteed
progress?

XADD is the better version of CAS, imho.

It allows actual progress for multiple contending threads, with just the
minimum-possible delay for transferring ownership.

Under heavy contention, yes;
under light contention, no.

XADD has DRAM+coherence latency minimum.

Surely the XADD can be handled by the L2
or L3 (PoC - Point of Coherency) that owns
the line.

Only when caching is disabled for a
particular address, will the XADD need to
be done at the DRAM controller.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed May 6 19:48:51 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

ERROR "unexpected byte sequence starting at index 630: '\xC2'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Kevin Bowling wrote:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?

Scalability.

------------------

Cavium cnMIPS (OCTEONII/III) implement both.Â The LL/SC has lower
latency for the uncontested path.Â The CAS hit the L2.Â I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed >> > progress?

XADD is the better version of CAS, imho.

It allows actual progress for multiple contending threads, with just the >> minimum-possible delay for transferring ownership.

Under heavy contention, yes;
under light contention, no.

XADD has DRAM+coherence latency minimum.

Surely the XADD can be handled by the L2
or L3 (PoC - Point of Coherency) that owns
the line.

I totally agree. XADD has to be done at the
place who owns write permission. The problem
is the latency to find where write permission
is.

Only when caching is disabled for a
particular address, will the XADD need to
be done at the DRAM controller.

There will be lots of times where the PoC
is near maximal latency, and as the size
of the coherent area increases, that latency
goes up.

The real question is what kinds of memory
accesses are taking place in the millisecond
leading up to the XADD--because that is what
determine which lines are where.

--- Synchronet 3.22a-Linux NewsLink 1.2

From jseigh@jseigh_es00@xemaps.com to comp.arch on Wed May 6 18:30:20 2026

From Newsgroup: comp.arch

On 5/6/26 10:19, Scott Lurndal wrote:

Kevin Bowling <kevin.bowling@kev009.com> writes:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the specific >>>>>>> reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>>> instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid it >>> is. LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.

Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add, >>> you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be >>> one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives you >>> better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost of >>> having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?

The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8 CN8800. Both supported
cache coherency across multiple sockets (up to 4 for the CN7800). The CN8800
implemented the new ARMv8.1 Large Systems Extension (i.e. atomic instructions)
from the start as it was realized that the load/store exclusive paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.

Subsequent ARMv8/9 processors in the Octeon family have only supported single socket implementations, with large on-chip core counts.

CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.

I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down. Plus if reclamation stops because of a stalled thread, is your queue still
lock free?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Wed May 6 21:50:49 2026

From Newsgroup: comp.arch

On 5/6/2026 3:30 PM, jseigh wrote:

On 5/6/26 10:19, Scott Lurndal wrote:

Kevin Bowling <kevin.bowling@kev009.com> writes:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?

Scalability.

Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear,
et alia
instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how
valid it
is. LL/SC provides a very flexible "framework" for implementing >>>> whatever atomic operation seems right for a particular application, but >>>> the atomic operations are more efficient if you want to do exactly what >>>> they do.

Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add, >>>> you would have had to do a TS instruction, followed by a load, then an >>>> add, then a store and finally a clear test and set - five instructions. >>>> Now think about the same thing if you had LL/SC. It would be LL, add, >>>> SC - three instructions. Of course, if you had the atomics, it
would be
one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.

So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you
better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design
cost of
having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?

The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8
CN8800. Both supported
cache coherency across multiple sockets (up to 4 for the CN7800).
The CN8800
implemented the new ARMv8.1 Large Systems Extension (i.e. atomic
instructions)
from the start as it was realized that the load/store exclusive
paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.

Subsequent ARMv8/9 processors in the Octeon family have only supported
single
socket implementations, with large on-chip core counts.

CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.

I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.

How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.

Plus if
reclamation stops because of a stalled thread, is your queue still
lock free?

--- Synchronet 3.22a-Linux NewsLink 1.2

From jseigh@jseigh_es00@xemaps.com to comp.arch on Thu May 7 17:34:21 2026

From Newsgroup: comp.arch

On 5/7/26 00:50, Chris M. Thomasson wrote:

On 5/6/2026 3:30 PM, jseigh wrote:

I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.

How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.

Dequeuing from head is straightforward. I don't think I need to
describe that.

Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.

Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.

It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 13:43:55 2026

From Newsgroup: comp.arch

On 5/7/2026 2:34 PM, jseigh wrote:

On 5/7/26 00:50, Chris M. Thomasson wrote:

On 5/6/2026 3:30 PM, jseigh wrote:

I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.

How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.

Dequeuing from head is straightforward. I don't think I need to
describe that.

Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.

Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.

It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.

You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about
accessing the next node from deleted memory? LL/SC does not need ABA
counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the
underlying allocator actually does with the deleted node...
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 22:18:50 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 5/7/2026 2:34 PM, jseigh wrote:

On 5/7/26 00:50, Chris M. Thomasson wrote:

On 5/6/2026 3:30 PM, jseigh wrote:

I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC, >>> which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.

How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.

Dequeuing from head is straightforward. I don't think I need to
describe that.

Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.

Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.

It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.

You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about accessing the next node from deleted memory? LL/SC does not need ABA counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the underlying allocator actually does with the deleted node...

a) I can't seem to find a definition of SEH instruction

b) LL/SC is not subject to ABA when SC fails on any control transfer
out of thread {between LL and SC}; and this is one primary reason
that LL/SC can fail spuriously.

c) if deleted memory has been freed--all bets are off.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 22:06:47 2026

From Newsgroup: comp.arch

On 5/8/2026 3:18 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 5/7/2026 2:34 PM, jseigh wrote:

On 5/7/26 00:50, Chris M. Thomasson wrote:

On 5/6/2026 3:30 PM, jseigh wrote:

I was messing about with lock-free queue implementations and realized >>>>> you can do a Michael-Scott lock-free queue implementation using LL/SC, >>>>> which is an advantage since you don't need deferred reclamation that >>>>> CAS implementations require and which can slow things down.

How? Say using it raw with dynamic nodes. The user deletes a node. How >>>> does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.

Dequeuing from head is straightforward. I don't think I need to
describe that.

Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.

Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.

It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.

You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about
accessing the next node from deleted memory? LL/SC does not need ABA
counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the
underlying allocator actually does with the deleted node...

a) I can't seem to find a definition of SEH instruction

Oh, SEH is Microsoft's structured exception handling thing.

b) LL/SC is not subject to ABA when SC fails on any control transfer
out of thread {between LL and SC}; and this is one primary reason
that LL/SC can fail spuriously.

Yeah. But, some of them can even fail on loading from a reservation granule?

c) if deleted memory has been freed--all bets are off.

Oh yeah. Indeed!
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri May 8 22:11:42 2026

From Newsgroup: comp.arch

On 5/6/2026 6:40 AM, Terje Mathisen wrote:

Kevin Bowling wrote:

On 5/5/26 21:38, Stephen Fuld wrote:

On 5/5/2026 7:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

jseigh <jseigh_es00@xemaps.com> writes:

Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?

Scalability.

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Moving the contention detection to
the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.

CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.

I think this second point is dependent on your cache coherent
protocol.

With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.

Given that, one can make LL/SC with that same scaling properties
as CASs.

It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et
alia
instructions).

And why implement both atomics and LL/SC in a new architecture?

I think there is an argument for both, though I am not sure how valid
it Â is.Â LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want
to do exactly what they do.

Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions.Â If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC.Â It
would be LL, add, SC - three instructions.Â Of course, if you had
the atomics, it would be one instruction.Â Of course, a similar
argument applies to a later generation of systems with respect to
CAS/DCAS.

So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.

Of course, YMMV, and whether it is worth the hardware and design cost
of having both is a separate discussion.

Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?

XADD is the better version of CAS, imho.

But, XADD is NOT CAS. Very different things, one can fail the other
cannot. But you already know all of that.

It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.

If you can get an algo to use XADD, use that big time, good times. XADD
can be used in wait-free loopless algos. CAS can to, but in different
ways. Think of weak CAS (ala LL/SC like), vs strong CAS (ala x86 style).

strong CAS can be used for interesting state machines.

To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.

The question is who makes that kind of memory controller?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun May 10 20:19:47 2026

From Newsgroup: comp.arch

On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]

CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.

Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".

One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.

Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.

Code density may not be a huge issue if such atomic operations
are infrequent. I am skeptical that using providing small (16-
bit) LL/SC instructions would be worthwhile (a 16-bit branch
instruction might be worthwhile).

It might be practical to provide a single load-and-store
instruction that (in Mitch Alsup's term) casts a shadow on one
operation instruction such that the store is delayed until after
the operation. This might simplify decode a little (the idiom
recognition check would be explicitly activated by the
instruction encoding) and remove the need for a branch for retry
(ESM does not require a branch for automatic retry). Such would
also provide the same flexibility as LL/SC in terms of what the
operation is. It may even be practical to specify the length of
the "shadow" such that non-primitive operations could still be
expressed as atomic within one store.

(It might be useful to explicitly declare single _cache block_
atomic operations. Such might be more easily ordered flexibly as
the "transaction signature" is a single address. Hardware
knowing this as soon as the first instruction is decoded might
be simpler and able to act more quickly on the information.)

Atomic operations also have the issue of which operations (and
addressing modes and operands) are supported. Either interface
might also needs to be extended to provide additional
performance-hinting or possibly even semantic information. As a
new feature, atomic instructions might be better positioned to
either provide space for such extensions or include them in the
original design, though with fixed 32-bit instructions that
might require a prefix instruction.

Having both an optimistic concurrency mechanism (LL/SC, ESM,
transactional memory) and atomic instructions requires defining
how the two interact. For ESM, where a transaction can send a
NAK to delay a conflicting access, an atomic instruction would
have to be retryable even if the hardware would otherwise know
that the operation would complete (or the otherwise NAKing node
might have to handle the atomic operation providing the result
to the node executing it).

Idiom recognition is a powerful technique for translating
a sequence of primitives into a solution, but sometimes an
explicit approach is better. Yet I object to rejecting a
sequence of primitives interface for semantic reasons when idiom
recognition (and architectural guarantees) could provide the
semantics. One could argue that the complexity budget could
be better spent or that code density is important enough. (I do
feel that RISC-V's emphasis on using idiom recognition is
flawed.)

Idiom recognition can also have the advantage of supporting
legacy software that naturally uses the idiom. With LL/SC the
expression of atomic operations is natural, so legacy software
could use a new translate-to-atomic feature. (Legacy software
might use simple atomic operations less often given the
availability of larger "locked" code sequences.) If software
translation was more widespread, this advantage could disappear.

(Nearly universal software translation would provide other
advantages — and, of course, the disadvantage of reducing lock-
in to a particular architecture.)
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon May 11 14:38:52 2026

From Newsgroup: comp.arch

Paul Clayton <paaronclayton@gmail.com> writes:

On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]

CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.

Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".

Please elaborate. There are few restrictions on the instructions
that lie between the LL and SC instructions - I don't see how
any CPU could translate an arbitrary sequence of instructions
between the LL and SS into an atomic bus operation efficiently.

One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.

Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.

Atomic instructions are already available for all existing
mainstream CPUs, so your idea only applies to new CPU designs.

IMO, LL/SC is an obsolete artifact of the past.
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,116
Nodes:	10 (0 / 10)
Uptime:	85:27:19
Calls:	14,305
Files:	186,338
D/L today:	647 files (184M bytes)
Messages:	2,525,478

ARM CAS vs LL/SC

Who's Online

System Info