Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia instructions).
And why implement both atomics and LL/SC in a new architecture?
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>> reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache >>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid it
is. LL/SC provides a very flexible "framework" for implementing whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.
Think back to the time when the only atomic operation supported was essentially test and set, i.e. before CPUs had atomic fetch and add instructions. If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be one instruction. Of course, a similar argument applies to a later generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your requirements are met by the atomic instructions, but the LL/SC gives you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost of having both is a separate discussion.
On 5/5/26 21:38, Stephen Fuld wrote:XADD is the better version of CAS, imho.
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
              Moving the contention detection to the
cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et >>> alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid >> it  is. LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want >> to do exactly what they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC. It
would be LL, add, SC - three instructions. Of course, if you had the
atomics, it would be one instruction. Of course, a similar argument
applies to a later generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost >> of having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower > latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed > progress?
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache >>>>> is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>> instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid it
is. LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add,
you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be >> one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives you
better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost of
having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the >reasoning was the same, CAS wins for higher core counts and guaranteed >progress?
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>> reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the reasoning was the same, CAS wins for higher core counts and guaranteed progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers.
I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.
The question is who makes that kind of memory controller?
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >> >>>>>> reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the >> > reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the
minimum-possible delay for transferring ownership.
Under heavy contention, yes;
under light contention, no.
XADD has DRAM+coherence latency minimum.
ERROR "unexpected byte sequence starting at index 630: '\xC2'" while decoding:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Kevin Bowling wrote:------------------
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific
reason they add CAS to the architecture?
Scalability.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed >> > progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the >> minimum-possible delay for transferring ownership.
Under heavy contention, yes;
under light contention, no.
XADD has DRAM+coherence latency minimum.
Surely the XADD can be handled by the L2
or L3 (PoC - Point of Coherency) that owns
the line.
Only when caching is disabled for a
particular address, will the XADD need to
be done at the DRAM controller.
Kevin Bowling <kevin.bowling@kev009.com> writes:
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the specific >>>>>>> reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et alia >>>> instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid it >>> is. LL/SC provides a very flexible "framework" for implementing
whatever atomic operation seems right for a particular application, but
the atomic operations are more efficient if you want to do exactly what
they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add, >>> you would have had to do a TS instruction, followed by a load, then an
add, then a store and finally a clear test and set - five instructions.
Now think about the same thing if you had LL/SC. It would be LL, add,
SC - three instructions. Of course, if you had the atomics, it would be >>> one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives you >>> better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost of >>> having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8 CN8800. Both supported
cache coherency across multiple sockets (up to 4 for the CN7800). The CN8800
implemented the new ARMv8.1 Large Systems Extension (i.e. atomic instructions)
from the start as it was realized that the load/store exclusive paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.
Subsequent ARMv8/9 processors in the Octeon family have only supported single socket implementations, with large on-chip core counts.
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.
On 5/6/26 10:19, Scott Lurndal wrote:
Kevin Bowling <kevin.bowling@kev009.com> writes:
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?
Scalability.
Moving the contention detection to the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear,
et alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how
valid it
is. LL/SC provides a very flexible "framework" for implementing >>>> whatever atomic operation seems right for a particular application, but >>>> the atomic operations are more efficient if you want to do exactly what >>>> they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and add, >>>> you would have had to do a TS instruction, followed by a load, then an >>>> add, then a store and finally a clear test and set - five instructions. >>>> Now think about the same thing if you had LL/SC. It would be LL, add, >>>> SC - three instructions. Of course, if you had the atomics, it
would be
one instruction. Of course, a similar argument applies to a later
generation of systems with respect to CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you
better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design
cost of
having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
The cnMIPS-based CN7800 supported up to 48 cores, as did the ARMv8
CN8800. Both supported
cache coherency across multiple sockets (up to 4 for the CN7800).
The CN8800
implemented the new ARMv8.1 Large Systems Extension (i.e. atomic
instructions)
from the start as it was realized that the load/store exclusive
paradigm in arm V8.0 was
a performance limition with multisocket CN8800 processors.
Subsequent ARMv8/9 processors in the Octeon family have only supported
single
socket implementations, with large on-chip core counts.
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
Plus if
reclamation stops because of a stalled thread, is your queue still
lock free?
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC,
which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward. I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
On 5/7/2026 2:34 PM, jseigh wrote:
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized
you can do a Michael-Scott lock-free queue implementation using LL/SC, >>> which is an advantage since you don't need deferred reclamation that
CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How
does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward. I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about accessing the next node from deleted memory? LL/SC does not need ABA counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the underlying allocator actually does with the deleted node...
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:
On 5/7/2026 2:34 PM, jseigh wrote:
On 5/7/26 00:50, Chris M. Thomasson wrote:
On 5/6/2026 3:30 PM, jseigh wrote:
I was messing about with lock-free queue implementations and realized >>>>> you can do a Michael-Scott lock-free queue implementation using LL/SC, >>>>> which is an advantage since you don't need deferred reclamation that >>>>> CAS implementations require and which can slow things down.
How? Say using it raw with dynamic nodes. The user deletes a node. How >>>> does that work without hitting deleted memory? I know that Microsoft
uses SEH in its SLIST impl.
Dequeuing from head is straightforward. I don't think I need to
describe that.
Enqueuing uses the old Double Tap trick.
1) load locked from tail node next pointer.
2) if null and tail still (double tap) points to node,
store conditional new node address into the next pointer.
Updating tail by doing a load locked on it,
loading the next pointer from tail node, and
updating tail.
It's ok to access deleted memory and long as you don't
actually use it. The old lock-free stack using
DCAS did that when unsuccessfully popping the stack.
It may have had an invalid value but the DCAS will fail.
You mean DWCAS? For the Michael-Scott lock-free queue; it did not need
DCAS. I try to separate those in my mind. The only way the Microsoft
SLIST can get away with accessing deleted memory is SEH. What about
accessing the next node from deleted memory? LL/SC does not need ABA
counters, last time I checked, but it can have an issue with deleted
memory? What am I missing here Joe? I guess it depends on what the
underlying allocator actually does with the deleted node...
a) I can't seem to find a definition of SEH instruction
b) LL/SC is not subject to ABA when SC fails on any control transfer
out of thread {between LL and SC}; and this is one primary reason
that LL/SC can fail spuriously.
c) if deleted memory has been freed--all bets are off.
Kevin Bowling wrote:
On 5/5/26 21:38, Stephen Fuld wrote:
On 5/5/2026 7:03 PM, Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
scott@slp53.sl.home (Scott Lurndal) posted:
jseigh <jseigh_es00@xemaps.com> writes:
Of the possible issues LL/SC might have, did ARM mention the
specific
reason they add CAS to the architecture?
Scalability.
              Moving the contention detection to
the cache
is much more bandwidth efficient than swapping cache lines
between a hundred cores.
CASs touch the modifiable data fields without write permission,
allowing other cores to touch that data, too. Then, whomever
gets to CAS first {and then gets their CAS addresses to LLC/DRC
first} wins. But you still have the property that only 1 CAS
{in a conflicting group} succeeds.
I think this second point is dependent on your cache coherent
protocol.
With this in mind, My 66000 CCP has the ability to request write
permission on a cache line request, but the other end of the
transaction can refuse to send write permission. So, LL requests
write permission, but the 'system' can send the line read-only.
A core can refuse to pass write permission when it has performed
one or more LLs without having run into the SC.
Given that, one can make LL/SC with that same scaling properties
as CASs.
It's still far less convenient to actually use (particularly
when CAS is paired with atomic fetch-and-add, bit-set, bit-clear, et
alia
instructions).
And why implement both atomics and LL/SC in a new architecture?
I think there is an argument for both, though I am not sure how valid
it  is. LL/SC provides a very flexible "framework" for
implementing whatever atomic operation seems right for a particular
application, but the atomic operations are more efficient if you want
to do exactly what they do.
Think back to the time when the only atomic operation supported was
essentially test and set, i.e. before CPUs had atomic fetch and add
instructions. If you wanted the functionality of atomic fetch and
add, you would have had to do a TS instruction, followed by a load,
then an add, then a store and finally a clear test and set - five
instructions. Now think about the same thing if you had LL/SC. It
would be LL, add, SC - three instructions. Of course, if you had
the atomics, it would be one instruction. Of course, a similar
argument applies to a later generation of systems with respect to
CAS/DCAS.
So, having both gives you better efficiency for the cases where your
requirements are met by the atomic instructions, but the LL/SC gives
you better efficiency when they are not.
Of course, YMMV, and whether it is worth the hardware and design cost
of having both is a separate discussion.
Cavium cnMIPS (OCTEONII/III) implement both. The LL/SC has lower
latency for the uncontested path. The CAS hit the L2. I'd guess the
reasoning was the same, CAS wins for higher core counts and guaranteed
progress?
XADD is the better version of CAS, imho.
It allows actual progress for multiple contending threads, with just the minimum-possible delay for transferring ownership.
To improve on it, I think you need a distributed arbiter that can handle multiple incoming requests and send back the correct response to all of them, with zero actual memory transfers. I.e. all such semaphor memory addresses would actually end up inside the arbiter, so that it could
respond as if it was just RAM but with far lower latency.
The question is who makes that kind of memory controller?
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel) processor chips.
On 5/6/26 10:19 AM, Scott Lurndal wrote:
[snip]
CXL seems to be the wave of the future, and supports the standard PCI
Express atomic operations, so atomic CPU instructions are definitely a
better choice than LDEX/STREX on ARM-based (and for that matter Intel)
processor chips.
Since idiom recognition allows a software LL/op/SC to be
translated into an internal atomic operation, I do not see
specific atomic instructions as "definitely a better choice".
One could argue that code density, decode simplicity, and
simpler detection of non-support (illegal instruction exception)
give a significant advantage to atomic instructions, but I
do not perceive atomic instructions as obviously better.
Non-support of idiom recognition could be handled either by
architectural requirement (for a version if not so defined
initially) or software probing for the feature. That may be
a sufficient solution for that issue.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,116 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 85:27:19 |
| Calls: | 14,305 |
| Files: | 186,338 |
| D/L today: |
647 files (184M bytes) |
| Messages: | 2,525,478 |