From Newsgroup: comp.arch
On 5/13/26 9:36 PM, MitchAlsup wrote:
Paul Clayton <paaronclayton@gmail.com> posted:
[big snip]
I suspect that HW TM will never take hold of the CPU industry.
I suspect you are correct. I still think an optimistic
concurrency mechanism with at least a very large read set could
be useful.
[snip]
(The issue I have with limited optimistic concurrency mechanisms
like AMD's Advanced Synchronization Facility and My 66000's
Exotic Synchronization Mechanism is not the initial limits but
that there seems to be little presentation of an interface that
can be extended.
For example:: what ??
Increasing the size of the transaction would not (I think)
require any new instructions but it would require documentation.
The Principles of Operation version that I have seems to imply
that Architecturally ESM transactions are limited to six 64-byte
chunks rather than that such is a minimum guaranteed by the
Architecture (like ASF does).
I imagined that there might be some use for providing a scope
identifier for transactions that is denser than a list of
addresses. This might be used to facilitate ordering
optimization and perhaps even false conflict avoidance (e.g., a
large read set might be guarded by a conservative filter based
on used addresses and by the transaction scope identifier or
false sharing within a cache block might be ignored). Obviously,
this would be dangerous, comparable to not acquiring a necessary
lock, if software got the scope naming wrong.
Exporting atomic operations also seems not to be discussed.
While such is technically not Architectural, there are other
quality of implementation matters that seem important for My
66000. Knowing that certain idioms will always be recognized
and optimized to at least a defined degree (which can vary
across time) is important.
Quoting Scott Lurndal
(Message-ID: <qQJPR.1144299$
U733.266313@fx16.iad>):
| Functionality guarantees, yes. Performance has to suffer,
| unless the hardware can analyze all the instructions between
| the LL/SC and abstract them into a single bus operation; which
| I don't see as feasible.
(The virtual vector method has similar issues of the performance
profile not being clear. Documenting that a vec loop will never
perform more than 2% worse than any semantically equivalent code
may not be sufficient if a different algorithm would perform
50% better on the same hardware and give an acceptable result.
This is not really Architectural, but benchmarking every
implementation seems unattractive. Even Intel's early frequency
throttling for AVX-512 made performance choices for software
developers more difficult than ideal. I know this is not a
trivial issue for hardware developers and computer architects,
but I think it deserves more attention than it seems to get.)
If my notion of exporting part of a transaction makes sense
(like atomically incrementing a counter if a complex transaction
succeeds), how this should be expressed would need to be
documented (the same way zeroing and nop idioms are defined not
because of semantics but because of performance).
Since copying of a page can be atomic (at least for I/O?), it
seems to me that this could be integrated with ESM.
Other possible future developments may not belong in an
Architecture defining document, but having some documentation
of developmental intent and mere possibilities seems
potentially beneficial.
There may also be interactions between ESM and thread scheduling
and power management. (I do not have even an early version of
the system documentation for power management and such.) Just as
Intel introduced the PAUSE instruction in part to save power
when waiting for a lock to be released, ESM might interact with
thread priority (which I think you mentioned before) and power
use.
There is also a similarity between ESM and x86's MONITOR/MWAIT.
(The former monitors interference hoping for none so an
operation can be atomic; the later monitors interference hoping
for some so that another chunk of work can be started.) I do not
think My 66000 defined anything like MWAIT.
I would guess that others would have some ideas for ways that
ESM might be extended.
[snip]
Of course, just as early broad software
abstractions present the risk of choosing the wrong abstraction
from lack of experience, having too many exceptional cases, and
delaying release, an ISA can be designed with excessive
flexibility that is not exploited much later and has immediate
costs.)
That is the problem when you have only been working on it for 22 years----------------alone---------------without feedback
Sigh. Sometimes I wish I had more computer architecture
expertise. Even if I could not help develop a better synchronization/communication interface or mechanism, I might at
least contribute to the state of the art in some way.
[snip]
It never ceased to amaze me that Solaris would not boot without a
real TLM in the simulator. Just referencing all the right mmory
where the tables were stored (using the CR holding said pointer)
was not enough--you had to have a TLB with at least 5 FA entries.
I wish all the experience of people like you was gathered
together for future generations. Yes, the Computer History
Museum has a lot of oral histories, and a.f.c. and other parts
of USENET are probably archived, but it seems a lot of lore
is lost.
[snip]
Mitch considers TM to be a SW problem and My 66000 ISA supports SW
by allowing multiple lines to participate in a TM transaction,
without over constraining how SW gets its job done, and with enough
HW defined behavior that SW can make a robust system with it. Other
than that TM is a SW problem.
I agree that general transactional memory is a software problem,
but I think a lot of aspects can assisted by hardware. E.g., a
conservative filter of read addresses is harder to do in
software. Read-Copy-Update methods (which seems to present a
limited form of versioned memory) may also be amenable to
hardware assistance of some kind.
Cliff Click's "IWannaBit!" (2008) opens with:
| Just One Lousy Bit! I want to know if any memory operation
| misses or any line in my L1 cache gets evicted. Why? Because
| with this one Bit I can write any number of lock-free
| algorithms easily. This Bit gives me an N-word atomic read
| set, and with a typical Store Conditional instruction a 1-word
| atomic write set. The algorithm writing community has begged
| for D-CAS or Hardware Transactional Memory for years, but
| proposals far out-strip implementations: neither are available
| on any commodity system. With this Bit I hope to lower the
| hardware costs as low as possible while still being useful.
That proposal was in my opinion too small in that it failed a
transaction on any cache miss (so the cache had to be warmed up
before a transaction could succeed). At minimum the cache block
of the starting instruction could be a non-failing cache miss,
allowing fast single-block atomics. Yet it is more powerful
than ESM in one very limited way: the capacity of the read set
can be much larger.
As a side note, Cliff Click worked at Azul Systems, which
sold a JAVA-targeted processor that supported transactional
memory. A lot of software work was required to take software
counters out of atomic sections because such produced
interference from the counter being shared, but a lot of
software was not changed and that hurt the performance of
transactional memory. With locks, atomic performance counters
are nearly free; with transactional memory, this design choice
was sub-optimal.
--- Synchronet 3.22a-Linux NewsLink 1.2