Forum: War Ensemble BBS

VLIW Architecture of Google TPUs

From John Levine@johnl@taugh.com to comp.arch on Wed Jun 17 02:07:46 2026

From Newsgroup: comp.arch

Here's a preprint of an IEEE Micro article:

Google's Training Supercomputers from TPU v2 to Ironwood:
Architectural Stability, Scale, Resilience, Power Efficiency,
and Sustainability Across Five Generations

They call the architecture VLIW which I don't think is quite
right -- they do indeed have wide instruction words but I don't
think they do speculative execution.

https://arxiv.org/abs/2606.15870
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 17 15:12:44 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

They call the architecture VLIW which I don't think is quite
right -- they do indeed have wide instruction words but I don't
think they do speculative execution.

Speculative execution in a VLIW (and EPIC) architecture with in-order implementation (the usual case) depends on the compiler reordering the instructions. Some architectures have architectural support for
speculative execution, e.g., the speculative load of IA-64, but these architectures tend to go by the label EPIC rather than VLIW. IIRC
classic VLIWs like the Cydrome Cydra 5 or the Multiflow machines do
not have architectural support for speculation.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jun 17 16:05:55 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

Here's a preprint of an IEEE Micro article:

Google's Training Supercomputers from TPU v2 to Ironwood:
Architectural Stability, Scale, Resilience, Power Efficiency,
and Sustainability Across Five Generations

They call the architecture VLIW which I don't think is quite
right -- they do indeed have wide instruction words but I don't
think they do speculative execution.

It is more like clocked (or lock-step) data flow architectures,
without the natural asynchronicity in data flow.
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jun 17 15:22:35 2026

From Newsgroup: comp.arch

On 6/17/2026 11:05 AM, Scott Lurndal wrote:

John Levine <johnl@taugh.com> writes:

Here's a preprint of an IEEE Micro article:

Google's Training Supercomputers from TPU v2 to Ironwood:
Architectural Stability, Scale, Resilience, Power Efficiency,
and Sustainability Across Five Generations

They call the architecture VLIW which I don't think is quite
right -- they do indeed have wide instruction words but I don't
think they do speculative execution.

It is more like clocked (or lock-step) data flow architectures,
without the natural asynchronicity in data flow.

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

If limited to strictly one operation per cycle (or less), there could be potentially be another option, like, say:
Mem/Mem/Mem
Or, basically:
Load, Load, Op, Store
Within a single instruction.
Per-cycle, with any additional logic mostly for addressing, but could be reduced by having more addressing modes (with double-indirect and double-indirect auto-increment, could potentially eliminate the need for
CPU registers).

But, say, would fall on its face as soon as one allows superscalar, and
would be nearly impossible to scale. Even getting 1 IPC could end up
being a feat of engineering.

Any advantage would also be lost in the face of memory-access latency.

But, once the possibility of superscalar is in play, the balance would
shift highly towards register machines.

Then the tradeoff becomes more one of "register width" vs "machine width".

Traditional SIMD: Favors wider registers to reduce instruction counts,
but the "ever wider SIMD" inclination seems to follow the implicit
assumption that one can't do superscalar on the SIMD ops.

Like, the advantage of 256 or 512 bit SIMD would mostly evaporate if you assume that 2 or 4 128-bit SIMD operations could potentially be done in
a single clock-cycle (and, if the wider SIMD doesn't match the native
data width well, it could actually become a liability).

Though, one could argue about the relative cost of register ports, say,
for example, if each register port remains the same maximum width, then pushing larger SIMD vectors would necessarily require more register
ports which may scale more steeply than increasing register-port width.

Well, and/or one ends up with a CPU where the port-count (and max ILP)
remains fixed, so then using narrower operations comes at a cost of only
using part of the registers (and there may be penalties if multiple instructions use virtual registers that correspond to the same machine register).

Say, for example:
Machine is 6R3W with native 128-bit ports, which are divided in half for 64-bit ops.
So, say:
ADD R10, R23, R8
ADD R11, R29, R9
Can't co-execute because, at the HW register level, both are accessing
the same logical registers (well, unless additional logic existed to
subdivide access to these 128-bit ports).

For Read, an additional bit of the register specifier selecting which
half for use, and for store which part to update

Say:
00: Update the low 64-bits (64b op);
01: Update the high 64-bits (64b op);
10: Update both (128b op);
11: -

Or, could instead do "multi-ganging" in the decoder:
ADD R12, R24, R16
ADD R13, R25, R17
ADD R14, R26, R18
ADD R15, R27, R19
Being special-case recognized as mapping to all mapping the same
operations to the same machine registers (and so thus behaves as a single-cycle 256-bit SIMD op to the CPU).

Though, van note a that a similar trick has been considered for implied 128-bit SIMD ops in my ISA, but the relative merit is lower when it
already has explicit 128-bit SIMD ops that would cover mostly the same
cases (except for 64b integer SIMD, which could still use such pairing
in this scheme; and even when unpaired is still typically the most
efficient way to use conventional superscalar).

In the 2-wide case, this mostly means detecting cases where both cases
map to appropriate FUs, and where the registers follow the correct
patterns (high-bits equal but low-bits differ, and no output-input dependencies on the same machine physical register).

But, a 6R3W 128-bit machine could do 3R1R for 256-bit (or 4R2W could do
2R1W ops); or an 8R4W machine could potentially do native 512-bit vector
ops.

Though, wider SIMD registers does also avoid the drawback that every
time the effective register width is increased by pairing, it reduces
the number of available registers at the larger width (so, if you wanted
32x 256-bit registers, would need 128x 64-bit; or end up with a skewed
space where part of the register space that exists at wider widths is
not as easily accessible at the narrower widths).

Though, in practice, this doesn't seem like a huge loss (since it
doesn't actually necessarily increase the number of physical registers
or register powers, just means that patterns which don't fit the HW
resources come at an ILP cost).

Either way, the relative cost at a 64b/128b scale is small (and 256b
seems to be diminishing returns relative to 128b).

But, alas, my approach does still allow:
PMULX.F R40, R48, R52
PMULX.F R42, R50, R24
To potentially behave as a 256-bit vector if the hardware allows it.

...

Dunno how well any of this would map to OoO.
But, if one is doing OoO, they can presumably still afford the logic to
detect "the next two or four instructions represent a single SIMD
operation at a larger width"; which could likely be handled during IF
(so, before it hits the ROB and so-on).

But, not sure why this approach isn't more popular (vs, say, "ever wider
SIMD registers" or vector ISAs).

But, yeah, can note that recently I did get around to getting ALUX
support re-implemented for XG3, but ended up adding a new set of 128-bit compare ops (to match the more RV-like patterns that XG3 uses).

In this case, ended up using 64-bit encodings for the 128-bit CMPxx ops
as they aren't used often enough to justify burning 32-bit encoding
space on them (but are still common enough in live execution to make it undesirable to fake them via multi-op sequences).

Doing 128-bit branch-compare still requires at least 2 ops though, as
128b branch-compare is very rare and would be hard-pressed to justify
the added cost (of encoding or allowing 128-bit inputs to the Bcc ops).

Or, say:
CMPEQ.X R10, R12, R5
BNEZ R5, label
Rather than, say:
BEQ.X R10, R12, label

Though, debatable, could consider allowing BEQ.X and similar as pseudo-instructions (likely decaying as above).

Or, I re-allow the XG2 ops if PRED is enabled, so it turns into,
essentially:
CMPEQ.X R10, R12
BRA?T label

Which has the merit of larger branch displacement and more compact encoding.

Where "BRA?T label" is equivalent to "BT label", but more explicit about
the encoding.

Well, or one use CMPXEQ, but this is mostly just a notation/ASM-level difference. Decided to start treating X as a type-suffix, rather than
some of the infix-type notation sometimes used. But, this was itself a side-effect of dropping the "/" used in SH op naming patterns.
"CMP/EQ" and "CMP.Q/EQ" (BJX1-64C) which collapsed to "CMPQEQ".

But, at some point I decided that '/' in instruction names was ugly.

Otherwise, almost half tempting to consider tweaking the Reg6->Imm33s
case in XG3 to have a bit-pattern more consistent with the Imm10->Imm33s
case, but, this would be another breaking change (and would also break
from XG2).

Say, as-is:
Imm10 -> Imm33s: ckkkkkkkkjjjjjjjjjjjjjjjjiiiiiiii
Reg6 -> Imm33s: daaackkkkkkkkjjjjjjjjjjjjjjjjiiii
Put, say:
Reg6 -> Imm33s: ckkkkkkkkjjjjjjjjjjjjjjjjaaadiiii
Meaning that less bits need to move around when gluing on the prefix (potentially saving some MUXing here).

Then again, this would break consistency between the Reg6->Imm33s and Reg6->Imm17s patterns, which are consistent for the bits that overlap,
etc, so likely better to not poke at it for now.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 18 01:19:12 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/17/2026 11:05 AM, Scott Lurndal wrote:

John Levine <johnl@taugh.com> writes:

Here's a preprint of an IEEE Micro article:

Google's Training Supercomputers from TPU v2 to Ironwood:
Architectural Stability, Scale, Resilience, Power Efficiency,
and Sustainability Across Five Generations

They call the architecture VLIW which I don't think is quite
right -- they do indeed have wide instruction words but I don't
think they do speculative execution.

It is more like clocked (or lock-step) data flow architectures,
without the natural asynchronicity in data flow.

An idle thought here is whether there is any "better" option than conventional register-machine designs.

Consider a machine designed to crunch AI-like vector data.

A vector can contain between 768-and-8192 datums.

One has an HSNW tree of vectors in a vector data base.

A query takes one <query> vector and compares it with a number of
vectors contained in the DB. One comparison is sum {of vector-length multiplies}. So a <small> comparison requires 768×s and 767+s.

Given that the VBD is going to be stored in some kind of persistent
store (say FLASH), and given that the FLASH data rate is 5 GHz and
4 beats per FP32, one can assign a FMAC per FLASH bus and do all the
arithmetic on-line, never needing a single register, and consuming
the data as fast as it streams out of the store. One simply builds
the HW to perform this kind of work.

Power savings:
1) don't have to run the data up PCIe tree and into DRAM
2) don't have to take an interrupt per vector
3) don't have to read vector from DRAM into core <SRAM>
4) don't have to read data from SRMA (cache) into register(s)
5) perform FMACs at 5GHz

In effect one converts 3072 bytes into 4 bytes as the data returned to the point of query, and converts 200-500 interrupts into 1.

Sometimes the proper point of doing a calculation is not in the core.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Jun 17 23:17:07 2026

From Newsgroup: comp.arch

On 6/17/2026 1:22 PM, BGB wrote:

snip

An idle thought here is whether there is any "better" option than conventional register-machine designs.

There has been a fair amount of work on, and several working, analog
neural network chips. I believe there has also been some work on
digital chips which contain lots of "neurons" each with a little
hardwired logic to sum the weighted inputs, compare to a threshold and
output a signal if the threshold has been exceeded, sort of like real
neurons do. The issue with these is the interconnect.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Jun 18 07:45:02 2026

From Newsgroup: comp.arch

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than conventional register-machine designs.

Back in the 1980s and earlier, there were many architectural
possibilities being considered.

Consider the Transputer: this consisted of a lot of CPU nodes, each
with its own local memory. I guess they envisioned scaling up both
numbers of CPUs as well as amount of memory in more powerful
configurations, so the amount of memory per CPU stayed roughly
constant.

Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
memory has in reality been increasing much faster than the number of
CPUs.

And I don’t think you see NUMA in consumer machines; maybe in
supercomputers.
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 03:50:52 2026

From Newsgroup: comp.arch

On 6/18/2026 2:45 AM, Lawrence D’Oliveiro wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

Back in the 1980s and earlier, there were many architectural
possibilities being considered.

Consider the Transputer: this consisted of a lot of CPU nodes, each
with its own local memory. I guess they envisioned scaling up both
numbers of CPUs as well as amount of memory in more powerful
configurations, so the amount of memory per CPU stayed roughly
constant.

Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
memory has in reality been increasing much faster than the number of
CPUs.

And I don’t think you see NUMA in consumer machines; maybe in supercomputers.

Well, such is the problem.
Seemingly computational throughput is easier to achieve than memory
bandwidth.

Well, even with an FPGA:
It wouldn't be that hard to make a SIMD unit that could pull off 400 of
800 MFLOP at 50 MHz...

Keeping it fed is another problem entirely.

I guess one relative merit of register machines is that one can try to static-schedule around memory access patterns to some extent.

At least, more so than with hardware that streams to/from memory. Well,
and memory streaming would introduce unavoidable round-trips to memory, whereas operating within registers could potentially do more stuff
locally in registers before needing to hit back to memory.

I guess a big what if is, say, rather than having a 64-bit or 128-bit
pipe to a relatively large RAM, you could have a whole lot of pipes to
smaller and narrower RAM modules.

Say, for example, Say, for example, 64x 16b LPDDR?...

As opposed to say, two 128-bit channels covering 4 DIMMs?...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Jun 18 16:52:28 2026

From Newsgroup: comp.arch

On 2026-06-18 10:45, Lawrence D’Oliveiro wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

Back in the 1980s and earlier, there were many architectural
possibilities being considered.

Consider the Transputer: this consisted of a lot of CPU nodes, each
with its own local memory. I guess they envisioned scaling up both
numbers of CPUs as well as amount of memory in more powerful
configurations, so the amount of memory per CPU stayed roughly
constant.

Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
memory has in reality been increasing much faster than the number of
CPUs.

The Transputer (from Inmos) led to the XCORE many-core chips from XMOS (https://www.xmos.com/) which seem to be somewhat successful today.

The Transputer was successful for a while, but IMO then waned because
Inmos focused on making each single-core processor chip more powerful,
which put the transputer into direction competition with conventional processors and made many-processor transputer systems expensive, instead
of making many-core chips, which is what XMOS does, making the cost per
core small.

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 18 14:39:07 2026

From Newsgroup: comp.arch

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 16:54:57 2026

From Newsgroup: comp.arch

On 18/06/2026 15:52, Niklas Holsti wrote:

On 2026-06-18 10:45, Lawrence D’Oliveiro wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

Back in the 1980s and earlier, there were many architectural
possibilities being considered.

Consider the Transputer: this consisted of a lot of CPU nodes, each
with its own local memory. I guess they envisioned scaling up both
numbers of CPUs as well as amount of memory in more powerful
configurations, so the amount of memory per CPU stayed roughly
constant.

Why didn’t that work? Obviously, because CPU speeds increased
disproportionately more than memory speeds. And so the total amount of
memory has in reality been increasing much faster than the number of
CPUs.

The Transputer (from Inmos) led to the XCORE many-core chips from XMOS (https://www.xmos.com/) which seem to be somewhat successful today.

The Transputer was successful for a while, but IMO then waned because
Inmos focused on making each single-core processor chip more powerful,
which put the transputer into direction competition with conventional processors and made many-processor transputer systems expensive, instead
of making many-core chips, which is what XMOS does, making the cost per
core small.

The XCORE core in the XMOS devices is not actually multi-core - it is a
single core with a very deterministic multi-threading system. You get 8 hardware threads, with stepping between threads on every clock tick.
The original XCOREs ran at a fixed 500 MHz (I believe they are faster
now) with a 5 stage pipeline, with pipeline overlap between hardware
threads but not within a hardware thread. So no virtual cpu would run
faster than 100 MHz (but it could be slower if more than 5 of the 8
hardware threads is active). Since the individual virtual cpus do not
see the pipelining, everything is very deterministic and predictable -
there are no delays for branches, no pipeline stalls, no waits for
memory (using the onboard static ram), etc.

There are some XMOS devices with more than one XCORE on the same die, so
they are mutli-core. All communication between cores - real or virtual
- is supposed to be via message passing, which is supported by hardware (including dedicated processor instructions).

The whole idea was to take the principle of having many communicating
cores, but implement it efficiently in hardware - reusing the same ALU
and other hardware for multiple virtual cores to save space and cost,
while simultaneously eliminating the timing complications from running a pipelined core quickly.

Another massively multi-core device I read about was the GreenArray
GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing elements means it can do a massive number of operations per second with
very little power and cost - in practice, the tiny amount of ram for
code and data on each element means it can do almost nothing. It is programmed in a type of Forth (I know there are Forth experts in this
group, who might have more informed opinions on the chip and development
for it), but it is an obscure and limited Forth. Combined with the complication of splitting tasks between many elements and communicating
and synchronising between them, I making use of these devices is a very
niche skill.

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 16:59:11 2026

From Newsgroup: comp.arch

On 18/06/2026 16:39, Scott Lurndal wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

There have been plenty of microcontrollers where there is only one or
very few actual registers - everything else is ram. The 8-bit PIC
family works like that, and has been hugely popular. There are also
"stack machine" architectures where you have, at most, a register for
the top-of-stack (along with at least one stack pointer register, a
program counter, and perhaps a flag/status register). Pretty much all
4-bit processors work like that, AFAIK.

I think there's a lot to be said for stack machine type designs,
possibly with more than one stack.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 16:21:04 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Registers in high-performance CPUs give you several benefits:

1) The addresses are hard-coded in the instructions. This means that
read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early and used
for forwarding, and for renaming registers), and for reducing port requirements.

2) They have many read and write ports.

3) Fast access time. Well, maybe. Thanks to 1) fast access time is
actually not necessary, it just means that you need fewer forwarding
paths.

Let's look at your thought experiment:

Advantage 1 is missing. Some AMD64 implementations still manage to
implement 0-cycle store-to-load-forwarding in many cases, but AFAIK
not as reliably as for registers.

Advantage 2 tends to be missing. E.g., the most extreme I have seen
up to now is 3 reads and 2 writes per cycle, and IIRC <5 total memory
accesses per cycle, on a machine that can do 8 or 10 instructions per
cycle, i.e. at least 16 register reads and 8 register writes per cycle
(maybe limited to less, but with advantage 1 mitigating that to some
extent).

Advantage 3: What would single-cycle memory access mean for d=a+b+c? It
would be compiled to

t=b+c
d=a+t

With registers this has a latency of typically 2 cycles. With
single-cycle memory access this typically has a latency of 6 cycles.

BTW, it's not just a though experiment:

A number of IA-64 implementations have had single-cycle D-cache
access. It still had registers.

Processors like the 6502 and the 6809 have single-cycle memory access.
They still have registers (actually, accumulators and index
registers).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 18 16:53:42 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 18/06/2026 16:39, Scott Lurndal wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

There have been plenty of microcontrollers where there is only one or
very few actual registers - everything else is ram. The 8-bit PIC
family works like that, and has been hugely popular. There are also
"stack machine" architectures where you have, at most, a register for
the top-of-stack (along with at least one stack pointer register, a
program counter, and perhaps a flag/status register). Pretty much all
4-bit processors work like that, AFAIK.

The Burroughs B3500 and IBM 1401 were memory-to-memory
architectures and were popular in the day. In both cases
they supported index registers mapped to memory addresses.

The B3500 TOS (Top Of Stack) was stored in a reserved memory
address (address 40), although the original parameter passing
mechanism was not reentrant (parameters were stored in the
code space immediately following the enter (NTR) instruction).

A later enhancement to the architecture (the VEN - Virtual Enter)
instruction was re-entrant.

I think there's a lot to be said for stack machine type designs,
possibly with more than one stack.

The B6500 family lives on today (albeit in emulation). The HP-3000
was also a stack-based architecture influenced by the Burroughs
large systems.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 16:49:40 2026

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

The Transputer was successful for a while, but IMO then waned because
Inmos focused on making each single-core processor chip more powerful,
which put the transputer into direction competition with conventional >processors and made many-processor transputer systems expensive, instead
of making many-core chips, which is what XMOS does, making the cost per
core small.

The multiprocessor Transputer systems were never particularly
successful. I have read that the focus on programming the system in
Occam and the limitations of that programming language resulted in a
lack of adoption. However, thanks to the possibility to use the
transputers with few support chips, they were successful as
single-processors in high-end embedded systems.

The early transputers were fast for their time, e.g., with the T414 at
up to 20MHz (and single-cycle instruction execution) in 1985, while
the 80386 was introduced in 1985 at 12.5MHz (with at least two cycles
per instruction), and the MIPS R2000 introduced in 1986 was available
at up to 15MHz.

Later they tried to follow up on that with the T9000 with multiple
instructions per cycle and higher clock rate, but that ran into long
delays, deadly in the 1990s with it's extreme clock rate increases
every year, and so the T9000 was eventually cancelled.

As for multi-processing, in addition to Occam the Transputer concept
suffers from distributed memory. As for the idea that the work on the
T9000 prevented success of having a many-processor with cheap CPUs,
that's not the case because the T9000 never appeared, and you could
buy various cheaper transputers. So if a transputer system made up of
lots of T212s would have been a winner, nobody (and certainly not the
T9000) stopped that from happening. It was not a winner, because
programming a distributed-memory machine is harder than a sequential
machine.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 17:38:06 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

Another massively multi-core device I read about was the GreenArray
GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing >elements means it can do a massive number of operations per second with
very little power and cost - in practice, the tiny amount of ram for
code and data on each element means it can do almost nothing. It is >programmed in a type of Forth (I know there are Forth experts in this
group, who might have more informed opinions on the chip and development
for it), but it is an obscure and limited Forth.

That is not its major problem IMO.

The Greenarrays chips have IIRC 64 18-bit words per core. That's
really little for a general-purpose computer, and too little to be of
any use in that capacity. A number of people in the Forth community
were fascinated by these chips and ordered some to play around with
them, but I rarely heard of any actual uses, much less production
uses. Greenarrays apparently is still around, so maybe someone has
found some use for it.

One suggestion I have read is that it would be useful for bit-banging
on I/O lines. 64 words might be enough for that (as long as the
protocol is not too complex), and at 700MHz these chips might outdo
FPGAs in some of these applications. But I have not heard much about
such applications, either.

Combined with the
complication of splitting tasks between many elements and communicating
and synchronising between them, I making use of these devices is a very >niche skill.

One interesting aspect is the synchronization. AFAIK you can send
over a word to a neighboring core. As long as that word is not
consumed, sending another word will block. Reading a word from a
neighbor if there is none available will block, too. Not sure if it
has a way to check whether something is in the buffer before trying to
read from or write to it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 20:01:34 2026

From Newsgroup: comp.arch

On 18/06/2026 19:38, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Another massively multi-core device I read about was the GreenArray
GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing
elements means it can do a massive number of operations per second with
very little power and cost - in practice, the tiny amount of ram for
code and data on each element means it can do almost nothing. It is
programmed in a type of Forth (I know there are Forth experts in this
group, who might have more informed opinions on the chip and development
for it), but it is an obscure and limited Forth.

That is not its major problem IMO.

The Greenarrays chips have IIRC 64 18-bit words per core. That's
really little for a general-purpose computer, and too little to be of
any use in that capacity. A number of people in the Forth community
were fascinated by these chips and ordered some to play around with
them, but I rarely heard of any actual uses, much less production
uses. Greenarrays apparently is still around, so maybe someone has
found some use for it.

Yes, it is the small program size that is the big limit. If there were
more code space, it would be straightforward to simply add new Forth
words as needed until you had something that was more directly practical.

One suggestion I have read is that it would be useful for bit-banging
on I/O lines. 64 words might be enough for that (as long as the
protocol is not too complex), and at 700MHz these chips might outdo
FPGAs in some of these applications. But I have not heard much about
such applications, either.

That is something that is done with XMOS devices (at 100 MHz per virtual
cpu). But to make it work well, they also have a large number of
hardware timers and parallel-to-serial and serial-to-parallel shift
registers. This lets you make things like a 100 Mbps Ethernet MAC in
only a few hardware threads, or multiple UARTs on one thread. One of
the GA144 example applications is an Ethernet MAC, but IIRC it takes
over have the chip. And even though the XMOS devices could do Ethernet
and USB (480 Mbps) in software, they quickly realised that they are much
more efficient in dedicated hardware blocks.

Combined with the
complication of splitting tasks between many elements and communicating
and synchronising between them, I making use of these devices is a very
niche skill.

One interesting aspect is the synchronization. AFAIK you can send
over a word to a neighboring core. As long as that word is not
consumed, sending another word will block. Reading a word from a
neighbor if there is none available will block, too. Not sure if it
has a way to check whether something is in the buffer before trying to
read from or write to it.

That's a nice idea. But can have a bit of flexibility here, with a FIFO
of size greater than 1 ? That would provide many other uses. I suppose
it is possible to use one of the cores as a FIFO.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jun 18 11:08:25 2026

From Newsgroup: comp.arch

On 6/18/2026 7:59 AM, David Brown wrote:

On 18/06/2026 16:39, Scott Lurndal wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

There have been plenty of microcontrollers where there is only one or
very few actual registers - everything else is ram. The 8-bit PIC
family works like that, and has been hugely popular. There are also
"stack machine" architectures where you have, at most, a register for
the top-of-stack (along with at least one stack pointer register, a
program counter, and perhaps a flag/status register). Pretty much all 4-bit processors work like that, AFAIK.

I think there's a lot to be said for stack machine type designs,
possibly with more than one stack.

Yes, for some applications. As you noted, many/most of the successful
stack architectures CPUs are in the small embedded space. The
advantages of stack architectures, besides the ones you mentioned
include smaller code footprint and faster context switch.

The downside becomes more problematic when you get to more powerful
systems and try to do superscaler operations. For example, it is easy
to see how to perform simultaneously two adds that involve different registers, but since essentially all operations in a stack machine have
top of stack as the destination, it gets more tricky.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 21:08:09 2026

From Newsgroup: comp.arch

On 18/06/2026 20:08, Stephen Fuld wrote:

On 6/18/2026 7:59 AM, David Brown wrote:

On 18/06/2026 16:39, Scott Lurndal wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

There have been plenty of microcontrollers where there is only one or
very few actual registers - everything else is ram. The 8-bit PIC
family works like that, and has been hugely popular. There are also
"stack machine" architectures where you have, at most, a register for
the top-of-stack (along with at least one stack pointer register, a
program counter, and perhaps a flag/status register). Pretty much all
4-bit processors work like that, AFAIK.

I think there's a lot to be said for stack machine type designs,
possibly with more than one stack.

Yes, for some applications. As you noted, many/most of the successful stack architectures CPUs are in the small embedded space. The
advantages of stack architectures, besides the ones you mentioned
include smaller code footprint and faster context switch.

The downside becomes more problematic when you get to more powerful
systems and try to do superscaler operations. For example, it is easy
to see how to perform simultaneously two adds that involve different registers, but since essentially all operations in a stack machine have
top of stack as the destination, it gets more tricky.

While a lot of problems benefit most from fast performance per thread,
some can spread across many threads, and performance per Watt is key.
Then it doesn't matter if you can't perform multiple simultaneous
additions if you can switch to a new thread in a single cycle.

Of course, the challenge here is that programming is significantly
different from what we are used to, and you need a new type of OS as
well as new applications - and ideally, new programming languages.
That's a lot of big hurdles to clear even if the result is theoretically
more efficient. (And you still need to keep the fast single-threaded
cpus, and the fast SIMD / vector processing systems, for other kinds of tasks.)

Possibly the biggest millstone around the neck of computing
architectures is the C language. For every processor that is not just
for highly niche code (like gpus), what matters is how fast C code can
run on it. Most other languages either use a similar model, or run on
VMs written in C. Why bother with support for multiple stacks, or other interesting hardware innovations, if it doesn't support faster C? With
all due respect to Anton and other Forth enthusiasts, "fastest Forth benchmarks" is not going to attract much investment money.

I'd love to see new architectures and new hardware features that are
genuinely different, but they rarely turn up. Even with C programming,
there are so many things that could be made more efficient with a bit of interesting hardware. (I say this with little knowledge of the
complications implied.) A lot of time in C code is spend in memory
allocation work - that's got to be a prime candidate for hardware acceleration, especially if we can get away from the brutish malloc/free approach. (Stack-based allocators are one possibility for a lot of allocations.) There could be hardware support for threading, locking,
and inter-process communication. Separate data stacks and return stacks
would make things faster and more secure. Fat pointers that can track
access modes and range limits, at least in common cases, would aid
reliability and security.

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 14:16:46 2026

From Newsgroup: comp.arch

On 6/18/2026 9:39 AM, Scott Lurndal wrote:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

This is what I was debating. But, trying to push registers out of the
mix adds a new complexity:
Either need for very complex addressing mode, or a section of memory
whose sole purpose is to behave like registers.

Could mostly eliminate GPRs, and assume only a few SPRs, say:
ZP: Zero-Page Base
SP: Stack Pointer
GP: Global Pointer
PC: Program Counter

Then, you can use one of these as a base register, and access an offset.
Say:
(ZP,Disp) //location at ZP+Disp
Indirect:
@(ZP,Disp) //location pointed to by pointer at ZP+Disp
Increment:
@(ZP,Disp)+, @-(ZP,Disp)
Uses pointer, then increments.

This makes a problem for structs and arrays though...

You either need a multi-op sequence to access a struct member, or the
ability to double-up the addressing modes:
((ZP,Disp),Disp2)
((ZP,Disp),(ZP,Disp2))

And, if one allows:
ADD.L
((ZP,Disp),(ZP,Disp2)),
((ZP,Disp),(ZP,Disp2)),
((ZP,Disp),(ZP,Disp2))

This would effectively be 9 memory accesses in a single instruction...

Another option is a stack-machine, like Forth or PostScript (or JVM).

Naively this works, but as noted similar work in a stack machine tends
to need around 60% more instructions than a register machine (works OK
as a compiler IR though mostly because many of the ops can evaporate
when translating to 3AC/SSA).

But:
PUSH.L (ZP,Disp)
PUSH.L (ZP,Disp)
ADD
POP.L (ZP,Disp)
4 ops to do "c=a+b;"

Or, to access an array:
PUSH.A (ZP,Disp)
PUSH.L (ZP,Disp)
LOADINDEX.L
POP.L (ZP,Disp)

Also to be effective if compiling a language C, would likely need a frame-pointer to allow effective access to stack-based local variables,
but the relative merit of using ZP for global state would be reduced.

Does all seem to work out to being a disadvantage vs a register machine...

The other option being register machine with some Load-Op and Op-Store instructions and maybe some more advanced addressing modes.

But, can note (if running Doom):
(Rb, Disp): ~ 60%
(Rb, Index): ~ 36% (*1)
(Rb, Index, Disp): ~ 2%
(Rb)+ / -(Rb): ~ 2%
*1: Live stats, closer to (76, 21, 1, 2) for static counts.

While one could think of specific cases like:
[SP+Ix*4+Ofs]

This mostly becomes moot if the address of the array ends up pinned in a register. This leaves array-inside-struct as the primary use-case, but
not common enough to really justify it.

This leaves Load-Op and Op-Store:
ADD.L (SP, Disp), Rn //Rn=Rn+(SP,Disp)
ADD.L Rn, (SP, Disp) //(SP,Disp)=(SP,Disp)+Rn

Which can help if the variable is not in a register and only accessed
once. This is theoretically spec'ed, but not really implemented for XG3
in BGBCC yet. The emulator and CPU core should support it though (though
is an optional feature, mostly overlaps with the same mechanism as the
RV AMO ops).

Relative gains are small but non-zero.

The main useful instruction from this cohort is mostly:
XCHGV.L (Rb, Disp), Rn //Rp<=Rn, LD=>Rn
Which can do:
Rn=(Rb,Disp)
(Rb,Disp)=Rp
As a single operation, with volatile access to main RAM (for the V variant).

This happens to map over to the:
InterlockedExchange()
Intrinsic from MSVC, which can be used to implement spinlocks and
similar. Granted, this mostly defeats the use-case for CAS or LR/SC
(which are both more expensive and have little obvious advantage in this case).

...

Otherwise, did another test:
Modeled what the result would be if the CPU imposes register-aliasing
between registers within the same pair (say, assume a core with 64-bit
logical registers but 128-bit physical registers):
Current shuffling: ~ 4% penalty
Pairing aware shuffling: ~ 1%
Paired shuffling vs current on current cores: ~ 1%

So, in effect, the overall cost of 128-bit physical registers would be
around a 2% penalty vs 64-bit physical registers (absent maybe also
making the register allocator aware of such shenanigans, and avoiding allocating registers within the same pair except when register pressure demands it).

Or, in effect:
6R3W register file with 30x 128-bit physical registers;
Each physical register containing a logical pair of GPRs.
WB would only write the low or high-half for 64b ops.
This would be for exploring an option that could support 256-bit SIMD.
Could make a tricked-out SIMD monster...
But not enough memory bandwidth to make effective use of it.
...

Though, this is still a fair bit less than the penalty from using RV-C
(closer to around 10-15% vs non RV-C). This penalty could be avoided
mostly by having superscalar handling that can deal with RV-C, but this
is easier said than done. Well, or have a compiler that gets "clever"
and tries to constrain RV-C to behave like pair-packing rather than
free-form (trading some code-density for higher performance by not
wrecking things as badly).

Well, and as-is, XG3 is beating plain RV64GC on both code density and performance (RV64GC+Jx puts up a stronger fight in that at least
binaries can be smaller, *).

*: Though, to put RV64GC+Jx ahead, ended up adding stuff like 32-bit
encodings for "LD/LW/LWU Disp17s*{4|8}(GP)" and similar (along with jumbo-prefixes, etc). Though, XG3 still remains faster.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 15:37:39 2026

From Newsgroup: comp.arch

On 6/18/2026 11:21 AM, Anton Ertl wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better" option than
conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Registers in high-performance CPUs give you several benefits:

1) The addresses are hard-coded in the instructions. This means that
read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early and used
for forwarding, and for renaming registers), and for reducing port requirements.

2) They have many read and write ports.

3) Fast access time. Well, maybe. Thanks to 1) fast access time is
actually not necessary, it just means that you need fewer forwarding
paths.

Let's look at your thought experiment:

Advantage 1 is missing. Some AMD64 implementations still manage to
implement 0-cycle store-to-load-forwarding in many cases, but AFAIK
not as reliably as for registers.

Advantage 2 tends to be missing. E.g., the most extreme I have seen
up to now is 3 reads and 2 writes per cycle, and IIRC <5 total memory accesses per cycle, on a machine that can do 8 or 10 instructions per
cycle, i.e. at least 16 register reads and 8 register writes per cycle
(maybe limited to less, but with advantage 1 mitigating that to some
extent).

Advantage 3: What would single-cycle memory access mean for d=a+b+c? It would be compiled to

t=b+c
d=a+t

With registers this has a latency of typically 2 cycles. With
single-cycle memory access this typically has a latency of 6 cycles.

BTW, it's not just a though experiment:

A number of IA-64 implementations have had single-cycle D-cache
access. It still had registers.

Processors like the 6502 and the 6809 have single-cycle memory access.
They still have registers (actually, accumulators and index
registers).

Seemingly, I can try various ideas in my head, but seemingly almost invariably, "have registers, and a good number of them", mostly comes
out as the best-case option.

One other possibility could be some sort of dynamically reconfigurable systolic array, but this would be "not so great" on area cost. And,
would likely still make sense to drive them using big SIMD registers or similar as inputs.

Say:
Shove SIMD vectors into specialized reconfigurable unit;
Wait for an N cycle latency;
Deal with results coming out the other side.
If exposed in an ISA, would likely make sense to violate the assumption
that the instruction produces an immediate result, or alternatively the
input and output ports are decoupled.

SYSARR_IN_A0 Rv1, Rv2, Rv3 //Feed inputs to unit A0 (3R)
...
SYSARR_OUT_A0 Rv4 //get results from A0 (1W)

With pipelining:
SYSARR_IN_A0 Rv1, Rv2, Rv3
SYSARR_IN_A0 Rv4, Rv5, Rv6
...
SYSARR_OUT_A0 Rv10
SYSARR_OUT_A0 Rv11
...

But, this would be a bit messy, as the instructions would effectively
have their timing latency as part of the operation. Likely each unit
would need control registers to describe the operation to be performed.

But, still the same fundamental issue:
Feeding data in and dealing with the data out is likely to be a bigger bottleneck than the computation itself.

Well, and that when memory access stops being as big of a dominant
bottleneck, things like going superscalar or OoO on the memory
operations opens up as an option (well, and the main reason to have
memory operations as single-ported is mostly due to the higher cost of multi-ported access to memory, which isn't really solvable in any
obvious way until one already has the resource budget to address it).

Well, and "why not just make load/store and the bus wider?" is
moderately effective (if naive) limited mostly by how much one can
reasonably deal with, or the costs one can pay (say, major reason I am
using 16-byte cache-lines being because it is more expensive to use
32-byte cache lines, even if the per-line overheads would be lower at 32 bytes).

But, say, if the core could do 256-bit load/store, would make sense to
use a bus design with 256 or 512 bit cache lines. Well, and then 256b load/store would likely require 128-bit alignment.

But, not there at the moment.

One could map systolic arrays directly to memory, which does at least
seem less awkward and works so long as the whole operation can be mapped
to the input/output buffers, working like a niche co-processor. But,
logic complexity and flexibility would be fairly limited in this case.

Such a unit would still be limited to whatever the memory subsystem can deliver (so not necessarily much faster than a CPU running SIMD, which
is limited to the same underlying constraint).

Then, the only merit of the systolic array is if it can use less area or energy than a comparable CPU core (which isn't as likely to be true in a bandwidth-limited scenario; unless it is very narrow so that the bus can
keep up, but then almost may as well just use a CPU).

But, such an approach could make sense assuming there is enough memory bandwidth to keep it fed (and thus able to spend nearly every clock
cycle doing useful work).

...

But, does bring up an idle thought (unrelated):
Could probably do a faster 3D renderer for my project if a lot of the
work, like the dynamic tessellation and transforms, could be cached
between frames.

I guess could maybe try to bolt it onto OpenGL by hashing the vertex
arrays, then building a cached tessellated array's contents and
modelview matrix are stable for N frames. Could then use a cheaper path
to feed all the (pre-tessellated) primitives through the projection
matrix and rasterizer.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 18 22:37:29 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:
-----------------------

Of course, the challenge here is that programming is significantly
different from what we are used to, and you need a new type of OS as
well as new applications - and ideally, new programming languages.
That's a lot of big hurdles to clear even if the result is theoretically more efficient. (And you still need to keep the fast single-threaded
cpus, and the fast SIMD / vector processing systems, for other kinds of tasks.)

Possibly the biggest millstone around the neck of computing
architectures is the C language.

C is not an albatross !! it is a standard to which one designs--
exactly like air (our atmosphere) provides the standard to which
airplanes have to be designed.

C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.

For every processor that is not just
for highly niche code (like gpus), what matters is how fast C code can
run on it. Most other languages either use a similar model, or run on
VMs written in C.

You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...

Why bother with support for multiple stacks, or other interesting hardware innovations, if it doesn't support faster C?

I can argue that having 2 stacks {one for the preserved state from
caller to callee, the other for data} enables ever so slightly faster
C--but that is not the point--the point is robustness in the face of
threats (buffer overruns, ROP, malicious use of memory).

The speed advantage is by knowing that registers written to the call-
stack do not need to be written to L2 (or farther) if RET has occurred
and the cache line replaced. Saves a trifling of power, too. These is
another advantage is when an EXIT instruction is still reading from
stack and an ENTER instruction starts writing to the stack. WE taught
the compiler a prescribed order to utilize the registers, so that when
an EXIT is running and an ENTER is decoded, the EXIT can be short
circuited and some of the ENTER short circuited, eliding cycle waste.

With
all due respect to Anton and other Forth enthusiasts, "fastest Forth benchmarks" is not going to attract much investment money.

I'd love to see new architectures and new hardware features that are genuinely different, but they rarely turn up.

My 66000 is replete with those features--and it is argued here daily
that it (my 66000 ISA) has gone too far !!

[Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any
time.

Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers
directly through GOT (or other SW table).

Multi-line multi-instruction ATOMIC sequences freely available to SW.

Transcendental instructions that take FDIV number of cycles.

Context Switches performed without instruction execution--as if the
state of a thread was treated like a write-back cache.

Interrupt tables that can be used as a low level scheduler built into
the priority (and privilege) model with support for vVMs monitoring vMs
One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core
and table.)

Even with C programming, there are so many things that could be made more efficient with a bit of interesting hardware. (I say this with little knowledge of the complications implied.) A lot of time in C code is spend in memory allocation work - that's got to be a prime candidate for hardware acceleration, especially if we can get away from the brutish malloc/free approach.

And then C++ goes all 'new' on using memory...

(Stack-based allocators are one possibility for a lot of allocations.)
There could be hardware support for threading, locking,
and inter-process communication.

My 66000 can switch threads in a single instruction.
My 66000 ESM provides unrealized synchronization capabilities.
My 66000 Interrupt tables provide single instruction message send and
single instruction message receives.
SW determines what the messages mean.

Separate data stacks and return stacks would make things faster and more secure.

Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

Plus, the code sees a 64-bit Virtual Address Space, while the system has
four 64-bit physical address spaces. The spaces are used to determine
the consistency model:: {
Cacheable DRAM is causally ordered and coherent and cached
unCacheable DRAM is sequentially consistent incoherent
ROM is unordered incoherent but cached
MMI/O is sequentially consistent incoherent
Config is strongly ordered incoherent
}

Fat pointers that can track access modes and range limits, at least in common cases, would aid reliability and security.

A bit Too "Cheri" for me.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Jun 18 22:58:26 2026

From Newsgroup: comp.arch

On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

I guess a big what if is, say, rather than having a 64-bit or
128-bit pipe to a relatively large RAM, you could have a whole lot
of pipes to smaller and narrower RAM modules.

Say, for example, Say, for example, 64x 16b LPDDR?...

16-bit ... wasn’t that the bus width of Rambus?

Was Rambus trying to do anything like this? Remember, Intel invested
heavily in it ... only for just about the entire rest of the industry
to bring out DDR.

Yes, pipelining RAM would seem an obvious answer to try to keep up
with faster and faster CPUs.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 00:26:42 2026

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

I guess a big what if is, say, rather than having a 64-bit or
128-bit pipe to a relatively large RAM, you could have a whole lot
of pipes to smaller and narrower RAM modules.

Say, for example, Say, for example, 64x 16b LPDDR?...

16-bit ... wasn’t that the bus width of Rambus?

Was Rambus trying to do anything like this? Remember, Intel invested
heavily in it ... only for just about the entire rest of the industry
to bring out DDR.

Yes, pipelining RAM would seem an obvious answer to try to keep up
with faster and faster CPUs.

See my patent 5,367,494

Where there were 3 busses, an address bus, a readd-out bus and a
write-in bus, each bus had/has independent timing.

Basically a mainframe multi-banked memory system in a single chip.
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 23:48:10 2026

From Newsgroup: comp.arch

On 6/18/2026 5:58 PM, Lawrence D’Oliveiro wrote:

On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

I guess a big what if is, say, rather than having a 64-bit or
128-bit pipe to a relatively large RAM, you could have a whole lot
of pipes to smaller and narrower RAM modules.

Say, for example, Say, for example, 64x 16b LPDDR?...

16-bit ... wasn’t that the bus width of Rambus?

Was Rambus trying to do anything like this? Remember, Intel invested
heavily in it ... only for just about the entire rest of the industry
to bring out DDR.

Yes, pipelining RAM would seem an obvious answer to try to keep up
with faster and faster CPUs.

The idea here is partly:
LPDDR vs DDR:
LPDDR has lower pin count due to multiplexing;
Narrower interface (typically 16 bit);
Commonly used in cellphones or similar;
...

So, for normal DDR, there are normally:
Command pins;
Address pins;
Data Pins.
The command/address is sent in parallel, typically SDR.

On a DIMM, typically the C/A pins would be shared across all of the
chips, so each chip would individually provide an 8 or 16 bit interface,
but then they are ganged up for a 64 bit interface.

If each DDR chip were addressed individually, the C/A pins would be
greatly outweigh the data pins.

Whereas with individual addressing, LPDDR wouldn't have nearly as big of
an impact on pin count. The C/A pins are more heavily multiplexed and
driven using DDR signals.

Why go narrower?...
Mostly, each memory access has a certain latency RAS and CAS, which
needs to be paid for every access. To get the most efficient use of
bandwidth, one effectively needs to perform a relatively large burst
transfers (say, would need to transfer around 512 bytes or so to get
peak efficiency from a DIMM; can do 128 or 256 byte bursts, but then one
is wasting more of the time on CAS latency).

But, then a more subtle problem emerges:
The bigger the block you transfer, the lower the probability that all of
the data in that block will actually be used.

Say, block size:
16B: Very likely all of it is relevant;
32B: Also likely that all of it is relevant;
64B: Meh;
128B: Chances are half the block is wasted;
256B/512B: Likely only part of the block will be accessed in the near-term.

If the RAM interface is pushed narrower, this block width is pushed
down, so a higher percentage is likely to be useful (within the time
window of however long it is in the cache).

The pin budget can then instead be used to access more blocks. And, if
you have a lot of cores, more likely they are going to be accessing RAM
in a fairly scatter-shot pattern rather than anything resembling a
sequential access pattern.

Likewise, more of the total RAM's bandwidth budget is likely to go
towards useful work vs accessing RAM that just happened to be nearby the
data that was being accessed.

Though, the correlation pattern is different than the one effecting MMU
page size, but MMU page size applies over a few orders of magnitude
larger time scale (where, say, data in the L1 and L2 caches tends to be
much shorter lived vs stuff in the TLB).

But, yeah, you wouldn't likely want 256B or 512B burst transfers for
similar reasons to why you wouldn't want a 64K or 128K page size.

...

But, then again, probably if mainstream systems designers felt there
were good reason to stick a bunch of cellphone style RAM chips onto the
DIMMs, they would have probably done it...

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 19 09:24:43 2026

From Newsgroup: comp.arch

On 19/06/2026 00:37, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:
-----------------------

Of course, the challenge here is that programming is significantly
different from what we are used to, and you need a new type of OS as
well as new applications - and ideally, new programming languages.
That's a lot of big hurdles to clear even if the result is theoretically
more efficient. (And you still need to keep the fast single-threaded
cpus, and the fast SIMD / vector processing systems, for other kinds of
tasks.)

Possibly the biggest millstone around the neck of computing
architectures is the C language.

C is not an albatross !! it is a standard to which one designs--
exactly like air (our atmosphere) provides the standard to which
airplanes have to be designed.

C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed
to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

This is not necessarily a bad thing - it lets everyone get stuff done.
But it means that we are stuck on a local maxima. If there is a better
way out there somewhere, it would be a long and arduous journey to get
there.

For every processor that is not just
for highly niche code (like gpus), what matters is how fast C code can
run on it. Most other languages either use a similar model, or run on
VMs written in C.

You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...

Having a consistency here makes a lot of things easier and more
efficient. But it also makes change harder, even if the end results
would perhaps be better. (I don't know if there are other models that
really are better - this thread is titled "Thought experiment", after all!)

Why bother with support for multiple stacks, or other
interesting hardware innovations, if it doesn't support faster C?

I can argue that having 2 stacks {one for the preserved state from
caller to callee, the other for data} enables ever so slightly faster
C--but that is not the point--the point is robustness in the face of
threats (buffer overruns, ROP, malicious use of memory).

Those could be handled in hardware too.

Intel "Control-flow Enforcement Technology" sounds fancy and innovative,
but it is really nothing more than having a second stack for return
addresses. Having two or more stacks, with hardware protection for what
can be done with them, should be very positive for robustness and security.

The speed advantage is by knowing that registers written to the call-
stack do not need to be written to L2 (or farther) if RET has occurred
and the cache line replaced. Saves a trifling of power, too. These is
another advantage is when an EXIT instruction is still reading from
stack and an ENTER instruction starts writing to the stack. WE taught
the compiler a prescribed order to utilize the registers, so that when
an EXIT is running and an ENTER is decoded, the EXIT can be short
circuited and some of the ENTER short circuited, eliding cycle waste.

That is one benefit, yes - things above the stack line (for any of the
stacks) can be discarded without being pushed back to main memory. But
you can do better. They can not only be discarded, but cleared,
improving security. They can use specialised cpu-local caches for
different purposes - return stacks with just addresses will be much
smaller than data stacks, and can fit tightly together with prefetches, speculative execution, etc. You know that the different types of data
don't overlap, there is no need to worry about data accesses addressing
things on the return stack, and so on. (That would add restrictions
limiting some kinds of self-modifying program - good riddance!)

With
all due respect to Anton and other Forth enthusiasts, "fastest Forth
benchmarks" is not going to attract much investment money.

I'd love to see new architectures and new hardware features that are
genuinely different, but they rarely turn up.

My 66000 is replete with those features--and it is argued here daily
that it (my 66000 ISA) has gone too far !!

[Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any time.

Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers directly through GOT (or other SW table).

Those might all be useful at times, but are not game-changers as far as
I can see.

Multi-line multi-instruction ATOMIC sequences freely available to SW.

That's more fun.

Transcendental instructions that take FDIV number of cycles.

That's "just" efficiency.

Context Switches performed without instruction execution--as if the
state of a thread was treated like a write-back cache.

/Now/ we are getting somewhere. That sounds like the kind of feature I
am talking about - something that changes the way people design code,
not just making existing stuff faster.

Interrupt tables that can be used as a low level scheduler built into
the priority (and privilege) model with support for vVMs monitoring vMs
One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core and table.)

Even with C programming,
there are so many things that could be made more efficient with a bit of
interesting hardware. (I say this with little knowledge of the
complications implied.) A lot of time in C code is spend in memory
allocation work - that's got to be a prime candidate for hardware
acceleration, especially if we can get away from the brutish malloc/free
approach.

And then C++ goes all 'new' on using memory...

"new" is mostly built on top of malloc/free. While you /can/ make your
own overrides for "new", either globally or for specific types, in the
great majority of cases it boils down to calling "malloc" then calling
the type's constructor.

(Stack-based allocators are one possibility for a lot of
allocations.)
There could be hardware support for threading, locking,
and inter-process communication.

My 66000 can switch threads in a single instruction.
My 66000 ESM provides unrealized synchronization capabilities.
My 66000 Interrupt tables provide single instruction message send and
single instruction message receives.
SW determines what the messages mean.

That gives me confidence that not all my ideas are crazy ! I hope you
succeed with your design - these are features I would love to be able to
use in my daily work.

Separate data stacks and return stacks
would make things faster and more secure.

Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

Plus, the code sees a 64-bit Virtual Address Space, while the system has
four 64-bit physical address spaces. The spaces are used to determine
the consistency model:: {
Cacheable DRAM is causally ordered and coherent and cached
unCacheable DRAM is sequentially consistent incoherent
ROM is unordered incoherent but cached
MMI/O is sequentially consistent incoherent
Config is strongly ordered incoherent
}

Fat pointers that can track
access modes and range limits, at least in common cases, would aid
reliability and security.

A bit Too "Cheri" for me.

I think "Cheri" took it too far. I believe there is scope for tagging a
bit of information onto pointers without trying to do everything.

I also think a lot can be done on the side of programming languages and
tools, which could catch far more possible pointer mistakes. That won't
stop the bad guys, of course, but I think more bad accesses are from
bugs than hackers.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 19 06:02:16 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

Possibly the biggest millstone around the neck of computing
architectures is the C language. For every processor that is not just
for highly niche code (like gpus), what matters is how fast C code can
run on it. Most other languages either use a similar model, or run on
VMs written in C. Why bother with support for multiple stacks, or other >interesting hardware innovations, if it doesn't support faster C? With
all due respect to Anton and other Forth enthusiasts, "fastest Forth >benchmarks" is not going to attract much investment money.

That is probably the case, but "fast Java performance", "fast Python performance", or "fast JavaScript performance" might be a different
issue. And indeed, once upon a time Sun boasted architectural
features to support Java. AFAIK these were features to improve the indirect-branch performance in the Java VM interpreter, to improve the
startup performance. However, other CPU manufacturers (in particular
Intel and AMD, and, more recently, ARM and Apple and probably
Nuvia/Qualcomm) made indirect branches fast without architectural
support.

Concerning multiple stacks, existing architectures and their
implementations support multiple stacks just fine. While AMD64, which
is descended from an architecture that is older than C, has special architectural support for one stack, based on the call/return stack of
its ancestor, implementing a stack using a different register as stack
pointer works just fine and is efficient. In newer architectures like
ARM A64 and RISC-V it's just software convention that one register is designated as stack pointer (the Compression extension of RISC-V has
compressed instructions for dealing with that register, but apart from
a code size advantage with C the stack pointer is just another
register on RISC-V).

Most Forth implementations on AMD64 implement the data stack, the most
heavily used stack in Forth using a different register than %RSP.
There is one high-performance implementation that uses %RSP as data
stack, but it does not perform generally better than other
high-performance implementations that use a different register.

Of course, high-performance implementations tend to keep data stack
items in registers and access them in memory as rarely as possible, so
the question of having stack pointers with different performance characteristics would have less influence than some might expect, but
I have not observed or read about different performance
characteristics. A long time ago I read about optimizing consecutive
pushes and pops being optimized to avoid the dependency due to
stack-pointer updates, but for a stack implemented using stores,
loads, and additions to a register, any even moderately sophisticated
compiler combines the additions of consecutive stack accesses, too.

One case where C and Forth have a difference and where a preference
for C may show in some architectures is in reifying comparison results
as integer values. Most comparison results are only used by
conditional branches, and compilers can easily avoid having to reify
them in this context, but sometimes this optimization does not happen.
There C produces 1 for true, while Forth produces -1 (all bits set)
for true.

RISC-V clearly is in C land, and its comparison instructions produce 1
for true, and Forth needs an additional instruction for its reified
flag.

AMD64 inherits the SETcc instruction that produces a byte result of 1
for true, and needs another instruction to produce an integer result.
This tends to result in one additional instruction for Forth, but in
some cases shorter sequences are possible. E.g., for "5 u<" VFX Forth
produces the code:

CMP RBX, # 05
SBB RBX, RBX

In AVX2 the comparison instructions produce all-bits-set for true.
Did they design it for Forth? Probably not, but they designed it for
use of the values with bitwise operations (and, or, xor), just like
the designers of Forth-83 did. In any case, this is a counterexample
to the theory that everything is designed to accomodate C.

ARM A64 supports reifying flags in the Forth way just as well as it
does reifying them in the C way. E.g., in Gforth the code for < is:

cmp x28, x20
csetm x28, lt // lt = tstop

Actually, cmp and csetm are aliases for specific uses of more
versatile instructions. The same code decode with the more versatile instructions:

subs xzr, x28, x20 #set the flags, throw subtraction result away
csinv x28, xzr, xzr, ge #select either xzr or xzr-1, depending on flag

It's interesting the ARM A64 has not just the instruction, but also a
separate mnemonic for the 0-or-all-bits-set case. The ARM A64
architects obviously had more than just C in mind.

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz
Lisp developers has explained in this newsgroup, they actually did not
use this feature, because the performance benefit was not big enough
to justify the complications of modifying their tagging architecture
to make use of that. However, in recent years AMD, ARM, and Intel
have added features to ignore the top (7,8, or 16) bits in every
address (how many depends on the feature and the selected variant of
the feature), probably to support pointer tagging in such programming languages. I am sure that no C need is behind this feature addition.

I'd love to see new architectures and new hardware features that are >genuinely different, but they rarely turn up.

I see lots of architectural features that are not or badly supported
by C, and so obviously are not designed for C.

A prominent example is SIMD. Standard C does not have language
features that map to SIMD instructions, and much as C compiler writers
want to make use of them with auto-vectorization, the result is
hit-or-miss. Admittedly, SIMD has existed for more than 50 years, so
it's not a new architectural feature, but the fact that it has been
added to many architectures after C became prominent is another
indication that architects do not restrain themselves to things that C supports.

Another example is the ADX extension for AMD64 (introduced with
Broadwell (released 2014)), which does not correspond to a language
feature by C before C23's _BitInt, and which existing C compilers do
not support at all AFAICS. Read all about it in <https://repositum.tuwien.at/bitstream/20.500.12708/226349/5/Ertl-2025-Multi-precision%20integer%20arithmetics-vor.pdf>

Even with C programming,
there are so many things that could be made more efficient with a bit of >interesting hardware. (I say this with little knowledge of the >complications implied.) A lot of time in C code is spend in memory >allocation work - that's got to be a prime candidate for hardware >acceleration, especially if we can get away from the brutish malloc/free >approach. (Stack-based allocators are one possibility for a lot of >allocations.)

What architectural features do you have in mind?

There could be hardware support for threading, locking,
and inter-process communication.

There is, some better, some worse. First of all, we have shared
memory rather than distributed memory. Next, we have cache-coherent
shared memory. The cache coherence architectures are generally
deficient, with the deficiencies described as "memory model".

Separate data stacks and return stacks
would make things faster and more secure.

It's not clear that more architectural support for that would make
things faster, see above; plus, one of the results of my work on PAF <https://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf>
is that for the majority of Forth code, one stack pointer is enough;
that's not because the architectural support for several stack
pointers is so bad, but because registers are a limited resource, and
being stingy with them helps.

Concerning "more secure", the idea is probably that buffer overflows
then cannot overwrite return addresses. Interestingly, some
architectures now support an additional return stack to thwart such
attacks. Anyway, buffer overflows are still a security problem even
with separate return stacks, because the data contains code pointers,
sometimes with some indirections, such as C function pointers,
pointers to virtual function tables in object-oriented programs etc.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 18 23:57:22 2026

From Newsgroup: comp.arch

Anton Ertl [2026-06-18 16:49:40] wrote:

The early transputers were fast for their time, e.g., with the T414 at
up to 20MHz (and single-cycle instruction execution) in 1985, while
the 80386 was introduced in 1985 at 12.5MHz (with at least two cycles
per instruction), and the MIPS R2000 introduced in 1986 was available
at up to 15MHz.

Another interesting project in that ballpark was the [iWarp](https://en.wikipedia.org/wiki/IWarp)

I never really played with one, but I worked next to one during
one summer. 🙂

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Fri Jun 19 11:20:10 2026

From Newsgroup: comp.arch

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz Lisp developers has explained in this newsgroup, they actually did not use
this feature, because the performance benefit was not big enough to
justify the complications of modifying their tagging architecture to
make use of that. However, in recent years AMD, ARM, and Intel have
added features to ignore the top (7,8, or 16) bits in every address (how
many depends on the feature and the selected variant of the feature), probably to support pointer tagging in such programming languages. I am
sure that no C need is behind this feature addition.

The architectural support for tagging in SPARC only avoided the need to
untag and tag integers in compiled code.

David Ungar's thesis on SOAR provided measurements of the impact of this
on benchmarks for Smalltalk.

The layout of having the tags in the bottom 2 bits of a 32 bit word works
fine without architectural support, being able to turn on traps for
unaligned data access helps though.

In 64 bit machines you can use the bottom three bits for Lisp tags but
SPARC64 didn't provide instructions to work with this.

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

In previous discussions, I had tried to press Mitch to see if he could
remember what kind of benchmarks they had run on the 88100 that showed
it running Lisp faster than SPARC.

To me, the old SPEC li benchmark was a test of the speed of an interpreter written in C and doesn't say anything useful about how well a system would
run Lisp that had been compiled to machine code.

There were well known (non SPEC) Lisp benchmarks at the time.

I'd love to see new architectures and new hardware features that are >>genuinely different, but they rarely turn up.

I see lots of architectural features that are not or badly supported by
C, and so obviously are not designed for C.

The one architectural feature of Lisp Machines that I don't think has been carried forward was a multi-way switch instruction.

The rest of the MIT Lisp Machine microarchitecture was just a pipelined,
three address, load/store one that provides another data point for the discussion from a few months ago on whether VAX could have been a RISC
using TTL chips.

--- Synchronet 3.22a-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Fri Jun 19 15:08:40 2026

From Newsgroup: comp.arch

In article <2026Jun19.080216@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

once upon a time Sun boasted architectural features to support
Java. AFAIK these were features to improve the indirect-branch
performance in the Java VM interpreter, to improve the startup
performance.

Did this become obsolete when Java runtime environments switched to
JITing to native code?

Admittedly, SIMD has existed for more than 50 years, so it's not
a new architectural feature, but the fact that it has been added
to many architectures after C became prominent is another
indication that architects do not restrain themselves to things
that C supports.

They don't, but they don't do a good job of making those features usable either. Support for new instructions is readily provided via intrinsics,
but those aren't portable. Back when I had a close relationship with
Intel, it seemed that they assumed all software would be built for a
specific combination of manufacturer and chip generation, and software suppliers would happily maintain several such versions simultaneously.

John
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Jun 19 16:16:05 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:
-----------------------

Of course, the challenge here is that programming is significantly
different from what we are used to, and you need a new type of OS as
well as new applications - and ideally, new programming languages.
That's a lot of big hurdles to clear even if the result is theoretically
more efficient. (And you still need to keep the fast single-threaded
cpus, and the fast SIMD / vector processing systems, for other kinds of
tasks.)

Possibly the biggest millstone around the neck of computing
architectures is the C language.

C is not an albatross !! it is a standard to which one designs--
exactly like air (our atmosphere) provides the standard to which
airplanes have to be designed.

C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.

For every processor that is not just
for highly niche code (like gpus), what matters is how fast C code can
run on it. Most other languages either use a similar model, or run on
VMs written in C.

You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...

Why bother with support for multiple stacks, or other
interesting hardware innovations, if it doesn't support faster C?

I can argue that having 2 stacks {one for the preserved state from
caller to callee, the other for data} enables ever so slightly faster
C--but that is not the point--the point is robustness in the face of
threats (buffer overruns, ROP, malicious use of memory).

The speed advantage is by knowing that registers written to the call-
stack do not need to be written to L2 (or farther) if RET has occurred
and the cache line replaced. Saves a trifling of power, too. These is
another advantage is when an EXIT instruction is still reading from
stack and an ENTER instruction starts writing to the stack. WE taught
the compiler a prescribed order to utilize the registers, so that when
an EXIT is running and an ENTER is decoded, the EXIT can be short
circuited and some of the ENTER short circuited, eliding cycle waste.

With
all due respect to Anton and other Forth enthusiasts, "fastest Forth
benchmarks" is not going to attract much investment money.

I'd love to see new architectures and new hardware features that are
genuinely different, but they rarely turn up.

My 66000 is replete with those features--and it is argued here daily
that it (my 66000 ISA) has gone too far !!

[Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any time.

Nice to have.

Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

I used this idea, in its extreme form when my 486 Word Count code had
the state variable in BL and loaded the next byte into BH: At this point
I could jump directly to the code BX was pointing to, so a 256*number of
main states (=2, inside or outside a word) => a 512-entry jump table.

When the Pentium turned up a few years later, branching got relatively
even costlier, so I got rid of every branch inside the 256-byte main processing loop.

Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers directly through GOT (or other SW table).

Also nice to have.

Multi-line multi-instruction ATOMIC sequences freely available to SW.

Transcendental instructions that take FDIV number of cycles.

:-)

Context Switches performed without instruction execution--as if the
state of a thread was treated like a write-back cache.

Interrupt tables that can be used as a low level scheduler built into
the priority (and privilege) model with support for vVMs monitoring vMs
One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core and table.)

Even with C programming,
there are so many things that could be made more efficient with a bit of
interesting hardware. (I say this with little knowledge of the
complications implied.) A lot of time in C code is spend in memory
allocation work - that's got to be a prime candidate for hardware
acceleration, especially if we can get away from the brutish malloc/free
approach.

And then C++ goes all 'new' on using memory...

(Stack-based allocators are one possibility for a lot of
allocations.)
There could be hardware support for threading, locking,
and inter-process communication.

My 66000 can switch threads in a single instruction.
My 66000 ESM provides unrealized synchronization capabilities.
My 66000 Interrupt tables provide single instruction message send and
single instruction message receives.
SW determines what the messages mean.

Separate data stacks and return stacks
would make things faster and more secure.

Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

Plus, the code sees a 64-bit Virtual Address Space, while the system has
four 64-bit physical address spaces. The spaces are used to determine
the consistency model:: {
Cacheable DRAM is causally ordered and coherent and cached
unCacheable DRAM is sequentially consistent incoherent
ROM is unordered incoherent but cached
MMI/O is sequentially consistent incoherent
Config is strongly ordered incoherent
}

Fat pointers that can track
access modes and range limits, at least in common cases, would aid
reliability and security.

A bit Too "Cheri" for me.

:-)

The Mill is probably the closest to Cheri that is still in active
development.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 14:34:10 2026

From Newsgroup: comp.arch

Robert Swindells <rjs@fdy2.co.uk> posted:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz Lisp developers has explained in this newsgroup, they actually did not use
this feature, because the performance benefit was not big enough to
justify the complications of modifying their tagging architecture to
make use of that. However, in recent years AMD, ARM, and Intel have
added features to ignore the top (7,8, or 16) bits in every address (how many depends on the feature and the selected variant of the feature), probably to support pointer tagging in such programming languages. I am sure that no C need is behind this feature addition.

The architectural support for tagging in SPARC only avoided the need to
untag and tag integers in compiled code.

David Ungar's thesis on SOAR provided measurements of the impact of this
on benchmarks for Smalltalk.

The layout of having the tags in the bottom 2 bits of a 32 bit word works fine without architectural support, being able to turn on traps for unaligned data access helps though.

In 64 bit machines you can use the bottom three bits for Lisp tags but SPARC64 didn't provide instructions to work with this.

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

In previous discussions, I had tried to press Mitch to see if he could remember what kind of benchmarks they had run on the 88100 that showed
it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to
use 2 shifts to perform an extract; indexing was scaled:: both helped interpreters.

To me, the old SPEC li benchmark was a test of the speed of an interpreter written in C and doesn't say anything useful about how well a system would run Lisp that had been compiled to machine code.

There were well known (non SPEC) Lisp benchmarks at the time.

I'd love to see new architectures and new hardware features that are >>genuinely different, but they rarely turn up.

I see lots of architectural features that are not or badly supported by
C, and so obviously are not designed for C.

The one architectural feature of Lisp Machines that I don't think has been carried forward was a multi-way switch instruction.

My 66000 has a jump-through-table instruction which performs the C-switch including range checks and default substitutions--doing the work of 4-5
normal instructions and allowing the jump table to be <short> integer data instead of pointers.

The rest of the MIT Lisp Machine microarchitecture was just a pipelined, three address, load/store one that provides another data point for the discussion from a few months ago on whether VAX could have been a RISC
using TTL chips.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Jun 19 18:16:16 2026

From Newsgroup: comp.arch

On Fri, 19 Jun 2026 16:16:05 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:
-----------------------

Of course, the challenge here is that programming is significantly
different from what we are used to, and you need a new type of OS
as well as new applications - and ideally, new programming
languages. That's a lot of big hurdles to clear even if the result
is theoretically more efficient. (And you still need to keep the
fast single-threaded cpus, and the fast SIMD / vector processing
systems, for other kinds of tasks.)

Possibly the biggest millstone around the neck of computing
architectures is the C language.

C is not an albatross !! it is a standard to which one designs--
exactly like air (our atmosphere) provides the standard to which
airplanes have to be designed.

C ended up being this model because its floor supports almost all
other programming languages: certainly {Fortran, C++, Algol68,
Pascal, Jovial, ...} and is not all that bad when doing {LISP, RPG,
COBOL, APL, Snowbal}.

For every processor that is not
just for highly niche code (like gpus), what matters is how fast C
code can run on it. Most other languages either use a similar
model, or run on VMs written in C.

You state that like it was a BAD thing--it is not. I just we had
all chosen the same standards at which to design {BE or LE} is like
having the steering wheel on the {left or right}, ...

Why bother with support for multiple stacks,
or other interesting hardware innovations, if it doesn't support
faster C?

I can argue that having 2 stacks {one for the preserved state from
caller to callee, the other for data} enables ever so slightly
faster C--but that is not the point--the point is robustness in the
face of threats (buffer overruns, ROP, malicious use of memory).

The speed advantage is by knowing that registers written to the
call- stack do not need to be written to L2 (or farther) if RET has occurred and the cache line replaced. Saves a trifling of power,
too. These is another advantage is when an EXIT instruction is
still reading from stack and an ENTER instruction starts writing to
the stack. WE taught the compiler a prescribed order to utilize the registers, so that when an EXIT is running and an ENTER is decoded,
the EXIT can be short circuited and some of the ENTER short
circuited, eliding cycle waste.

With >> all due respect to Anton and other Forth enthusiasts, "fastest
Forth benchmarks" is not going to attract much investment money.

I'd love to see new architectures and new hardware features that
are genuinely different, but they rarely turn up.

My 66000 is replete with those features--and it is argued here daily
that it (my 66000 ISA) has gone too far !!

[Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere
in memory at any time.

Nice to have.

Jump-Through-Table (switch) making PIC standard; while making the
tables smaller {1/8th to 1/4th}

I used this idea, in its extreme form when my 486 Word Count code had
the state variable in BL and loaded the next byte into BH: At this
point I could jump directly to the code BX was pointing to, so a
256*number of main states (=2, inside or outside a word) => a
512-entry jump table.

When the Pentium turned up a few years later, branching got
relatively even costlier, so I got rid of every branch inside the
256-byte main processing loop.

Load IP instructions (CALX, JMPX, CALA, JMPA) enable control
transfers directly through GOT (or other SW table).

Also nice to have.

Multi-line multi-instruction ATOMIC sequences freely available to
SW.

Transcendental instructions that take FDIV number of cycles.

:-)

"FDIV number of cycles" is a moving target. Mitch has a tendency of
using "Opteron" as his measurement stick. The question of what is
"number of cycles" is also not obvious. Single or double precision?
Latency or throughput?

Apple has single-cycle FDIV throughput since ~2019. That applies to
both scalar and 128bit SIMD variants of instruction.
So, for single-precision vector variant the throughput is 4 FDIV per
clock.

Intel has 4-cycle SP FDIV throughput (or 5-cycle by other sources) for
256-bit vectors since 2015. That's 2 SP FDIV per clock.

AMD started with 6-cycle 256-bit SP FDIV on Zen1.
It progressed to 3-3.5 cycles on Zen 2/3/4. Then on Zen5 they
progressed to 3 cycles per 512bit vector register. So, by now they are
at 5.33 SP FDIV per clock - ahead of Apple of 6 years ago.
I don't know where Apple stands right now.

Somehow, I suspect that when Mitch says that his transcendental
instructions "take FDIV number of cycles" he does not mean that he can
run 5.33 transcendental instructions per clock.

Against DP rather than SP and latency rather than throughput, Mith's
claim is probably closer to reality. But still...
Apple of 6 years ago had latency of DP FDIV = 10.
AMD has worst case latency = 13 since Zen1 (best latency used to be
faster, but on newer chips worst and best are the same).
Intel has worst case DP FDIV latency = 14 since Ivy Bridge (2012-04).
That's probably close to the date when Mitch started to consider His
66000.

Context Switches performed without instruction execution--as if the
state of a thread was treated like a write-back cache.

Interrupt tables that can be used as a low level scheduler built
into the priority (and privilege) model with support for vVMs
monitoring vMs One can schedule an DPC/sofIRQ in 1 instruction that
never fails (excepting when the interrupt message takes an
unrecoverable ECC failure between core and table.)

Even with C
programming, there are so many things that could be made more
efficient with a bit of interesting hardware. (I say this with
little knowledge of the complications implied.) A lot of time in
C code is spend in memory allocation work - that's got to be a
prime candidate for hardware acceleration, especially if we can
get away from the brutish malloc/free approach.

And then C++ goes all 'new' on using memory...

(Stack-based allocators are one possibility for a lot
of allocations.)
There could be hardware support for threading,
locking, and inter-process communication.

My 66000 can switch threads in a single instruction.
My 66000 ESM provides unrealized synchronization capabilities.
My 66000 Interrupt tables provide single instruction message send
and single instruction message receives.
SW determines what the messages mean.

Separate data stacks and return
stacks would make things faster and more secure.

Already present. But in addition, My 66000 DRAM controller is free
from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

Plus, the code sees a 64-bit Virtual Address Space, while the
system has four 64-bit physical address spaces. The spaces are used
to determine the consistency model:: {
Cacheable DRAM is causally ordered and coherent and cached
unCacheable DRAM is sequentially consistent incoherent
ROM is unordered incoherent but cached
MMI/O is sequentially consistent incoherent
Config is strongly ordered incoherent
}

Fat pointers that can
track access modes and range limits, at least in common cases,
would aid reliability and security.

A bit Too "Cheri" for me.

:-)

The Mill is probably the closest to Cheri that is still in active development.

Terje

Did not Ivan said that he likes capabilities, but decided that Mill
already has too many innovative concepts as it goes, Including
capabilities would be too much.
Also, claiming that Mill is still in active development sounds to me as
a stretch of the word "active".

--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Fri Jun 19 20:20:31 2026

From Newsgroup: comp.arch

On 2026-06-19 18:16, Michael S wrote:

On Fri, 19 Jun 2026 16:16:05 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

[snip]

Fat pointers that can
track access modes and range limits, at least in common cases,
would aid reliability and security.

A bit Too "Cheri" for me.

:-)

The Mill is probably the closest to Cheri that is still in active
development.

Terje

Did not Ivan said that he likes capabilities, but decided that Mill
already has too many innovative concepts as it goes, Including
capabilities would be too much.

As I recall, Ivan used to say that he knew how to /build/ a capability machine, but did not know how to /sell/ it. I believe he meant that such
a machine would not run "normal" C/C++ code, at least not very well,
which would turn away many potential customers.

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Jun 19 18:59:47 2026

From Newsgroup: comp.arch

According to David Brown <david.brown@hesbynett.no>:

Possibly the biggest millstone around the neck of computing
architectures is the C language. ...

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed
to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

C killed off every memory model other than flat byte addressed memory.
Pointers are sort of typed, but any real C program does stuff like this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

or worse

typedef struct { // string with explicit length
int len:
char str[0];
} varstr;

varstr *p;
char *s = "swordfish";

// initialize p from s
p = (varstr *)malloc(sizeof(varstr)+strlen(s));
p->len = strlen(s);
strncpy(p->str, s, p->len);

so in practice pointers all have to be pointers to bytes or something
that can losslessly be converted to and from them.

This evolution was certainly helped along by the horrible implementaton
of segmented memory in the Intel 8086 and 286, which persuaded people
that segments are a plague to be avoided rather than a tool to make
programs more reliable.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 19 13:09:04 2026

From Newsgroup: comp.arch

On 6/19/2026 11:59 AM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

Possibly the biggest millstone around the neck of computing
architectures is the C language. ...

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed
to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

C killed off every memory model other than flat byte addressed memory. Pointers are sort of typed, but any real C program does stuff like this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

Fwiw, why all of the casts?
__________________
#include <stdio.h>
#include <stdlib.h>

struct foo
{
int m_bar;
};

int main()
{
struct foo* foo = malloc(sizeof(*foo));

printf("foo = %p", (void*)foo); // cast needed for %p

free(foo);

return 0;
}
__________________

or worse

typedef struct { // string with explicit length
int len:
char str[0];
} varstr;

varstr *p;
char *s = "swordfish";

// initialize p from s
p = (varstr *)malloc(sizeof(varstr)+strlen(s));
p->len = strlen(s);
strncpy(p->str, s, p->len);

so in practice pointers all have to be pointers to bytes or something
that can losslessly be converted to and from them.

This evolution was certainly helped along by the horrible implementaton
of segmented memory in the Intel 8086 and 286, which persuaded people
that segments are a plague to be avoided rather than a tool to make
programs more reliable.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 19 13:10:37 2026

From Newsgroup: comp.arch

On 6/19/2026 7:16 AM, Terje Mathisen wrote:
[...]

The Mill is probably the closest to Cheri that is still in active development.

How close are you guys to making a Mill processor?

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 22:03:38 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Fri, 19 Jun 2026 16:16:05 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

---------------------

Transcendental instructions that take FDIV number of cycles.

:-)

"FDIV number of cycles" is a moving target. Mitch has a tendency of
using "Opteron" as his measurement stick. The question of what is
"number of cycles" is also not obvious. Single or double precision?
Latency or throughput?

One can use more/less HoBs of the fraction to index coefficient tables.
This enables a tradeoff of FMAC cycles with table size. Given one has
64-bit constant fractions of 2/e, 10/e, and a 1072-bits of 2/pi::
{
For a coefficient table with size equal to that of Goldschmidt FDIV+
SQRT tables:: one can implement {Ln2, Ln, LOG, Exp2, Exp, Exp10, SIN,
COS, TAN, ASIN, ACOS, ATAN, POW, POWI, ATAN2} that use 9 multiplies
and 8 adds to achieve 17-cycle Transcendentals. This number is similar
to FDIV at Opteron times.

POW is basically 2 transcendentals with a 64-bit fraction multiply in
the middle. I see 38-cycles as typical.

ATAN2 is ½ the time 1 transcendental and a 64-bit fraction multiply, and
the other ½ the time 2 transcendentals and a 64-bit fraction multiply.
So, typically ~26-cycles.

And ROUGHLY::
For every doubling (2×) of table size one saves 1 multiply and 1 add.
So, a quad-sized table could perform these in 13-cycles, or a quarter
(¼×) sized table could perform at 21-cycles.
}

Apple has single-cycle FDIV throughput since ~2019. That applies to
both scalar and 128bit SIMD variants of instruction.

What is the latency of each step in:
FDIV Rd,---,--
FDIV Rd,Rs1,Rd
FDIV Rd,Rs2,Rd
// until blue in the face
???

So, for single-precision vector variant the throughput is 4 FDIV per
clock.

Intel has 4-cycle SP FDIV throughput (or 5-cycle by other sources) for 256-bit vectors since 2015. That's 2 SP FDIV per clock.

All of the numbers I have talked about (above and before) are FP64.
FP32 numbers are roughly 2/3rds the cycle count.

AMD started with 6-cycle 256-bit SP FDIV on Zen1.
It progressed to 3-3.5 cycles on Zen 2/3/4. Then on Zen5 they
progressed to 3 cycles per 512bit vector register. So, by now they are
at 5.33 SP FDIV per clock - ahead of Apple of 6 years ago.
I don't know where Apple stands right now.

Somehow, I suspect that when Mitch says that his transcendental
instructions "take FDIV number of cycles" he does not mean that he can
run 5.33 transcendental instructions per clock.

Given K SIMD vector of FMAC units, I can run {K double, 2×K single, 4×K
half} transcendentals in {quoted, 2/3×quoted, 4/9×quoted}-cycles.

Against DP rather than SP and latency rather than throughput, Mitch's
claim is probably closer to reality. But still...

Apple of 6 years ago had latency of DP FDIV = 10.
AMD has worst case latency = 13 since Zen1 (best latency used to be
faster, but on newer chips worst and best are the same).
Intel has worst case DP FDIV latency = 14 since Ivy Bridge (2012-04).
That's probably close to the date when Mitch started to consider His
66000.

It's all a tradeoff between table size and loop iterations.

-------------

Fat pointers that can
track access modes and range limits, at least in common cases,
would aid reliability and security.

A bit Too "Cheri" for me.

:-)

The Mill is probably the closest to Cheri that is still in active development.

Terje

Did not Ivan said that he likes capabilities, but decided that Mill
already has too many innovative concepts as it goes, Including
capabilities would be too much.

Given some thought, adding capabilities is acceptable, so long as one
can support a flat VAS of at least 64-bits. In effect, capability
pointers add the capability stuff in another 64-bit container attached
to the pointer.

Also, claiming that Mill is still in active development sounds to me as
a stretch of the word "active".

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Fri Jun 19 22:43:28 2026

From Newsgroup: comp.arch

On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
<david.brown@hesbynett.no> wrote:

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed
to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

This is not necessarily a bad thing - it lets everyone get stuff done.
But it means that we are stuck on a local maxima. If there is a better
way out there somewhere, it would be a long and arduous journey to get >there.

I might well be willing to concede that C does have its flaws. But
these are flaws it has _as a programming language_, and not flaws that
have affected the design of computers. Why do I say this?

Because while C, as a procedural language, took a lot from its
predecessors - its punctuation came from PL/I - it was designed to not
be too different from the underlying hardware. It had pointers to
memory, which most programming languages up until then did not bother
with, in order to be able to substitute for assembler language.

If, instead, a "better" language, like ALGOL or Pascal, had become the "standard", we might have ended up with computers like the Burroughs
B6500 or the Intel 432. That would indeed have been a situation where
computers were less efficient and powerful because they were designed
around the peculiarities of the most-used higher-level language.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Fri Jun 19 22:49:53 2026

From Newsgroup: comp.arch

On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Back when logic and memory were more evenly matched, computers still
had one register - the accumulator. And instructions basically did
arithmetic between memory and the accumulator. Of course,
memory-to-memory operations were also possible without even an
accumulator.
But since memory isn't likely to get that fast, it's not really useful
to think of how to design for something that can't happen.

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Jun 20 00:07:03 2026

From Newsgroup: comp.arch

quadibloc@invalid.com (John Savard) writes:

On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Back when logic and memory were more evenly matched, computers still
had one register - the accumulator. And instructions basically did
arithmetic between memory and the accumulator. Of course,
memory-to-memory operations were also possible without even an
accumulator.

And some computers in those days simply used memory to memory operations exclusively without needing an accumulator.

Memory superscaler/OoO require a ROB that works with addresses rather than registers (perhaps CAM based); the size of the ROB is still limited
to the degree of OoO.

That noted, it seems to me that if access to all of memory costs
the same as access to a register, the need for OoO support in
the core becomes less interesting when the normal delays for which instruction-level parallism helps don't apply
(e.g. cache misses, NUMA latency, etc).

But since memory isn't likely to get that fast, it's not really useful
to think of how to design for something that can't happen.

Never say never.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Jun 20 00:09:48 2026

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/19/2026 11:59 AM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

Possibly the biggest millstone around the neck of computing
architectures is the C language. ...

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

C killed off every memory model other than flat byte addressed memory.
Pointers are sort of typed, but any real C program does stuff like this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

Fwiw, why all of the casts?

C and C++ handle void* conversions differently. You must cast
the malloc result to a pointer of the declared type when using C++.

It doesn't hurt to add the cast in C, and may help with documenting
the intention of the programmer who wrote the code.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:01:29 2026

From Newsgroup: comp.arch

quadibloc@invalid.com (John Savard) posted:

On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
<david.brown@hesbynett.no> wrote:

De-facto standards are /always/ albatrosses to some extent. Things are >done that way because things are done that way - processors are designed >to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar >languages, or languages with a VM written in C) because that's how >existing processors work.

This is not necessarily a bad thing - it lets everyone get stuff done.
But it means that we are stuck on a local maxima. If there is a better >way out there somewhere, it would be a long and arduous journey to get >there.

I might well be willing to concede that C does have its flaws. But
these are flaws it has _as a programming language_, and not flaws that
have affected the design of computers. Why do I say this?

Because while C, as a procedural language, took a lot from its
predecessors - its punctuation came from PL/I

I think it comes closer to the Algol line of languages; but more
like BCPL, Bliss, and B; than contemporaneous others.

PL/1 has things like <well> like::

struct {type var, var1;
like other_struct}; // so you don't have to type it all in again

- it was designed to not
be too different from the underlying hardware. It had pointers to
memory,

PL/1s most useful memory trick was using an area. So, one could 'malloc'
a bunch of data, and then free it all with one free! Nothing prevents C
from doing this, but C++ has new and new is not compatible with area.

which most programming languages up until then did not bother
with, in order to be able to substitute for assembler language.

If, instead, a "better" language, like ALGOL

Algol was ruined with its parameter passing in 'thunks' and strict
1-file compilation.

or Pascal, had become the "standard", we might have ended up with computers like the Burroughs
B6500 or the Intel 432.

That is one bullet we dodged!!

That would indeed have been a situation where computers were less efficient and powerful because they were designed
around the peculiarities of the most-used higher-level language.

Instead, C presents a programming model way down at the vonNeumann level::
1 unit of work (step) at a time.

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:06:10 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

quadibloc@invalid.com (John Savard) writes:

On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal) >wrote:

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Back when logic and memory were more evenly matched, computers still
had one register - the accumulator. And instructions basically did >arithmetic between memory and the accumulator. Of course,
memory-to-memory operations were also possible without even an
accumulator.

And some computers in those days simply used memory to memory operations exclusively without needing an accumulator.

Memory superscaler/OoO require a ROB that works with addresses rather than registers (perhaps CAM based); the size of the ROB is still limited
to the degree of OoO.

That noted, it seems to me that if access to all of memory costs
the same as access to a register, the need for OoO support in
the core becomes less interesting when the normal delays for which instruction-level parallism helps don't apply
(e.g. cache misses, NUMA latency, etc).

With the execution window being architected to absorb latency
AND
memory being longer latency than registers,

The size of the RoB would have to be larger, and the size of each field
being compared grows from 8-bits (256 dynamic names) to <what> 64-bits::
the RoB, RAT, and other structures get "out of hand" quickly.

But since memory isn't likely to get that fast, it's not really useful
to think of how to design for something that can't happen.

Never say never.

"When your Register file is as big as your cache,
register access will be as slow as your cache" Andy Glew.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:09:28 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/19/2026 11:59 AM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

Possibly the biggest millstone around the neck of computing
architectures is the C language. ...

De-facto standards are /always/ albatrosses to some extent. Things are >>> done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

C killed off every memory model other than flat byte addressed memory.
Pointers are sort of typed, but any real C program does stuff like this: >>
p = (struct foo *) malloc(42 * sizeof(struct foo));

Fwiw, why all of the casts?

C and C++ handle void* conversions differently. You must cast
the malloc result to a pointer of the declared type when using C++.

This reminds me of an entry into the obfuscated C context way back::
There was an entry where one could compile it in {C, Fortran, pascal,
and some other language} and source would be compiled into the same
object module from each.

It doesn't hurt to add the cast in C, and may help with documenting
the intention of the programmer who wrote the code.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sat Jun 20 11:09:46 2026

From Newsgroup: comp.arch

On 2026-06-20 4:01, MitchAlsup wrote:

quadibloc@invalid.com (John Savard) posted:

On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
<david.brown@hesbynett.no> wrote:

De-facto standards are /always/ albatrosses to some extent. Things are
done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

This is not necessarily a bad thing - it lets everyone get stuff done.
But it means that we are stuck on a local maxima. If there is a better
way out there somewhere, it would be a long and arduous journey to get
there.

I might well be willing to concede that C does have its flaws. But
these are flaws it has _as a programming language_, and not flaws that
have affected the design of computers. Why do I say this?

Because while C, as a procedural language, took a lot from its
predecessors - its punctuation came from PL/I

[snip irrelevant syntax discussion]

- it was designed to not
be too different from the underlying hardware.

The underlying hardware *of that time*. Therefore, it may have
contributed to "locking in" that style of hardware. But I do not pretend
to know that a different style of hardware would be better today.

PL/1s most useful memory trick was using an area. So, one could 'malloc'
a bunch of data, and then free it all with one free! Nothing prevents C
from doing this, but C++ has new and new is not compatible with area.

(Mostly irrelevant to the suggestion that C is an "albatross", but Ada provides such areas in the form of "storage pools" which can also
contain "subpools". One can allocate objects in a pool or subpool and
then deallocate the whole pool or subpool at once.)

If, instead, a "better" language, like ALGOL

Algol was ruined with its parameter passing in 'thunks' and strict
1-file compilation.

The 1-file compilation was an implementation issue. For example,
Burroughs Algol supported (and no doubt still supports) separate
compilation of subprograms. (IIRC, even the paper-tape-based HP Algol
for the HP2100 series did that.) Algol can do pass-by-value, and the alternative pass-by-name method could have been deprecated and removed
as the language evolved, or reduced to pass-by-reference, if it was
judged to be an obstruction.

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

Well, who knows what would have resulted if a similar amount of effort
had been made to improve such computers, as has been made to improve the currently conventional style of computers.

Algol, Pascal, Ada, etc. can be compiled as well for the currently conventional style of computers as for B6500-style computers, while
compiling C for a B6500-style computer limits the compiler to use a
subset of the hardware functions, as I have understood it, because of
the usual assumption in C programs of a single flat memory space.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 10:33:58 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

"When your Register file is as big as your cache,
register access will be as slow as your cache" Andy Glew.

Looking at https://www.guru3d.com/data/publish/223/54520bdd20560bcbc963979637025fd69682f6/afnviogfwsce6yxo.webp

It seems that the vector register file is about as large as the
D-cache ("L1d$"). The I-cache seems to be only a little larger than
the integer register file, and is smaller than 1/4 of the vector
register file.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 10:46:59 2026

From Newsgroup: comp.arch

jgd@cix.co.uk (John Dallman) writes:

In article <2026Jun19.080216@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

once upon a time Sun boasted architectural features to support
Java. AFAIK these were features to improve the indirect-branch
performance in the Java VM interpreter, to improve the startup
performance.

Did this become obsolete when Java runtime environments switched to
JITing to native code?

I don't remember in which SPARC this feature was added, but IIRC it
was after the introduction of the HotSpot Java VM implementation.
Note that HotSpot uses an interpreter on startup and on the cold
methods, and only compiles hot methods to native code after executing
them for a while and collecting execution statistics.

They don't, but they don't do a good job of making those features usable >either. Support for new instructions is readily provided via intrinsics,
but those aren't portable.

Yes. But with a programming language like C, what is the alternative?

GNU C supports a vector extension; I don't know how fast Intel added
support for AVX, AVX2, and the various AVX-512 variants to gcc and
llvm (which also supports this extension); certainly recent gcc and
clang versions use AVX-512 if you press the right buttons.

Fortran supports the array sublanguage which AFAIU makes vectorization
easy within an expression. But as Thomas Koenig tells us, gcc's
Fortran front end translates that into scalar code and relies on auto-vectorization in the back end to produce vectorized code.
Intel's Fortran compiler ifort has been replaced by something that
uses IIRC LLVM as back end.

And finally we have auto-vectorization, where it is a matter of luck
whether the scalar code is vectorized or not (e.g., I have code that
one compiler auto-vectorizes with -Os, but not -O3, and another
compiler that autovectorizes with -O3, but not -Os).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sat Jun 20 14:25:41 2026

From Newsgroup: comp.arch

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he could
remember what kind of benchmarks they had run on the 88100 that showed
it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to use
2 shifts to perform an extract; indexing was scaled:: both helped interpreters.

Production Lisp environments are not interpreters, even back then.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jun 20 17:26:43 2026

From Newsgroup: comp.arch

Niklas Holsti wrote:

On 2026-06-19 18:16, Michael S wrote:

On Fri, 19 Jun 2026 16:16:05 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

[snip]

Fat pointers that can
track access modes and range limits, at least in common cases,
would aid reliability and security.

A bit Too "Cheri" for me.

:-)

The Mill is probably the closest to Cheri that is still in active
development.

Terje

Did not Ivan said that he likes capabilities, but decided that Mill
already has too many innovative concepts as it goes, Including
capabilities would be too much.

As I recall, Ivan used to say that he knew how to /build/ a capability > machine, but did not know how to /sell/ it. I believe he meant that such
a machine would not run "normal" C/C++ code, at least not very well,
which would turn away many potential customers.

The most Cheri-like feature of the Mill is probably the hardware byte granularity security, with the user ability to create subsets in size
and/or access rights.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jun 20 17:27:59 2026

From Newsgroup: comp.arch

Chris M. Thomasson wrote:

On 6/19/2026 7:16 AM, Terje Mathisen wrote:
[...]

The Mill is probably the closest to Cheri that is still in active
development.

How close are you guys to making a Mill processor?

I don't know, I'm just the FP emulation guy in the project. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 17:33:13 2026

From Newsgroup: comp.arch

Robert Swindells <rjs@fdy2.co.uk> writes:

The architectural support for tagging in SPARC only avoided the need to
untag and tag integers in compiled code.

David Ungar's thesis on SOAR provided measurements of the impact of this
on benchmarks for Smalltalk.

The layout of having the tags in the bottom 2 bits of a 32 bit word works >fine without architectural support, being able to turn on traps for >unaligned data access helps though.

Actually IA-32 (since the 486) and AMD64 have a bit for turning on
unaligned traps, but unfortunately there is too much software in
libararies that performs unaligned accesses.

On IA-32 that's the result of aligning FP numbers to 4-byte boundaries
in the ABI, while the hardware requires 8-byte alignment (both
hardware and ABI come from Intel).

On AMD64 that's a result of statement-level auto-vectorization; e.g.,
two bytes get loaded as one 16-bit value even if the resulting access
is not naturally aligned.

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

Then I misremembered the Lisp system's name. But it was one whose name
I had already read before.

To me, the old SPEC li benchmark was a test of the speed of an interpreter >written in C and doesn't say anything useful about how well a system would >run Lisp that had been compiled to machine code.

It also does not say anything useful about the speed of a
high-performance interpreter. See Figure 5 of <https://jilp.org/vol5/v5paper12.pdf>.

The one architectural feature of Lisp Machines that I don't think has been >carried forward was a multi-way switch instruction.

What you get today is indirect branches with good predictors, and
conditional branches that take 0 cycles if not taken and correctly
predicted, and 1 cycle if taken and correctly predicted. I expect they are pretty good for implementing multi-way switches.

The rest of the MIT Lisp Machine microarchitecture was just a pipelined, >three address, load/store one that provides another data point for the >discussion from a few months ago on whether VAX could have been a RISC
using TTL chips.

Hmm, but IIRC the architecture was more in the direction of "closing
the semantic gap".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 14:02:56 2026

From Newsgroup: comp.arch

On 6/19/2026 2:24 AM, David Brown wrote:

On 19/06/2026 00:37, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:
-----------------------

<snip>

Fat pointers that can track
access modes and range limits, at least in common cases, would aid
reliability and security.

A bit Too "Cheri" for me.

I think "Cheri" took it too far. I believe there is scope for tagging a bit of information onto pointers without trying to do everything.

I also think a lot can be done on the side of programming languages and tools, which could catch far more possible pointer mistakes. That won't stop the bad guys, of course, but I think more bad accesses are from
bugs than hackers.

Agreed, this is a route I experimented with.

A basic bounds-checking mechanism can help with debugging and security.
One option here is, say, using pointer tagging bits to encode
bounds-check information and then have the compiler emit instructions to detect (roughly) when an access has gone out-of-bounds.

Extending it to the scope CHERI did adds new problems:
Adds significant implementation overhead;
Interferes with C programming practices;
...

And with a glaring weakness:
By its design, it is theoretically incapable of by itself forming a
sandbox capable of stopping actively hostile code.

It *could* still make it a PITA for human programmers to break out of,
but if a determined human programmer (or an AI assisted one) could put
in the work and break out of it via convoluted pointer de-referencing
(and if this break-ability is likely necessary for things like the C
runtime to be able to work), this is a weak point.

And, if it can't lock down security against actively hostile code, then
its more heavy-handed aspects are no longer justifiable.

Meanwhile, if the task is subdivided, some similar benefits can be
realized more cheaply:
Bounds checked pointers to trap on out-of-bounds access;
ASLR to make it much harder for shell-code to know where anything is;
Tagging to make it harder to stomp the link register;
Any attack attempt will need to know an N-bit magic number.
...

Or, if code sandboxing is needed, MMU based trickery is known effective.
Code can't access something if it is not accessible within the current
address space;
It can't gain additional access unless either CPU instructions exist or SYSCALL trickery would allow for some level of privilege escalation.

This is sort of the route I had been (gradually) investigating (though,
using the VUGID/ACLID system within a SAS rather than separate address spaces).

A similar mechanism can be used both for intra-process calls across
security domains, and for inter-process communication:
Wrapping interfaces in COM-like objects and routing the method calls
over the SYSCALL mechanism.

Had considered the possibility of some lower overhead mechanisms (such
as a "Secure Execute" mode that could allow for very localized CPU
privilege escalation), but demoted these ideas, mostly because I
realized that they were still exploitable (and by nature using them
would leave a security risk).

Like, even if the untrusted code can't itself get access to privileged instructions, doesn't mean it can't try to use the security-transfer
mechanism as a widget to try to grant itself access to other parts of
the addresses space (such as by forging cross-domain COM-like objects
that transfer control back to itself).

To enforce things, the privilege transfer can't exist within the scope
of the memory that the unprivileged code has access to, which in this
case, effectively means it needs to be abstract handles that only the
kernel or similar can deal with.

...

Ironically, all this has less visible surface area at the ISA level.
Like, still works with just plain RISC-V code or similar. Like, ideally,
the code should not need to be able to see the bars of the jail in which
it exists (or that, just outside of its reach, there exists a register
that if it could somehow zero-out, it would gain access to the entire
address space; but that it is limited only to its own code and data
sections and any local heap, say, with even its own C runtime access
being possibly via a concealed COM-like object).

For untrusted/sandboxed code, could maybe also make sense for there to
be an OS feature to disallow true OS syscalls (so, unlike the normal user-mode; it isn't allowed to talk to the OS directly, but only to
objects supplied by the host application).

Granted, one could maybe argue that a mechanism where
cross-security-domain calls need to go across COM objects or similar, is
a bit heavy handed. Or, for this matter, what exactly to call these
things (which are mostly like COM objects, but do not involve MS's tools
or APIs).

Well, or if it is more the:
(*pVt)->SomeMethod(pVt, ...);
Aspect that matters.

Side note:
Could in theory route it over individual function pointers, it is just
that the mechanisms in question would be in effect too heavyweight to
justify using them to map function pointers (an object can effectively
map a whole big group of function pointers for the same cost as a single pointer in a callback-marshaling approach; and also objects are somewhat
less of a pain to deal with than a "WvateverGetProcAddress()" style mechanism).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat Jun 20 19:07:15 2026

From Newsgroup: comp.arch

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

- [C] was designed to not
be too different from the underlying hardware.

The underlying hardware *of that time*. Therefore, it may have
contributed to "locking in" that style of hardware. But I do not pretend
to know that a different style of hardware would be better today.

C evolved from B which had a memory model that addressed words, which made sense for a lot of the computers of the 1960s. I gather the earliest
versions of C were on the GE 635 which was a 36 bit word addressed machine
but when it moved to the byte addressed PDP-11, dmr had to add typed pointers so it could do something reasonable with pointers to character strings vs pointers to words.

I think that with or without C, flat byte addressed memory would have won out due to the success of S/360 and the PDP-11, both of which were programmed
in lots of languages other than C.

Algol was ruined with its parameter passing in 'thunks' and strict
1-file compilation.

The 1-file compilation was an implementation issue. For example,
Burroughs Algol supported (and no doubt still supports) separate
compilation of subprograms. (IIRC, even the paper-tape-based HP Algol
for the HP2100 series did that.) Algol can do pass-by-value, and the >alternative pass-by-name method could have been deprecated and removed
as the language evolved, or reduced to pass-by-reference, if it was
judged to be an obstruction.

Call by name and thunks were a mistake. The Algol committee was trying
to write an elegant description of call by reference and only when
Jensen's device came along did they realize what they'd done. Alan
Perlis, who was on the Algol committee, told me this. Then when they
tried to fix Algol 60, the committee was hijacked by people who
produced Algol 68 which was quite a good language, but was defined so
obscurely that people wrongly assumed it was hard to learn and use.

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found
that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always want to turn off the error checks in the production build of their code.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat Jun 20 19:17:45 2026

From Newsgroup: comp.arch

According to Robert Swindells <rjs@fdy2.co.uk>:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he could
remember what kind of benchmarks they had run on the 88100 that showed
it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to use
2 shifts to perform an extract; indexing was scaled:: both helped
interpreters.

Production Lisp environments are not interpreters, even back then.

Quite right. Lisp 1.5 on the 7094 had a compiler. See the Lisp 1.5
manual published in 1962. Appendix D describes the compiler:

https://archive.org/details/bitsavers_mitrlelisprammersManual2ed1985_9279667
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 14:51:46 2026

From Newsgroup: comp.arch

On 6/20/2026 5:46 AM, Anton Ertl wrote:

jgd@cix.co.uk (John Dallman) writes:

In article <2026Jun19.080216@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

once upon a time Sun boasted architectural features to support
Java. AFAIK these were features to improve the indirect-branch
performance in the Java VM interpreter, to improve the startup
performance.

Did this become obsolete when Java runtime environments switched to
JITing to native code?

I don't remember in which SPARC this feature was added, but IIRC it
was after the introduction of the HotSpot Java VM implementation.
Note that HotSpot uses an interpreter on startup and on the cold
methods, and only compiles hot methods to native code after executing
them for a while and collecting execution statistics.

They don't, but they don't do a good job of making those features usable
either. Support for new instructions is readily provided via intrinsics,
but those aren't portable.

Yes. But with a programming language like C, what is the alternative?

Define a language extension for vectors that "doesn't suck";
Get widespread enough support that code can use it without it turning
into a mess;
Make it work whether or not the target has native architectural support
for a given feature.

GNU C supports a vector extension; I don't know how fast Intel added
support for AVX, AVX2, and the various AVX-512 variants to gcc and
llvm (which also supports this extension); certainly recent gcc and
clang versions use AVX-512 if you press the right buttons.

You don't want to enable this stuff too fast though.

AVX512 is far from universal.

In my case, I am still running a CPU where enabling the use of the
256-bit AVX instructions comes with a significant performance penalty.

OTOH, GCC still also tends to fall into the trap of only supporting
certain features on certain targets if the specific ISA (or target configuration) can support them natively.

Having C code that can either compile or fail to compile depending on
the specific combination of target machine and compiler feature flags is
a poor situation.

Fortran supports the array sublanguage which AFAIU makes vectorization
easy within an expression. But as Thomas Koenig tells us, gcc's
Fortran front end translates that into scalar code and relies on auto-vectorization in the back end to produce vectorized code.
Intel's Fortran compiler ifort has been replaced by something that
uses IIRC LLVM as back end.

And finally we have auto-vectorization, where it is a matter of luck
whether the scalar code is vectorized or not (e.g., I have code that
one compiler auto-vectorizes with -Os, but not -O3, and another
compiler that autovectorizes with -O3, but not -Os).

IMO auto-vectorizarion is another mess:
Sometimes effective, but sometimes makes it worse.

With BGBCC at least, have not gone that route.

I probably would not, unless it could get consistently positive gains,
the current scattershot of "maybe faster, maybe shoot oneself in the
foot, almost invariably makes binary bigger" situation just kinda sucks. Doesn't exactly give me strong optimism in this area.

Though, MSVC is kind of a bigger offender in some ways (kinda almost
wish for a "Have SIMD enabled, but do not attempt auto-vectorization"
option). Kinda ironically the closest it does have to this is "/O1", but
this is more of a "just half-ass the optimizations" option rather than explicitly disallowing vectorization (or, contrast with "/O0" which is
"Hope you like every operation to be 'Load Load Op Store', and no
constant folding or similar"; and then VS debugger only really debugs
stuff effectively when it is built in this "performs like epic dog crap" mode).

Ironically, BGBCC doesn't go this far, as there isn't any direct
equivalent of "/O0" or "-O0" behavior, as I would actually have to go
out of my way to make the compiler generate code that poorly...

Then again:
RV64G/RV64GC doesn't really have SIMD;
XG1/XG2/XG3 have SIMD, but it is based on some fundamentally different assumptions from either SSE/AVX or RV-V, and does not have separate SIMD/Vector registers.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 15:39:06 2026

From Newsgroup: comp.arch

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he could
remember what kind of benchmarks they had run on the 88100 that showed
it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to use
2 shifts to perform an extract; indexing was scaled:: both helped
interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any
given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of interpreter...

Well, but I found one, "naive walk of XML DOM trees" being slower.
Back in the 2000s, managed to make a script interpreter that was "so
damn slow" that when used in a makeshift 3D engine, could actually
"feel" whenever the short script fragments fired off...

Though, this was a design where basically something as little as a
constant expression would require:
Walk along and repeatedly check the node-tag against each tag string;
Fetch the attribute holding the value as a string;
"atof()" or similar;
Allocate a memory box to hold the value;
Put the value in a memory box.

This VM didn't live long...

But, ironically, part of this code is what eventually became BGBCC.
Though BGBCC's node system is still far more optimized than what this interpreter originally used (because early on, BGBCC's compilation
process was also painfully slow for similar reasons).

Ironically, what did the script-language interpreter do?
Got rewritten to re-use part of a Scheme interpreter I had written in high-school as the backend.

Which later got rewritten to target a bytecode, and then to have a JIT.
Which was not exactly all that dissimilar from the RIL bytecode (used by BGBCC) and general structure in JX2VM and similar.

Though, JX2VM ended up mostly not JIT'ing, as JIT is unnecessary when
limited to emulating a 50 MHz target CPU on a modern desktop PC.

...

But, one can be like, what is faster (and less LOC) than a Lisp or
Scheme interpreter?...

A pre-tokenized BASIC interpreter.
But, then, arguably, BASIC is a cruftier language; but one that is
ironically still more readily usable for non-toy programming tasks.

Though, then I recent-ish ended up using BASIC as a basis for limited
CSG tasks, and the result caused it to start to mutate into something resembling a BASIC/Emacs-Lisp hybrid.

If intentionally going this direction though, would make more sense to
just go for something more resembling something like Emacs-Lisp though.

Well, as trying to hybridize Lisp and BASIC gives something that sucks
worse than either Lisp or BASIC would on their own (like a "Beware, this
path leads to dog crap" thing).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sat Jun 20 16:52:44 2026

From Newsgroup: comp.arch

On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

PL/1s most useful memory trick was using an area. So, one could 'malloc'
a bunch of data, and then free it all with one free!

Aka "Mark-Release".

Nothing prevents C from doing this, but C++ has new and new is not >compatible with area.

If by "compatible" you mean being able to select an area on a
per-allocation basis, then C's default malloc can't do that either.

You certainly can write a custom malloc, but for compatibility with
the standard function it would have to work from a "default" heap. You
could change the default heap at will, but not in the "compatible"
malloc call itself.

You can do the same in C++: although AFAIK there is no standard way to
do it, every compiler I have seen provided some way to replace the
standard allocator. If you do this, then all new/delete calls will
use it.

Also constructors for all the standard containers take an extra
parameter which tells them which allocator to use [and defaults to the
standard allocator].

And new()/delete() can be overloaded per class to do whatever you
want.

Few programmers in C++ ever delve so deeply into the mysteries of
allocation. IME too many have trouble even understanding the working
of C++'s "placement" new.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 22:01:54 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/19/2026 2:24 AM, David Brown wrote:

On 19/06/2026 00:37, MitchAlsup wrote:

--------------------------

A bit Too "Cheri" for me.

I think "Cheri" took it too far. I believe there is scope for tagging a bit of information onto pointers without trying to do everything.

I also think a lot can be done on the side of programming languages and tools, which could catch far more possible pointer mistakes. That won't stop the bad guys, of course, but I think more bad accesses are from
bugs than hackers.

Agreed, this is a route I experimented with.

A basic bounds-checking mechanism can help with debugging and security.
One option here is, say, using pointer tagging bits to encode
bounds-check information and then have the compiler emit instructions to detect (roughly) when an access has gone out-of-bounds.

How do you do this with 64-bit registers and a 64-bit virtual address space ??!!

Extending it to the scope CHERI did adds new problems:
Adds significant implementation overhead;
Interferes with C programming practices;

Which means it is not going to fly.....
...

And with a glaring weakness:
By its design, it is theoretically incapable of by itself forming a
sandbox capable of stopping actively hostile code.

It *could* still make it a PITA for human programmers to break out of,
but if a determined human programmer (or an AI assisted one) could put
in the work and break out of it via convoluted pointer de-referencing
(and if this break-ability is likely necessary for things like the C
runtime to be able to work), this is a weak point.

And, if it can't lock down security against actively hostile code, then
its more heavy-handed aspects are no longer justifiable.

Meanwhile, if the task is subdivided, some similar benefits can be
realized more cheaply:
Bounds checked pointers to trap on out-of-bounds access;
ASLR to make it much harder for shell-code to know where anything is;

With pure PIC coding practices, and the link-pointer being stored directly
onto the call stack (not the data stack), one has no particular reason to
know where they currently are (IP) and few ways of seeing where they are.

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.

Any attack attempt will need to know an N-bit magic number.

No need for a number, it cannot be accessed outside of calling and returning. {{Which is not being done with a series of instructions--but by 1 designed
for the task at hand}}

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 22:08:23 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he could >>> remember what kind of benchmarks they had run on the 88100 that showed >>> it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to use >> 2 shifts to perform an extract; indexing was scaled:: both helped
interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of interpreter...

Given that one can create a data structure in LISP, and then execute it;
how would you do this without an interpreter or a JIT ??
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat Jun 20 10:15:41 2026

From Newsgroup: comp.arch

Robert Swindells [2026-06-19 11:20:10] wrote:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz Lisp
developers has explained in this newsgroup, they actually did not use
this feature, because the performance benefit was not big enough to

[...]

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

I guess you two aren't talking bout the same "Franz Lisp".
AFAIK Anton is referring to the commercial Common Lisp compiler
associated with the Franz Inc company, marketed under the name
"Allegro".

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 20:42:03 2026

From Newsgroup: comp.arch

On 6/20/2026 5:08 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he could >>>>> remember what kind of benchmarks they had run on the 88100 that showed >>>>> it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to use >>>> 2 shifts to perform an extract; indexing was scaled:: both helped
interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any
given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of interpreter...

Given that one can create a data structure in LISP, and then execute it;
how would you do this without an interpreter or a JIT ??

Eval can be implemented, just eval isn't necessarily fast.

It is common to implement eval by using a recursive tree walk:
Is list:
Fetch first item in list:
Is function or builtin?
Eval each argument;
Apply function;
Is a macro:
Do macro-handling stuff.
Is fixnum or flonum or similar:
Evaluates to itself.
...

There is a penalty due to things like type-checks, etc.

Contrast, say, a naive bytecode:
Go through a switch;
Goes fairly directly to logic.

A language like Forth or PostScript can be pretokenized fairly directly
to a stack-based bytecode.

In a pretokenized BASIC, one could use a dispatch based on the first token.

Granted, the "per-eval" cost for a simple Lisp expression (such as in a
REPL) would be lower than that of a JavaScript interpreter (which will typically need to parse and translate to some internal format before
running it; such as a stack-machine bytecode, which is then often taken further).

And, admittedly, this was back to my first BGBScript, where:
First version:
Parse to a DOM-based AST;
Tree walk the AST.
Second version:
Parse into a Scheme-based format;
Run a modified Scheme interpreter (older).
Then became:
Translate to a stack based bytecode;
Translate stack bytecode into 3AC ops in a "trace graph";
Optional JIT compile the trace graph.
Could push close to native C speeds with this at the time.

As over time it went from a dynamic to static typed core, and the VM
became very big and complicated (though, later looked at V8 and
SpiderMonkey, which were basically in a similar weight class; and had seemingly ended up crossing into a lot of similar domain here).

Then I did a reboot, turning it into a Java style language (using JSON
based ASTs), which was faster still, but lost the ability to be used for scripting tasks or to eval anything. It also still failed to compete
well with C on C's home turf.

Well, and the original reason I had written my own scheme interpreter
was due to frustrations with SCM and GNU Guile (I had looked at the code
and was like, "I could just do this stuff myself").

And, as noted, another offshoot of this first VM became BGBCC, which has outlived the descendants of the second VM.

Ironically, now I am back where I started, still lacking a particularly
good general-purpose scripting VM.

In the minimal case, 80s style BASIC can be implemented in minimal LOC,
but doesn't scale very well.

A minimal JS-like language needs more LOC, but is more complicated.

Ideally, one wants something small and self-contained enough that it can
be casually copied from one project or context to another without having
a bunch of extra "plumbing pain", which is notoriously harder to pull
off for scripting VMs than, say, file-format importers/exporters or data-compression code.

Something like a small Lisp/Scheme almost starts looking tempting again.
Slower in a naive implementation than BASIC;
But, scales a little better to some tasks;
Smaller implementation cost than a minimal JS dialect.

One might also end up with goals like, say:
Whole VM needs to fit within a single source file with no external dependencies beyond the basic C library;
Also whole VM shouldn't be more than a few kLOC;
...

Well, and I am lacking a good way to implement something like a GLSL
compiler in a way that isn't overly heavyweight. But, GLSL poses a very different problem space from a light duty scripting language (would need
to implement this to get TKRA-GL beyond 1.x territory).

Then again, could ironically make a case again for using S-Expressions
again, but say:
Parse GLSL into S-Expressions;
Translate S-Expressions into a modified ARB-like IR;
Map ARB-like commands to blobs of XG3 machine instructions or similar.
Possibly using a vaguely similar approach to the Quake3 QVM;
Naive/sucks, but kinda works;
Could map IR registers ~ 1:1 to CPU registers;
This basically means the reg-alloc happens when compiling the AST.
...

Despite front-end similarities, the backend for a GLSL compiler would
not be a good fit for scripting tasks.

Was able to get a basic JS like interpreter into a few kLOC, but
extending this into a GLSL compiler (while trying to keep code footprint small) would be a harder problem.

Well, and is likely to end about like how my effort to implement a new lighter-weight C compiler ended. Previously I had wanted to try to
implement a C compiler in a smaller code footprint than the Doom engine,
but kinda failed at this. And then it fizzled, like unless it fits into
Doom's code size and memory footprint, there was little reason to keep
working on it; vs just continuing to use BGBCC; which, granted, weighs
in at around 10x the size of Doom; or on-par with the size of Quake 3 Arena.

Can't really use BGBCC to compile anything within TestKern itself mostly because it uses too much RAM (would need a fairly large pagefile, and
then it would spend most of its time paging).

Though, BGBCC does work very different from 80s/90s style compilers, effectively loading the whole target program into RAM and dealing with everything, rather than compiling each translation unit sequentially
from top to bottom (and then running a linker).

Not entirely sure how linkers worked on very old computers though, as seemingly the working set of the linker would likely end up needing to
be larger than the VAS on something like a PDP-11.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 21:04:41 2026

From Newsgroup: comp.arch

On 6/20/2026 5:01 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/19/2026 2:24 AM, David Brown wrote:

On 19/06/2026 00:37, MitchAlsup wrote:

--------------------------

A bit Too "Cheri" for me.

I think "Cheri" took it too far. I believe there is scope for tagging a >>> bit of information onto pointers without trying to do everything.

I also think a lot can be done on the side of programming languages and
tools, which could catch far more possible pointer mistakes. That won't >>> stop the bad guys, of course, but I think more bad accesses are from
bugs than hackers.

Agreed, this is a route I experimented with.

A basic bounds-checking mechanism can help with debugging and security.
One option here is, say, using pointer tagging bits to encode
bounds-check information and then have the compiler emit instructions to
detect (roughly) when an access has gone out-of-bounds.

How do you do this with 64-bit registers and a 64-bit virtual address space ??!!

You can't...

It is doable at least with a 48-bit VAS as there are just enough HOBs
left over to sort of shove in an approximate bounds-check scheme.

Exponent, Range_Mantissa, Range_Bias.

Extending it to the scope CHERI did adds new problems:
Adds significant implementation overhead;
Interferes with C programming practices;

Which means it is not going to fly.....
...

Yeah.

It is in this awkward area of "sorta works" but far from being entirely transparent.

And with a glaring weakness:
By its design, it is theoretically incapable of by itself forming a
sandbox capable of stopping actively hostile code.

It *could* still make it a PITA for human programmers to break out of,
but if a determined human programmer (or an AI assisted one) could put
in the work and break out of it via convoluted pointer de-referencing
(and if this break-ability is likely necessary for things like the C
runtime to be able to work), this is a weak point.

And, if it can't lock down security against actively hostile code, then
its more heavy-handed aspects are no longer justifiable.

Meanwhile, if the task is subdivided, some similar benefits can be
realized more cheaply:
Bounds checked pointers to trap on out-of-bounds access;
ASLR to make it much harder for shell-code to know where anything is;

With pure PIC coding practices, and the link-pointer being stored directly onto the call stack (not the data stack), one has no particular reason to know where they currently are (IP) and few ways of seeing where they are.

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Can at least reduce success rate (for stomped LR being able to redirect control without immediate CPU fault) from 100% down to 0.4%, ...

Stack canaries can also help, but then compilers (and programmers) like
to disable them for sake of the "usually fraction of a percent"
performance overhead.

Can also be defeated if the attacker can guess the stack canary (could
be very well possible if they know enough to be able to defeat the ASLR).

If the start-up is predictable enough, an attacker could likely guess
the ALSR, stack-canary value and XOR mask, and the magic number that
goes in the link register, ... then we have a harder problem.

Well, and then again, there is still the problem of all the potential weak-points that don't depend on injecting shell code. Like, it is
rendered moot if a program can use a malformed string to escape commands
into a script interpreter or use an insecure interface to get at the C library's "system()" function or similar...

Any attack attempt will need to know an N-bit magic number.

No need for a number, it cannot be accessed outside of calling and returning. {{Which is not being done with a series of instructions--but by 1 designed for the task at hand}}

OK.

--- Synchronet 3.22a-Linux NewsLink 1.2

From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Sun Jun 21 05:10:00 2026

From Newsgroup: comp.arch

On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

Instead, C presents a programming model way down at the vonNeumann level::
1 unit of work (step) at a time.

That could be considered a flaw.

Of course, there are languages that address that flaw.

APL has mathematical operators that act directly on vectors and
matrices without loops.

Modula-2, ADA, and some other languages include constructs for
parallel execution.

But then, even C has fork().

John Savard
--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Jun 21 10:45:53 2026

From Newsgroup: comp.arch

On 2026-Jun-20 15:07, John Levine wrote:

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always want to turn off the error checks in the production build of their code.

I suspect that is because error checks were so badly designed.
e.g. the x86 BOUND instruction costs more to set up than it saves
because it requires 2 bounds to be set up in memory and then
read every time.

If checks are designed from a risc point of view
they should have little to no runtime costs.

For example, almost all arrays are 1-dimension, base-0 or base-1
and most array bounds are constants, so one only needs to check,
- for base-0 a single index unsigned < register or constant limit,
- for base-1 a single index != 0 and unsigned <= register or constant limit. (It uses an unsigned compare because signed negative integers
will be treated as large positive unsigned integers and fault.)

Since the index will already be in a register, this is just a
reg-reg or reg-imm compare and possibly fault.

There are two forms of conditional check ChkCC.
With the standard check, the following LD or ST is not dependent on
the check success and could be speculatively executed before an
index fault was thrown. It is therefore slightly faster but not
Spectre safe, suitable for secure environments.

The second form is a sequential check ChkSeqCC has 3 operands:
the source index register , the limit imm or register, and a dest register.
ChkSeqCC rd_index, rs1_index, imm_limit
ChkSeqCC rd_index, rs1_index, rs2_limit
When the check succeeds the rs1_index is copied into rd_index register,
and the rd_index register is then used in the LD or ST instruction.
This creates a sequential dependency of the LD/ST on the check having
been passed and thus blocks Spectre style speculative indexing.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 15:16:29 2026

From Newsgroup: comp.arch

On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he
could remember what kind of benchmarks they had run on the 88100
that showed it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to
use 2 shifts to perform an extract; indexing was scaled:: both
helped interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any
given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of
interpreter...

Given that one can create a data structure in LISP, and then execute it;
how would you do this without an interpreter or a JIT ??

You don't do that for anything serious.

You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any
other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

A good overview of the SoTA back then is this:

<https://dreamsongs.com/Files/Timrep.pdf>

I would expect that some PDP-10s at CMU had MacLisp installed on them.

The Franz Lisp source code was available from UCB including the compiler.

The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
to the M88K.

The Lisp system developed by the SPICE Project at CMU initially targeted
the PERQ but later switched to running on conventional CPUs and is still
in use today in SBCL and CMUCL variants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jun 21 14:52:05 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it
through pointers derived from p. And programs usually satisfy that requirement.

typedef struct { // string with explicit length
int len:
char str[0];
} varstr;

varstr *p;
char *s = "swordfish";

// initialize p from s
p = (varstr *)malloc(sizeof(varstr)+strlen(s));

len = strlen(s);

strncpy(p->str, s, p->len);

so in practice pointers all have to be pointers to bytes or something
that can losslessly be converted to and from them.

So you want typed pointers. Other languages have more type safety.

What kind of segmentation do you have in mind that would provide type
safety?

This evolution was certainly helped along by the horrible implementaton
of segmented memory in the Intel 8086 and 286, which persuaded people
that segments are a plague to be avoided rather than a tool to make
programs more reliable.

The 286 provides segments that fit the C standard. It seems that what
people found horrible about them was that they are limited to 64KB,
and that using them is slow, and that the 80286 protected mode was
completely at odds with real mode instead of an upwards-compatible
thing.

The limit could be fixed, and they would require more hardware
resources and be even slower. The limit on the number of segments is
probable also a problem if you want to use them for C-standard
objects.

Concerning the performance, one can probably improve that, at the cost
of additional hardware, but I fail to see how any segment-based
hardware could be as fast (or at least close to) as flat memory with
software bounds checking.

One issue that segments as on the 80286 do not fix is dangling
references (C memory safety checkers go to considerable lengths to
deal with that problem). So the language implementation of a language
with explicit deallocation (e.g., C or Pascal) would deallocate the
segment, but, given the finite number of segment numbers, pass out the
segment number again, and and access through a dangling reference to
the old segment could wreak havoc.

One problem connected to C and the 286 segments is that each pointer
would require a segment number and an offset withing the segment. For
a language like Java, the segment number would be sufficient. But,
e.g., Pascal reference parameters can reference a specific field in a
record or an array element, so they would have to be represented by segment+offset, too.

Do you have any example of non-horrible segmentation that provides
memory safety. If not, do you have an idea what that would look like?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jun 21 15:38:52 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

- [C] was designed to not
be too different from the underlying hardware.

The underlying hardware *of that time*. Therefore, it may have
contributed to "locking in" that style of hardware. But I do not pretend >>to know that a different style of hardware would be better today.

C evolved from B which had a memory model that addressed words, which made >sense for a lot of the computers of the 1960s. I gather the earliest >versions of C were on the GE 635 which was a 36 bit word addressed machine >but when it moved to the byte addressed PDP-11, dmr had to add typed pointers >so it could do something reasonable with pointers to character strings vs >pointers to words.

My understanding is that Ritchie designed C to address B's
deficiencies for byte-addressed machines; so C for a word-addressed
machine makes no sense.

Looking at "The Development of the C Language" <https://9p.io/cm/cs/who/dmr/chist.html>, Ritchie writes:

|In 1971 I began to extend the B language by adding a character type
|and also rewrote its compiler to generate PDP-11 machine instructions
|instead of threaded code. Thus the transition from B to C was
|contemporaneous with the creation of a compiler capable of producing
|programs fast and small enough to compete with assembly language. I
|called the slightly-extended language NB, for `new B.'

There are three mentions of the GE-635 in this paper, none of it about
a C compiler.

I think that with or without C, flat byte addressed memory would have won out >due to the success of S/360 and the PDP-11, both of which were programmed
in lots of languages other than C.

I agree. The IBM 704 also has flat memory.

From the Datapoint 2200 up to and including the 8085 Intel used flat
memory. The segments of the 8086 look more like a way to support more
the 64KB than anything else:

Memory safety? No.

Rearrange memory? No, because the segment number directly specifies
the memory.

The 80286 protected mode was a serious attempt to provide memory
safety and to make memory rearrangeable, but it failed in the market
before C became dominant. Most software used real mode on the 286.

Then they developed the 386 and knew that people want flat memory, so
they found a way to extend 286 protected mode to the 386, but in a way
that allows using a flat memory model; they also provided decent ways
to run 8086 software (in particular, virtual-8086 mode); and the 386
added paging.

The 6800, 6502 and 68000 were designed before C became dominant and
have flat addressing. And of course ARM also has flat addressing,
because of the 6502 heritage, and because they could not afford
segmentation frills.

The other kind of "segmentation" I encountered is HPPA's and Power's
address handling, but that has nothing to do with memory safety or
other language concepts.

Call by name and thunks were a mistake. The Algol committee was trying
to write an elegant description of call by reference and only when
Jensen's device came along did they realize what they'd done. Alan
Perlis, who was on the Algol committee, told me this. Then when they
tried to fix Algol 60, the committee was hijacked by people who
produced Algol 68 which was quite a good language, but was defined so >obscurely that people wrongly assumed it was hard to learn and use.

The Van Wijngaarden grammars of Algol 68 can be seen as second-systems
effect. After the success of BNF for Algol 60, they wanted to
increase the reach of formal specifications, so they developed vW
grammars and specified Algol 68 in it. No successful language
followed that step (and I think most languages where formalism extends
beyond the context-free grammar have not become popular).

Was that the main reason why Algol 68 never became popular? Maybe,
but others were possible. Algol 60, which did not have this problem,
also was more popular as inspiration for other programming languages
than for writing programs (Burroghs machines excepted).

GCC now supports Algol 68, so one can try it out relatively easily.

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

AFAIK the B6500 leaves all security and safety to the compiler, so in
a way it is the exact opposite of the iAPX432. We have settled on an in-between approach, where safety inside a process is the job of the
compiler or the programmer), while security between processes is
managed by hardware.

Concerning the 432: The 286 protected mode was inspired by the 432,
but did Pascal compilers make use of it? Not that I know of.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:20:56 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

In addition, code does not need to access a GOT entry and then call
the address of an entry, one can LD directly into IP and at the
same time deposit the return address where it can't be stomped.

Only EXIT and RET can access the return address and 99% of the
time it goes directly into IP.

Can at least reduce success rate (for stomped LR being able to redirect control without immediate CPU fault) from 100% down to 0.4%, ...

My way gets it down to 0%.

Stack canaries can also help, but then compilers (and programmers) like
to disable them for sake of the "usually fraction of a percent"
performance overhead.

Stack canaries are <unnecessary> added work to the instruction stream.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:29:55 2026

From Newsgroup: comp.arch

quadibloc@invalid.com (John Savard) posted:

On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Instead, C presents a programming model way down at the vonNeumann level:: >1 unit of work (step) at a time.

That could be considered a flaw.

I assume you do not want to render your ISA code where each instruction performs 1/3 units of work. So, we are left with how many units of work
should a single instruction do (incrementing IP is not considered a UoW).

Indeed, compared to RISC-V, My 66000 performs 1.4 UoW per RISC-V UoW.

Of course, there are languages that address that flaw.

APL has mathematical operators that act directly on vectors and
matrices without loops.

By stating that this {vector, matrix} calculation is 1 step, you address
the vast majority of the 'flaw' mentioned above.

Modula-2, ADA, and some other languages include constructs for
parallel execution.

But then, even C has fork().

John Savard

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:37:45 2026

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

On 2026-Jun-20 15:07, John Levine wrote:

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always
want to turn off the error checks in the production build of their code.

I suspect that is because error checks were so badly designed.
e.g. the x86 BOUND instruction costs more to set up than it saves
because it requires 2 bounds to be set up in memory and then
read every time.

If checks are designed from a risc point of view
they should have little to no runtime costs.

For example, almost all arrays are 1-dimension, base-0 or base-1
and most array bounds are constants, so one only needs to check,
- for base-0 a single index unsigned < register or constant limit,
- for base-1 a single index != 0 and unsigned <= register or constant limit. (It uses an unsigned compare because signed negative integers
will be treated as large positive unsigned integers and fault.)

Since the index will already be in a register, this is just a
reg-reg or reg-imm compare and possibly fault.

My 66000 has bounds checks built into the CMP instruction.
C would use the CIN check (0<=Rindex<Rcomparand)
Fortran would use the FIN check (0<Rindex<=Rcomparand)
An advantage of condition-code-less comparisons.

There are two forms of conditional check ChkCC.
With the standard check, the following LD or ST is not dependent on
the check success and could be speculatively executed before an
index fault was thrown. It is therefore slightly faster but not
Spectre safe, suitable for secure environments.

Everything becomes SPectré safe when caches are not updated until
the causing instruction retires, AND when memory indirect* cannot
access the cache a second time, until the first access passes its
RWE permission checks.

(*) LD Ri,[address] // get pointer
MEM Rd,[Ri,...,...] // touch memory indirectly

The second form is a sequential check ChkSeqCC has 3 operands:
the source index register , the limit imm or register, and a dest register.
ChkSeqCC rd_index, rs1_index, imm_limit
ChkSeqCC rd_index, rs1_index, rs2_limit
When the check succeeds the rs1_index is copied into rd_index register,
and the rd_index register is then used in the LD or ST instruction.
This creates a sequential dependency of the LD/ST on the check having
been passed and thus blocks Spectre style speculative indexing.

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Jun 21 13:55:59 2026

From Newsgroup: comp.arch

On 6/21/2026 10:16 AM, Robert Swindells wrote:

On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he
could remember what kind of benchmarks they had run on the 88100
that showed it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to >>>>> use 2 shifts to perform an extract; indexing was scaled:: both
helped interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any
given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of
interpreter...

Given that one can create a data structure in LISP, and then execute it;
how would you do this without an interpreter or a JIT ??

You don't do that for anything serious.

You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

A good overview of the SoTA back then is this:

<https://dreamsongs.com/Files/Timrep.pdf>

I would expect that some PDP-10s at CMU had MacLisp installed on them.

The Franz Lisp source code was available from UCB including the compiler.

The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
to the M88K.

The Lisp system developed by the SPICE Project at CMU initially targeted
the PERQ but later switched to running on conventional CPUs and is still
in use today in SBCL and CMUCL variants.

Yeah.

Though, I guess one merit of a Lisp like language is that it is a lot
easier to parse, and it could be possible to implement a fairly cheap
compiler for it (in the basic case).

Usual downside it that the excessive parenthesis tend to turn into a
usability issue.

One other major hassle was typically a lack of C style loops (with break
or continue), but this could be addressed in theory.

There is the pros/cons aspect of being typically dynamically typed, but
could be possible to define a statically-typed dialect. Pure dynamic
typing has unavoidable performance costs, etc.

Hmm...
(let (((:int x) 1) ((:int y) 2)))
(defun (:int foo) ((:int x) (:int y)) (+ x y))

Where, say, replacing a symbol with (:keyword symbol) being understood
to declare it as a typed symbol.

Likely (:keyword expr) could in other contexts be understood as a
contextual attribute, with (cast :type expr) for casts.

Could make sense to require all non-primitive types to effectively be typedef'ed rather than declared by an inline notation.

(typedef pvoid (:ptr :void))
(typedef ppvoid (:ptr :pvoid))

The default type could maybe be a sort of auto-variant:
Make an attempt to infer the type if possible;
Failing this, fall back to variant (dynamic types).

Could turn this into a more C-like language by throwing a full parser on
the front (this was how my second implementation of the BGBScript VM
worked). Initially, it remained fully dynamically typed, but the final
version (before this project/language had died) had moved over to a static-typed core (with the top-level language kinda resembling
ActionScript3 or HaXE).

Could likely go from S expressions to a stack based bytecode, but could
make sense to store the bytecode in a format where separate random
access is more possible. This was a downside of BGBCC's bytecode; it
basically required "all at once" processing. To be more friendly to a
compiler with a more limited memory budget, would ideally want to be
able to read in the bytecode and walk the reach-ability graph without
needing to convert the whole thing to 3AC.

Maybe keep the format still able to support C and similar.

FWIW: BGBCC's RIL format was itself partly derived from the BGBScript
VM's bytecode format (they had originally sort of co-evolved; and used
very similar approaches).

As noted, the BSVM backend didn't interpret the bytecode directly, but
rather unpacked it into a 3AC format which was what it actually ran (vs
BGBCC where the 3AC is instead used to drive machine-code generation).

Both were developed in the era when RAM was assumed plentiful and cheap
(or, at least, normal desktop PC doesn't care if one loads something the
size of Doom or Quake entirely into decoded 3AC ops).

Well, ironically, this is what JX2VM does as well for emulating stuff,
but JX2VM couldn't really self host...

A compromise option would be to move the outer layers to a TLV format,
but not try to turn it into a structure resembling the RIFF/AVI format
(or .NET style table-driven metadata), but instead maybe stay with local tagging.

Say:
BYTE tag; BYTE nlen; BYTE data[~nLen]; //small tag
WORD tag; WORD nlen; BYTE data[~nLen]; //mediam tag
DWORD tag; DWORD nlen; BYTE data[~nLen]; //long tag
Where, TAG bytes are required to stay in the range 01..7F ( or 20..7E),
and lengths are stored as a complement, so the HOB is always set.

This format makes it possible to detect the tag format, without more heavy-handed tagging (like in ASN.1 or MKV). Had used this sort of TLV
format in some of my other formats (and ironically, contains a RIFF-like format as a subset case; differing primarily in that the length is complemented, and the data's stored length is not padded up to an even
byte).

So, one can have tags like:
'FN': Function
'VR': Variable
'LV': Local Variables (functions)
'LA': Local Arguments (functions)
'N': Name (String Index, Name)
'T': Type (String Index, Sig)
'F': Flags (Maybe String Index, Sig-like)
'B': Body (functions, bytecode)
'V': Value (variables, maybe bytecode)
'G': Global Dependencies List (likely global indices)

May or may not make sense (for a compiler IL) to still use bytecode for declaring values.

RIL had used inline strings, but arguably it is better to use a string
table (with string literals as offsets). May make sense to support both
normal (raw) strings, and LZ compressed strings (for larger text/data
blobs).

Might make sense to have both a global and local (per-function) string
tables, with the literals able to encode which table they are pulling from.

Possibly each declaration would be required to give a list of global dependencies (so that the dependency graph-walk can be done without
needing to process the bytecode).

Ideally, want to be able to use a process where one can read the image
into RAM, and pull out bits when needed, rather than needing to up-front
parse the whole thing (which was a design limitation with RIL, which
wasn't really designed with the idea of needing to compile stuff with a
more limited memory footprint).

Likely the bytecode could use a similar format to RIL, say:
Opcodes:
00..DF: Single Byte
E0..EF: Two Byte (0000..0FFF)
(Longer, probably unneeded)
...
Integer Numbers:
00..7F: One Byte (00..7F)
80..BF: Two Byte (0000..3FFF)
C0..DF: Three Byte (...)
...
Signed integers are stored sign-folded.

The format isn't likely to be really optimized for interpreters, and
would probably stick with leaving off opcode types (type information
would be carried along the operand stack).

If compiling a language like C, would need to decide between one
bytecode blob per translation unit, or merging all TU's into a single
big blob for libraries.

If aiming for a RAM-conserving compiler, per-TU blobs likely make more
sense (with libraries either storing compound blobs, or something like
WAD or "!<arch>" libraries). Almost more tempting to just use a WAD
variant for libraries (the traditional ".a" format just kinda sucks;
would almost rather just use the ".tar" format than this).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:57:56 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it through pointers derived from p. And programs usually satisfy that requirement.

{
p = (struct foo *) malloc(42 * sizeof(struct foo));
fprintf( stream, "0x16,", p );
...
if( fscanf( stream, "x16", q ) ) {
use q
}
}

is q "derived" though p ??

typedef struct { // string with explicit length
int len:
char str[0];
} varstr;

varstr *p;
char *s = "swordfish";

// initialize p from s
p = (varstr *)malloc(sizeof(varstr)+strlen(s));

len = strlen(s);

strncpy(p->str, s, p->len);

so in practice pointers all have to be pointers to bytes or something
that can losslessly be converted to and from them.

So you want typed pointers. Other languages have more type safety.

What kind of segmentation do you have in mind that would provide type
safety?

This evolution was certainly helped along by the horrible implementaton
of segmented memory in the Intel 8086 and 286, which persuaded people
that segments are a plague to be avoided rather than a tool to make >programs more reliable.

The 286 provides segments that fit the C standard. It seems that what
people found horrible about them was that they are limited to 64KB,
and that using them is slow, and that the 80286 protected mode was
completely at odds with real mode instead of an upwards-compatible
thing.

That did not help--but the problem goes much deeper--especially with
modern software adding DLLs as the <static> application ages.

The limit could be fixed, and they would require more hardware
resources and be even slower. The limit on the number of segments is probable also a problem if you want to use them for C-standard
objects.

Yes, for a 64-bit VAS, you want a 64-bit pointer to the start, no less
than a 40-bit value for its size, and then 2 (or 3) sets of permissions--
each permission being 7-8-bits long. You can pack all this into a 128-
bit descriptor if you accept that no segment can be larger than 2^40
bytes. A 40-bit size is a bit on the restrictive side.

Concerning the performance, one can probably improve that, at the cost
of additional hardware, but I fail to see how any segment-based
hardware could be as fast (or at least close to) as flat memory with
software bounds checking.

Neither does anyone else in the RISC camp who architects instructions.

Yes, you can throw a bunch of HW at the problem and almost make the
performance degradation vanish--but at what cost {area, power, cycles}??

One issue that segments as on the 80286 do not fix is dangling
references (C memory safety checkers go to considerable lengths to
deal with that problem). So the language implementation of a language
with explicit deallocation (e.g., C or Pascal) would deallocate the
segment, but, given the finite number of segment numbers, pass out the segment number again, and and access through a dangling reference to
the old segment could wreak havoc.

Yes, you are going to want at least 2^32 segments...

One problem connected to C and the 286 segments is that each pointer
would require a segment number and an offset withing the segment. For
a language like Java, the segment number would be sufficient. But,
e.g., Pascal reference parameters can reference a specific field in a
record or an array element, so they would have to be represented by segment+offset, too.

And an ability to create, restrict, pass, and receive segment-descriptors across some kind of interface that takes very few instructions to set up
and use--to other threads which share only part of your VAS {memmap()}.

Do you have any example of non-horrible segmentation that provides
memory safety. If not, do you have an idea what that would look like?

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 19:02:32 2026

From Newsgroup: comp.arch

Robert Swindells <rjs@fdy2.co.uk> posted:

On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 9:25 AM, Robert Swindells wrote:

On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

Robert Swindells <rjs@fdy2.co.uk> posted:

In previous discussions, I had tried to press Mitch to see if he
could remember what kind of benchmarks they had run on the 88100
that showed it running Lisp faster than SPARC.

M88K shift instructions could perform extracts, whereas SPARC had to
use 2 shifts to perform an extract; indexing was scaled:: both
helped interpreters.

Production Lisp environments are not interpreters, even back then.

Lisp is a funny language:
Big promises in the design;
But, only deliver them poorly (and can't improve on the delivery of any
given thing without eroding the original promises).

Simplicity and elegance of an interpreter,
But only if already operating within a Lisp environment...
Clean and elegant syntax,
That actually sucks real hard to use in practice.
Performance, but only if compiled to something else...

A naive Lisp interpreter being almost the slowest style of
interpreter...

Given that one can create a data structure in LISP, and then execute it; how would you do this without an interpreter or a JIT ??

You don't do that for anything serious.

You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

A good overview of the SoTA back then is this:

<https://dreamsongs.com/Files/Timrep.pdf>

I would expect that some PDP-10s at CMU had MacLisp installed on them.

The Franz Lisp source code was available from UCB including the compiler.

The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
to the M88K.

I was told that the prolog application on M88K was faster than competing
RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

The Lisp system developed by the SPICE Project at CMU initially targeted
the PERQ but later switched to running on conventional CPUs and is still
in use today in SBCL and CMUCL variants.

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Jun 21 21:15:22 2026

From Newsgroup: comp.arch

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like this: >>>
p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it
through pointers derived from p. And programs usually satisfy that
requirement.

{
p = (struct foo *) malloc(42 * sizeof(struct foo));
fprintf( stream, "0x16,", p );
...
if( fscanf( stream, "x16", q ) ) {
use q
}
}

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"
and when a compiler can know pointers definitely alias, definitely do
not alias, can be assumed not to alias or must be assumed to possibly
alias. I haven't followed the details enough to say what would be the
case here, or if a consensus has been reached about such situations.

However, Anton did say that programs /usually/ satisfy that requirement.
It is possible to play silly buggers with pointers in C, but few
people would do something as "creative" as you are suggesting here.

There can be good reasons for doing some weird things with pointers in
C, and it's not always clear what is entirely allowed or not. Sometimes
it is safer to use unsigned character pointers (such as with "memcpy")
to be sure - there's a risk to efficiency, but good compilers will often generate efficient results.

C++23 introduced the "start_lifetime_as" which lets you be a bit clearer
about some things.

I doubt if anyone will claim that C is a "perfect" language here, or
that there are no risks of misunderstandings between the programmer and
the compiler if you try to do very strange things. But usually you
don't need to do such strange things in code.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 19:52:10 2026

From Newsgroup: comp.arch

On Sun, 21 Jun 2026 19:02:32 GMT, MitchAlsup wrote:

I was told that the prolog application on M88K was faster than competing
RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

XLisp doesn't use tags to encode types, it just has a C union of structs
with a one byte field for the type. You can still find the source to the version used by SPECint. It doesn't provide a compiler.

It isn't a useful benchmark for anyone who was interested in running Lisp
back then.

I'm not trying to defend SPARC and am happy to take your word for it that
M88K was fast for the time.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 19:56:40 2026

From Newsgroup: comp.arch

On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

Though, I guess one merit of a Lisp like language is that it is a lot
easier to parse, and it could be possible to implement a fairly cheap compiler for it (in the basic case).

Usual downside it that the excessive parenthesis tend to turn into a usability issue.

You use an editor that keeps track of them.

One other major hassle was typically a lack of C style loops (with break
or continue), but this could be addressed in theory.

It is addressed in practice.

You could run SBCL on your CPU, it has a RISC-V backend to the compiler.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jun 21 13:11:25 2026

From Newsgroup: comp.arch

On 6/21/2026 11:37 AM, MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

On 2026-Jun-20 15:07, John Levine wrote:

or Pascal, had become the "standard", we might have ended up with
computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found >>> that moving complexity from the hardware into the compiler made computers >>> faster and cheaper. IBM's PL.8 compiler had excellent error checking even >>> though it was originally targeted at the RISC 801, but somehow people always
want to turn off the error checks in the production build of their code.

I suspect that is because error checks were so badly designed.
e.g. the x86 BOUND instruction costs more to set up than it saves
because it requires 2 bounds to be set up in memory and then
read every time.

If checks are designed from a risc point of view
they should have little to no runtime costs.

For example, almost all arrays are 1-dimension, base-0 or base-1
and most array bounds are constants, so one only needs to check,
- for base-0 a single index unsigned < register or constant limit,
- for base-1 a single index != 0 and unsigned <= register or constant limit. >> (It uses an unsigned compare because signed negative integers
will be treated as large positive unsigned integers and fault.)

Since the index will already be in a register, this is just a
reg-reg or reg-imm compare and possibly fault.

My 66000 has bounds checks built into the CMP instruction.
C would use the CIN check (0<=Rindex<Rcomparand)
Fortran would use the FIN check (0<Rindex<=Rcomparand)
An advantage of condition-code-less comparisons.

Yes, although a perhaps minor quibble. You would use the compare
followed presumably by a branch on bit instruction. I believe Eric's
proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution. I am not sure how much the
extra instruction costs, but if it occurs on every array reference, it
might be an issue.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Jun 21 16:22:43 2026

From Newsgroup: comp.arch

On 6/21/2026 1:20 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the
application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

In addition, code does not need to access a GOT entry and then call
the address of an entry, one can LD directly into IP and at the
same time deposit the return address where it can't be stomped.

Only EXIT and RET can access the return address and 99% of the
time it goes directly into IP.

This requires a CPU that can deal with PUSH/POP mechanics in hardware.
With a Link Register, HW doesn't need to deal with this.

Then again, did see a video recently about a new interrupt-handling and
system call mechanism for x86-64 (called FRED).

And the big apparent change:
Mostly makes SYSCALL behave like a normal interrupt, but drops the IDT
in favor of BaseRegister + Disp and similar.

So, seemingly:
IDT:
Push RIP and RFLAGS and similar;
Jump to entry point loaded from IDT;
Specific behavior depends on interrupt type, etc.
SYSCALL:
Copies RIP and RFLAGS to different registers;
Jump to entry point from an MSR.
New mechanism (FRED):
Pushes stuff to stack again, but more stuff to the stack;
Jump to fixed entry point with a per-category displacement;
Stack contents are more consistent.

Well, contrast the interrupt mechanism used in my ISA:
Saves SR (flags) and PC and similar to special registers;
Branches to VBR + Disp (category);
Loads some mode-state from VBR;
VBR can encode which ISA mode handles the interrupt (similar to LR).
Mode flag causes SP and SSP to swap places in the decoder.
Debatable, but stack-swapping avoids a bunch of PITA...

There is a difference in that in my ISA designs, ISR needs to manually save/restore GPRs, whereas traditionally x86 / x86-64 dumps them to the
TSS. Well, and ironically it seems FRED drops the TSS, so apparently now
the ISR needs to save/restore all the GPRs itself.

Like, they are seemingly (almost) converging towards a mechanism similar
to what I was using, just with more pushing stuff to the stack, and 4
stack pointers (one for each ring), rather than 2.

Well, if they would have skipped the stack PUSH'ing and just copied the
RIP and RFLAGS and similar to MSRs, it would have been "almost the same thing".

Almost funny in a way, as my mechanism was basically designed around
trying to do the bare minimum of what could have been done and still
allow stuff to work.

Could probably LOL pretty hard if x86-64 then went and added an explicit "Branch-with-Link" instruction (as a replacement for CALL).

...

Can at least reduce success rate (for stomped LR being able to redirect
control without immediate CPU fault) from 100% down to 0.4%, ...

My way gets it down to 0%.

One could also make a case for having a separate Stack and Hunk, with
things like arrays/structs going onto the Hunk rather than the Stack.

But, this would be like 2 pointers and some extra mechanics when the
Hunk needs to be used.

This would also eliminate the same basic issue, because buffer overflows
can't stomp the saved registers (and if the Hunk moves upwards, then OOB
is far less likely to stomp anything meaningful).

Where, the Hunk allocator in this case could be functionally similar to
the ones in Quake 2/3 (Quake 1 also used one).

Stack canaries can also help, but then compilers (and programmers) like
to disable them for sake of the "usually fraction of a percent"
performance overhead.

Stack canaries are <unnecessary> added work to the instruction stream.

They are pretty effective though against the specific case of
out-of-bounds stomping saved registers or the return address...

Also MSVC uses them by default (as does BGBCC), but GCC and Clang
generally don't use them.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jun 22 00:24:07 2026

From Newsgroup: comp.arch

On Sun, 21 Jun 2026 14:52:05 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Do you have any example of non-horrible segmentation that provides
memory safety. If not, do you have an idea what that would look like?

- anton

Nick McLaren used to say that he knows how to do segments right.
But it was impossible to press him into providing even minimally
detailed description of his ideas.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 00:26:09 2026

From Newsgroup: comp.arch

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like
this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it
through pointers derived from p. And programs usually satisfy that
requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

(Meanings of "providence" quoted from Wiktionary.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:02:05 2026

From Newsgroup: comp.arch

Robert Swindells <rjs@fdy2.co.uk> posted:

On Sun, 21 Jun 2026 19:02:32 GMT, MitchAlsup wrote:

I was told that the prolog application on M88K was faster than competing RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

XLisp doesn't use tags to encode types, it just has a C union of structs with a one byte field for the type. You can still find the source to the version used by SPECint. It doesn't provide a compiler.

It isn't a useful benchmark for anyone who was interested in running Lisp back then.

I'm not trying to defend SPARC and am happy to take your word for it that M88K was fast for the time.

I think it is more appropriate to say the M88K was peaky--some things
it did quite well, and others "not so much".
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:04:11 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 6/21/2026 11:37 AM, MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

On 2026-Jun-20 15:07, John Levine wrote:

or Pascal, had become the "standard", we might have ended up with >>>>>> computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found
that moving complexity from the hardware into the compiler made computers >>> faster and cheaper. IBM's PL.8 compiler had excellent error checking even
though it was originally targeted at the RISC 801, but somehow people always
want to turn off the error checks in the production build of their code. >>

I suspect that is because error checks were so badly designed.
e.g. the x86 BOUND instruction costs more to set up than it saves
because it requires 2 bounds to be set up in memory and then
read every time.

If checks are designed from a risc point of view
they should have little to no runtime costs.

For example, almost all arrays are 1-dimension, base-0 or base-1
and most array bounds are constants, so one only needs to check,
- for base-0 a single index unsigned < register or constant limit,
- for base-1 a single index != 0 and unsigned <= register or constant limit.
(It uses an unsigned compare because signed negative integers
will be treated as large positive unsigned integers and fault.)

Since the index will already be in a register, this is just a
reg-reg or reg-imm compare and possibly fault.

My 66000 has bounds checks built into the CMP instruction.
C would use the CIN check (0<=Rindex<Rcomparand)
Fortran would use the FIN check (0<Rindex<=Rcomparand)
An advantage of condition-code-less comparisons.

Yes, although a perhaps minor quibble. You would use the compare
followed presumably by a branch on bit instruction. I believe Eric's proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution. I am not sure how much the
extra instruction costs, but if it occurs on every array reference, it
might be an issue.

In the domain of GBOoO implementations, as long as the extra instruction
does not add to the latency to a critical path, the only degradation is
found in delta-ICache-performance.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:23:04 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/21/2026 1:20 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the >>> application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a THROW() and its unstructured equivalent longjump().

In addition, code does not need to access a GOT entry and then call
the address of an entry, one can LD directly into IP and at the
same time deposit the return address where it can't be stomped.

Only EXIT and RET can access the return address and 99% of the
time it goes directly into IP.

This requires a CPU that can deal with PUSH/POP mechanics in hardware.
With a Link Register, HW doesn't need to deal with this.

Consider the Push/Pop mechanics in HW compared to FMAC in HW--which
do you think is easier ???

Now consider 16 pushed in a row versus a single instruction that performs
the same amount of work. Which one needs to translate an address more
often, which one needs to AGEN more often, and which one can access the
cache once for up to 8 registers ???

Now given that the whole push sequence is dumped onto a MEMory unit,
and the other FUs are available, how much easier is it to find work
after the 16 pushes (or 1 ENTER) has been moved to the MEM FU ??? ---------------------

Then again, did see a video recently about a new interrupt-handling and system call mechanism for x86-64 (called FRED).

And the big apparent change:
Mostly makes SYSCALL behave like a normal interrupt, but drops the IDT
in favor of BaseRegister + Disp and similar.

So, seemingly:
IDT:
Push RIP and RFLAGS and similar;
Jump to entry point loaded from IDT;
Specific behavior depends on interrupt type, etc.
SYSCALL:
Copies RIP and RFLAGS to different registers;
Jump to entry point from an MSR.
New mechanism (FRED):
Pushes stuff to stack again, but more stuff to the stack;
Jump to fixed entry point with a per-category displacement;
Stack contents are more consistent.

Well, contrast the interrupt mechanism used in my ISA:
Saves SR (flags) and PC and similar to special registers;
Branches to VBR + Disp (category);
Loads some mode-state from VBR;
VBR can encode which ISA mode handles the interrupt (similar to LR).
Mode flag causes SP and SSP to swap places in the decoder.
Debatable, but stack-swapping avoids a bunch of PITA...

And My 66000::

The SVC instruction uses the 16-bit immediate to specify what service
is being requested, and uses the SRC1 field as a 5-bit immediate to
specify which application Registers are passed to the service routine.
SVC pushes 1 DW (64-bits) onto the call stack. HW saves application thread.state and R[16:31] in permanent memory; and reads in thread.state
for the service routine to be run.

When control arrives, the registers are already present for the service
routine to get about its business, interrupts were never disabled, it
has a stack, and other useful pointers and local data in its VAS along
with its ASID, ... SW needs to do nothing to get here.

SVR undoes the state-changing control transfer. SVR uses the SRC1 field
to specify how many supervisor registers are being copied back as one
or more results {Linux uses 0, 1, 2 results}, clears the registers that
are not reloaded R[2:15] while reloading the preserved registers. ------------------
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:23:57 2026

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory. >>>

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like
this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it >>> through pointers derived from p. And programs usually satisfy that
requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

Although many do .....

(Meanings of "providence" quoted from Wiktionary.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Jun 21 20:49:38 2026

From Newsgroup: comp.arch

On 6/18/26 12:21 PM, Anton Ertl wrote:> scott@slp53.sl.home
(Scott Lurndal) writes:

On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

An idle thought here is whether there is any "better"

option than

conventional register-machine designs.

As an interesting thought experiment, let's assume that a vast
amount of memory is available with access times better than
SRAM (let's suppose 1-cycle for the purposes of this thread).

Would registers even be needed in such an architecture?

Registers in high-performance CPUs give you several benefits:

1) The addresses are hard-coded in the instructions. This

means that

read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early

and used

for forwarding, and for renaming registers), and for reducing

port

requirements.

2) They have many read and write ports.

3) Fast access time. Well, maybe. Thanks to 1) fast access

time is

actually not necessary, it just means that you need fewer

forwarding

paths.

I would rather write that the three (typical) advantages of
registers are:

1) [Same] The address is hard-coded in the instruction.
2) They are relative few in number (typically less than 256) and
smallish in storage..
3) The access is "word" aligned (typically).

Your number 2 derives mostly from my number 2. My number 2
also helps with latency and access power. Code density may
also be an advantage here, though variable length encoding
could support diverse address sizes which could avoid explicit
moves into the smaller address space (registers).

My number 3 (alignment) helps with latency and access power at
some cost in effective capacity. Partial reads do not seem that
expensive (especially for operations where it would be practical
to treat small operands as SIMD with the other lane(s)
suppressed — carry suppression instead of alignment shifting).
Partial writes introduce complications for out-of-order
execution (with some similarity to conditional move).

The direct addressing also facilities use of compiler-assisted
banking, though this may be limited to a renaming table (RAT) in
an out-of-order implementation. (It _may_ be practical for a
microarchitecture to support two banks based on the compiler-
intended banking, i.e., some of the compiler's work may be
useful. I suspect a greater degree of static banking even as
hints would introduce utilization balance issues with out-of-
order execution and even variable execution width.)

As Mitch Alsup noted, a smaller address space also reduces the
overhead of dependency tracking. (There may be some potential
for exploiting spatial locality with accesses in a large address
space, e.g., sharing most significant bits. It may also be
possible to benefit from a conservative filter for dependency
checking; such was proposed for Itanium's advanced loads.)

Queue-based storage (like the Mill's Belt) further simplifies
dependency tracking but introduces issues for long-lived values
(they have to be "moved" to be preserved). One could provide two
queues (one for longer-lived items) to reduce the number of
preservation actions required.

More persistent values might benefit from special handling since
they would tend not to be retrieved by a forwarding path.

There may also be benefits to special handling of values that
are only slightly adjusted such as counters. If the modification
is only dependent on the previous value and the instruction,
including immediate, then reversal may be cheaper than storage.
Even being able to share the storage of most significant bits
could be useful.

Completely avoiding named storage for temporary values that are
known to be consumed immediately (effectively quasi-explicit
instruction fusion) might be useful as well.

Let's look at your thought experiment:

Advantage 1 is missing. Some AMD64 implementations still

manage to

implement 0-cycle store-to-load-forwarding in many cases, but

AFAIK

not as reliably as for registers.

Advantage 2 tends to be missing. E.g., the most extreme I

have seen

up to now is 3 reads and 2 writes per cycle, and IIRC <5

total memory

accesses per cycle, on a machine that can do 8 or 10

instructions per

cycle, i.e. at least 16 register reads and 8 register writes

per cycle

(maybe limited to less, but with advantage 1 mitigating that

to some

extent).

A large storage area does reduce the overheads of banking.
Subarrays are naturally used, so "only" routing overhead would
be added.

*General* memory accesses also introduce tag and permission
check overhead. The former is a consequence of the huge address
space (too large to store with cheap access — the proposal
assumed small enough for fast access, so tags might be limited
to something like ASID and such a direct-mapped cache can
speculate on a hit and so cover tag checking latency).

Since temporal locality is a major justification for memory
hierarchy, a system designed for workloads lacking such might
not have a substantial set of registers. (Spatial locality also
justifies caching but does not typically apply to registers —
software may sometimes load multiple values into a single
register but that seems uncommon. Prefetching or other diverse
ordering of accessing can also justify caching/buffering.)

Advantage 3: What would single-cycle memory access mean for

d=a+b+c? It

would be compiled to

t=b+c
d=a+t

With registers this has a latency of typically 2 cycles. With
single-cycle memory access this typically has a latency of 6

cycles.

BTW, it's not just a though experiment:

A number of IA-64 implementations have had single-cycle D-cache
access. It still had registers.

A cache supports indirection and spatial locality and is
expected to be managed primarily by hardware. I doubt anyone
would try to implement a cache with 12 wordlines per entry,
which Itanium 2 had (to provide 12 read and 8 write ports).

If there is no indirection, such might be considered just a
large register file. (Note that some stack cache proposals and
other direct-mapped specialized caches such as the Knapsack
cache could provide register-like access characteristics. The
CRISP architecture specifically used a multi-ported stack cache
for "registers"; this cache was accessed for ordinary memory
accesses in the 32-word range cached, so some indirection was
possible.)

Processors like the 6502 and the 6809 have single-cycle

memory access.

They still have registers (actually, accumulators and index
registers).

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jun 22 06:28:16 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Swindells <rjs@fdy2.co.uk> posted:
I was told that the prolog application on M88K was faster than competing
RISC processors.

I contributed to a Prolog implementation in 1990. We developed on DG
Aviion machines that had an 88100 CPU, but wrote the Prolog system in
C. It is based on the WAM, and IIRC we used the 4 most significant
bits of every word for type tags. The bitfield instructions of the
88k did not provide an advantage, because the top 4 bits can just as
easily be accessed with shifts that every architecture has (and the C
code used shifts).

I also ran this system on a DecStation (MIPS CPU), and IIRC the
DecStation was faster per MHz, and I attributed that to the larger
caches of the DecStation, but the performance on the DecStation was
pretty brittle (probably due to direct mapped caches). Inserting some
code caused a 20% slowdown on a benchmark that did not execute the new
code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Jun 22 10:05:25 2026

From Newsgroup: comp.arch

On 21/06/2026 23:26, Niklas Holsti wrote:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory. >>>>

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like
this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it >>>> through pointers derived from p. And programs usually satisfy that
requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

I thought that's how most C programming was done? :-)

I probably rely too much on spell chequers - if there are no wiggly red
lines, my post is ready to send!

(Meanings of "providence" quoted from Wiktionary.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Jun 22 10:44:42 2026

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory. >>>>

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like
this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it >>>> through pointers derived from p. And programs usually satisfy that
requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 15:38:30 2026

From Newsgroup: comp.arch

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory. >>>>>

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like >>>>>> this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it >>>>> through pointers derived from p. And programs usually satisfy that >>>>> requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the
"careful governance and guidance of God", or on an "instance of divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
to some extent.

I am reminded of the person, apparently very religious, who some decades
ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard
Stallman was too atheistic and had tainted his products. I have not
heard how that went.
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Jun 22 15:19:09 2026

From Newsgroup: comp.arch

On 22/06/2026 14:38, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed
memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

   p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only
access it
through pointers derived from p. And programs usually satisfy that >>>>>> requirement.

      {
          p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
          ...
          if( fscanf( stream, "x16", q ) ) {
              use q
          }
      }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence" >>>

Perhaps you meant pointer "provenance"? I hope we will not rely on the
"careful governance and guidance of God", or on an "instance of divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
to some extent.

I am not sure if "pointer safety" has any specific defined meaning - but
I strongly suspect you are right for any reasonable definition, with
pointers as unrestricted as in C. But being undecidable does not mean equivalent to the halting problem - after all, Turing machines only have
to cope with defined behaviour, while C programs can have undefined
behaviour. That may mean that even if you have an oracle that solves
the halting problem, it still could not be used in an algorithm to
determine the pointer safety of a given C program.

I am reminded of the person, apparently very religious, who some decades
ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard
Stallman was too atheistic and had tainted his products. I have not
heard how that went.

I believe I know which person you are referring to, and have not heard
from him for a long time. But I can't say I have made any effort to
contact him after he stopped posting in comp.lang.c. I hope he got the
help he /really/ needed, or at least found somewhere where he could be
happily crazy rather than in constant conflict with reality and everyone
he was in contact with.

--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jun 22 09:26:15 2026

From Newsgroup: comp.arch

On 2026-Jun-21 16:11, Stephen Fuld wrote:

On 6/21/2026 11:37 AM, MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

On 2026-Jun-20 15:07, John Levine wrote:

or Pascal, had become the "standard", we might have ended up with >>>>>>> computers like the Burroughs B6500 or the Intel 432.

That is one bullet we dodged!!

I doubt it. Several parallel strands of RISC research independently found
that moving complexity from the hardware into the compiler made computers >>>> faster and cheaper. IBM's PL.8 compiler had excellent error checking even
though it was originally targeted at the RISC 801, but somehow people always
want to turn off the error checks in the production build of their code. >>>

I suspect that is because error checks were so badly designed.
e.g. the x86 BOUND instruction costs more to set up than it saves
because it requires 2 bounds to be set up in memory and then
read every time.

If checks are designed from a risc point of view
they should have little to no runtime costs.

For example, almost all arrays are 1-dimension, base-0 or base-1
and most array bounds are constants, so one only needs to check,
- for base-0 a single index unsigned < register or constant limit,
- for base-1 a single index != 0 and unsigned <= register or constant limit.
(It uses an unsigned compare because signed negative integers
will be treated as large positive unsigned integers and fault.)

Since the index will already be in a register, this is just a
reg-reg or reg-imm compare and possibly fault.

My 66000 has bounds checks built into the CMP instruction.
C would use the CIN check (0<=Rindex<Rcomparand)
Fortran would use the FIN check (0<Rindex<=Rcomparand)
An advantage of condition-code-less comparisons.

Yes, although a perhaps minor quibble. You would use the compare followed presumably by a branch on bit instruction. I believe Eric's proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution. I am not sure how much the extra instruction costs, but if it occurs on every array reference, it might be an issue.

Yes, CHKcc has a general set of condition codes to test like CMPcc
and generates a fault exception if the test fails.
Plus the special CC for handling base-1 array has the "and index != 0".
These could be used for all kinds of range checks, not just array indexes.

(A fault exception is precise and leaves the instruction pointer
pointing at the instruction that triggered the exception.
That combined with a stack trace-back, ideally with source routine
names and line numbers, facilitates problem diagnosis.
Yes, I do miss those VMS exception trace-back stacks.)

There is also CHKVcc to check a single value with tests same as
BRcc reg, offset but which generates a fault exception if the test fails.
For example, in a checked language, a cast of a signed integer to unsigned could check that the value was >= 0.

I also have integer down-size conversion check instructions.
An unchecked conversion of int64 to int8 would truncate to 8 bits,
but a checked conversion tests if the high order bits [63:8] are
all the same as the sign bit [7], and throws an overflow exception if not.
This ensures you are not trying to put 10 pounds into a 5 pound bag.

In addition to the usual modulo arithmetic and shift instructions
there are ones that check for signed and unsigned integer overflows
and throw an exception if there is.

All of this makes the cost of cost of verifying the original code's
design assumptions at runtime nearly or actually zero.
Programmers don't have to use these but it makes their lives easier
to have code debug itself, and removes the "cost too much" excuse.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 22 07:59:12 2026

From Newsgroup: comp.arch

On 6/22/2026 3:44 AM, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed memory. >>>>>

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like >>>>>> this:

p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only access it >>>>> through pointers derived from p. And programs usually satisfy that >>>>> requirement.

     {
         p = (struct foo *) malloc(42 * sizeof(struct foo));
         fprintf( stream, "0x16,", p );
         ...
         if( fscanf( stream, "x16", q ) ) {
             use q
         }
     }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the
"careful governance and guidance of God", or on an "instance of divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I don't know the answer to your question, but presumably we can do
better than C does. Isn't that one of the, at least claimed, advantages
of Rust, and perhaps even Ada? Also, I believe that had the originators
of C not allowed arithmetic on pointers (comparisons for equality would
still be allowed, and array addressing would have to use subscripts)
many of the problems with C pointers wouldn't have occurred. Of course,
that horse has left the barn a long time ago.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@vandys@vsta.org to comp.arch on Mon Jun 22 08:33:29 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Actually IA-32 (since the 486) and AMD64 have a bit for turning on
unaligned traps, but unfortunately there is too much software in
libararies that performs unaligned accesses.

Interesting... what's this bit called? In which register does
it live?

Thanks!

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 19:26:03 2026

From Newsgroup: comp.arch

On 2026-06-22 17:59, Stephen Fuld wrote:

On 6/22/2026 3:44 AM, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed
memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

   p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only
access it
through pointers derived from p. And programs usually satisfy that >>>>>> requirement.

      {
          p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
          ...
          if( fscanf( stream, "x16", q ) ) {
              use q
          }
      }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence" >>>

Perhaps you meant pointer "provenance"? I hope we will not rely on the
"careful governance and guidance of God", or on an "instance of divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I don't know the answer to your question, but presumably we can do
better than C does. Isn't that one of the, at least claimed, advantages
of Rust, and perhaps even Ada?

Both Rust and Ada have to be restricted in certain ways in order to
ensure absence of pointer errors: Rust has to avoid "unsafe" code, and
Ada has to avoid pointer-related "unchecked" constructs and certain
undefined behavior (which does exist in Ada, but less so than in C). The
Ada subset called SPARK, together with its proof tools, is meant for
such programming, and has a feature similar to Rust "ownership" though standard Ada does not.

Also, I believe that had the originators
of C not allowed arithmetic on pointers (comparisons for equality would still be allowed, and array addressing would have to use subscripts)
many of the problems with C pointers wouldn't have occurred. Of course, that horse has left the barn a long time ago.

I recently helped to debug an Ada program that now and then, but not
often, was overwriting some buffers. At one point in that program I had *cough* used pointer arithmetic *blush* instead of array indexing, for
what I felt were good reasons at the time. But it bit me. An amusing
clue to the error was that the bug happened more often when the
satellite running the program was above Russia's borders. Perhaps you
can guess reasons for that :-)

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 22 09:50:16 2026

From Newsgroup: comp.arch

On 6/22/2026 9:26 AM, Niklas Holsti wrote:

On 2026-06-22 17:59, Stephen Fuld wrote:

On 6/22/2026 3:44 AM, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed >>>>>>>> memory.

At least in the C standard the memory is segmented into objects. >>>>>>>

Pointers are sort of typed, but any real C program does stuff like >>>>>>>> this:

   p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only
access it
through pointers derived from p. And programs usually satisfy that >>>>>>> requirement.

      {
          p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>>           fprintf( stream, "0x16,", p );
          ...
          if( fscanf( stream, "x16", q ) ) {
              use q
          }
      }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer
providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the >>>> "careful governance and guidance of God", or on an "instance of divine >>>> intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I don't know the answer to your question, but presumably we can do
better than C does. Isn't that one of the, at least claimed,
advantages of Rust, and perhaps even Ada?

Both Rust and Ada have to be restricted in certain ways in order to
ensure absence of pointer errors: Rust has to avoid "unsafe" code,

I like Rust's solution. You can do unsafe things - sometimes they are
just necessary - but they are not the default was of doing things, and
you have to notate them in the source code which serves to discourage
them and points people debugging errors to certain areas of the code
that are more likley to be problematic.

and
Ada has to avoid pointer-related "unchecked" constructs and certain undefined behavior (which does exist in Ada, but less so than in C). The
Ada subset called SPARK, together with its proof tools, is meant for
such programming, and has a feature similar to Rust "ownership" though standard Ada does not.

Is programming under SPARK rules significantly harder than under
nonSPARK Ada?

Also, I believe that had the originators of C not allowed arithmetic
on pointers (comparisons for equality would still be allowed, and
array addressing would have to use subscripts) many of the problems
with C pointers wouldn't have occurred. Of course, that horse has
left the barn a long time ago.

I recently helped to debug an Ada program that now and then, but not
often, was overwriting some buffers. At one point in that program I had *cough* used pointer arithmetic *blush* instead of array indexing, for
what I felt were good reasons at the time. But it bit me. An amusing
clue to the error was that the bug happened more often when the
satellite running the program was above Russia's borders. Perhaps you
can guess reasons for that :-)

Interesting. Perhaps it is because Russia has less "careful governance
and guidance from God" :-)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Mon Jun 22 17:10:45 2026

From Newsgroup: comp.arch

According to Andy Valencia <vandys@vsta.org>: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Actually IA-32 (since the 486) and AMD64 have a bit for turning on
unaligned traps, but unfortunately there is too much software in
libararies that performs unaligned accesses.

Interesting... what's this bit called? In which register does
it live?

A quick grep of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual finds the AC bit in the EFLAGS register. I happen
to have an old 486 manual and it's there, too.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jun 22 17:09:25 2026

From Newsgroup: comp.arch

Andy Valencia <vandys@vsta.org> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Actually IA-32 (since the 486) and AMD64 have a bit for turning on
unaligned traps, but unfortunately there is too much software in
libararies that performs unaligned accesses.

Interesting... what's this bit called? In which register does
it live?

The bit is called "alignment check (AC)" and is bit 18 of EFLAGS.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 17:47:02 2026

From Newsgroup: comp.arch

Andy Valencia <vandys@vsta.org> posted:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Actually IA-32 (since the 486) and AMD64 have a bit for turning on unaligned traps, but unfortunately there is too much software in
libararies that performs unaligned accesses.

Even My 66000 has a bit to turn on misaligned checks {for debugging}.
Added late last year.

Interesting... what's this bit called? In which register does
it live?

Thanks!

Andy Valencia
Home page: https://www.vsta.org/andy/

--- Synchronet 3.22a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Mon Jun 22 18:49:40 2026

From Newsgroup: comp.arch

On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:

Robert Swindells [2026-06-19 11:20:10] wrote:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz Lisp
developers has explained in this newsgroup, they actually did not use
this feature, because the performance benefit was not big enough to

[...]

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

I guess you two aren't talking bout the same "Franz Lisp".
AFAIK Anton is referring to the commercial Common Lisp compiler
associated with the Franz Inc company, marketed under the name
"Allegro".

=== Stefan

ISTM there were at least a couple of Lisps available for the Vax. I
can't speak to Franz, but I do know at least one Vax Lisp was a
BIBOP[1] system that (generally) did not use tags.

In BIBOP, memory "pages"[2] are dedicated to a single data type. The
base address of the page is mapped to the type of the objects the page contains, and so the objects (and pointers to them) need no type
information themselves. This allowed for full width pointers, fixnums
and floats, and for conses, boxes, and other fixed sized data types
(including user types) to avoid tagging.

Obviously there has to be a way to identify which "slots" on each page
are in use. This typically is done using a bitmap kept with the type information. If a page becomes empty it can be repurposed to host
another type.

Also large and/or variably sized objects that don't fit into a single
page or do not cleanly divide the chosen page size had to be handled separately. These objects need to be tagged (though their pointers do
not) and allocated from a general heap.

[1] BIg Bag Of Pages
[2] BIBoP "pages" could be actual VMM pages, or just same-sized
contiguous blocks defined by software.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 22 08:46:55 2026

From Newsgroup: comp.arch

MitchAlsup [2026-06-22 00:02:05] wrote:

I think it is more appropriate to say the M88K was peaky--some things
it did quite well, and others "not so much".

I'd be interested to hear of the cases where it shone and the cases
where it had more difficulty. Same for other architectures of the time.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 22 09:02:11 2026

From Newsgroup: comp.arch

BGB [2026-06-21 16:22:43] wrote:

This requires a CPU that can deal with PUSH/POP mechanics in hardware.
With a Link Register, HW doesn't need to deal with this.

I don't think that's a very useful way to look at it: it's very easy
for a CPU to handle PUSH/POP in hardware (as evidenced by the fact that
many early CPUs did it).

I think the downside of "hardware-managed stack" is not the hardware
cost but the fact that it may not quite fit the needs of the compiler
(e.g. it may be tricky to use if you compiler uses heap allocation for
the stack frames).

IOW, you need to make sure your CPU can be used efficiently without
using the hardware-managed-stack, which should be easy to do, since it
doesn't require much more than "jump-and-link" (all the rest is
standard LD/ST/ADD/SUB).

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@vandys@vsta.org to comp.arch on Mon Jun 22 16:28:17 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

A quick grep of the Intel® 64 and IA-32 Architectures Software
Developer's Manual finds the AC bit in the EFLAGS register. I happen
to have an old 486 manual and it's there, too.

Thank you! I read with interest its interaction with SMEP/SMAP as well.

I have been out of the kernel game for many dog-years.

-- Andy
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 01:13:13 2026

From Newsgroup: comp.arch

On 6/21/2026 2:56 PM, Robert Swindells wrote:

On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

Though, I guess one merit of a Lisp like language is that it is a lot
easier to parse, and it could be possible to implement a fairly cheap
compiler for it (in the basic case).

Usual downside it that the excessive parenthesis tend to turn into a
usability issue.

You use an editor that keeps track of them.

Probably.
The main editor I use on Windows, Notepad2, has syntax highlighting and parenthesis matching.

Normal Notepad does not.

Though, would seem that these features have become fairly common in text-editors in Linux land.

One other major hassle was typically a lack of C style loops (with break
or continue), but this could be addressed in theory.

It is addressed in practice.

You could run SBCL on your CPU, it has a RISC-V backend to the compiler.

Probably, would need to look into it.

But, yeah, if it could target RV64G or similar and doesn't have too many platform dependencies, could work.

...

Well, even with the annoyance of being unable to self-host with my C
compiler due to RAM footprint (ideally, would want it to be able to
compile something like Doom, or itself, in under around 30MB of RAM).

As-is, compiling Doom in BGBCC needing ~ 100 MB (as more, fiddled with
it and got the footprint down a little more).

Looks like main things eating RAM are (according to some internal
mem-use stats via BGBCC's allocator, descending ranking):
Register Info structs (used for describing variables), ~ 12MB;
VirtOp structures (used for 3AC ops), ~ 9MB;
Arrays of Pointers to RegInfo, ~ 7MB;
Section Buffers, ~ 6MB;
AST nodes, ~4MB;
...

The internal memory manager lists around 64MB used, but Visual Studio
says 100MB, though currently the internal stats don't list allocator
overheads or the amount of memory eaten by the big interned-string table.

Checking, string table is currently around 50K strings and using around
920K of RAM, so an average of around 18 bytes per string.

RegInfo's are using ~ 12MB, each struct is 320 bytes ATM, so around 40K RegInfo structs...

Struct is kinda bulky as it has the combined fields needed to express variables, functions, structs/unions, ...

It has a lot of pointers, difficult to avoid in this case. Did end up
moving some of the string members from raw pointers to interned string indices, which maybe at least saved something...

Checking:
~ 6K global declarations
~ 9K structs / typedefs / array-initializers / ...
~ 25K internal (non-global) variables
Divided as some combination of: locals, args, struct fields, ...

Of which, 1200 are functions that make it into the final binary.

Looks like AST nodes aren't being too unreasonable ATM (~ 4MB) but these
get reused between TUs.

As for speed:
~ 14s with debug dumping;
~ 9s with no debug dumping;
~ 6s if I compile the compiler with optimizations enabled.

...

Checking:
~ 5MB for .text
~ 300K for data+bss.

Granted, maybe still moderately fast and lightweight by modern desktop
PC standards...

Though unclear how I could significantly reduce the memory footprint of
the compiler if still doing whole-program builds, as it would appear a
few significant memory consumers are things I would need even if I were
doing the 3AC translation incrementally.

Like, would likely still need to build a view of the global toplevel
(and only real way to make the struct smaller would be to try to
separate off the members needed for structs/functions from those needed
for plain variables).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jun 23 05:54:47 2026

From Newsgroup: comp.arch

Andy Valencia <vandys@vsta.org> writes:

I read with interest its interaction with SMEP/SMAP as well.

That's gross. Because it's so cumbersome to change the SMAP bit
(which prevents prevents access to user pages from supervisor mode),
they repurposed the AC flag to mean, in supervisor mode, that
user-memory accesses are allowed even if the SMAP bit is set. The AC
bit is easy to set with the STAC and clear with the CLAC instruction
(these instructions do not exist in my Pentium manual, so they were
added later). I guess that a set AC does not trap unaligned accesses
in supervisor mode.

I have been out of the kernel game for many dog-years.

The alignment trap functionality can be switched on or of from user
mode, and apparently, after reading the stuff about SMAP, I expect
that this functionality exists only in user mode.

One interesting aspect is that at some point, instructions for setting
and clearing AC were added, making that operation look cheap, but they
actually are serializing instructions on AMD processors up to and
including Zen4, and probably also on some or all Intel processors.
Thinking about the fact that the bit enables or disables trapping some conditions, yes, that's a way to implement these instructions.
According to <https://github.com/advisories/ghsa-mhcq-hvgj-xm2h>, Zen5
renames AC, which would mean that every memory access gets the renamed
value of the AC flag as input.

Another implementation that comes to my mind would be to implement
STAC and CLAC purely in the front end without serialization, and
provide the value of the AC bit as extra bit in every instruction
coming out of the decoder; POPFD and possibly other instructions that
change AC would be serializing, however.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Jun 23 10:34:44 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the
application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Jun 23 16:37:58 2026

From Newsgroup: comp.arch

On 2026-06-22 19:50, Stephen Fuld wrote:

On 6/22/2026 9:26 AM, Niklas Holsti wrote:

On 2026-06-22 17:59, Stephen Fuld wrote:

On 6/22/2026 3:44 AM, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

[snip]

There is a discussion going on at the moment about "pointer
providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on the >>>>> "careful governance and guidance of God", or on an "instance of divine >>>>> intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I don't know the answer to your question, but presumably we can do
better than C does. Isn't that one of the, at least claimed,
advantages of Rust, and perhaps even Ada?

Both Rust and Ada have to be restricted in certain ways in order to
ensure absence of pointer errors: Rust has to avoid "unsafe" code,

I like Rust's solution. You can do unsafe things - sometimes they are
just necessary - but they are not the default was of doing things, and
you have to notate them in the source code which serves to discourage
them and points people debugging errors to certain areas of the code
that are more likley to be problematic.

Same in Ada, mostly: some unsafe things are named "Unchecked_Xxx",
others are available only if some specific predefined packages are used,
which are not needed for most safe things.

and Ada has to avoid pointer-related "unchecked" constructs and
certain undefined behavior (which does exist in Ada, but less so than
in C). The Ada subset called SPARK, together with its proof tools, is
meant for such programming, and has a feature similar to Rust
"ownership" though standard Ada does not.

Is programming under SPARK rules significantly harder than under
nonSPARK Ada?

I don't have personal experience, but my impression is that it does not
make it markedly harder than the usual restrictions on embedded,
more-or-less critical software do. SPARK is defined and supported by the AdaCore company, not a standards group, and is evolving. The
documentation is at https://www.adacore.com/documentation?tab=spark; the
main restrictions are (quoted from https://docs.adacore.com/live/wave/spark2014/html/spark2014_rm/introduction.html#principal-language-restrictions,
with my comments in []):

--- quote:

To facilitate formal analyses and verification, SPARK enforces a number
of global restrictions to Ada. While these are covered in more detail in
the remaining chapters of this document, the most notable restrictions are:

- Restrictions on the use of access types and values [pointers], similar
in some ways to the ownership model of the programming language Rust.

- All expressions (including function calls) are free of side effects.

- Aliasing of names is not permitted in general but the renaming of
entities is permitted as there is a static relationship between the two
names. In analysis all names introduced by a renaming declaration are
replaced by the name of the renamed entity. This replacement is applied recursively when there are multiple renames of an entity.

- Backward goto statements are not permitted.

- The use of controlled types is not currently permitted. [These are
types with automatic invocation of user-defined initialization and finalization operations on object creation, copying, and deletion.]

- Tasks and protected objects are permitted only if the Ravenscar
profile (or the Jorvik profile) is specified. [The main limitation in
these profiles is that the set of tasks (threads) is static, no task
ever terminates, and inter-task communication is by protected objects (monitors, synchronized objects) and not by rendez-vous.]

- Raising and handling of exceptions is not currently permitted
(exceptions can be included in a program but proof must be used to show
that they cannot be raised).

--- end quote.

Also, I believe that had the originators of C not allowed arithmetic
on pointers (comparisons for equality would still be allowed, and
array addressing would have to use subscripts) many of the problems
with C pointers wouldn't have occurred. Of course, that horse has
left the barn a long time ago.

I recently helped to debug an Ada program that now and then, but not
often, was overwriting some buffers. At one point in that program I
had *cough* used pointer arithmetic *blush* instead of array indexing,
for what I felt were good reasons at the time. But it bit me. An
amusing clue to the error was that the bug happened more often when
the satellite running the program was above Russia's borders. Perhaps
you can guess reasons for that :-)

Interesting. Perhaps it is because Russia has less "careful governance
and guidance from God" :-)

One could indeed say so, because the reason is Putin's attack on
Ukraine, as you may have guessed.

The Ada program runs a satellite-based GNSS receiver that acquires
(finds) and then tracks GNSS signals from GNSS satellites (GPS, Galileo,
and others) as those satellites rise or set. The purpose is to measure atmospheric properties from the way the atmosphere refracts the signal.

The design and/or coding error was in the transition between two stages
of the multi-stage procedure for finding and starting to track a GNSS
signal from a GNSS satellite.

So then: Russia attacks Ukraine => Ukraine defends itself with
long-distance drones => Russia jams and perturbs GNSS signals along its borders => the satellite software often loses track of a signal it is
tracking => the satellite software often has to re-acquire signals =>
the bug manifests more often over Russia's borders.

If one favours the Ukrainian Orthodox church, which objects to this war, Russia is going against God's guidance. If one favours the Russian
Orthodox church, which blesses this war, Russia is following God's guidance.

(The bug was not found in testing because it did not manifest on every transition between the two acquisition stages -- it manifested only when
two other dynamic program states occurred together, at the same time as
the transition, and one of these states is rather rare, at least in test conditions.)

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Jun 23 14:32:22 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 6/21/2026 2:56 PM, Robert Swindells wrote:

On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

Though, I guess one merit of a Lisp like language is that it is a lot
easier to parse, and it could be possible to implement a fairly cheap
compiler for it (in the basic case).

Usual downside it that the excessive parenthesis tend to turn into a
usability issue.

You use an editor that keeps track of them.

Probably.
The main editor I use on Windows, Notepad2, has syntax highlighting and >parenthesis matching.

Normal Notepad does not.

Though, would seem that these features have become fairly common in >text-editors in Linux land.

THat feature has been common in Unix and linux land for close to three decades.

One might even note that color syntax highlighting predates
Windows completely in one form or another (e.g 1969 Emily editor).

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Jun 23 07:37:12 2026

From Newsgroup: comp.arch

On 6/23/2026 6:37 AM, Niklas Holsti wrote:

On 2026-06-22 19:50, Stephen Fuld wrote:

On 6/22/2026 9:26 AM, Niklas Holsti wrote:

On 2026-06-22 17:59, Stephen Fuld wrote:

On 6/22/2026 3:44 AM, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

[snip]

There is a discussion going on at the moment about "pointer
providence"

Perhaps you meant pointer "provenance"? I hope we will not rely on >>>>>> the
"careful governance and guidance of God", or on an "instance of
divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I don't know the answer to your question, but presumably we can do
better than C does. Isn't that one of the, at least claimed,
advantages of Rust, and perhaps even Ada?

Both Rust and Ada have to be restricted in certain ways in order to
ensure absence of pointer errors: Rust has to avoid "unsafe" code,

I like Rust's solution. You can do unsafe things - sometimes they are
just necessary - but they are not the default was of doing things, and
you have to notate them in the source code which serves to discourage
them and points people debugging errors to certain areas of the code
that are more likley to be problematic.

Same in Ada, mostly: some unsafe things are named "Unchecked_Xxx",
others are available only if some specific predefined packages are used, which are not needed for most safe things.

and Ada has to avoid pointer-related "unchecked" constructs and
certain undefined behavior (which does exist in Ada, but less so than
in C). The Ada subset called SPARK, together with its proof tools, is
meant for such programming, and has a feature similar to Rust
"ownership" though standard Ada does not.

Is programming under SPARK rules significantly harder than under
nonSPARK Ada?

I don't have personal experience, but my impression is that it does not
make it markedly harder than the usual restrictions on embedded, more- or-less critical software do. SPARK is defined and supported by the
AdaCore company, not a standards group, and is evolving. The
documentation is at https://www.adacore.com/documentation?tab=spark; the main restrictions are (quoted from https://docs.adacore.com/live/wave/ spark2014/html/spark2014_rm/introduction.html#principal-language- restrictions, with my comments in []):

--- quote:

To facilitate formal analyses and verification, SPARK enforces a number
of global restrictions to Ada. While these are covered in more detail in
the remaining chapters of this document, the most notable restrictions are:

- Restrictions on the use of access types and values [pointers], similar
in some ways to the ownership model of the programming language Rust.

- All expressions (including function calls) are free of side effects.

- Aliasing of names is not permitted in general but the renaming of
entities is permitted as there is a static relationship between the two names. In analysis all names introduced by a renaming declaration are replaced by the name of the renamed entity. This replacement is applied recursively when there are multiple renames of an entity.

- Backward goto statements are not permitted.

- The use of controlled types is not currently permitted. [These are
types with automatic invocation of user-defined initialization and finalization operations on object creation, copying, and deletion.]

- Tasks and protected objects are permitted only if the Ravenscar
profile (or the Jorvik profile) is specified. [The main limitation in
these profiles is that the set of tasks (threads) is static, no task
ever terminates, and inter-task communication is by protected objects (monitors, synchronized objects) and not by rendez-vous.]

- Raising and handling of exceptions is not currently permitted
(exceptions can be included in a program but proof must be used to show
that they cannot be raised).

--- end quote.

Also, I believe that had the originators of C not allowed
arithmetic on pointers (comparisons for equality would still be
allowed, and array addressing would have to use subscripts) many of
the problems with C pointers wouldn't have occurred. Of course,
that horse has left the barn a long time ago.

I recently helped to debug an Ada program that now and then, but not
often, was overwriting some buffers. At one point in that program I
had *cough* used pointer arithmetic *blush* instead of array
indexing, for what I felt were good reasons at the time. But it bit
me. An amusing clue to the error was that the bug happened more often
when the satellite running the program was above Russia's borders.
Perhaps you can guess reasons for that :-)

Interesting. Perhaps it is because Russia has less "careful
governance and guidance from God" :-)

One could indeed say so, because the reason is Putin's attack on
Ukraine, as you may have guessed.

The Ada program runs a satellite-based GNSS receiver that acquires
(finds) and then tracks GNSS signals from GNSS satellites (GPS, Galileo,
and others) as those satellites rise or set. The purpose is to measure atmospheric properties from the way the atmosphere refracts the signal.

The design and/or coding error was in the transition between two stages
of the multi-stage procedure for finding and starting to track a GNSS
signal from a GNSS satellite.

So then: Russia attacks Ukraine => Ukraine defends itself with long- distance drones => Russia jams and perturbs GNSS signals along its
borders => the satellite software often loses track of a signal it is tracking => the satellite software often has to re-acquire signals =>
the bug manifests more often over Russia's borders.

If one favours the Ukrainian Orthodox church, which objects to this war, Russia is going against God's guidance. If one favours the Russian
Orthodox church, which blesses this war, Russia is following God's
guidance.

(The bug was not found in testing because it did not manifest on every transition between the two acquisition stages -- it manifested only when
two other dynamic program states occurred together, at the same time as
the transition, and one of these states is rather rare, at least in test conditions.)

For both the above and the discussion about SPARK, Thanks Niklas, quite interesting.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 23 17:33:49 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the >> > application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Jun 23 17:43:38 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the >> >> > application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000
ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)
{
int num_frames;
void *framelist[100];
char **strings;

num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));
strings = ::backtrace_symbols(framelist, num_frames);
if (strings == NULL) {
lp->log("Unable to obtain simulator stack traceback: %s\n",
strerror(errno));
return;
}
for(int frame=0; frame < num_frames; frame++) {
lp->log("[%2.2d] %s\n", frame, strings[frame]);
}
::free(strings);
}

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 15:15:49 2026

From Newsgroup: comp.arch

On 6/23/2026 12:43 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the >>>>>> application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good >>>>> when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 >>>> ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)
{
int num_frames;
void *framelist[100];
char **strings;

num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));
strings = ::backtrace_symbols(framelist, num_frames);
if (strings == NULL) {
lp->log("Unable to obtain simulator stack traceback: %s\n",
strerror(errno));
return;
}
for(int frame=0; frame < num_frames; frame++) {
lp->log("[%2.2d] %s\n", frame, strings[frame]);
}
::free(strings);
}

Yeah, for what arguable benefits separate call / data stacks could
bring, or making the call stack inaccessible to the program, this
doesn't fit with the vibe of either RISC philosophy, or for sake of
practical things like implementing C++ style throw/catch, or mechanisms
like C's longjmp, ...

One would likely need to defy minimalism by having additional hardware mechanisms to support these kinda things.

Or, at least more than the damage already done in my case by putting
mode tag bits and similar in the the link register, which could
potentially effect code which messes with the link register value
directly and assumes the link register represents a bare address value.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 17:48:22 2026

From Newsgroup: comp.arch

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

C killed off every memory model other than flat byte addressed
memory.

At least in the C standard the memory is segmented into objects.

Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

   p = (struct foo *) malloc(42 * sizeof(struct foo));

That produces an object of a certain size, and you must only
access it
through pointers derived from p. And programs usually satisfy that >>>>>> requirement.

      {
          p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
          ...
          if( fscanf( stream, "x16", q ) ) {
              use q
          }
      }

is q "derived" though p ??

There is a discussion going on at the moment about "pointer providence" >>>

Perhaps you meant pointer "provenance"? I hope we will not rely on the
"careful governance and guidance of God", or on an "instance of divine
intervention" to ensure pointer safety...

Has pointer safety been shown to be equivalent to the halting
problem? If so, "careful governance and guidance from God" may
indeed be required.

I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
to some extent.

I didn't really understand it myself.

In my case, I tended to use more conservative approaches and then only optimize based on what can be verified by the compiler within certain fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array;
Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they
don't alias with a pointer to a local array;
...

Another option is a sort of "selective TBAA":
Enable TBAA, but only if the current function doesn't contain any
obvious pointer casts or similar.

...

Then had noted that in my compiler (while working on it to try to reduce memory use), that there was a feature to walk the call-flow graph and
mark off whichever global variables may be modified and similar as a
result of calling some function.

Had sort of forgot this existed, but is sometimes useful to know (can
keep a global cached in a register if one knows the called function will
not modify it, otherwise spill/reload is necessary).

...

I am reminded of the person, apparently very religious, who some decades
ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard
Stallman was too atheistic and had tainted his products. I have not
heard how that went.

There were a few people like that...

There is seemingly a fine line though between being overly religious and
being insane. A few of the people who I had seen who were like that, had
been a bit of the latter.

There is a lot of complexity with things like doctrine and theology,
etc, but there is a characteristic difference IME.

Well, and a leaning towards "reality defying" views; more emphasis on supernatural events and experiences, defiance of things like basic
physics or rules of mathematics; and often pairing the outward
religiosity with rather unstable or inconsistent adherence to moral or
ethical behavior (or, applying it only to other people, while giving themselves free reign to indulge in whatever they feel like doing); ...

Well, and seemingly, the "more genuine" thing being to express restraint
in ones' own behavior in these areas, not to worry about or try to
control what anyone else is doing.

Well, and then there is sorta the cultural expectation that one
evangelize to others, etc, but this doesn't make as much sense in
contexts where everyone likely already knows and/or has already made up
their mind.

Or, one ends up getting on others' bad sides, say, if one admits that
they don't personally buy into the "Young Earth Creationist" mindset,
and feel that (as a society) people have mostly been interpreting
Genesis incorrectly (and making themselves look stupid in the process,
by insisting that everyone adopt an overly particular and somewhat
nonsensical interpretation).

But, alas, ...

Though, this doesn't mean that I can claim to always have a 100% stable
hold on what constitutes "reality" (and my own experiences do include
things that seem to deviate from normal expectations).

Though, most in my experience seem to be things like seeming time-flow
and causality breaks: experiences where normal linear time-flow seems to
break down; where events happen in ways that seem to break forwards
causal order; or where sometimes stuff just "changes around" for no
particular reason.

Though, I guess I differ by not claiming to have any higher explanation
for stuff like this...

Also often more like "bad Sci-Fi tropes" than particularly religious
though (like one seemingly encounters weirdness that more seems like
something out of Star Trek or something...).

Well, like high-level examples:
Seeming delayed-choice key-ring color instability;
Choose one color of keyring, it flip flops and changes later.
Several instances variations of:
Go to the bathroom, seemingly experience time displacement.
Like, go into bathroom, may emerge with an unexpected time delta.
Times where events seemingly tie into "time knots";
Event sequence becomes paradoxical, causes/effects are reversed;
Or occasional time-loops (reliving the same events multiple times);
Unexpected changes appearing in ones' code;
Like, the code was one way, then it was different.
Or, in CNC, an event where some M01's turned into M00's somehow.
Or, one remembers documenting something,
but then what they wrote is nowhere to be found.
...

Could try to come up with some sort of explanation, but purely
observational, one might just claim "if one goes to the bathroom or
similar, they may sometimes somehow initiate temporal anomalies". Well,
and/or attribute it to neurological factors.

Though, many of these sorts of events do seem oddly correlated with
"went to the bathroom, then some weirdness happens...".

Well, and realizing that some bigger mysteries from earlier in my life
had more mundane explanations:
Weird Mac style computer with external magneto-optical drive, etc:
Apparently was actually a thing at the time...
I just don't know why anyone would have showed it to me.
Like, not actually "alien", just absurdly expensive.
But, then, I question if I really saw it.
But, why would I remember such a setup if I didn't see it?...
But, why subject a random 3rd grader to Pascal and MPW?...
Like, a story with plausible explanations, technically.
But, the "why" aspect doesn't make sense...
LaserDisc disappearance in the early 2000s:
Apparently people mostly just got rid of them...
The rare purple LaserDisc's:
Apparently recordable LaserDisc was just a market flop,
not some weird alien tech.
A one-off incident in a school A/V setup.
Like, sometimes they used LD, and not just VCRs.
...

Though, it is still odd sometimes to have maybe encountered weird tech,
to then have it disappear and never seeing it again (like, where one can question whether their current self is still living in the same timeline
they existed in during their childhood).

Not like there is anything particular religious about tech though, and
if I saw this stuff as an adult would probably have not thought as much
about it.

Does sometimes seem like life could have gone differently in some areas,
I was just sort of an epic fail at everything.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jun 24 00:54:32 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only optimize based on what can be verified by the compiler within certain fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array;
Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error.

-----------------

I am reminded of the person, apparently very religious, who some decades ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard Stallman was too atheistic and had tainted his products. I have not
heard how that went.

Rick...

--------
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jun 24 00:59:12 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the
application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 >> > ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

When Thread is unExceptional it cannot access Call Stack,
when Thread is Exceptional it can.

ENTER, EXIT, and RET are exempt from the protection check.
Call Stack Pointer is not accessible to unprivileged code.

Don't see how one gets from a running application into debugger without
taking an exception !?! or from running in the debugger to running in application without returning from an exception !!!
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 21:01:35 2026

From Newsgroup: comp.arch

On 6/23/2026 7:54 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only
optimize based on what can be verified by the compiler within certain
fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array;
Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they
don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error.

Hard proof that alias is impossible is harder to achieve in practice...

A softer "there is no reasonable possibility of alias" is easier to achieve.

Like, one can assume that each independent memory object exists in its
own local void, and that there is no reasonable way to reach from one to another.

Like, even if you can potentially go out of bounds to reach from one independent memory object to another, for a compiler it may be
sufficient merely to prove that the origins reflect two independent
memory objects (and not two pointers within the same object, or a
parent/child relationship).

Likewise, global variables can be seen as separate objects, along with independent local variables.

Say:
int arra[16];
int arrb[16];
With arra and arrb being assumed independent, even if in-memory they are
right next to each other, but excluding arrays within a common struct
(where the containing struct can be seen as a common origin point).

Everything passed in can go into an "unknown" category; where unknown
pointers may be assumed to alias with each other.

Otherwise, one would need to make assumptions about "every possible
caller", which is unreasonable (caller behavior can be assumed to fall
into an open-ended set, even in cases where callee behavior can be
reasoned about via graph walks).

Though, could still be done when one assumes that the callers form a
closed set.

...

-----------------

I am reminded of the person, apparently very religious, who some decades >>> ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard
Stallman was too atheistic and had tainted his products. I have not
heard how that went.

Rick...

That was one of them...

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Jun 24 02:25:30 2026

From Newsgroup: comp.arch

According to BGB <cr88192@gmail.com>:

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error. >>

Hard proof that alias is impossible is harder to achieve in practice...

A softer "there is no reasonable possibility of alias" is easier to achieve.

Sort of. The standard says that the compiler can assume no type punning, so that
if pointers are of different types, they can't point at the same thing (with an exception for pointers to unions.)

Even so, C has "restrict" to tell the compiler to assume that pointers never alias, and "volatile" to assume they always do.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 22:41:19 2026

From Newsgroup: comp.arch

On 6/23/2026 9:25 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error. >>>

Hard proof that alias is impossible is harder to achieve in practice...

A softer "there is no reasonable possibility of alias" is easier to achieve.

Sort of. The standard says that the compiler can assume no type punning, so that
if pointers are of different types, they can't point at the same thing (with an
exception for pointers to unions.)

Even so, C has "restrict" to tell the compiler to assume that pointers never alias, and "volatile" to assume they always do.

Possibly, though traditional type-based aliasing rules run into a
problem in that pointer casting can break its assumptions, and a lot of
code doesn't respect these rules (which taken purely at face value, are
overly limiting).

One option though is "if enabled, assume the rules are followed unless
the compiler sees them being broken", in which case it disables TBAA
when faced with TBAA violations. This approach seems to be moderately effective, and allows benefiting from some of the performance advantages
of TBAA while also being more friendly to code that goes "wild west"
with things like pointer casts and "cast and dereference" patterns.

So, say, a nicer compromise (even if still breakable).
int foo1(char *s, int *t)
{
*s=*t+1;
return *t;
}
//assume not directly visible within same context:
int foo2()
{
int i, j;
i=4;
j=foo1((char *)(&i), &i);
return j;
}
What is the result of calling foo2?...
Here, foo2 breaks TBAA but in a way invisible to foo1.

Though, in theory, one workaround is that the compiler can see that foo2 breaks TBAA and can then flag foo1 that its operands may not safely
assume TBAA.

Though, this poses a problem for my current compiler design, as some of
the alias handling stuff happens before the compiler will have a
complete view of the call-graph.

Would in effect need to add an additional internal compiler pass to
detect and mark all the TBAA violations within the call-graph.

But, for now, seems "mostly good enough".

For volatile, one typically needs to go a little further:
Every load and store needs to be performed explicitly;
There is a need to disallow load/store reordering;
...
Mostly because volatile may be used to access MMIO, and MMIO is more
strict than normal RAM in this area.

Though, could maybe be better if "volatile" could be broken into several subtypes depending on which particular behaviors are needed:
Weaker case: Assume aliasing happens.
May still prune non-aliasing load/store or reorder;
Normal case:
Every load/store needs to happen;
No reordering allowed.
Stronger case:
Like the above, but also needs to be synchronous between cores;
Though, this role overlaps with _Atomic.

There is also ambiguity as to how far the volatile-ness extends, but
this can be avoided by doing it at the point of cast-and-deref:
(*(volatile uint64_t *)ptr)
In this case, it applying explicitly to the deref operation rather than
the handling of the pointer before this point.

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 05:34:30 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Consider the Push/Pop mechanics in HW compared to FMAC in HW--which
do you think is easier ???

Now consider 16 pushed in a row versus a single instruction that performs
the same amount of work. Which one needs to translate an address more
often, which one needs to AGEN more often, and which one can access the
cache once for up to 8 registers ???

Modern x86 processors have a "stack engine" to address this
problems. Multiple push or pop instructions, respectively,
are split into two microops (one memory access, one decrement or
increment), and the decrement/increment microops are then merged.

This proably costs an extra cycle pipeline depth or so, but
I haven't been able (after cursory looking) to find a number for
newer architectures.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 24 05:48:44 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:06:14 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the >> >> > application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000
ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

But no error backtrace from an error occuring in a normal program?

Example (Fortran reading from a non-opened file, compiled with
-g -static-libgfortran):

program memain
call foo(a)
print *,a
end program memain

subroutine foo(a)
read (10) a
end subroutine foo

$ ./a.out
At line 7 of file foo.f90 (unit = 10, file = 'fort.10')
Fortran runtime error: End of file

Error termination. Backtrace:
#0 0x407d27 in us_read
at ../../../dump/libgfortran/io/transfer.c:2983
#1 0x407e24 in pre_position
at ../../../dump/libgfortran/io/transfer.c:3109
#2 0x40ae34 in data_transfer_init
at ../../../dump/libgfortran/io/transfer.c:3562
#3 0x40392f in foo_
at /tmp/foo.f90:7
#4 0x403976 in memain
at /tmp/foo.f90:2
#5 0x403a0f in main
at /tmp/foo.f90:4
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:08:15 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the
application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 >> >> > ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

When Thread is unExceptional it cannot access Call Stack,
when Thread is Exceptional it can.

ENTER, EXIT, and RET are exempt from the protection check.
Call Stack Pointer is not accessible to unprivileged code.

Don't see how one gets from a running application into debugger without taking an exception !?! or from running in the debugger to running in application without returning from an exception !!!

Issuing in application error for which one might want to look at
a backtrace (see previous Fortran example).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:20:45 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

[LISP]

Usual downside it that the excessive parenthesis tend to turn into a usability issue.

Ample fun has been made of this over time.

Example: https://xkcd.com/297/

Or, from the priceless "A Brief, Incomplete, and Mostly Wrong History of Programming Languages":

# 1958 - John McCarthy and Paul Graham invent LISP. Due to high
# costs caused by a post-war depletion of the strategic parentheses
# reserve LISP never becomes popular... Fortunately for computer
# science the supply of curly braces and angle brackets remains high.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 08:50:18 2026

From Newsgroup: comp.arch

On 24/06/2026 02:54, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only
optimize based on what can be verified by the compiler within certain
fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array;
Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they
don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error.

C lets the compiler assume that things do not alias, under certain circumstances. If you have a local array (pointer 1) and its address
does not "escape", and a global array (pointer 2), the compiler can
assume they do not alias, as any aliasing could only be the result of UB.

For pointers passed into functions, the compiler won't have any such
knowledge (unless it happens to be able to see the calling and called
code at the same time - if they are in the same file, or you are using
some kind of link-time or whole-program optimisation). But you can tell
the compiler that pointers don't alias, with the "restrict" qualifier.
This can make a significant difference in some code, and means that the "Fortran is faster than C because pointer parameters can't alias"
argument has not been true since 1999. (Fortran code may be faster for
other reasons - such as "C programmers don't know how to use restrict".)

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jun 24 02:01:40 2026

From Newsgroup: comp.arch

On 6/24/2026 12:48 AM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

Yes, this is one place where I disagree with GCC on.
I decided to go with "more sane" default behavior (no TBAA by default,
it is opt-in).

Goal is to find rules that are "mostly sane" while still being effective.

Localized approaches can work OK, but necessarily need to be conservative.

Something like full provenance poses a harder problem though, as to know
a solid answer requires tracing the flow of a variable across multiple control-flow frames (or maybe going further, into reasoning about things
like objects and linked lists).

Decided not to go too much into it, but this is not the first time I
have encountered a variation of this problem. It is doable in theory,
but actually doing it in a compiler is a bit more of a pain...

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 09:58:35 2026

From Newsgroup: comp.arch

On 24/06/2026 05:41, BGB wrote:

On 6/23/2026 9:25 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming
error.

Hard proof that alias is impossible is harder to achieve in practice...

A softer "there is no reasonable possibility of alias" is easier to
achieve.

Sort of. The standard says that the compiler can assume no type
punning, so that
if pointers are of different types, they can't point at the same thing
(with an
exception for pointers to unions.)

Even so, C has "restrict" to tell the compiler to assume that pointers
never
alias, and "volatile" to assume they always do.

Possibly, though traditional type-based aliasing rules run into a
problem in that pointer casting can break its assumptions, and a lot of
code doesn't respect these rules (which taken purely at face value, are overly limiting).

It's true that some programmers seem to think you can do whatever you
like with pointers converted between different types. A lot of use of converted pointer types will be UB in C, but C does not make it at all difficult to write code with these conversions. There's a fair argument
to be made that type-based alias analysis rarely gives good optimisation opportunities, restricts programmers, and lets people write code that
they think is correct, but is not. Quite a number of C compilers
specifically do not do any type-based aliasing analysis, or let you turn
it off (gcc -fno-strict-aliasing).

One key point is that in C++, type-based alias analysis is much more
useful as you generally use far more different types (typedef in C does
not make different types), and code is generally much more careful about accessing them with correct pointer types (or better, references,
containers, smart pointers, etc.).

Maybe things could be helped by attributes that give you better control
over aliasing - gcc has a "may_alias" type attribute that can be used to
give a type the "aliasing superpowers" of character types.

One option though is "if enabled, assume the rules are followed unless
the compiler sees them being broken", in which case it disables TBAA
when faced with TBAA violations.

That sounds /really/ bad. You can't have the behaviour - the semantics
- dependent on whether or not a compiler is able to find an error in
your code!

An option to say TBAA is enabled or not makes sense. Even better, is
having it as a pragma. (I always use gcc optimise pragmas if I need to disable a particular optimisation, to keep it safe regardless of command
line options.) Standardising this in some way could be useful. And it
is probably a good idea to have TBAA off by default - let those who
understand it and want it, enable it. (But it should probably be on by default for C++.)

And when a compiler has TBAA enabled, and it spots a violation, that's
time for an error message - not silently disabling it!

This approach seems to be moderately
effective, and allows benefiting from some of the performance advantages
of TBAA while also being more friendly to code that goes "wild west"
with things like pointer casts and "cast and dereference" patterns.

So, say, a nicer compromise (even if still breakable).

It's better to use something other than "char" pointers, since character pointers can be used to access any data. There is no UB in your example
here, that I can see - "*s" is allowed to access data pointed to by
"*t". Let's pretend you use "short * s" or "float * s" instead.

int foo1(char *s, int *t)
{
    *s=*t+1;
    return *t;
}
//assume not directly visible within same context:
int foo2()
{
    int i, j;
    i=4;
    j=foo1((char *)(&i), &i);
    return j;
}
What is the result of calling foo2?...
Here, foo2 breaks TBAA but in a way invisible to foo1.

With the proviso mentioned above, there are countless ways in which you calling a function with unexpected or inappropriate parameters leads to
UB. You always have to know the requirements for the parameters before calling a function. This situation is a drop in the ocean, and not
really worth worrying about specifically IMHO.

For volatile, one typically needs to go a little further:
Every load and store needs to be performed explicitly;
There is a need to disallow load/store reordering;

"volatile" does not affect hardware ordering - it only affects the
ordering within the program. It cannot see things that are at a level
below the generated code.

If you want to influence hardware ordering, use atomics and fences (from
C11, or implementation extensions).

...
Mostly because volatile may be used to access MMIO, and MMIO is more
strict than normal RAM in this area.

In the microcontroller world at least, that is done by the memory
management unit or memory protection unit, specifying which address
areas are accessible in different ways, which are cacheable, which can
be buffered or re-ordered. That is all well below the level visible in
a programming language - and once the MPU is set up correctly, it all
"just works".

Though, could maybe be better if "volatile" could be broken into several subtypes depending on which particular behaviors are needed:
Weaker case: Assume aliasing happens.
    May still prune non-aliasing load/store or reorder;

C does not need volatile for that - you've got aliasing superpower
character types. In practice, you have memcpy() / memmove() to read or
write data that might be aliased, or different types. All you need is
for compilers to handle small memcpy's with fixed sizes efficiently (as
gcc and clang do). Often this is more efficient that using volatiles -
using "float f = 12.3; uint32_t x; memcpy(&x, &f, 4); return x;" will typically result in a float register to integer register move instruction.

Normal case:
    Every load/store needs to happen;
    No reordering allowed.

No re-ordering with respect to other volatiles, you mean. That is what volatile does today.

Stronger case:
    Like the above, but also needs to be synchronous between cores;
    Though, this role overlaps with _Atomic.

As you say, that is the job for atomics - or volatile atomics.

So we already have all the features you want. It is fair to say,
however, that some programmers misunderstand "volatile" and think it
means one of the other cases you list. (Or the case that you didn't
list - the assumption that volatile forces an order on non-volatile
accesses or operations.)

There is also ambiguity as to how far the volatile-ness extends, but
this can be avoided by doing it at the point of cast-and-deref:
(*(volatile uint64_t *)ptr)
In this case, it applying explicitly to the deref operation rather than
the handling of the pointer before this point.

I don't know what you mean by an "ambiguity" here. There is no
ambiguity in C about where "volatile" applies. There might be confusion
or misunderstanding amongst some programmers, but not an ambiguity in
the semantics.

It is, IME, helpful to remember that "volatile" is primarily about
/accesses/, rather than objects. This was somewhat unclear in the C
standards until C17.

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 10:30:12 2026

From Newsgroup: comp.arch

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in
the C standard are ideal (I don't think they are, but the changes I'd
make might be different from the ones you would like). But it is
entirely appropriate for a compiler to follow the C rules unless you
specify otherwise.

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jun 24 14:30:17 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

BGB <cr88192@gmail.com> posted:

On 6/20/2026 5:01 PM, MitchAlsup wrote:

---------------

Tagging to make it harder to stomp the link register;

Put it somewhere it can't be stomped on !! like in memory on a page the
application has no access permissions.

Multiple stacks is a big ask, and non-accessible memory is not so good
when dealing with an ISA where user code needs to handle the Link-Register.

Code does not need to access or look at the return address in My 66000 >> >> > ISA--except for the case where one wants to walk the stack back on a
THROW() and its unstructured equivalent longjump().

What about a debugging stack trace?

The debugger runs in a separate process with access to application
Root pointer and ASID. In that process, Call-stack is RW-.

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

When Thread is unExceptional it cannot access Call Stack,
when Thread is Exceptional it can.

ENTER, EXIT, and RET are exempt from the protection check.
Call Stack Pointer is not accessible to unprivileged code.

Don't see how one gets from a running application into debugger without >taking an exception !?! or from running in the debugger to running in >application without returning from an exception !!!

The glibc function ::backtrace can be called at any time, in any context.

Then there are the unix context functions that also allow access to
resources not normally visible to an application - getcontext(2), makecontext(3) and the setjmp/sigsetjmp functions which also
gather the thread context, including the current stack pointer.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Wed Jun 24 14:38:02 2026

From Newsgroup: comp.arch

On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:

On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:

Robert Swindells [2026-06-19 11:20:10] wrote:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and
SPARC contains some support for that, but as one of the IIRC Franz
Lisp developers has explained in this newsgroup, they actually did
not use this feature, because the performance benefit was not big
enough to

[...]

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton
is referring to the commercial Common Lisp compiler associated with the >>Franz Inc company, marketed under the name "Allegro".

=== Stefan

ISTM there were at least a couple of Lisps available for the Vax. I
can't speak to Franz, but I do know at least one Vax Lisp was a BIBOP[1] system that (generally) did not use tags.

Franz Lisp used BiBOP.

I posted a link earlier in the thread to the PDF of "Performance and Evaluation of Lisp Systems", Chapter 2 contains descriptions of the
various implementations available at that time.

<https://dreamsongs.com/Files/Timrep.pdf>

The following chapter lists the benchmarks used and results for each implementation.

For some reason, later versions of SPECint li ran these Lisp benchmarks
in the XLisp interpreter that had been compiled for the CPU under test.

The benchmark results reported in the book are for fully compiled code.

In BIBOP, memory "pages"[2] are dedicated to a single data type. The
base address of the page is mapped to the type of the objects the page contains, and so the objects (and pointers to them) need no type
information themselves. This allowed for full width pointers, fixnums
and floats, and for conses, boxes, and other fixed sized data types (including user types) to avoid tagging.

I ran Franz Lisp on the Atari ST, not having tags made it easy to
interface to the GEM GUI.

I also made a start on a hardware accelerator for BiBOP type checking for
the ST. The expansion connector on it provided access to the full 68k bus including the function pins.

The idea was to look for data reads within a defined range then use the
page number as an address for a small SRAM holding the BiBOP table and
latch the value stored at that address. Would tweak the compiler slightly
to read a value into a CPU register before needing to read the latched
type of it.

A variant of this idea could be to store the type value in spare bits in a PTE, then define an instruction that treats the contents of a register as
an address and returns the matching "type" bits for it.
--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Jun 24 20:17:45 2026

From Newsgroup: comp.arch

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jun 24 16:34:55 2026

From Newsgroup: comp.arch

On 6/24/2026 3:17 PM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library, whose
main purpose is to give this sort of self-overlapping copy behavior (and
to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but as
noted the behavior in the case of a short forwards copy are different.

For non-overlap cases it can just invoke "memcpy()".

Otherwise, was sitting around trying to fiddle with memory usage in my compiler, and there still seems to be around 16MB unaccounted for (after tracking basically all of the memory allocation and freeing in the
compiler, VS debugger reports around 16MB more memory being used than
the internal memory-use tracking does).

This also seems larger than easily explained by the binary itself...

Would estimate ~ 7MB for the EXE's sections + OS stack (1MB in Windows).

But, in the past few days of fiddling I have gotten it from ~ 250 MB to compile Doom down to around 64MB (in VS), or 48MB (according to the
internal allocation tracking).

Though seemingly in the process, the build times have gotten ~ 2 seconds longer.

Though, there were some changes effecting significant compiler
structures (changing some raw strings to string handles, and breaking
one larger structure into multiple parts), so this isn't entirely unreasonable. Changing the structure was a bit annoying as it was one of
the most heavily used in the compiler, so involved touching a lot of code.

One annoyance is that the 3AC opcode structure has a few fields that are minority use, but it likely isn't worth the pain of messing with it.

Unlike the other struct, the Op struct is small enough that splitting it
via a pointer would likely end up being net-negative for memory use.
Would likely need to get a bit tricky and use two different-sized
structs depending on sub-type and putting the lesser-used fields on the
end, but this would be awkward and ugly. Not likely worth it.

Well, because to address the full range of operations, it effectively has:
operation tags
2 types;
2 destinations;
4 sources;
a 24-byte tagged-union immediate-value field
needed for calls, member load/store, ...

The 2-size strategy would likely be to have a full version with all the fields, and a subset version with:
operation tags
1 type
1 destination
3 sources
And, then questioning whether it would be worth it to shave 48 bytes off
an 88 byte struct (and maybe save a few MB at most).

Well, and then I can see that the compiler is burning around 2MB on the
data for figuring out whether called functions may have touched
particular global variables (though this effects what optimizations the compiler can do, so isn't purely waste), ...

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 00:46:17 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 6/24/2026 3:17 PM, John Levine wrote:

-------------------

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library, whose
main purpose is to give this sort of self-overlapping copy behavior (and
to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

Instead, I added MM instruction to ISA. MM is memmove() ! LLVM is happy to
use MM as a struct copy (sa = sb;) independent of where sa or sb are.

In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

HW is really good at pointer compares and loop inversions.

For non-overlap cases it can just invoke "memcpy()".

Unnecessary with MM.

Plus, while MM is doing its thing, non-memory ref instructions can make
forward progress, and non-aliasing memory refs can use the 'other' Memory Units..
--- Synchronet 3.22a-Linux NewsLink 1.2

From Andy Valencia@vandys@vsta.org to comp.arch on Wed Jun 24 19:17:39 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Usual downside it that the excessive parenthesis tend to turn into a usability issue.

Ample fun has been made of this over time.

From rec.humor.funny:

From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
Newsgroups: rec.humor.funny
Subject: The Strategic Defense Initiative (SDI/Star Wars)
Keywords: computer, funny
Message-ID: <137457@looking.on.ca>
Date: 23 Apr 90 10:30:08 GMT
Sender: funnyr@looking.on.ca
Posted: Mon Apr 23 11:30:08 1990
Reply-Path: mist.cs.orst.edu!jasmerb

Through some clever security hole manipulation if I have been able to
break into all of the government's computers and acquire the Lisp code
to SDI. Here is the last page (tail -10) of it to prove that I actually
have the code:

)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 09:18:38 2026

From Newsgroup: comp.arch

On 24/06/2026 22:17, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

I think it is perhaps better to say that one of the less fortunate
things about C is that people make assumptions without learning the
language properly or looking up the details. And then code with these incorrect assumptions is then propagated.

"memcpy" is a fine example of this. It says on the tin that using it
for objects that overlap is undefined behaviour - C standard code for
"don't do that".

But lots of people hammer away at their keyboards without reading the
manuals or instructions, or paying much attention to their tutorial
books or courses. Some languages are much more forgiving there - Python
aims to accept as wide a range of inputs as possible for any operation
or library function, and aims to give you as much feedback about your
mistakes at runtime. C aims for maximal efficiency on the assumption
that you have read the specifications for the language, and follow the
rules. You can do a lot of Python programming by combining trial and
error with a bit of "how do I do this in Python" googling. In C, that's
going to lead to tears sooner rather than later.

In the case of "memcpy", a lot of people think it is defined - specified
- by the naïve implementation (I believe it is shown as an example in
K&R). And so they use "memcpy" everywhere, even in cases where
"memmove" is the appropriate choice. Not long ago, a glibc developer discovered that on some Intel processors, in some circumstances, running
the memory copy backwards lead to a noticeable speedup for memcpy(), and
thus implemented that. The backlash of people who said the change
"broke" their code was overwhelming, and the change was reverted.

I really do think that things like this should be /easy/ to get right. Parameters to "memcpy" are not allowed to overlap, so that the copying
can be as efficient as possible. "memmove" allows the parameters to
overlap, but is likely to be less efficient. Use the one that suits
your requirements.

But people get it wrong. There's a lot of people who sit alone,
programming in C, who should not be programming in C - they should be
using different languages, or learning C better before using it. Or
they should have better guidance and help, code reviews from people more experienced. Many people programming in C don't even enable warnings on
their compiler. C is not a language for people who program "by
intuition", it requires more discipline in developers than many other languages.

You are right that there are more subtle possibilities for errors in C,
and I know of no one who thinks the rules of C and the standard library
are all ideal. But a huge percentage of the code bugs in C (as distinct
from logical errors, specification errors, etc., that plague all
programming in all languages) could be avoided by better development practices.

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 09:22:49 2026

From Newsgroup: comp.arch

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >>> the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write
code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library, whose
main purpose is to give this sort of self-overlapping copy behavior (and
to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that
temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

How is your "_memlzcpy" defined that is different from that?

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 03:23:27 2026

From Newsgroup: comp.arch

On 6/25/2026 2:22 AM, David Brown wrote:

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove) >>>>> that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the
rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write
code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library,
whose main purpose is to give this sort of self-overlapping copy
behavior (and to consolidate nearly every LZ77 style decompressor
otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but
as noted the behavior in the case of a short forwards copy are different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

Yeah, this is why I created "_memlzcpy()", because the defined behavior
for "memmove()" is not what one wants for self-overlapping forward copy.

How is your "_memlzcpy" defined that is different from that?

Here:
_memlzcpy(dst+1, dst, len);
Is functionally equivalent to:
memset(dst+1, *dst, len);

But, it can do more:
_memlzcpy(dst+2, dst, len); //repeating 2-byte pattern
_memlzcpy(dst+3, dst, len); //repeating 3-byte pattern
...

So, required to work for every self-overlap distance.

Or, in the case as commonly used in an LZ77 style decompressor:
_memlzcpy(dest, dest-distance, length);

Though, there are also:
_memcpyf()
_memmovef()
_memlzcpyf()

Where the 'f' in this case means:
Allowed to be a little faster by potentially going up to 32 bytes extra.

Where, in some cases it is faster to overshoot the end than to give an
exact length, but it would not be valid to overshoot the copy for the
normal versions (exact length even if it is a little slower).

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Jun 24 19:50:09 2026

From Newsgroup: comp.arch

Robert Swindells [2026-06-24 14:38:02] wrote:

On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:

On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:

Robert Swindells [2026-06-19 11:20:10] wrote:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support
would help dynamically typed programming languages (e.g., Lisp), and >>>>> SPARC contains some support for that, but as one of the IIRC Franz
Lisp developers has explained in this newsgroup, they actually did
not use this feature, because the performance benefit was not big
enough to

[...]

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton
is referring to the commercial Common Lisp compiler associated with the >>>Franz Inc company, marketed under the name "Allegro".

=== Stefan

ISTM there were at least a couple of Lisps available for the Vax. I
can't speak to Franz, but I do know at least one Vax Lisp was a BIBOP[1]
system that (generally) did not use tags.

Franz Lisp used BiBOP.

Side note: the BiBoP technique is largely orthogonal to the
architectural support for pointer tagging, because usually BiBoP is used
to "eliminate" the tags present inside the heap representation of
objects rather than the few tagbits stolen from pointers: the purpose of
those tagbits is usually to be able to determine the type of the object *without* any memory access whereas BiBoP stores the corresponding info
in memory.

E.g. tagbits are most commonly used to distinguish between an immediate
small integer value and a pointer. BiBoP wouldn't help with that,
forcing the small integer to be stored in some "page of small integers"
which could have a very serious performance impact.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jun 25 14:39:54 2026

From Newsgroup: comp.arch

BGB wrote:

On 6/25/2026 2:22 AM, David Brown wrote:

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David BrownÂ <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish.Â Actually, by default gcc assumes (i.e., it does not prove) >>>>>> that pointers to different types (except char) do not point to the>>>>>> same address.Â One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined.Â It is debatable as to whether the
rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write
code that
is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>
Â Â Â Â char a[100];

Â Â Â Â a[0] = 42;
Â Â Â Â memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that>>>> moves larger blocks won't.Â This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library,
whose main purpose is to give this sort of self-overlapping copy
behavior (and to consolidate nearly every LZ77 style decompressor
otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but >>> as noted the behavior in the case of a short forwards copy are
different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42.Â "memmove" acts as
though it copies the source to a temporary buffer, then copies that
temporary buffer to the destination.Â (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

Yeah, this is why I created "_memlzcpy()", because the defined behavior
for "memmove()" is not what one wants for self-overlapping forward copy.

How is your "_memlzcpy" defined that is different from that?

Here:
_memlzcpy(dst+1, dst, len);
Is functionally equivalent to:
memset(dst+1, *dst, len);

But, it can do more:
_memlzcpy(dst+2, dst, len); //repeating 2-byte pattern
_memlzcpy(dst+3, dst, len); //repeating 3-byte pattern
...

So, required to work for every self-overlap distance.

Or, in the case as commonly used in an LZ77 style decompressor:
_memlzcpy(dest, dest-distance, length);

Though, there are also:
_memcpyf()
_memmovef()
_memlzcpyf()

Where the 'f' in this case means:
Allowed to be a little faster by potentially going up to 32 bytes extra.

I'm guessing you really meant up to 31 bytes extra?
This is what my own (faster than Google's version) LZ4 decompressor uses internally.
I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of
an overlapping copy that unrolls RLL-encoded data, I start by loading
the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the
required results, for any pattern up to 32 bytes long.
swizzle_table:
[0,0,0,0,0,0,0,...
[0,1,0,1,0,1,0,1,...
[0,1,2,0,1,2,0,1,2,...
[0,1,2,3,0,1,2,3,...
[0,1,2,3,4,0,1,2,3,..
etc.
Note that having 31 entries of 32 bytes each means that I'm allocating
almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Thu Jun 25 13:24:45 2026

From Newsgroup: comp.arch

On Wed, 24 Jun 2026 19:50:09 -0400, Stefan Monnier wrote:

Robert Swindells [2026-06-24 14:38:02] wrote:

On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:

On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
<monnier@iro.umontreal.ca> wrote:

Robert Swindells [2026-06-19 11:20:10] wrote:

On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

Another architectural feature: One might think that tagging support >>>>>> would help dynamically typed programming languages (e.g., Lisp),
and SPARC contains some support for that, but as one of the IIRC
Franz Lisp developers has explained in this newsgroup, they
actually did not use this feature, because the performance benefit >>>>>> was not big enough to

[...]

Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton >>>>is referring to the commercial Common Lisp compiler associated with
the Franz Inc company, marketed under the name "Allegro".

=== Stefan

ISTM there were at least a couple of Lisps available for the Vax. I
can't speak to Franz, but I do know at least one Vax Lisp was a
BIBOP[1]
system that (generally) did not use tags.

Franz Lisp used BiBOP.

Side note: the BiBoP technique is largely orthogonal to the
architectural support for pointer tagging, because usually BiBoP is used
to "eliminate" the tags present inside the heap representation of
objects rather than the few tagbits stolen from pointers: the purpose of those tagbits is usually to be able to determine the type of the object *without* any memory access whereas BiBoP stores the corresponding info
in memory.

E.g. tagbits are most commonly used to distinguish between an immediate
small integer value and a pointer. BiBoP wouldn't help with that,
forcing the small integer to be stored in some "page of small integers"
which could have a very serious performance impact.

But if you know that you have a "page of small integers" then you can just
do address comparisons between them, the Franz Lisp compiler did this.

=== Stefan

--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 25 09:38:22 2026

From Newsgroup: comp.arch

On 2026-Jun-25 08:39, Terje Mathisen wrote:

BGB wrote:

On 6/25/2026 2:22 AM, David Brown wrote:

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David BrownÂ <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias. >>>>>>>

I wish.Â Actually, by default gcc assumes (i.e., it does not prove) >>>>>>> that pointers to different types (except char) do not point to the >>>>>>> same address.Â One has to turn that off with -fno-strict-aliasing. >>>>>>> Other C compilers use the same assumption.

That's the way C is defined.Â It is debatable as to whether the rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>>
Â Â Â Â char a[100];

Â Â Â Â a[0] = 42;
Â Â Â Â memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that >>>>> moves larger blocks won't.Â This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library, whose main purpose is to give this sort of self-overlapping copy behavior (and to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42.Â "memmove" acts as though it copies the source to a temporary buffer, then copies that temporary buffer to the destination.Â (If you want to fill the buffer with the value 42, "memset" is the function to use.)

Yeah, this is why I created "_memlzcpy()", because the defined behavior for "memmove()" is not what one wants for self-overlapping forward copy.

How is your "_memlzcpy" defined that is different from that?

Here:
   _memlzcpy(dst+1, dst, len);
Is functionally equivalent to:
   memset(dst+1, *dst, len);

But, it can do more:
   _memlzcpy(dst+2, dst, len); //repeating 2-byte pattern
   _memlzcpy(dst+3, dst, len); //repeating 3-byte pattern
   ...

So, required to work for every self-overlap distance.

Or, in the case as commonly used in an LZ77 style decompressor:
   _memlzcpy(dest, dest-distance, length);

Though, there are also:
   _memcpyf()
   _memmovef()
   _memlzcpyf()

Where the 'f' in this case means:
Allowed to be a little faster by potentially going up to 32 bytes extra.

I'm guessing you really meant up to 31 bytes extra?

This is what my own (faster than Google's version) LZ4 decompressor uses internally.

I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of an overlapping copy that unrolls RLL-encoded data, I start by loading the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the required results, for any pattern up to 32 bytes long.

swizzle_table:

[0,0,0,0,0,0,0,...
[0,1,0,1,0,1,0,1,...
[0,1,2,0,1,2,0,1,2,...
[0,1,2,3,0,1,2,3,...
[0,1,2,3,4,0,1,2,3,..

etc.

Note that having 31 entries of 32 bytes each means that I'm allocating almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.

Terje

If I had 256b,32B registers I would like to have LDV Load Variable and STV Store Variable
instructions, which take an address, a src/dst simd register, and either a scalar register
or immediate byte count in the range 0..32. LDV loads the specified number of bytes into
the simd starting at the least significant byte and zero-fills any unread ones. These should be relatively easy to implement if one already has unaligned SIMD LD/ST.

One might also consider LDBV/STBV variable length bit vectors 0 to 256b,

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 25 15:17:52 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >>the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that >is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

The burroughs B3500 and medium systems successors, which is a
memory-to-memory architecture had a number of move instructions,
several of which had architecturally defined semantics for
overlapping source and destination fields, which included
functionality similar to that you describe above.

MVR (Move Repeat) was the one most commonly used to fill
a single value into multiple memory locations (where
the value could be from one to 100 digits and the repeat count
between 1 and 100).

The remaining move instructions MVA (Move Alpha - i.e. bytes)
MVD (Move Data), MVW (Move Words) and
MVN (Move Numeric) had defined semantics
for some cases of overlapping operands that could result in
"smearing" a store over a large region of memory or
repeating a digit throughout the receiving operand.

For example, the overlap behavior for MVW was

"When the final B address is less than the final A
address and the fields partially overlap, the source
data field will be shifted by that number of digits
to the left. When the B data field partially overlaps
the A data field and B is greater than A, repeat the data from
the A address to the B address throughout the destination
data field. The B data field may totally overlap the A
data field"

The overlap behavior for MVC (Move and Clear) could be
used to right justify the A data in the B field with
the destination filled with leading zeros or shift
the data to the left depending on the relationship
between A and B.

All other overlap results were dependent upon the
generation of processor and could not be relied upon
between generations.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Jun 25 16:26:43 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> schrieb:

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

This is what Fortran does for array assignment. From the language
definition, the right-hand side of an assignment is evaluated
completely, then the value is assigned to the lefth-and side.
So, from the language definition,

a = a + 1.0

is something like, assuming a suitable declaration for tmp,

allocate (tmp(size(a)))
tmp = a + 1.0
a = tmp
deallocate (tmp)

and a compiler is free to do that. However, for efficiency
reason, a compiler write is well-advised to detect this
case and make it into a simple loop.

A lot of tricks can be played with dependency checking, loop
reversal etc.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jun 25 18:50:27 2026

From Newsgroup: comp.arch

EricP wrote:

On 2026-Jun-25 08:39, Terje Mathisen wrote:

BGB wrote:

On 6/25/2026 2:22 AM, David Brown wrote:

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David BrownÃ‚Â <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias. >>>>>>>>

I wish.Ã‚Â Actually, by default gcc assumes (i.e., it does not >>>>>>>> prove)
that pointers to different types (except char) do not point to the >>>>>>>> same address.Ã‚Â One has to turn that off with
-fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined.Ã‚Â It is debatable as to whether >>>>>>> the rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to
write code that
is intuitively reasonable and sometimes works but isn't portable, >>>>>> e.g.:

Ã‚Â Ã‚Â Ã‚Â Ã‚Â char a[100];

Ã‚Â Ã‚Â Ã‚Â Ã‚Â a[0] = 42;
Ã‚Â Ã‚Â Ã‚Â Ã‚Â memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that >>>>>> moves larger blocks won't.Ã‚Â This example is really obvious (it's >>>>>> why there's also memmove()) but there's plenty of more subtle ones. >>>>>>

This one is why I added a "_memlzcpy()" function to my C library,
whose main purpose is to give this sort of self-overlapping copy
behavior (and to consolidate nearly every LZ77 style decompressor
otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()",
but as noted the behavior in the case of a short forwards copy are >>>>> different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42.Ã‚Â "memmove" acts >>>> as though it copies the source to a temporary buffer, then copies
that temporary buffer to the destination.Ã‚Â (If you want to fill >>>> the buffer with the value 42, "memset" is the function to use.)

Yeah, this is why I created "_memlzcpy()", because the defined
behavior for "memmove()" is not what one wants for self-overlapping
forward copy.

How is your "_memlzcpy" defined that is different from that?

Here:
Â Â _memlzcpy(dst+1, dst, len);
Is functionally equivalent to:
Â Â memset(dst+1, *dst, len);

But, it can do more:
Â Â _memlzcpy(dst+2, dst, len);Â //repeating 2-byte pattern
Â Â _memlzcpy(dst+3, dst, len);Â //repeating 3-byte pattern
Â Â ...

So, required to work for every self-overlap distance.

Or, in the case as commonly used in an LZ77 style decompressor:
Â Â _memlzcpy(dest, dest-distance, length);

Though, there are also:
Â Â _memcpyf()
Â Â _memmovef()
Â Â _memlzcpyf()

Where the 'f' in this case means:
Allowed to be a little faster by potentially going up to 32 bytes extra.

I'm guessing you really meant up to 31 bytes extra?

This is what my own (faster than Google's version) LZ4 decompressor
uses internally.

I am using either a pair of SSE or a single AVX register (so 32 bytes >> in both cases) as the copy granule. For the specific,very common, case
of an overlapping copy that unrolls RLL-encoded data, I start by
loading the starting pattern into the bottom of a register, then use
the pattern length to index into a table of swizzle patterns that will
generate the required results, for any pattern up to 32 bytes long.

swizzle_table:

[0,0,0,0,0,0,0,...
[0,1,0,1,0,1,0,1,...
[0,1,2,0,1,2,0,1,2,...
[0,1,2,3,0,1,2,3,...
[0,1,2,3,4,0,1,2,3,..

etc.

Note that having 31 entries of 32 bytes each means that I'm allocating
almost a KB of $L1 cache space just for this table, but when you're
decompressing lots of data it pays off.

If I had 256b,32B registers I would like to have LDV Load Variable and > STV Store Variable
instructions, which take an address, a src/dst simd register, and either
a scalar register
or immediate byte count in the range 0..32. LDV loads the specified
number of bytes into
the simd starting at the least significant byte and zero-fills any
unread ones.
These should be relatively easy to implement if one already has
unaligned SIMD LD/ST.

One might also consider LDBV/STBV variable length bit vectors 0 to 256b,

We do have that, in the form of a masked move, but it is more efficient
to simply use the regular unaligned store op.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:13:04 2026

From Newsgroup: comp.arch

Andy Valencia <vandys@vsta.org> posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Usual downside it that the excessive parenthesis tend to turn into a usability issue.

Ample fun has been made of this over time.

From rec.humor.funny:

From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
Newsgroups: rec.humor.funny
Subject: The Strategic Defense Initiative (SDI/Star Wars)
Keywords: computer, funny
Message-ID: <137457@looking.on.ca>
Date: 23 Apr 90 10:30:08 GMT
Sender: funnyr@looking.on.ca
Posted: Mon Apr 23 11:30:08 1990
Reply-Path: mist.cs.orst.edu!jasmerb

Through some clever security hole manipulation if I have been able to
break into all of the government's computers and acquire the Lisp code
to SDI. Here is the last page (tail -10) of it to prove that I actually have the code:

)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

I remember the LISP on PDP-8. One could use the character ] to mean as many
)s as needed to close the lambda.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
No AI was used in the composition of this message

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:14:28 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

?????

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:20:01 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 24/06/2026 23:34, BGB wrote:

---------------------

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

Act as though it copies twice is utterly unnecessary as overlapping
memory can simply be performed back-to-front instead of front-to-back.

How is your "_memlzcpy" defined that is different from that?

--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 20:45:26 2026

From Newsgroup: comp.arch

On 25/06/2026 19:20, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 24/06/2026 23:34, BGB wrote:

---------------------

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that
temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

Act as though it copies twice is utterly unnecessary as overlapping
memory can simply be performed back-to-front instead of front-to-back.

You are mixing up "act as though" it does something, and implementing it
that way. memmove implementations will typically figure out if they can
work as a forwards loop or a backwards loop, and do that. For moves
that are big enough to be worth the effort, they'll do it using big
lumps (64 bit, or bigger if that is more efficient) and then handle any
last few bytes individually. If the overlap is closer than a "lump",
more effort is needed. As Thomas said in reference to Fortran array assignment, there are lots of tricks possible that give faster results
than the simple forward or backwards byte copying.

But however it is implemented, the result is the same as you would get
by copying to a temporary area.

How is your "_memlzcpy" defined that is different from that?

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Jun 25 19:19:47 2026

From Newsgroup: comp.arch

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

John Levine <johnl@taugh.com> posted:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >> >the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

Jeez, it's an example.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 14:55:28 2026

From Newsgroup: comp.arch

On 6/25/2026 7:39 AM, Terje Mathisen wrote:

BGB wrote:

On 6/25/2026 2:22 AM, David Brown wrote:

On 24/06/2026 23:34, BGB wrote:

On 6/24/2026 3:17 PM, John Levine wrote:

According to David BrownÂ <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias. >>>>>>>

I wish.Â Actually, by default gcc assumes (i.e., it does not prove) >>>>>>> that pointers to different types (except char) do not point to the >>>>>>> same address.Â One has to turn that off with -fno-strict-aliasing. >>>>>>> Other C compilers use the same assumption.

That's the way C is defined.Â It is debatable as to whether the >>>>>> rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to
write code that
is intuitively reasonable and sometimes works but isn't portable,
e.g.:

Â Â Â Â char a[100];

Â Â Â Â a[0] = 42;
Â Â Â Â memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that >>>>> moves larger blocks won't.Â This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

This one is why I added a "_memlzcpy()" function to my C library,
whose main purpose is to give this sort of self-overlapping copy
behavior (and to consolidate nearly every LZ77 style decompressor
otherwise needing to supply their own version).

In the case of a short backwards copy, it will call "memmove()", but
as noted the behavior in the case of a short forwards copy are
different.

For non-overlap cases it can just invoke "memcpy()".

"memmove" will not fill the array above with 42.Â "memmove" acts as
though it copies the source to a temporary buffer, then copies that
temporary buffer to the destination.Â (If you want to fill the
buffer with the value 42, "memset" is the function to use.)

Yeah, this is why I created "_memlzcpy()", because the defined
behavior for "memmove()" is not what one wants for self-overlapping
forward copy.

How is your "_memlzcpy" defined that is different from that?

Here:
   _memlzcpy(dst+1, dst, len);
Is functionally equivalent to:
   memset(dst+1, *dst, len);

But, it can do more:
   _memlzcpy(dst+2, dst, len); //repeating 2-byte pattern
   _memlzcpy(dst+3, dst, len); //repeating 3-byte pattern
   ...

So, required to work for every self-overlap distance.

Or, in the case as commonly used in an LZ77 style decompressor:
   _memlzcpy(dest, dest-distance, length);

Though, there are also:
   _memcpyf()
   _memmovef()
   _memlzcpyf()

Where the 'f' in this case means:
Allowed to be a little faster by potentially going up to 32 bytes extra.

I'm guessing you really meant up to 31 bytes extra?

Yeah, off by 1 error.

This is what my own (faster than Google's version) LZ4 decompressor uses internally.

I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of
an overlapping copy that unrolls RLL-encoded data, I start by loading
the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the required results, for any pattern up to 32 bytes long.

swizzle_table:

[0,0,0,0,0,0,0,...
[0,1,0,1,0,1,0,1,...
[0,1,2,0,1,2,0,1,2,...
[0,1,2,3,0,1,2,3,...
[0,1,2,3,4,0,1,2,3,..

etc.

Note that having 31 entries of 32 bytes each means that I'm allocating almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.

Not using my own C library on x86-64, ... usually (depending on context) MSVCRT or glibc or similar.

In my case it is typically using 64-bit copies, but for better pipeline utilization it is better to copy in groups of 4x 64-bit loads/stores, or
32 bytes.

Well, and a 0,2,1,3 order on my core; but this is mostly a benefit if
the pointer is aligned on a 16-byte boundary (can potentially avoid some internal penalties within the L1 cache).

But, yeah, I can use this for both LZ4 and RP2 compression, which are typically the main two that I use.

Some common properties:
Both byte-oriented designs that allow for fast decompressors.
Some different properties:
LZ4 usually does slightly better for program binaries;
RP2 usually does better for general data;
LZ4 is usually faster on OoO machines;
RP2 is usually faster on in-order.
LZ4 limits:
Distance: 64K
Literal Length: Unbounded
Match Length: Unbounded
RP2 limits (typical):
Distance: 128K
Literal Length: Unbounded
Match Length: 516 bytes

There exist variants of RP2 which allow larger limits, but for many
use-cases, the version with these limits makes the most sense. A newer
variant added a different mechanism for long-distance matching (the
original mechanism wasn't ideal in some ways), and some special cases to
help with long RLE runs (a long-match / very-short-distance case), but
these aren't really used much.

As noted, the design of RP2 was a mutation of the EA RefPack design, but
with bits moved around to work better for a little endian (the original
design seemingly assumed big endian). Also typically traded 1 bit of
match length for 1 bit of literal length (literal runs typically 0..7 vs 0..3).

Did investigate at one point whether it might have been better to trade
1 bit on distance instead, but testing seemed to confirm taking it from
match length as the correct choice.

In both cases, there is a presumed positive correlation between match
length and distance, unlike LZ4 where they are uncorrelated (distance
always 16 bits).

...

Had also experimented with bolt-on post-compressors for RP2:
STF + AdRice:
Can push compression similar to that of Deflate
It tends to have an advantage for small payloads.
But, relative gains diminish with payload size.
Speed drops into Deflate-like areas.
Range Coder:
Compression increases to LZMA like areas;
Speed decreases to LZMA like areas.

No Huffman or ANS, but:
Huffman would be a more complex mechanism to apply as a bolt-on post compressor, and isn't likely to see a significant compression delta.

ANS seems very weird / convoluted;
Extant implementations I have looked at don't seem to live up to claims
of extreme compression at high speeds, most like both speed and
compression that seem to fall between Huffman and Range-Coding;
Has quirks that would pose problems for use as a simple post-compressor.

The idea being that one first encodes as RP2 normally, and then checks
if a post compressor could give an acceptable level of additional
compression. This avoids wasting the speed-cost of more expensive post-encoders on data which does not significantly benefit (and entropy
coding isn't always the win one might think it is).

The post-compressor effectively sorts out bytes into various categories
and runs them through the corresponding entropy context.

Decompression could be done 2-pass, but usually faster to do a combined decoder.

Might seem like an overly jank approach, but worked well in testing...

STF + AdRice is used in some of my compressors:

STF: Each time a symbol is encode it, swap it towards the front of the
list. The symbols are encoded as their positions in the list.
Typical swapping is with the value 7/8 or 15/16 the current index (7/8
is better for short payloads, 15/16 or 31/32 for long payloads).
Initial state is typically the bytes from 00 to FF in order, but
sometimes differs for some uses. For a byte context, needs 256 bytes of storage per context for decoding (encoding typically needs 512, for a reverse-lookup table). Main premise is to turn raw bytes into something
that Rice coding can use.

AdRice:
Adaptive variant of Rice coding.
Encodes Q as a unary coded prefix, followed by K bit suffix;
Q=Val>>K, Suf=Val&((1<<K)-1)
Except for an escape-case, where if Q>7:
Encode an 8-bit prefix, and raw 8-bit index.
State update:
Typical variant:
Q=0, if K>0, decrement K
Q=1: Leave K as-is
Q>1: If K<7, increment K
Alt variant:
K is understood as having a 3 bit fraction.
Update instead increments the fraction,
so major K update happens more slowly.

The alternate variant's update rules can help for compression in some use-cases at the expense of others. It comes at a minor speed cost in
some cases, as the typical variant can move all of the short cases into
a lookup table, operating more like a table-driven Huffman decoder, but
the fractional K doesn't map well to expressing the updated K via a
combined lookup table.

The merit is that slowing down the K adaptation causes K to more often
be at the optimal value, whereas with the simple case it is typically
off by 1.

As for Huffman, it poses a frequent problem:
At a 15 or 16 bit symbol length, the needed lookup table to do a whole
symbol at once does not fit in L1 cache and often has a poor L1 hit rate;
At 12 or 13 bits, a single lookup strategy is faster, but compression
suffers (and mostly loses the compression advantage over STF+AdRice,
while still having higher cache pressure).

A partial table, say the first 8 bits, with fallback for the rest, can
work, but is also a speed penalty.
It also poses a problem for small payloads in that filling the lookup
table is often a significant time penalty.

...

There are some claims of formats with "extreme compression at amazing
speeds" from some companies (like, Deflate-like compression at LZ4 like speeds), but given I don't have them, I can't test anything myself.

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 14:58:36 2026

From Newsgroup: comp.arch

On 6/25/2026 12:20 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 24/06/2026 23:34, BGB wrote:

---------------------

"memmove" will not fill the array above with 42. "memmove" acts as
though it copies the source to a temporary buffer, then copies that
temporary buffer to the destination. (If you want to fill the buffer
with the value 42, "memset" is the function to use.)

Act as though it copies twice is utterly unnecessary as overlapping
memory can simply be performed back-to-front instead of front-to-back.

Yes, this is how memmove is done in practice IME.
Forwards copy:
Do it end-to-front (backwards)
Backwards copy:
Do it front-to-end (forwards)

For what memmove is intended to do, it works well, but is not always
what someone needs.

And, for whatever reason, the people developing the C standards
seemingly didn't feel "a copy function that explicitly produces
repeating N byte patterns in the case of self-overlap" to be a priority...

How is your "_memlzcpy" defined that is different from that?

--- Synchronet 3.22a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 18:07:15 2026

From Newsgroup: comp.arch

On 6/25/2026 12:13 PM, MitchAlsup wrote:

Andy Valencia <vandys@vsta.org> posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

BGB <cr88192@gmail.com> schrieb:

Usual downside it that the excessive parenthesis tend to turn into a
usability issue.

Ample fun has been made of this over time.

From rec.humor.funny:

From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
Newsgroups: rec.humor.funny
Subject: The Strategic Defense Initiative (SDI/Star Wars)
Keywords: computer, funny
Message-ID: <137457@looking.on.ca>
Date: 23 Apr 90 10:30:08 GMT
Sender: funnyr@looking.on.ca
Posted: Mon Apr 23 11:30:08 1990
Reply-Path: mist.cs.orst.edu!jasmerb

Through some clever security hole manipulation if I have been able to
break into all of the government's computers and acquire the Lisp code
to SDI. Here is the last page (tail -10) of it to prove that I actually
have the code:

))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

I remember the LISP on PDP-8. One could use the character ] to mean as many )s as needed to close the lambda.

I had ended up in a custom dialect using [] and {} for other things, say:
(+ x y) //ye olde
[1 2 3] //sorta like #<1 2 3>
{:x 1 :y 2} //associative dictionary/object
{x: 1 y: 2} //basically the same

":x" and "x:" were both understood as keywords (like in CL), sometimes
but not always interchangeable (both were treated the same for most
runtime tasks, but "(eq? :x x:)" would be false, as they were considered
as distinct sub-types of keyword, mostly differentiated by remembering
which side the colon went on).

Though, these would evaluate its operands, so more like:
[1 2 3] => (vector 1 2 3)
{x: 1 y: 2} => (object x: 1 y: 2)
Where, say:
(define obj {x: 1 y: 2})
(obj :z (+ (obj :x) (obj :y)))
Would add a "z" member holding 3.

But, say:
(obj foo: 1 2 3)
Would call a 'foo' method on the object with 1, 2, 3.

(define obj2 {x: 1 y: 2 bar: (lambda () (+ x y))})

Or, something to this effect...

Also, IIRC, members starting with '$' would delegate, so like:
(define obj1 {x: 1 y: 2})
(define obj2 {z: 3 $up: obj1})

(obj2 :y) => 2 (via the $up) member (there could be multiple).

Though, this language wasn't pure by any means, sort of a Scheme /
Common Lisp / Self hybrid.

The original BGBScript language continued on with a similar model (*1),
just replacing "$up" with "_up_" or similar. Also the lookup process
would track where it had been, so cycles would not explode the lookup.

Also there was a big hash table to track object/member lookups, so often lookups could remain (moderately) fast (by the standards of that era for
a script VM).

*1: Well, after the disastrously slow first BGBScript VM the second
reused the first language (and its VM) as the core (throwing the JS like syntax on top).

Seemed like a cool/nifty idea at the time, as did using this as the
logical basis of the entire scoping model. But, when later wanting to
move to a static-typed core (with type-inference, etc), this stuff came
back to bite.

Also went between different tagref formats.

Say: Early VM:
(31:3): Address
( 2:0): Tag
Tag:
000: Object Reference
100: Cons Cell
110: Various literal value types.
x01: Fixnum (30 bits)
x11: Flonum (30 bits, Binary32 with 2b cut off)

First BS VM:
Went over to bare pointers (more C friendly);
Crammed fixnum and flonum into 24 bit address ranges (sucked).
Second BSVM:
Went back to the tagrefs.
Also a precise GC, but this was a pain.
Third VM:
Mostly Went back to bare pointers for objects and cons cells;
Went back to a conservative GC.

Later, eg:
Went 64-bit, then ended up moving the VM over to a 64-bit format.
(63:48): Tag Bits
(47: 0): Bare Address
With a tag in the HOBs:
0000: Object/Etc
0001: Literal values/etc
001x: Misc (bounded pointers ATM)
01xx: Fixnum
10xx: Flonum
11xx: ...

Some may or not recognize this as the format my ISA project is using...

But, the basic scheme itself originated when the BGBScript VM went 64-bit.

Well, also the type-tagging notation that BGBCC uses was also partly
shared with the BGBScript VM.

Ironically, cons cells and cons lists still exist, sorta, but are not
really widely used in TestKern.

Ironically, I had also partly used this typesystem internally in the
makeshift BASIC dialect, which in another offshoot, started to gain some Lisp-like appendages.

But, not entirely sure that "Weird mix of Lisp and 1980s style
unstructured BASIC" is really a direction I want to go in.

Though, could revive a Lisp style dialect, but with C style loops:
(while (cond...) (begin
(if (something) (break))
...
))
(let-for (i 0) (< i 10) (++ i) (println "Yeah " i))
With (break) and (continue).

Well, and maybe in certain contexts bring back 32-bit tagrefs as a way
to save memory (though, operating within the limits of a constrained
heap, rather than public memory, and possibly referencing any external
objects as handles).

Could go further (eg, 16 bits), though most non-toy examples are likely
to either run out of addressable cons-cells, or not actually have enough
going on to benefit from the 16-bit handles.

Say (if 16b):
00: Object Handle
01: CONS Handle
10: Fixnum
11: Misc
00: Symbols (4K unique symbols)
01: Keywords
10: Magic values
11: ?

...

Though, could make sense for a GLSL compiler, as one is not as likely to
run out of cons cells when compiling a shader. Fully populated CONS heap
would be 64K, vs 256K for the same number of cons cells vs the normal
memory management.

Then again, the number of AST nodes that BGBCC uses when compiling Doom
isn't wildly larger than this, so it is very possible that 16K cons
cells could be enough to compile even a fairly complex GLSL shader...

Well, or get wacky and use 48-bit cons cells with 24-bit tagrefs (4M
cons cells, could compile full on C programs with this), hrrm...

Well, also C would need more than 4K unique symbols, but one does
generally stay under a 64K symbol limit (for something in a Doom'ish
size range).

...

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 10:33:38 2026

From Newsgroup: comp.arch

The glibc function ::backtrace can be called at any time, in any context.

I guess, internally, that backtrace function can signal an exception
after setting up an appropriate "debugger" that will collect the
backtrace and return it to the application.

Then there are the unix context functions that also allow access to
resources not normally visible to an application - getcontext(2), makecontext(3) and the setjmp/sigsetjmp functions which also
gather the thread context, including the current stack pointer.

IIRC these don't need to *look* at the stack, they can limit their work
to manipulating the stack pointer.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 23:46:20 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:
/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)
{
int num_frames;
void *framelist[100];
char **strings;

num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

Where does ::backtrace get access to the number of preserved registers
on the stack and where the return address is on a per subroutine basis ??

That is: each stack frame is of a different size with return address at a different spot per subroutine.

strings = ::backtrace_symbols(framelist, num_frames);
if (strings == NULL) {
lp->log("Unable to obtain simulator stack traceback: %s\n",
strerror(errno));
return;
}
for(int frame=0; frame < num_frames; frame++) {
lp->log("[%2.2d] %s\n", frame, strings[frame]);
}
::free(strings);
}

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 25 20:29:42 2026

From Newsgroup: comp.arch

On 6/19/2026 5:09 PM, Scott Lurndal wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 6/19/2026 11:59 AM, John Levine wrote:

According to David Brown <david.brown@hesbynett.no>:

Possibly the biggest millstone around the neck of computing
architectures is the C language. ...

De-facto standards are /always/ albatrosses to some extent. Things are >>>> done that way because things are done that way - processors are designed >>>> to run C (or C-model languages, if you like) because that's what
existing code is written in, and code is written in C (or similar
languages, or languages with a VM written in C) because that's how
existing processors work.

C killed off every memory model other than flat byte addressed memory.
Pointers are sort of typed, but any real C program does stuff like this: >>>
p = (struct foo *) malloc(42 * sizeof(struct foo));

Fwiw, why all of the casts?

C and C++ handle void* conversions differently. You must cast
the malloc result to a pointer of the declared type when using C++.

Oh well, yeah. I had C on the brain. :^o

It doesn't hurt to add the cast in C, and may help with documenting
the intention of the programmer who wrote the code.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jun 26 06:08:15 2026

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)

Nit: That is not glibc code, glibc code is C (it would be strange to
have a C++ runtime library for C...)

The glibc code can be seen, for example, at

https://github.com/bminor/glibc/blob/master/debug/backtrace.c
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jun 26 06:38:48 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:
/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)
{
int num_frames;
void *framelist[100];
char **strings;

num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

Where does ::backtrace get access to the number of preserved registers
on the stack and where the return address is on a per subroutine basis ??

That is: each stack frame is of a different size with return address at a different spot per subroutine.

There are several methods.

Rolling back via the frame pointer is one method, which of course
incurs overhead.

Then ther's EH frame based stack tracing, which uses DWARF debug
info that is also used for exception handling. To use this,
you need to interpret DWARF opcodes. (You also need to interpret
DWARF opcodes for exception handling. An exception will usually
cost you thousands of cycles, which is HUGE).

The latest and greatest bor backtrace is probably SFrame (used in
the Linux kernel, for example). This uses a lookup table to locate
the current function. See https://sourceware.org/binutils/wiki/sframe .

Hmm... one other question. Would EH frame-based stack unwinding
(which is now the standard) work with My 66000's safe stack?
I think not, because it needs the return address, but I may
be wrong.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 03:24:25 2026

From Newsgroup: comp.arch

On 6/23/2026 5:54 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only
optimize based on what can be verified by the compiler within certain
fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array;
Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they
don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error.

-----------------

I am reminded of the person, apparently very religious, who some decades >>> ago posted to solicit help for reimplementing all of computing (gcc,
GNU, et cetera) on Biblical principles, because he thought Richard
Stallman was too atheistic and had tainted his products. I have not
heard how that went.

Rick...

--------

sorry of if this is way off base, but well...

What about container_of, or CONTAINING_RECORD?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 22:18:18 2026

From Newsgroup: comp.arch

But if you know that you have a "page of small integers" then you can just
do address comparisons between them, the Franz Lisp compiler did this.

Ah, so you're using the leading bits that correspond to that "page of
integers" as tagbits. You don't really need BiBoP allocation to do
that, tho.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 22:23:35 2026

From Newsgroup: comp.arch

John Levine [2026-06-25 19:19:47] wrote:

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

John Levine <johnl@taugh.com> posted:

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

Jeez, it's an example.

It's an example, indeed, but it's a pretty bad one since using `memset`
is more clear, more concise, and actually works, whereas your example
seems very contrived.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 26 14:10:55 2026

From Newsgroup: comp.arch

On 26/06/2026 04:23, Stefan Monnier wrote:

John Levine [2026-06-25 19:19:47] wrote:

According to MitchAlsup <user5857@newsgrouper.org.invalid>:

John Levine <johnl@taugh.com> posted:

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>
char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

Jeez, it's an example.

It's an example, indeed, but it's a pretty bad one since using `memset`
is more clear, more concise, and actually works, whereas your example
seems very contrived.

While a memset would be much better in this case (assuming that is the
effect the author was aiming for), it is certainly the case that people
have used memcpy() with overlapping regions and an assumption that it
copies forward in some way. John's point that "it is easy to write code
that is intuitively reasonable and sometimes works but isn't portable"
is to a fair extent independent of the quick example he wrote to
demonstrate it.

However, the example does show how John is somewhat inaccurate - and it demonstrates how difficult things are when people write code with
undefined behaviour.

It is /not/ intuitively reasonable to write code like that example. But
it /looks/ like it is reasonable. The critical issue is that it is not
clear from the code whether the author wants the memset-like behaviour
that some memcpy implementations would give, where a[] is filled with
42, or if the author wants the memmove-like behaviour that many other
memcpy implementations would give (where a[0] remains 42, a[1] gets 42, a[2..99] gets whatever was previously in a[1..98]). What those values
were depends on any initialisation there was of the rest of a[] - if
they were not initialised, then memmove() here would also have been UB.

John is also a bit inaccurate in writing "sometimes works but isn't
portable" - when you have UB in the code, it's just luck if the end
results meet your intentions. "Non-portable" code, as I see it, is code
that does what you want on one target or compiler, but might not do so
for other targets or compilers. Code with UB is worse - even if your
code "works" at the moment, apparently unconnected changes to other
parts of your code, changes to compiler flags, small updates to your
tools, can all give you an end result that no longer fits your
intentions. There's nothing wrong with writing non-portable code,
though it is best to be aware that you are doing so - people do that all
the time. There is always something wrong with writing code with UB - unfortunately, people do that a lot too.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Jun 26 15:08:22 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

John Levine <johnl@taugh.com> posted:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >>> the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

In the case of a single repeating byte, memset is of course optimal, but
the same LZ4 encoding is used to encode any repeating pattern, of
lengths from 1 and up. There is no indexed memset where the pattern is
of arbitrary length.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.22a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 26 15:50:49 2026

From Newsgroup: comp.arch

On 26/06/2026 15:08, Terje Mathisen wrote:

MitchAlsup wrote:

John Levine <johnl@taugh.com> posted:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove) >>>>> that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the
rules in
the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write
code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

Why not::

        memset( a, 42, 100 );

In the case of a single repeating byte, memset is of course optimal, but
the same LZ4 encoding is used to encode any repeating pattern, of
lengths from 1 and up. There is no indexed memset where the pattern is
of arbitrary length.

I don't think memset is necessarily "optimal", because the optimal
solution will depend on the number of bytes to fill, and possibly
alignments, and details of the exact processor. A particular memset implementation could be close to optimal for large blocks, where it is
worth picking the best algorithm at runtime. And a compiler could pick
the algorithm details at compile time if it knows the size of the target block. "Optimal" is a strong word.

I would think that the kind of copying you need for LZ4 is quite
specialised for that task - a function to do that belongs in LZ4 implementation code rather than as a standard function. Trying to use memcpy() for the task is, however, a recipe for having your name cursed
by future maintainers!

--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jun 26 10:11:05 2026

From Newsgroup: comp.arch

On 2026-Jun-26 06:24, Chris M. Thomasson wrote:

On 6/23/2026 5:54 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only
optimize based on what can be verified by the compiler within certain
fundamental assumptions.

Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they
don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming error.

sorry of if this is way off base, but well...

What about container_of, or CONTAINING_RECORD?

If that is what I think it is, where it cast from
a pointer to a field inside a struct back to the containing struct
by subtracting the field byte offset and changing the pointer type, irrespective of programming language that mechanism has been used
by operating systems at least since RSX days.
It is a compact way of having structs linked to many other structures.

That macro is just a variant of the mechanism for C.
The method is used by WinNT and Linux, and I believe also by the BSD's.

GCC has a compile option, no_strict_alias or something, that anyone
using it and doing "illegal" pointer casting must use.
In Windows land, pointer casting at least used to be Microsoft's
recommended method and is supported by their compiler because
they use it too, extensively.

I have used it when I had complex multiple linkages between data structures. Say an object is in multiple double linked lists and an index tree and I
need to cast from a pointer to a list link field back to the object
containing that link field. I also often put a validity check marker for
each object type at the start of the container and Assert its correctness.
The marker is zeroed when the container is destroyed to catch
any dangling references.

--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 26 14:21:10 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:
/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)
{
int num_frames;
void *framelist[100];
char **strings;

num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

Where does ::backtrace get access to the number of preserved registers
on the stack and where the return address is on a per subroutine basis ??

https://elixir.bootlin.com/glibc/glibc-2.43.9000/A/ident/backtrace

It is processor dependent, of course.
--- Synchronet 3.22a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 26 14:22:29 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

GLIBC has a function to obtain a backtrace at a current point
in time. This is called in the context of the thread that invokes
the call. It requires access to the call records on the stack
in the context of the thread (the glicb functions are backtrace(3)
and backtrace_symbols(3)).

/**
* Log a simulator stack traceback.
*/
void
c_osdep::backtrace(c_logger *lp)

Nit: That is not glibc code, glibc code is C (it would be strange to
have a C++ runtime library for C...)

Indeed it is C++ code calling a GLIBC function.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 26 16:15:06 2026

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

it is certainly the case that people
have used memcpy() with overlapping regions and an assumption that it
copies forward in some way.

More precisely, in 2010 there was a big flamewar because a newer glibc
used backwards stride on some processors for some combinations of
source and destination addresses, and this broke a pre-existing binary
(of a Flash player IIRC). The "solution" was to use memmove for
memcpy for existing binaries, and use the processor-dependent memcpy
for new binaries. I heard no complaints about the solution for
existing binaries.

This shows that no binaries that link to glibc assumed that dest can
overlap source with dest>src, and get some kind of replicating
behaviour (probably because glibc stopped using byte-by-byte copying
much earlier, if it ever had it at all).

What the Flash player apparently used is operlapping memcpy with
dest<src. It worked like memmove before the glibc release that caused
the flame war, and actually used memmove once the solution was
implemented.

A better solution might have been to implement memcpy on the funny
processors as follows:

if (prefer_forward_stride(dest,src) || (((uintptr_t)src)-((uintptr_t)dest))<n )
return memcpy_forward_stride(dest, src, n);
else
return memcpy_backward_stride(dest, src, n);

This would have covered the Flash player usage (and any like it). Of
course, memmove is only slightly more expensive to implement (you also
have to cover the case where src<dest<src+n).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jun 26 13:51:01 2026

From Newsgroup: comp.arch

On 2026-Jun-26 12:15, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

it is certainly the case that people
have used memcpy() with overlapping regions and an assumption that it
copies forward in some way.

More precisely, in 2010 there was a big flamewar because a newer glibc
used backwards stride on some processors for some combinations of
source and destination addresses, and this broke a pre-existing binary
(of a Flash player IIRC).

Unbroken parts of Flash player existed?

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:33:21 2026

From Newsgroup: comp.arch

On 6/25/2026 10:14 AM, MitchAlsup wrote:

John Levine <johnl@taugh.com> posted:

According to David Brown <david.brown@hesbynett.no>:

On 24/06/2026 07:48, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

C requires the compiler to prove that the pointers cannot alias.

I wish. Actually, by default gcc assumes (i.e., it does not prove)
that pointers to different types (except char) do not point to the
same address. One has to turn that off with -fno-strict-aliasing.
Other C compilers use the same assumption.

That's the way C is defined. It is debatable as to whether the rules in >>> the C standard are ideal ...

One of the less fortunate things about C is that it is easy to write code that
is intuitively reasonable and sometimes works but isn't portable, e.g.:

char a[100];

a[0] = 42;
memcpy(a+1, a, 99);

Why not::

memset( a, 42, 100 );

?????

Or something akin to, pesudo code:

struct buffer
{
char a[100];
};

struct buffer b0 = { '\0' };

or

struct buffer b0 = { };

Try to hold the flames for a little while. Typed it in as is from
memory. ;^)

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:36:30 2026

From Newsgroup: comp.arch

On 6/26/2026 7:11 AM, EricP wrote:

On 2026-Jun-26 06:24, Chris M. Thomasson wrote:

On 6/23/2026 5:54 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only >>>> optimize based on what can be verified by the compiler within certain
fundamental assumptions.

Say:
Pointer 1 points at a stack array in the local function;
Pointer 2 was derived from taking the address of a global array; >>>> Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume they >>>> don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming
error.

sorry of if this is way off base, but well...

What about container_of, or CONTAINING_RECORD?

If that is what I think it is, where it cast from
a pointer to a field inside a struct back to the containing struct
by subtracting the field byte offset and changing the pointer type, irrespective of programming language that mechanism has been used
by operating systems at least since RSX days.
It is a compact way of having structs linked to many other structures.

That macro is just a variant of the mechanism for C.
The method is used by WinNT and Linux, and I believe also by the BSD's.

GCC has a compile option, no_strict_alias or something, that anyone
using it and doing "illegal" pointer casting must use.
In Windows land, pointer casting at least used to be Microsoft's
recommended method and is supported by their compiler because
they use it too, extensively.

I have used it when I had complex multiple linkages between data
structures.
Say an object is in multiple double linked lists and an index tree and I
need to cast from a pointer to a list link field back to the object containing that link field. I also often put a validity check marker for
each object type at the start of the container and Assert its correctness. The marker is zeroed when the container is destroyed to catch
any dangling references.

Yup. You got it and basically had to use it the same way I have in the
past. Its really cool. Also, check this shit out:

#define RALLOC_ALIGN_OF(mp_type) \
offsetof( \
struct { \
char pad_RALLOC_ALIGN_OF; \
mp_type type_RALLOC_ALIGN_OF; \
}, \
type_RALLOC_ALIGN_OF \
)

;^D
--- Synchronet 3.22a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:40:07 2026

From Newsgroup: comp.arch

On 6/26/2026 11:36 AM, Chris M. Thomasson wrote:

On 6/26/2026 7:11 AM, EricP wrote:

On 2026-Jun-26 06:24, Chris M. Thomasson wrote:

On 6/23/2026 5:54 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 6/22/2026 7:38 AM, Niklas Holsti wrote:

On 2026-06-22 13:44, Thomas Koenig wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:

On 2026-06-21 22:15, David Brown wrote:

On 21/06/2026 20:57, MitchAlsup wrote:

-------------

In my case, I tended to use more conservative approaches and then only >>>>> optimize based on what can be verified by the compiler within certain >>>>> fundamental assumptions.

Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array; >>>>>     Compiler can safely assume no-alias.

Also, if two pointers were passed into a function, can also assume
they
don't alias with a pointer to a local array;

C requires the compiler to prove that the pointers cannot alias.
Fortran specifies that if the 2 argument alias, it is a programming
error.

sorry of if this is way off base, but well...

What about container_of, or CONTAINING_RECORD?

If that is what I think it is, where it cast from
a pointer to a field inside a struct back to the containing struct
by subtracting the field byte offset and changing the pointer type,
irrespective of programming language that mechanism has been used
by operating systems at least since RSX days.
It is a compact way of having structs linked to many other structures.

That macro is just a variant of the mechanism for C.
The method is used by WinNT and Linux, and I believe also by the BSD's.

GCC has a compile option, no_strict_alias or something, that anyone
using it and doing "illegal" pointer casting must use.
In Windows land, pointer casting at least used to be Microsoft's
recommended method and is supported by their compiler because
they use it too, extensively.

I have used it when I had complex multiple linkages between data
structures.
Say an object is in multiple double linked lists and an index tree and I
need to cast from a pointer to a list link field back to the object
containing that link field. I also often put a validity check marker for
each object type at the start of the container and Assert its
correctness.
The marker is zeroed when the container is destroyed to catch
any dangling references.

Yup. You got it and basically had to use it the same way I have in the
past. Its really cool. Also, check this shit out:

#define RALLOC_ALIGN_OF(mp_type) \
offsetof( \
    struct { \
      char pad_RALLOC_ALIGN_OF; \
      mp_type type_RALLOC_ALIGN_OF; \
    }, \
    type_RALLOC_ALIGN_OF \
)

;^D

fwiw, https://groups.google.com/g/comp.lang.c/c/7oaJFWKVCTw/m/sSWYU9BUS_QJ

--- Synchronet 3.22a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Jun 26 18:47:05 2026

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

a[0] = 42;
memcpy(a+1, a, 99);

A naive byte copy will fill a[] with 42, a more typical version that
moves larger blocks won't. This example is really obvious (it's
why there's also memmove()) but there's plenty of more subtle ones.

The burroughs B3500 and medium systems successors, which is a >memory-to-memory architecture had a number of move instructions,
several of which had architecturally defined semantics for
overlapping source and destination fields, which included
functionality similar to that you describe above. ...

So does S/360 and its sucessors. The original 360 had MVC which takes two addresses and a length, and the spec says it acts as if it copies a byte at a time, so the a -> a+1 hack is the usual way to set a block to a specific value. S/370 added MOVE LONG with separate lengths for the two operands and an explicit
padding byte, so you fill memory with the padding byte by setting the source length to zero. If the operands have "destructive overlap", it sets a condition
code and moves nothing. They also added MOVE INVERSE which reverses a byte string
and has explicitly undefined results if the operands overlap by more than one byte.

S/390 added the confusing;y named MOVE LONG UNICODE which moves and pads pairs of bytes. (That works OK for UTF-16, not any other Unicode encoding) It also has
MOVE PAGE which blats a 4K page at a time, MOVE STRING which does a C-style copy
up to a delimiter byte, and MOVE LONG EXTENDED which puts the padding byte in an operand (typically immediate in the instruction) rather than a register and doesn't check for destructive overlap, you get what you get.

z/Series adds MOVE RIGHT TO LEFT which is similar to MVC except it's specified to move bytes right to left rather than left to right, with the example being to
add a hole in the middle of an array.

Copying strings is surprisingly complicated.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- N Cline
  Fri Jun 26 13:25:22 2026
  from Palmer, Ga via Telnet
- N Cline
  Fri Jun 26 12:13:09 2026
  from Palmer, Ga via Telnet
- Noozle
  Fri Jun 26 10:51:12 2026
  from Noozle City via Telnet
- N Cline
  Thu Jun 25 19:30:21 2026
  from Palmer, Ga via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,124
Nodes:	10 (0 / 10)
Uptime:	24:41:32
Calls:	14,394
Calls today:	3
Files:	186,389
D/L today:	6,226 files (1,574M bytes)
Messages:	2,545,009
Posted today:	1

VLIW Architecture of Google TPUs

Who's Online

Recent Visitors

System Info