Forum: War Ensemble BBS

Re: Tonights Tradeoff

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025

From Newsgroup: comp.arch

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025

From Newsgroup: comp.arch

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025

From Newsgroup: comp.arch

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

Well, that and when co-existing with RV64G, it gives somewhere to put
the FPRs. But, in turn this was initially motivated by me failing to
figure out how to get GCC configured to target Zfinx/Zdinx.

Had ended up going with the Even/Odd pairing scheme as it is less wonky
IMO to deal with R5:R4 than R36:R4.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

BT/BF works well. I otherwise also ended up using RISC-V style branches,
which I originally disliked due to higher implementation cost, but they
do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

So, it becomes harder to complain about a feature that does technically
help with performance.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Hmm...

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;
There are some unresolved bugs, but I haven't been able to fully hunt
them down. A lot was in relation to RISC-V's C extension, but at least
it seems like at this point the C extension is likely fully working.

Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
stable at this point.

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance;
Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

Had noted in Cinebench that my main PC is actually performing a little
slower than is typical for the 2700X, but then again, it is effectively
a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
was a case of the RAM I have was unstable if run all that fast (and in
this case; more RAM but slightly slower seemed preferable to less RAM
but slightly faster, or running it slightly faster but having the
computer be crash-prone).

They sold the ran with its on-the-box speed being the XMP2 settings
rather than the baseline settings, but the RAM in question didn't run
reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
more so when there was already the annoyance that my MOBO chipset
apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
maybe not an ideal setup for perf).

So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

Apparently, going by Cinebench scores, my PC's single threaded
performance is mostly hanging out with a bunch of Xeons (getting a score
in R23 of around 700 vs 950).

Well, could be addressed, in theory, but would need some RAM that
actually runs reliably at 2933 or 3200 MT/s and is also cheap...

In both cases, they are CPUs originally released in 2018.

Has noted, in a few tests:
LZ4 benchmark (same file):
Main PC: 3.3 GB/s
Laptop: 3.9 GB/s
memcpy (single threaded):
Main PC: 3.8 GB/s
Laptop : 5.6 GB/s
memcpy (all threads):
Main PC: ~ 15 GB/s
Laptop : ~ 24 GB/s
( Like, what; thing only has 1 stick of RAM... *1 )

*1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
Xeon E5410 with like 8 sticks of RAM...

or, maybe it was just weak that my main PC was failing to beat the Xeon
at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).

Granted, then again, I am using (almost) the cheapest MOBO I could find
at the time (that had an OK number of RAM slots and SATA connectors).
Can't quite identify the MOBO or chipset as I lost the box (and not
clearly labeled on the MOBO itself); except that it is a
something-or-another ASUS board.

Like, at the time, IIRC:
Went on Newggg;
Pick mostly the cheapest parts on the site;
Say, a Zen+ CPU being a lot cheaper than Zen 2,
or pretty much anything from Intel.
...

Did get a slightly fancy/beefy case, but partly this was because I was
annoyed with the late-90s-era beige tower case I had been using. Which I
had ended up hot gluing a bunch of extra PC fans into the thing in an
attempt to keep airflow good enough so that it didn't melt. And
under-clocking the CPU so that it could run reliably.

Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
to 2.8 GHz (for similar reasons).

Where, at least the Zen+ doesn't overheat at stock settings (but, they
also supplied the thing with a comparably much bigger stock CPU cooler).

The case I got is slightly more traditional, with 5.25" bays and similar
and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
stopped selling the traditional sheet-steel PC cases with open 5.25"
bays. Like, where exactly is someone supposed to put their DVD-RW drive,
or hot-swap HDD trays ?...

Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
drive (but, now mostly moot as relatively few other computers still have floppy drives either).

Well, in theory could build a PC with newer components and a bigger
budget for parts. Still wouldn't want to go over to Win11, now it is a
choice between jumping to Linux or "Windows Server" or similar (like, at
least they didn't pollute Windows Server with a bunch of random
pointless crap).

For now, inertia option is to just keep using Win10 for now.

As for the laptop, had noted:
Can run Minecraft:
Yes; though best results at an 8-chunk draw distance.
Much more than this, and the "Intel UHD" graphics struggle.
At 12 chunks, there is obvious chug.
At 16 chunks, it starts dropping into single digit territory.
Can run Doom3:
Yes: Mostly gets 40-50 fps in Doom 3.

My main PC can manage a 16-chunk draw distance in Minecraft and mostly
gets a constant 63 fps in Doom3.

Don't have many other newer games to test, as I mostly lost interest in
modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
work. I can mostly just be happy that Minecraft works and is playable
(and that its GPU is solidly faster than just using a software renderer...).

On both fronts, this is a significant improvement over the older laptop.
For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

This is mostly because I had noticed that, right now (unlike a few years
ago), there are actually OK laptops at cheap prices (along with all the
$80 Dell OptiPlex computers and similar on Amazon...).

Otherwise, went and recently wrote up a spec partly based on a BASIC
dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

Where I was able to get a usable implementation for something similar in
a little over 1000 lines of C.

Though, this was for an Unstructured BASIC dialect.

Decided then to try something a little harder:
Doing a small JavaScript like language, and trying to keep the
interpreter small.

I don't yet have the full language implemented, but for a partial JS
like language, I currently have something in around 2500 lines of C.

I had initially set a target estimate of 4-8 kLOC.
Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
the rest of the core-language implemented within around 1.5 kLOC or so).

Note: No 3rd party libraries allowed, only the normal C runtime library.
Did end up using a few C99 features, but mostly still C95.

For now, I was calling the language BS3L, where:
Dynamically typed;
Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
JS style syntax;
Nothing too exciting here.
Still has JS style arrays and objects;
Dynamically scoped.
Where, dynamic scoping needs less code than lexical scoping;
But, dynamic scoping is also a potential foot-gun as well.
Not sure if too much of a foot-gun.
Vs going to C-style scoping;
Or, biting the bullet and properly implementing lexical scoping.
Leaving out most advanced features.
will be fairly minimal even vs early versions of JS.

But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
needing a full parser (that builds an AST) and an AST-walking
interpreter. Unlike BASIC, it wouldn't be possible to implement an
interpreter by directly walking and pattern matching lists of tokens.

And, a parser that builds an AST, and code to walk said AST, necessarily
needs more code.

I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
the preprocessor). Or, basically, where one has to stick to similar C
coding conventions to those used in Doom and Quake.

I am not sure if this would be possible. Both the dynamic type-system
and parser have eaten up a fair chunk of the code budget. A sub 1000
line parser is also a little novel; but the parser itself got a little
wonky and doesn't fully abstract over what it parses (as there is still
a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

For example, it has wonk like dealing with lists of statements as-if
there were a right-associative semicolon operator (allowing it to be
walked like a linked list).

There is slightly wonky operator tokenization again to save code:
Separately matching every possible operator pattern is a bunch of extra
logic. Was using rules that mostly give the correct operators, but with
the possibility of non-sense operators. Also the precedence levels don't
match up exactly, but this is a lower priority issue.

I guess, if someone things they can do so in significantly less code,
they can try.

Note that while a language like Lua sort of resembles an intermediate
between BASIC and JavaScript, I wouldn't expect Lua to save that much
here (it would still have the cost of needing to build an AST and similar).

Going from an AST to a bytecode or 3AC IR would allow for higher
performance.

But, I decided to go for an AST walking interpreter in this case as it
would be the least LOC.

Actually takes more effort trying to keep the code small. Rather than
just copy-pasting stuff a bunch of times, one spends more time needing
to try to factor out and reuse common patterns.

Though, in a way, some of this is revisiting stuff I did 20+ years ago,
but from a different perspective.

Like, 20+ years ago, my first interpreters also used AST walkers.

As for where I will go with this, I don't know.
Some of it could make sense as a starting point for a GLSL compiler;
Or maybe adapted into parsing the SCAD language;

Or, as a cheaper alternative to what my first script VM became.
By the end of its span, it had become quite massive...
Though, still not too bad if compared with SpiderMonkey or V8.

Ironically, my jump to a Java + JVM/.NET like design was actually to
make it simpler.

For a simple but slow language, JS works, but if you want it fast it
quickly turns worse (and simpler to jump to a more traditional
statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
make a JavaScript style language fast (by turning it transparently into
a statically-typed language), but also, was a huge PITA to deal with
(this was combined in my VM with optional explicit type declarations;
with a syntax inspired by ActionScript).

Well, and when something gets big and complicated enough that one almost
may as well just use spiderMonkey or similar to run their JS code, this
is a problem...

Still less bad than LLVM, not sure why anyone would willingly submit to
this.

Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

Though, makes more sense to do a clean interpreter in this case, than to
try to build one by copy-pasting the parser from BGBCC or my old VM and
trying to build a new lighter-weight VM.

In some of these cases, it is easier to scale up than scale back down.
Easier to take simpler code and add features or improve performance.
Than to take more complex code and try to trim it down.

And, sometimes it does make more sense to just write something starting
from a clean slate.

Well, except for my attempt at a clean slate C compiler, except this was
more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
the design. Partly as I was trying to follow a model more like that used
by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
make more sense than trying to imitate how GCC does things).

Might still make at some point to try for another clean-slate C
compiler, though if I would still end up taking a similar general
approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
to use BGBCC).

Where, say, the main thing that would ideally need to be improved would
be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

Comparably, MSVC typically being a bit faster at compiling stuff IME.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025

From Newsgroup: comp.arch

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be
passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025

From Newsgroup: comp.arch

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste. If you have six bits, you can use
all 64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app
mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
SIMD registers, despite having only 256-bit wide FUs. The way I
understand it, a 512-bit operation comes as one uop to the FU,
occupies it for two cycles (and of course the result latency is
extra), and then has a 512-bit result.

The alternative would be to do as AMD did in some earlier cores,
starting with (I think) K8: have registers that are half as wide and
split each 512-bit operation into 2 256-bit uops that go throught the
OoO engine individually. This approach would allow more physical
256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
256-bit operations, but would cost additional decoding bandwidth,
renaming bandwidth, renaming checkpoint size (a little), and scheduler
space than the approach AMD have taken. Apparently the cost of this
approach is higher than the benefit.

Doubling the logical register size doubles the renamer checkpoint
size, no? This way of avoiding "waste" looks quite a bit more
expensive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025

From Newsgroup: comp.arch

On 10/29/2025 7:50 AM, Robert Finch wrote:

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or
6 bit register numbers in the instructions. Five allows you to use
the high registers for 128 bit operations without needing another
register specifier, but then the high registers can only be used for
128 bit operations, which seems a waste. If you have six bits, you
can use all 64 registers for any operation, but how is the "upper"
method that better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would
be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But
since it should be using mostly compiled code, it does not make much
difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

I am not as sure about this approach...

Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
subsetting registers on context switch seems like it could turn into a problem.

Or, if a goal is to allow for encodings with a 5-bit register field,
would make sense to use 32-bit encodings.

Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
immediate (and one has to be more careful not to "waste" the encoding
space as badly as RISC-V had done).

Though, can note that both:
R6+R6+Imm10
R5+R5+Imm12
Use the same amount of encoding space.
But, R6+R6+R6 uses 3 bits more than R5+R5+R5.

Though, one could debate my case, as I did effectively end up burning
1/4 of the total encoding space mostly on Jumbo prefixes.

...

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a
branch on bit-set/clear for conditional branches. Might also include
branch true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

FWIW: I have gotten by OK with 128 internal registers:
00..3F: Array-Mapped Registers (mostly the GPRs)
40..7F: CRs and SPRs

Mostly sufficient.

For the array-mapped registers, these ones use LUTRAM, with a logical
copy of the array per write port, and some control bits to encode which
array currently holds the up-to-date copy of the register.

All this gets internally replicated for each read port.

So, roughly 18 internal copies of all of the registers with 6R3W, but
this is unavoidable (since LUTRAMs are 1R1W).

The other option is using flip-flops, which is the strategy mostly used
for the writable CRs and SPRs. This is done sparingly as the resource
cost is higher in this case (at least on xilinx, *).

*: Things went amiss on Altera and when I tried to build on this, needed
to use FF's for all the GPRs as well; as these FPGAs lack a direct
equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).

As for the CR/SPR space:
Some of it is used for writable registers;
A big chunk is used for internal read-only registers.
ZZR, IMM, IMMB, JIMM, etc.
ZZR: Zero Register / Null Register (Write)
IMM: Immediate for current lane (33-bit, sign-ext).
IMMB: Immediate from Lane 3.
JIMM: 64-bit immediate spanning Lanes A and B.
...

Could also be seen as C0..C63 (or, all control registers) except that
much of C32..C63 is used for internal read-only SPRs, and a few other
SPRs (DLR, DHR, and SP).

Originally, the CRs and SPRs were handled as separate, but now things
have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
in GPR like ways).

There is some wonk as they were handled as separate modules, but with
the current way things are done it would almost make more sense to fold
all of the CRs into the GPR file module.

The module might also continue to deal with forwarding, but might also
make sense to have a RegisterFile module, possibly with a disjoint
"Register Forwarding And Interlocks" style module (which forwards
registers if the value is available and signals pipeline stalls as
needed; this logic currently partly handled by the existing
register-file module).

Did experiment with a mechanism to allow bank-swapped registers. This
would have added an internal 2-bit mode for the registers, and would
stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
Ended up mostly not using this though (at best, it wouldn't gain much
over the existing "Load and Store everything to RAM" strategy; and would
make context switching slower than it is already).

It is more likely that a practical mechanism for fast bank swapping
would need a mechanism to bank-swap the registers to external RAM. Or
maybe a special "Stall and dump all the registers to this RAM Address" instruction.

For the RISC-V CSRs:
Part of the space maps to the CRs, and part maps to CPUID;
For pretty much everything else, it traps.
So, pretty much all of the normal RISC-V CSRs will trap.

Ended up trapping for the RISC-V FPU CSRs as well:
Rarely accessed;
Rather than just one CSR for the FPU status, they broke it up to
multiple sub-registers for parts of the register (like, there is a
special CSR just for the rounding-mode, ...).

Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
stuff IMO.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Having 64 registers and 64 bit registers makes life easier for that
particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025

From Newsgroup: comp.arch

On 10/29/2025 10:44 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste.

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions. But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that. e.g.

Add A1,A2,A3 would be a 64 bit add on those registers but
Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
bits of the destination, etc. So the question becomes how is using
Rn+32 better than using Rn+1?

That being said, your points are well taken for a different implementation.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly
for 64-bit and smaller.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025

From Newsgroup: comp.arch

On 10/29/2025 1:15 PM, Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

Agreed.

From what I have seen, the vast bulk of constants tend to come in
several major clusters:
0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
-64 to -1: Much of what falls outside 0 to 511.
-32768 to 65535: Second major group
-2G to +4G: Third group (smaller than second)
64-bit: Another smaller spike.

For values between 512 and 16384: Sparsely populated.
Mostly the continued geometric fall-off from the near-0 peak.
Likewise for values between 65536 and 1G.
Values between 4G and 4E tend to be mostly unused.

Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
constant, the larger constants would have very little advantage (in
terms of statistical hit rate) over the 33 bit constant (and, it isn't
until you reach 64 bits that it suddenly becomes worthwhile again).

Partly why I go with 33 bit immediate fields in the pipeline in my core,
but nothing much bigger or smaller:
Slightly smaller misses out on a lot, so almost may as well drop back to
17 in this case;
Going slightly bigger would gain pretty much nothing.

Like, in the latter case, does sort of almost turn into a "go all the
way to 64 bits or don't bother" thing.

That said, I do use a 48-bit address space, so while in concept 48-bits
could be useful for pointers: This is statistically insignificant in an
ISA which doesn't encode absolute addresses in instructions.

So, ironically, there are a lot of 48-bit values around, just pretty
much none of them being encoded via instructions.

Kind of a similar situation to function argument counts:
8 arguments: Most of the functions;
12: Vast majority of them;
16: Often only a few stragglers remain.

So, 16 gets like 99.95% of the functions, but maybe there are a few
isolated ones taking 20+ arguments lurking somewhere in the code. One
would then need to go up to 32 arguments to have reasonable confidence
of "100%" coverage.

Or, impose an arbitrary limit, where the stragglers would need to be
modified to pass arguments using a struct or something.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025

From Newsgroup: comp.arch

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025

From Newsgroup: comp.arch

On 10/29/2025 1:47 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

OK.

In my case, a lot of the 128-bit operations are a single 32-bit
instruction, which splits (in decode) to spanning multiple lanes (using
the 6R3w register file as a virtual 3R1W 128-bit register file).

In some cases, pairs of 64-bit SIMD instructions may be merged to send
both through the SIMD unit at the same time. Say, as a special-case
co-issue for 2x Binary32 ops (which can basically be handled the same as
the 4x Binary32 scenario by the SIMD unit).

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

The reason 17 and 33 ended up slightly preferable is that both
zero-extended and sign-extended 16 and 32 bit values are fairly common.

And, if one has both a zero and sign extended immediate, this eats the
same encoding space as having a 17-bit immediate, or a separate
zero-extended and one-extended variant.

There are a few 5/6 bit immediate instructions, but I didn't really
count them.

XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
extended to 33 bits with a jumbo prefix).

There isn't much need for a direct 128-bit immediate though:
This case is exceedingly rate;
Register-pairs basically make it a non-issue;
Even if it were supported:
This would still require a 24-byte encoding...
Which, doesn't save anything over 2x 12-bytes.
And doesn't gain much, apart from making CPU more expensive.

Someone could maybe do 20 bytes by using a 128-bit memory load, but with
the usual drawbacks of using a memory load (BGBCC doesn't usually do
this). The memory load will have a higher latency than a pair of
immediate instructions.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Basically similar.

XG3 also uses only 70% as many instructions as RV64G.

But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
more competitive (30% smaller and 50% faster).

Not found a good way to much improve much over this though...

But, yeah, if comparing against RV64G as it exists in its standard form,
there is a bit of room for improvement.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Granted.

The main thing it benefits is things like TKRA-GL, ...

Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).

Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.

As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
it is limited to 32 registers and doesn't have SIMD.

Only real saving point is when running with TKRA-GL over system calls in
which case it runs in the kernel (as XG1) which is slightly less bad.
For reasons, TestKern kinda still needs to be built as XG1.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

For $240 I was paranoid is still might not have been fast enough to run Minecraft...

Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
run reliably at more than DDR4-2133... Like, you can try 3200 if you
don't mind computer blue-screening after a few minutes I guess...

But, without much RAM, nor enough SSD space to set up a huge pagefile,
not going to try compiling LLVM on the thing.

Even with all the RAM, a full rebuild of LLVM still takes several hours
on my main PC (though, trying to build LLVM or GCC is at least slightly
faster if one tells the AV software to stop grinding the CPU by looking
at every file accessed).

Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
this has a 2C/2T CPU).

Basically, was a small PC that was using mostly laptop-style parts
internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
MOBO layout I think.

I don't remember there being any card slots; so like if you want to
install a PCIe card or similar, basically SOL.

But, it was either this or an off-brand NUC clone...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

It is a 64-bit machine that provides a small modicum of support for
larger sizes. It is not and never will be a 128-bit machine--that is
what vVM is for.

Key words "small modicum"

DBLE simply supplies registers to the pipeline and width to decode.

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.

DBLE supports 128-bits in the ISA at the total cost of 1 instruction
added per use. In many situations (especially integer) CARRY is the
better option because it throws a shadow of width over a number of
instructions and thereby has lower code foot print costs. So, a 256
bit shift is only 5 instructions instead of 8. And realistically, if
you want wider than that, you have already run out of registers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025

From Newsgroup: comp.arch

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.
So, now there are 16,56,96 and 136 bit constants possible. The
56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
constant unused. 136 bit constants may not be implemented, but a size
code is reserved for that size.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025

From Newsgroup: comp.arch

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the
branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.
Are there other instruction sequence where double-rounding would be good
to avoid? Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025

From Newsgroup: comp.arch

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
      pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
      Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
      and with multi-core workloads.
    But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025

From Newsgroup: comp.arch

On 10/29/2025 5:26 PM, Robert Finch wrote:

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
   8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
     Rarely reaches turbo
       pretty much only happens if just running a single thread...
     With all cores running stuff in the background:
       Idles around 3.6 to 3.8.

Laptop:
   4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
     If power set to performance, reaches turbo a lot more easily,
       and with multi-core workloads.
     But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

IIRC, current PC was something like:
CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
MOBO: $60
Case: $50
...

Spent around $200 for 128GB of RAM.
Could have gotten a cheaper 64GB kit had I known my MOBO would not
accept a full 128GB (then could have had 96 GB).

The RTX card I have (RTX 3060) has 12 GB of VRAM.

IIRC, it was also about the cheapest semi-modern graphics card I could
find at the time. Like, while I could have bought an RTX 4090 or similar
at the time, I am not made of money.

Like, a prior-generation mid-range card being the cheaper option.
And, still newer than the GTX980 that had died on my (where, the GTX980
was itself second-hand).

Before this, had been running a GTX 460, and before that, a Radeon HD
4850 (IIRC).

I think it was a case of:
Had a Phenom II box, with the HD 4850;
Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
Got GTX 980 (also second hand);
Got Ryzen 7 2700X and new MOBO;
Got RTX 3060 (as the 980 was failing).

With the RTX 3060, had to go single-monitor, mostly as it only has
DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

Well, also the RTX 3060 doesn't have a VGA output either (monitor would
also accept VGA).

Though, the current monitor I am using is newer and does support
DisplayPort.

I also managed to get a MultiSync CRT a while ago, but it only really
gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

I also have an LCD that goes up to 1280x1024, although it looks like
garbage if set above 1024x768. Only accepts VGA.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

If you have that many bits available, do you still go for a load-store
architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
in code density to start with. Having memory operations could offset
that by a certain factor, that was why I was asking.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.

OK. That should also help for offsets in load/store.

So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

For addresses, it will take some time for this to overflow :-)
For floating point constants, that will be hard.

I have done some analysis on frequency of floating point constants
in different programs, and what I found was that there are a few
floating point constants that keep coming up, like a few integers
around zero (biased towards the positive side), plus a few more
golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
different programs have wildly different floating point constants,
which is not surprising. (I based that analysis on the grand
total of three packages, namely Perl, gnuplot and GSL, so cover
is not really extensive).

Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

There are also 32-bit floating point constants, and 32-bit integers
as constants. There are also very many small integer constants, but
of course there also could be others.

136 bit constants may not be implemented, but a size
code is reserved for that size.

I'm still hoping for good 128-bit IEEE hardware float support.
POWER has this, but stuck it on their their decimal float
arithmetic, which is not highly performing...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions along with the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

The problem arises due to a cross products of various {machine,
language, compiler} features not working "all ends towards the middle".

LLVM promotes FP calculations with a constant to 64-bits whenever the
constant cannot be represented exactly in 32-bits. {Strike one}

C makes no <useful> statements about precision of calculation control.
{strike two}

HW almost never provides mixed mode calculations which provide the
means to avoid the double rounding. {strike three}

So, technically, My 66000 does not provide general-mixed-mode FP,
but I wrote a special rule to allow for larger constants used with
narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025

From Newsgroup: comp.arch

On 10/30/2025 11:10 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

Only really makes sense if one assumes these resources are "borderline
free".

If you are also paying for logic complexity and wires/routing, then
having bigger registers just to typically waste most of them is not ideal.

Granted, one could argue that most of the register is wasted when, say:
Most integer values could easily fit into 16 bits;
We have 64-bit registers.

But, there is enough that actually uses the 64-bits of a 64-bit register
to make it worthwhile. Would be harder to say the same for 128-bit
registers.

It is common on many 32-bit machines to use register pairs for 64-bit operations.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
AVX-512 had not exactly had clean roll-outs.

Checks and, ironically, my recent super-cheap laptop was the first thing
I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).

Still some oddities though:
RAM that appears to be faster than it should be;
The MHz and CAS latency appear abnormally high.
They do not match the values for DDR4-2400.
(Nor, even DDR4 in general).
Appears to exceed expected bandwidth on memcpy test;
...
Windows 11 on an unsupported CPU model;
More so, Windows 11 Professional, also on something cheap.
(Listing said it would come with Win10, got Win11 instead, OK).

So, technically seems good, but also slightly sus...

Differs slightly from what I was expecting:
Something kinda old and not super fast;
Listing said Windows 10, kinda expected Windows 10;
...

Like, something non-standard may have been done here.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problemso
or inefficiencies?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not RISC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Consider that being able to address every 2^(3+n) field of a register
is far from free. Take a simple add of 2 bytes::

ADDB R8[7], R6[3], R19[4]

One has to individually align each of the bytes, which is going to blow
out your timing for forwarding by at least 3 gates of delay (operands)
and 4 gates for the result (register). The only way it makes "timing"
sense if if you restrict the patterns to::

ADDB R8[7], R6[7], R19[7]

Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
are not the same.}}

I also suspect that you would gain few compiler writers to support random fields in registers.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

There are vanishingly few useful manipulations on part of pointers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value
analysis in the compiler.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

If your ISA has excellent support for statically positioned bit-fields
(or even better with dynamically positioned bit fields) fetching the
fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
significant latency.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problems
or inefficiencies?

Consider trying to invalidate an ICache line--this requires looking
at 2 DCache lines to see if they, too, need invalidation.

Consider self-modifying code, the data stream overwrites an instruction,
then later the FETCH engine runs over the modified line, but the modified
line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
a single fetch.

It also prevents SNARFing updates to ICache instructions, unless the
SNARFed data is entirely retained in a single ICache line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4. The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in
hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

Yes, those were in EV4.

Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

I didn't consider these instructions as SIMD. May be, I should have.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025

From Newsgroup: comp.arch

On 10/31/2025 9:48 AM, Anton Ertl wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

Most likely Dhrystone:
It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.

I had experimented with special instructions for packed search, which
could be used to help with either string compare of implementing
dictionary objects in my usual way.

Though, had later fallen back to a more generic way of implementing
"strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
use of niche instructions that typically require hand-written ASM or
similar).

Generally, makes more sense to use helper instructions that have a
general impact on performance, say for example, effecting how quickly a
new image can be drawn into VRAM.

For example, in my GUI experiments:
Most of the programs are redrawing the screens as, say, 320x200 RGB555.

Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
to mimic planar VGA behavior. In this case, for the port it was easier
to write wrapper code to fake the VGA weirdness than to try to rewrite
the whole renderer to work with a normal linear framebuffer (like what
Doom and similar had used).

In a lot of the cases, I was using an 8-bit indexed color or color-cell
mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Though, thus far the 1024x768 mode is still mostly untested on real
hardware.

Had experimented some with special instructions to speed up the indexed
color conversion and color-cell encoding, but had mostly gone back and
forth between using helper instructions and normal plain C logic, and
which exact route to take.

Had at one point had a helper instruction for the "convert 4 RGB555
colors to 4 indexed colors using a hardware palette", but this broke
when I later ended up modifying the system palette for better results
(which was a critical weakness of this approach). Also the naive
strategy of using a 32K lookup table isn't great, as this doesn't fit
into the L1 cache.

So, for 4 bpp color cell:
Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
into a logical 8x8 pixel block; rather than a linear ordering like in
DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
(same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
IIRC, had also experimented with having each 4x4 sub-block able to use a
pair of RGB232 colors, but was harder to get good results.

But, to help with this process, it was useful to have helper operations
for, say:
Map RGB555 values to a luma value;
Select minimum and maximum RGB555 values for block;
Map luma values to 1 or 2 bit selectors;
...

Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
previous version of the framebuffer and tracking when pixel-blocks will
differ to refine the selection of blocks that need redraw, copying over
blocks as needed to keep track of these buffers).

Process wasn't particularly efficient (and performance is considerably
worse than what Win3.x or Win9x seemed to give).

As for the packed-search instructions, there were 16-bit versions as
well, which could be used either to help with UTF-16 operations; or for dictionary objects.

Where, a common way I implement dictionary objects is to use arrays of
16-bit keys with 64-bit values (often tagged values or similar).

Though, this does put a limit on the maximum number of unique symbols
that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
of running out of symbol name somewhat.

One can also differ though on how much it makes to have sense to have
ISA level helpers for working with tagrefs and similar (or, getting the
ABI involved with these matters, like defining in the ABI the encodings
for things like fixnum/flonum/etc).

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025

From Newsgroup: comp.arch

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
   320x200: 16 bpp
   640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
   800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat
color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD
extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good >> to avoid?

Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

The 8086 is no 68000. The [BX] addressing mode makes it obvious that
that's not the case.

What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
are new, SP corresponds to the 8080 SP, which does not have 8-bit
components. That's why SI, DI, BP, SP have no low or high
sub-registers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

I think that we can learn a lot from earlier architectures, some
things to adopt and some things to avoid. Concerning subregisters, I
lean towards avoiding.

That's also another reason to avoid load-and-op and RMW instructions.
With a load/store architecture, load can sign/zero extend as
necessary, and then most operations can be done at full width.

But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value >analysis in the compiler.

I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025

From Newsgroup: comp.arch

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8,
allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
Y R
B G

In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...

Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).

Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma;
...

Dealing with chroma does have the effect of making the dithering process
more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).

The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with the
dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that much
better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
complexity.

Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.

Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.

I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).

More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in the
ulp position.

We have known since before the 1978 ieee754 standard that guard+sticky
(plus sign and ulp) is enough to get the rounding correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025

From Newsgroup: comp.arch

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje

People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.

Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
Know About Floating-Point Arithmetic". It seems, people copy the name
of the article one from another, but very small fraction of them
bothered to actually read it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 09:39:27 2025

From Newsgroup: comp.arch

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 10:06:42 2025

From Newsgroup: comp.arch

On 2025-11-02 3:21 a.m., BGB wrote:

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
   G R
   B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
   G R G B
   B G R G
   G R G B
   B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
Y R
B G

In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...

Where, Y is a pure luma value.
May or may not use this, or:
    Y R B G Y R B G
    B G Y R B G Y R
    ...
But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).

Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma; ...

Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).

The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
    This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
    Getting a 5-bit scaled RGB;
    Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.

Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.

Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.

I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).

More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.

...

I think your support for graphics is interesting; something to keep in
mind for displays with limited RAM.

I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per
component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 16:09:10 2025

From Newsgroup: comp.arch

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje

People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.

Within the 754 working group the definition is totally clear:

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 18:14:54 2025

From Newsgroup: comp.arch

On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.

Terje

People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
everybody really agree.

Within the 754 working group the definition is totally clear:

I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not
matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 20:19:10 2025

From Newsgroup: comp.arch

Michael S wrote:

On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.

Terje

People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if
everybody had agreed about what they mean. But I don't think that
everybody really agree.

Within the 754 working group the definition is totally clear:

I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not
matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5. If you work with the binary
representation for decimal, then you just need two extra bits, just like
BFP.

Correct rounding also work when Guard temporarily contains more than one
bit, possibly due to normalization, but you would normally squash this
down (Guard, Sticky) by OR'ing any secondary guard bits into Sticky.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 14:58:36 2025

From Newsgroup: comp.arch

On 11/2/2025 9:06 AM, Robert Finch wrote:

On 2025-11-02 3:21 a.m., BGB wrote:

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the
bits per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
   G R
   B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
   G R G B
   B G R G
   G R G B
   B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
   Y R
   B G

In this case tiling as:
   Y R Y R ...
   B G B G ...
   Y R Y R ...
   B G B G ...
   ...

Where, Y is a pure luma value.
   May or may not use this, or:
     Y R B G Y R B G
     B G Y R B G Y R
     ...
   But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
per channel, allowing for roughly a RGB333 color space (though, the
vectors are normalized here).

Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of
chroma;
...

Dealing with chroma does have the effect of making the dithering
process more complicated. As noted, reliable recovery of the color
vector is itself a bit fiddly (and is very sensitive to the encoder
side dither process).

The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with
the dither pattern, and modified how it was being handled (replacing
the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
rotating the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that
much better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
   Use LUT4 to map 4b -> 2b (as a count)
   Then, map 2x2b -> 3b (adder)
   Then, map 2x3b -> 4b (adder), then discard LSB.
   Then, select max or R/G/B/Y;
     This is used as an inverse normalization scale.
   Feed each value and scale through a LUT (for R/G/B)
     Getting a 5-bit scaled RGB;
     Roughly: (Val<<5)/Max
   Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1
bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and
maximum values, vs full intensity and black, but this would add more
logic complexity.

Pros/Cons:
   +: Looks better than per-pixel Bayer-RGB
   +: Looks better than 4x4 RGBI
   -: Would require more complex decoder logic;
   -: Requires specialized dither logic to not look like broken crap.
   -: Doesn't give passable results if handed naive grayscale dithering. >>
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.

I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler
(non- randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted
into a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more
computationally expensive than just using a CRAM style codec (while
also giving worse image quality).

More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.

...

I think your support for graphics is interesting; something to keep in
mind for displays with limited RAM.

I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.

No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And,
2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

But, yeah, even the theoretical framebuffer images generally look better
than what one sees on actual monitors.

Even then, modern LCD panels mostly can't display even full RGB24 color
depth; more often it is 6-bit / channel or similar (then the panels
dither for full 24). But, IIRC a lot of OLEDs are back up to full
color-depth (but, OLEDs are more expensive and have often had a
notoriously short lifespans, ...).

But, yeah, my current monitor seems to be LCD based.

In my case, the video HW uses prefetch requests along a ring-bus, which
goes to the L2 cache, and then to external RAM. It then works on hope
that the requests get around the bus and can be resolved in time.

In this case, the memory works in a vaguely similar way to the CPU's L1
caches (although with line-oriented access), and a module that
translates this to color-values during screen refresh. General access
pattern was built around "character cells".

It can give stable results at 8MB/s to 16MB/s (with more glitches as it
goes higher), but breaks down too much past this point.

So, switching to a RAM backed framebuffer didn't significantly usable
increase screen resolutions or color depths.

Also, I am mostly limited to using either either a 25 or 50 MHz pixel
clock, so some timings were tweaked to fit this. Doesn't really fit
standard VESA timings, but it seems like monitors can tolerate
nonstandard timings, and are more limited to operating range.

So, say:
320x200 70Hz, 25MHz; 9MB/s @ 16bpp (hi-color)
640x400 70Hz, 25MHz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
640x480 60Hz, 50Mhz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
800x600 72Hz, 50Mhz; 8.6 MB/s @ 2bpp, 17 MB/s @ 4bpp
1024x768 48Hz, 50Mhz, 5MB/s @ 1bpp, 10MB/s @ 2bpp

So, this implies that just running 1024x768 at 2bpp should be acceptable
(even if it exceeds the usual 128K limit).

Earlier on, I had an 800x600/36Hz, and 1024x768/25Hz, these would have
allowed 8bpp color, but are below the minimum refresh rate of most
monitors (seems like VGA monitors don't like going below around 40Hz).

Of these modes, 8bpp (Indexed color) is technically newest.
Originally the graphics hardware was written for color-cell.

Earliest design had 32-bit cells (for 8x8 pixels):
10 bits: Glyph
2x 6b color + Attrib (RGB222)
2x 9b Color: RGB333

was later expanded first to 64b cells, then to 128b and 256b
Some control bits effect cell size.
Also with ability to specify 8x8 or 4x4 cells.
Where, 4x4 cells reduce the effective resolution.
In the bitmap modes:
4x4 + 256b: 16bpp Hicolor
4x4 + 128b: 8bpp Indexed
4x4 + 64b: 4bpp RGBI (Alt2)
8x8 + 256b: 4bpp RGBI (Alt1)
8x8 + 128b: 2bpp (4-color, CGA-like)
With a range of color palettes available (more than CGA).
Black/White/Cyan/Magenta, Black/White/Red/Green, ...
Black/White/DarkGray/LightGray, also with Green and Amber, ...
8x8 + 64b: 1bpp (Monochrome)
Can select between RGBI colors and some special sub-modes.
The recent idea, if added to HW, would slot into this mode.
The color-cell modes:
8x8 + 256b: 4bpp (DXT1 like, 4x 4x4 cells per 256-bit cell)
8x8 + 128b: 2bpp (2bpp cells)
Each cell has 2x RGB555 colors, and 8x8x1 for pixel data
Had experimented with 8x RGB232,
didn't catch on (looked terrible).
8x8 + 64b: Text-Mode + Early Graphics (4x4 cells)

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

The 640x200 mode is the same as 640x400 (for VGA) but with the vertical resolution halved. The 320x200 mode also halves the horizontal
resolution (so 40x25 cells).

In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to
320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.

Can note that the 8x8x1b cells have the upper-left corner in the MSB.
This differs from most other modes where the upper left corner is in the
LSB (so, pixels flipped both horizontally and vertically).

Can note that in this case, the video memory had several parts:
VRAM / Framebuffer
Note: Uses 64-bit word addressing.
Font RAM: Stores character glyphs as 8x8 patterns.
Originally, there was a FontROM, but I dropped this feature.
This means BootROM needs to supply the initial glyph set.
I went with 5x6 pixel cells in the ROM to save space.
Where 5x6 does ASCII well enough.
Palette RAM: Stores 256x 16-bits (as RGB555).

Though, TestKern typically uses what is effectively color-cell graphics
for the text mode (so, just draws 8x8 pixel blocks for the character
glyphs).

All this differs notably from CGA/EGA/VGA, which had used mostly raster-ordered modes. Except for the oddity of bit-planes for 16 color
modes in EGA and VGA.

I did experiment with raster ordered modes which worked by effectively stretching the character cell horizontally while reducing vertical
height to 1 pixel. Ended not going with this was it was prone to a lot
more glitches with the screen refresh (turned out to be a lot more
sensitive to timing than the use of 8x8 or 4x4 cells).

But, since generally programs don't draw directly into VRAM, the use of non-raster VRAM is mostly less of an issue.

Well, apart from the computational cost of converting from internal
RGB555 frame-buffers. Though, partial reason RGB555 ended up used so
often was because it was faster to do RGB555 -> ColorCell encoding than
8-bit indexed color to color-cell, as indexed color typically also
requires a bunch of palette lookups (which could end up more expensive
than the additional RAM bandwidth from the RGB555).

Also, there isn't really a "good and simple" way to generalize 8-bit
colors in a way that leads to acceptable image quality. Invariably, one
ends up needing palettes or encoding schemes that are slightly irregular.

For color-cell, there are different approaches depending on how fast it
needs to be:
Faster: Simply select minimum and maximum luma;
Selector encoding is often via comparing against thresholds.
Except on x86, where multiply+bias+shift is faster.
Medium: Calculate along 4 axes in parallel;
Select axis which gives highest contrast;
Usually: Luma, Cyan, Magenta, Yellow.
Adjust endpoints to better reflect standard deviation.
Vs simply min/max.
Slower:
Calculate centroid and mass distribution and similar;
Better quality, more for offline / batch encoding.

As noted, early on, I was mostly using real-time color-cell encoders for
Doom and Quake and similar (hence part of why they were modified to use RGB555).

Some of this is also related to the existence of a lot of RGB555 related helper ops. Though, early on, had also used YUV655 as well, but RGB555
mostly won out over YUV655 (even if it is easier to get a luma from
YUV655 vs RGB555).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 16:56:05 2025

From Newsgroup: comp.arch

On 2025-11-02 3:58 p.m., BGB wrote:
<snip>

No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.

Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high
enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.

<snip>

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

For the text mode 800x600 mode is used on my system, with 12x18 cells so
that I can read the display at a distance (64x32 characters).

The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>

In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
   8bpp: raster order, 32-bits per row
   4bpp: Raster order, 16-bits per row
And, 8x8:
   4bpp: Takes 16bpp layout, splits each pixel into 2x2.
   2bpp: Takes 8bpp layout, splits each pixel into 2x2.
   1bpp: Raster order, 1bpp, but same order as text glyphs.
     With MSB in upper left, LSB in lower right.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 17:21:52 2025

From Newsgroup: comp.arch

On 11/2/2025 3:56 PM, Robert Finch wrote:

On 2025-11-02 3:58 p.m., BGB wrote:
<snip>

No real need to go much beyond RGB555, as the FPGA boards have VGA
DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
The usual workaround was also to perform dithering while driving the
VGA output (with ordered dither in the Verilog).

I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.

Never went up the learning curve for HDMI.
Would likely need to drive the monitor outputs with SERDES or similar
though.

Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.

Temporal dithering seems to generate annoying artifacts on the monitors
I tried it on. Trying to use temporal dithering tended to result in
annoying wavy/rippling artifacts.

Likewise, PWM'ing the pixels also makes LCD monitors unhappy (rainbow
banding artifacts), but seems to work OK on CRTs. I suspect it is an
issue that the monitors expect a 25MHz pixel clock (when using 640x400
or 640x480 timing) with an ADC that doesn't like sudden changes in level
(say, if updating the pixels at 50MHz internally).

<snip>

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

For the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).

The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>

I went with 80x25 as it is pretty standard;
80x50 is also possible, but less standard.

Though, Linux seems to often like using high-res text modes rather than
the usual 80x25 or similar.

As for 8x8 character cells:
Also pretty standard, and fix nicely into 64 bits.

In theory, for a text mode, could drive a monitor at 1280x400 with
640x400 timings for 16x16 character cells, but LCD monitors don't like
this sort of thing.

Even at 640x400/70Hz timings, the monitor didn't consistently recognize
it as 640x400, and would sometimes try to detect it as 720x400 or
similar (which would look wonky).

The other option being to output 640x480 and simply black-fill the extra
lines (so, add 20 lines of black-fill at the top and bottom of the
screen). Where, the monitors were able to more reliably detect 640x480/60Hz

The main tradeoff is that mostly I have a limited selection of pixel
clocks available:
25, 50, maybe 100.

Mostly because the pixel clocks are high enough and clock-edges
sensitive enough where accumulation timers don't really work.

Though, accumulation timers do work for driving an NTSC composite
output. But, NTSC composite looks poor, can't even really do an 80x25
text mode acceptably (if using colorburst); but can do 80x25 if one can
accept black-and-white.

Well, there was also component video, but this is basically the same as driving VGA (just with it being able to accept both NTSC and VGA
timings; eg, 15 to 70 kHz for horizontal refresh, 40 to 90 Hz vertical,
...).

Though, I no longer have the display that had component video inputs.

Contrast, there is generally a very limited range of timings for
composite or S-Video (generally, these don't accept VGA-like timings). Whereas, VGA only really accepts VGA-like timings, and is unhappy if
given NTSC timings (eg: 15 kHz horizontal refresh).

Not sure why seemingly component video is the only "accepts whatever you
throw at it" analog input (say, on a display with multiple input types
and presumably similar hardware internally).

Checks, annoyingly hard to find plain LCD monitors with a component
video inputs that is not also a full TV with a TV tuner (but, a little
easier to find ones with both VGA and composite). Closest I can find are apparently intended mostly as CCTV monitors.

But, mostly using VGA anyways, so...

...

In this case, a 40x25 color-cell mode (with 256-bit cells) could be
used for graphics (32K). Early on, this was used as the graphics mode
for Doom and similar, before I later expanded VRAM to 128K and
switched to 320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
   16bpp: pixels in raster order.
    8bpp: raster order, 32-bits per row
    4bpp: Raster order, 16-bits per row
And, 8x8:
    4bpp: Takes 16bpp layout, splits each pixel into 2x2.
    2bpp: Takes 8bpp layout, splits each pixel into 2x2.
    1bpp: Raster order, 1bpp, but same order as text glyphs.
      With MSB in upper left, LSB in lower right.

<snip>

...

But, yeah, my makeshift graphics hardware is a little wonky.
And, works in an almost entirely different way from the VGA style hardware.

Ironically, software doesn't configure timings itself, but rather uses selector bits to control various properties:
Base Resolution (640x400, 640x480, 800x600, ...);
Character cell size in pixels (4x4 or 8x8);
Settings to modify the number of horizontal and vertical cells relative
to the base resolution;
...

But, for the most part, had been using 640x400 or similar; with 800x600
as more experimental (and doesn't look great with 2bpp cells).

The 1024x768 mode had gone mostly unused, and is still untested on real hardware.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 3 15:22:44 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Nov 3 11:53:48 2025

From Newsgroup: comp.arch

On 11/3/2025 9:22 AM, Scott Lurndal wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I would assume he meant something like either the newer IEEE-754 decimal formats, or a decimal-FP format that MS had used in .NET, ...

The IEEE formats are generally one of:
Linear mantissa understood as decimal;
Groups of 10-bits, each used to encode 3 digits.
As Densely Packed Decimal.
With a power-of-10 exponent.

The .NET format was similar, except using groups of 32 bits as linear
values representing 9 digits.

When I looked at it before, the most practical way to me to support
something like this seemed to be to not do it directly in hardware, but
to support a subset of operations:
Operations to pack and unpack DPD into BCD;
Say: 64 bit value holds 15 BCD digits, mapped to 50 bits of DPD.
Some basic operations to help with arithmetic on BCD.

I partly implemented these as an experiment before, but then noted I
have basically no use case for Decimal-FP in my project.

And, ironically, the main benefit the helpers would have provided would
be to allow for faster Binary<->Decimal conversion. But, even then are debatable, as Binary<->Decimal conversion isn't itself enough CPU time
to justify making it faster at the cost of needing to drag around BCD
helper instructions.

One downside is that there was no multiplier, so the BCD helpers would
need to be used to effectively implement a Radix-10 Shift-and-Add.

...

Though, it is debatable, something more like the .NET approach could
make more sense for a SW implementation.

If one wants to make the encoding more efficiently use the bits, a
hybrid approach could make sense, say:
Use 3 groups of 30 bits, and another group of 20 (6 digits)
Use an 17 bit linear exponent and sign bit.

This would be slightly cheaper to implement vs what is defined in the
standard (for the BID variant), and could achieve a similar effect
(though, with 33 digits rather than 34).

Internally, it could work similar to the .NET approach, just with a
little more up-front to pack/unpack the 30 bit components. The merit of
30 bit groups being that they map internally onto 32-bit integer
operations (which would also provide a space internally for carry/borrow signaling in operations).

Most CPUs at least have native support for 32-bit integer math, and for
SW (on a 32/64 bit machine) this could be an easier chunking size than
10 bits. Someone could argue for 60 bit chunking on a 64-bit machine
(or, one 60 bit chunk, and a 50 bit chunk), but likely this wouldn't
save much over 30 bit chunking.

Also, 60-bit chunking would imply access to a 64*64->128 bit widening multiply; which is asking more than 32*32->64. And, also precludes some
ways to more cheaply implement the divide/modulo step for each chunk
(*). So, it is likely in this sense 30 bit chunks could still be preferable.

*:
high=product>>30;
low=product-(high*1000000000LL);
if(low>=1000000000)
{ high++; low-=1000000000; }
Where, 60 bit chunking would require 128-bit math here.

Where, effectively, the multiply step is operating in radix-1-billion.

...

Still don't have much of a use-case though.

In general, Decimal-FP seems more like a solution in search of a problem.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 3 18:47:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:03:13 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:13:50 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

The VEC instruction (My 66000) provides a register that is used for
the address of the top of the loop and the address of the VEC inst
itself. So, when running in the loop, the LOOP instruction branches
to the register value, and when taking an exception in the loop,
the register leads back to the VEC instruction for after the excpt
has been performed.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

VEC-{ }-LOOP always saves at least 1 instruction per iteration.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

VEC does its own predictions. LOOP does not overrun the loop-count,
so loop termination is not a pipeline flush.

The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)

LDA Rd,[IP,displacement]

BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

But if you create "R3" from your VEC instruction, you KNOW that
the compiler is only allowed to use "r3" as a branch target, and
that "R3" is static over the duration of the loop, so you can get
the reservation stations moving faster/easier.

I have a "special" RS for the VEC-LOOP brackets.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 3 23:04:53 2025

From Newsgroup: comp.arch

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I
use them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd
rather not use the term 'guard' at all. Names like 'rounding bit'
or 'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit
if the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 08:50:25 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

No, I meant ieee754 DFP, where you either store the decimal digits in
packed modulo-1000 groups, or as a binary mantissa with a decimal exponent/scaling value.

When you do math with these you have to handle all the required
(financial?) rounding modes.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 4 07:50:33 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?

The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.

By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.

That all would not be so bad, if the compilers would move the
prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.

Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:

First, machines from the early 2000s:

sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with
varying amounts of other instructions; "direct" needs the least
instructions.

And here are results with some newer machines:

sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.

It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in
slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.

Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.

I have to take additional measurements with other Power and AMD64
processors.

Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be a bad idea.

tell the CPU
what the target is (like VEC in My66000)

I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.

just use a general
purpose register with a general-purpose instruction.

That turns out to be the winner.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.

If you use a link register or a special instruction, the CPU could
do that.

It turns out that this does not work well in practice.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 15:19:08 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 17:41:07 2025

From Newsgroup: comp.arch

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

Decimal32 and Decimal64 would suffer from similar mismatch, but those
formats probably not important. IMHO, IEEE defined them for sake of completeness rather than because they are useful in real world.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 07:47:50 2025

From Newsgroup: comp.arch

On 11/4/2025 7:19 AM, Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

By "information density" I think he means that for almost any (I won't
say any because there might be some edge cases where the isn't true)
value, it takes fewer bits to represent in the IEEE scheme than in your beloved Burroughs Medium system's scheme. :-) Fewer bits per value
means higher information density.

Fewer bits means less less hardware, thus lower cost, less power
required, etc.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 16:52:18 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34 digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than BFP.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 18:54:58 2025

From Newsgroup: comp.arch

On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than
BFP.

Terje

Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 17:12:54 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:13:36 2025

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than
BFP.

Terje

Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

Agreed.

It is somewhat similar to the very old hex fp which had a wider exonent
range but more variable precision.

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 19:15:31 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?

We threw HW at the problem.

The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.

By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.

That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.

Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:

First, machines from the early 2000s:

sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
instructions.

And here are results with some newer machines:

sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.

It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.

Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.

I have to take additional measurements with other Power and AMD64
processors.

Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

tell the CPU
what the target is (like VEC in My66000)

I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.

VEC is the bracket at the top of a loop. VEC supplies a register
which will contain the address of the instruction at the top of
the loop, and a 21-bit-vector use to specify those registers which
are "Live" out of the loop. VEC is "executed" as the loop is entered
and then not again until the loop is entered again.

The LOOP instruction is the bottom bracket of the loop and performs
the ADD-CMP-BC sequence as a single instruction. There are 3 flavors
{counted, value terminated, counter value terminated} that use the
3 registers similarly but differently.

just use a general
purpose register with a general-purpose instruction.

That turns out to be the winner.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

With VEC-LOOP you are guaranteed that the branch and its target are
100% correlated.

If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.

Greater than 1 branch per FETCH latency.

If you use a link register or a special instruction, the CPU could
do that.

It turns out that this does not work well in practice.

Agreed.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:16:59 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.

I thought that was obvious:

When you learned how to do decimal rounding back in your pen & paper
math classes, you probably realized that for any calculation which could
not be done exactly, you had to generate enough extra digits to be sure
how to round.

Those extra digits play exactly the same role as Guard + Sticky do in
binary FP.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 4 21:07:43 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:44:21 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of
1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told
them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:52:46 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Several options, the easiest is of course a set of full forward/reverse
lookup tables, but you can take advantage of the regularities by using
smaller tables together with a little bit of logic.

You also need a way to extract one or two digits from the top/bottom of
each mod1000 container in order to handle normalization.

For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up, but
when you have a (worst case) temporary 220+ bit product mantissa,
scaling is not that easy.

The saving grace is that almost all DFP calculations tend to employ
relatively small numbers, mostly dfadd/dfsub/dfmul operations with fixed precision, and those will always be faster (in software) using the
binary mantissa.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 22:51:28 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 15:46:06 2025

From Newsgroup: comp.arch

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 00:44:18 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

I remember you relating this story about 6-8 years ago.

As you said: "Never bet against branch prediction".

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 02:51:10 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?

Probably.

I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

But you can be sure COBOL got them from assembly language programmers.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 4 23:43:48 2025

From Newsgroup: comp.arch

On 11/4/2025 4:51 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

In SW, you would still need to burn 16 bits per entry on the table, and possibly have code to fill in the tables (well, unless the numbers are expressed in code).

A similar strategy is often used for sin/cos in many 90s era games,
though the table is big enough that it would likely be impractical to
type out by hand (or calculate using mental math).

It is likely someone at ID Software or similar wrote out code at one
point to spit out the sin+cos lookup table as a big blob of C (say,
because an 8192 entry table is likely too big to be reasonable to type
out by hand).

Sometimes it becomes a tradeoff where exactly is the tradeoff in these
cases between when to use typing and mental math, and when to write some
code to spit out a table.

For me, the tradeoff is often somewhere around 256 numbers, or less if
the calculation is mentally difficult (namely, whether typing or
calculating is the bottleneck).

It is most likely for DPD<->BCD, would resort to using code to generate
the lookup table.

Then again, it might depend a lot on the person...

You still need to build 12-bit decimal ALUs to string together

When I did it experimentally, I had done 16 BCD digits in 64 bits...

The cost was slightly higher than that of a 64-bit ADD/SUB unit.

Generally, it was combining the normal 4-bit CARRY4 style logic with
some LUTs on the output side to turn it into a sort of BCD equivalent of
a CARRY4.

Granted, doing it with 3/6/9 digits would be cheaper than with 16 digits.

Though, if doing it purely in software, may make sense to go a different route:
Map DPD to a linear integer between 0 and 999;
Combine groups of 3 values into a 32 bit value;
Work 32 bits at a time;
Split back up to groups of 3 digits, and map back to DPD.

Though, depends on the operation, for some it may be faster to operate
in groups of 3 digits at a time (and sidestep the costs of combining or splitting the values).

Then again, thinking about it, it is possible that for the Decimal128
BID format, the mantissa could be broken up into smaller chunks (say, 9 digits) without the need for a full-width 128-bit multiply.

In this case, could use a narrower multiply, and the "error" from the
overflow would exist outside of the range of digits that are being
worked on, so effectively becomes irrelevant for the operation in
question (so, may be able to use 32 or 64 bit multiply, and 128-bit ADD).

Granted, this is untested.

Well, apart from how to recombine the parts without the need for wide multiply.

In theory, could turn it into a big pile of shifts-and-add. Not sure if
there is a good way to limit the number of shifts-and-adds needed. Well, unless turned into multiply-by-100 (3 shift 2 add) 4x times followed by multiply by 10 (1 shift 1 add), to implement multiply by 1 billion, but
this also sucks (vs 13 shift 12 add).

Hmm...

Ironically, the DPD option almost looks preferable...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 05:17:53 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and
switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values
feature "computed goto" rather than "labels as values" or "assigned
goto".

But you can be sure COBOL got them from assembly language programmers.

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:41:30 2025

From Newsgroup: comp.arch

On 2025-11-03 2:03 p.m., MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.

Round nearest random? How about round externally guided (RXG) by an
input signal? For instance, the rounding could come from a feedback
filter of some sort.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 06:44:54 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:47:56 2025

From Newsgroup: comp.arch

On 2025-11-03 1:47 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a
displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

Any loop condition that needs a displacement constant. The constant
being loaded into a register.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 22:53:49 2025

From Newsgroup: comp.arch

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that. BTDT.

BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

I am not saying it couldn't be used well. Just that it was often not,
and when not, it caused a lot of problems.

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto".

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 06:55:49 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:00:32 2025

From Newsgroup: comp.arch

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating
counters, but it is sort of a partial mystery why (for mainstream
processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).

Then again, had before noted that the LLMs are seemingly also not really
able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.

Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could
have done similar.

Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of
2-bit state) would have costed more than the cost of smaller tables of 6
bit state ?...

Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)

Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.

Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as
weak states which partly overlap with the weak states for the 3-bit
patterns).

But, yeah, not going to type out state tables for these ones.

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit
state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Could almost have this sort of thing as a "brain teaser" puzzle or something...

Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 02:06:50 2025

From Newsgroup: comp.arch

On 2025-11-05 1:47 a.m., Robert Finch wrote:

On 2025-11-03 1:47 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a
displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

Any loop condition that needs a displacement constant. The constant
being loaded into a register.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.

One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 07:13:46 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

I strongly suspect that IBM is doing something similar :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:38:30 2025

From Newsgroup: comp.arch

On 11/5/2025 1:00 AM, BGB wrote:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring
of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its
architecture.

Before the start of that briefing I suggested that I should start off
on the blackboard by showing what I had been able to figure out on my
own, then I proceeded to pretty much exactly cover every single
feature on the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I
told them that I expected the P6 to employ eager execution, i.e
execute both ways of one or two layers of branches, discarding the
non-taken paths as the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power
viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).

Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.

Errm...

I just decided to test it, and it appears Grok was able to figure it out
(more or less).

This is concerning, either the AIs are getting smart enough to deal with semi-difficult problems; or in fact it is not difficult and I was just
dumb for thinking there is any difficulty in working out the state
tables for the longer patterns.

I tried before with DeepSeek R1 and similar, which had failed.

Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could have done similar.

Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
bit state) would have costed more than the cost of smaller tables of 6
bit state ?...

Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)

Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000   weak, not taken
001   weak, taken
010   strong, not taken
011   strong, taken
100   weak, alternating, not-taken
101   weak, alternating, taken
110   strong, alternating, not-taken
111   strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.

Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).

But, yeah, not going to type out state tables for these ones.

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
able to express decay both to shorter patterns and to longer patterns,
and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Could almost have this sort of thing as a "brain teaser" puzzle or something...

Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.

But, alas, sometimes I wonder if I am just kinda stupid and everyone
else has already kinda figured this out, but doesn't say much...

Like, just smart enough to do the things that I do, but not so much otherwise... In theory, I am kinda OK, but often it mostly seems like I
mostly just suck at everything.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 02:01:35 2025

From Newsgroup: comp.arch

On 11/4/2025 11:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.

Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).

Granted, "while loop and switch" is the more traditional way of writing
an interpreter.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
goto".

But you can be sure COBOL got them from assembly language programmers.

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

But, if you use it, you are basically stuck with GCC...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:18:50 2025

From Newsgroup: comp.arch

On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,

I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division

The first factor suggests even more intense push toward "eager"
solutions.

The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.

but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:21:32 2025

From Newsgroup: comp.arch

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:25:45 2025

From Newsgroup: comp.arch

On 2025-11-05 2:13 a.m., Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

I strongly suspect that IBM is doing something similar :-)

Like that IBM packing method.

I have some RTL code to pack and unpack modulo 1000 to BCD. I think it
is fast and small enough that it can be used inline at the input and
output of DFP operations. The DFP values can then be passed around in
the CPU as 128-bit values instead of the expanded BCD value.

Only 128-bit DFP is supported on my machine under the assumption that
one is wanting the extended decimal precision for engineering / finance. Otherwise, why would one use it? Better to use BFP.

One headache I have not worked out how to do yet is convert between DFP
and BFP in a sensible fashion. I have tried a couple of means but the
results are way off. Using log/exp type functions. I suppose I could
rely on conversions to and from text strings.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:27:48 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Probably.

I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

But you can be sure COBOL got them from assembly language programmers.

Back before caches and branch predictors, my fastest world count (wc)
asm program employed runtime code generation, it started by filling in a
64kB segment with code snippets aligned every 128 bytes: Even block
counts were for scanning outside a word and the odd entries were used
when a word start had been found, then each snippet would load the next
byte into BH and jump to BX. (BL contained the outside/inside flag value
as 0/128)

Fast forward a few years and a branchless data state machine ran far
faster, culminating at (a measured) 1.5 clock cycles/byte on a Pentium.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:42:37 2025

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up,

I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division

The first factor suggests even more intense push toward "eager"
solutions.

The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.

I suspect that a model using pre-calculated reciprocals which generate
~10+ approximate digits, back-multiply and subtract, repeat once or
twice, could perform OK.

Having full ~225 bit reciprocals in order to generate the exact result
in a single iteration would require 256-bit storage for each of them and
the 256x256->512 MUL would use 16 64x64->128 MULs, but here we do have
the possibility to start from the top and as soon as you get the high
end 128 bits of the mantissa fixed (modulo any propagating carries from
lower down) you could inspect the preliminary result and see that it
would usually be far enough away from a tipping point so that you could
stop there.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:56:12 2025

From Newsgroup: comp.arch

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 17:26:44 2025

From Newsgroup: comp.arch

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement
where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

Niklas
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Nov 5 10:49:10 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?

- anton

For a code analysis, an assigned goto, aka label variables,
looks equivalent to:
- make a list of all the target labels assigned to each label variable
- at each "goto variable" substitute a switch statement with that list

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:15:00 2025

From Newsgroup: comp.arch

On 11/5/2025 3:21 AM, Michael S wrote:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

I had interpreted it as being about software with BCD helper ops.

Otherwise, would probably go a different route.

One other tradeoff is whether to go for Decimal128 in DPD or BID.

Stuff online says BID is better for a software implementation, but I am
having doubts. It is possible that DPD could make more sense in both
cases, albeit likely, in the absence of BCD helpers, it may make sense
to map DPD to linear 10-bit values.

While BID could make sense, it would have a drawback of assuming having
some way of quickly performing power-of-10 multiplies on large integer
values. If you have a CPU where the fastest way to perform generic
128-bit multiply is to break it down into 32 bit multiplies, and/or use shift-and-add, it is not a particularly attractive option.

Contrast, working with 16-bit chunks holding 10 bit values is likely to
work out being cheaper.

Despite BID being more conceptually similar to Binary128, they differ in
that Binary128 would only need to use large-integer multiply sparingly (namely, for multiply operations).

Though, likely fastest option would be to map the DPD values to 30-bit
linear values, then internally use the 30-bit linear values, and convert
back to DPD at the end. Though, the performance of this is likely to
depend on the operation.

A non-standard variant, representing the value as packed 30 bit fields,
could likely be the fastest option. Could use the same basic layout as
the existing Decimal128 format.

S0, my guess for a performance ranking, fast to slow, being:
1: Dense packed, 30b linear, 30+30+30+20+digit
2: DPD
3: BID

As for whether or not to support Decimal128 (in either form), dunno.

Closest I have to a use-case is that well, technically there is a
_Decimal128 type in C, and it might make sense for it to be usable.

But, then one needs to decide on which possible format to use here.
And, whether to aim for performance or compatibility.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:23:16 2025

From Newsgroup: comp.arch

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be
said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

So, yeah, most likely UB, of a "particularly destructive" / "unlikely to
be useful" kind.

FWIW:
This was not a feature that I feel inclined to support in BGBCC...

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 5 17:22:48 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

<computed goto>

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

In my experience, longjmp is far faster than e.g. C++ exceptions.

Granted, the code needs to be designed to allow longjmp without
orphaning or leaking memory (i.e. in a context where there isn't any
dynamic memory allocation) for the best speed.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 18:03:31 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too?

Yes.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer, and possibly
(if you want to catch GOTO when no variable has been assigned)
a second variable.

But it interacts with compiler writers - additional efforts, warning,
testing, ...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 21:30:11 2025

From Newsgroup: comp.arch

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:30:05 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-03 2:03 p.m., MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.

Round nearest random?

Another unbiased rounding mode. Not yet available because I don't have
a truly random source to guide the rounding.

How about round externally guided (RXG) by an
input signal?

I guess that would be OK, but you could not make the statement that
the rounding mode was unbiased.

For instance, the rounding could come from a feedback
filter of some sort.

Sure, just you can state "unbiased".
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:43:58 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

---------------

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

In 1991 Mike Shebanow, Tse-Yu Yeh, and I tried out a Correlation predictor where strings of {T, !T}** were pattern matched to create a prediction.
While it was somewhat competitive with Global History Table, it ultimately failed.

I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
c) Jump Through Table is not predicted through jump indirect table-like
prediction, what is predicted is the value (switch variable) and this
is used to index the table (early)
d) CMOV gets rid of another 8%

These strip out about 40% of branches from needing prediction, causing
the remaining branches to be harder to predict but having less total
latency in execution.

-----------------

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Tried some of these (1991) mostly with little to no success.
Be my guest and try again.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:52:22 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 1:47 a.m., Robert Finch wrote:

-----------

I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

4 register specifiers: check.

I decided I liked the dual operations that some instructions supported, which need a wide instruction format.

With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.

One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.

Motto: never execute any instructions fetching or building constants.

I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

May I humbly suggest this is the wrong direction.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:53:59 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.

I strongly suspect that IBM is doing something similar :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:04:57 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/4/2025 11:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.

I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.

I have converged on call-threading as a way to eliminate "if-statements" -----------------------
extern uint64_t operation( uint64_t src1, uint64_t src1 );

static uint64_t (*int2op[32])( uint64_t src1, uint64_t src1 ) =
{ // integer 2-operand decoding table
/* 00 */ operation,
/* 01 */ operation,
/* 02 */ uadd,
/* 03 */ sadd,
/* 04 */ umul,
/* 05 */ smul,
/* 06 */ udiv,
/* 07 */ sdiv,
/* 10 */ cmp,
/* 11 */ operation,
/* 12 */ operation,
/* 13 */ operation,
/* 14 */ umax,
/* 15 */ smax,
/* 16 */ umin,
/* 17 */ smin,
/* 20 */ or,
/* 21 */ operation,
/* 22 */ xor,
/* 23 */ operation,
/* 24 */ and,
/* 25 */ operation,
/* 26 */ operation,
/* 27 */ operation,
/* 30 */ operation,
/* 31 */ operation,
/* 32 */ operation,
/* 33 */ operation,
/* 34 */ operation,
/* 35 */ operation,
/* 36 */ operation,
/* 37 */ operation;
};

/*
* Integer 2-Operand Table Caller
*/
bool intimm16( coreStack *cpu, Context *c, Major I )
{
uint8_t or = I.or;
uint64_t src1 = c->ctx.reg[ I.src1 ],
src2 = c->ctx.reg[ I.src2 ],
*dst = &c->ctx.reg[ I.dst ];
*dst = int2op[ (I.major&15)<<1 ]( src1, src2, 0 );
return true;
}

bool int2op( coreStack *cpu, Context *c, OpRoute I )
{
uint8_t or = I.or,
s = I.size;
uint64_t *src1 = &c->ctx.reg[ I.src1 ],
*src2 = &c->ctx.reg[ I.src2 ],
*dst = &c->ctx.reg[ I.dst ];
iorTable[ or ]( *c, I, src1, src2 );
*dst = int2op[ I.minor ]( src1, src2, s );
return true;
}
-----------------------

One does not have to check for unimplemented instructions, just place
a call to the operation() subroutine where they are not defined. The operation() subroutine raises an exception which is caught at the
next instruction fetch.

I show both 16-bit immediates and general 2-Operand instructions use
the same table (with a trifling of bit twiddling).

Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).

Table-calls are faster than many switches unless you can demonstrate
the switch is dense and there are no missing cases.

Granted, "while loop and switch" is the more traditional way of writing
an interpreter.

Just not a fast one...
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:06:16 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:21:34 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and
logical, range {-15.5..15.5} for floating point.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:24:07 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time?

This is where the call-table approach works better--the scope is well
defined.

If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:28:16 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 00:45:19 2025

From Newsgroup: comp.arch

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined
Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

The undefined cases could be excluded at compile-time, even in C, by
requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In
addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 20:41:18 2025

From Newsgroup: comp.arch

On 2025-11-05 3:52 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 1:47 a.m., Robert Finch wrote:

-----------

I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that
reduced most instructions eight bits.

4 register specifiers: check.

I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.

With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.

One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.

It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.

Motto: never execute any instructions fetching or building constants.

I could reduce the 64-bit constant build to two instructions by adding a
load-immediate instruction.

May I humbly suggest this is the wrong direction.

agree.

Taking heed of the motto, I have
scrapped a bunch of shifted immediate instructions and load immediate.
These were present as an alternate means to work with large constants.
They were really redundant with the ability to specify constant
overrides (routing) for registers, and they would increase the dynamic instruction count (bad!) Scrapping the extra instructions will also make writing a compiler simpler.

One instruction scrapped was an add to IP. So, another means of forming relative addresses was required. Sacrificing a register code (code 32)
to represent the instruction pointer. This will allow the easy formation
of IP relative addresses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 21:49:19 2025

From Newsgroup: comp.arch

On 2025-11-05 4:21 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

What happens if one tries to use an unsupported combination?

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

I just realized that Qupls2026 does not accommodate small constants very
well except for a few instructions like shift and bitfield instructions
which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions
with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.

Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.

The constant routing issue could maybe be fixed as there are 30+ free
opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.

Qupls strives to be the low-cost processor.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Nov 5 19:20:57 2025

From Newsgroup: comp.arch

On 11/5/2025 1:21 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:24:24 2025

From Newsgroup: comp.arch

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 08:46:40 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.

As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:43:57 2025

From Newsgroup: comp.arch

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

Yes, UB sounnds as the best answer.. Inter-procedural assigned goto is
not different from out-of-bound array access or from attempt to use
pointer to local variable when the block/function that originally
declared the variable is no longer active.
But compiler shall try to detect as many cases of such misuse as it can.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an
extension would be accepted for the Ada standard.)

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:11:54 2025

From Newsgroup: comp.arch

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C. I doubt that
Stallman had any better answer for gcc, but he did not care.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:37:16 2025

From Newsgroup: comp.arch

On 2025-11-06 10:46, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

I'm not sure if you are trolling or serious, but I will assume the latter.

The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the compile_to_vm_code() function returns a pointer to a vector of addresses picked from the insts[] vector. That could need an analysis of many
functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next?

You know, without any deep analysis or understanding, that the execution
goes to one of the cases in the switch, and /not/ into the wild blue yonder.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 13:14:55 2025

From Newsgroup: comp.arch

On Thu, 6 Nov 2025 12:11:54 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
away between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of
the) label to which the value refers", which is machine-level
semantics and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out
when nothing useful can be said.

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which
is why he removed labels-as-values from his version of C. I doubt
that Stallman had any better answer for gcc, but he did not care.

I suspect that the reason was different: DMR had no sanctifying answer
even for some of intra-procedural cases.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 6 07:44:38 2025

From Newsgroup: comp.arch

Taking direction from the VAX’s AOB? (add-one and branch) instruction
and the DBcc instruction of the 68k, the Qupls Rs1 register of a compare-and-branch instruction may be incremented or decremented. This
is really a form of instruction fusing the op performed on the branch
register into the branch instruction.

I was thinking of modifying this to support additional ops and constant values. Why just add, if one can shift right or XOR as well? It may be
useful to increment by a structure size. Also, a ring counter might be
handy which could be implemented as a right shift. This could be
supported by adding a postfix word to the branch instruction. It would
make the instruction wider but it would not increase the dynamic
instruction count.

Not sure about the syntax to use for coding such instructions.

BEQ Rs1,Rs2,label:ADD Rs1,256

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 07:57:23 2025

From Newsgroup: comp.arch

On 11/6/2025 12:46 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.

BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have. BTW, I can accept the argument for keeping
it in C on the argument that C is "lower level" than say Fortran, COBOL
or PL/1, and people using it are used to the language allowing "risky" constructs.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.

As demonstrated above, they do.

No, they are implemented as an indexed jump table.

And if you fall back to using ifs, it
does not get any better, either.

- anton

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 17:44:32 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.

Agreed. Unfortunately, I have a hard time (i.e. "have not managed")
convincing abc that both signals are available, and assert that
exactly one of them is 1 at any given time, without completely
blowing up the optimization routines. It also does not handle
external don't cares. But as I use it purely to play around with
things, that is not too bad :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 17:52:32 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-05 7:17, Anton Ertl wrote:

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

You can look at his specification in the documentation of, say, 7th
edition Unix (where Ritchie apparently took the effort to document
semantics), and see how he specified that. I doubt he specified
"semantics in the abstract C machine", but I expect that he specified
semantics at the C level.

Concerning how Stallman documented it, you can look at the gcc
documentation from 2.0 until Stallman passed maintainership on
(gcc-2.7?).

If you look at the curent documentation <https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html>, it talks
about the "address of a label" and "jump to one", which you might
consider to be a machine-level description. You can also describe
this at a C source level or "C abstract machine" level, but I don't
expect the description to become any clearer.

The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

The gcc documentation says:

|You may not use this mechanism to jump to code in a different
|function. If you do that, totally unpredictable things happen.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:14:54 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:28:19 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:17:31 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

2) Put the offset from the start of the function or compilation unit
(whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that. Of course, if Fortran
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?

How does ifort deal with this problem?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:36:33 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 4:21 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

What happens if one tries to use an unsupported combination?

For 2-operands and 3-operand instructions, they are all present.
For 1-Operand instructions, only the ones targeting Src2 are
available and if you use one not allowed you take an OPERATION
exception.

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.

1<<const // performed at compile time
1<<var // 1-instruction {1-word in My 66000}

17/var // 1-instruction {1-word}

You might notice My 66000 does not even HAVE a SUB instruction,
instead:

ADD Rd,Rs1,-Rs2

Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.

The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.

Out of the 64-slot Major OpCode space, 23-clost are left over, 6-reserved
in perpetuity to catch random jumps into integer or fp data.

Qupls strives to be the low-cost processor.

My 66000 strives to be the low-instruction-count processor.

But remember, ISA is only the first 1/3rd of an architecture.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:39:55 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/5/2025 1:21 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses. In My case, one always
has access to larger constants at the same instruction-count price
just a larger code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:45:41 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Now let us look at it with tabularized functions:: {Ignore the
interrupt and exception stuff at your peril}

bool RunInst( Chip chip )
{
for( uint64_t i = 0; i < cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t = cpu->context[cs];
Inst I;

if( cpu->interrupt & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( uint16_t raised = c->raised & c->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I.inst;
t->reg[2] = I.src1;
t->reg[3] = I.src2;
t->reg[4] = I.src3;
}
else
{ // run an instruction
t->ip += memory( FETCH, t->ip, &I.inst );
t->raised |= majorTable[ I.major ]( cpu, t, &I );
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.

On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.

As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Nov 6 13:11:10 2025

From Newsgroup: comp.arch

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.

For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.

For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something that
would lead to X30 being a decisive win either in terms of performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due to
less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.

However, if running on RV64G with the standard ABI, it is likely the
9-digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).

With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to
working with 18 digits, but then narrows back to 9 digit chunks).

Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 19:38:54 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:04:37 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that?

No, that would be beyond horrible.

What about regular goto and
computed goto?

Neither; according to F77, it must be "defined in the same program
unit".

An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:07:16 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

Compiler writers should never box themselves in like that.

2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.

That would make jumps very inefficient.

Of course, if Fortran
assigns labels between shared libraries and the main program,

It does not.

How does ifort deal with this problem?

I have no idea, and no inclination to find out; check out
assembly code at godbolt if you are really interested.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 12:14:33 2025

From Newsgroup: comp.arch

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point
instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction
stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

Interesting! No values related to pi? And what are the ...e+307 used for?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 20:24:23 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.

After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.

Of course, if Fortran
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?

How does ifort deal with this problem?

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 16:24:28 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
and it was up to me to make sure the registers were all handled correctly.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 21:59:31 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that

There is a space between the y and the 6 in My 66000.

the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Nov 6 22:09:25 2025

From Newsgroup: comp.arch

It appears that MitchAlsup <user5857@newsgrouper.org.invalid> said:

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

Relatively speaking, yeah. In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy program, 13 lines of Algol60 horror that Knuth himself got the results wrong. --
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 22:53:09 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,

Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.

Or you could do some "foul play" on a jumpbuf and longjump.

{{Be careful not to shoot yourself in the foot.}}

and it was up to me to make sure the registers were all handled correctly.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 22:21:05 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

Ritchie designed lots of features into C for which the C
standardization committee later decided that some cases are undefined behaviour. I don't think that Ritchie had any qualms at designing
something like labels-as-values with unchecked limitations (what would
later become undefined or implementation-defined behaviour), or
documenting these limitations.

Here is my attempt (from 1999) at a specification for
labels-as-values:

|goto *<expr>" [or whatever the syntax was] is equivalent to "goto <label>"
|if <expr> evaluates to the same value as the expression "&&<label>" [or |whatever the syntax was]. If <expr> does not evaluate to a label of the |function that contains the "goto *<expr>", the result is undefined.

The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.

Gforth certainly passes the labels out, for use by the compiler that
generates the VM code.

In
addition, the use of an uninitialized label-valued variable should be >prevented or detected.

Using an uninitialized variable is undefined behaviour in C, but not
prevented, and not always detected (compilers emit warnings in some
cases when they detect a use of an uninitialized variable). Why
should it be any different for an uninitialized variable in used with
"goto *"?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 20:10:19 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,

Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.

Or you could do some "foul play" on a jumpbuf and longjump.

{{Be careful not to shoot yourself in the foot.}}

Or worse... shoot yourself in the foot and then step in a cow pie.
I hate when that happens.

and it was up to me to make sure the registers were all handled correctly. >>

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 06:55:08 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.

Some people use auto-generated code (for example from computer
algebra systems), which generate really, really long procedures.
A good stress-test for compilers, too; they tend to expose
O(n^2) or worse behavior where nobody looked. So it is good that
branch instructions within functions are expanded by the assembler
if needed :-)

Even having 64-bit offsets like My 66000 can lead into a trap (and will
require future optimization work on the compiler). This is a simplified version of something that came up in a PR.

SUBROUTINE FOO
DOUBLE PRECISION A,B,C,D,E
COMMON /A,B,C,D,E/
C very many statements involving A,B,C,D,E

If you load and store each access to one of the variables via its
64-bit access, you can end up using very many 96-bit instructions,
where a single load of the base address of the COMMON block would
save a lot of code space at the expense of a single instruction
at the beginning.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:06:41 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

...

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C.

He did not write that, and given the rest of C, I very much doubt that
this was the reason.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:08:42 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-06 10:46, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

[Fortran's assigned goto]

Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

I'm not sure if you are trolling or serious, but I will assume the latter.

This is the problem that Stephen Fuld mentioned, and that is actually
a practical problem that I have experience in some cases when
debugging programs with indirect control flow, usually with various
forms of indirect calls, e.g., method calls. I have not experienced
it for threaded-code interpreters that use labels-as-values (as
outlined above), because there I can always look at ip[0], ip[1]
etc. to see where the next executions of goto *ip will go.

The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.

That has never been a problem in my experience, and I have been using labels-as-values since 1992. Up to gforth-0.6 (2003), all instances
of &&label and all instances of goto *expr were in the same function,
so if labels had a separate type, that could not be converted by
casts, the analysis would be trivial, at least if GNU C was an
Ada-like language, where labels have their own type that cannot be
converted to other types. As it is, Fortran's assigned goto uses
integer numbers, and labels-as-values uses void *, so if anybody was
really interested in performing such an analysis, they would have a
lot of work to do. But the design of these features with using
existing types makes it obvious that performing such an analysis was
not intended.

Interestingly, if somebody wanted to work in that direction, checking
at run-time that the target of a goto is inside the function that
contains the goto is easy and not particularly expensive. With the
newfangled "control-flow integrity" features in hardware, you could
even check relatively cheaply that only &&label instances are targets
of goto *.

Ok, so what about gforth-0.6 (2003) and later? First of all, they
contain two functions with goto * and &&label instances, so the
trivial analysis would no longer work. Has there ever been any mixup
where a goto * jumped to a label in the other function? Not that I
know of; if it happened, it would actually work, because the two
functions are identical apart from some code-space padding.

What's more relevant is that gforth-0.6 added code-copying dynamic
native code generation: It copies code snippets (using the addresses
gotten with &&label to determine where they start and where they end)
to some RWX data region, concatenating the snippets in this way,
resulting in a compiled program in the RWX region. It then uses one
of the goto * in one of the functions to actually start executing this dynamically-generated code.

This is probably outside of what Stallman had in mind for
labels-as-values, but fortunately Stallman did not try to limit what
can be done to what he had in mind, the way that many programming
language designers do, and the way that many people discussing
programming languages think. This is a feature that Ritchie's C also
has, which cannot be said about the C of people who think that
"undefined behaviour" is enough justification to declare a program
"buggy".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:09:02 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
of it.

In <1762311070-5857@newsgrouper.org> you mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas
Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
I bet that it ends up in self-modifying code, too, because these
architectures usually don't have indirect jumps through jump tables,
either. If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

Did you read what you are replying to?

Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:32:08 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.

The benefit I see from that is that data-flow analysis must only
consider the control flows from the assigned goto to these targets and
not to all assigned labels (in contrast to labels-as-values), and
conversely, if every assigned goto has such a list, data-flow analysis
knows more precisely which gotos can actually jump to a given label.

This would make a small difference in Gforth since 0.6, which has
introduced hybrid direc/indirect-threaded code, and where some goto *
are for indirect-threaded dispatches, and some labels are only reached
from these goto * instances, and a certain variable is only alive
across these jumps. GNU C does not have this option, so what we did
instead is to kill the variable right before all the gotos that do not
jump to these labels.

It might also help with static stack caching: There are stack states
with 0-n stack items in registers, and a particular VM instruction
code snippet starts in a particular state (say, 2 stack items in a
register) and ends with another state S (say, 1 stack item in a
register). It will jump to code that expects the same state S. All
variables that contain stack items beyond what S has are dead at that
point. If we could tell that the goto * from state S only goes to
targets in state S, the data-flow analysis could determine that.
Instead, what we do is to kill these additional variables in a subset
of uses. When we tried to kill them at all uses, the quality of the
code produced by gcc deteriorated significantly.

This variable-killing happens by having empty asm statements that
claim to write to these variables, so if this is used incorrectly, the
produced code will be incorrect. So the benefit of this assigned-goto
feature would be to replace a dangerous feature with another dangerous
one: if you fail to list all the jumped-to labels, the data-flow
analysis would be wrong, too. It seems more elegant to describe the
actual control flow, and then let the data-flow analysis do its work
than the heavy-handed direct influence on the data-flow analysis that
our variable-killing does.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 15:26:38 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.

Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.

Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.

The main horror in the original version is that for some of the Algol
60 syntax that is used, it is not obvious without studying the Algol
60 report what it means. <https://rosettacode.org/wiki/Man_or_boy_test#ALGOL_60> contains some discussion, and one can find it in various other programming
languages, more or (often) less close to the original. The discussion
at <https://rosettacode.org/wiki/Man_or_boy_test#TXR> and the
difference between the "proper job" version from the "crib the Common
Lisp or Scheme solution" version gives some insight.

The fact that "less close" also produces the correct result suggests
that the man-or-boy test is less discerning than Knuth probably
intended. That's a common problem with testing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 08:26:41 2025

From Newsgroup: comp.arch

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.

In <1762311070-5857@newsgrouper.org> you

I think the attributions are messed up, as I didn't say what you next
say I said.

mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying >>>> code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

No code modification nor indirection required .

Yes, it does require execution of an "extra" jump instruction.

I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
either.

Not required.

If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

Did you read what you are replying to?

Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?

I don't know what the 704 implemented, but I have shown above self
modifying code is not necessary for computed goto, and I suspect
assigned goto was implemented with self modifying code. But as I said,
back then self modifying code was not considered as bad as it is now.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 17:29:07 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

[...]

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

Interesting! No values related to pi? And what are the ...e+307 used for?

If you loook closely, you'll see pi/180 in that list. But pi is
also there (I cut it off the list), it occurs 11 times. And the
large numbers are +/- DBL_MAX*0.5, I don't know what they are
used for.

By comparision, here are the values which are most frequently
contained in GSL:

5-bit constants: 5148
32-bit constants: 3769
64-bit constants:3140
2678 1
1518 0
687 -1
424 2
329 0.5
298 -2
291 2.22044604925031e-16
275 4.44089209850063e-16
273 3
132 -3
131 -0.5
131 3.14159265358979
88 4
86 1.34078079299426e+154
77 6
70 0.25
70 5
68 2.2250738585072e-308
66 10
64 -4
50 -6
46 0.1
45 5.87747175411144e-39
43 0.333333333333333
42 1e+50
38 6.28318530717959
35 9
31 0.2
30 7
30 -0.25

[...]

So, having values between -15.5 and +15.5 is a choice that will
cover quite a few floating point constants. For different packages,
FP constant distributions probably vary too much to create something
that is much more useful.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 17:15:59 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

Which architecture ist that?

No code modification nor indirection required .

The "Jump $,R1" is an indirect jump. With that the assigned goto can
be implemented as (for "GOTO X")

load R1,X
Jump 0,R1

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bill Findlay@findlaybill@blueyonder.co.uk to comp.arch on Fri Nov 7 17:54:33 2025

From Newsgroup: comp.arch

On 7 Nov 2025, Anton Ertl wrote
(in article<2025Nov7.162638@mips.complang.tuwien.ac.at>):

John Levine <johnl@taugh.com> writes:

In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.

Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.

I append a run of MANORBOY in Pascal for the KDF9.
No display was used.
A static frame pointer as part of the functional parameter
suffices logically and gives better performance.

Paskal : the KDF9 Pascal cross-compiler V19.2a, compiled ... on 2025-11-07.
1 u | %storage = 32767
2 u | %ystores = 30100
3 u |
4 u | program MAN_OR_BOY;
5 u |
6 u | { See: }
7 u | { "Man or boy?", }
8 u | { by Donald Knuth, }
9 u | { ALGOL Bulletin 17.2.4, p7; July 1964. }
10 u |
11 u | var
12 u | i : integer;
13 u | function A (
14 u | k : integer;
15 u | function x1 : integer;
16 u | function x2 : integer;
17 u | function x3 : integer;
18 u | function x4 : integer;
19 u | function x5 : integer
20 u | ) : integer;
21 u |
22 u | function B : integer;
23 u 1b| begin
24 u | k := k - 1;
25 u | B := A (k, B, x1, x2, x3, x4);
26 u 1e| end { B };
27 u |
28 u 1b| begin { A }
29 u | if k <= 0 then
30 u | A := x4 + x5
31 u | else
32 u | A := B;
33 u 1e| end { A };
34 u |
35 u | function pos_one : integer;
36 u | begin pos_one := 1 end;
37 u |
38 u | function neg_one : integer;
39 u | begin neg_one := -1 end;
40 u |
41 u | function zero : integer;
42 u | begin zero := 0 end;
43 u |
44 u 1b| begin { MAN_OR_BOY }
45 u | rewrite(1, 3);
46 u | for i := 0 to 11 do
47 u | write(A(i, pos_one, neg_one, neg_one, pos_one, zero):6);
48 u | writeln;
49 u 1e| end { MAN_OR_BOY }.

Compilation complete : 0 error(s) and 0 warning(s) were reported.
...
This is ee9 17.0a, compiled by GNAT ... on 2025-11-07.
Running the KDF9 problem program Binary/MANORBOY
...
Final State: Normal end of run.
...
LP0 on buffer #05 printed 1 line.

LP0:
===
1 0 -2 0 1 0 1 -1 -10 -30 -67 -138
===
--
Bill Findlay

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 10:45:39 2025

From Newsgroup: comp.arch

On 11/7/2025 9:15 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

Which architecture ist that?

It is generic enough that it could be lots of architectures, but the one
I know best is the Univac 1100.

No code modification nor indirection required .

The "Jump $,R1" is an indirect jump.

Perhaps we just have a terminology disagreement. I don't call that
indirect addressing. The 1100 architecture supports indirect addressing
in the hardware. An indirect reference was represented in the assembler
by an asterisk preceding the label, which set a bit in the instruction
that told the hardware to go to the address specified in the instruction
and treat what it found there as the address of the operand for the instruction.

So, for example:

J *tag

tag finaladdress

would cause the hardware to fetch the address at tag and use that as the operand, thus causing a jump to "final address".

This is what I call indirect addressing.

So to use this in an assigned goto, the assign statement would store the desired address at tag such that when the jump was executed, it would
jump to the desired address.

I call the construct with several consecutive jump instructions an
indexed jump, not an indirect one.

With that the assigned goto can
be implemented as (for "GOTO X")

load R1,X
Jump 0,R1

Yes.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.

True. But we still have my original argument, better expressed by
Niklas about code readability/followability.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 14:28:48 2025

From Newsgroup: comp.arch

On 11/6/2025 1:11 PM, BGB wrote:

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                  1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
    Except the top one generally holding 7 digits.
16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.

For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
    BID needs a bunch of 128-bit arithmetic handlers.
    DPD needs a bunch of merge/split and table lookups.
    Seems to mostly balance out in this case.

For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x   9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being notably slower.

However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a value).

With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).

Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level helpers.

I don't know yet if my implementation of DPD is actually correct.

Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

Which, in theory, should resemble PI.

Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.

Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
As, in, not going and using some proprietary IBM library or similar.

Also Grok wasn't much help here, just keeps trying to use Python's
"decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.

And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.

Neither DeepSeek nor QWen being much help either... Both just sort of go
down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".

Not helpful, I more would just want some way to confirm whether or not I
got the format correct.

Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).

Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.

...

Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)

FDIV : 0.4 MHz (very slow; Newton Raphson)

FMUL (DPD) : 11.2 MHz (330 cycles)
FADD (DPD) : 8.6 MHz (430 cycles)
FMUL (X30) : 12.4 MHz (298 cycles)
FADD (X30) : 9.8 MHz (378 cycles)

The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).

Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly
related to the cost of the pack/unpack process).

The wrapped cases basically just add a wrapper function that unpacks the
input values to the internal format, and then re-packs the result.

For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.

Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.

Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.

Though, in this case, it does use these functions by passing/returning
structs by value. It is possible a by-reference design might be faster
in this case.

This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.

Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
Currently hard-wired at Round-Nearest-Even.

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.

So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't
close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.

Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.

More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for
Binary-FP isn't quite as effective.

...

Still don't have a use-case, mostly just messing around with this...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 7 22:57:14 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:
--------------snip---------------

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.

Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.

Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.

That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 20:23:40 2025

From Newsgroup: comp.arch

On 11/7/2025 4:57 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:
--------------snip---------------

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.

Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.

Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.

That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.

FWIW: Dump of the test code as it exists...
https://pastebin.com/NcvCi5gD

I had since found the decNumber library, and with this was able to
confirm that I had in-fact figured out the specifics of the format (I
was unsure whether or not my version was correct; as I had implemented
in based mostly on descriptions of the format on Wikipedia; which were
not entirely consistent).

Otherwise, experiment / proof of concept.
Unlikely to actually be useful.

Way I had usually started out with binary FDIV/reciprocal:
Turn the reciprocal into a modified integer subtract;
Or, subtract for HOB's, everything else is a bitwise inversion.
Can often get within the top 4 bits of the mantissa or so.

Way I had tried to do so for decimal:
Invert the exponent in a similar way as binary FP;
Set the mantissa to the 9s complement value.

Issue:
The 9s complement method doesn't give a value particularly close to the
actual target value.

For example:
Taking the reciprocal of 3.14159x, I get 0.685840x, but actual target is 0.318309x.

Like, I almost may as well just leave the mantissa as-is, or fill it
with all 5s or something.

Granted, feeding the high 3 digits through a lookup table and just
setting all the low digits to whatever is probably also an option, and probably faster than using an initial coarse convergence to try to get
it somewhere in the right general area.

I realized after finding decNumber and using it to generate a test
number, that it seems to use the format in a very different way,
effectively keeping the value right-aligned and normalized, rather than left-aligned and normalized.

My code sort of assumed keeping values normalized (as with traditional floating point).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:18:08 2025

From Newsgroup: comp.arch

On 2025-11-07 3:28 p.m., BGB wrote:

On 11/6/2025 1:11 PM, BGB wrote:

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                  1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
   I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
   4x 32-bit values each holding 9 digits
     Except the top one generally holding 7 digits.
   16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
   X30: Directly packing 20/30 bit chunks, non-standard;
   DPD: Use the DPD format;
   BID: Use the BID format.

For the pack/unpack step (taken in isolation):
   X30 is around 10x faster than either DPD or BID;
   Both DPD and BID need a similar amount of time.
     BID needs a bunch of 128-bit arithmetic handlers.
     DPD needs a bunch of merge/split and table lookups.
     Seems to mostly balance out in this case.

For DPD, merge is effectively:
   Do the table lookups;
   v=v0+(v1*1000)+(v2*1000000);
With a split step like:
   v0=v;
   v1=v/1000;
   v0-=v1*1000;
   v2=v1/1000;
   v1-=v2*1000;
   Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by
constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD
or BID. Except, that the cost of the ADD and MUL operations
effectively dwarf that of the pack/unpack operations, so the relative
cost difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something
that would lead to X30 being a decisive win either in terms of
performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a
128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due
to less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after
normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
   12x 3 digits (16b chunk)
   4x   9 digits (32b chunk)
   2x 18 digits (64b chunk)
   3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations
fully fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.

However, if running on RV64G with the standard ABI, it is likely the
9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).

With 3x 12 digits,while not exactly the densest scheme, leaves a
little more "working space" so would reduce cases which exceed the
limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays
within the limits of 64-bit arithmetic (where multiply temporarily
widens to working with 18 digits, but then narrows back to 9 digit
chunks).

Also 9 digit chunking may be preferable when one has a faster
32*32=>64 bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.

I don't know yet if my implementation of DPD is actually correct.

Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

Which, in theory, should resemble PI.

Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.

Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
    As, in, not going and using some proprietary IBM library or similar.

Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.

And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.

Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".

Not helpful, I more would just want some way to confirm whether or not I
got the format correct.

Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).

Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.

...

Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)

FDIV           : 0.4 MHz (very slow; Newton Raphson)

FMUL (DPD)     : 11.2 MHz (330 cycles)
FADD (DPD)     : 8.6 MHz (430 cycles)
FMUL (X30)     : 12.4 MHz (298 cycles)
FADD (X30)     : 9.8 MHz (378 cycles)

The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).

Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).

The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.

For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.

Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.

Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.

Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
in this case.

This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.

Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
    Currently hard-wired at Round-Nearest-Even.

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.

So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
Exponent is wrong:
    Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.

Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.

More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.

...

Still don't have a use-case, mostly just messing around with this...

When I built my decimal float code I ran into the same issue. There are
not really examples on the web. I built integer to decimal-float and decimal-float to integer converters then compared results.

Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are
right, no guarantees).
Integer decimal-float
u 00000000000000000000000000000001 25ffc000000000000000000000000000
u 0000000000000000000000000000000a 26000000000000000000000000000000
u 00000000000000000000000000000064 26004000000000000000000000000000
u 000000000000000000000000000003e8 26008000000000000000000000000000
u 000000000000000000000000000f4240 26014000000000000000000000000000
u 00000000000000000000000000bc614e 2601934b9c0c00000000000000000000
u 00000000000000000000000000000002 29ffc000000000000000000000000000

I have used the decimal float code (96 bit version) with Tiny BASIC and
it seems to work.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:30:36 2025

From Newsgroup: comp.arch

Cache-line constants were tried with the StarkCPU and seemed to work
fine, but wasted cache-line space when constants and instructions could
not be packed evenly into the cache-line.

However, for Qupls2026 using constants stored on the cache-line might be
just as efficient storage wise as having the constants follow
instruction words because of the 48-bit word width. Constants typically
do not need to be multiples of 48 bits. If stored on the cache-line they
could be multiples of 16-bits. There are potentially 32-bits of wasted
space if an instruction is not able to be packed onto the cache-line.
There may just be as much wasted space due to the support of over-sized constants in-line with 48-bit parcels. A 32-bit constant uses 48 bits,
wasting 16-bits of storage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 8 00:34:37 2025

From Newsgroup: comp.arch

<snip>>

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

<snip>

I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
exponent may be off.

2e078c2aeb53b3fbb4e262d0dab5e680

The sequence of digits is the same, except it begins C2 instead of C1.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,076
Nodes:	10 (1 / 9)
Uptime:	64:57:37
Calls:	13,805
Files:	186,990
D/L today:	541 files (173M bytes)
Messages:	2,442,779

Re: Tonights Tradeoff

Who's Online

System Info