Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers.
GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
On 10/28/2025 8:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
On 10/28/2025 8:52 PM, Robert Finch wrote:Yes, but it is just a suggested usage. The registers are GPRs that can
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste. If you have six bits, you can use
all 64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.
Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>
Yup.>
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.
On 2025-10-29 8:41 a.m., Robert Finch wrote:
On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
On 10/28/2025 8:52 PM, Robert Finch wrote:Yes, but it is just a suggested usage. The registers are GPRs that can
Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.
I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or
6 bit register numbers in the instructions. Five allows you to use
the high registers for 128 bit operations without needing another
register specifier, but then the high registers can only be used for
128 bit operations, which seems a waste. If you have six bits, you
can use all 64 registers for any operation, but how is the "upper"
method that better than automatically using r(x+1)?
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would
be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But
since it should be using mostly compiled code, it does not make much
difference.
Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>
I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. TheyYup.>
GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a
branch on bit-set/clear for conditional branches. Might also include
branch true / false.
Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.
Those seem like a call from the My 66000 playbook, which I like.
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste.
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
On 10/28/2025 10:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.
OK.
I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.
But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.
My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.
Otherwise, goings on in my land:<snip>
ISA development is slow, and had mostly turned into bug hunting;
The longer term future is uncertain.
My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".
Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.
Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.
Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.
My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?
Having register pairs does not make the compiler writer's life easier, unfortunately.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
Having 64 registers and 64 bit registers makes life easier for that particular task :-)
If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
BGB <cr88192@gmail.com> posted:
But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
BGB <cr88192@gmail.com> posted:
On 10/28/2025 10:52 PM, Robert Finch wrote:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
OK.
I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.
it is definitely an issue.
But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
----------
My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.
{5, 16, 32, 64}-bit immediates.
<snip>
Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;
The longer term future is uncertain.
My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".
I am still running at 70% RISC-Vs instruction count.
Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.
Fewer instructions, and or instructions that take fewer cycles to execute.
Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.
Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.
There is very little to be gained with that many registers.
Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.
My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.
My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
On 10/29/2025 11:47 AM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
snip
But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.
There is always the DBLE pseudo-instruction.
DBLE Rd,Rs1,Rs2,Rs3
All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.
So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?
If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?
Having register pairs does not make the compiler writer's life easier, unfortunately.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
Having 64 registers and 64 bit registers makes life easier for that particular task :-)
If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
Robert Finch <robfi680@gmail.com> posted:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.
I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:
if( p && p->next )
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.
With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.
Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:
CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt
Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.
Now, the compiler emits:
FMULf Rd,Rf,#1.425D0
saving 2 instructions alongwith the higher precision.
Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.
Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...
If set to Efficiency, mostly stays below 3 GHz.
As noted, the laptop is surprisingly speedy for how cheap it was.
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
<snip>
Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.
Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...
If set to Efficiency, mostly stays below 3 GHz.
As noted, the laptop is surprisingly speedy for how cheap it was.
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?
The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>
If you have that many bits available, do you still go for a load-store
architecture, or do you have memory operations? This could offset the
larger size of your instructions.
It is load/store with no memory ops excepting possibly atomic memory ops.>
I found that 16-bit immediates could be encoded instead of 10-bit.Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.
Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.
136 bit constants may not be implemented, but a size
code is reserved for that size.
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?
The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>
That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
On 2025-10-29 2:33 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.
Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.
I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:
if( p && p->next )
Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.
Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.
My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.
Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.
With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.
Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:
CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt
Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.
Now, the compiler emits:
FMULf Rd,Rf,#1.425D0
saving 2 instructions along with the higher precision.
Improves the accuracy? of algorithms, but seems a bit specific to me.
Are there other instruction sequence where double-rounding would be good
to avoid?
Seems like HW could detect the sequence and fuse the instructions.--- Synchronet 3.21a-Linux NewsLink 1.2
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.
He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.
But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.
As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not IRSC.
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not IRSC.
Thomas Koenig <tkoenig@netcologne.de> writes:
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:That raises an interesting question. If you want to align a branch
Robert Finch <robfi680@gmail.com> schrieb:The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned? >>>
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.
Which only goes to prove that x86 is not RISC.
I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:
Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.
Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.
In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.
Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.
By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.
IIRC the original HPPA has 32 or so 64-bit FP registers, which they--- Synchronet 3.21a-Linux NewsLink 1.2
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.
- anton
Scott Lurndal <scott@slp53.sl.home> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
Robert Finch <robfi680@gmail.com> schrieb:
On 2025-10-29 2:15 p.m., Thomas Koenig wrote:That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
Robert Finch <robfi680@gmail.com> schrieb:The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.
I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>
aligned OR routines could be required to be 32-bit aligned for instance.> >>
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.
iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).
There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problems
or inefficiencies?
According to my understanding, EV4 had no SIMD-style instructions.
Michael S <already5chosen@yahoo.com> writes:
According to my understanding, EV4 had no SIMD-style instructions.
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.
- anton
On Thu, 30 Oct 2025 22:19:18 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
I didn't consider these instructions as SIMD. May be, I should have.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.
Michael S <already5chosen@yahoo.com> writes:
On Thu, 30 Oct 2025 22:19:18 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.
I didn't consider these instructions as SIMD. May be, I should have.
They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.
Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.
Robert Finch <robfi680@gmail.com> posted:
Improves the accuracy? of algorithms, but seems a bit specific to me.
It is down in the 1% footprint area.
Are there other instruction sequence where double-rounding would be good
to avoid?
Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.
On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD
extensions across the industry), but still provides no direct name for
the individual bytes of a register.
According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.
MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Improves the accuracy? of algorithms, but seems a bit specific to me.
It is down in the 1% footprint area.
Are there other instruction sequence where double-rounding would be good >> to avoid?
Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:
f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)
You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.
The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.
Terje
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.
{ABCD}X registers were data.
{SDBS} registers were pointer registers.
Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".
But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value >analysis in the compiler.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8,
allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.
Terje
On 10/31/2025 2:32 PM, BGB wrote:
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...
Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).
Using a joke image as a test case here...
https://x.com/cr88192/status/1984694932666261839
This variation uses:
Y R
B G
In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...
Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.
Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.
With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.
With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.
With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).
Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).
Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma; ...
Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).
The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).
Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.
I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.
One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.
Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.
Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.
Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.
the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.
I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
can get 0.1 to 0.8 bpp; which is generally JPEG territory).
If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.
Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.
But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).
More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.
...
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.
Terje
People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.
Michael S wrote:
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.
Terje
People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
everybody really agree.
Within the 754 working group the definition is totally clear:
Guard is the first bit after the normal mantissa.
Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.
I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.
Ulp (Unit in Last Place)) is the final mantissa bit
Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.
This means that those four bits in combination suffices to separate
between rounding directions:
Default rounding is nearest or even: (In this case Sign does not
matter.)
Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
Terje
On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).
Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.
We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.
The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.
Terje
People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if
everybody had agreed about what they mean. But I don't think that
everybody really agree.
Within the 754 working group the definition is totally clear:
I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.
Guard is the first bit after the normal mantissa.
Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.
I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.
Ulp (Unit in Last Place)) is the final mantissa bit
Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.
This means that those four bits in combination suffices to separate
between rounding directions:
Default rounding is nearest or even: (In this case Sign does not
matter.)
Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
Terje
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
On 2025-11-02 3:21 a.m., BGB wrote:
On 10/31/2025 2:32 PM, BGB wrote:I think your support for graphics is interesting; something to keep in
On 10/31/2025 1:21 PM, BGB wrote:
...
In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.
And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the
bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.
*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.
Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.
But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.
The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).
Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...
Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).
Using a joke image as a test case here...
https://x.com/cr88192/status/1984694932666261839
This variation uses:
Y R
B G
In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...
Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.
Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.
With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.
With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.
With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
per channel, allowing for roughly a RGB333 color space (though, the
vectors are normalized here).
Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).
Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of
chroma;
...
Dealing with chroma does have the effect of making the dithering
process more complicated. As noted, reliable recovery of the color
vector is itself a bit fiddly (and is very sensitive to the encoder
side dither process).
The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with
the dither pattern, and modified how it was being handled (replacing
the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
rotating the matrix for each channel).
Image quality isn't great, but then again, not sure how to do that
much better with a naive 1 bit/pixel encoding.
I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.
One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.
Actual pixel decoding process works the same as with 8x8 blocks of 1
bit monochome, selecting minimum or maximum color based on each bit.
Possibly, Y could also be used to select "relative" minimum and
maximum values, vs full intensity and black, but this would add more
logic complexity.
Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering. >>
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.
the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.
I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler
(non- randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).
If combined with delta encoding or similar; could almost be adapted
into a very crappy video codec.
Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.
But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more
computationally expensive than just using a CRAM style codec (while
also giving worse image quality).
More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.
...
mind for displays with limited RAM.
I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.
No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).
Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).
In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.
The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.
On 2025-11-02 3:58 p.m., BGB wrote:
<snip>
No real need to go much beyond RGB555, as the FPGA boards have VGA
DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
The usual workaround was also to perform dithering while driving the
VGA output (with ordered dither in the Verilog).
I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.
Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.
<snip>
Generally, the text mode operates in a 640x200 mode with 8x8 + 128bFor the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).
cells, so 32K of VRAM used (for 80x25 cells).
The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>
In this case, a 40x25 color-cell mode (with 256-bit cells) could be
used for graphics (32K). Early on, this was used as the graphics mode
for Doom and similar, before I later expanded VRAM to 128K and
switched to 320x200 Hicolor.
The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.
<snip>
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a displacement from the IP.
Using a fused compare-and-branch instruction for Qupls4
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction--
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a displacement from the IP.
Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I
use them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd
rather not use the term 'guard' at all. Names like 'rounding bit'
or 'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit
if the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Michael S wrote:
I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.
Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
tell the CPU
what the target is (like VEC in My66000)
just use a general
purpose register with a general-purpose instruction.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
It is needed to be comparable to binary FP:
A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.
The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.
I.e. the DFP format has the same precision and a larger range than
BFP.
Terje
On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.
On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
It is needed to be comparable to binary FP:
A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.
The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
5.9e(+/-)4931.
I.e. the DFP format has the same precision and a larger range than
BFP.
Terje
Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?
The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.
By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.
That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.
Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:
Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:
First, machines from the early 2000s:
sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002
Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
instructions.
And here are results with some newer machines:
sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020
The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.
It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.
Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.
I have to take additional measurements with other Power and AMD64
processors.
Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
tell the CPU
what the target is (like VEC in My66000)
I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.
just use a general
purpose register with a general-purpose instruction.
That turns out to be the winner.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.
If you use a link register or a special instruction, the CPU could
do that.
It turns out that this does not work well in practice.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Michael S <already5chosen@yahoo.com> writes:
On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.
How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.
I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.
Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Terje
On 11/4/2025 11:15 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
But you can be sure COBOL got them from assembly language programmers.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
Robert Finch <robfi680@gmail.com> schrieb:
Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a
displacement from the IP.
Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.
Using a fused compare-and-branch instruction for Qupls4
Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
That makes sense.
By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.
Are you talking about a normal loop condition or a jump out of
a loop?
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto".
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.
Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Terje
On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
Robert Finch <robfi680@gmail.com> schrieb:
Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.
I think this has about the same code density as having a branch to a
displacement from the IP.
Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.
Using a fused compare-and-branch instruction for Qupls4
Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).
That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.
there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.
That makes sense.
Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
Any loop condition that needs a displacement constant. The constantBy moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.
Are you talking about a normal loop condition or a jump out of
a loop?
being loaded into a register.
One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.
If you use a link register or a special instruction, the CPU could
do that.
The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.
Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3 ; branch to R3 if R1 < R2
Versus:
CMP R3,R1,R2
BLT R3,displacement
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
On 11/4/2025 3:44 PM, Terje Mathisen wrote:
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.
Or "Never bet against branch prediction".
I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring
of 1995:
I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)
Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its
architecture.
Before the start of that briefing I suggested that I should start off
on the blackboard by showing what I had been able to figure out on my
own, then I proceeded to pretty much exactly cover every single
feature on the cpu, with one glaring exception:
Based on the useful but not great branch predictor on the Pentium I
told them that I expected the P6 to employ eager execution, i.e
execute both ways of one or two layers of branches, discarding the
non-taken paths as the branch direction info became available.
That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power
viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.
As you said: "Never bet against branch prediction".
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.
Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).
Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.
Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could have done similar.
Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
bit state) would have costed more than the cost of smaller tables of 6
bit state ?...
Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)
Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.
Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).
But, yeah, not going to type out state tables for these ones.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
able to express decay both to shorter patterns and to longer patterns,
and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).
Could almost have this sort of thing as a "brain teaser" puzzle or something...
Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.
Terje
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
goto".
But you can be sure COBOL got them from assembly language programmers.
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
- anton
For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,
but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
I strongly suspect that IBM is doing something similar :-)
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Thomas Koenig <tkoenig@netcologne.de> writes:
Should be possible. A question is if you want to have a special
register for that (like POWER's link register),
There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The
I first heard about this 1982 from Burton Smith.
prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.
I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?
Probably.
I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".
I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.
But you can be sure COBOL got them from assembly language programmers.
On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up,
I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division
The first factor suggests even more intense push toward "eager"
solutions.
The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
Thomas Koenig <tkoenig@netcologne.de> writes:
Assigned GOTO has been deleted from the Fortran standard (in FortranThat is the problem with deleted features - compiler writers have
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
to support them forever, and interaction with other features can
lead to problems.
So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?
- anton
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be
said.
(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)
Niklas
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
Thomas Koenig <tkoenig@netcologne.de> writes:
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
So does gfortran support assigned goto, too?
What problems in
interaction with other features do you see?
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
On 2025-11-03 2:03 p.m., MitchAlsup wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.
https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf
The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
Round nearest random?
How about round externally guided (RXG) by an
input signal?
For instance, the rounding could come from a feedback
filter of some sort.
On 11/4/2025 3:44 PM, Terje Mathisen wrote:---------------
MitchAlsup wrote:
As you said: "Never bet against branch prediction".
Branch prediction is fun.
When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.
But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.
But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.
Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).
Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.
Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.
I decided I liked the dual operations that some instructions supported, which need a wide instruction format.
One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.
I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
I strongly suspect that IBM is doing something similar :-)--- Synchronet 3.21a-Linux NewsLink 1.2
On 11/4/2025 11:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this feature?
<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.
I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.
I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.
Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).
Granted, "while loop and switch" is the more traditional way of writing
an interpreter.
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time?
If not, I would expect the semantics--- Synchronet 3.21a-Linux NewsLink 1.2
to be Undefined Behavior, the usual cop-out when nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)
Niklas
On 2025-11-05 18:23, BGB wrote:
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Or silently produces wrong results.
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
Niklas
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
On 2025-11-05 18:23, BGB wrote:
On 11/5/2025 9:26 AM, Niklas Holsti wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)
My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...
Or silently produces wrong results.
Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
Robert Finch <robfi680@gmail.com> posted:
On 2025-11-05 1:47 a.m., Robert Finch wrote:-----------
I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.
Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that
reduced most instructions eight bits.
4 register specifiers: check.
I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.
With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.
One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.
It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.
Motto: never execute any instructions fetching or building constants.
I could reduce the 64-bit constant build to two instructions by adding a
load-immediate instruction.
May I humbly suggest this is the wrong direction.
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constantsI just realized that Qupls2026 does not accommodate small constants very
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?A SW solution based on how it would be done in HW.
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an
extension would be accepted for the Ada standard.)
Niklas
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.
Yes, UB sounnds as the best answer..
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next?
On 2025-11-06 11:43, Michael S wrote:
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
[ snip ]
Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
away between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of
the) label to which the value refers", which is machine-level
semantics and not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out
when nothing useful can be said.
Yes, UB sounnds as the best answer..
The point is that Ritchie was not satisfied with that answer, which
is why he removed labels-as-values from his version of C. I doubt
that Stallman had any better answer for gcc, but he did not care.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
On such architectures switch would also be implemented by modifying
the code,
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.
As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
As demonstrated above, they do.
And if you fall back to using ifs, it--
does not get any better, either.
- anton
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.
Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.
Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.
On 2025-11-05 7:17, Anton Ertl wrote:
Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.
I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.
The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
On 2025-11-05 23:28, MitchAlsup wrote:
Niklas Holsti <niklas.holsti@tidorum.invalid> posted:----------------
But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.
But YOU had to pass the jumpbuf out of the setjump() scope.
Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.
That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.
The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.
Niklas
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
On 2025-11-05 4:21 p.m., MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
What happens if one tries to use an unsupported combination?
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.
Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.
The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.
Qupls strives to be the low-cost processor.
On 11/5/2025 1:21 PM, MitchAlsup wrote:
Robert Finch <robfi680@gmail.com> posted:
Qupls2026 currently supports 48-bit inline constants. I am debatingMy 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.
I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:
0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2
Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.
Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/4/2025 9:17 PM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 11/4/2025 11:15 AM, MitchAlsup wrote:
PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.
Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?
Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
What makes you think that it is "rightly" to deprecate or delete this
feature?
Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
Now let's see how it looks with switch:
void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);
for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}
Do you know any better which of the "..." is executed next? Of course--- Synchronet 3.21a-Linux NewsLink 1.2
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.
On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.
As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.
As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.
- anton
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that?
What about regular goto and
computed goto?
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
Cool.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
Implementation options that come to my mind are:
1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.
2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.
Of course, if Fortran
assigns labels between shared libraries and the main program,
How does ifort deal with this problem?
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Some time ago, we discussed using the 5 bit immediates in floating point
instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction
stream. Are you implementing that, and if not, why not?
I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.
A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
Thomas Koenig <tkoenig@netcologne.de> writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
So does gfortran support assigned goto, too?
Yes.
Cool.
What problems in
interaction with other features do you see?
In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer
Implementation options that come to my mind are:
1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.
2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.
Of course, if Fortran--- Synchronet 3.21a-Linux NewsLink 1.2
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?
How does ifort deal with this problem?
- anton
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that? What about regular goto and
computed goto?
- anton
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.
A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
So, label-variables are hard to define, but function-variables are not ?!?
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
Does the assigned goto support that? What about regular goto and
computed goto?
- anton
I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.
I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.
I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
and it was up to me to make sure the registers were all handled correctly.
That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.
The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.
The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.
In
addition, the use of an uninitialized label-valued variable should be >prevented or detected.
EricP <ThatWouldBeTelling@thevillage.com> posted:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:I didn't mean to imply that it did.
Where this might be a problem is if the label variable was aDoes the assigned goto support that? What about regular goto and
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.
computed goto?
- anton
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.
I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.
I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.
Or you could do some "foul play" on a jumpbuf and longjump.
{{Be careful not to shoot yourself in the foot.}}
--- Synchronet 3.21a-Linux NewsLink 1.2and it was up to me to make sure the registers were all handled correctly. >>
After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.
On 2025-11-06 11:43, Michael S wrote:...
On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:
On 2025-11-05 7:17, Anton Ertl wrote:
Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:
| I eliminated them because I didn't know what to say about their
| semantics.
Yes, UB sounnds as the best answer..
The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C.
On 2025-11-06 10:46, Anton Ertl wrote:[Fortran's assigned goto]
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.
Take an example use: A VM interpreter. With labels-as-values it looks
like this:
void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};
void **ip=compile_to_vm_code(source,insts);
goto *ip++;
add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}
So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.
I'm not sure if you are trolling or serious, but I will assume the latter.
The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.
On 11/6/2025 12:46 AM, Anton Ertl wrote:
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
of it.
BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.
An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.
In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.
Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.
Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.
In <1762311070-5857@newsgrouper.org> you
mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?
BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying >>>> code is just bad.
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
either.
If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.
Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.
Did you read what you are replying to?
Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?
On 11/6/2025 11:38 AM, Thomas Koenig wrote:
Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.
5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
Interesting! No values related to pi? And what are the ...e+307 used for?
On 11/7/2025 2:09 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
For example, the following Fortran code
goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc
would be compiled to something like (add any required "bounds checking"
for I)
load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40
No code modification nor indirection required .
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.
John Levine <johnl@taugh.com> writes:
In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.
Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.
Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
program, 13 lines of Algol60 horror that Knuth himself got the results wrong.
Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/7/2025 2:09 AM, Anton Ertl wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 11/6/2025 12:46 AM, Anton Ertl wrote:
On such architectures switch would also be implemented by modifying
the code,
I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.
What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
For example, the following Fortran code
goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
would be compiled to something like (add any required "bounds checking"
for I)
load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40
Which architecture ist that?
No code modification nor indirection required .
The "Jump $,R1" is an indirect jump.
With that the assigned goto can
be implemented as (for "GOTO X")
load R1,X
Jump 0,R1
and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.
And, by an large they have.
We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.
No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.
On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.
I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.
Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?
Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.
We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.
On 11/6/2025 3:24 AM, Michael S wrote:
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.
It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.
Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.
For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.
For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.
Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).
At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.
As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.
So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.
It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.
If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.
Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.
Though, still not complete nor confirmed to produce correct results.
But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)
Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.
12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).
While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being notably slower.
However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a value).
With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...
The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).
Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.
One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level helpers.
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.
BGB <cr88192@gmail.com> posted:
--------------snip---------------
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.
Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.
Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.
That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.
On 11/6/2025 1:11 PM, BGB wrote:
On 11/6/2025 3:24 AM, Michael S wrote:
On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 04 Nov 2025 22:51:28 GMTA SW solution based on how it would be done in HW.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)
Yes, that modulo 1000 packing is quite clever. It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.
Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.
You still need to build 12-bit decimal ALUs to string together
Are talking about hardware or software?
Then, I suspect that you didn't understand objection of Thomas Koenig.
1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format
2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.
3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.
4. All said above assumes an absence of HW assists.
BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).
Then I'd do multiplication and normalization and rounding in Base_1e18.
Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.
Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.
Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.
I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.
It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.
Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.
For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.
For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.
Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by
constant into multiply by reciprocal).
At first it seemed like a strong reason to favor X30 over either DPD
or BID. Except, that the cost of the ADD and MUL operations
effectively dwarf that of the pack/unpack operations, so the relative
cost difference between X30 and DPD may not matter much.
As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.
So, it seems, while DPD pack/unpack isn't free, it is not something
that would lead to X30 being a decisive win either in terms of
performance.
It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a
128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.
If working based on digit chunks, likely better to stick with DPD due
to less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.
Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after
normalization. This is slower, but could give exact results.
Though, still not complete nor confirmed to produce correct results.
But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)
Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations
fully fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.
12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).
While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.
However, if running on RV64G with the standard ABI, it is likely the
9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).
With 3x 12 digits,while not exactly the densest scheme, leaves a
little more "working space" so would reduce cases which exceed the
limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...
The main merit of 9 digit chunking here being that it fully stays
within the limits of 64-bit arithmetic (where multiply temporarily
widens to working with 18 digits, but then narrows back to 9 digit
chunks).
Also 9 digit chunking may be preferable when one has a faster
32*32=>64 bit multiplier, but 64*64=>128 is slower.
One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.
I don't know yet if my implementation of DPD is actually correct.
Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.
Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680
Which, in theory, should resemble PI.
Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.
Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
As, in, not going and using some proprietary IBM library or similar.
Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.
And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.
Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".
Not helpful, I more would just want some way to confirm whether or not I
got the format correct.
Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).
Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.
...
Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...
FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)
FDIV : 0.4 MHz (very slow; Newton Raphson)
FMUL (DPD) : 11.2 MHz (330 cycles)
FADD (DPD) : 8.6 MHz (430 cycles)
FMUL (X30) : 12.4 MHz (298 cycles)
FADD (X30) : 9.8 MHz (378 cycles)
The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).
Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).
The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.
For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.
Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.
Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.
Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
in this case.
This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.
Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
Currently hard-wired at Round-Nearest-Even.
DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.
So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.
Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.
Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.
More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.
...
Still don't have a use-case, mostly just messing around with this...
Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,076 |
| Nodes: | 10 (1 / 9) |
| Uptime: | 64:57:37 |
| Calls: | 13,805 |
| Files: | 186,990 |
| D/L today: |
541 files (173M bytes) |
| Messages: | 2,442,779 |