Forum: War Ensemble BBS

Re: floating pain, Combining Practicality with Perfection

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 01:16:33 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> wrote:

According to Waldek Hebisch <antispam@fricas.org>:

quadi <quadibloc@ca.invalid> wrote:

I remember having read one article in a computer magazine where someone >>> mentioned that an unfortunate result of the transition from the IBM 7090 >>> to the IBM System/360 was that a lot of FORTRAN programs that were able to >>> use ordinary real nubers had to be switched over to double precision to >>> yield acceptable results.

Note that IBM floating point format effectively lost about 3 bits of >>accuracy compared to modern 32-bit format. I am not sure how much they >>lost compared to IBM 7090 but it looks that it was at least 5 bits. >>Assuming that accuracy requirements are uniformly distributed between
20 and say 60 bits, we can estimate that loss of 5 bits affected about
25% (or more) of applications that could run using 36-bits. That is
"a lot" of programs.

But it does not mean that 36-bits are somewhat magical. Simply, given >>36-bit machine original author had extra motivation to make sure that
the program run in 36-bit floating point.

It's worse than that, because the 360's floating point had wobbling precision.
Depending on the number of leading zero bits in the fraction it could lose anywhere from 1 to 5 bits of precision compared to a rounded binary format. Hence the badness of the result depended more than usual on the input
data.

Well, IBM format had twice the rage of IEEE format, so effectively one
bit moved from mantissa to exponent. Looking at representable values
except at low end of the range only nomalized values matter. In
hex format 15/16 of values are normalized, which is better than
binary without hidden bit and marginaly worse than binary with hidden
bit. One hex order of maginitude has 15/16 representable values
compared to binary without hidden bit and with IEEE range and
15/32 representable values compared to IEEE. This order of magnitude correspond to 4 binary orders of magnitude, and each binary order
of magnitude has the same namber of values. So hex block beginning
with 1 has 1/16 values compared to all bit patterns of given hex order of magnitude, while corresponding IEEE binary orger of magnitude has
1/2 bit patterns compared to given hex order of magnitude. Which
gives 8 times bigger density for IEEE binary, that is 3 bits of
accuracy. IBM truncated, which looses one extra buit, so AFAICS
worse case for IBM hex is loss of 4 bits. At the high end of
hex order of magnitude density is the same, but still there is
one bit loss due to truncation. So actually, loss varies between
1 to 4 bits. Simple average is 2.5 bit loss, but 3 bits is more
realistic, because once you loose a bit performing following operations
with better accuracy will not compensate for loss.

Note that 1 bit is due to using truncation in arithmetic, which is
indepedent of format. 1 bit is due to exponent range. Hex makes
IBM choice of range natural, but if they really wanted they could
halve exponent range and add one bit to mantissa. So, compared
to binary machine using truncation, no hidden bit and the same
range as IBM hex one looses 1 bit in worst case and gains 2 bits
in best case. So, IBM choice was bad, but at that time other
made bad choices too.
--
Waldek Hebisch
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 01:24:17 2026

From Newsgroup: comp.arch

According to Waldek Hebisch <antispam@fricas.org>:

Well, IBM format had twice the rage of IEEE format, so effectively one
bit moved from mantissa to exponent. Looking at representable values
except at low end of the range only nomalized values matter. In
hex format 15/16 of values are normalized, ...

That's the same mistake IBM made when they designed the 360's FP.
Leading fraction digits are geometrically distributed, not linearly.
(Look at a slide rule to see what I mean.)

There are on average two leading zeros so only half of the values are normalized.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 16:21:44 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> wrote:

According to Waldek Hebisch <antispam@fricas.org>:

Well, IBM format had twice the rage of IEEE format, so effectively one
bit moved from mantissa to exponent. Looking at representable values >>except at low end of the range only nomalized values matter. In
hex format 15/16 of values are normalized, ...

That's the same mistake IBM made when they designed the 360's FP.
Leading fraction digits are geometrically distributed, not linearly.
(Look at a slide rule to see what I mean.)

If you have read und understand what I wrote (and you snipped), you
would see that I handle distribution of numbers. Hint: the point of
talking abount hex order of magnitude and binary orders of magnitude
is to compare both distributions.

There are on average two leading zeros so only half of the values are normalized.

No. By _definition_ hex floating point number is normalized if and
only if its leading hex digit is different than zero. It is easy
to check that different normalized hex bit pattern produce different
values.
--
Waldek Hebisch
--- Synchronet 3.21b-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Tue Feb 17 18:57:18 2026

From Newsgroup: comp.arch

On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
days.

Yes, in fact they have. The goal there is to facilitate data interchange
and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed
in this thread.

The original CISC Concertina I architecture went further; it had the goal
of being able to natively emulate the floating-point of just about every computer ever made.

John Savard
--- Synchronet 3.21b-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Tue Feb 17 19:09:53 2026

From Newsgroup: comp.arch

On Tue, 17 Feb 2026 01:24:17 +0000, John Levine wrote:

According to Waldek Hebisch <antispam@fricas.org>:

Well, IBM format had twice the rage of IEEE format, so effectively one
bit moved from mantissa to exponent. Looking at representable values >>except at low end of the range only nomalized values matter. In hex
format 15/16 of values are normalized, ...

That's the same mistake IBM made when they designed the 360's FP.
Leading fraction digits are geometrically distributed, not linearly.
(Look at a slide rule to see what I mean.)

This is Benford's Law, and there was an interesting discussion of it in
the December, 1969 issue of _Scientific American_ - in an article, not in Martin Gardner's _Mathematical Games_ column, as I would have expected.

John Savard
--- Synchronet 3.21b-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 19:20:33 2026

From Newsgroup: comp.arch

According to Waldek Hebisch <antispam@fricas.org>:

There are on average two leading zeros so only half of the values are
normalized.

No. By _definition_ hex floating point number is normalized if and
only if its leading hex digit is different than zero.

I wrote sloppily. On average a normalized hex FP number has two leading
zeros so you lose another bit compared to binary, in addition to what you
lose by no hidden bit and no rounding.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21b-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 19:52:46 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> wrote:

According to Waldek Hebisch <antispam@fricas.org>:

There are on average two leading zeros so only half of the values are
normalized.

No. By _definition_ hex floating point number is normalized if and
only if its leading hex digit is different than zero.

I wrote sloppily. On average a normalized hex FP number has two leading zeros so you lose another bit compared to binary, in addition to what you lose by no hidden bit and no rounding.

That is almost what I wrote, except for that that I sketched proof that
hex FP looses that one bit _in worst case_, and average is better. In
case of IBM hex float tradoff between range and mantissa bits leads to
another bit lost from accuracy, so 4 bits in worst case (but range
is twice as large as IEEE floats). To summarize: 1 bit loss (compared
to binary with no hidden bit) due to uneven distribution of hex, 1 bit
loss due to impossibility to use hidden bit in hex, 1 bit loss due to
larger range, 1 bit loss due to lack of rounding.
--
Waldek Hebisch
--- Synchronet 3.21b-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 20:43:35 2026

From Newsgroup: comp.arch

quadi <quadibloc@ca.invalid> wrote:

On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

Quadi, have your computer architectures included IBM 360 floating point
support? There is probably more demand for that than for 36-bit these
days.

Yes, in fact they have. The goal there is to facilitate data interchange
and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed
in this thread.

The original CISC Concertina I architecture went further; it had the goal
of being able to natively emulate the floating-point of just about every computer ever made.

That was probably already written, but since you are revising your
design it may be worth stating some facts. If you have 64-bit
machine with convenient access to 32-bit, 16-bit and 8-bit parts
you can store any number of bits between 4 and 64 wasting at most
50% of storage and have simple access to each item. So in terms
of memory use you are trying to avoid this 50% loss. In practice
loss will be much smaller because:

- power of 2 quantities are quite popular
- when program needs large number of items of some other size
programmer is likely to use packing/unpacking routines, keeping
data is space efficient packed formant for most time and unpacking
it for processing
- machine with fast bit-extract/bit-insert instruction can perform
most operation quite fast even on packed data

so possible gain in memory consumption is quite low. Given that
non-standard memory modules and support chips tend to be much more
expensive than standard ones, economically attempting such savings
make no sense.

Of course, that is also question of speed. The argument above shows
that loss of speed on access itself can be quite small. So what
remains is speed of processing data. As long as you do processing
on power of 2 sized items (that is unusual sizes are limited to
storage), loss of speed can be modest, basically dedicated 36-bit
machine probably can do 2 times as much 36-bit float operations
as standard machine can do 64-bit operations. Practically, this
loss will be than loss of storage, but still does not look significant
enough to warrant developement of special machine.

Things are somewhat different when you want bit-accurate result
using old formats. Here already one-complement arithmetic has
significant overhead on two-complement machine. And emulating
old floating point formats is mare expensive. OTOH, modern
machines are much faster than old ones. For example modern CPU
seem to be more than 1000 times faster than real CDC-6600, so
even slow emulation is likely to be faster than real machine,
which means that emulated machine can do the work of orignal
one.

So to summarize: practical consideration leave rather small space
for machine using non-power-of-two formats, and it is rather
unlikely that any design can fit there.

Of course, there is very good reason to expore non-mainstream
approaches, namely having fun. But once you realize that
mainstream designs make their choices for good reasons,
exploring alternatives gets less funny (at least for me).
--
Waldek Hebisch
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 18 00:50:52 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

quadi <quadibloc@ca.invalid> wrote:

On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

Quadi, have your computer architectures included IBM 360 floating point
support? There is probably more demand for that than for 36-bit these
days.

Yes, in fact they have. The goal there is to facilitate data interchange and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed in this thread.

The original CISC Concertina I architecture went further; it had the goal of being able to natively emulate the floating-point of just about every computer ever made.

That was probably already written, but since you are revising your
design it may be worth stating some facts. If you have 64-bit
machine with convenient access to 32-bit, 16-bit and 8-bit parts
you can store any number of bits between 4 and 64 wasting at most
50% of storage and have simple access to each item. So in terms
of memory use you are trying to avoid this 50% loss. In practice
loss will be much smaller because:

- power of 2 quantities are quite popular
- when program needs large number of items of some other size
programmer is likely to use packing/unpacking routines, keeping
data is space efficient packed formant for most time and unpacking
it for processing
- machine with fast bit-extract/bit-insert instruction can perform
most operation quite fast even on packed data

so possible gain in memory consumption is quite low. Given that
non-standard memory modules and support chips tend to be much more
expensive than standard ones, economically attempting such savings
make no sense.

Of course, that is also question of speed. The argument above shows
that loss of speed on access itself can be quite small. So what
remains is speed of processing data. As long as you do processing
on power of 2 sized items (that is unusual sizes are limited to
storage), loss of speed can be modest, basically dedicated 36-bit
machine probably can do 2 times as much 36-bit float operations
as standard machine can do 64-bit operations. Practically, this
loss will be than loss of storage, but still does not look significant
enough to warrant developement of special machine.

Things are somewhat different when you want bit-accurate result
using old formats. Here already one-complement arithmetic has
significant overhead on two-complement machine.

The only useful difference in 1-s complement and 2-s complement in
ADD is the end around carry, and the adder will have the same number
of gates and the same gates of delay. So, in theory, one could make
a {1-s or 2-s} complement adder at the cost of 1 gate of delay and
one logic gate.

And emulating
old floating point formats is mare expensive. OTOH, modern
machines are much faster than old ones. For example modern CPU
seem to be more than 1000 times faster than real CDC-6600, so
even slow emulation is likely to be faster than real machine,
which means that emulated machine can do the work of orignal
one.

Access to 64×64->128 is the key unit of processing.

So to summarize: practical consideration leave rather small space
for machine using non-power-of-two formats, and it is rather
unlikely that any design can fit there.

Of course, there is very good reason to expore non-mainstream
approaches, namely having fun. But once you realize that
mainstream designs make their choices for good reasons,
exploring alternatives gets less funny (at least for me).

--- Synchronet 3.21b-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Wed Feb 18 08:52:27 2026

From Newsgroup: comp.arch

On Tue, 17 Feb 2026 20:43:35 +0000, Waldek Hebisch wrote:

But once you realize that mainstream
designs make their choices for good reasons,
exploring alternatives gets less funny (at least for me).

At one time, back in the past, the mainstream computers had word lengths
such as 12 bits, 18 bits, 24 bits, 30 bits, 36 bits, 48 bits, 60 bits...
all multiples of 6 bits.

The reason for this was that computers needed a character set with
letters, numbers, and various special characters - and a six-bit
character, with 64 possibilities, was adequate for that.

As technology advanced, and computer power became cheaper, it became
possible to think of using computers for more applications. Using an eight-
bit character allowed the use of lower-case characters, getting rid of a limitation of the older computers that could possibly become annoying in
the future. Of course, a 7-bit character would also be enough for that -
and at least one company, ASI, actually made computers with word lengths
that were multiples of 7 bits.

Even before System/360, IBM made a computer built around a 64-bit word,
the STRETCH. It was intended to be a very powerful scientific computer,
but it also had the very rare feature of bit addressing - which a power-of-
two word length made much more practical.

Hardly any architectures provide bit addressing these days, though.

None the less, a character set that includes lower-case is a good reason. Since a 36-bit word works better with 9-bit characters instead of 6-bit characters being addressable, nothing is really lost by going to 36 bits.

Of course, there's another good reason for sticking with 32-bit or 64-bit designs: because that's what everyone else is using, standard memory
modules have data buses corresponding to such widths, possibly with extra
bits for ECC.

To me, those don't seem to be enough "good reasons" to absolutely preclude different word lengths. But there would definitely have to be a real
benefit to justify the cost and effort to use a different length. It seems
to me there is a real benefit, in that the available data sizes in the 32-
bit world aren't optimized to the needs of scientific computation.

But it's quite correct to feel this real benefit isn't enough to make
machines oriented around the 36-bit word length likely.

John Savard

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 18 08:40:38 2026

From Newsgroup: comp.arch

On 2026-02-18 3:52 a.m., quadi wrote:

On Tue, 17 Feb 2026 20:43:35 +0000, Waldek Hebisch wrote:

But once you realize that mainstream
designs make their choices for good reasons,
exploring alternatives gets less funny (at least for me).

At one time, back in the past, the mainstream computers had word lengths
such as 12 bits, 18 bits, 24 bits, 30 bits, 36 bits, 48 bits, 60 bits...
all multiples of 6 bits.

The reason for this was that computers needed a character set with
letters, numbers, and various special characters - and a six-bit
character, with 64 possibilities, was adequate for that.

As technology advanced, and computer power became cheaper, it became
possible to think of using computers for more applications. Using an eight- bit character allowed the use of lower-case characters, getting rid of a limitation of the older computers that could possibly become annoying in
the future. Of course, a 7-bit character would also be enough for that -
and at least one company, ASI, actually made computers with word lengths
that were multiples of 7 bits.

Even before System/360, IBM made a computer built around a 64-bit word,
the STRETCH. It was intended to be a very powerful scientific computer,
but it also had the very rare feature of bit addressing - which a power-of- two word length made much more practical.

Hardly any architectures provide bit addressing these days, though.

None the less, a character set that includes lower-case is a good reason. Since a 36-bit word works better with 9-bit characters instead of 6-bit characters being addressable, nothing is really lost by going to 36 bits.

Of course, there's another good reason for sticking with 32-bit or 64-bit designs: because that's what everyone else is using, standard memory
modules have data buses corresponding to such widths, possibly with extra bits for ECC.

To me, those don't seem to be enough "good reasons" to absolutely preclude different word lengths. But there would definitely have to be a real
benefit to justify the cost and effort to use a different length. It seems
to me there is a real benefit, in that the available data sizes in the 32- bit world aren't optimized to the needs of scientific computation.

But it's quite correct to feel this real benefit isn't enough to make machines oriented around the 36-bit word length likely.

John Savard

Maybe we should switch to 18-bit bytes to support UNICODE.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Feb 19 02:10:07 2026

From Newsgroup: comp.arch

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

FMAC does suddenly get a bit cheaper if its scope is limited to
FP8*FP8+FP16, but this operation is a bit niche.

This one makes a lot of sense for NN's, but haven't gotten my NN tech
working good enough to make a strong use-case for it.

Where, in terms of algorithmic or behavioral complexity relative to computational efficiency, NN's are significantly behind what is possible
with genetic algorithms or genetic programming.

So, for computational efficiency of the result:
Hand-written native code, best efficiency;
Genetic algorithm, moderate efficiency;
Neural Net, very inefficient.

The merit of NNs could then be if one could make them adaptive in some practical way:
Native code: No adaptation apart from specific algos;
Genetic algorithms: Only when running the evolver, static otherwise;
NN's: Could be made adaptable in theory, usually fixed in practice.

And, adaptation process:
Native: None, maybe manual fiddling by programmer;
Genetic algo: Initially very slow, gradually converges on answer;
NNs, via generic algorithm: Slow, but converges toward an answer;
NNs, via backprop: Rapid adaptation initially, then hits a plateau.

Backprop is seemingly prone to get stuck at a non-optimal solution, and
then is hard pressed to make any further progress. Seemingly isn't
really able to "fix" any obvious structural defects once it hits a
plateau, but can sometimes jump up or down between various nearby
options (when obvious suboptimal patterns persist).

Some tricks that work with GA-NN's don't really work with backprop, and
my initial attempts to glue GA handling onto backprop have not been
effective. Also it seems to need at least FP16 weights for training to
work effectively (though, one other option being FP8 with a bias
counter; but this is effectively analogous to using a non-standard
S.E4.M11 format).

Seemingly, my own efforts are getting stuck at the level of very
inefficiently solving very mundane issues, nowhere near the success
being seen by more mainstream efforts.

Nor, as of yet, even anything particularly interesting...

Had started making some progress in other types of areas though, for
example:
Figured out a practical way to get below 16kbps for audio...

By using 8kHz ADPCM and then using lookup table and reversed LZ search trickery to make the audio more LZ compressible (without changing the
storage format).

Or, basically, ADPCM encoding strategy like:
Lookup a match for the last 4 bytes;
Look for the longest backwards match (last N bytes);
Evaluate if the next byte for pattern is within an error limit;
Select based on combination of error and length
Longer matches permit more error than shorter ones.
Check a pattern table,
seeing if anything is within an acceptable error limit;
Use pattern if so.
Else:
Figure out best-match for next 6 samples,
using this to encode next 4 samples (1 byte).

Was able to get around a 20-30% reduction in bitrate, or around 12 kbps typical, before loss of audio quality becomes unacceptable (starts
breaking down in obvious ways).

Did version for 4-bit ADPCM, which can get a roughly similar reduction,
or around 24 kbps, though trying to push it much lower makes 2-bit ADPCM preferable.

A slightly higher reduction rate is possible if the baseline sample-rate
is increased to 16kHz, but still doesn't get as low as when using 8 kHz.

Note that it is possible to just use a pattern table directly to give an equivalent of 8 kbps ADPCM (each byte encoding an index into an 8-sample table, which is then decoded as 2-bit ADPCM), but the audio quality is unacceptably poor (for much of any use-case).

Though, all this was mostly dusting off an experiment from last year,
and putting it to use in my packaging tool (inside BGBCC).

Mostly it is a case of:
It is "good enough" to at least allow for optional super-compression of
ADPCM without breaking the existing decoders.

...

- anton

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 17:30:50 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

The add stage after the multiplication tree is <essentially> 2× as wide.
FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

--- Synchronet 3.21b-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 20:15:29 2026

From Newsgroup: comp.arch

On Thu, 19 Feb 2026 17:30:50 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in
hardware. The additional hardware cost (or the cost of trapping
and software emulation) has been the only argument against
denormals that I ever encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals
became a low cost addition. {And that has been my point--you seem
to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for
FMUL, whereas FADD almost invariably has the necessary support
hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some
cases.

The add stage after the multiplication tree is <essentially> 2� as
wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Arm Inc. application processors cores have FMAC latency=4 for
multiplicands, but 2 for accumulator.
--- Synchronet 3.21b-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Feb 19 18:49:22 2026

From Newsgroup: comp.arch

John Dallman <jgd@cix.co.uk> schrieb:

Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
days.

It has been quite a few decades since the last large-scale
scientific calculations in IBM hex float; I believe it must have
been the Japanese vector computers (one of which I worked on in
the mid to late 1990s). It is probably safe to say that any
hex float these days is embedded firmly in the z ecosystem.

Since every laptop these days has more performance than the old
vector computers, I very much doubt that there is significant data
saved in that format. Same thing for VAX floating point formats.

Bit vs. little endian data could is more recent. Around 20 years
ago, I wrote code to convert between big- and little endian data
for gfortran. This is also quite irrelevant today.

The last conversion issue I had a hand in was for IBM's "double
double" 128-bit real. Now POWER supports this as IEEE in hardware
(if not very fast), but this ABI change is very painful.

There could, however, be a niche for 36-bit reals - graphics cards.
I have recently discovered a GPU solver in a commercial package that
I use, and it has an option for using 32-bit reals. 36-bit reals
could extend the usefulness of such a solver.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 19:55:40 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Thu, 19 Feb 2026 17:30:50 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in
hardware. The additional hardware cost (or the cost of trapping
and software emulation) has been the only argument against
denormals that I ever encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals
became a low cost addition. {And that has been my point--you seem
to have forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Well, and the secondary irony that it is mainly cost-added for
FMUL, whereas FADD almost invariably has the necessary support
hardware already.

But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some
cases.

The add stage after the multiplication tree is <essentially> 2× as
wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Arm Inc. application processors cores have FMAC latency=4 for
multiplicands, but 2 for accumulator.

Thank you for that tid-bit of information.

--- Synchronet 3.21b-Linux NewsLink 1.2

From quadi@quadibloc@ca.invalid to comp.arch on Fri Feb 20 08:14:46 2026

From Newsgroup: comp.arch

On Wed, 18 Feb 2026 08:40:38 -0500, Robert Finch wrote:

Maybe we should switch to 18-bit bytes to support UNICODE.

It's true that Unicode has expanded beyond the old 16-bit Basic
Multilingual Plane. But while all currently-defined characters would fit
in 18 bits, they envisage enlarging Unicode to as many as 31 bits; that is what UTF-8 supports.

If 9-bit bytes are used for simple applications, it certainly will be true that 18-bit halfwords will be an available data type.

John Savard
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 05:08:28 2026

From Newsgroup: comp.arch

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The >>>> additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever
encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for
computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

Can special case the "Binary64 * Binary64 => Binary128" case to save
cost over using a native Binary128 multiply.

For Binary128 multiply, can also make sense to detect and special-case
the "low order bits are zero" case:
If low-order bits are zero, can use a multiply that only produces the
high 128 bits;
Vs a transient 128*128=>256 bit, and then needing to round.

Relative cost is lower if one is already paying the cost of a trap
handler or similar (except that if the ISA supports it, you really don't
want the compiler to combine these operations).

So, one can maybe document if using a compiler like GCC to use "-fp-contract=off -fno-fdiv", ...

Well, and the secondary irony that it is mainly cost-added for FMUL,
whereas FADD almost invariably has the necessary support hardware already. >>
But:
FMUL is expensive operation + cheap normalizer (if no denormals);
FADD is cheap operation with expensive normalizer.

FMAC then is gluing the costs of the two units together, but:
With roughly the latency of both;
The need to be significantly wider internally to deal with some cases.

The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

A 160-bit 3-way adder happening "quickly" is still kinda asking a lot though...

Though, granted, the first step is deciding to decide full-width
multiply, and not discard the low order results.

Granted discarding the low results reduces rounding accuracy, but a way
to fake full IEEE rounding was to detect this case and have the FMUL
raise a fault (similar to denormal/underflow handling). Though, does
mean there is a performance penalty if multiplying numbers where the
low-order bits in both values are non-zero.

In my ISA, the exact behavior depends on instruction an rounding mode.
In the RISC-V mode, partly based on instruction rounding mode and and
flags settings.

For reasons though, can't safely enable full IEEE emulation until after setting up virtual memory and similar though.

The handling of the RISC-V F/D extensions was non-standard in my case,
though not in a way that effects GCC output (it seems to exclusively use
the DYN rounding mode in instructions, assuming the rounding mode to be handled via CSR's). Also, ironically and contrasting with the seeming
design of these extensions, these registers are so rarely accessed in
practice that it seemed most sensible to use trap-and-emulate for the CSRs.

Granted, there are limits to corner cutting:
If a design does not produce exact results for cases where it is trivial
to verify that an exact answer exists in cases that do not require
rounding, IMO this is below the minimum limit for a usable general
purpose FPU.

So, FMAC is a single unit that costs more than both units taken
separately, and with a higher latency.

Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

Trying to push the latency down would be pretty bad for timing, unless
there is some cheaper way to implement FPUs that I am not aware of.

In my case:
FMADD.D, RM=DYN: Trap
FMADD.D, RM=RNE, 10-cycle, double-rounded (non-standard)
FMADD.S, RM=DYN, 10-cycle (mimics single rounding, *)
*: Happens internally at Binary64 precision.

It could be possible to handle FMADD.D RM=DYN the same way as RNE
internally, but then trap if the inputs would potentially give a
non-IEEE result. Though, for now, trapping is the cheaper solution in
terms of HW cost.

The one exception is FP8*FP8 + FP16, but mostly because it is possible
to do FP8*FP8 under 1 cycle.

But, still not free here; and overly niche. Ended up going with a
cheaper option of simply having an SIMD FP8*FP8=>FP16 multiply op (which
still ends up as a 2-cycle op, because...).

--- Synchronet 3.21b-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 20 15:22:05 2026

From Newsgroup: comp.arch

BGB wrote:

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. The >>>>> additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever>>>>> encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

Neither of those work!
I believed this to be true but I was shown the error of my thinking by
more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher
precision would still cause double rounding errors.
Also note that Mitch have stated multiple times that you need ~160
mantissa bits during FMAC double calculations.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 15:26:24 2026

From Newsgroup: comp.arch

On 2/20/2026 8:22 AM, Terje Mathisen wrote:

BGB wrote:

On 2/19/2026 11:30 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 2/12/2026 11:09 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{{One can STILL argue whether
deNormals were a plus or a minus in IEEE}}

I am surprised to read that from you, who has always written that
denormals can be implemented cheaply and efficiently in hardware. >>>>>> The
additional hardware cost (or the cost of trapping and software
emulation) has been the only argument against denormals that I ever >>>>>> encountered.

It is only after IEEE 754-2008 came with FMAC that deNormals became
a low cost addition. {And that has been my point--you seem to have
forgotten the -2008 part or the argument}

And, can note, this is assuming that one actually pays the cost of
native hardware FMAC.

It is exceedingly difficult to get an IEEE quality rounded result if
not done in HW.

Likely depends.

Can use the trick of bumping to the next size up and use that for
computation.

So, for Binary32 compute it as Binary64, and for Binary64 compute it
as Binary128.

Neither of those work!

I believed this to be true but I was shown the error of my thinking by
more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher precision would still cause double rounding errors.

Also note that Mitch have stated multiple times that you need ~160
mantissa bits during FMAC double calculations.

Could look into this, next option being to use a makeshift 192-bit FP
format with a 176 bit mantissa (likely cheaper than going all the way to
224 bits).

This is slow/annoying, but not really likely a "hard" problem (when one
is already doing this stuff in software in a trap handler).

So, potentially:
Binary32 -> FP96 (truncated Binary128, still stored as Binary128)
Binary64 -> FP192 (extended Binary128)
Binary128 -> FP384 (likewise)
Big/ugly, but no one says this needs to be fast...

Might end up on a sort of "TODO list"...

In any case, actual native hardware support for single-rounded FMA is
unlikely to happen in my case.

...

Terje

--- Synchronet 3.21b-Linux NewsLink 1.2

From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sat Feb 21 20:27:25 2026

From Newsgroup: comp.arch

On 2/19/26 12:30 PM, MitchAlsup wrote:
[snip]

The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
The multiplication tree is the same, normalizer is larger.

How much of that applies to a double-rounded FMADD? Double-
rounding would have the advantage of giving bit-identical
results.

While a single-rounding presents more algorithmic opportunities,
double-rounded FMADD would still save decode bandwidth (and
issue bandwidth if a pair of FMUL and FADD instructions were not
fused by idiom recognition) at the cost of supporting three-
input operations (with reduced forwarding per unit work).

Getting bit-identical results for FMUL and FADD executed
separately or as a fused operation and an ISA extension FMADD
might have some practical benefit.

(I still wonder if an FP execution model that only calculated to
the integer size precision (64-bit integer for 64-bit FP),
ignoring carry-in, might have been acceptable for 99% of uses
and saved a little bit of power and area as well as potentially
facilitated software FP — to implement FMUL without carry-in one
would have to have an integer multiply high result that did not
use carry-in (a mirror of multiply low execution) which would
probably only be useful for multiplication by reciprocal as
integer results are expected to be exact.)
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Toadster
  Sun Mar 22 09:52:29 2026
  from Ashland, Wi via Telnet
- Microbot
  Sun Mar 22 08:05:36 2026
  from Moore, Ok via Telnet
- Pixelrez
  Sat Mar 21 16:03:42 2026
  from Lenexa,ks via Telnet
- Pixelrez
  Sat Mar 21 15:57:39 2026
  from Lenexa,ks via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,104
Nodes:	10 (0 / 10)
Uptime:	492389:01:01
Calls:	14,151
Calls today:	2
Files:	186,281
D/L today:	2,617 files (990M bytes)
Messages:	2,501,190
Posted today:	1

Re: floating pain, Combining Practicality with Perfection

Who's Online

Recent Visitors

System Info