• Re: floating pain, Combining Practicality with Perfection

    From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 01:16:33 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Waldek Hebisch <antispam@fricas.org>:
    quadi <quadibloc@ca.invalid> wrote:

    I remember having read one article in a computer magazine where someone >>> mentioned that an unfortunate result of the transition from the IBM 7090 >>> to the IBM System/360 was that a lot of FORTRAN programs that were able to >>> use ordinary real nubers had to be switched over to double precision to >>> yield acceptable results.

    Note that IBM floating point format effectively lost about 3 bits of >>accuracy compared to modern 32-bit format. I am not sure how much they >>lost compared to IBM 7090 but it looks that it was at least 5 bits. >>Assuming that accuracy requirements are uniformly distributed between
    20 and say 60 bits, we can estimate that loss of 5 bits affected about
    25% (or more) of applications that could run using 36-bits. That is
    "a lot" of programs.

    But it does not mean that 36-bits are somewhat magical. Simply, given >>36-bit machine original author had extra motivation to make sure that
    the program run in 36-bit floating point.

    It's worse than that, because the 360's floating point had wobbling precision.
    Depending on the number of leading zero bits in the fraction it could lose anywhere from 1 to 5 bits of precision compared to a rounded binary format. Hence the badness of the result depended more than usual on the input
    data.

    Well, IBM format had twice the rage of IEEE format, so effectively one
    bit moved from mantissa to exponent. Looking at representable values
    except at low end of the range only nomalized values matter. In
    hex format 15/16 of values are normalized, which is better than
    binary without hidden bit and marginaly worse than binary with hidden
    bit. One hex order of maginitude has 15/16 representable values
    compared to binary without hidden bit and with IEEE range and
    15/32 representable values compared to IEEE. This order of magnitude correspond to 4 binary orders of magnitude, and each binary order
    of magnitude has the same namber of values. So hex block beginning
    with 1 has 1/16 values compared to all bit patterns of given hex order of magnitude, while corresponding IEEE binary orger of magnitude has
    1/2 bit patterns compared to given hex order of magnitude. Which
    gives 8 times bigger density for IEEE binary, that is 3 bits of
    accuracy. IBM truncated, which looses one extra buit, so AFAICS
    worse case for IBM hex is loss of 4 bits. At the high end of
    hex order of magnitude density is the same, but still there is
    one bit loss due to truncation. So actually, loss varies between
    1 to 4 bits. Simple average is 2.5 bit loss, but 3 bits is more
    realistic, because once you loose a bit performing following operations
    with better accuracy will not compensate for loss.

    Note that 1 bit is due to using truncation in arithmetic, which is
    indepedent of format. 1 bit is due to exponent range. Hex makes
    IBM choice of range natural, but if they really wanted they could
    halve exponent range and add one bit to mantissa. So, compared
    to binary machine using truncation, no hidden bit and the same
    range as IBM hex one looses 1 bit in worst case and gains 2 bits
    in best case. So, IBM choice was bad, but at that time other
    made bad choices too.
    --
    Waldek Hebisch
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 01:24:17 2026
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    Well, IBM format had twice the rage of IEEE format, so effectively one
    bit moved from mantissa to exponent. Looking at representable values
    except at low end of the range only nomalized values matter. In
    hex format 15/16 of values are normalized, ...

    That's the same mistake IBM made when they designed the 360's FP.
    Leading fraction digits are geometrically distributed, not linearly.
    (Look at a slide rule to see what I mean.)

    There are on average two leading zeros so only half of the values are normalized.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 16:21:44 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Waldek Hebisch <antispam@fricas.org>:
    Well, IBM format had twice the rage of IEEE format, so effectively one
    bit moved from mantissa to exponent. Looking at representable values >>except at low end of the range only nomalized values matter. In
    hex format 15/16 of values are normalized, ...

    That's the same mistake IBM made when they designed the 360's FP.
    Leading fraction digits are geometrically distributed, not linearly.
    (Look at a slide rule to see what I mean.)

    If you have read und understand what I wrote (and you snipped), you
    would see that I handle distribution of numbers. Hint: the point of
    talking abount hex order of magnitude and binary orders of magnitude
    is to compare both distributions.

    There are on average two leading zeros so only half of the values are normalized.

    No. By _definition_ hex floating point number is normalized if and
    only if its leading hex digit is different than zero. It is easy
    to check that different normalized hex bit pattern produce different
    values.
    --
    Waldek Hebisch
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Feb 17 18:57:18 2026
    From Newsgroup: comp.arch

    On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

    Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
    days.

    Yes, in fact they have. The goal there is to facilitate data interchange
    and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed
    in this thread.

    The original CISC Concertina I architecture went further; it had the goal
    of being able to natively emulate the floating-point of just about every computer ever made.

    John Savard
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Feb 17 19:09:53 2026
    From Newsgroup: comp.arch

    On Tue, 17 Feb 2026 01:24:17 +0000, John Levine wrote:
    According to Waldek Hebisch <antispam@fricas.org>:

    Well, IBM format had twice the rage of IEEE format, so effectively one
    bit moved from mantissa to exponent. Looking at representable values >>except at low end of the range only nomalized values matter. In hex
    format 15/16 of values are normalized, ...

    That's the same mistake IBM made when they designed the 360's FP.
    Leading fraction digits are geometrically distributed, not linearly.
    (Look at a slide rule to see what I mean.)

    This is Benford's Law, and there was an interesting discussion of it in
    the December, 1969 issue of _Scientific American_ - in an article, not in Martin Gardner's _Mathematical Games_ column, as I would have expected.

    John Savard
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Feb 17 19:20:33 2026
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    There are on average two leading zeros so only half of the values are
    normalized.

    No. By _definition_ hex floating point number is normalized if and
    only if its leading hex digit is different than zero.

    I wrote sloppily. On average a normalized hex FP number has two leading
    zeros so you lose another bit compared to binary, in addition to what you
    lose by no hidden bit and no rounding.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 19:52:46 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Waldek Hebisch <antispam@fricas.org>:
    There are on average two leading zeros so only half of the values are
    normalized.

    No. By _definition_ hex floating point number is normalized if and
    only if its leading hex digit is different than zero.

    I wrote sloppily. On average a normalized hex FP number has two leading zeros so you lose another bit compared to binary, in addition to what you lose by no hidden bit and no rounding.

    That is almost what I wrote, except for that that I sketched proof that
    hex FP looses that one bit _in worst case_, and average is better. In
    case of IBM hex float tradoff between range and mantissa bits leads to
    another bit lost from accuracy, so 4 bits in worst case (but range
    is twice as large as IEEE floats). To summarize: 1 bit loss (compared
    to binary with no hidden bit) due to uneven distribution of hex, 1 bit
    loss due to impossibility to use hidden bit in hex, 1 bit loss due to
    larger range, 1 bit loss due to lack of rounding.
    --
    Waldek Hebisch
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Feb 17 20:43:35 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> wrote:
    On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

    Quadi, have your computer architectures included IBM 360 floating point
    support? There is probably more demand for that than for 36-bit these
    days.

    Yes, in fact they have. The goal there is to facilitate data interchange
    and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed
    in this thread.

    The original CISC Concertina I architecture went further; it had the goal
    of being able to natively emulate the floating-point of just about every computer ever made.

    That was probably already written, but since you are revising your
    design it may be worth stating some facts. If you have 64-bit
    machine with convenient access to 32-bit, 16-bit and 8-bit parts
    you can store any number of bits between 4 and 64 wasting at most
    50% of storage and have simple access to each item. So in terms
    of memory use you are trying to avoid this 50% loss. In practice
    loss will be much smaller because:

    - power of 2 quantities are quite popular
    - when program needs large number of items of some other size
    programmer is likely to use packing/unpacking routines, keeping
    data is space efficient packed formant for most time and unpacking
    it for processing
    - machine with fast bit-extract/bit-insert instruction can perform
    most operation quite fast even on packed data

    so possible gain in memory consumption is quite low. Given that
    non-standard memory modules and support chips tend to be much more
    expensive than standard ones, economically attempting such savings
    make no sense.

    Of course, that is also question of speed. The argument above shows
    that loss of speed on access itself can be quite small. So what
    remains is speed of processing data. As long as you do processing
    on power of 2 sized items (that is unusual sizes are limited to
    storage), loss of speed can be modest, basically dedicated 36-bit
    machine probably can do 2 times as much 36-bit float operations
    as standard machine can do 64-bit operations. Practically, this
    loss will be than loss of storage, but still does not look significant
    enough to warrant developement of special machine.

    Things are somewhat different when you want bit-accurate result
    using old formats. Here already one-complement arithmetic has
    significant overhead on two-complement machine. And emulating
    old floating point formats is mare expensive. OTOH, modern
    machines are much faster than old ones. For example modern CPU
    seem to be more than 1000 times faster than real CDC-6600, so
    even slow emulation is likely to be faster than real machine,
    which means that emulated machine can do the work of orignal
    one.

    So to summarize: practical consideration leave rather small space
    for machine using non-power-of-two formats, and it is rather
    unlikely that any design can fit there.

    Of course, there is very good reason to expore non-mainstream
    approaches, namely having fun. But once you realize that
    mainstream designs make their choices for good reasons,
    exploring alternatives gets less funny (at least for me).
    --
    Waldek Hebisch
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Feb 18 00:50:52 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    quadi <quadibloc@ca.invalid> wrote:
    On Sun, 15 Feb 2026 14:37:00 +0000, John Dallman wrote:

    Quadi, have your computer architectures included IBM 360 floating point
    support? There is probably more demand for that than for 36-bit these
    days.

    Yes, in fact they have. The goal there is to facilitate data interchange and emulation, not to provide better quality floating-point arithmetic... since, of course, it provides rather the opposite, as has been discussed in this thread.

    The original CISC Concertina I architecture went further; it had the goal of being able to natively emulate the floating-point of just about every computer ever made.

    That was probably already written, but since you are revising your
    design it may be worth stating some facts. If you have 64-bit
    machine with convenient access to 32-bit, 16-bit and 8-bit parts
    you can store any number of bits between 4 and 64 wasting at most
    50% of storage and have simple access to each item. So in terms
    of memory use you are trying to avoid this 50% loss. In practice
    loss will be much smaller because:

    - power of 2 quantities are quite popular
    - when program needs large number of items of some other size
    programmer is likely to use packing/unpacking routines, keeping
    data is space efficient packed formant for most time and unpacking
    it for processing
    - machine with fast bit-extract/bit-insert instruction can perform
    most operation quite fast even on packed data

    so possible gain in memory consumption is quite low. Given that
    non-standard memory modules and support chips tend to be much more
    expensive than standard ones, economically attempting such savings
    make no sense.

    Of course, that is also question of speed. The argument above shows
    that loss of speed on access itself can be quite small. So what
    remains is speed of processing data. As long as you do processing
    on power of 2 sized items (that is unusual sizes are limited to
    storage), loss of speed can be modest, basically dedicated 36-bit
    machine probably can do 2 times as much 36-bit float operations
    as standard machine can do 64-bit operations. Practically, this
    loss will be than loss of storage, but still does not look significant
    enough to warrant developement of special machine.

    Things are somewhat different when you want bit-accurate result
    using old formats. Here already one-complement arithmetic has
    significant overhead on two-complement machine.

    The only useful difference in 1-s complement and 2-s complement in
    ADD is the end around carry, and the adder will have the same number
    of gates and the same gates of delay. So, in theory, one could make
    a {1-s or 2-s} complement adder at the cost of 1 gate of delay and
    one logic gate.

    And emulating
    old floating point formats is mare expensive. OTOH, modern
    machines are much faster than old ones. For example modern CPU
    seem to be more than 1000 times faster than real CDC-6600, so
    even slow emulation is likely to be faster than real machine,
    which means that emulated machine can do the work of orignal
    one.

    Access to 64×64->128 is the key unit of processing.

    So to summarize: practical consideration leave rather small space
    for machine using non-power-of-two formats, and it is rather
    unlikely that any design can fit there.

    Of course, there is very good reason to expore non-mainstream
    approaches, namely having fun. But once you realize that
    mainstream designs make their choices for good reasons,
    exploring alternatives gets less funny (at least for me).

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed Feb 18 08:52:27 2026
    From Newsgroup: comp.arch

    On Tue, 17 Feb 2026 20:43:35 +0000, Waldek Hebisch wrote:

    But once you realize that mainstream
    designs make their choices for good reasons,
    exploring alternatives gets less funny (at least for me).

    At one time, back in the past, the mainstream computers had word lengths
    such as 12 bits, 18 bits, 24 bits, 30 bits, 36 bits, 48 bits, 60 bits...
    all multiples of 6 bits.

    The reason for this was that computers needed a character set with
    letters, numbers, and various special characters - and a six-bit
    character, with 64 possibilities, was adequate for that.

    As technology advanced, and computer power became cheaper, it became
    possible to think of using computers for more applications. Using an eight-
    bit character allowed the use of lower-case characters, getting rid of a limitation of the older computers that could possibly become annoying in
    the future. Of course, a 7-bit character would also be enough for that -
    and at least one company, ASI, actually made computers with word lengths
    that were multiples of 7 bits.

    Even before System/360, IBM made a computer built around a 64-bit word,
    the STRETCH. It was intended to be a very powerful scientific computer,
    but it also had the very rare feature of bit addressing - which a power-of-
    two word length made much more practical.

    Hardly any architectures provide bit addressing these days, though.

    None the less, a character set that includes lower-case is a good reason. Since a 36-bit word works better with 9-bit characters instead of 6-bit characters being addressable, nothing is really lost by going to 36 bits.

    Of course, there's another good reason for sticking with 32-bit or 64-bit designs: because that's what everyone else is using, standard memory
    modules have data buses corresponding to such widths, possibly with extra
    bits for ECC.

    To me, those don't seem to be enough "good reasons" to absolutely preclude different word lengths. But there would definitely have to be a real
    benefit to justify the cost and effort to use a different length. It seems
    to me there is a real benefit, in that the available data sizes in the 32-
    bit world aren't optimized to the needs of scientific computation.

    But it's quite correct to feel this real benefit isn't enough to make
    machines oriented around the 36-bit word length likely.

    John Savard

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Feb 18 08:40:38 2026
    From Newsgroup: comp.arch

    On 2026-02-18 3:52 a.m., quadi wrote:
    On Tue, 17 Feb 2026 20:43:35 +0000, Waldek Hebisch wrote:

    But once you realize that mainstream
    designs make their choices for good reasons,
    exploring alternatives gets less funny (at least for me).

    At one time, back in the past, the mainstream computers had word lengths
    such as 12 bits, 18 bits, 24 bits, 30 bits, 36 bits, 48 bits, 60 bits...
    all multiples of 6 bits.

    The reason for this was that computers needed a character set with
    letters, numbers, and various special characters - and a six-bit
    character, with 64 possibilities, was adequate for that.

    As technology advanced, and computer power became cheaper, it became
    possible to think of using computers for more applications. Using an eight- bit character allowed the use of lower-case characters, getting rid of a limitation of the older computers that could possibly become annoying in
    the future. Of course, a 7-bit character would also be enough for that -
    and at least one company, ASI, actually made computers with word lengths
    that were multiples of 7 bits.

    Even before System/360, IBM made a computer built around a 64-bit word,
    the STRETCH. It was intended to be a very powerful scientific computer,
    but it also had the very rare feature of bit addressing - which a power-of- two word length made much more practical.

    Hardly any architectures provide bit addressing these days, though.

    None the less, a character set that includes lower-case is a good reason. Since a 36-bit word works better with 9-bit characters instead of 6-bit characters being addressable, nothing is really lost by going to 36 bits.

    Of course, there's another good reason for sticking with 32-bit or 64-bit designs: because that's what everyone else is using, standard memory
    modules have data buses corresponding to such widths, possibly with extra bits for ECC.

    To me, those don't seem to be enough "good reasons" to absolutely preclude different word lengths. But there would definitely have to be a real
    benefit to justify the cost and effort to use a different length. It seems
    to me there is a real benefit, in that the available data sizes in the 32- bit world aren't optimized to the needs of scientific computation.

    But it's quite correct to feel this real benefit isn't enough to make machines oriented around the 36-bit word length likely.

    John Savard

    Maybe we should switch to 18-bit bytes to support UNICODE.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Feb 19 02:10:07 2026
    From Newsgroup: comp.arch

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.


    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.



    FMAC does suddenly get a bit cheaper if its scope is limited to
    FP8*FP8+FP16, but this operation is a bit niche.


    This one makes a lot of sense for NN's, but haven't gotten my NN tech
    working good enough to make a strong use-case for it.

    Where, in terms of algorithmic or behavioral complexity relative to computational efficiency, NN's are significantly behind what is possible
    with genetic algorithms or genetic programming.


    So, for computational efficiency of the result:
    Hand-written native code, best efficiency;
    Genetic algorithm, moderate efficiency;
    Neural Net, very inefficient.

    The merit of NNs could then be if one could make them adaptive in some practical way:
    Native code: No adaptation apart from specific algos;
    Genetic algorithms: Only when running the evolver, static otherwise;
    NN's: Could be made adaptable in theory, usually fixed in practice.

    And, adaptation process:
    Native: None, maybe manual fiddling by programmer;
    Genetic algo: Initially very slow, gradually converges on answer;
    NNs, via generic algorithm: Slow, but converges toward an answer;
    NNs, via backprop: Rapid adaptation initially, then hits a plateau.

    Backprop is seemingly prone to get stuck at a non-optimal solution, and
    then is hard pressed to make any further progress. Seemingly isn't
    really able to "fix" any obvious structural defects once it hits a
    plateau, but can sometimes jump up or down between various nearby
    options (when obvious suboptimal patterns persist).

    Some tricks that work with GA-NN's don't really work with backprop, and
    my initial attempts to glue GA handling onto backprop have not been
    effective. Also it seems to need at least FP16 weights for training to
    work effectively (though, one other option being FP8 with a bias
    counter; but this is effectively analogous to using a non-standard
    S.E4.M11 format).


    Seemingly, my own efforts are getting stuck at the level of very
    inefficiently solving very mundane issues, nowhere near the success
    being seen by more mainstream efforts.

    Nor, as of yet, even anything particularly interesting...



    Had started making some progress in other types of areas though, for
    example:
    Figured out a practical way to get below 16kbps for audio...

    By using 8kHz ADPCM and then using lookup table and reversed LZ search trickery to make the audio more LZ compressible (without changing the
    storage format).

    Or, basically, ADPCM encoding strategy like:
    Lookup a match for the last 4 bytes;
    Look for the longest backwards match (last N bytes);
    Evaluate if the next byte for pattern is within an error limit;
    Select based on combination of error and length
    Longer matches permit more error than shorter ones.
    Check a pattern table,
    seeing if anything is within an acceptable error limit;
    Use pattern if so.
    Else:
    Figure out best-match for next 6 samples,
    using this to encode next 4 samples (1 byte).

    Was able to get around a 20-30% reduction in bitrate, or around 12 kbps typical, before loss of audio quality becomes unacceptable (starts
    breaking down in obvious ways).


    Did version for 4-bit ADPCM, which can get a roughly similar reduction,
    or around 24 kbps, though trying to push it much lower makes 2-bit ADPCM preferable.

    A slightly higher reduction rate is possible if the baseline sample-rate
    is increased to 16kHz, but still doesn't get as low as when using 8 kHz.



    Note that it is possible to just use a pattern table directly to give an equivalent of 8 kbps ADPCM (each byte encoding an index into an 8-sample table, which is then decoded as 2-bit ADPCM), but the audio quality is unacceptably poor (for much of any use-case).


    Though, all this was mostly dusting off an experiment from last year,
    and putting it to use in my packaging tool (inside BGBCC).

    Mostly it is a case of:
    It is "good enough" to at least allow for optional super-compression of
    ADPCM without breaking the existing decoders.


    ...




    - anton

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 17:30:50 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    The add stage after the multiplication tree is <essentially> 2× as wide.
    FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Feb 19 20:15:29 2026
    From Newsgroup: comp.arch

    On Thu, 19 Feb 2026 17:30:50 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in
    hardware. The additional hardware cost (or the cost of trapping
    and software emulation) has been the only argument against
    denormals that I ever encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals
    became a low cost addition. {And that has been my point--you seem
    to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for
    FMUL, whereas FADD almost invariably has the necessary support
    hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some
    cases.

    The add stage after the multiplication tree is <essentially> 2ª as
    wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).

    Arm Inc. application processors cores have FMAC latency=4 for
    multiplicands, but 2 for accumulator.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Feb 19 18:49:22 2026
    From Newsgroup: comp.arch

    John Dallman <jgd@cix.co.uk> schrieb:

    Quadi, have your computer architectures included IBM 360 floating point support? There is probably more demand for that than for 36-bit these
    days.

    It has been quite a few decades since the last large-scale
    scientific calculations in IBM hex float; I believe it must have
    been the Japanese vector computers (one of which I worked on in
    the mid to late 1990s). It is probably safe to say that any
    hex float these days is embedded firmly in the z ecosystem.

    Since every laptop these days has more performance than the old
    vector computers, I very much doubt that there is significant data
    saved in that format. Same thing for VAX floating point formats.

    Bit vs. little endian data could is more recent. Around 20 years
    ago, I wrote code to convert between big- and little endian data
    for gfortran. This is also quite irrelevant today.

    The last conversion issue I had a hand in was for IBM's "double
    double" 128-bit real. Now POWER supports this as IEEE in hardware
    (if not very fast), but this ABI change is very painful.

    There could, however, be a niche for 36-bit reals - graphics cards.
    I have recently discovered a GPU solver in a commercial package that
    I use, and it has an option for using 32-bit reals. 36-bit reals
    could extend the usefulness of such a solver.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Feb 19 19:55:40 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 19 Feb 2026 17:30:50 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in
    hardware. The additional hardware cost (or the cost of trapping
    and software emulation) has been the only argument against
    denormals that I ever encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals
    became a low cost addition. {And that has been my point--you seem
    to have forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Well, and the secondary irony that it is mainly cost-added for
    FMUL, whereas FADD almost invariably has the necessary support
    hardware already.

    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some
    cases.

    The add stage after the multiplication tree is <essentially> 2× as
    wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    So, FMAC is a single unit that costs more than both units taken separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).


    Arm Inc. application processors cores have FMAC latency=4 for
    multiplicands, but 2 for accumulator.

    Thank you for that tid-bit of information.

    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Feb 20 08:14:46 2026
    From Newsgroup: comp.arch

    On Wed, 18 Feb 2026 08:40:38 -0500, Robert Finch wrote:

    Maybe we should switch to 18-bit bytes to support UNICODE.

    It's true that Unicode has expanded beyond the old 16-bit Basic
    Multilingual Plane. But while all currently-defined characters would fit
    in 18 bits, they envisage enlarging Unicode to as many as 31 bits; that is what UTF-8 supports.

    If 9-bit bytes are used for simple applications, it certainly will be true that 18-bit halfwords will be an available data type.

    John Savard
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 05:08:28 2026
    From Newsgroup: comp.arch

    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. The >>>> additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever
    encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for
    computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.

    Can special case the "Binary64 * Binary64 => Binary128" case to save
    cost over using a native Binary128 multiply.

    For Binary128 multiply, can also make sense to detect and special-case
    the "low order bits are zero" case:
    If low-order bits are zero, can use a multiply that only produces the
    high 128 bits;
    Vs a transient 128*128=>256 bit, and then needing to round.



    Relative cost is lower if one is already paying the cost of a trap
    handler or similar (except that if the ISA supports it, you really don't
    want the compiler to combine these operations).

    So, one can maybe document if using a compiler like GCC to use "-fp-contract=off -fno-fdiv", ...



    Well, and the secondary irony that it is mainly cost-added for FMUL,
    whereas FADD almost invariably has the necessary support hardware already. >>
    But:
    FMUL is expensive operation + cheap normalizer (if no denormals);
    FADD is cheap operation with expensive normalizer.

    FMAC then is gluing the costs of the two units together, but:
    With roughly the latency of both;
    The need to be significantly wider internally to deal with some cases.

    The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.


    A 160-bit 3-way adder happening "quickly" is still kinda asking a lot though...


    Though, granted, the first step is deciding to decide full-width
    multiply, and not discard the low order results.

    Granted discarding the low results reduces rounding accuracy, but a way
    to fake full IEEE rounding was to detect this case and have the FMUL
    raise a fault (similar to denormal/underflow handling). Though, does
    mean there is a performance penalty if multiplying numbers where the
    low-order bits in both values are non-zero.


    In my ISA, the exact behavior depends on instruction an rounding mode.
    In the RISC-V mode, partly based on instruction rounding mode and and
    flags settings.

    For reasons though, can't safely enable full IEEE emulation until after setting up virtual memory and similar though.


    The handling of the RISC-V F/D extensions was non-standard in my case,
    though not in a way that effects GCC output (it seems to exclusively use
    the DYN rounding mode in instructions, assuming the rounding mode to be handled via CSR's). Also, ironically and contrasting with the seeming
    design of these extensions, these registers are so rarely accessed in
    practice that it seemed most sensible to use trap-and-emulate for the CSRs.


    Granted, there are limits to corner cutting:
    If a design does not produce exact results for cases where it is trivial
    to verify that an exact answer exists in cases that do not require
    rounding, IMO this is below the minimum limit for a usable general
    purpose FPU.



    So, FMAC is a single unit that costs more than both units taken
    separately, and with a higher latency.

    Prior RISC processors did FMUL in 3-4 cycles (mostly 4).
    Later RISC processors and x86 did FMAC in 4-cycles (occasionally 5).


    Trying to push the latency down would be pretty bad for timing, unless
    there is some cheaper way to implement FPUs that I am not aware of.

    In my case:
    FMADD.D, RM=DYN: Trap
    FMADD.D, RM=RNE, 10-cycle, double-rounded (non-standard)
    FMADD.S, RM=DYN, 10-cycle (mimics single rounding, *)
    *: Happens internally at Binary64 precision.

    It could be possible to handle FMADD.D RM=DYN the same way as RNE
    internally, but then trap if the inputs would potentially give a
    non-IEEE result. Though, for now, trapping is the cheaper solution in
    terms of HW cost.




    The one exception is FP8*FP8 + FP16, but mostly because it is possible
    to do FP8*FP8 under 1 cycle.

    But, still not free here; and overly niche. Ended up going with a
    cheaper option of simply having an SIMD FP8*FP8=>FP16 multiply op (which
    still ends up as a 2-cycle op, because...).


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Feb 20 15:22:05 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware.  The >>>>> additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever>>>>> encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it as Binary128.
    Neither of those work!
    I believed this to be true but I was shown the error of my thinking by
    more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher
    precision would still cause double rounding errors.
    Also note that Mitch have stated multiple times that you need ~160
    mantissa bits during FMAC double calculations.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Feb 20 15:26:24 2026
    From Newsgroup: comp.arch

    On 2/20/2026 8:22 AM, Terje Mathisen wrote:
    BGB wrote:
    On 2/19/2026 11:30 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 2/12/2026 11:09 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    {{One can STILL argue whether
    deNormals were a plus or a minus in IEEE}}

    I am surprised to read that from you, who has always written that
    denormals can be implemented cheaply and efficiently in hardware. >>>>>> The
    additional hardware cost (or the cost of trapping and software
    emulation) has been the only argument against denormals that I ever >>>>>> encountered.

    It is only after IEEE 754-2008 came with FMAC that deNormals became
    a low cost addition. {And that has been my point--you seem to have
    forgotten the -2008 part or the argument}

    And, can note, this is assuming that one actually pays the cost of
    native hardware FMAC.

    It is exceedingly difficult to get an IEEE quality rounded result if
    not done in HW.

    Likely depends.


    Can use the trick of bumping to the next size up and use that for
    computation.

    So, for Binary32 compute it as Binary64, and for Binary64 compute it
    as Binary128.

    Neither of those work!

    I believed this to be true but I was shown the error of my thinking by
    more knowledgable people in the 754 working group. I.e. they had a very simple/small example where doing the calculation in the next higher precision would still cause double rounding errors.

    Also note that Mitch have stated multiple times that you need ~160
    mantissa bits during FMAC double calculations.


    Could look into this, next option being to use a makeshift 192-bit FP
    format with a 176 bit mantissa (likely cheaper than going all the way to
    224 bits).

    This is slow/annoying, but not really likely a "hard" problem (when one
    is already doing this stuff in software in a trap handler).


    So, potentially:
    Binary32 -> FP96 (truncated Binary128, still stored as Binary128)
    Binary64 -> FP192 (extended Binary128)
    Binary128 -> FP384 (likewise)
    Big/ugly, but no one says this needs to be fast...


    Might end up on a sort of "TODO list"...


    In any case, actual native hardware support for single-rounded FMA is
    unlikely to happen in my case.


    ...

    Terje


    --- Synchronet 3.21b-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sat Feb 21 20:27:25 2026
    From Newsgroup: comp.arch

    On 2/19/26 12:30 PM, MitchAlsup wrote:
    [snip]
    The add stage after the multiplication tree is <essentially> 2× as wide. FMUL needs a 108-bit 2-input adder
    FMAC needs a 160-bit 3-input adder and a 52-bit incrementor.
    The multiplication tree is the same, normalizer is larger.

    How much of that applies to a double-rounded FMADD? Double-
    rounding would have the advantage of giving bit-identical
    results.

    While a single-rounding presents more algorithmic opportunities,
    double-rounded FMADD would still save decode bandwidth (and
    issue bandwidth if a pair of FMUL and FADD instructions were not
    fused by idiom recognition) at the cost of supporting three-
    input operations (with reduced forwarding per unit work).

    Getting bit-identical results for FMUL and FADD executed
    separately or as a fused operation and an ISA extension FMADD
    might have some practical benefit.

    (I still wonder if an FP execution model that only calculated to
    the integer size precision (64-bit integer for 64-bit FP),
    ignoring carry-in, might have been acceptable for 99% of uses
    and saved a little bit of power and area as well as potentially
    facilitated software FP — to implement FMUL without carry-in one
    would have to have an integer multiply high result that did not
    use carry-in (a mirror of multiply low execution) which would
    probably only be useful for multiplication by reciprocal as
    integer results are expected to be exact.)
    --- Synchronet 3.21b-Linux NewsLink 1.2