• Re: Tonights Tradeoff

    From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025
    From Newsgroup: comp.arch

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025
    From Newsgroup: comp.arch

    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name. Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?



    GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025
    From Newsgroup: comp.arch

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.


    Well, that and when co-existing with RV64G, it gives somewhere to put
    the FPRs. But, in turn this was initially motivated by me failing to
    figure out how to get GCC configured to target Zfinx/Zdinx.


    Had ended up going with the Even/Odd pairing scheme as it is less wonky
    IMO to deal with R5:R4 than R36:R4.


    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.


    BT/BF works well. I otherwise also ended up using RISC-V style branches,
    which I originally disliked due to higher implementation cost, but they
    do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

    So, it becomes harder to complain about a feature that does technically
    help with performance.


    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.


    Hmm...

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    There are some unresolved bugs, but I haven't been able to fully hunt
    them down. A lot was in relation to RISC-V's C extension, but at least
    it seems like at this point the C extension is likely fully working.

    Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
    stable at this point.

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.



    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance;
    Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    Had noted in Cinebench that my main PC is actually performing a little
    slower than is typical for the 2700X, but then again, it is effectively
    a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
    was a case of the RAM I have was unstable if run all that fast (and in
    this case; more RAM but slightly slower seemed preferable to less RAM
    but slightly faster, or running it slightly faster but having the
    computer be crash-prone).

    They sold the ran with its on-the-box speed being the XMP2 settings
    rather than the baseline settings, but the RAM in question didn't run
    reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
    more so when there was already the annoyance that my MOBO chipset
    apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
    maybe not an ideal setup for perf).

    So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

    Apparently, going by Cinebench scores, my PC's single threaded
    performance is mostly hanging out with a bunch of Xeons (getting a score
    in R23 of around 700 vs 950).

    Well, could be addressed, in theory, but would need some RAM that
    actually runs reliably at 2933 or 3200 MT/s and is also cheap...


    In both cases, they are CPUs originally released in 2018.

    Has noted, in a few tests:
    LZ4 benchmark (same file):
    Main PC: 3.3 GB/s
    Laptop: 3.9 GB/s
    memcpy (single threaded):
    Main PC: 3.8 GB/s
    Laptop : 5.6 GB/s
    memcpy (all threads):
    Main PC: ~ 15 GB/s
    Laptop : ~ 24 GB/s
    ( Like, what; thing only has 1 stick of RAM... *1 )

    *1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
    Xeon E5410 with like 8 sticks of RAM...

    or, maybe it was just weak that my main PC was failing to beat the Xeon
    at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).


    Granted, then again, I am using (almost) the cheapest MOBO I could find
    at the time (that had an OK number of RAM slots and SATA connectors).
    Can't quite identify the MOBO or chipset as I lost the box (and not
    clearly labeled on the MOBO itself); except that it is a
    something-or-another ASUS board.

    Like, at the time, IIRC:
    Went on Newggg;
    Pick mostly the cheapest parts on the site;
    Say, a Zen+ CPU being a lot cheaper than Zen 2,
    or pretty much anything from Intel.
    ...


    Did get a slightly fancy/beefy case, but partly this was because I was
    annoyed with the late-90s-era beige tower case I had been using. Which I
    had ended up hot gluing a bunch of extra PC fans into the thing in an
    attempt to keep airflow good enough so that it didn't melt. And
    under-clocking the CPU so that it could run reliably.

    Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
    stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
    to 2.8 GHz (for similar reasons).

    Where, at least the Zen+ doesn't overheat at stock settings (but, they
    also supplied the thing with a comparably much bigger stock CPU cooler).

    The case I got is slightly more traditional, with 5.25" bays and similar
    and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
    stopped selling the traditional sheet-steel PC cases with open 5.25"
    bays. Like, where exactly is someone supposed to put their DVD-RW drive,
    or hot-swap HDD trays ?...

    Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
    drive (but, now mostly moot as relatively few other computers still have floppy drives either).




    Well, in theory could build a PC with newer components and a bigger
    budget for parts. Still wouldn't want to go over to Win11, now it is a
    choice between jumping to Linux or "Windows Server" or similar (like, at
    least they didn't pollute Windows Server with a bunch of random
    pointless crap).

    For now, inertia option is to just keep using Win10 for now.


    As for the laptop, had noted:
    Can run Minecraft:
    Yes; though best results at an 8-chunk draw distance.
    Much more than this, and the "Intel UHD" graphics struggle.
    At 12 chunks, there is obvious chug.
    At 16 chunks, it starts dropping into single digit territory.
    Can run Doom3:
    Yes: Mostly gets 40-50 fps in Doom 3.

    My main PC can manage a 16-chunk draw distance in Minecraft and mostly
    gets a constant 63 fps in Doom3.

    Don't have many other newer games to test, as I mostly lost interest in
    modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
    work. I can mostly just be happy that Minecraft works and is playable
    (and that its GPU is solidly faster than just using a software renderer...).


    On both fronts, this is a significant improvement over the older laptop.
    For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

    This is mostly because I had noticed that, right now (unlike a few years
    ago), there are actually OK laptops at cheap prices (along with all the
    $80 Dell OptiPlex computers and similar on Amazon...).



    Otherwise, went and recently wrote up a spec partly based on a BASIC
    dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

    Where I was able to get a usable implementation for something similar in
    a little over 1000 lines of C.

    Though, this was for an Unstructured BASIC dialect.


    Decided then to try something a little harder:
    Doing a small JavaScript like language, and trying to keep the
    interpreter small.

    I don't yet have the full language implemented, but for a partial JS
    like language, I currently have something in around 2500 lines of C.

    I had initially set a target estimate of 4-8 kLOC.
    Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
    the rest of the core-language implemented within around 1.5 kLOC or so).

    Note: No 3rd party libraries allowed, only the normal C runtime library.
    Did end up using a few C99 features, but mostly still C95.


    For now, I was calling the language BS3L, where:
    Dynamically typed;
    Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
    JS style syntax;
    Nothing too exciting here.
    Still has JS style arrays and objects;
    Dynamically scoped.
    Where, dynamic scoping needs less code than lexical scoping;
    But, dynamic scoping is also a potential foot-gun as well.
    Not sure if too much of a foot-gun.
    Vs going to C-style scoping;
    Or, biting the bullet and properly implementing lexical scoping.
    Leaving out most advanced features.
    will be fairly minimal even vs early versions of JS.

    But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
    needing a full parser (that builds an AST) and an AST-walking
    interpreter. Unlike BASIC, it wouldn't be possible to implement an
    interpreter by directly walking and pattern matching lists of tokens.

    And, a parser that builds an AST, and code to walk said AST, necessarily
    needs more code.

    I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
    the preprocessor). Or, basically, where one has to stick to similar C
    coding conventions to those used in Doom and Quake.


    I am not sure if this would be possible. Both the dynamic type-system
    and parser have eaten up a fair chunk of the code budget. A sub 1000
    line parser is also a little novel; but the parser itself got a little
    wonky and doesn't fully abstract over what it parses (as there is still
    a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

    For example, it has wonk like dealing with lists of statements as-if
    there were a right-associative semicolon operator (allowing it to be
    walked like a linked list).

    There is slightly wonky operator tokenization again to save code:
    Separately matching every possible operator pattern is a bunch of extra
    logic. Was using rules that mostly give the correct operators, but with
    the possibility of non-sense operators. Also the precedence levels don't
    match up exactly, but this is a lower priority issue.


    I guess, if someone things they can do so in significantly less code,
    they can try.

    Note that while a language like Lua sort of resembles an intermediate
    between BASIC and JavaScript, I wouldn't expect Lua to save that much
    here (it would still have the cost of needing to build an AST and similar).

    Going from an AST to a bytecode or 3AC IR would allow for higher
    performance.

    But, I decided to go for an AST walking interpreter in this case as it
    would be the least LOC.


    Actually takes more effort trying to keep the code small. Rather than
    just copy-pasting stuff a bunch of times, one spends more time needing
    to try to factor out and reuse common patterns.


    Though, in a way, some of this is revisiting stuff I did 20+ years ago,
    but from a different perspective.

    Like, 20+ years ago, my first interpreters also used AST walkers.

    As for where I will go with this, I don't know.
    Some of it could make sense as a starting point for a GLSL compiler;
    Or maybe adapted into parsing the SCAD language;

    Or, as a cheaper alternative to what my first script VM became.
    By the end of its span, it had become quite massive...
    Though, still not too bad if compared with SpiderMonkey or V8.

    Ironically, my jump to a Java + JVM/.NET like design was actually to
    make it simpler.

    For a simple but slow language, JS works, but if you want it fast it
    quickly turns worse (and simpler to jump to a more traditional
    statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
    make a JavaScript style language fast (by turning it transparently into
    a statically-typed language), but also, was a huge PITA to deal with
    (this was combined in my VM with optional explicit type declarations;
    with a syntax inspired by ActionScript).


    Well, and when something gets big and complicated enough that one almost
    may as well just use spiderMonkey or similar to run their JS code, this
    is a problem...

    Still less bad than LLVM, not sure why anyone would willingly submit to
    this.


    Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

    Though, makes more sense to do a clean interpreter in this case, than to
    try to build one by copy-pasting the parser from BGBCC or my old VM and
    trying to build a new lighter-weight VM.

    In some of these cases, it is easier to scale up than scale back down.
    Easier to take simpler code and add features or improve performance.
    Than to take more complex code and try to trim it down.


    And, sometimes it does make more sense to just write something starting
    from a clean slate.

    Well, except for my attempt at a clean slate C compiler, except this was
    more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
    the design. Partly as I was trying to follow a model more like that used
    by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
    make more sense than trying to imitate how GCC does things).

    Might still make at some point to try for another clean-slate C
    compiler, though if I would still end up taking a similar general
    approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
    to use BGBCC).

    Where, say, the main thing that would ideally need to be improved would
    be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

    Comparably, MSVC typically being a bit faster at compiling stuff IME.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025
    From Newsgroup: comp.arch

    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste.  If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be
    passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025
    From Newsgroup: comp.arch

    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.  If you have six bits, you can use
    all 64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app
    mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
    design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

    These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
    SIMD registers, despite having only 256-bit wide FUs. The way I
    understand it, a 512-bit operation comes as one uop to the FU,
    occupies it for two cycles (and of course the result latency is
    extra), and then has a 512-bit result.

    The alternative would be to do as AMD did in some earlier cores,
    starting with (I think) K8: have registers that are half as wide and
    split each 512-bit operation into 2 256-bit uops that go throught the
    OoO engine individually. This approach would allow more physical
    256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
    256-bit operations, but would cost additional decoding bandwidth,
    renaming bandwidth, renaming checkpoint size (a little), and scheduler
    space than the approach AMD have taken. Apparently the cost of this
    approach is higher than the benefit.

    Doubling the logical register size doubles the renamer checkpoint
    size, no? This way of avoiding "waste" looks quite a bit more
    expensive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025
    From Newsgroup: comp.arch

    On 10/29/2025 7:50 AM, Robert Finch wrote:
    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or
    6 bit register numbers in the instructions.  Five allows you to use
    the high registers for 128 bit operations without needing another
    register specifier, but then the high registers can only be used for
    128 bit operations, which seems a waste.  If you have six bits, you
    can use all 64 registers for any operation, but how is the "upper"
    method that better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would
    be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But
    since it should be using mostly compiled code, it does not make much
    difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    I am not as sure about this approach...

    Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

    But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
    subsetting registers on context switch seems like it could turn into a problem.


    Or, if a goal is to allow for encodings with a 5-bit register field,
    would make sense to use 32-bit encodings.

    Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
    immediate (and one has to be more careful not to "waste" the encoding
    space as badly as RISC-V had done).

    Though, can note that both:
    R6+R6+Imm10
    R5+R5+Imm12
    Use the same amount of encoding space.
    But, R6+R6+R6 uses 3 bits more than R5+R5+R5.


    Though, one could debate my case, as I did effectively end up burning
    1/4 of the total encoding space mostly on Jumbo prefixes.

    ...



    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a
    branch on bit-set/clear for conditional branches. Might also include
    branch true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.


    FWIW: I have gotten by OK with 128 internal registers:
    00..3F: Array-Mapped Registers (mostly the GPRs)
    40..7F: CRs and SPRs

    Mostly sufficient.

    For the array-mapped registers, these ones use LUTRAM, with a logical
    copy of the array per write port, and some control bits to encode which
    array currently holds the up-to-date copy of the register.

    All this gets internally replicated for each read port.

    So, roughly 18 internal copies of all of the registers with 6R3W, but
    this is unavoidable (since LUTRAMs are 1R1W).


    The other option is using flip-flops, which is the strategy mostly used
    for the writable CRs and SPRs. This is done sparingly as the resource
    cost is higher in this case (at least on xilinx, *).

    *: Things went amiss on Altera and when I tried to build on this, needed
    to use FF's for all the GPRs as well; as these FPGAs lack a direct
    equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
    Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).


    As for the CR/SPR space:
    Some of it is used for writable registers;
    A big chunk is used for internal read-only registers.
    ZZR, IMM, IMMB, JIMM, etc.
    ZZR: Zero Register / Null Register (Write)
    IMM: Immediate for current lane (33-bit, sign-ext).
    IMMB: Immediate from Lane 3.
    JIMM: 64-bit immediate spanning Lanes A and B.
    ...

    Could also be seen as C0..C63 (or, all control registers) except that
    much of C32..C63 is used for internal read-only SPRs, and a few other
    SPRs (DLR, DHR, and SP).

    Originally, the CRs and SPRs were handled as separate, but now things
    have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
    in GPR like ways).

    There is some wonk as they were handled as separate modules, but with
    the current way things are done it would almost make more sense to fold
    all of the CRs into the GPR file module.

    The module might also continue to deal with forwarding, but might also
    make sense to have a RegisterFile module, possibly with a disjoint
    "Register Forwarding And Interlocks" style module (which forwards
    registers if the value is available and signals pipeline stalls as
    needed; this logic currently partly handled by the existing
    register-file module).



    Did experiment with a mechanism to allow bank-swapped registers. This
    would have added an internal 2-bit mode for the registers, and would
    stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
    Ended up mostly not using this though (at best, it wouldn't gain much
    over the existing "Load and Store everything to RAM" strategy; and would
    make context switching slower than it is already).

    It is more likely that a practical mechanism for fast bank swapping
    would need a mechanism to bank-swap the registers to external RAM. Or
    maybe a special "Stall and dump all the registers to this RAM Address" instruction.


    For the RISC-V CSRs:
    Part of the space maps to the CRs, and part maps to CPUID;
    For pretty much everything else, it traps.
    So, pretty much all of the normal RISC-V CSRs will trap.

    Ended up trapping for the RISC-V FPU CSRs as well:
    Rarely accessed;
    Rather than just one CSR for the FPU status, they broke it up to
    multiple sub-registers for parts of the register (like, there is a
    special CSR just for the rounding-mode, ...).

    Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
    stuff IMO.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Having 64 registers and 64 bit registers makes life easier for that
    particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025
    From Newsgroup: comp.arch

    On 10/29/2025 10:44 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.

    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions. But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that. e.g.

    Add A1,A2,A3 would be a 64 bit add on those registers but
    Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
    bits of the destination, etc. So the question becomes how is using
    Rn+32 better than using Rn+1?

    That being said, your points are well taken for a different implementation.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly
    for 64-bit and smaller.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.

    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:15 PM, Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.


    Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?


    Agreed.

    From what I have seen, the vast bulk of constants tend to come in
    several major clusters:
    0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
    -64 to -1: Much of what falls outside 0 to 511.
    -32768 to 65535: Second major group
    -2G to +4G: Third group (smaller than second)
    64-bit: Another smaller spike.

    For values between 512 and 16384: Sparsely populated.
    Mostly the continued geometric fall-off from the near-0 peak.
    Likewise for values between 65536 and 1G.
    Values between 4G and 4E tend to be mostly unused.

    Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
    constant, the larger constants would have very little advantage (in
    terms of statistical hit rate) over the 33 bit constant (and, it isn't
    until you reach 64 bits that it suddenly becomes worthwhile again).


    Partly why I go with 33 bit immediate fields in the pipeline in my core,
    but nothing much bigger or smaller:
    Slightly smaller misses out on a lot, so almost may as well drop back to
    17 in this case;
    Going slightly bigger would gain pretty much nothing.

    Like, in the latter case, does sort of almost turn into a "go all the
    way to 64 bits or don't bother" thing.


    That said, I do use a 48-bit address space, so while in concept 48-bits
    could be useful for pointers: This is statistically insignificant in an
    ISA which doesn't encode absolute addresses in instructions.

    So, ironically, there are a lot of 48-bit values around, just pretty
    much none of them being encoded via instructions.


    Kind of a similar situation to function argument counts:
    8 arguments: Most of the functions;
    12: Vast majority of them;
    16: Often only a few stragglers remain.

    So, 16 gets like 99.95% of the functions, but maybe there are a few
    isolated ones taking 20+ arguments lurking somewhere in the code. One
    would then need to go up to 32 arguments to have reasonable confidence
    of "100%" coverage.

    Or, impose an arbitrary limit, where the stragglers would need to be
    modified to pass arguments using a struct or something.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025
    From Newsgroup: comp.arch

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:47 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.


    OK.

    In my case, a lot of the 128-bit operations are a single 32-bit
    instruction, which splits (in decode) to spanning multiple lanes (using
    the 6R3w register file as a virtual 3R1W 128-bit register file).

    In some cases, pairs of 64-bit SIMD instructions may be merged to send
    both through the SIMD unit at the same time. Say, as a special-case
    co-issue for 2x Binary32 ops (which can basically be handled the same as
    the 4x Binary32 scenario by the SIMD unit).

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.


    The reason 17 and 33 ended up slightly preferable is that both
    zero-extended and sign-extended 16 and 32 bit values are fairly common.

    And, if one has both a zero and sign extended immediate, this eats the
    same encoding space as having a 17-bit immediate, or a separate
    zero-extended and one-extended variant.

    There are a few 5/6 bit immediate instructions, but I didn't really
    count them.

    XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
    extended to 33 bits with a jumbo prefix).



    There isn't much need for a direct 128-bit immediate though:
    This case is exceedingly rate;
    Register-pairs basically make it a non-issue;
    Even if it were supported:
    This would still require a 24-byte encoding...
    Which, doesn't save anything over 2x 12-bytes.
    And doesn't gain much, apart from making CPU more expensive.

    Someone could maybe do 20 bytes by using a 128-bit memory load, but with
    the usual drawbacks of using a memory load (BGBCC doesn't usually do
    this). The memory load will have a higher latency than a pair of
    immediate instructions.





    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.


    Basically similar.

    XG3 also uses only 70% as many instructions as RV64G.

    But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
    etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
    more competitive (30% smaller and 50% faster).

    Not found a good way to much improve much over this though...


    But, yeah, if comparing against RV64G as it exists in its standard form,
    there is a bit of room for improvement.



    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.


    Granted.

    The main thing it benefits is things like TKRA-GL, ...

    Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).


    Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.


    As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
    it is limited to 32 registers and doesn't have SIMD.


    Only real saving point is when running with TKRA-GL over system calls in
    which case it runs in the kernel (as XG1) which is slightly less bad.
    For reasons, TestKern kinda still needs to be built as XG1.


    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
    8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
    pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
    Idles around 3.6 to 3.8.

    Laptop:
    4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
    and with multi-core workloads.
    But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    For $240 I was paranoid is still might not have been fast enough to run Minecraft...


    Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
    run reliably at more than DDR4-2133... Like, you can try 3200 if you
    don't mind computer blue-screening after a few minutes I guess...



    But, without much RAM, nor enough SSD space to set up a huge pagefile,
    not going to try compiling LLVM on the thing.

    Even with all the RAM, a full rebuild of LLVM still takes several hours
    on my main PC (though, trying to build LLVM or GCC is at least slightly
    faster if one tells the AV software to stop grinding the CPU by looking
    at every file accessed).


    Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
    this has a 2C/2T CPU).

    Basically, was a small PC that was using mostly laptop-style parts
    internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
    MOBO layout I think.

    I don't remember there being any card slots; so like if you want to
    install a PCIe card or similar, basically SOL.

    But, it was either this or an off-brand NUC clone...


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    It is a 64-bit machine that provides a small modicum of support for
    larger sizes. It is not and never will be a 128-bit machine--that is
    what vVM is for.

    Key words "small modicum"

    DBLE simply supplies registers to the pipeline and width to decode.

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.

    DBLE supports 128-bits in the ISA at the total cost of 1 instruction
    added per use. In many situations (especially integer) CARRY is the
    better option because it throws a shadow of width over a number of
    instructions and thereby has lower code foot print costs. So, a 256
    bit shift is only 5 instructions instead of 8. And realistically, if
    you want wider than that, you have already run out of registers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>
    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.
    So, now there are 16,56,96 and 136 bit constants possible. The
    56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
    constant unused. 136 bit constants may not be implemented, but a size
    code is reserved for that size.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the
    branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.
    Are there other instruction sequence where double-rounding would be good
    to avoid? Seems like HW could detect the sequence and fuse the instructions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025
    From Newsgroup: comp.arch

    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
      8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
        Rarely reaches turbo
          pretty much only happens if just running a single thread...
        With all cores running stuff in the background:
          Idles around 3.6 to 3.8.

    Laptop:
      4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
        If power set to performance, reaches turbo a lot more easily,
          and with multi-core workloads.
        But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025
    From Newsgroup: comp.arch

    On 10/29/2025 5:26 PM, Robert Finch wrote:
    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
       8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
         Rarely reaches turbo
           pretty much only happens if just running a single thread...
         With all cores running stuff in the background:
           Idles around 3.6 to 3.8.

    Laptop:
       4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
         If power set to performance, reaches turbo a lot more easily,
           and with multi-core workloads.
         But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.


    IIRC, current PC was something like:
    CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
    MOBO: $60
    Case: $50
    ...

    Spent around $200 for 128GB of RAM.
    Could have gotten a cheaper 64GB kit had I known my MOBO would not
    accept a full 128GB (then could have had 96 GB).


    The RTX card I have (RTX 3060) has 12 GB of VRAM.

    IIRC, it was also about the cheapest semi-modern graphics card I could
    find at the time. Like, while I could have bought an RTX 4090 or similar
    at the time, I am not made of money.

    Like, a prior-generation mid-range card being the cheaper option.
    And, still newer than the GTX980 that had died on my (where, the GTX980
    was itself second-hand).


    Before this, had been running a GTX 460, and before that, a Radeon HD
    4850 (IIRC).

    I think it was a case of:
    Had a Phenom II box, with the HD 4850;
    Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
    Got GTX 980 (also second hand);
    Got Ryzen 7 2700X and new MOBO;
    Got RTX 3060 (as the 980 was failing).

    With the RTX 3060, had to go single-monitor, mostly as it only has
    DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

    Well, also the RTX 3060 doesn't have a VGA output either (monitor would
    also accept VGA).

    Though, the current monitor I am using is newer and does support
    DisplayPort.


    I also managed to get a MultiSync CRT a while ago, but it only really
    gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
    1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

    I also have an LCD that goes up to 1280x1024, although it looks like
    garbage if set above 1024x768. Only accepts VGA.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    If you have that many bits available, do you still go for a load-store
    architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>

    OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
    in code density to start with. Having memory operations could offset
    that by a certain factor, that was why I was asking.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.

    OK. That should also help for offsets in load/store.

    So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

    For addresses, it will take some time for this to overflow :-)
    For floating point constants, that will be hard.

    I have done some analysis on frequency of floating point constants
    in different programs, and what I found was that there are a few
    floating point constants that keep coming up, like a few integers
    around zero (biased towards the positive side), plus a few more
    golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
    different programs have wildly different floating point constants,
    which is not surprising. (I based that analysis on the grand
    total of three packages, namely Perl, gnuplot and GSL, so cover
    is not really extensive).

    Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

    There are also 32-bit floating point constants, and 32-bit integers
    as constants. There are also very many small integer constants, but
    of course there also could be others.

    136 bit constants may not be implemented, but a size
    code is reserved for that size.

    I'm still hoping for good 128-bit IEEE hardware float support.
    POWER has this, but stuck it on their their decimal float
    arithmetic, which is not highly performing...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

    I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions along with the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    The problem arises due to a cross products of various {machine,
    language, compiler} features not working "all ends towards the middle".

    LLVM promotes FP calculations with a constant to 64-bits whenever the
    constant cannot be represented exactly in 32-bits. {Strike one}

    C makes no <useful> statements about precision of calculation control.
    {strike two}

    HW almost never provides mixed mode calculations which provide the
    means to avoid the double rounding. {strike three}

    So, technically, My 66000 does not provide general-mixed-mode FP,
    but I wrote a special rule to allow for larger constants used with
    narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

    Seems like HW could detect the sequence and fuse the instructions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025
    From Newsgroup: comp.arch

    On 10/30/2025 11:10 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.


    Only really makes sense if one assumes these resources are "borderline
    free".

    If you are also paying for logic complexity and wires/routing, then
    having bigger registers just to typically waste most of them is not ideal.


    Granted, one could argue that most of the register is wasted when, say:
    Most integer values could easily fit into 16 bits;
    We have 64-bit registers.

    But, there is enough that actually uses the 64-bits of a 64-bit register
    to make it worthwhile. Would be harder to say the same for 128-bit
    registers.

    It is common on many 32-bit machines to use register pairs for 64-bit operations.


    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.


    Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
    AVX-512 had not exactly had clean roll-outs.


    Checks and, ironically, my recent super-cheap laptop was the first thing
    I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).


    Still some oddities though:
    RAM that appears to be faster than it should be;
    The MHz and CAS latency appear abnormally high.
    They do not match the values for DDR4-2400.
    (Nor, even DDR4 in general).
    Appears to exceed expected bandwidth on memcpy test;
    ...
    Windows 11 on an unsupported CPU model;
    More so, Windows 11 Professional, also on something cheap.
    (Listing said it would come with Win10, got Win11 instead, OK).

    So, technically seems good, but also slightly sus...


    Differs slightly from what I was expecting:
    Something kinda old and not super fast;
    Listing said Windows 10, kinda expected Windows 10;
    ...

    Like, something non-standard may have been done here.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
    quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
    registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problemso
    or inefficiencies?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not RISC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Consider that being able to address every 2^(3+n) field of a register
    is far from free. Take a simple add of 2 bytes::

    ADDB R8[7], R6[3], R19[4]

    One has to individually align each of the bytes, which is going to blow
    out your timing for forwarding by at least 3 gates of delay (operands)
    and 4 gates for the result (register). The only way it makes "timing"
    sense if if you restrict the patterns to::

    ADDB R8[7], R6[7], R19[7]

    Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
    are not the same.}}

    I also suspect that you would gain few compiler writers to support random fields in registers.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    There are vanishingly few useful manipulations on part of pointers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value
    analysis in the compiler.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    If your ISA has excellent support for statically positioned bit-fields
    (or even better with dynamically positioned bit fields) fetching the
    fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
    significant latency.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problems
    or inefficiencies?

    Consider trying to invalidate an ICache line--this requires looking
    at 2 DCache lines to see if they, too, need invalidation.

    Consider self-modifying code, the data stream overwrites an instruction,
    then later the FETCH engine runs over the modified line, but the modified
    line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
    a single fetch.

    It also prevents SNARFing updates to ICache instructions, unless the
    SNARFed data is entirely retained in a single ICache line.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4. The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in
    hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.


    Yes, those were in EV4.

    Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

    I didn't consider these instructions as SIMD. May be, I should have.
    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025
    From Newsgroup: comp.arch

    On 10/31/2025 9:48 AM, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.


    Most likely Dhrystone:
    It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.


    I had experimented with special instructions for packed search, which
    could be used to help with either string compare of implementing
    dictionary objects in my usual way.


    Though, had later fallen back to a more generic way of implementing
    "strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
    use of niche instructions that typically require hand-written ASM or
    similar).



    Generally, makes more sense to use helper instructions that have a
    general impact on performance, say for example, effecting how quickly a
    new image can be drawn into VRAM.

    For example, in my GUI experiments:
    Most of the programs are redrawing the screens as, say, 320x200 RGB555.

    Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
    to mimic planar VGA behavior. In this case, for the port it was easier
    to write wrapper code to fake the VGA weirdness than to try to rewrite
    the whole renderer to work with a normal linear framebuffer (like what
    Doom and similar had used).


    In a lot of the cases, I was using an 8-bit indexed color or color-cell
    mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
    1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

    Though, thus far the 1024x768 mode is still mostly untested on real
    hardware.

    Had experimented some with special instructions to speed up the indexed
    color conversion and color-cell encoding, but had mostly gone back and
    forth between using helper instructions and normal plain C logic, and
    which exact route to take.

    Had at one point had a helper instruction for the "convert 4 RGB555
    colors to 4 indexed colors using a hardware palette", but this broke
    when I later ended up modifying the system palette for better results
    (which was a critical weakness of this approach). Also the naive
    strategy of using a 32K lookup table isn't great, as this doesn't fit
    into the L1 cache.


    So, for 4 bpp color cell:
    Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
    and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
    into a logical 8x8 pixel block; rather than a linear ordering like in
    DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

    The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
    (same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
    IIRC, had also experimented with having each 4x4 sub-block able to use a
    pair of RGB232 colors, but was harder to get good results.

    But, to help with this process, it was useful to have helper operations
    for, say:
    Map RGB555 values to a luma value;
    Select minimum and maximum RGB555 values for block;
    Map luma values to 1 or 2 bit selectors;
    ...


    Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
    blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
    previous version of the framebuffer and tracking when pixel-blocks will
    differ to refine the selection of blocks that need redraw, copying over
    blocks as needed to keep track of these buffers).

    Process wasn't particularly efficient (and performance is considerably
    worse than what Win3.x or Win9x seemed to give).



    As for the packed-search instructions, there were 16-bit versions as
    well, which could be used either to help with UTF-16 operations; or for dictionary objects.

    Where, a common way I implement dictionary objects is to use arrays of
    16-bit keys with 64-bit values (often tagged values or similar).

    Though, this does put a limit on the maximum number of unique symbols
    that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
    of running out of symbol name somewhat.

    One can also differ though on how much it makes to have sense to have
    ISA level helpers for working with tagrefs and similar (or, getting the
    ABI involved with these matters, like defining in the ABI the encodings
    for things like fixnum/flonum/etc).

    ...


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025
    From Newsgroup: comp.arch

    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
       320x200: 16 bpp
       640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
       800x600: 2 or 4 bpp color-cell
      1024x768: 1 bpp monochrome, other experiments (*1)
        Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
    G R
    B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
    G R G B
    B G R G
    G R G B
    B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat
    color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
    would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD
    extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.

    The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good >> to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    The 8086 is no 68000. The [BX] addressing mode makes it obvious that
    that's not the case.

    What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
    of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
    are new, SP corresponds to the 8080 SP, which does not have 8-bit
    components. That's why SI, DI, BP, SP have no low or high
    sub-registers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    I think that we can learn a lot from earlier architectures, some
    things to adopt and some things to avoid. Concerning subregisters, I
    lean towards avoiding.

    That's also another reason to avoid load-and-op and RMW instructions.
    With a load/store architecture, load can sign/zero extend as
    necessary, and then most operations can be done at full width.

    But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value >analysis in the compiler.

    I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025
    From Newsgroup: comp.arch

    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the bits
    per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8,
    allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
      G R
      B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
      G R G B
      B G R G
      G R G B
      B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
    Y R
    B G

    In this case tiling as:
    Y R Y R ...
    B G B G ...
    Y R Y R ...
    B G B G ...
    ...

    Where, Y is a pure luma value.
    May or may not use this, or:
    Y R B G Y R B G
    B G Y R B G Y R
    ...
    But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to
    recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
    are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery
    process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of chroma;
    ...


    Dealing with chroma does have the effect of making the dithering process
    more complicated. As noted, reliable recovery of the color vector is
    itself a bit fiddly (and is very sensitive to the encoder side dither process).

    The former image was itself an example of an artifact caused by the
    dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color
    shifts). The later image was mostly after I realized the issue with the
    dither pattern, and modified how it was being handled (replacing the use
    of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
    the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that much
    better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
    Use LUT4 to map 4b -> 2b (as a count)
    Then, map 2x2b -> 3b (adder)
    Then, map 2x3b -> 4b (adder), then discard LSB.
    Then, select max or R/G/B/Y;
    This is used as an inverse normalization scale.
    Feed each value and scale through a LUT (for R/G/B)
    Getting a 5-bit scaled RGB;
    Roughly: (Val<<5)/Max
    Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
    complexity.


    Pros/Cons:
    +: Looks better than per-pixel Bayer-RGB
    +: Looks better than 4x4 RGBI
    -: Would require more complex decoder logic;
    -: Requires specialized dither logic to not look like broken crap.
    -: Doesn't give passable results if handed naive grayscale dithering.

    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale
    patterns as gray.



    I guess a more open question is if such a thing could be useful (it is
    pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
    image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted into
    a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
    image quality).


    More just interesting that I was able to get things "almost half-way
    passable" from 1 bpp monochrome.


    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in the
    ulp position.

    We have known since before the 1978 ieee754 standard that guard+sticky
    (plus sign and ulp) is enough to get the rounding correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025
    From Newsgroup: comp.arch

    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also rounding
    bit (e.g. in Wikipedia article) without explanation, as if everybody
    had agreed about what they mean. But I don't think that everybody
    really agree.

    Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
    Know About Floating-Point Arithmetic". It seems, people copy the name
    of the article one from another, but very small fraction of them
    bothered to actually read it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 09:39:27 2025
    From Newsgroup: comp.arch

    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Using a fused compare-and-branch instruction for Qupls4 there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 10:06:42 2025
    From Newsgroup: comp.arch

    On 2025-11-02 3:21 a.m., BGB wrote:
    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the bits
    per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also
    encodes the color);
    One possibility also being to use an indexed color pair for every
    8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
       G R
       B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4
    pixels, with the pixel bits encoding color indirectly for the whole
    4x4 block:
       G R G B
       B G R G
       G R G B
       B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main
    color if over 75% of a given bit are set in a given way (say, for
    mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color
    graphics at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
      Y R
      B G

    In this case tiling as:
      Y R Y R ...
      B G B G ...
      Y R Y R ...
      B G B G ...
      ...

    Where, Y is a pure luma value.
      May or may not use this, or:
        Y R B G Y R B G
        B G Y R B G Y R
        ...
      But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
    are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of chroma; ...


    Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
    itself a bit fiddly (and is very sensitive to the encoder side dither process).

    The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
    of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
    the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
      Use LUT4 to map 4b -> 2b (as a count)
      Then, map 2x2b -> 3b (adder)
      Then, map 2x3b -> 4b (adder), then discard LSB.
      Then, select max or R/G/B/Y;
        This is used as an inverse normalization scale.
      Feed each value and scale through a LUT (for R/G/B)
        Getting a 5-bit scaled RGB;
        Roughly: (Val<<5)/Max
      Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.


    Pros/Cons:
      +: Looks better than per-pixel Bayer-RGB
      +: Looks better than 4x4 RGBI
      -: Would require more complex decoder logic;
      -: Requires specialized dither logic to not look like broken crap.
      -: Doesn't give passable results if handed naive grayscale dithering.

    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.



    I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
    can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted into
    a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
    image quality).


    More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.


    ...



    I think your support for graphics is interesting; something to keep in
    mind for displays with limited RAM.

    I use a high-speed DDR memory interface and video fifo (line cache).
    Colors are broken into components specifying the number of bits per
    component (up to 10) in CRs. Colors are passed around as 32-bit values
    for video processing. Using the colors directly is much easier than
    dealing with dithered colors.
    The graphics accelerator just spits out colors to the frame buffer
    without needing to go through a dithering stage.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 16:09:10 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also rounding
    bit (e.g. in Wikipedia article) without explanation, as if everybody
    had agreed about what they mean. But I don't think that everybody
    really agree.

    Within the 754 working group the definition is totally clear:

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing
    together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 18:14:54 2025
    From Newsgroup: comp.arch

    On Sun, 2 Nov 2025 16:09:10 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its
    128-bit floating point arithmetic, for that very reason (I
    assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to
    check all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also
    rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
    everybody really agree.

    Within the 754 working group the definition is totally clear:


    I could believe that there is consensus about these names between
    current members of 754 working group. But nothing of that sort is
    mentioned in the text of the Standard. Which among other things means
    that you can not rely on being understood even by new members of 754
    working group.

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all
    fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not
    matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje


    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 20:19:10 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sun, 2 Nov 2025 16:09:10 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its
    128-bit floating point arithmetic, for that very reason (I
    assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to
    check all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also
    rounding bit (e.g. in Wikipedia article) without explanation, as if
    everybody had agreed about what they mean. But I don't think that
    everybody really agree.

    Within the 754 working group the definition is totally clear:


    I could believe that there is consensus about these names between
    current members of 754 working group. But nothing of that sort is
    mentioned in the text of the Standard. Which among other things means
    that you can not rely on being understood even by new members of 754
    working group.

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing
    together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all
    fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not
    matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje


    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5. If you work with the binary
    representation for decimal, then you just need two extra bits, just like
    BFP.

    Correct rounding also work when Guard temporarily contains more than one
    bit, possibly due to normalization, but you would normally squash this
    down (Guard, Sticky) by OR'ing any secondary guard bits into Sticky.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 14:58:36 2025
    From Newsgroup: comp.arch

    On 11/2/2025 9:06 AM, Robert Finch wrote:
    On 2025-11-02 3:21 a.m., BGB wrote:
    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the
    bits per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also
    encodes the color);
    One possibility also being to use an indexed color pair for every
    8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
       G R
       B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4
    pixels, with the pixel bits encoding color indirectly for the whole
    4x4 block:
       G R G B
       B G R G
       G R G B
       B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main
    color if over 75% of a given bit are set in a given way (say, for
    mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color
    graphics at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
       Y R
       B G

    In this case tiling as:
       Y R Y R ...
       B G B G ...
       Y R Y R ...
       B G B G ...
       ...

    Where, Y is a pure luma value.
       May or may not use this, or:
         Y R B G Y R B G
         B G Y R B G Y R
         ...
       But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to
    recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
    per channel, allowing for roughly a RGB333 color space (though, the
    vectors are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery
    process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of
    chroma;
    ...


    Dealing with chroma does have the effect of making the dithering
    process more complicated. As noted, reliable recovery of the color
    vector is itself a bit fiddly (and is very sensitive to the encoder
    side dither process).

    The former image was itself an example of an artifact caused by the
    dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color
    shifts). The later image was mostly after I realized the issue with
    the dither pattern, and modified how it was being handled (replacing
    the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
    rotating the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that
    much better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
       Use LUT4 to map 4b -> 2b (as a count)
       Then, map 2x2b -> 3b (adder)
       Then, map 2x3b -> 4b (adder), then discard LSB.
       Then, select max or R/G/B/Y;
         This is used as an inverse normalization scale.
       Feed each value and scale through a LUT (for R/G/B)
         Getting a 5-bit scaled RGB;
         Roughly: (Val<<5)/Max
       Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1
    bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and
    maximum values, vs full intensity and black, but this would add more
    logic complexity.


    Pros/Cons:
       +: Looks better than per-pixel Bayer-RGB
       +: Looks better than 4x4 RGBI
       -: Would require more complex decoder logic;
       -: Requires specialized dither logic to not look like broken crap.
       -: Doesn't give passable results if handed naive grayscale dithering. >>
    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale
    patterns as gray.



    I guess a more open question is if such a thing could be useful (it is
    pretty far down the image-quality scale). But, OTOH, with simpler
    (non- randomized) dither patterns; it can LZ compress OK (depending on
    image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted
    into a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more
    computationally expensive than just using a CRAM style codec (while
    also giving worse image quality).


    More just interesting that I was able to get things "almost half-way
    passable" from 1 bpp monochrome.


    ...



    I think your support for graphics is interesting; something to keep in
    mind for displays with limited RAM.

    I use a high-speed DDR memory interface and video fifo (line cache).
    Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
    for video processing. Using the colors directly is much easier than
    dealing with dithered colors.
    The graphics accelerator just spits out colors to the frame buffer
    without needing to go through a dithering stage.



    No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
    that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And,
    2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
    or needing to use 2 PMOD connections for the VGA). The usual workaround
    was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

    But, yeah, even the theoretical framebuffer images generally look better
    than what one sees on actual monitors.

    Even then, modern LCD panels mostly can't display even full RGB24 color
    depth; more often it is 6-bit / channel or similar (then the panels
    dither for full 24). But, IIRC a lot of OLEDs are back up to full
    color-depth (but, OLEDs are more expensive and have often had a
    notoriously short lifespans, ...).

    But, yeah, my current monitor seems to be LCD based.



    In my case, the video HW uses prefetch requests along a ring-bus, which
    goes to the L2 cache, and then to external RAM. It then works on hope
    that the requests get around the bus and can be resolved in time.

    In this case, the memory works in a vaguely similar way to the CPU's L1
    caches (although with line-oriented access), and a module that
    translates this to color-values during screen refresh. General access
    pattern was built around "character cells".


    It can give stable results at 8MB/s to 16MB/s (with more glitches as it
    goes higher), but breaks down too much past this point.

    So, switching to a RAM backed framebuffer didn't significantly usable
    increase screen resolutions or color depths.

    Also, I am mostly limited to using either either a 25 or 50 MHz pixel
    clock, so some timings were tweaked to fit this. Doesn't really fit
    standard VESA timings, but it seems like monitors can tolerate
    nonstandard timings, and are more limited to operating range.

    So, say:
    320x200 70Hz, 25MHz; 9MB/s @ 16bpp (hi-color)
    640x400 70Hz, 25MHz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
    640x480 60Hz, 50Mhz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
    800x600 72Hz, 50Mhz; 8.6 MB/s @ 2bpp, 17 MB/s @ 4bpp
    1024x768 48Hz, 50Mhz, 5MB/s @ 1bpp, 10MB/s @ 2bpp

    So, this implies that just running 1024x768 at 2bpp should be acceptable
    (even if it exceeds the usual 128K limit).


    Earlier on, I had an 800x600/36Hz, and 1024x768/25Hz, these would have
    allowed 8bpp color, but are below the minimum refresh rate of most
    monitors (seems like VGA monitors don't like going below around 40Hz).


    Of these modes, 8bpp (Indexed color) is technically newest.
    Originally the graphics hardware was written for color-cell.

    Earliest design had 32-bit cells (for 8x8 pixels):
    10 bits: Glyph
    2x 6b color + Attrib (RGB222)
    2x 9b Color: RGB333

    was later expanded first to 64b cells, then to 128b and 256b
    Some control bits effect cell size.
    Also with ability to specify 8x8 or 4x4 cells.
    Where, 4x4 cells reduce the effective resolution.
    In the bitmap modes:
    4x4 + 256b: 16bpp Hicolor
    4x4 + 128b: 8bpp Indexed
    4x4 + 64b: 4bpp RGBI (Alt2)
    8x8 + 256b: 4bpp RGBI (Alt1)
    8x8 + 128b: 2bpp (4-color, CGA-like)
    With a range of color palettes available (more than CGA).
    Black/White/Cyan/Magenta, Black/White/Red/Green, ...
    Black/White/DarkGray/LightGray, also with Green and Amber, ...
    8x8 + 64b: 1bpp (Monochrome)
    Can select between RGBI colors and some special sub-modes.
    The recent idea, if added to HW, would slot into this mode.
    The color-cell modes:
    8x8 + 256b: 4bpp (DXT1 like, 4x 4x4 cells per 256-bit cell)
    8x8 + 128b: 2bpp (2bpp cells)
    Each cell has 2x RGB555 colors, and 8x8x1 for pixel data
    Had experimented with 8x RGB232,
    didn't catch on (looked terrible).
    8x8 + 64b: Text-Mode + Early Graphics (4x4 cells)


    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    The 640x200 mode is the same as 640x400 (for VGA) but with the vertical resolution halved. The 320x200 mode also halves the horizontal
    resolution (so 40x25 cells).


    In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
    for graphics (32K). Early on, this was used as the graphics mode for
    Doom and similar, before I later expanded VRAM to 128K and switched to
    320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
    16bpp: pixels in raster order.
    8bpp: raster order, 32-bits per row
    4bpp: Raster order, 16-bits per row
    And, 8x8:
    4bpp: Takes 16bpp layout, splits each pixel into 2x2.
    2bpp: Takes 8bpp layout, splits each pixel into 2x2.
    1bpp: Raster order, 1bpp, but same order as text glyphs.
    With MSB in upper left, LSB in lower right.

    Can note that the 8x8x1b cells have the upper-left corner in the MSB.
    This differs from most other modes where the upper left corner is in the
    LSB (so, pixels flipped both horizontally and vertically).


    Can note that in this case, the video memory had several parts:
    VRAM / Framebuffer
    Note: Uses 64-bit word addressing.
    Font RAM: Stores character glyphs as 8x8 patterns.
    Originally, there was a FontROM, but I dropped this feature.
    This means BootROM needs to supply the initial glyph set.
    I went with 5x6 pixel cells in the ROM to save space.
    Where 5x6 does ASCII well enough.
    Palette RAM: Stores 256x 16-bits (as RGB555).

    Though, TestKern typically uses what is effectively color-cell graphics
    for the text mode (so, just draws 8x8 pixel blocks for the character
    glyphs).


    All this differs notably from CGA/EGA/VGA, which had used mostly raster-ordered modes. Except for the oddity of bit-planes for 16 color
    modes in EGA and VGA.


    I did experiment with raster ordered modes which worked by effectively stretching the character cell horizontally while reducing vertical
    height to 1 pixel. Ended not going with this was it was prone to a lot
    more glitches with the screen refresh (turned out to be a lot more
    sensitive to timing than the use of 8x8 or 4x4 cells).

    But, since generally programs don't draw directly into VRAM, the use of non-raster VRAM is mostly less of an issue.


    Well, apart from the computational cost of converting from internal
    RGB555 frame-buffers. Though, partial reason RGB555 ended up used so
    often was because it was faster to do RGB555 -> ColorCell encoding than
    8-bit indexed color to color-cell, as indexed color typically also
    requires a bunch of palette lookups (which could end up more expensive
    than the additional RAM bandwidth from the RGB555).

    Also, there isn't really a "good and simple" way to generalize 8-bit
    colors in a way that leads to acceptable image quality. Invariably, one
    ends up needing palettes or encoding schemes that are slightly irregular.



    For color-cell, there are different approaches depending on how fast it
    needs to be:
    Faster: Simply select minimum and maximum luma;
    Selector encoding is often via comparing against thresholds.
    Except on x86, where multiply+bias+shift is faster.
    Medium: Calculate along 4 axes in parallel;
    Select axis which gives highest contrast;
    Usually: Luma, Cyan, Magenta, Yellow.
    Adjust endpoints to better reflect standard deviation.
    Vs simply min/max.
    Slower:
    Calculate centroid and mass distribution and similar;
    Better quality, more for offline / batch encoding.


    As noted, early on, I was mostly using real-time color-cell encoders for
    Doom and Quake and similar (hence part of why they were modified to use RGB555).

    Some of this is also related to the existence of a lot of RGB555 related helper ops. Though, early on, had also used YUV655 as well, but RGB555
    mostly won out over YUV655 (even if it is easier to get a luma from
    YUV655 vs RGB555).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 16:56:05 2025
    From Newsgroup: comp.arch

    On 2025-11-02 3:58 p.m., BGB wrote:
    <snip>

    No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
    that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
    or needing to use 2 PMOD connections for the VGA). The usual workaround
    was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).


    I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
    I tried to get a display channel interface working but no luck. VGA is
    so old.

    Have you tried dithering based on the frame (temporal dithering vs
    space-al dithering)? First frame is one set of colors, the next frame is
    a second set of colors. I think it may work if the refresh rate is high
    enough (120 Hz). IIRC I tried this a while ago and was not happy with
    the results. I also tried rotating the dithering pattern around each frame.

    <snip>

    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    For the text mode 800x600 mode is used on my system, with 12x18 cells so
    that I can read the display at a distance (64x32 characters).

    The font then has 64 block graphic characters of 2x3 block. Low-res
    graphics can be done in text mode with the appropriate font size and
    block graphics characters. Color selection is limited though.>
    In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
    for graphics (32K). Early on, this was used as the graphics mode for
    Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
      16bpp: pixels in raster order.
       8bpp: raster order, 32-bits per row
       4bpp: Raster order, 16-bits per row
    And, 8x8:
       4bpp: Takes 16bpp layout, splits each pixel into 2x2.
       2bpp: Takes  8bpp layout, splits each pixel into 2x2.
       1bpp: Raster order, 1bpp, but same order as text glyphs.
         With MSB in upper left, LSB in lower right.


    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 17:21:52 2025
    From Newsgroup: comp.arch

    On 11/2/2025 3:56 PM, Robert Finch wrote:
    On 2025-11-02 3:58 p.m., BGB wrote:
    <snip>

    No real need to go much beyond RGB555, as the FPGA boards have VGA
    DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
    A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
    RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
    The usual workaround was also to perform dithering while driving the
    VGA output (with ordered dither in the Verilog).


    I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
    I tried to get a display channel interface working but no luck. VGA is
    so old.


    Never went up the learning curve for HDMI.
    Would likely need to drive the monitor outputs with SERDES or similar
    though.


    Have you tried dithering based on the frame (temporal dithering vs
    space-al dithering)? First frame is one set of colors, the next frame is
    a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
    the results. I also tried rotating the dithering pattern around each frame.


    Temporal dithering seems to generate annoying artifacts on the monitors
    I tried it on. Trying to use temporal dithering tended to result in
    annoying wavy/rippling artifacts.

    Likewise, PWM'ing the pixels also makes LCD monitors unhappy (rainbow
    banding artifacts), but seems to work OK on CRTs. I suspect it is an
    issue that the monitors expect a 25MHz pixel clock (when using 640x400
    or 640x480 timing) with an ADC that doesn't like sudden changes in level
    (say, if updating the pixels at 50MHz internally).


    <snip>

    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    For the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).

    The font then has 64 block graphic characters of 2x3 block. Low-res
    graphics can be done in text mode with the appropriate font size and
    block graphics characters. Color selection is limited though.>

    I went with 80x25 as it is pretty standard;
    80x50 is also possible, but less standard.

    Though, Linux seems to often like using high-res text modes rather than
    the usual 80x25 or similar.

    As for 8x8 character cells:
    Also pretty standard, and fix nicely into 64 bits.



    In theory, for a text mode, could drive a monitor at 1280x400 with
    640x400 timings for 16x16 character cells, but LCD monitors don't like
    this sort of thing.


    Even at 640x400/70Hz timings, the monitor didn't consistently recognize
    it as 640x400, and would sometimes try to detect it as 720x400 or
    similar (which would look wonky).

    The other option being to output 640x480 and simply black-fill the extra
    lines (so, add 20 lines of black-fill at the top and bottom of the
    screen). Where, the monitors were able to more reliably detect 640x480/60Hz


    The main tradeoff is that mostly I have a limited selection of pixel
    clocks available:
    25, 50, maybe 100.

    Mostly because the pixel clocks are high enough and clock-edges
    sensitive enough where accumulation timers don't really work.

    Though, accumulation timers do work for driving an NTSC composite
    output. But, NTSC composite looks poor, can't even really do an 80x25
    text mode acceptably (if using colorburst); but can do 80x25 if one can
    accept black-and-white.

    Well, there was also component video, but this is basically the same as driving VGA (just with it being able to accept both NTSC and VGA
    timings; eg, 15 to 70 kHz for horizontal refresh, 40 to 90 Hz vertical,
    ...).

    Though, I no longer have the display that had component video inputs.


    Contrast, there is generally a very limited range of timings for
    composite or S-Video (generally, these don't accept VGA-like timings). Whereas, VGA only really accepts VGA-like timings, and is unhappy if
    given NTSC timings (eg: 15 kHz horizontal refresh).


    Not sure why seemingly component video is the only "accepts whatever you
    throw at it" analog input (say, on a display with multiple input types
    and presumably similar hardware internally).


    Checks, annoyingly hard to find plain LCD monitors with a component
    video inputs that is not also a full TV with a TV tuner (but, a little
    easier to find ones with both VGA and composite). Closest I can find are apparently intended mostly as CCTV monitors.


    But, mostly using VGA anyways, so...


    ...




    In this case, a 40x25 color-cell mode (with 256-bit cells) could be
    used for graphics (32K). Early on, this was used as the graphics mode
    for Doom and similar, before I later expanded VRAM to 128K and
    switched to 320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
       16bpp: pixels in raster order.
        8bpp: raster order, 32-bits per row
        4bpp: Raster order, 16-bits per row
    And, 8x8:
        4bpp: Takes 16bpp layout, splits each pixel into 2x2.
        2bpp: Takes  8bpp layout, splits each pixel into 2x2.
        1bpp: Raster order, 1bpp, but same order as text glyphs.
          With MSB in upper left, LSB in lower right.


    <snip>


    ...

    But, yeah, my makeshift graphics hardware is a little wonky.
    And, works in an almost entirely different way from the VGA style hardware.

    Ironically, software doesn't configure timings itself, but rather uses selector bits to control various properties:
    Base Resolution (640x400, 640x480, 800x600, ...);
    Character cell size in pixels (4x4 or 8x8);
    Settings to modify the number of horizontal and vertical cells relative
    to the base resolution;
    ...

    But, for the most part, had been using 640x400 or similar; with 800x600
    as more experimental (and doesn't look great with 2bpp cells).

    The 1024x768 mode had gone mostly unused, and is still untested on real hardware.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 3 15:22:44 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Nov 3 11:53:48 2025
    From Newsgroup: comp.arch

    On 11/3/2025 9:22 AM, Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?


    I would assume he meant something like either the newer IEEE-754 decimal formats, or a decimal-FP format that MS had used in .NET, ...

    The IEEE formats are generally one of:
    Linear mantissa understood as decimal;
    Groups of 10-bits, each used to encode 3 digits.
    As Densely Packed Decimal.
    With a power-of-10 exponent.

    The .NET format was similar, except using groups of 32 bits as linear
    values representing 9 digits.

    When I looked at it before, the most practical way to me to support
    something like this seemed to be to not do it directly in hardware, but
    to support a subset of operations:
    Operations to pack and unpack DPD into BCD;
    Say: 64 bit value holds 15 BCD digits, mapped to 50 bits of DPD.
    Some basic operations to help with arithmetic on BCD.

    I partly implemented these as an experiment before, but then noted I
    have basically no use case for Decimal-FP in my project.

    And, ironically, the main benefit the helpers would have provided would
    be to allow for faster Binary<->Decimal conversion. But, even then are debatable, as Binary<->Decimal conversion isn't itself enough CPU time
    to justify making it faster at the cost of needing to drag around BCD
    helper instructions.

    One downside is that there was no multiplier, so the BCD helpers would
    need to be used to effectively implement a Radix-10 Shift-and-Add.

    ...


    Though, it is debatable, something more like the .NET approach could
    make more sense for a SW implementation.

    If one wants to make the encoding more efficiently use the bits, a
    hybrid approach could make sense, say:
    Use 3 groups of 30 bits, and another group of 20 (6 digits)
    Use an 17 bit linear exponent and sign bit.

    This would be slightly cheaper to implement vs what is defined in the
    standard (for the BID variant), and could achieve a similar effect
    (though, with 33 digits rather than 34).

    Internally, it could work similar to the .NET approach, just with a
    little more up-front to pack/unpack the 30 bit components. The merit of
    30 bit groups being that they map internally onto 32-bit integer
    operations (which would also provide a space internally for carry/borrow signaling in operations).

    Most CPUs at least have native support for 32-bit integer math, and for
    SW (on a 32/64 bit machine) this could be an easier chunking size than
    10 bits. Someone could argue for 60 bit chunking on a 64-bit machine
    (or, one 60 bit chunk, and a 50 bit chunk), but likely this wouldn't
    save much over 30 bit chunking.

    Also, 60-bit chunking would imply access to a 64*64->128 bit widening multiply; which is asking more than 32*32->64. And, also precludes some
    ways to more cheaply implement the divide/modulo step for each chunk
    (*). So, it is likely in this sense 30 bit chunks could still be preferable.

    *:
    high=product>>30;
    low=product-(high*1000000000LL);
    if(low>=1000000000)
    { high++; low-=1000000000; }
    Where, 60 bit chunking would require 128-bit math here.

    Where, effectively, the multiply step is operating in radix-1-billion.

    ...



    Still don't have much of a use-case though.

    In general, Decimal-FP seems more like a solution in search of a problem.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 3 18:47:36 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction? (That
    may have been mentioned upthread, in that case I don't remember).

    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:03:13 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:13:50 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Using a fused compare-and-branch instruction for Qupls4 there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    The VEC instruction (My 66000) provides a register that is used for
    the address of the top of the loop and the address of the VEC inst
    itself. So, when running in the loop, the LOOP instruction branches
    to the register value, and when taking an exception in the loop,
    the register leads back to the VEC instruction for after the excpt
    has been performed.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    VEC-{ }-LOOP always saves at least 1 instruction per iteration.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    VEC does its own predictions. LOOP does not overrun the loop-count,
    so loop termination is not a pipeline flush.

    The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)

    LDA Rd,[IP,displacement]

    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement

    But if you create "R3" from your VEC instruction, you KNOW that
    the compiler is only allowed to use "r3" as a branch target, and
    that "R3" is static over the duration of the loop, so you can get
    the reservation stations moving faster/easier.

    I have a "special" RS for the VEC-LOOP brackets.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 3 23:04:53 2025
    From Newsgroup: comp.arch

    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I
    use them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd
    rather not use the term 'guard' at all. Names like 'rounding bit'
    or 'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit
    if the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 08:50:25 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    No, I meant ieee754 DFP, where you either store the decimal digits in
    packed modulo-1000 groups, or as a binary mantissa with a decimal exponent/scaling value.

    When you do math with these you have to handle all the required
    (financial?) rounding modes.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 4 07:50:33 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    But in practice, it turned out that Intel and AMD processors had much
    better performance on indirect-branch intensive workloads in the early
    2000s without this architectural feature. What happened?

    The IA-32 and AMD64 microarchitects implemented indirect-branch
    prediction; in the early 2000s it was based on the BTB, which these
    CPUs need for fast direct branching anyway. They were not content
    with that, and have implemented history-based indirect branch
    predictors in the meantime, which improve the performance even more.

    By contrast, Power and IA-64 implementations apparently rely on
    getting the target-address early enough, and typically predict that
    the indirect branch will go to the current contents of the
    branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
    instruction stream just before the take-branch instruction, it takes
    several cycles until the prepare-to-branch actually can move the
    target to the branch-target register. In case of an OoO
    implementation, the number of cycles tends to be longer. It's
    essentially a similar latency as in a branch misprediction.

    That all would not be so bad, if the compilers would move the
    prepare-to-branch instructions sufficiently far away from the
    take-branch instruction. But gcc certainly has not done so whenever I
    looked at code it generated for PowerPC or IA-64.

    Here is some data for code that focusses on indirect-branch
    performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

    Numbers are cycles per indirect branch, smaller is faster, the years
    are the release dates of the CPUs:

    First, machines from the early 2000s:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
    4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
    18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
    8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
    13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
    5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
    7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

    Ignore the first column (it uses call and return), the others all need
    an indirect branch or indirect call ("call" column) per dispatch, with
    varying amounts of other instructions; "direct" needs the least
    instructions.

    And here are results with some newer machines:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
    4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
    4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
    4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
    5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
    4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
    6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
    3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

    The age of the Pentium M would suggest putting it into the earlier
    table, but given its clear performance-per-clock advantage over the
    other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
    to have a history-based indirect-branch predictor.

    It seems that, while the AMD64 microarchitectures improved not just in
    clock rate, but also in performance per clock for this microbenchmark
    (thanks to history-based indirect-branch predictors), the Power 9
    still relies on its split-branch architectural feature, resulting in
    slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
    other CPUs.

    Particularly notable is the Core i5-1135G7, which takes one indirect
    branch per cycle.

    I have to take additional measurements with other Power and AMD64
    processors.

    Couldn't the Power and IA-64 CPUs use history-based branch prediction,
    too? Of course, but then it would be even more obvious that the
    split-branch architecture provides no benefit.

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be a bad idea.

    tell the CPU
    what the target is (like VEC in My66000)

    I have no idea what VEC does, but all indirect-branch architectures
    are about telling the CPU what the target is.

    just use a general
    purpose register with a general-purpose instruction.

    That turns out to be the winner.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you want to be able to perform one taken branch per cycle (or
    more), you always need prediction.

    If you use a link register or a special instruction, the CPU could
    do that.

    It turns out that this does not work well in practice.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 15:19:08 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 17:41:07 2025
    From Newsgroup: comp.arch

    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    Decimal32 and Decimal64 would suffer from similar mismatch, but those
    formats probably not important. IMHO, IEEE defined them for sake of completeness rather than because they are useful in real world.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 07:47:50 2025
    From Newsgroup: comp.arch

    On 11/4/2025 7:19 AM, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    By "information density" I think he means that for almost any (I won't
    say any because there might be some edge cases where the isn't true)
    value, it takes fewer bits to represent in the IEEE scheme than in your beloved Burroughs Medium system's scheme. :-) Fewer bits per value
    means higher information density.

    Fewer bits means less less hardware, thus lower cost, less power
    required, etc.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 16:52:18 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34 digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15
    exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than BFP.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 18:54:58 2025
    From Newsgroup: comp.arch

    On Tue, 4 Nov 2025 16:52:18 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which
    is a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34
    digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than
    BFP.

    Terje


    Nitpick:
    In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
    digit of DFP =9, [relative] precision is indeed almost identical.
    But in the worst case, i.e. cases where mantissa of BFP is close to 1
    and MS digit of DFP =1, [relative] precision of BFP is 5 times better.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 17:12:54 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    I guess it wasn't clear that my question was regarding
    the necessity of providing 'hidden' bits for BCD floating
    point.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:13:36 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 4 Nov 2025 16:52:18 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which
    is a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34
    digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15
    exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
    5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than
    BFP.

    Terje


    Nitpick:
    In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
    digit of DFP =9, [relative] precision is indeed almost identical.
    But in the worst case, i.e. cases where mantissa of BFP is close to 1
    and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

    Agreed.

    It is somewhat similar to the very old hex fp which had a wider exonent
    range but more variable precision.

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 19:15:31 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    But in practice, it turned out that Intel and AMD processors had much
    better performance on indirect-branch intensive workloads in the early
    2000s without this architectural feature. What happened?

    We threw HW at the problem.

    The IA-32 and AMD64 microarchitects implemented indirect-branch
    prediction; in the early 2000s it was based on the BTB, which these
    CPUs need for fast direct branching anyway. They were not content
    with that, and have implemented history-based indirect branch
    predictors in the meantime, which improve the performance even more.

    By contrast, Power and IA-64 implementations apparently rely on
    getting the target-address early enough, and typically predict that
    the indirect branch will go to the current contents of the
    branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
    instruction stream just before the take-branch instruction, it takes
    several cycles until the prepare-to-branch actually can move the
    target to the branch-target register. In case of an OoO
    implementation, the number of cycles tends to be longer. It's
    essentially a similar latency as in a branch misprediction.

    That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
    take-branch instruction. But gcc certainly has not done so whenever I
    looked at code it generated for PowerPC or IA-64.

    Here is some data for code that focusses on indirect-branch
    performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

    Numbers are cycles per indirect branch, smaller is faster, the years
    are the release dates of the CPUs:

    First, machines from the early 2000s:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
    4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
    18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
    8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
    13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
    5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
    7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

    Ignore the first column (it uses call and return), the others all need
    an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
    instructions.

    And here are results with some newer machines:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
    4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
    4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
    4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
    5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
    4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
    6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
    3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

    The age of the Pentium M would suggest putting it into the earlier
    table, but given its clear performance-per-clock advantage over the
    other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
    to have a history-based indirect-branch predictor.

    It seems that, while the AMD64 microarchitectures improved not just in
    clock rate, but also in performance per clock for this microbenchmark
    (thanks to history-based indirect-branch predictors), the Power 9
    still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
    other CPUs.

    Particularly notable is the Core i5-1135G7, which takes one indirect
    branch per cycle.

    I have to take additional measurements with other Power and AMD64
    processors.

    Couldn't the Power and IA-64 CPUs use history-based branch prediction,
    too? Of course, but then it would be even more obvious that the
    split-branch architecture provides no benefit.

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    tell the CPU
    what the target is (like VEC in My66000)

    I have no idea what VEC does, but all indirect-branch architectures
    are about telling the CPU what the target is.

    VEC is the bracket at the top of a loop. VEC supplies a register
    which will contain the address of the instruction at the top of
    the loop, and a 21-bit-vector use to specify those registers which
    are "Live" out of the loop. VEC is "executed" as the loop is entered
    and then not again until the loop is entered again.

    The LOOP instruction is the bottom bracket of the loop and performs
    the ADD-CMP-BC sequence as a single instruction. There are 3 flavors
    {counted, value terminated, counter value terminated} that use the
    3 registers similarly but differently.

    just use a general
    purpose register with a general-purpose instruction.

    That turns out to be the winner.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    With VEC-LOOP you are guaranteed that the branch and its target are
    100% correlated.

    If you want to be able to perform one taken branch per cycle (or
    more), you always need prediction.

    Greater than 1 branch per FETCH latency.

    If you use a link register or a special instruction, the CPU could
    do that.

    It turns out that this does not work well in practice.

    Agreed.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:16:59 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    I guess it wasn't clear that my question was regarding
    the necessity of providing 'hidden' bits for BCD floating
    point.

    I thought that was obvious:

    When you learned how to do decimal rounding back in your pen & paper
    math classes, you probably realized that for any calculation which could
    not be done exactly, you had to generate enough extra digits to be sure
    how to round.

    Those extra digits play exactly the same role as Guard + Sticky do in
    binary FP.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 4 21:07:43 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:44:21 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of
    1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler
    runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told
    them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:52:46 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Several options, the easiest is of course a set of full forward/reverse
    lookup tables, but you can take advantage of the regularities by using
    smaller tables together with a little bit of logic.

    You also need a way to extract one or two digits from the top/bottom of
    each mod1000 container in order to handle normalization.

    For the Intel binary mantissa dfp128 normalization is the hard issue,
    Michael S have figured out some really nice tricks to speed it up, but
    when you have a (worst case) temporary 220+ bit product mantissa,
    scaling is not that easy.

    The saving grace is that almost all DFP calculations tend to employ
    relatively small numbers, mostly dfadd/dfsub/dfmul operations with fixed precision, and those will always be faster (in software) using the
    binary mantissa.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 22:51:28 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 15:46:06 2025
    From Newsgroup: comp.arch

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 00:44:18 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    I remember you relating this story about 6-8 years ago.

    As you said: "Never bet against branch prediction".

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 02:51:10 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Probably.

    I find it somewhat amusing that modern languages moved away from
    label variables and into method calls -- which if you look at it
    from 5,000 feet/metres -- is just a more expensive "label".

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    But you can be sure COBOL got them from assembly language programmers.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 4 23:43:48 2025
    From Newsgroup: comp.arch

    On 11/4/2025 4:51 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.


    In SW, you would still need to burn 16 bits per entry on the table, and possibly have code to fill in the tables (well, unless the numbers are expressed in code).


    A similar strategy is often used for sin/cos in many 90s era games,
    though the table is big enough that it would likely be impractical to
    type out by hand (or calculate using mental math).

    It is likely someone at ID Software or similar wrote out code at one
    point to spit out the sin+cos lookup table as a big blob of C (say,
    because an 8192 entry table is likely too big to be reasonable to type
    out by hand).


    Sometimes it becomes a tradeoff where exactly is the tradeoff in these
    cases between when to use typing and mental math, and when to write some
    code to spit out a table.

    For me, the tradeoff is often somewhere around 256 numbers, or less if
    the calculation is mentally difficult (namely, whether typing or
    calculating is the bottleneck).


    It is most likely for DPD<->BCD, would resort to using code to generate
    the lookup table.

    Then again, it might depend a lot on the person...



    You still need to build 12-bit decimal ALUs to string together

    When I did it experimentally, I had done 16 BCD digits in 64 bits...

    The cost was slightly higher than that of a 64-bit ADD/SUB unit.

    Generally, it was combining the normal 4-bit CARRY4 style logic with
    some LUTs on the output side to turn it into a sort of BCD equivalent of
    a CARRY4.

    Granted, doing it with 3/6/9 digits would be cheaper than with 16 digits.


    Though, if doing it purely in software, may make sense to go a different route:
    Map DPD to a linear integer between 0 and 999;
    Combine groups of 3 values into a 32 bit value;
    Work 32 bits at a time;
    Split back up to groups of 3 digits, and map back to DPD.

    Though, depends on the operation, for some it may be faster to operate
    in groups of 3 digits at a time (and sidestep the costs of combining or splitting the values).


    Then again, thinking about it, it is possible that for the Decimal128
    BID format, the mantissa could be broken up into smaller chunks (say, 9 digits) without the need for a full-width 128-bit multiply.

    In this case, could use a narrower multiply, and the "error" from the
    overflow would exist outside of the range of digits that are being
    worked on, so effectively becomes irrelevant for the operation in
    question (so, may be able to use 32 or 64 bit multiply, and 128-bit ADD).

    Granted, this is untested.

    Well, apart from how to recombine the parts without the need for wide multiply.

    In theory, could turn it into a big pile of shifts-and-add. Not sure if
    there is a good way to limit the number of shifts-and-adds needed. Well, unless turned into multiply-by-100 (3 shift 2 add) 4x times followed by multiply by 10 (1 shift 1 add), to implement multiply by 1 billion, but
    this also sucks (vs 13 shift 12 add).

    Hmm...


    Ironically, the DPD option almost looks preferable...


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 05:17:53 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and
    switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto". Ironically, the gcc people usually call their labels-as-values
    feature "computed goto" rather than "labels as values" or "assigned
    goto".

    But you can be sure COBOL got them from assembly language programmers.

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:41:30 2025
    From Newsgroup: comp.arch

    On 2025-11-03 2:03 p.m., MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.

    Round nearest random? How about round externally guided (RXG) by an
    input signal? For instance, the rounding could come from a feedback
    filter of some sort.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 06:44:54 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:47:56 2025
    From Newsgroup: comp.arch

    On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a
    displacement from the IP.

    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction? (That
    may have been mentioned upthread, in that case I don't remember).

    That was the name of the architecture, but I am being fickle and
    scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.


    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having
    compare and branch as two separate instructions, or having an extended
    constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    Any loop condition that needs a displacement constant. The constant
    being loaded into a register.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 22:53:49 2025
    From Newsgroup: comp.arch

    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that. BTDT.

    BTW, you mentioned that it could be implemented as an indirect jump. It
    could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    I am not saying it couldn't be used well. Just that it was often not,
    and when not, it caused a lot of problems.




    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto".

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 06:55:49 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too? What problems in
    interaction with other features do you see?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:00:32 2025
    From Newsgroup: comp.arch

    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
    weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
    Keep a local history of taken/not-taken;
    XOR this with the low-order-bits of PC for the table index;
    Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating
    counters, but it is sort of a partial mystery why (for mainstream
    processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit patterns
    into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).



    Then again, had before noted that the LLMs are seemingly also not really
    able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.


    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers could
    have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of
    2-bit state) would have costed more than the cost of smaller tables of 6
    bit state ?...

    Say, for example, 2b:
    00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
    00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
    01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
    01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
    10_0 => 10_0 //strongly not taken, dir=0
    10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
    11_0 => 01_1 //strongly taken, dir=0
    11_1 => 11_1 //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
    As above, and 4-more alternating states
    And slightly different transition logic.
    Say (abbreviated):
    000 weak, not taken
    001 weak, taken
    010 strong, not taken
    011 strong, taken
    100 weak, alternating, not-taken
    101 weak, alternating, taken
    110 strong, alternating, not-taken
    111 strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
    The weak states can more between any of the 4.
    The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist as
    weak states which partly overlap with the weak states for the 3-bit
    patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit
    state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
    pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
    mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
    One needs to be able to express decay both to shorter patterns and to
    longer patterns, and I suspect at this point, the pattern breaks down
    (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or something...

    Then again, maybe other people would not find any particular difficulty
    in these sorts of tasks.


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 02:06:50 2025
    From Newsgroup: comp.arch

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a
    displacement from the IP.

    Should be possible.  A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction?  (That
    may have been mentioned upthread, in that case I don't remember).

    That was the name of the architecture, but I am being fickle and
    scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.


    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having
    compare and branch as two separate instructions, or having an extended
    constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    Any loop condition that needs a displacement constant. The constant
    being loaded into a register.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234    ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3        ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

    I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 07:13:46 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    I strongly suspect that IBM is doing something similar :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:38:30 2025
    From Newsgroup: comp.arch

    On 11/5/2025 1:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring
    of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
    MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler
    runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its
    architecture.

    Before the start of that briefing I suggested that I should start off
    on the blackboard by showing what I had been able to figure out on my
    own, then I proceeded to pretty much exactly cover every single
    feature on the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I
    told them that I expected the P6 to employ eager execution, i.e
    execute both ways of one or two layers of branches, discarding the
    non-taken paths as the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power
    viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
      weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
      Keep a local history of taken/not-taken;
      XOR this with the low-order-bits of PC for the table index;
      Use a 5/6-bit finite-state-machine or similar.
        Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit patterns
    into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).



    Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.



    Errm...

    I just decided to test it, and it appears Grok was able to figure it out
    (more or less).

    This is concerning, either the AIs are getting smart enough to deal with semi-difficult problems; or in fact it is not difficult and I was just
    dumb for thinking there is any difficulty in working out the state
    tables for the longer patterns.

    I tried before with DeepSeek R1 and similar, which had failed.



    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers could have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
    bit state) would have costed more than the cost of smaller tables of 6
    bit state ?...

    Say, for example, 2b:
     00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken
     00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken
     01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken
     01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken
     10_0 => 10_0  //strongly not taken, dir=0
     10_1 => 00_0  //strongly not taken, dir=1 (goes weak)
     11_0 => 01_1  //strongly taken, dir=0
     11_1 => 11_1  //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
      As above, and 4-more alternating states
      And slightly different transition logic.
    Say (abbreviated):
      000   weak, not taken
      001   weak, taken
      010   strong, not taken
      011   strong, taken
      100   weak, alternating, not-taken
      101   weak, alternating, taken
      110   strong, alternating, not-taken
      111   strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
      The weak states can more between any of the 4.
      The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
    space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
    a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
    able to express decay both to shorter patterns and to longer patterns,
    and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or something...

    Then again, maybe other people would not find any particular difficulty
    in these sorts of tasks.


    But, alas, sometimes I wonder if I am just kinda stupid and everyone
    else has already kinda figured this out, but doesn't say much...

    Like, just smart enough to do the things that I do, but not so much otherwise... In theory, I am kinda OK, but often it mostly seems like I
    mostly just suck at everything.



    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 02:01:35 2025
    From Newsgroup: comp.arch

    On 11/4/2025 11:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.


    I usually used call threading, because:
    In my testing it was one of the faster options;
    At least if excluding 32-bit x86,
    which often has slow function calls.
    Because pretty much every function needs a stack frame, ...
    It is usable in standard C.

    Often "while loop and switch()" was notably slower than using unrolled
    lists of indirect function calls (usually with the main dispatch loop
    based on "traces", which would call each of the opcode functions and
    then return the next trace to be run).

    Granted, "while loop and switch" is the more traditional way of writing
    an interpreter.


    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
    goto".

    But you can be sure COBOL got them from assembly language programmers.

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    But, if you use it, you are basically stuck with GCC...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:18:50 2025
    From Newsgroup: comp.arch

    On Tue, 4 Nov 2025 22:52:46 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,

    I remember that I played with that, but don't remember what I did
    exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
    chains rather than total number of multiplications.
    An important point here is that I played on relatively old x86-64
    hardware. My solution is not necessarily optimal for newer hardware.
    The differences between old and new are two-fold and they push
    optimal solution into different directions.
    1. Increase in throughput of integer multiplier
    2. Decrease in latency of integer division

    The first factor suggests even more intense push toward "eager"
    solutions.

    The second factor suggests, possibly, much simpler code, especially in
    common case of division by 1 to 27 decimal digits (5**27 < 2**64).
    How they say? Sometimes a division is just a division.

    but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:21:32 2025
    From Newsgroup: comp.arch

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:25:45 2025
    From Newsgroup: comp.arch

    On 2025-11-05 2:13 a.m., Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    I strongly suspect that IBM is doing something similar :-)

    Like that IBM packing method.

    I have some RTL code to pack and unpack modulo 1000 to BCD. I think it
    is fast and small enough that it can be used inline at the input and
    output of DFP operations. The DFP values can then be passed around in
    the CPU as 128-bit values instead of the expanded BCD value.

    Only 128-bit DFP is supported on my machine under the assumption that
    one is wanting the extended decimal precision for engineering / finance. Otherwise, why would one use it? Better to use BFP.

    One headache I have not worked out how to do yet is convert between DFP
    and BFP in a sensible fashion. I have tried a couple of means but the
    results are way off. Using log/exp type functions. I suppose I could
    rely on conversions to and from text strings.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:27:48 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Probably.

    I find it somewhat amusing that modern languages moved away from
    label variables and into method calls -- which if you look at it
    from 5,000 feet/metres -- is just a more expensive "label".

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    But you can be sure COBOL got them from assembly language programmers.

    Back before caches and branch predictors, my fastest world count (wc)
    asm program employed runtime code generation, it started by filling in a
    64kB segment with code snippets aligned every 128 bytes: Even block
    counts were for scanning outside a word and the odd entries were used
    when a word start had been found, then each snippet would load the next
    byte into BH and jump to BX. (BL contained the outside/inside flag value
    as 0/128)

    Fast forward a few years and a branchless data state machine ran far
    faster, culminating at (a measured) 1.5 clock cycles/byte on a Pentium.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:42:37 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 4 Nov 2025 22:52:46 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    For the Intel binary mantissa dfp128 normalization is the hard issue,
    Michael S have figured out some really nice tricks to speed it up,

    I remember that I played with that, but don't remember what I did
    exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
    chains rather than total number of multiplications.
    An important point here is that I played on relatively old x86-64
    hardware. My solution is not necessarily optimal for newer hardware.
    The differences between old and new are two-fold and they push
    optimal solution into different directions.
    1. Increase in throughput of integer multiplier
    2. Decrease in latency of integer division

    The first factor suggests even more intense push toward "eager"
    solutions.

    The second factor suggests, possibly, much simpler code, especially in
    common case of division by 1 to 27 decimal digits (5**27 < 2**64).
    How they say? Sometimes a division is just a division.

    I suspect that a model using pre-calculated reciprocals which generate
    ~10+ approximate digits, back-multiply and subtract, repeat once or
    twice, could perform OK.

    Having full ~225 bit reciprocals in order to generate the exact result
    in a single iteration would require 256-bit storage for each of them and
    the 256x256->512 MUL would use 16 64x64->128 MULs, but here we do have
    the possibility to start from the top and as soon as you get the high
    end 128 bits of the mantissa fixed (modulo any propagating carries from
    lower down) you could inspect the preliminary result and see that it
    would usually be far enough away from a tipping point so that you could
    stop there.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:56:12 2025
    From Newsgroup: comp.arch

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 17:26:44 2025
    From Newsgroup: comp.arch

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement
    where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

    Niklas
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Nov 5 10:49:10 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too? What problems in
    interaction with other features do you see?

    - anton

    For a code analysis, an assigned goto, aka label variables,
    looks equivalent to:
    - make a list of all the target labels assigned to each label variable
    - at each "goto variable" substitute a switch statement with that list

    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:15:00 2025
    From Newsgroup: comp.arch

    On 11/5/2025 3:21 AM, Michael S wrote:
    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?


    I had interpreted it as being about software with BCD helper ops.

    Otherwise, would probably go a different route.

    One other tradeoff is whether to go for Decimal128 in DPD or BID.

    Stuff online says BID is better for a software implementation, but I am
    having doubts. It is possible that DPD could make more sense in both
    cases, albeit likely, in the absence of BCD helpers, it may make sense
    to map DPD to linear 10-bit values.

    While BID could make sense, it would have a drawback of assuming having
    some way of quickly performing power-of-10 multiplies on large integer
    values. If you have a CPU where the fastest way to perform generic
    128-bit multiply is to break it down into 32 bit multiplies, and/or use shift-and-add, it is not a particularly attractive option.

    Contrast, working with 16-bit chunks holding 10 bit values is likely to
    work out being cheaper.

    Despite BID being more conceptually similar to Binary128, they differ in
    that Binary128 would only need to use large-integer multiply sparingly (namely, for multiply operations).



    Though, likely fastest option would be to map the DPD values to 30-bit
    linear values, then internally use the 30-bit linear values, and convert
    back to DPD at the end. Though, the performance of this is likely to
    depend on the operation.

    A non-standard variant, representing the value as packed 30 bit fields,
    could likely be the fastest option. Could use the same basic layout as
    the existing Decimal128 format.


    S0, my guess for a performance ranking, fast to slow, being:
    1: Dense packed, 30b linear, 30+30+30+20+digit
    2: DPD
    3: BID


    As for whether or not to support Decimal128 (in either form), dunno.

    Closest I have to a use-case is that well, technically there is a
    _Decimal128 type in C, and it might make sense for it to be usable.

    But, then one needs to decide on which possible format to use here.
    And, whether to aim for performance or compatibility.


    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:23:16 2025
    From Newsgroup: comp.arch

    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

       [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be
    said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.


    So, yeah, most likely UB, of a "particularly destructive" / "unlikely to
    be useful" kind.


    FWIW:
    This was not a feature that I feel inclined to support in BGBCC...


    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 5 17:22:48 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

    <computed goto>

    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    In my experience, longjmp is far faster than e.g. C++ exceptions.

    Granted, the code needs to be designed to allow longjmp without
    orphaning or leaking memory (i.e. in a context where there isn't any
    dynamic memory allocation) for the best speed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 18:03:31 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too?

    Yes.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer, and possibly
    (if you want to catch GOTO when no variable has been assigned)
    a second variable.

    But it interacts with compiler writers - additional efforts, warning,
    testing, ...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 21:30:11 2025
    From Newsgroup: comp.arch

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension
    would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:30:05 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-03 2:03 p.m., MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.

    Round nearest random?

    Another unbiased rounding mode. Not yet available because I don't have
    a truly random source to guide the rounding.

    How about round externally guided (RXG) by an
    input signal?

    I guess that would be OK, but you could not make the statement that
    the rounding mode was unbiased.

    For instance, the rounding could come from a feedback
    filter of some sort.

    Sure, just you can state "unbiased".
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:43:58 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:
    ---------------

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
    weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
    Keep a local history of taken/not-taken;
    XOR this with the low-order-bits of PC for the table index;
    Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    In 1991 Mike Shebanow, Tse-Yu Yeh, and I tried out a Correlation predictor where strings of {T, !T}** were pattern matched to create a prediction.
    While it was somewhat competitive with Global History Table, it ultimately failed.

    I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.
    c) Jump Through Table is not predicted through jump indirect table-like
    prediction, what is predicted is the value (switch variable) and this
    is used to index the table (early)
    d) CMOV gets rid of another 8%

    These strip out about 40% of branches from needing prediction, causing
    the remaining branches to be harder to predict but having less total
    latency in execution.

    -----------------
    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
    pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
    mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
    One needs to be able to express decay both to shorter patterns and to
    longer patterns, and I suspect at this point, the pattern breaks down
    (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

    Tried some of these (1991) mostly with little to no success.
    Be my guest and try again.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:52:22 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported, which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only
    12% bigger than a 32-bit ISA.

    One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

    May I humbly suggest this is the wrong direction.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:53:59 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    Since the gates hang off flip-flops, you don't need the inv gate
    at the front. Flip-flops can easily give both true and complement
    outputs.

    I strongly suspect that IBM is doing something similar :-)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:04:57 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/4/2025 11:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.


    I usually used call threading, because:
    In my testing it was one of the faster options;
    At least if excluding 32-bit x86,
    which often has slow function calls.
    Because pretty much every function needs a stack frame, ...
    It is usable in standard C.

    I have converged on call-threading as a way to eliminate "if-statements" -----------------------
    extern uint64_t operation( uint64_t src1, uint64_t src1 );

    static uint64_t (*int2op[32])( uint64_t src1, uint64_t src1 ) =
    { // integer 2-operand decoding table
    /* 00 */ operation,
    /* 01 */ operation,
    /* 02 */ uadd,
    /* 03 */ sadd,
    /* 04 */ umul,
    /* 05 */ smul,
    /* 06 */ udiv,
    /* 07 */ sdiv,
    /* 10 */ cmp,
    /* 11 */ operation,
    /* 12 */ operation,
    /* 13 */ operation,
    /* 14 */ umax,
    /* 15 */ smax,
    /* 16 */ umin,
    /* 17 */ smin,
    /* 20 */ or,
    /* 21 */ operation,
    /* 22 */ xor,
    /* 23 */ operation,
    /* 24 */ and,
    /* 25 */ operation,
    /* 26 */ operation,
    /* 27 */ operation,
    /* 30 */ operation,
    /* 31 */ operation,
    /* 32 */ operation,
    /* 33 */ operation,
    /* 34 */ operation,
    /* 35 */ operation,
    /* 36 */ operation,
    /* 37 */ operation;
    };

    /*
    * Integer 2-Operand Table Caller
    */
    bool intimm16( coreStack *cpu, Context *c, Major I )
    {
    uint8_t or = I.or;
    uint64_t src1 = c->ctx.reg[ I.src1 ],
    src2 = c->ctx.reg[ I.src2 ],
    *dst = &c->ctx.reg[ I.dst ];
    *dst = int2op[ (I.major&15)<<1 ]( src1, src2, 0 );
    return true;
    }

    bool int2op( coreStack *cpu, Context *c, OpRoute I )
    {
    uint8_t or = I.or,
    s = I.size;
    uint64_t *src1 = &c->ctx.reg[ I.src1 ],
    *src2 = &c->ctx.reg[ I.src2 ],
    *dst = &c->ctx.reg[ I.dst ];
    iorTable[ or ]( *c, I, src1, src2 );
    *dst = int2op[ I.minor ]( src1, src2, s );
    return true;
    }
    -----------------------

    One does not have to check for unimplemented instructions, just place
    a call to the operation() subroutine where they are not defined. The operation() subroutine raises an exception which is caught at the
    next instruction fetch.

    I show both 16-bit immediates and general 2-Operand instructions use
    the same table (with a trifling of bit twiddling).

    Often "while loop and switch()" was notably slower than using unrolled
    lists of indirect function calls (usually with the main dispatch loop
    based on "traces", which would call each of the opcode functions and
    then return the next trace to be run).

    Table-calls are faster than many switches unless you can demonstrate
    the switch is dense and there are no missing cases.

    Granted, "while loop and switch" is the more traditional way of writing
    an interpreter.

    Just not a fast one...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:06:16 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:21:34 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and
    logical, range {-15.5..15.5} for floating point.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:24:07 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time?

    This is where the call-table approach works better--the scope is well
    defined.

    If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

    Niklas
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:28:16 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension
    would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 00:45:19 2025
    From Newsgroup: comp.arch

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined
    Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    The undefined cases could be excluded at compile-time, even in C, by
    requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In
    addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 20:41:18 2025
    From Newsgroup: comp.arch

    On 2025-11-05 3:52 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that
    reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only
    12% bigger than a 32-bit ISA.

    One gotcha is that 64-bit constant overrides need to be modified. For
    Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit
    instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift
    instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I could reduce the 64-bit constant build to two instructions by adding a
    load-immediate instruction.

    May I humbly suggest this is the wrong direction.

    agree.

    Taking heed of the motto, I have
    scrapped a bunch of shifted immediate instructions and load immediate.
    These were present as an alternate means to work with large constants.
    They were really redundant with the ability to specify constant
    overrides (routing) for registers, and they would increase the dynamic instruction count (bad!) Scrapping the extra instructions will also make writing a compiler simpler.

    One instruction scrapped was an add to IP. So, another means of forming relative addresses was required. Sacrificing a register code (code 32)
    to represent the instruction pointer. This will allow the easy formation
    of IP relative addresses.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 21:49:19 2025
    From Newsgroup: comp.arch

    On 2025-11-05 4:21 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    What happens if one tries to use an unsupported combination?

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
    I just realized that Qupls2026 does not accommodate small constants very
    well except for a few instructions like shift and bitfield instructions
    which have special formats. Sure, constants can be made to override
    register specs, but they take up a whole additional word. I am not sure
    how big a deal this is as there are also immediate forms of instructions
    with the constant encoded in the instruction, but these do not allow
    operand routing. There is a dedicated subtract from immediate
    instruction. A lot of other instructions are commutative, so operand
    routing is not needed.

    Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
    are available for shifts and bitfield ops. Leaving the 130-bit constants
    out for now. They may be useful for 128-bit SIMD against constant operands.

    The constant routing issue could maybe be fixed as there are 30+ free
    opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
    for in the compiler. May want to permute two registers and a constant,
    or two constants and a register, and then three or four different sizes.

    Qupls strives to be the low-cost processor.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Nov 5 19:20:57 2025
    From Newsgroup: comp.arch

    On 11/5/2025 1:21 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:24:24 2025
    From Newsgroup: comp.arch

    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 08:46:40 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code, and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code. I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.

    As demonstrated above, they do. And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:43:57 2025
    From Newsgroup: comp.arch

    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics
    and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    Yes, UB sounnds as the best answer.. Inter-procedural assigned goto is
    not different from out-of-bound array access or from attempt to use
    pointer to local variable when the block/function that originally
    declared the variable is no longer active.
    But compiler shall try to detect as many cases of such misuse as it can.


    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an
    extension would be accepted for the Ada standard.)

    Niklas


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:11:54 2025
    From Newsgroup: comp.arch

    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics
    and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which is
    why he removed labels-as-values from his version of C. I doubt that
    Stallman had any better answer for gcc, but he did not care.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:37:16 2025
    From Newsgroup: comp.arch

    On 2025-11-06 10:46, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    I'm not sure if you are trolling or serious, but I will assume the latter.

    The point is that without a deep analysis of the program you cannot be
    sure that these goto's actually go to one of the labels in the engine() function, and not to some other location in the code, perhaps in some
    other function. That analysis would have to discover that the compile_to_vm_code() function returns a pointer to a vector of addresses picked from the insts[] vector. That could need an analysis of many
    functions called from compile_to_vm_code(), the history of the whole
    program execution, and so on. NOT easy.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next?

    You know, without any deep analysis or understanding, that the execution
    goes to one of the cases in the switch, and /not/ into the wild blue yonder.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 13:14:55 2025
    From Newsgroup: comp.arch

    On Thu, 6 Nov 2025 12:11:54 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
    away between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of
    the) label to which the value refers", which is machine-level
    semantics and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out
    when nothing useful can be said.

    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which
    is why he removed labels-as-values from his version of C. I doubt
    that Stallman had any better answer for gcc, but he did not care.


    I suspect that the reason was different: DMR had no sanctifying answer
    even for some of intra-procedural cases.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 6 07:44:38 2025
    From Newsgroup: comp.arch

    Taking direction from the VAX’s AOB? (add-one and branch) instruction
    and the DBcc instruction of the 68k, the Qupls Rs1 register of a compare-and-branch instruction may be incremented or decremented. This
    is really a form of instruction fusing the op performed on the branch
    register into the branch instruction.

    I was thinking of modifying this to support additional ops and constant values. Why just add, if one can shift right or XOR as well? It may be
    useful to increment by a structure size. Also, a ring counter might be
    handy which could be implemented as a right shift. This could be
    supported by adding a postfix word to the branch instruction. It would
    make the instruction wider but it would not increase the dynamic
    instruction count.

    Not sure about the syntax to use for coding such instructions.

    BEQ Rs1,Rs2,label:ADD Rs1,256


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 07:57:23 2025
    From Newsgroup: comp.arch

    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might
    assume that the goto goes to one of those labels, but you can't be sure
    of it.


    BTW, you mentioned that it could be implemented as an indirect jump. It
    could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.


    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have. BTW, I can accept the argument for keeping
    it in C on the argument that C is "lower level" than say Fortran, COBOL
    or PL/1, and people using it are used to the language allowing "risky" constructs.


    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.


    I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.

    As demonstrated above, they do.

    No, they are implemented as an indexed jump table.


    And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 17:44:32 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    Since the gates hang off flip-flops, you don't need the inv gate
    at the front. Flip-flops can easily give both true and complement
    outputs.

    Agreed. Unfortunately, I have a hard time (i.e. "have not managed")
    convincing abc that both signals are available, and assert that
    exactly one of them is 1 at any given time, without completely
    blowing up the optimization routines. It also does not handle
    external don't cares. But as I use it purely to play around with
    things, that is not too bad :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 17:52:32 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-05 7:17, Anton Ertl wrote:
    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    You can look at his specification in the documentation of, say, 7th
    edition Unix (where Ritchie apparently took the effort to document
    semantics), and see how he specified that. I doubt he specified
    "semantics in the abstract C machine", but I expect that he specified
    semantics at the C level.

    Concerning how Stallman documented it, you can look at the gcc
    documentation from 2.0 until Stallman passed maintainership on
    (gcc-2.7?).

    If you look at the curent documentation <https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html>, it talks
    about the "address of a label" and "jump to one", which you might
    consider to be a machine-level description. You can also describe
    this at a C source level or "C abstract machine" level, but I don't
    expect the description to become any clearer.

    The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    The gcc documentation says:

    |You may not use this mechanism to jump to code in a different
    |function. If you do that, totally unpredictable things happen.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:14:54 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:28:19 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:17:31 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    2) Put the offset from the start of the function or compilation unit
    (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that. Of course, if Fortran
    assigns labels between shared libraries and the main program, that
    approach probably does not work, but does anybody really do that?

    How does ifort deal with this problem?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:36:33 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 4:21 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    What happens if one tries to use an unsupported combination?

    For 2-operands and 3-operand instructions, they are all present.
    For 1-Operand instructions, only the ones targeting Src2 are
    available and if you use one not allowed you take an OPERATION
    exception.

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
    register specs, but they take up a whole additional word. I am not sure
    how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
    operand routing. There is a dedicated subtract from immediate
    instruction. A lot of other instructions are commutative, so operand
    routing is not needed.

    1<<const // performed at compile time
    1<<var // 1-instruction {1-word in My 66000}

    17/var // 1-instruction {1-word}

    You might notice My 66000 does not even HAVE a SUB instruction,
    instead:

    ADD Rd,Rs1,-Rs2

    Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
    are available for shifts and bitfield ops. Leaving the 130-bit constants
    out for now. They may be useful for 128-bit SIMD against constant operands.

    The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
    for in the compiler. May want to permute two registers and a constant,
    or two constants and a register, and then three or four different sizes.

    Out of the 64-slot Major OpCode space, 23-clost are left over, 6-reserved
    in perpetuity to catch random jumps into integer or fp data.

    Qupls strives to be the low-cost processor.

    My 66000 strives to be the low-instruction-count processor.

    But remember, ISA is only the first 1/3rd of an architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:39:55 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/5/2025 1:21 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    The constant ROM[specifier] seems to be the easiest way of taking
    5-bits and converting it into a FP number. It was only a few weeks
    ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
    this covers <slightly> more fp constant uses. In My case, one always
    has access to larger constants at the same instruction-count price
    just a larger code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:45:41 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Now let us look at it with tabularized functions:: {Ignore the
    interrupt and exception stuff at your peril}

    bool RunInst( Chip chip )
    {
    for( uint64_t i = 0; i < cores; i++ )
    {
    ContextStack *cpu = &core[i];
    uint8_t cs = cpu->cs;
    Thread *t = cpu->context[cs];
    Inst I;

    if( cpu->interrupt & ((((signed)1)<<63) >> cpu->priority) )
    { // take an interrupt
    cpu->cs = cpu->interrupt.cs;
    cpu->priority = cpu->interrupt.priority;
    t = context[cpu->cs];
    t->reg[0] = cpu->interrupt.message;
    }
    else if( uint16_t raised = c->raised & c->enabled )
    { // take an exception
    cpu->cs--;
    t = context[cpu->cs];
    t->reg[0] = FT1( raised ) | EXCPT;
    t->reg[1] = I.inst;
    t->reg[2] = I.src1;
    t->reg[3] = I.src2;
    t->reg[4] = I.src3;
    }
    else
    { // run an instruction
    t->ip += memory( FETCH, t->ip, &I.inst );
    t->raised |= majorTable[ I.major ]( cpu, t, &I );
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.

    On such architectures switch would also be implemented by modifying
    the code, and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code. I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.

    As demonstrated above, they do. And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Nov 6 13:11:10 2025
    From Newsgroup: comp.arch

    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
    I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along
    vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
    4x 32-bit values each holding 9 digits
    Except the top one generally holding 7 digits.
    16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
    X30: Directly packing 20/30 bit chunks, non-standard;
    DPD: Use the DPD format;
    BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
    X30 is around 10x faster than either DPD or BID;
    Both DPD and BID need a similar amount of time.
    BID needs a bunch of 128-bit arithmetic handlers.
    DPD needs a bunch of merge/split and table lookups.
    Seems to mostly balance out in this case.


    For DPD, merge is effectively:
    Do the table lookups;
    v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
    v0=v;
    v1=v/1000;
    v0-=v1*1000;
    v2=v1/1000;
    v1-=v2*1000;
    Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD or
    BID. Except, that the cost of the ADD and MUL operations effectively
    dwarf that of the pack/unpack operations, so the relative cost
    difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something that
    would lead to X30 being a decisive win either in terms of performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due to
    less clutter, etc. Though, this part would be less bad if C had had
    widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by
    internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
    12x 3 digits (16b chunk)
    4x 9 digits (32b chunk)
    2x 18 digits (64b chunk)
    3x 12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations fully
    fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still
    sometimes exceed it; and isn't that much more than 9 digits (but would
    reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being
    notably slower.

    However, if running on RV64G with the standard ABI, it is likely the
    9-digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a
    value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a little
    more "working space" so would reduce cases which exceed the limits of
    64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays within
    the limits of 64-bit arithmetic (where multiply temporarily widens to
    working with 18 digits, but then narrows back to 9 digit chunks).

    Also 9 digit chunking may be preferable when one has a faster 32*32=>64
    bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level
    helpers.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 19:38:54 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:04:37 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that?

    No, that would be beyond horrible.

    What about regular goto and
    computed goto?

    Neither; according to F77, it must be "defined in the same program
    unit".

    An extra feature: When using GOTO variable, you can also supply a
    list of labels that it should jump to; if the jump target is not
    in the list, the GOTO variable is illegal.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:07:16 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    Compiler writers should never box themselves in like that.

    2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that.

    That would make jumps very inefficient.

    Of course, if Fortran
    assigns labels between shared libraries and the main program,

    It does not.

    How does ifort deal with this problem?

    I have no idea, and no inclination to find out; check out
    assembly code at godbolt if you are really interested.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 12:14:33 2025
    From Newsgroup: comp.arch

    On 11/6/2025 11:38 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point
    instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction
    stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    Interesting! No values related to pi? And what are the ...e+307 used for?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 20:24:23 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that.

    After 4 years of looking, we are still waiting for a single function
    that needs more than a scaled 16-bit displacement from current IP
    {±17-bits} to reach all labels within the function.

    Of course, if Fortran
    assigns labels between shared libraries and the main program, that
    approach probably does not work, but does anybody really do that?

    How does ifort deal with this problem?

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 16:24:28 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton

    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,
    and it was up to me to make sure the registers were all handled correctly.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 21:59:31 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that

    There is a space between the y and the 6 in My 66000.

    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Nov 6 22:09:25 2025
    From Newsgroup: comp.arch

    It appears that MitchAlsup <user5857@newsgrouper.org.invalid> said:
    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    Relatively speaking, yeah. In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

    Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy program, 13 lines of Algol60 horror that Knuth himself got the results wrong. --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 22:53:09 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton

    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,

    Oh sure it does--it is called Return-Oriented-Programming.
    You take the return address off the stack and insert your
    go-to label on the stack and then just return.

    Or you could do some "foul play" on a jumpbuf and longjump.

    {{Be careful not to shoot yourself in the foot.}}

    and it was up to me to make sure the registers were all handled correctly.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 22:21:05 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    Ritchie designed lots of features into C for which the C
    standardization committee later decided that some cases are undefined behaviour. I don't think that Ritchie had any qualms at designing
    something like labels-as-values with unchecked limitations (what would
    later become undefined or implementation-defined behaviour), or
    documenting these limitations.

    Here is my attempt (from 1999) at a specification for
    labels-as-values:

    |goto *<expr>" [or whatever the syntax was] is equivalent to "goto <label>"
    |if <expr> evaluates to the same value as the expression "&&<label>" [or |whatever the syntax was]. If <expr> does not evaluate to a label of the |function that contains the "goto *<expr>", the result is undefined.

    The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.

    Gforth certainly passes the labels out, for use by the compiler that
    generates the VM code.

    In
    addition, the use of an uninitialized label-valued variable should be >prevented or detected.

    Using an uninitialized variable is undefined behaviour in C, but not
    prevented, and not always detected (compilers emit warnings in some
    cases when they detect a use of an uninitialized variable). Why
    should it be any different for an uninitialized variable in used with
    "goto *"?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 20:10:19 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.
    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton
    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,

    Oh sure it does--it is called Return-Oriented-Programming.
    You take the return address off the stack and insert your
    go-to label on the stack and then just return.

    Or you could do some "foul play" on a jumpbuf and longjump.

    {{Be careful not to shoot yourself in the foot.}}

    Or worse... shoot yourself in the foot and then step in a cow pie.
    I hate when that happens.

    and it was up to me to make sure the registers were all handled correctly. >>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 06:55:08 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    After 4 years of looking, we are still waiting for a single function
    that needs more than a scaled 16-bit displacement from current IP
    {±17-bits} to reach all labels within the function.

    Some people use auto-generated code (for example from computer
    algebra systems), which generate really, really long procedures.
    A good stress-test for compilers, too; they tend to expose
    O(n^2) or worse behavior where nobody looked. So it is good that
    branch instructions within functions are expanded by the assembler
    if needed :-)

    Even having 64-bit offsets like My 66000 can lead into a trap (and will
    require future optimization work on the compiler). This is a simplified version of something that came up in a PR.

    SUBROUTINE FOO
    DOUBLE PRECISION A,B,C,D,E
    COMMON /A,B,C,D,E/
    C very many statements involving A,B,C,D,E

    If you load and store each access to one of the variables via its
    64-bit access, you can end up using very many 96-bit instructions,
    where a single load of the base address of the COMMON block would
    save a lot of code space at the expense of a single instruction
    at the beginning.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:06:41 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:
    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.
    ...
    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which is
    why he removed labels-as-values from his version of C.

    He did not write that, and given the rest of C, I very much doubt that
    this was the reason.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:08:42 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-06 10:46, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    [Fortran's assigned goto]
    Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    I'm not sure if you are trolling or serious, but I will assume the latter.

    This is the problem that Stephen Fuld mentioned, and that is actually
    a practical problem that I have experience in some cases when
    debugging programs with indirect control flow, usually with various
    forms of indirect calls, e.g., method calls. I have not experienced
    it for threaded-code interpreters that use labels-as-values (as
    outlined above), because there I can always look at ip[0], ip[1]
    etc. to see where the next executions of goto *ip will go.

    The point is that without a deep analysis of the program you cannot be
    sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
    other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
    program execution, and so on. NOT easy.

    That has never been a problem in my experience, and I have been using labels-as-values since 1992. Up to gforth-0.6 (2003), all instances
    of &&label and all instances of goto *expr were in the same function,
    so if labels had a separate type, that could not be converted by
    casts, the analysis would be trivial, at least if GNU C was an
    Ada-like language, where labels have their own type that cannot be
    converted to other types. As it is, Fortran's assigned goto uses
    integer numbers, and labels-as-values uses void *, so if anybody was
    really interested in performing such an analysis, they would have a
    lot of work to do. But the design of these features with using
    existing types makes it obvious that performing such an analysis was
    not intended.

    Interestingly, if somebody wanted to work in that direction, checking
    at run-time that the target of a goto is inside the function that
    contains the goto is easy and not particularly expensive. With the
    newfangled "control-flow integrity" features in hardware, you could
    even check relatively cheaply that only &&label instances are targets
    of goto *.

    Ok, so what about gforth-0.6 (2003) and later? First of all, they
    contain two functions with goto * and &&label instances, so the
    trivial analysis would no longer work. Has there ever been any mixup
    where a goto * jumped to a label in the other function? Not that I
    know of; if it happened, it would actually work, because the two
    functions are identical apart from some code-space padding.

    What's more relevant is that gforth-0.6 added code-copying dynamic
    native code generation: It copies code snippets (using the addresses
    gotten with &&label to determine where they start and where they end)
    to some RWX data region, concatenating the snippets in this way,
    resulting in a compiled program in the RWX region. It then uses one
    of the goto * in one of the functions to actually start executing this dynamically-generated code.

    This is probably outside of what Stallman had in mind for
    labels-as-values, but fortunately Stallman did not try to limit what
    can be done to what he had in mind, the way that many programming
    language designers do, and the way that many people discussing
    programming languages think. This is a feature that Ritchie's C also
    has, which cannot be said about the C of people who think that
    "undefined behaviour" is enough justification to declare a program
    "buggy".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:09:02 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
    of it.

    In <1762311070-5857@newsgrouper.org> you mentioned method calls as
    'just a more expensive "label"', there you know that the method call
    calls one of the implementations of the method with the name, like
    with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
    large number of switch targets is good enough for you, whereas Niklas
    Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

    BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?
    I bet that it ends up in self-modifying code, too, because these
    architectures usually don't have indirect jumps through jump tables,
    either. If they had, the easy way to implement indirect branches
    without self-modifying code would be to have a one-entry jump table,
    store the target in that entry, and then perform an indirect jump
    through that jump table.

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone
    architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.

    Did you read what you are replying to?

    Does the IBM 704 (for which FORTRAN has been designed originally)
    support indirect branches, or was it necessary to implement the
    assigned goto (and computed goto) with self-modifying code on that architecture?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:32:08 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    An extra feature: When using GOTO variable, you can also supply a
    list of labels that it should jump to; if the jump target is not
    in the list, the GOTO variable is illegal.

    The benefit I see from that is that data-flow analysis must only
    consider the control flows from the assigned goto to these targets and
    not to all assigned labels (in contrast to labels-as-values), and
    conversely, if every assigned goto has such a list, data-flow analysis
    knows more precisely which gotos can actually jump to a given label.

    This would make a small difference in Gforth since 0.6, which has
    introduced hybrid direc/indirect-threaded code, and where some goto *
    are for indirect-threaded dispatches, and some labels are only reached
    from these goto * instances, and a certain variable is only alive
    across these jumps. GNU C does not have this option, so what we did
    instead is to kill the variable right before all the gotos that do not
    jump to these labels.

    It might also help with static stack caching: There are stack states
    with 0-n stack items in registers, and a particular VM instruction
    code snippet starts in a particular state (say, 2 stack items in a
    register) and ends with another state S (say, 1 stack item in a
    register). It will jump to code that expects the same state S. All
    variables that contain stack items beyond what S has are dead at that
    point. If we could tell that the goto * from state S only goes to
    targets in state S, the data-flow analysis could determine that.
    Instead, what we do is to kill these additional variables in a subset
    of uses. When we tried to kill them at all uses, the quality of the
    code produced by gcc deteriorated significantly.

    This variable-killing happens by having empty asm statements that
    claim to write to these variables, so if this is used incorrectly, the
    produced code will be incorrect. So the benefit of this assigned-goto
    feature would be to replace a dangerous feature with another dangerous
    one: if you fail to list all the jumped-to labels, the data-flow
    analysis would be wrong, too. It seems more elegant to describe the
    actual control flow, and then let the data-flow analysis do its work
    than the heavy-handed direct influence on the data-flow analysis that
    our variable-killing does.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 15:26:38 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.

    Pascal has that feature. Concerning error handling, jumping to an
    error handler in a statically enclosing scope has fallen out of
    favour, but throwing an exception to the next dynamically enclosing
    exception handler is supported in a number of languages.

    Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

    Displays and static link chains are among the techniques that can be
    used to implement static scoping correctly, i.e., where the man-or-boy
    test produces the correct result. Knuth initially got the result
    wrong, because he only had boy compilers, and the computation is too
    involved to do it by hand.

    The main horror in the original version is that for some of the Algol
    60 syntax that is used, it is not obvious without studying the Algol
    60 report what it means. <https://rosettacode.org/wiki/Man_or_boy_test#ALGOL_60> contains some discussion, and one can find it in various other programming
    languages, more or (often) less close to the original. The discussion
    at <https://rosettacode.org/wiki/Man_or_boy_test#TXR> and the
    difference between the "proper job" version from the "crib the Common
    Lisp or Scheme solution" version gives some insight.

    The fact that "less close" also produces the correct result suggests
    that the man-or-boy test is less discerning than Knuth probably
    intended. That's a common problem with testing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 08:26:41 2025
    From Newsgroup: comp.arch

    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might
    assume that the goto goes to one of those labels, but you can't be sure
    of it.

    In <1762311070-5857@newsgrouper.org> you

    I think the attributions are messed up, as I didn't say what you next
    say I said.


    mentioned method calls as
    'just a more expensive "label"', there you know that the method call
    calls one of the implementations of the method with the name, like
    with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
    large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

    BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying >>>> code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    No code modification nor indirection required .

    Yes, it does require execution of an "extra" jump instruction.


    I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
    either.

    Not required.


    If they had, the easy way to implement indirect branches
    without self-modifying code would be to have a one-entry jump table,
    store the target in that entry, and then perform an indirect jump
    through that jump table.

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone
    architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.


    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.


    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.

    Did you read what you are replying to?

    Does the IBM 704 (for which FORTRAN has been designed originally)
    support indirect branches, or was it necessary to implement the
    assigned goto (and computed goto) with self-modifying code on that architecture?

    I don't know what the 704 implemented, but I have shown above self
    modifying code is not necessary for computed goto, and I suspect
    assigned goto was implemented with self modifying code. But as I said,
    back then self modifying code was not considered as bad as it is now.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 17:29:07 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 11/6/2025 11:38 AM, Thomas Koenig wrote:

    [...]

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    Interesting! No values related to pi? And what are the ...e+307 used for?

    If you loook closely, you'll see pi/180 in that list. But pi is
    also there (I cut it off the list), it occurs 11 times. And the
    large numbers are +/- DBL_MAX*0.5, I don't know what they are
    used for.

    By comparision, here are the values which are most frequently
    contained in GSL:

    5-bit constants: 5148
    32-bit constants: 3769
    64-bit constants:3140
    2678 1
    1518 0
    687 -1
    424 2
    329 0.5
    298 -2
    291 2.22044604925031e-16
    275 4.44089209850063e-16
    273 3
    132 -3
    131 -0.5
    131 3.14159265358979
    88 4
    86 1.34078079299426e+154
    77 6
    70 0.25
    70 5
    68 2.2250738585072e-308
    66 10
    64 -4
    50 -6
    46 0.1
    45 5.87747175411144e-39
    43 0.333333333333333
    42 1e+50
    38 6.28318530717959
    35 9
    31 0.2
    30 7
    30 -0.25

    [...]

    So, having values between -15.5 and +15.5 is a choice that will
    cover quite a few floating point constants. For different packages,
    FP constant distributions probably vary too much to create something
    that is much more useful.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 17:15:59 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    Which architecture ist that?

    No code modification nor indirection required .

    The "Jump $,R1" is an indirect jump. With that the assigned goto can
    be implemented as (for "GOTO X")

    load R1,X
    Jump 0,R1

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.

    We have also gotten rid of any architecture that requires
    self-modifying code for implementing the assigned goto.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bill Findlay@findlaybill@blueyonder.co.uk to comp.arch on Fri Nov 7 17:54:33 2025
    From Newsgroup: comp.arch

    On 7 Nov 2025, Anton Ertl wrote
    (in article<2025Nov7.162638@mips.complang.tuwien.ac.at>):

    John Levine <johnl@taugh.com> writes:
    In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

    Pascal has that feature. Concerning error handling, jumping to an
    error handler in a statically enclosing scope has fallen out of
    favour, but throwing an exception to the next dynamically enclosing
    exception handler is supported in a number of languages.

    Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
    program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

    Displays and static link chains are among the techniques that can be
    used to implement static scoping correctly, i.e., where the man-or-boy
    test produces the correct result. Knuth initially got the result
    wrong, because he only had boy compilers, and the computation is too
    involved to do it by hand.

    I append a run of MANORBOY in Pascal for the KDF9.
    No display was used.
    A static frame pointer as part of the functional parameter
    suffices logically and gives better performance.

    Paskal : the KDF9 Pascal cross-compiler V19.2a, compiled ... on 2025-11-07.
    1 u | %storage = 32767
    2 u | %ystores = 30100
    3 u |
    4 u | program MAN_OR_BOY;
    5 u |
    6 u | { See: }
    7 u | { "Man or boy?", }
    8 u | { by Donald Knuth, }
    9 u | { ALGOL Bulletin 17.2.4, p7; July 1964. }
    10 u |
    11 u | var
    12 u | i : integer;
    13 u | function A (
    14 u | k : integer;
    15 u | function x1 : integer;
    16 u | function x2 : integer;
    17 u | function x3 : integer;
    18 u | function x4 : integer;
    19 u | function x5 : integer
    20 u | ) : integer;
    21 u |
    22 u | function B : integer;
    23 u 1b| begin
    24 u | k := k - 1;
    25 u | B := A (k, B, x1, x2, x3, x4);
    26 u 1e| end { B };
    27 u |
    28 u 1b| begin { A }
    29 u | if k <= 0 then
    30 u | A := x4 + x5
    31 u | else
    32 u | A := B;
    33 u 1e| end { A };
    34 u |
    35 u | function pos_one : integer;
    36 u | begin pos_one := 1 end;
    37 u |
    38 u | function neg_one : integer;
    39 u | begin neg_one := -1 end;
    40 u |
    41 u | function zero : integer;
    42 u | begin zero := 0 end;
    43 u |
    44 u 1b| begin { MAN_OR_BOY }
    45 u | rewrite(1, 3);
    46 u | for i := 0 to 11 do
    47 u | write(A(i, pos_one, neg_one, neg_one, pos_one, zero):6);
    48 u | writeln;
    49 u 1e| end { MAN_OR_BOY }.

    Compilation complete : 0 error(s) and 0 warning(s) were reported.
    ...
    This is ee9 17.0a, compiled by GNAT ... on 2025-11-07.
    Running the KDF9 problem program Binary/MANORBOY
    ...
    Final State: Normal end of run.
    ...
    LP0 on buffer #05 printed 1 line.

    LP0:
    ===
    1 0 -2 0 1 0 1 -1 -10 -30 -67 -138
    ===
    --
    Bill Findlay

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 10:45:39 2025
    From Newsgroup: comp.arch

    On 11/7/2025 9:15 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    Which architecture ist that?

    It is generic enough that it could be lots of architectures, but the one
    I know best is the Univac 1100.



    No code modification nor indirection required .

    The "Jump $,R1" is an indirect jump.

    Perhaps we just have a terminology disagreement. I don't call that
    indirect addressing. The 1100 architecture supports indirect addressing
    in the hardware. An indirect reference was represented in the assembler
    by an asterisk preceding the label, which set a bit in the instruction
    that told the hardware to go to the address specified in the instruction
    and treat what it found there as the address of the operand for the instruction.

    So, for example:

    J *tag

    tag finaladdress

    would cause the hardware to fetch the address at tag and use that as the operand, thus causing a jump to "final address".

    This is what I call indirect addressing.

    So to use this in an assigned goto, the assign statement would store the desired address at tag such that when the jump was executed, it would
    jump to the desired address.

    I call the construct with several consecutive jump instructions an
    indexed jump, not an indirect one.



    With that the assigned goto can
    be implemented as (for "GOTO X")

    load R1,X
    Jump 0,R1


    Yes.


    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was
    referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.

    We have also gotten rid of any architecture that requires
    self-modifying code for implementing the assigned goto.

    True. But we still have my original argument, better expressed by
    Niklas about code readability/followability.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 14:28:48 2025
    From Newsgroup: comp.arch

    On 11/6/2025 1:11 PM, BGB wrote:
    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Thomas Koenig <tkoenig@netcologne.de> posted:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever.  It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                      1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?
    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128.
    https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed  (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
      I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
      4x 32-bit values each holding 9 digits
        Except the top one generally holding 7 digits.
      16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
      X30: Directly packing 20/30 bit chunks, non-standard;
      DPD: Use the DPD format;
      BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
      X30 is around 10x faster than either DPD or BID;
      Both DPD and BID need a similar amount of time.
        BID needs a bunch of 128-bit arithmetic handlers.
        DPD needs a bunch of merge/split and table lookups.
        Seems to mostly balance out in this case.


    For DPD, merge is effectively:
      Do the table lookups;
      v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
      v0=v;
      v1=v/1000;
      v0-=v1*1000;
      v2=v1/1000;
      v1-=v2*1000;
      Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD or
    BID. Except, that the cost of the ADD and MUL operations effectively
    dwarf that of the pack/unpack operations, so the relative cost
    difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
      12x  3 digits (16b chunk)
      4x   9 digits (32b chunk)
      2x  18 digits (64b chunk)
      3x  12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations fully
    fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being notably slower.

    However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a little
    more "working space" so would reduce cases which exceed the limits of
    64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays within
    the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).

    Also 9 digit chunking may be preferable when one has a faster 32*32=>64
    bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level helpers.


    I don't know yet if my implementation of DPD is actually correct.

    Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

    Here is an example value:
    2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    Which, in theory, should resemble PI.


    Annoyingly, it seems like pretty much everyone else either went with
    BID, or with other non-standard Decimal encodings.

    Can't seem to find:
    Any examples of hard-coded numbers in this format on the internet;
    Any obvious way to generate them involving "stuff I already have".
    As, in, not going and using some proprietary IBM library or similar.

    Also Grok wasn't much help here, just keeps trying to use Python's
    "decimal", which quickly becomes obvious is not using Decimal128 (much
    less DPD), but seemingly some other 256-bit format.

    And, Grok fails to notice that what it is saying is nowhere close to
    correct in this case.

    Neither DeepSeek nor QWen being much help either... Both just sort of go
    down a rabbit hole, and eventually fall back to "Here is how you might
    go about trying to decode this format...".


    Not helpful, I more would just want some way to confirm whether or not I
    got the format correct.

    Which is easier if one has some example numbers or something that they
    can decode and verify the value, or something that is able to decode
    these numbers (which isn't just trying to stupidly shove it into
    Python's Decimal class...).


    Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
    and Boost C++, but in these cases, less helpful because they went with BID.

    ...




    Checking, after things a a little more complete, MHz for (millions of
    times per second), on my desktop PC:
    DPD Pack/Unpack: 63.7 MHz (58 cycles)
    X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

    FMUL (unwrap) : 21.0 MHz (176 cycles)
    FADD (unwrap) : 11.9 MHz (311 cycles)

    FDIV : 0.4 MHz (very slow; Newton Raphson)

    FMUL (DPD) : 11.2 MHz (330 cycles)
    FADD (DPD) : 8.6 MHz (430 cycles)
    FMUL (X30) : 12.4 MHz (298 cycles)
    FADD (X30) : 9.8 MHz (378 cycles)

    The relative performance impact of the wrap/unwrap step is somewhat
    larger than expected (vs the unwrapped case).

    Though, there seems to only be a small difference here between DPD and
    X30 (so, likely whatever is effecting performance here is not directly
    related to the cost of the pack/unpack process).

    The wrapped cases basically just add a wrapper function that unpacks the
    input values to the internal format, and then re-packs the result.

    For using the wrapped functions to estimate pack/unpack cost:
    DPD cost: 51 cycles.
    X30 cost: 41 cycles.


    Not really a good way to make X30 much faster. It does pay for the cost
    of dealing with the combination field.

    Not sure why they would be so close:
    DPD case does a whole lot of stuff;
    X30 case is mostly some shifts and similar.

    Though, in this case, it does use these functions by passing/returning
    structs by value. It is possible a by-reference design might be faster
    in this case.


    This could possibly be cheapened slightly by going to, say:
    S.E13.M114
    In effect trading off some exponent range for cheaper handling of the exponent.


    Can note:
    MUL and ADD use double-width internal mantissa, so should be accurate;
    Current test doesn't implement rounding modes though, could do so.
    Currently hard-wired at Round-Nearest-Even.

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less
    accurate.

    So, it first uses a loop with hard-coded checks and scales to get it in
    the general area, before then letting N-R take over. If the value isn't
    close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
    Exponent is wrong:
    Scale by factors of 2 until correct;
    Off by more than 50%, scale by +/- 25%;
    Off by more than 25%, scale by +/- 12.5%;
    Else: Good enough, let normal N-R take over.

    Precondition step is usually simpler with Binary-FP as the initial guess
    is usually within the correct range. So, one can use a single modified
    N-R step (that undershoots) followed by letting N-R take over.

    More of an issue though when the initial guess is "maybe within a factor
    of 10" because the usual reciprocal-approximation strategy used for
    Binary-FP isn't quite as effective.


    ...


    Still don't have a use-case, mostly just messing around with this...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 7 22:57:14 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:
    --------------snip---------------

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less accurate.

    Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
    an 8-bit accurate first iteration result.

    Other than DFP not being normalized, once you find the HoD, you should
    be able to use something like a 10-bit in 13-bit out table to get the
    first 2 decimal digits correct, and N-R from there.

    That 10-bits in could be the packed DFP representation (its denser and
    has smaller tables). This way, table lookup overlaps unpacking.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 20:23:40 2025
    From Newsgroup: comp.arch

    On 11/7/2025 4:57 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:
    --------------snip---------------

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less
    accurate.

    Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
    an 8-bit accurate first iteration result.

    Other than DFP not being normalized, once you find the HoD, you should
    be able to use something like a 10-bit in 13-bit out table to get the
    first 2 decimal digits correct, and N-R from there.

    That 10-bits in could be the packed DFP representation (its denser and
    has smaller tables). This way, table lookup overlaps unpacking.


    FWIW: Dump of the test code as it exists...
    https://pastebin.com/NcvCi5gD

    I had since found the decNumber library, and with this was able to
    confirm that I had in-fact figured out the specifics of the format (I
    was unsure whether or not my version was correct; as I had implemented
    in based mostly on descriptions of the format on Wikipedia; which were
    not entirely consistent).

    Otherwise, experiment / proof of concept.
    Unlikely to actually be useful.



    Way I had usually started out with binary FDIV/reciprocal:
    Turn the reciprocal into a modified integer subtract;
    Or, subtract for HOB's, everything else is a bitwise inversion.
    Can often get within the top 4 bits of the mantissa or so.

    Way I had tried to do so for decimal:
    Invert the exponent in a similar way as binary FP;
    Set the mantissa to the 9s complement value.


    Issue:
    The 9s complement method doesn't give a value particularly close to the
    actual target value.

    For example:
    Taking the reciprocal of 3.14159x, I get 0.685840x, but actual target is 0.318309x.

    Like, I almost may as well just leave the mantissa as-is, or fill it
    with all 5s or something.


    Granted, feeding the high 3 digits through a lookup table and just
    setting all the low digits to whatever is probably also an option, and probably faster than using an initial coarse convergence to try to get
    it somewhere in the right general area.


    I realized after finding decNumber and using it to generate a test
    number, that it seems to use the format in a very different way,
    effectively keeping the value right-aligned and normalized, rather than left-aligned and normalized.

    My code sort of assumed keeping values normalized (as with traditional floating point).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:18:08 2025
    From Newsgroup: comp.arch

    On 2025-11-07 3:28 p.m., BGB wrote:
    On 11/6/2025 1:11 PM, BGB wrote:
    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Thomas Koenig <tkoenig@netcologne.de> posted:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever.  It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                      1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?
    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128.
    https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed  (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
       I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along
    vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
       4x 32-bit values each holding 9 digits
         Except the top one generally holding 7 digits.
       16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
       X30: Directly packing 20/30 bit chunks, non-standard;
       DPD: Use the DPD format;
       BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
       X30 is around 10x faster than either DPD or BID;
       Both DPD and BID need a similar amount of time.
         BID needs a bunch of 128-bit arithmetic handlers.
         DPD needs a bunch of merge/split and table lookups.
         Seems to mostly balance out in this case.


    For DPD, merge is effectively:
       Do the table lookups;
       v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
       v0=v;
       v1=v/1000;
       v0-=v1*1000;
       v2=v1/1000;
       v1-=v2*1000;
       Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by
    constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD
    or BID. Except, that the cost of the ADD and MUL operations
    effectively dwarf that of the pack/unpack operations, so the relative
    cost difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something
    that would lead to X30 being a decisive win either in terms of
    performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a
    128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due
    to less clutter, etc. Though, this part would be less bad if C had had
    widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by
    internally doubling the width and then narrowing the result after
    normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
       12x  3 digits (16b chunk)
       4x   9 digits (32b chunk)
       2x  18 digits (64b chunk)
       3x  12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations
    fully fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still
    sometimes exceed it; and isn't that much more than 9 digits (but would
    reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being
    notably slower.

    However, if running on RV64G with the standard ABI, it is likely the
    9- digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a
    value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a
    little more "working space" so would reduce cases which exceed the
    limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays
    within the limits of 64-bit arithmetic (where multiply temporarily
    widens to working with 18 digits, but then narrows back to 9 digit
    chunks).

    Also 9 digit chunking may be preferable when one has a faster
    32*32=>64 bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level
    helpers.


    I don't know yet if my implementation of DPD is actually correct.

    Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

    Here is an example value:
      2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    Which, in theory, should resemble PI.


    Annoyingly, it seems like pretty much everyone else either went with
    BID, or with other non-standard Decimal encodings.

    Can't seem to find:
      Any examples of hard-coded numbers in this format on the internet;
      Any obvious way to generate them involving "stuff I already have".
        As, in, not going and using some proprietary IBM library or similar.

    Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
    less DPD), but seemingly some other 256-bit format.

    And, Grok fails to notice that what it is saying is nowhere close to
    correct in this case.

    Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
    go about trying to decode this format...".


    Not helpful, I more would just want some way to confirm whether or not I
    got the format correct.

    Which is easier if one has some example numbers or something that they
    can decode and verify the value, or something that is able to decode
    these numbers (which isn't just trying to stupidly shove it into
    Python's Decimal class...).


    Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
    and Boost C++, but in these cases, less helpful because they went with BID.

    ...




    Checking, after things a a little more complete, MHz for (millions of
    times per second), on my desktop PC:
      DPD Pack/Unpack: 63.7 MHz (58 cycles)
      X30 Pack/Unpack: 567 MHz  ( 7 cycles) ?...

      FMUL (unwrap)  : 21.0 MHz (176 cycles)
      FADD (unwrap)  : 11.9 MHz (311 cycles)

      FDIV           :  0.4 MHz (very slow; Newton Raphson)

      FMUL (DPD)     : 11.2 MHz (330 cycles)
      FADD (DPD)     :  8.6 MHz (430 cycles)
      FMUL (X30)     : 12.4 MHz (298 cycles)
      FADD (X30)     :  9.8 MHz (378 cycles)

    The relative performance impact of the wrap/unwrap step is somewhat
    larger than expected (vs the unwrapped case).

    Though, there seems to only be a small difference here between DPD and
    X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).

    The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.

    For using the wrapped functions to estimate pack/unpack cost:
      DPD cost: 51 cycles.
      X30 cost: 41 cycles.


    Not really a good way to make X30 much faster. It does pay for the cost
    of dealing with the combination field.

    Not sure why they would be so close:
      DPD case does a whole lot of stuff;
      X30 case is mostly some shifts and similar.

    Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
    in this case.


    This could possibly be cheapened slightly by going to, say:
      S.E13.M114
    In effect trading off some exponent range for cheaper handling of the exponent.


    Can note:
      MUL and ADD use double-width internal mantissa, so should be accurate;
      Current test doesn't implement rounding modes though, could do so.
        Currently hard-wired at Round-Nearest-Even.

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less accurate.

    So, it first uses a loop with hard-coded checks and scales to get it in
    the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
      Exponent is wrong:
        Scale by factors of 2 until correct;
      Off by more than 50%, scale by +/- 25%;
      Off by more than 25%, scale by +/- 12.5%;
      Else: Good enough, let normal N-R take over.

    Precondition step is usually simpler with Binary-FP as the initial guess
    is usually within the correct range. So, one can use a single modified
    N-R step (that undershoots) followed by letting N-R take over.

    More of an issue though when the initial guess is "maybe within a factor
    of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.


    ...


    Still don't have a use-case, mostly just messing around with this...



    When I built my decimal float code I ran into the same issue. There are
    not really examples on the web. I built integer to decimal-float and decimal-float to integer converters then compared results.

    Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are
    right, no guarantees).
    Integer decimal-float
    u 00000000000000000000000000000001 25ffc000000000000000000000000000
    u 0000000000000000000000000000000a 26000000000000000000000000000000
    u 00000000000000000000000000000064 26004000000000000000000000000000
    u 000000000000000000000000000003e8 26008000000000000000000000000000
    u 000000000000000000000000000f4240 26014000000000000000000000000000
    u 00000000000000000000000000bc614e 2601934b9c0c00000000000000000000
    u 00000000000000000000000000000002 29ffc000000000000000000000000000


    I have used the decimal float code (96 bit version) with Tiny BASIC and
    it seems to work.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:30:36 2025
    From Newsgroup: comp.arch

    Cache-line constants were tried with the StarkCPU and seemed to work
    fine, but wasted cache-line space when constants and instructions could
    not be packed evenly into the cache-line.

    However, for Qupls2026 using constants stored on the cache-line might be
    just as efficient storage wise as having the constants follow
    instruction words because of the 48-bit word width. Constants typically
    do not need to be multiples of 48 bits. If stored on the cache-line they
    could be multiples of 16-bits. There are potentially 32-bits of wasted
    space if an instruction is not able to be packed onto the cache-line.
    There may just be as much wasted space due to the support of over-sized constants in-line with 48-bit parcels. A 32-bit constant uses 48 bits,
    wasting 16-bits of storage.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 8 00:34:37 2025
    From Newsgroup: comp.arch

    <snip>>
    Here is an example value:
      2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    <snip>

    I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
    exponent may be off.

    2e078c2aeb53b3fbb4e262d0dab5e680

    The sequence of digits is the same, except it begins C2 instead of C1.

    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2