• Concertina II Instead

    From quadi@quadibloc@ca.invalid to comp.arch on Sat Mar 7 00:15:09 2026
    From Newsgroup: comp.arch

    I had considered proceeding on to a CISC Concertina III. However, after starting to look at that project, I found that there was a lack of opcode space in one spot.
    Just the other day, though, it occurred to me that there was a possibility
    of improving Concertina II.
    After a long period of changing it, because I was dissatisfied with the various options for shortening the memory-reference instructions by one
    bit, I decided to leave the memory-reference instructions at their full length, and claim opcode space from somewhere less important: I replaced 14-bit short instructions by 13-bit short instructions.
    What occurred to me was that I could instead keep the full-length memory- reference instructions and have 14-bit short instructions, and fit
    everything else into space left by unused opcodes for 14-bit short instructions.
    In order to make that work, I had to alter the 14-bit short instructions a bit. I re-ordered the fields in the shift instructions, so that I could
    put the supervisor call instruction in with them, and the 14-bit branch instructions now only came with a three-bit field to select the condition
    they could test.
    That let me fit all the instructions in.
    The opcode space for block headers, however, was reduced. Which is not
    really that much of a bad thing; it means that now the block headers will
    be pared down (to just one!) and thus greatly simplified.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Mar 7 19:00:02 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    I had considered proceeding on to a CISC Concertina III. However, after starting to look at that project, I found that there was a lack of opcode space in one spot.
    Just the other day, though, it occurred to me that there was a possibility of improving Concertina II.
    After a long period of changing it, because I was dissatisfied with the various options for shortening the memory-reference instructions by one
    bit, I decided to leave the memory-reference instructions at their full length, and claim opcode space from somewhere less important: I replaced 14-bit short instructions by 13-bit short instructions.
    What occurred to me was that I could instead keep the full-length memory- reference instructions and have 14-bit short instructions, and fit everything else into space left by unused opcodes for 14-bit short instructions.
    In order to make that work, I had to alter the 14-bit short instructions a bit. I re-ordered the fields in the shift instructions, so that I could
    put the supervisor call instruction in with them, and the 14-bit branch instructions now only came with a three-bit field to select the condition they could test.

    I admire your effort.

    That let me fit all the instructions in.

    I smell danger--running out of OpCode space early in the design.
    After the general conceptualization of the ISA you should have
    half of the OpCode space available for future additions !!.

    The opcode space for block headers, however, was reduced. Which is not really that much of a bad thing; it means that now the block headers will
    be pared down (to just one!) and thus greatly simplified.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 8 01:14:42 2026
    From Newsgroup: comp.arch

    On Sat, 07 Mar 2026 19:00:02 +0000, MitchAlsup wrote:

    I smell danger--running out of OpCode space early in the design.
    After the general conceptualization of the ISA you should have half of
    the OpCode space available for future additions !!.

    Well, while it is true that only 1/64th of the opcode space for 32-bit instructions is left, my current plan is to use 1/128th for headers, and
    the other 1/128th for instructions longer than 32 bits. Which means that
    there is still space for 511 times as many instructions as are already defined, even if I never went beyond 48 bits.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 8 06:08:14 2026
    From Newsgroup: comp.arch

    On Sun, 08 Mar 2026 01:14:42 +0000, quadi wrote:

    On Sat, 07 Mar 2026 19:00:02 +0000, MitchAlsup wrote:

    I smell danger--running out of OpCode space early in the design. After
    the general conceptualization of the ISA you should have half of the
    OpCode space available for future additions !!.

    Well, while it is true that only 1/64th of the opcode space for 32-bit instructions is left, my current plan is to use 1/128th for headers, and
    the other 1/128th for instructions longer than 32 bits. Which means that there is still space for 511 times as many instructions as are already defined, even if I never went beyond 48 bits.

    However, the lack of opcode space did cause me one problem. Previously, I
    had a type of header which started with the four bits 1111. This was
    followed by fourteen two-bit prefixes, which applied to every 16 bits remaining in the 256-bit code block.

    They indicated:

    00 - 17-bit instruction, starting with 0
    01 - 17-bit instruction, starting with 1
    10 - start of a 32-bit or longer instruction
    11 - not the start of an instruction, don't start decoding here.

    Now, with only 1/64th of the opcode space left, the block header starts
    with a minimum of six fixed bits.

    A 10 prefix has to be followed by a 11 prefix. I thought of perhaps coming
    up with a very elaborate coding scheme to take advantage of this to save
    the bits I needed.

    But I decided to go with a much simpler option instead. A bit of
    compressive coding is still needed, but now the scheme is simple. I just switched from 17-bit short instructions to 16-bit instructions for code
    with mixed-length instructions. In some respects, the limitations of 16-
    bit instructions are complentary to those of 15-bit instructions, the ones that can occur in pairs within 32-bit instruction code, and so the two
    types can be mixed in a block to somewhat mitigate their limitations.

    Three bits can encode two prefixes with only three possibilities; since
    the start of a 32-bit or longer instruction can only be followed by a 16-
    bit extent not decoded, I only need seven values, not nine.

    So now I can combine 15-bit short instructions (rather than the
    drastically limited 14-bit short instructions) in plain 32-bit instruction code, without having some kind of awkward restriction on the memory-
    reference instructions... and everything fits.

    Instead of having sixteen types of headers, I have retreated to just three types of headers: one for a zero-overhead header, one for variable-length instructions, and one for the VLIW features. There is, however, space for more, and I am notoriously bad at resisting temptation.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 8 05:13:18 2026
    From Newsgroup: comp.arch

    On 3/7/2026 1:00 PM, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    I had considered proceeding on to a CISC Concertina III. However, after
    starting to look at that project, I found that there was a lack of opcode
    space in one spot.
    Just the other day, though, it occurred to me that there was a possibility >> of improving Concertina II.
    After a long period of changing it, because I was dissatisfied with the
    various options for shortening the memory-reference instructions by one
    bit, I decided to leave the memory-reference instructions at their full
    length, and claim opcode space from somewhere less important: I replaced
    14-bit short instructions by 13-bit short instructions.
    What occurred to me was that I could instead keep the full-length memory-
    reference instructions and have 14-bit short instructions, and fit
    everything else into space left by unused opcodes for 14-bit short
    instructions.
    In order to make that work, I had to alter the 14-bit short instructions a >> bit. I re-ordered the fields in the shift instructions, so that I could
    put the supervisor call instruction in with them, and the 14-bit branch
    instructions now only came with a three-bit field to select the condition
    they could test.

    I admire your effort.

    That let me fit all the instructions in.

    I smell danger--running out of OpCode space early in the design.
    After the general conceptualization of the ISA you should have
    half of the OpCode space available for future additions !!.


    In my case, I still have the F3 and F9 blocks unused.
    * F0, F1, F2, F8: In Use
    * F3, F9: Still Unused
    * FE, FF: Jumbo Prefixes

    N/E in XG3:
    F4..F7, F8..FD
    FE/FF: Remapped to FA/FB (with FA/FB becoming N/E).

    Where, otherwise (XG1/XG2):
    F4..F7: Repeat F0..F3, but with the "WEX" flag set.
    FC/FD: Repeat F8/F9, with WEX flag set.


    The EA/EB, and EE/EF blocks are effectively unused in XG3.
    XG1/XG2: Had encoded some PrWEX encodings.

    Where:
    E0..E3, E8/E9: Same as F0..F3, F8/F9, but ?T.
    E4..E7, EC/ED: Same as F0..F3, F8/F9, but ?F.

    Non-Ex/Fx:
    XG1: 16-bit ops go here.
    XG2: N/E, bits used to extend register fields to 6 bits.


    Had looked into using EA/EB and EE/EF for pair-encoded instructions in
    XG3, but the gains from the pair-encoded instructions wouldn't really be enough to justify the costs of having them. The constraints imposed by a
    pair encoding make the gains far less than with a 16/32 encoding scheme.


    The opcode space for block headers, however, was reduced. Which is not
    really that much of a bad thing; it means that now the block headers will
    be pared down (to just one!) and thus greatly simplified.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 8 18:43:38 2026
    From Newsgroup: comp.arch

    I have now begun uploading the description of the revised Concertina II instruction set to my web site. The block headers, the 32-bit instruction formats, and the 16-bit and 15-bit instruction formats are now all present
    at

    http://www.quadibloc.com/arch/ct25int.htm

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Mar 8 20:44:55 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Sun, 08 Mar 2026 01:14:42 +0000, quadi wrote:

    On Sat, 07 Mar 2026 19:00:02 +0000, MitchAlsup wrote:

    I smell danger--running out of OpCode space early in the design. After
    the general conceptualization of the ISA you should have half of the
    OpCode space available for future additions !!.

    Well, while it is true that only 1/64th of the opcode space for 32-bit instructions is left, my current plan is to use 1/128th for headers, and the other 1/128th for instructions longer than 32 bits. Which means that there is still space for 511 times as many instructions as are already defined, even if I never went beyond 48 bits.

    However, the lack of opcode space did cause me one problem. Previously, I had a type of header which started with the four bits 1111. This was followed by fourteen two-bit prefixes, which applied to every 16 bits remaining in the 256-bit code block.

    They indicated:

    00 - 17-bit instruction, starting with 0
    01 - 17-bit instruction, starting with 1
    10 - begin instruction with 32-bit parcel
    11 - append another 32-bit instruction parcel.

    11 simply adds another 32-bits to the current instruction parcel.
    This gives access to {16, 32, 48, 64, 80, 96, ...}-bit instructions.

    This can be treeified rather easily for wide decode.


    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 8 17:20:33 2026
    From Newsgroup: comp.arch

    On 3/8/2026 3:44 PM, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    On Sun, 08 Mar 2026 01:14:42 +0000, quadi wrote:

    On Sat, 07 Mar 2026 19:00:02 +0000, MitchAlsup wrote:

    I smell danger--running out of OpCode space early in the design. After >>>> the general conceptualization of the ISA you should have half of the
    OpCode space available for future additions !!.

    Well, while it is true that only 1/64th of the opcode space for 32-bit
    instructions is left, my current plan is to use 1/128th for headers, and >>> the other 1/128th for instructions longer than 32 bits. Which means that >>> there is still space for 511 times as many instructions as are already
    defined, even if I never went beyond 48 bits.

    However, the lack of opcode space did cause me one problem. Previously, I
    had a type of header which started with the four bits 1111. This was
    followed by fourteen two-bit prefixes, which applied to every 16 bits
    remaining in the 256-bit code block.

    They indicated:

    00 - 17-bit instruction, starting with 0
    01 - 17-bit instruction, starting with 1
    10 - begin instruction with 32-bit parcel
    11 - append another 32-bit instruction parcel.

    11 simply adds another 32-bits to the current instruction parcel.
    This gives access to {16, 32, 48, 64, 80, 96, ...}-bit instructions.

    This can be treeified rather easily for wide decode.


    Hmm:
    xxx0: 16 bits
    xx01: 32 bits (final)
    xx11: 32 bits (non-final)

    But, still basically the same idea:
    16/32/48/64/80/...

    Unlike the RV schemes, scales to larger sizes without consuming an ever
    larger percentage of the encoding bits.

    The 16-bit space would only be 2/3 the size of the RV encoding space,
    but the encoding space could go a little further if used more
    efficiently (namely, not burning it on needlessly large immediate and displacement fields).

    ...



    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 9 02:53:20 2026
    From Newsgroup: comp.arch

    On Sun, 08 Mar 2026 06:08:14 +0000, quadi wrote:

    But I decided to go with a much simpler option instead. A bit of
    compressive coding is still needed, but now the scheme is simple. I just switched from 17-bit short instructions to 16-bit instructions for code
    with mixed-length instructions. In some respects, the limitations of 16-
    bit instructions are complentary to those of 15-bit instructions, the
    ones that can occur in pairs within 32-bit instruction code, and so the
    two types can be mixed in a block to somewhat mitigate their
    limitations.

    Mixing these two types of short instructions in a single block is... an awkward and complicated workaround.

    I've decided to drop that capability, because doing so makes more opcode
    space available for 48-bit (and longer) instructions in the variable-
    length instruction blocks. I found that certain highly desirable classes
    of 48-bit instructions are made impossible otherwise.

    Basically, by having the basic instruction set designed so that memory- reference instructions are not compromised, and yet 15-bit short
    instructions (instead of the very cramped 14-bit short instructions) are available... has meant that everything else beyond the basic instruction
    set is hit with severe constraints.

    The most painful one was the loss of uncompromised 17-bit short
    instructions. But 16-bit short instructions still avoid the big problem of 15-bit short instructions that many here found objectionable.

    Why is the Concertina II instruction set so cramped for opcode space? I
    think I've answered that before. I'm trying to do what hasn't been
    attempted before - have an instruction set with 16-bit displacements,
    since that's what both CISC and RISC micros have, but with addressing
    options as found in CISC, but banks of 32 registers, instead of just eight
    (or maybe 16), like RISC designs have.

    It wouldn't be surprising if there wasn't room to include both what RISC
    had extra space for, and what CISC had extra space for. But what pleases
    me is that if one makes a little extra effort... such an instruction set
    *is* achievable.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 9 03:29:20 2026
    From Newsgroup: comp.arch

    On Mon, 09 Mar 2026 02:53:20 +0000, quadi wrote:

    Mixing these two types of short instructions in a single block is... an awkward and complicated workaround.

    I've decided to drop that capability, because doing so makes more opcode space available for 48-bit (and longer) instructions in the variable-
    length instruction blocks. I found that certain highly desirable classes
    of 48-bit instructions are made impossible otherwise.

    In recent previous versions of Concertina II, it was the opcode space used
    for paired short instructions, whether 15-bit or the more recent 14-bit
    ones, that was used for long instructions. I had thought of not doing it
    this time because that space would be somewhat more fragmented, as I was
    using the last part of it for a major chunk of the 32-bit instruction set.

    But on further examination, it was clear that this objection was not due
    to any real issue, and the space would be needed desperately. Another incidental consequence is doubling the opcode space available for block headers, but that doesn't increase it enough to allow a return to 17-bit
    short instructions in variable-length instruction blocks.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 10 01:00:19 2026
    From Newsgroup: comp.arch

    On Sun, 08 Mar 2026 18:43:38 +0000, quadi wrote:

    I have now begun uploading the description of the revised Concertina II instruction set to my web site. The block headers, the 32-bit
    instruction formats, and the 16-bit and 15-bit instruction formats are
    now all present at

    http://www.quadibloc.com/arch/ct25int.htm

    The instructions longer than 32 bits have also been uploaded. They follow
    what was included with the previous iteration; 15-bit short instructions
    are not available when 16-bit instructions are available, even though the 16-bit instructions aren't a strict superset of the 15-bit ones.

    This has also let me add a small group of additional 32-bit instructions
    when in a block where different instruction lengths can be freely mixed,
    one with a Type II header.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 10 13:21:15 2026
    From Newsgroup: comp.arch

    Given that I was able to reduce the prefix for paired short instructions
    from 1111 to 11, allowing the paired short instructions to return to being
    15 bits long...

    since 14-bit short instructions are possible, then 11 could be the prefix
    for a single short instruction.

    Thus, the squish of opcode space that made this iteration of Concertina II possible _also_ makes a CISC instruction set possible. However, the short instructions and the instructions longer than 32 bits are _both_
    *severely* constrained in opcode space in the CISC mode.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 10 15:15:36 2026
    From Newsgroup: comp.arch

    On Tue, 10 Mar 2026 13:21:15 +0000, quadi wrote:

    Thus, the squish of opcode space that made this iteration of Concertina
    II possible _also_ makes a CISC instruction set possible. However, the
    short instructions and the instructions longer than 32 bits are _both_ *severely* constrained in opcode space in the CISC mode.

    And thus I had to re-think the longer instructions in CISC mode, making a tweak to their definitions so that important functionality was not lost.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Mar 10 16:57:47 2026
    From Newsgroup: comp.arch

    On 3/10/2026 8:21 AM, quadi wrote:
    Given that I was able to reduce the prefix for paired short instructions
    from 1111 to 11, allowing the paired short instructions to return to being
    15 bits long...

    since 14-bit short instructions are possible, then 11 could be the prefix
    for a single short instruction.

    Thus, the squish of opcode space that made this iteration of Concertina II possible _also_ makes a CISC instruction set possible. However, the short instructions and the instructions longer than 32 bits are _both_
    *severely* constrained in opcode space in the CISC mode.


    FWIW:
    IME, while pair encoding scheme can result in space savings over a pure 32/64/96 coding scheme, while avoiding the misalignment issues of a
    16/32 coding scheme, a downside of a pair encoding is that the potential
    space savings are significantly reduced relative to a 16/32 scheme.

    Say, for example:
    An effective 16/32 scheme can get around a 20% space savings;
    An effective pair-encoding implicitly drops to around 8%.

    Mostly because it can only save space in cases when both instructions
    can be pair encoded, versus when either instruction could be 16-bit encoded.

    Though, that said, pair encoding is an attractive option when the main
    other option is 32-bit only, and one already has some mechanism in place
    to deal with cracking an instruction.


    As noted before, had considered this within the context of my XG3
    encoding scheme, but ended up deciding against it because the savings
    seemed like they were too small to make it worthwhile.

    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed Mar 11 02:04:52 2026
    From Newsgroup: comp.arch

    On Tue, 10 Mar 2026 16:57:47 -0500, BGB wrote:

    FWIW:
    IME, while pair encoding scheme can result in space savings over a pure 32/64/96 coding scheme, while avoiding the misalignment issues of a
    16/32 coding scheme, a downside of a pair encoding is that the potential space savings are significantly reduced relative to a 16/32 scheme.

    Say, for example:
    An effective 16/32 scheme can get around a 20% space savings;
    An effective pair-encoding implicitly drops to around 8%.

    Mostly because it can only save space in cases when both instructions
    can be pair encoded, versus when either instruction could be 16-bit
    encoded.

    Though, that said, pair encoding is an attractive option when the main
    other option is 32-bit only, and one already has some mechanism in place
    to deal with cracking an instruction.

    I am aware of the issue you are raising here, and I certainly am aware
    that forcing the programmer to choose shorter instructions in pairs limits
    the potential space savings.

    So why did I use this mechanism?

    For one thing, I used it in order to simply fetching and decoding instructions. If every instruction is 32 bits long, neither longer nor shorter, then it's very easy to fetch a block of memory, and decode all
    the instructions in it in parallel - because you already know where they
    begin and end.

    For another, look at the way I squeezed a short instruction - which I
    would prefer to be 17 bits long - into 15 bits. The register banks are
    divided into four groups, and the two operands must both be registers from
    the same group in a 15 bit instruction.

    That shows something about the type of code I expect to execute. Code
    where instructions belonging to multiple threads (of a sort, not real
    threads that execute asynchronously) are interleaved, so as to make it
    easier to execute the instructions simultaneously in a pipeline.

    That gives a bit of flexibility in instruction ordering, so it makes it
    easier to pair up short instructions.

    And, as further evidence that I'm aware that having to use short
    instructions in pairs is a disadvantage... this, along with the desire to
    use pseudo-immediates (because I do accept Mitch Alsup's reasoning that getting data almost for free from the instruction stream beats an
    additional memory access, with all the overhead that entails) led me to
    set up the block header mechanism (which Mitch Alsup rightly criticizes; I just felt it was the least bad way to achieve what I wanted) so that
    fetching instructions to decode _remained_ as naively straightforward _as
    if_ all the instructions were the same length... even when the
    instructions were allowed to vary in length.

    And so with what are currently the Type I, II, and IV headers, the
    instruction stream consists of variable length instructions; short instructions can be placed individually at any position in the instruction stream.

    There's even a CISC mode now, since I've squeezed things so much that this
    ISA is capable, with slight tweaking, of just having plain variable length instructions without blocks. But it's just barely capable of that; in that form, the short instructions only have 14 bits to play with, so the
    repertoire of those instructions is limited, and therefore the potential
    space savings they provide is less.

    Of course when block headers allow 17-bit instructions at arbitrary positions... that _would_ maximize space savings, but there's the overhead
    of the space the block header takes up. So any choice that is made
    involves tradeoffs.

    I also have a goal of making the ISA simple to implement, so in the CISC
    mode, instead of just saying "leave the register field all zeroes, and put
    the immediate right after the instruction", I have said that the pseudo- immediates aren't available in CISC mode. That avoids having to decode anything but the leading bits of the instruction in order to determine
    where the next instruction starts.

    It isn't the greatest variable-length instruction architecture; that capability is basically an afterthought appended to an architecture
    intended to be used with block headers.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 13 05:02:03 2026
    From Newsgroup: comp.arch

    On Tue, 10 Mar 2026 15:15:36 +0000, quadi wrote:
    On Tue, 10 Mar 2026 13:21:15 +0000, quadi wrote:

    Thus, the squish of opcode space that made this iteration of Concertina
    II possible _also_ makes a CISC instruction set possible. However, the
    short instructions and the instructions longer than 32 bits are _both_
    *severely* constrained in opcode space in the CISC mode.

    And thus I had to re-think the longer instructions in CISC mode, making
    a tweak to their definitions so that important functionality was not
    lost.

    I've also tweaked the short instructions. The original 14-bit instructions were made to be part of the standard Concertina II instruction set. This
    time, they're part of CISC mode. So what are they competing with? Other
    CISC architectures!

    That insight led me to switching from a five-bit opcode field, providing
    only a restricted set of operate instructions, to instead include all the basic operate instructions - but to have the short instructions in CISC
    mode only work with the first eight registers of each register bank.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Mar 13 16:08:33 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Tue, 10 Mar 2026 15:15:36 +0000, quadi wrote:
    On Tue, 10 Mar 2026 13:21:15 +0000, quadi wrote:

    Thus, the squish of opcode space that made this iteration of Concertina
    II possible _also_ makes a CISC instruction set possible. However, the
    short instructions and the instructions longer than 32 bits are _both_
    *severely* constrained in opcode space in the CISC mode.

    And thus I had to re-think the longer instructions in CISC mode, making
    a tweak to their definitions so that important functionality was not
    lost.

    I've also tweaked the short instructions. The original 14-bit instructions

    Why are you not counting the header overhead as part of the instruction ??

    It seems to me that a 14-bit instruction with a 2-bit header (descriptor)
    has a 16-bit footprint--and in the end that is what matters.

    were made to be part of the standard Concertina II instruction set. This time, they're part of CISC mode. So what are they competing with? Other
    CISC architectures!

    That insight led me to switching from a five-bit opcode field, providing only a restricted set of operate instructions, to instead include all the basic operate instructions - but to have the short instructions in CISC
    mode only work with the first eight registers of each register bank.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 13 16:28:39 2026
    From Newsgroup: comp.arch

    On Fri, 13 Mar 2026 16:08:33 +0000, MitchAlsup wrote:

    Why are you not counting the header overhead as part of the instruction
    ??

    It seems to me that a 14-bit instruction with a 2-bit header
    (descriptor) has a 16-bit footprint--and in the end that is what
    matters.

    I do count it, for some purposes. However, I make a distinction, as well, between several different types of short instruction, all of which have a 16-bit footprint, by the number of bits available to specify the
    instruction, since it is in respect of this attribute that the various
    short instruction formats differ.

    I mean, I could call them Type A, B, C, and D but that would be
    unnecessarily confusing.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 13 16:33:43 2026
    From Newsgroup: comp.arch

    On Fri, 13 Mar 2026 16:28:39 +0000, quadi wrote:

    On Fri, 13 Mar 2026 16:08:33 +0000, MitchAlsup wrote:

    Why are you not counting the header overhead as part of the instruction
    ??

    It seems to me that a 14-bit instruction with a 2-bit header
    (descriptor) has a 16-bit footprint--and in the end that is what
    matters.

    I do count it, for some purposes. However, I make a distinction, as
    well,
    between several different types of short instruction, all of which have
    a 16-bit footprint, by the number of bits available to specify the instruction, since it is in respect of this attribute that the various
    short instruction formats differ.

    I mean, I could call them Type A, B, C, and D but that would be
    unnecessarily confusing.

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short
    instructions, I have 17-bit short instructions. Their footprint, in the sequence of instructions, is 16 bits, because the first bit is in the
    header instead. Except that the fact that a short instruction is in that
    spot is indicated by another bit. So shall we give it an 18-bit footprint,
    and then the 16-bit instructions have a 17-bit footprint too, while the 15-
    bit and 14-bit instructions do both have 16-bit footprints.

    Their nominal footprints are all 16 bits, since for purposes of branching
    to them, their location is deemed to be a halfword, even in the case of 15-
    bit instructions, where the fields they're in within a 32-bit slot don't perfectly align with the 16-bit halfwords - the first one extends one bit
    into the second 16 bits.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 13 16:43:48 2026
    From Newsgroup: comp.arch

    On Fri, 13 Mar 2026 16:33:43 +0000, quadi wrote:

    Their nominal footprints are all 16 bits, since for purposes of
    branching to them, their location is deemed to be a halfword, even in
    the case of 15-
    bit instructions, where the fields they're in within a 32-bit slot don't perfectly align with the 16-bit halfwords - the first one extends one
    bit into the second 16 bits.

    This discussion, highlighting to me how the footprints of my short instructions are big, messy, and confusing momentarily suggested to me
    that I could name the architecture something else.

    But I think I'll keep "Concertina II", rather than calling it Sasquatch
    (or Bigfoot, or Yeti).

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Mar 13 10:24:41 2026
    From Newsgroup: comp.arch

    On 3/13/2026 9:33 AM, quadi wrote:
    On Fri, 13 Mar 2026 16:28:39 +0000, quadi wrote:

    On Fri, 13 Mar 2026 16:08:33 +0000, MitchAlsup wrote:

    Why are you not counting the header overhead as part of the instruction
    ??

    It seems to me that a 14-bit instruction with a 2-bit header
    (descriptor) has a 16-bit footprint--and in the end that is what
    matters.

    I do count it, for some purposes. However, I make a distinction, as
    well,
    between several different types of short instruction, all of which have
    a 16-bit footprint, by the number of bits available to specify the
    instruction, since it is in respect of this attribute that the various
    short instruction formats differ.

    I mean, I could call them Type A, B, C, and D but that would be
    unnecessarily confusing.

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short
    instructions, I have 17-bit short instructions. Their footprint, in the sequence of instructions, is 16 bits, because the first bit is in the
    header instead. Except that the fact that a short instruction is in that
    spot is indicated by another bit. So shall we give it an 18-bit footprint, and then the 16-bit instructions have a 17-bit footprint too, while the 15- bit and 14-bit instructions do both have 16-bit footprints.

    Do you really think a compiler writer will develop code to figure out
    which of the many code formats you have should be emitted in each source
    code situation?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 13 22:59:47 2026
    From Newsgroup: comp.arch

    On Fri, 13 Mar 2026 10:24:41 -0700, Stephen Fuld wrote:

    Do you really think a compiler writer will develop code to figure out
    which of the many code formats you have should be emitted in each source
    code situation?

    Now that is a very good question.

    In general, the answer is "no", but with one exception.

    Instead, since the reason different code formats are provided is so as to provide a different fit for different application domains, in general,
    what I picture is that a compiler writer will choose the most appropriate
    code format for a given programming language, and stick to it.

    This is certainly true when it comes to using the break bits to provide an explicit indication of parallelism. Also, because of the danger of using different modes to change how existing privileged code is interpreted, it won't be an option to easily switch back and forth from CISC mode in
    software.

    The exception is this: while the code formats don't differ in _optimality_
    in the sense that the execution time of the same instruction would differ between code formats, what does differ between them is overhead costs and
    code density.

    I don't think it would be beyond the current state of the art for a code generator to, by default, produce instructions in the code format with the smallest overhead, but then generate an instruction block in a different format when an instruction requiring that format appears in the generated instruction stream. But doing this is optional, not mandatory.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Mar 14 03:17:42 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 3/13/2026 9:33 AM, quadi wrote:
    On Fri, 13 Mar 2026 16:28:39 +0000, quadi wrote:

    On Fri, 13 Mar 2026 16:08:33 +0000, MitchAlsup wrote:

    Why are you not counting the header overhead as part of the instruction >>> ??

    It seems to me that a 14-bit instruction with a 2-bit header
    (descriptor) has a 16-bit footprint--and in the end that is what
    matters.

    I do count it, for some purposes. However, I make a distinction, as
    well,
    between several different types of short instruction, all of which have
    a 16-bit footprint, by the number of bits available to specify the
    instruction, since it is in respect of this attribute that the various
    short instruction formats differ.

    I mean, I could call them Type A, B, C, and D but that would be
    unnecessarily confusing.

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short instructions, I have 17-bit short instructions. Their footprint, in the sequence of instructions, is 16 bits, because the first bit is in the header instead. Except that the fact that a short instruction is in that spot is indicated by another bit. So shall we give it an 18-bit footprint, and then the 16-bit instructions have a 17-bit footprint too, while the 15- bit and 14-bit instructions do both have 16-bit footprints.

    Do you really think a compiler writer will develop code to figure out
    which of the many code formats you have should be emitted in each source code situation?

    In My 66000 case: it is the assembler that determines the size of the instruction, alleviating that compiler complexity.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Mar 14 14:55:08 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> schrieb:

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short instructions, I have 17-bit short instructions.

    Can these follow each other in arbitrary order?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat Mar 14 22:55:04 2026
    From Newsgroup: comp.arch

    On Sat, 14 Mar 2026 14:55:08 +0000, Thomas Koenig wrote:

    quadi <quadibloc@ca.invalid> schrieb:

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short
    instructions, I have 17-bit short instructions.

    Can these follow each other in arbitrary order?

    No, not at all.

    The different lengths of short instructions are each for a different type
    of block, or a different operating mode, where different amounts of opcode space are available for short instructions.

    For example:

    15-bit short instructions are for the normal RISC-like instruction set.
    The two bits 11 may begin a 32-bit instruction, then the rest is split
    into two 15-bit short instructions.

    14-bit short instructions are for CISC mode, where 11 starts a 16-bit instruction standing alone.

    17-bit short instructions are for a block with a header that includes two
    bits of header information for each 16 bits in the block; two of the combinations indicate a short instruction and include an extra bit to add
    to it.

    So they can't be mixed, they're part of different categories of
    instruction stream.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat Mar 14 22:58:24 2026
    From Newsgroup: comp.arch

    On Sat, 14 Mar 2026 14:55:08 +0000, Thomas Koenig wrote:
    quadi <quadibloc@ca.invalid> schrieb:

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short
    instructions, I have 17-bit short instructions.

    Can these follow each other in arbitrary order?

    Even I, who has indeed in Concertina II designed what has been in some of
    its iterations - and which has again become in the current iteration,
    sadly - some really weird ISAs would shrink from designing an ISA in which
    the answer to that question would be "Yes".

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Mar 14 23:36:31 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> schrieb:
    On Sat, 14 Mar 2026 14:55:08 +0000, Thomas Koenig wrote:
    quadi <quadibloc@ca.invalid> schrieb:

    On top of that, in addition to 14-bit, 15-bit, and 16-bit short
    instructions, I have 17-bit short instructions.

    Can these follow each other in arbitrary order?

    Even I, who has indeed in Concertina II designed what has been in some of its iterations - and which has again become in the current iteration,
    sadly - some really weird ISAs would shrink from designing an ISA in which the answer to that question would be "Yes".

    What you are doing makes no sense, then - compiler writers would
    have to use different instructions depending on what block they
    happen to be in. Those blocks will only coincide with basic
    blocks of the program in rare cases, and final placement of
    instructions happens quite late.

    This way lies madness (for the compiler writer, at least).
    The only viable solution would be to always use the same
    mode, and in that case all the block overhad would be wasted.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat Mar 14 23:48:08 2026
    From Newsgroup: comp.arch

    On Sat, 14 Mar 2026 22:58:24 +0000, quadi wrote:

    Even I, who has indeed in Concertina II designed what has been in some
    of its iterations - and which has again become in the current iteration, sadly - some really weird ISAs would shrink from designing an ISA in
    which the answer to that question would be "Yes".

    What is the Concertina II architecture, and why has it gone through so
    many iterations?

    The basic idea behind the Concertina II architecture has been:

    Start from a 32-bit RISC-like architecture.

    The basic form of a memory-reference instruction is:

    (Opcode) 5 bits for a load-store operation
    (Destination Register) 5 bits for one of two 32-register register banks,
    one for integers, one for floats
    (Index Register) 3 bits - use only seven of the 32 integer registers, so
    the instruction will fit in 32 bits
    (Base Register) 3 bits - as above
    (Displacement) 16 bits - as is conventional for most CISC and RISC microprocessors

    There are 24 opcodes normally used, so 1/4 of the opcode space is
    available for everything else.

    I had wanted to provide all the advantages and features of popular CISC architectures too, though.

    This meant trying to squeeze stuff into not enough opcode space. So, for example, on the IBM System/360, there are arithmetic instructions between
    two registers that only take up 16 bits.

    With 32-bit register banks, a short instruction ought to look like this:

    (Opcode) 7 bits
    (Destination Register) 5 bits
    (Source Register) 5 bits

    But that's 17 bits long.

    So what I usually did was place restrictions on the source and destination registers so that the short instructions could fit into 15 bits. A 32-bit instruction slot could start with 11 and then use 1/4 of the opcode space, containing a pair of these.

    But the basic memory-reference instructions take up 3/4 of the opcode
    space. I still need several other 32-bit instructions!

    So what I had been doing was placing various restrictions on the basic memory-reference instructions to make them one bit shorter.

    And whenever I did that, I just wasn't quite satisfied. I tried different
    ways of squeezing those instructions, but each one I tried was just too annoying, I felt.

    Finally, in the iteration of Concertina II which just preceded the current one, I decided that the short instructions were of a lower priority than
    the other things I was compromising to squeeze out more opcode space.

    So instead of two 15-bit instructions plus 11, I went with two 14-bit instructions plus 1111.

    I wasn't really happy with what the 14-bit instructions could do, but I
    felt that I had to stop somewhere.

    But then I had a bright idea. If I took the 15-bit instructions, and just shaved off a little opcode space from them, then the pairs of instructions plus 11 as leading bits... would leave behind enough opcode space for all
    the extra 32-bit instructions besides the basic memory-reference
    instructions.

    I really could have both uncompromised memory-reference instructions and 15-bit short instructions at the same time in the basic instruction set without a header to extend it!

    That's what sparked the current iteration. And then I found I had made a mistake, and some groups of instructions overlapped, using the same part
    of opcode space - I was able to patch things up, but the result is ugly
    and messy. So I'm going to need to re-examine it and figure out where to
    go next.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 15 01:14:55 2026
    From Newsgroup: comp.arch

    On Sat, 14 Mar 2026 23:36:31 +0000, Thomas Koenig wrote:

    This way lies madness (for the compiler writer, at least).
    The only viable solution would be to always use the same mode, and in
    that case all the block overhad would be wasted.

    Not really. A compiler for a language intended for general-purpose
    computation could use one mode, and one generating code for embedded
    systems could use another mode.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 15 02:26:22 2026
    From Newsgroup: comp.arch

    On 3/10/2026 9:04 PM, quadi wrote:
    On Tue, 10 Mar 2026 16:57:47 -0500, BGB wrote:

    FWIW:
    IME, while pair encoding scheme can result in space savings over a pure
    32/64/96 coding scheme, while avoiding the misalignment issues of a
    16/32 coding scheme, a downside of a pair encoding is that the potential
    space savings are significantly reduced relative to a 16/32 scheme.

    Say, for example:
    An effective 16/32 scheme can get around a 20% space savings;
    An effective pair-encoding implicitly drops to around 8%.

    Mostly because it can only save space in cases when both instructions
    can be pair encoded, versus when either instruction could be 16-bit
    encoded.

    Though, that said, pair encoding is an attractive option when the main
    other option is 32-bit only, and one already has some mechanism in place
    to deal with cracking an instruction.

    I am aware of the issue you are raising here, and I certainly am aware
    that forcing the programmer to choose shorter instructions in pairs limits the potential space savings.

    So why did I use this mechanism?

    For one thing, I used it in order to simply fetching and decoding instructions. If every instruction is 32 bits long, neither longer nor shorter, then it's very easy to fetch a block of memory, and decode all
    the instructions in it in parallel - because you already know where they begin and end.


    Similar in my case.

    I ended up with a superscalar that currently still only really works
    with 32-bit instructions, with both the 16 and 64/96 bit cases
    effectively dropping back scalar operation.


    Pair encoding could allow for a faster option that doesn't involve some
    of the costs of full 16-bit ops.

    Issue being reduced effectiveness.


    For another, look at the way I squeezed a short instruction - which I
    would prefer to be 17 bits long - into 15 bits. The register banks are divided into four groups, and the two operands must both be registers from the same group in a 15 bit instruction.

    That shows something about the type of code I expect to execute. Code
    where instructions belonging to multiple threads (of a sort, not real
    threads that execute asynchronously) are interleaved, so as to make it
    easier to execute the instructions simultaneously in a pipeline.

    That gives a bit of flexibility in instruction ordering, so it makes it easier to pair up short instructions.

    And, as further evidence that I'm aware that having to use short
    instructions in pairs is a disadvantage... this, along with the desire to
    use pseudo-immediates (because I do accept Mitch Alsup's reasoning that getting data almost for free from the instruction stream beats an
    additional memory access, with all the overhead that entails) led me to
    set up the block header mechanism (which Mitch Alsup rightly criticizes; I just felt it was the least bad way to achieve what I wanted) so that
    fetching instructions to decode _remained_ as naively straightforward _as
    if_ all the instructions were the same length... even when the
    instructions were allowed to vary in length.

    And so with what are currently the Type I, II, and IV headers, the instruction stream consists of variable length instructions; short instructions can be placed individually at any position in the instruction stream.

    There's even a CISC mode now, since I've squeezed things so much that this ISA is capable, with slight tweaking, of just having plain variable length instructions without blocks. But it's just barely capable of that; in that form, the short instructions only have 14 bits to play with, so the repertoire of those instructions is limited, and therefore the potential space savings they provide is less.

    Of course when block headers allow 17-bit instructions at arbitrary positions... that _would_ maximize space savings, but there's the overhead
    of the space the block header takes up. So any choice that is made
    involves tradeoffs.

    I also have a goal of making the ISA simple to implement, so in the CISC mode, instead of just saying "leave the register field all zeroes, and put the immediate right after the instruction", I have said that the pseudo- immediates aren't available in CISC mode. That avoids having to decode anything but the leading bits of the instruction in order to determine
    where the next instruction starts.

    It isn't the greatest variable-length instruction architecture; that capability is basically an afterthought appended to an architecture
    intended to be used with block headers.


    Man, I am not sure why "find something that works, and makes sense, and
    stick with it" poses such a problem...


    Oh well, in my case, I had recently been having issues with being
    depressed, and then ended up writing, among other things, and overly
    long and slightly over-elaborate Rainbow Brite AU fanfic: https://github.com/cr88192/bgbtech_fiction/blob/main/stories/2026-03-04_Murkwell0.txt

    Usual annoyance with GitHub's lack of word wrap as an option for viewing
    text files (with me recently going and making a making a dedicated repo
    for dumping my fiction stories into).


    Granted, it is possible some people might be like, "WTF? Why Rainbow Brite?" But, alas...

    Turned into some sort of weird take on the Science-Fantasy genre.
    Where I ended up sitting around trying to make the magic system
    "physically coherent". As in, it obeys similar rules to real-world
    physics, yet still tries to account for the sort of weirdness seen in
    the cartoons as-if the cartoon could exist in a reality-like
    environment; though would still require a certain level of requisite hand-waving, but wanted something beyond the usual "well, string
    together a bunch of random words and call it good" (and, preferably, a
    system that wont just collapse in on itself in all of 5 seconds of
    looking at it). Trying to come up with an "actually coherent" fantasy
    magic system being a little harder than it may seem on face value (more
    so when wanting it to allow for "wooden space ships in space" levels of wackiness).


    Then the story is mostly related to the environmental effects of of
    living in and using such a magic system as it was portrayed within the cartoons (with the story mostly based on the 1985 series), ...

    Though, isn't an exact match, but more or less does assume that the
    events of the 1985 series happened within the timeline.


    Did decide to leave something out, where trying to compare/contrast the physical realism in shows like "Star Trek", and it started coming off
    like I was trying to crap on "Trek", which wasn't really the point (even
    if at times it sits closer to the Fantasy end of the spectrum in terms
    of scientific realism, and is in some ways almost the more permissive
    setting here...).

    So, I guess I sort of ended up trying to go the opposite direction:
    Start with an obviously fantastical setting and try to nail down some
    rules and similar.

    Show is almost literal "rainbows and unicorns" apart from the ironic
    lack of unicorns (it was rainbows and talking horses instead...).
    Nothing says one couldn't add unicorns though.


    Well, my own sort of "pointless time waste", but the main role of
    writing this was more trying to distract myself from feeling depressed
    (and my usual stuff was failing to provide sufficient self-distraction).

    ...


    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 15 08:35:41 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> schrieb:
    On Sat, 14 Mar 2026 23:36:31 +0000, Thomas Koenig wrote:

    This way lies madness (for the compiler writer, at least).
    The only viable solution would be to always use the same mode, and in
    that case all the block overhad would be wasted.

    Not really. A compiler for a language intended for general-purpose computation could use one mode, and one generating code for embedded
    systems could use another mode.

    So the space for the headers will be wasted. It can be assumed
    that the vast majority programs are larger than 256 bits.

    Given that the vast majority of programs work very well on
    RISC, what programming languages did you have in mind for your
    CISC-mode, and why is your CISC mode supposed to be better at these
    particular programming languages including their runtime libraries?
    Runtime libraries need not be in the primary language; for example,
    gfortran's runtime library is written in C.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sun Mar 15 14:35:00 2026
    From Newsgroup: comp.arch

    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction
    bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 15 17:43:00 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 08:35:41 +0000, Thomas Koenig wrote:
    quadi <quadibloc@ca.invalid> schrieb:
    On Sat, 14 Mar 2026 23:36:31 +0000, Thomas Koenig wrote:

    This way lies madness (for the compiler writer, at least).
    The only viable solution would be to always use the same mode, and in
    that case all the block overhad would be wasted.

    Not really. A compiler for a language intended for general-purpose
    computation could use one mode, and one generating code for embedded
    systems could use another mode.

    So the space for the headers will be wasted. It can be assumed that the
    vast majority programs are larger than 256 bits.

    I'm not sure that I understand that criticism. Since each of the possible block formats is Turing-complete, a compiler can indeed generate working programs by putting out block after block in the same format.

    Given that the vast majority of programs work very well on RISC, what programming languages did you have in mind for your CISC-mode, and why
    is your CISC mode supposed to be better at these particular programming languages including their runtime libraries?
    Runtime libraries need not be in the primary language; for example, gfortran's runtime library is written in C.

    While I dropped the CISC _mode_ from the ISA, because 14-bit short instructions just weren't usable, I do have a block format for variable-
    length instructions, which allows CISC-like code. So I'll address that.

    This provides 17-bit short instructions that can be placed anywhere. That means I am more likely to be able to use them, and so I get a gain in code compactness.

    Also, it provides instructions longer than 32 bits. I include in that set packed decimal and string instructions similar to those of the IBM System/
    360 architecture. So, right away, it suggests that a COBOL compiler might
    want to generate code in this kind of block.

    Other header formats stick with the pure 32-bit instruction set - but
    provide one thing not found in plain RISC - instruction slots marked as unused. These can contain constant values, which are referenced by short
    range pointers which can replace source register specifications in operate instructions.

    Since I envisage code being fetched one 256-bit block at a time (the architecture is "really" a 256-bit long VLIW architecture, even though it looks like a sort of RISC/CISC hybrid) these constant values essentially
    have the same basic advantage as immediate operands, even though the instructions reference them with pointers. (The short-range pointers only point within the same 256-bit instruction block as the instruction is
    located in, which is why this is the case.)

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 15 17:55:27 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 02:26:22 -0500, BGB wrote:

    Man, I am not sure why "find something that works, and makes sense, and
    stick with it" poses such a problem...

    If _that_ is your goal, may I suggest x86-64? It's highly popular, so
    there is a lot of software available to run on it. As well, the companies
    that make processors with this ISA have large economies of scale, and so
    their processors get to use the most advanced process nodes, and have a
    lot of effort put into optimizing their microarchitecture.

    That should make clear that my goal isn't to settle for anything that
    _works_, but rather to find something that _works better_.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sun Mar 15 19:42:00 2026
    From Newsgroup: comp.arch

    In article <10p6rqf$1a0r2$1@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    That should make clear that my goal isn't to settle for anything
    that _works_, but rather to find something that _works better_.

    I found a story recently that may entertain you, given you were
    interested in being able to run 36-bit software. It's from Fred Brooks'
    _The Design of Design_, in his chapter about the design of the System/360 hardware. This had been done without much thought for backwards
    compatibility, but potential customers wanted that.

    One of the engineers realised than since they were using 36-bit-wide
    memory and CPU data path, for 4x8-bit bytes, each with parity, it was
    possible to write an efficient emulator for the IBM 7090 in microcode on
    the 360 model 65. More emulators were written for other 7000-series
    machines, which had very varied architectures.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Mar 15 19:47:20 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> wrote:
    On Sat, 14 Mar 2026 22:58:24 +0000, quadi wrote:

    Even I, who has indeed in Concertina II designed what has been in some
    of its iterations - and which has again become in the current iteration,
    sadly - some really weird ISAs would shrink from designing an ISA in
    which the answer to that question would be "Yes".

    What is the Concertina II architecture, and why has it gone through so
    many iterations?

    The basic idea behind the Concertina II architecture has been:

    Start from a 32-bit RISC-like architecture.

    The basic form of a memory-reference instruction is:

    (Opcode) 5 bits for a load-store operation
    (Destination Register) 5 bits for one of two 32-register register banks,
    one for integers, one for floats
    (Index Register) 3 bits - use only seven of the 32 integer registers, so
    the instruction will fit in 32 bits
    (Base Register) 3 bits - as above
    (Displacement) 16 bits - as is conventional for most CISC and RISC microprocessors

    You say that you have "short pointers" that provide constants from
    the same block. Why do not you provide displacement in this way?
    AFAICS 9 bits would be enough to have either 8 bit pointer or
    8 bit displacement, for saving of 7 bits. 4 bits could be used
    to get full range of registers, other 3 would give you room
    for other instructions.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Mar 15 15:47:34 2026
    From Newsgroup: comp.arch

    John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.

    John

    Yes, it had a bit aligned instruction set.
    Unfortunately the paper describing the decoder is paywalled but the
    432 was designed in the "put all logic in microcode" days, and they did.

    One author on the second paper Colwell, then at Multiflow, later joined
    Intel as chief IA-32 architect on the Pentium Pro, Pentium II, Pentium III,
    and Pentium 4 microprocessors.

    [paywalled]
    The Instruction Decoding Unit for the VLSI 432 General Data Processor, 1981 https://ieeexplore.ieee.org/abstract/document/1051633/

    Performance Effects of Architectural Complexity in the Intel 432, 1988
    RP Colwell, EF Gehringer, ED Jensen https://www.princeton.edu/~rblee/ELE572Papers/Fall04Readings/I432.pdf



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Mar 15 20:36:33 2026
    From Newsgroup: comp.arch

    According to John Dallman <jgd@cix.co.uk>:
    One of the engineers realised than since they were using 36-bit-wide
    memory and CPU data path, for 4x8-bit bytes, each with parity, it was >possible to write an efficient emulator for the IBM 7090 in microcode on
    the 360 model 65. More emulators were written for other 7000-series
    machines, which had very varied architectures.

    Something is confused there. The 360 was a 32 bit machine, and the parity bits were parity bits, not usable for anything else. They did indeed write emulators for many of their previous machines. The one for the 709x put each 36 bit 709x word in a 360 doubleword, so the 32K words of 709x memory took 256K bytes on the
    360. That's one of the reasons you couldn't emulate a 709x on anything less than
    a 360/65.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 15 20:42:28 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 19:42:00 +0000, John Dallman wrote:

    One of the engineers realised than since they were using 36-bit-wide
    memory and CPU data path, for 4x8-bit bytes, each with parity, it was possible to write an efficient emulator for the IBM 7090 in microcode on
    the 360 model 65.

    I remember reading the manuals on Bitsavers for several 360 emulation
    options. I don't recall any 7090 emulator which involved turning memory
    parity off in order for it to work!

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sun Mar 15 20:56:03 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 19:47:20 +0000, Waldek Hebisch wrote:

    You say that you have "short pointers" that provide constants from the
    same block. Why do not you provide displacement in this way? AFAICS 9
    bits would be enough to have either 8 bit pointer or 8 bit displacement,
    for saving of 7 bits. 4 bits could be used to get full range of
    registers, other 3 would give you room for other instructions.

    This is an interesting idea, I have to admit.

    But I consider the displacement part of the instruction.

    In some earlier iterations of Concertina II, quite some time ago, I did
    use short pointers also as a mechanism to point to the remainder of an instruction so as to allow instructions longer than 32 bits.

    A block header, which indicates part of the block as unused, is required
    to use these short pointers for pseudo-immediates. But that isn't really
    an objection, since a block header is also required to allow instructions longer than 32 bits.

    And treating displacements as immediate values is an idea that Mitch Alsup
    has talked about when he describes his My 66000 architecture, so I have
    been aware of the idea.

    I've put it aside, because I prefer to have instructions in one piece
    whenever possible. Displacements, unlike immediate values, come in just
    one size (well, not really in my architecture) and so they don't
    complicate length decoding - which is the issue that motivated me to go
    with pseudo-immediates.

    However, it's just possible that going this way could address the current issue I'm facing with Concertina II. The instruction header for blocks
    with variable-length instructions is directly conflicting with having
    pairs of 15-bit instructions available nicely and neatly in the no-header case. If I didn't need that block header, because reserving unused space
    for immediates also automatically let me have instructions longer than 32 bits, the problem would be solved.

    That problem would be solved - but I'd lose the 17-bit uncompromised short instructions, that can be freely placed singly. So it's not actually an attractive choice in all respects, even if it could solve part of my
    current issue.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Mar 15 16:53:51 2026
    From Newsgroup: comp.arch

    On 3/15/2026 12:55 PM, quadi wrote:
    On Sun, 15 Mar 2026 02:26:22 -0500, BGB wrote:

    Man, I am not sure why "find something that works, and makes sense, and
    stick with it" poses such a problem...

    If _that_ is your goal, may I suggest x86-64? It's highly popular, so
    there is a lot of software available to run on it. As well, the companies that make processors with this ISA have large economies of scale, and so their processors get to use the most advanced process nodes, and have a
    lot of effort put into optimizing their microarchitecture.

    That should make clear that my goal isn't to settle for anything that _works_, but rather to find something that _works better_.


    I meant more in the sense of:
    Pick an encoding encoding scheme that makes sense;
    Don't just keep going back and forth with needlessly complex approaches
    and never settling on anything.


    Like, if one were to design something, assuming no predication, one
    option could be, say, low bits:
    x0: Pair Encodings (2x 15-bit);
    01: 32-bit ops
    11: 64/96 bit ops

    Then, say:
    Decide on 5 or 6 bit register fields;
    Leave around 10-12 or so bits for opcode;
    ...


    So, say:
    x001: Mostly 3R-Forms (may include 2R and 1R subsets)
    0101: 3RI Block
    1101: 2RI / 1I Block

    Assuming then, say, an Imm10+R6+R6 or 3x-R6 layout:
    xxxx-tttttt-ssssss-xxxx-dddddd-yy-y001 //3R
    iiii-iiiiii-ssssss-xxxx-dddddd-yy-0101 //3RI, Imm10
    iiii-iiiiii-iiiiii-xxxx-dddddd-yy-1101 //2RI, Imm16
    iiii-iiiiii-iiiiii-111x-iiiiii-yy-1101 //Imm22 (Direct Branch)

    Or, maybe move things around, or whatever...


    For 15-bit pair ops, maybe:
    xxssssyyddddyy0 //2R, Reg4
    xsssssyddddd001 //2R, Reg5
    ssssssdddddd101 //MOV, Reg6
    xxiiiiyydddd011 //2RI, Reg4+Imm4
    xxiiiidddddd111 //2RI, Reg6+Imm4
    MOV (Imm4s_NZ), ADD (Imm4s_NZ)
    LD/SD [SP+Disp4u*8]

    With a register space like, say:
    First few GPRs are SPRs;
    R16..R31 or similar map to the Reg4's
    R32..R63: GPRs

    Or, something...




    But, oh well, my efforts have been slowing down drastically as I have
    run into a wall of "mostly good enough".

    Within the realm of incremental evolutionary improvements, I am not
    finding much more...


    I could maybe define a "Pure" XG3 (as an XG3 subset without any RISC-V
    stuff).

    Simplest answer is that in XG3L:
    Only XG3 mode is assumed to exist;
    LR and PC may potentially decay back to being untagged;
    The tagging-related bits may be ignored.
    The RV encoding block (low 2 bits = 11) is simply disallowed;
    Continues leaving out the stuff not carried over from XG1/XG2;
    Maybe demotes more non-core instructions to optional.

    The core of XG3L would retain a roughly comparable feature-set to RV64I
    (with partial 'M', likely back to only requiring a 32-bit integer
    multiply, with 64-bit operations and DIV and similar implemented via traps); Would retain the compare-and-branch encodings, but for XG3L may make
    sense to allow constraining that either Rs or Rt must be the
    Zero-Register (demoting the full compare-and-branch to optional).

    FPU would be demoted to optional, but defined that (if absent) any FPU
    ops will be implemented with software traps.

    Arguable motivation:
    Such an XG3L core would be simpler and cheaper to implement than RV64I;
    Respite RISC-V having a shorter instruction listing, had noted that implementing an RV64 decoder tends to involve comparably more "hair";
    Such a subset would likely still beat RV64I on code-density and performance.


    But, dunno...


    Would mostly matter for XC7S25, but for an XC7S25 it still makes more
    sense to go with an RV32I or RV32IM based core (where RISC-V's design is
    a comparably better option for 32-bit targets). And, for the boards that
    use such an FPGA, typically hard-pressed to use them for much of
    anything beyond a small microcontroller.

    Granted, for most use-cases here, would be cheaper and higher
    performance to just buy a "Pi Pico" or similar (the "Raspberry Pi Pico"
    being under 1/10th the cost of a CMod-S7).

    Or, maybe could end up cheaper if one were already having PCBs made and
    used bare RP2040 chips.

    ...


    Well, and why my recent efforts had been so slow as of late:
    CPU Project: Not all that much obvious to do ATM;
    3D engine project: Yeah, keep losing direction.

    Image Compression:
    Currently my UPIC codec is the front-runner in my tests:
    Lossless compression is similar to Lossless WebP;
    Performance is faster than JPEG;
    Vs, WebP, where its lossless mode is kinda slow.
    Lossy: Mostly competitive with T.81 JPEG.
    What is lost to AdRice is regained by more efficient VLC.
    Realistically, hard to push much past 100-150 megapixel/sec on my PC,
    but, good enough...

    It remains annoyingly difficult to try to get high-speeds out of PNG
    like designs, and the closest competitor here is QOI, which is faster
    than UPIC, but not exactly the front-runner for compression ratio.
    Though, the main interesting merit of QOI is that it does what it does
    with a purely byte-based format (no entropy coding).



    Audio Compression:
    Current front-runner is 2b and 4b ADPCM with encoder tricks.
    The 2b ADPCM + tricks to improve LZ compression works well enough.
    It seems that on-average, RP2 is effective for the LZ audio trick.
    Entropy compression remains weak, so little benefit.
    LZMA or similar would be more effective, but is slow.
    Absent ADPCM encoder tricks to improve LZ,
    ADPCM audio is mostly non-compressible.

    Not found any "better" schemes (for bitrate) that don't also severely
    impact audio quality.

    Can note that it does seem possible (although potentially
    computationally expensive) to boost the perceptual audio quality using
    neural nets (essentially using the NN like a more convoluted FIR filter).

    It is potentially possible that paired NN's could be used to squeeze
    more perceptual quality through a 8kHz 2-bit ADPCM channel, or maybe
    drop further like 4kHz or 6kHz, but debatable if the high computational
    cost is worthwhile.

    Then again, not exactly like traditional higher-cost options (like MP3)
    or similar are being particularly effective in this space, and almost
    could make sense (all things being easier) to use audio-enhancing NNs.
    Though, paired-NN and low-sample rate would require the NN's to be fixed
    in the codec (apart from very long audio clips, invariably the NN is
    going to be bigger than the audio data).



    NNs:
    Had experimented with back-propagation, which generally converges
    towards and answer quickly and then gets stuck, can seemingly never get
    past a certain error floor.

    Genetic algorithm training can often reach exact-answers, but scales
    poorly (increasing neural-net size makes it progressively worse at
    training it).

    My attempts at hybridizing back-prop and genetic algorithms have yet to
    give a "better" answer; nor have I come up with something that works
    well for "live" neural nets (which mimic natural learning).

    Seemingly the "closest to effective" answer for the latter being to
    train two sets of neural nets which run in opposite directions:
    One NN runs forwards, from input to output;
    One NN runs backwards, from output to input;
    The two sets of NN's have "bleed-over" for their respective error
    functions (vs pure backprop).

    Idea is that when the real and predicted inputs and outputs differ, the
    error delta in both directions can be added to the error delta from the
    normal backprop.

    Downside: Twice as big and twice as slow to run such an NN vs plain
    backprop, and still can't get past the "noise floor" and reach exact
    answers; but potentially could still have learning ability for
    free-running pairs of inputs and outputs.




    Word predictors nets:
    Had noted that there can be some interesting properties of going from
    English -> Pseudo-Chinese, running the NN is a Pseudo-Chinese, then
    converting this back to English (mostly all operating via UCS-2 / UTF-16 internally).

    In effect, mapping English to Hanzi can serve a similar role to an
    embedding space, as the Hanzi for words tend to have semantic-clustering effects, and thus (ironically) the English->Chinese->NN->English route
    seems to have a higher level of semantic coherence than a simpler model
    based purely on assigning an index number to each English word.

    In effect, because the Chinese "words" tend to consist of prefix/suffix patterns that tend to cluster semantically around the thing being
    described; with a much stronger semantic-clustering effect if compared
    with wither linearly assigning English words to index numbers or simpler hash-table based mapping schemes. This sort of semantic clustering being relevant to getting semantically-coherent output out of NNs.


    Was mostly using the CEDICT and ECDICT projects as data sources for the model-building (generally trying to do it algorithmically).


    I had a sub-goal of trying to develop a "decent" bijective mapping
    scheme, but got distracted. The goal here being to develop something
    that can do the English -> Pseudo-Chinese -> English mapping in a way that:
    Will (usually) get the same English text back out the other side;
    Uses the statistical minimum number of Hanzi needed to do so;
    ...

    Bonus points if an "actual" Chinese->English translator actually
    generates something coherent and resembling the original text.

    There are fudge factors though because English has some grammatical and word-form features that don't map well to Hanzi (such as articles, singlular/plural distinction, gendered pronouns, ...), which don't map well.

    Well, along with English words often having a higher-level of context-sensitive semantic mapping, which pose a problem for any sort of
    naive bijective scheme.

    The likely option being to use some explicit handling for English
    grammar rules and modifier characters added to preserve distinctions
    (picked from somewhere else in the Unicode space); though this would do
    little for machine-translation coherence other than confuse it (doesn't
    seem like there is a good way to do it within Hanzi, as, as noted, the language simply drops these distinctions and expects the reader/listener
    to infer them by context).


    Like, the mapping would need to have some level of redundancy and
    contextual mapping rather than a simpler bijective mapping (effectively
    using HMMs or similar to predict which of the possible mappings to use),
    then allowing a many-to-one mapping on the Hanzi -> English-Words side.

    Could fiddle more with it.

    More annoying as someone who doesn't read/understand Chinese;
    Almost debatable if I should instead build mapping tables between hanzi
    and pinyin as at least the pinyin is slightly less annoying to look at.

    Fiddled with it some, then quickly lost motivation.


    Also, whatever is done, is going to be bulky, like even a simple 1:1
    bijective mapping scheme for a moderate sized lexicon is already a good
    number of MB (plus however much space is needed for the NNs, etc). Well,
    and the denser (but more semantically-effective) schemes tending to not
    be bijective.

    Like it may seem simple:
    Word -> Number(s) -> Word
    But actually turns into a bit PITA.

    ...



    Not much else worthwhile going on, which is admittedly partly related to
    the whole thing of:
    Get depressed (well, partly related to my non-existent romantic-life);
    Well, like, even if I am seemingly ace, I still feel lonely...
    Then, proceeding to (among other things) write a Rainbow Brite fanfic;
    ...


    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 02:09:01 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 16:53:51 -0500, BGB wrote:

    I meant more in the sense of:
    Pick an encoding encoding scheme that makes sense;
    Don't just keep going back and forth with needlessly complex approaches
    and never settling on anything.

    That is definitely what I do want to end up doing.

    I have managed to find some additional opcode space that has let me make
    the paired 15-bit instructions just a bit less messy. But since I won't be satisfied until they're not messy at all - and I am so close to finally squeezing everything in that I want to include in the basic 32 bit
    instruction set - I am going to continue looking for some additional improvements for a while.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 02:37:33 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 02:09:01 +0000, quadi wrote:

    I have managed to find some additional opcode space that has let me make
    the paired 15-bit instructions just a bit less messy. But since I won't
    be satisfied until they're not messy at all - and I am so close to
    finally squeezing everything in that I want to include in the basic 32
    bit instruction set - I am going to continue looking for some additional improvements for a while.

    I didn't find a way to squeeze things yet further, but I found a
    disastrous mistake in one part of the instruction set that showed the same opcode space being allocated twice.

    But it turned out to be easily fixable; I had added an opcode bit to one instruction, so taking that back gave me the opcode space I needed to put everything right again.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 02:51:02 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 02:37:33 +0000, quadi wrote:

    I didn't find a way to squeeze things yet further, but I found a
    disastrous mistake in one part of the instruction set that showed the
    same opcode space being allocated twice.

    But it turned out to be easily fixable; I had added an opcode bit to one instruction, so taking that back gave me the opcode space I needed to
    put everything right again.

    And on further review, I found another error, which was corrected by regressing that part of the design.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 16 04:16:50 2026
    From Newsgroup: comp.arch

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...


    I guess, probably similar story as the Gen-Z people and the IA-64.

    But, being of an older generation, IA-64 hype was going on when I was in high-school, and x86-64 also came on the scene. Some other people were
    on Team IA-64, but I was on Team x86-64, mostly because as I could see
    it at the time, it seemed like the writing was already on the wall for
    IA-64 even back then.

    In this case, I ended up being right...

    IIRC, the thinking was, for IA-64:
    People already knew of its performance woes;
    It was also very much more expensive.

    On the other side of things, the Athlon64 was already on the horizon;
    and ended up getting one post graduation.


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...


    M68K started turning into a turd, but can't really be faulted that much,
    as its design was pretty close to a direct evolution of the PDP-11;
    which was quite influential.


    Well, and go back pretty far, and my current ISA design also follows an evolution tree that reaches back to the PDP-11, despite having almost
    noting it common (and by sorta colliding with RISC-V, also sorta
    half-way hybridizes it with the MIPS family).

    Then again, maybe being a product of engineering rather than
    naturalistic evolution sorta excludes it from the normal words of
    phylogeny. But, then again, can one prove that even nature obeys it?...

    Say, for example, what if you ended up with leaf slugs that could
    natively reproduce chloroplasts, would they still be purely animals, or
    would their then stolen DNA and chloroplasts make then partially
    descended from algae (as IIRC they had gained some DNA from the algae
    via horizontal gene transfer or similar, but need to scavenge
    chloroplasts from plant cells rather than making their own).

    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Mar 16 05:20:19 2026
    From Newsgroup: comp.arch

    On 3/15/2026 9:09 PM, quadi wrote:
    On Sun, 15 Mar 2026 16:53:51 -0500, BGB wrote:

    I meant more in the sense of:
    Pick an encoding encoding scheme that makes sense;
    Don't just keep going back and forth with needlessly complex approaches
    and never settling on anything.

    That is definitely what I do want to end up doing.

    I have managed to find some additional opcode space that has let me make
    the paired 15-bit instructions just a bit less messy. But since I won't be satisfied until they're not messy at all - and I am so close to finally squeezing everything in that I want to include in the basic 32 bit instruction set - I am going to continue looking for some additional improvements for a while.


    I tried various ideas, but as noted, my project ended up on a more evolutionary path.

    Vs, say, "analysis paralysis".



    But, as noted, my project has slowed, but it is more a case of "running
    out of improvements within the current constraints space".

    sometimes it makes sense to try different ideas, and then switch designs if/when something comes up to have enough of an improvement to
    effectively usurp what one has already.

    Sometimes it does, sometimes it doesn't.


    Like, the XG3 thing:
    Vs XG2: Well, it is a partial subset and repack;
    Technically, XG2 is a simplification vs XG1.
    Vs RV64GC:
    Well, XG3 is faster than RV64GC, even if applying extensions.

    But, ends up in a cost/benefit thing:
    XG1: Had the best code density, and was oldest;
    XG2 was faster, but worse code density.
    XG3, another code density loss, but perf gain.
    RV64G: More popular.
    RV64GC: Similar code density to XG1, similar perf.
    RV64G+XG3: Works well.
    RV64GC+XG3: Could exist, but "opens a can of worms".


    I ended up not being able to eliminate XG1, but it is possible I
    eventually could cross some effort threshold and like switch from
    booting in XG1 mode to booting in RV64GC mode.

    Though, implicitly, while booting the kernel in XG1 mode as-is allows
    running binaries in all other modes, booting in RV64GC mode would only
    allow running RV64G/RV64GC and XG3.

    Ironically, ATM this is more a bigger issue for my software stack than
    either the CPU core or emulator though (it is already an option to boot
    an RV64GC+Jx Boot-ROM and then boot something in RV mode).

    But, mostly the Boot ROM needs to be as either XG1 or RV64GC though
    mostly because the Boot ROM is generally limited to 32K and thus makes
    code density a premium, and ATM, RV64G+XG3 does not beat RV64GC+Jx at
    code density.


    Does lead to some annoying "forks in the road".



    Well, and sort of like, LZ4 vs RP2:
    Both are about the same speed, but...
    LZ4 generally works better for binaries;
    RP2 compresses better for most other data.

    But, while RP2 hasn't been replaced by anything, it has suffered
    sub-splits as well:
    RP2 original (up to 4MB sliding window and 16K matches)
    RP2A: Limits to 128K sliding window and 515 byte matches.
    But, simpler and cheaper/faster to decode.
    Omits a match type (simplest case).
    RP2B: Variant that regains the 4MB window and 16K matches.
    But, uses a different encoding scheme for them,
    and has an RLE special case.

    Then, in some use cases:
    RP2 + STF+AdRice post-compressor:
    Lazily gets entropy coding after-the-fact;
    Generally only used if it crosses a size-reduction threshold.
    RP2 + Range-Coder post-compressor:
    Gets a range-coder after the fact;
    Uses a vaguely similar design to the one in LZMA;
    Requires a fairly substantial size reduction to be selected.

    Where, the post-compressors are not an entirely separate compression
    scheme, but more a hacky way of bolting the entropy coding on top of the
    LZ compressed byte-blob. Which probably seems like a stupid idea, but
    kinda works.

    Currently, the post-compressors assume the 2A or 2B variants (the
    original variant is not compatible).


    Though, in most contexts where I am using the RP2 post compressors LZ4
    ended up not uses as it typically gives worse compression.

    In some cases, Deflate is also excluded as generally Deflate is only
    effective past a certain payload size (and the STF+AdRice and
    Range-Coder post-compressors turn out to be effective at typically
    smaller payload sizes than what Deflate can touch).


    Though, there is a range of payload sizes where Deflate is effective:
    Data is big enough that Huffman becomes effective;
    Not so big that Deflate starts getting hurt by the smaller 32K sliding
    window.

    For bigger payload:
    Range-Coder compresses well but is painfully slow;
    As payload gets bigger, slowness gets more obvious.
    A Huffman-Coded post compressor could make sense here.
    At a certain point, Huffman starts to beat STF+AdRice.


    Nevermind if first compressing data with a byte oriented LZ compressor
    and then optionally trying various entropy stages after the fact to see
    if they would give a compression advantage (and falling back to the raw
    byte oriented format as the default case), is maybe an absurd way to
    approach this.


    ...


    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 13:24:57 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 02:09:01 +0000, quadi wrote:

    I have managed to find some additional opcode space that has let me make
    the paired 15-bit instructions just a bit less messy. But since I won't
    be satisfied until they're not messy at all - and I am so close to
    finally squeezing everything in that I want to include in the basic 32
    bit instruction set - I am going to continue looking for some additional improvements for a while.

    I managed to squeeze out an additional bit by modifying the header formats ever so slightly.

    If I then restrict the 15-bit instructions to what was required for the
    first instruction slot in the block, now I can put set flag and branch instructions back in.

    That let me make paired 15-bit instructions "not messy" at all, in the
    sense that now both of the two instructions can have the same format, and
    I don't have to change the format for the first instruction slot in a
    block. The fact that in other instruction slots, the instruction
    repertoire for 15-bit short instructions is unnecessarily restricted - a six-bit opcode for register-to-register operate instructions isn't needed,
    a seven-bit one could fit - is, I feel, worth it to remove location
    dependency in the headerless case.

    Finally, without headers, the ISA is once again a RISC architecture plus
    extra capabilities, with its true VLIW nature only rearing its head once
    one chooses to use the additional capabilities that using headers makes available.

    Having different formats for 15-bit instructions in different block
    formats isn't something I count as "messy", that's just par for the course
    in Concertina II.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 18:43:50 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 13:24:57 +0000, quadi wrote:

    Having different formats for 15-bit instructions in different block
    formats isn't something I count as "messy", that's just par for the
    course in Concertina II.

    But I knew that the second 15-bit instruction format, for the cases of
    Type II and Type III headers, didn't take back as much opcode space as was actually made available in that case. I knew I ought to be able to do
    better, and I've now managed to go back to seven-bit opcodes for the register-to-register operate instructions.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 21:44:56 2026
    From Newsgroup: comp.arch

    On Sun, 15 Mar 2026 16:53:51 -0500, BGB wrote:

    I meant more in the sense of:
    Pick an encoding encoding scheme that makes sense;
    Don't just keep going back and forth with needlessly complex approaches
    and never settling on anything.

    I know; my x86-64 reply was intended to be gentle sarcasm that would give people an occasion to laugh.

    Well, I think that after much effort, I finally have gotten to a point
    where I am forced to admit that I can't improve on it further. I have
    squeezed out as much opcode space as possible to come as close to my goal
    as possible. Uncompromised memory-reference instructions, and fairly
    decent short instructions along with them.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 22:09:25 2026
    From Newsgroup: comp.arch

    On Sat, 07 Mar 2026 19:00:02 +0000, MitchAlsup wrote:

    I smell danger--running out of OpCode space early in the design.
    After the general conceptualization of the ISA you should have half of
    the OpCode space available for future additions !!.

    I'm not sure about "early". After the part I've done, the rest is filling
    in the blanks - here are the opcodes, here is their order.

    The basic structure has now indeed come close to running out of opcode
    space; I've been exerting myself to go into every nook and cranny to find extra opcode space so as to squeeze in bigger instructions than ought to
    be able to fit in 32 bits.

    But the 32-bit operate instructions have opcode fields big enough to allow
    all sorts of exotic operations on strange data types.

    The basic memory-reference instructions just handle the standard data
    types... so there are supplementary memory-reference instructions, with
    more limited address modes, to handle additional data types, so that just
    as I don't have to leave the basic 32-bit instruction set to operate on
    exotic data types, I also don't have to leave it to get those data types
    in and out of memory, in and out of the registers.

    And if you're willing to generate code that includes the headers, you
    get...

    full-length immediate values

    additional memory-reference instructions with a full selection of address modes

    In my opinion, the only *real* problem with having allocated the basic structure of the ISA so fully... is if someone were to come up with a data type never used before that is so important that it shouldn't be relegated
    to a special portion of the instruction set that might only be accessible
    with headers, that needs instructions over 32 bits, and so on.

    I can't say this will never happen. After all, MMX and its successors came along, reflected in my short vector data type. So integers and floating-
    point, the only two data types a lot of computers had, now have company.

    Of course, though, I also have character string instructions and packed decimal instructions. Packed decimal will come in both IBM's memory to
    memory kind and memory to register ones.

    But while I've tried to cover all the historic bases, what about the
    various short floating-point types that are increasing in popularity,
    because they're being used for the matrices that today's AI is based on?

    Right now, that field is so fast-moving that it's not clear which types I should support.

    Even though I got rid of complexity so as to have only three types of
    headers instead of seventeen... and just recently I squeezed the opcode
    space used by the headers... there is still some opcode space left. I
    could very easily define a Type IV header that lets me mix the standard 32-
    bit instruction set with a whole other set of 32-bit instructions that
    handle 16-bit, 8-bit, and even 4-bit floats...

    The type II and type III headers both offer 22 bits of data.

    So the type IV header could have...

    3 bits for a decode field, to let pseudo-immediates be used;
    7 bits to indicate whether each 32-bit instruction slot has a standard 32-
    bit instruction or a special instruction;
    8 bits to indicate *which of 256 different sets of special instructions is going to be used in this block*;
    and I've still got four bits left over, maybe to make fifteen other header types as good as this one.

    So I don't need to sweat opcode space. Yes, anything new that comes along
    will be less privileged, less prominent than the old standard types.
    That's true for any ISA. While the degree to which I've squeezed out
    opcode space may take it to a new level... it's counteracted by the fact
    that this ISA is also designed for extensibility.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 22:36:56 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 22:09:25 +0000, quadi wrote:

    So the type IV header could have...

    3 bits for a decode field, to let pseudo-immediates be used;
    7 bits to indicate whether each 32-bit instruction slot has a standard
    32- bit instruction or a special instruction;
    8 bits to indicate *which of 256 different sets of special instructions
    is going to be used in this block*;
    and I've still got four bits left over, maybe to make fifteen other
    header types as good as this one.

    Well, all right, that's good enough for new data types like eight-bit
    floats that can be handled well by 32-bit instructions.

    But what if there are new data types that require longer instructions?

    For example, today's IBM zSeries mainframes have instructions that handle UTF-8 character strings natively in hardware.

    The 32-bit header that allows variable-length isntructions is prefixed by
    only four bits in the beginning; I don't have the opcode space for another
    one like it. So I'm cooked, right?

    No. Because it's perfectly possible - as numerous earlier iterations of Concertina II demonstrated - to have a header that's 48 bits long (or, if
    need be, even 64 bits long) instead of 32 bits long for the more exotic
    cases.

    So I can have a header that allows 17-bit short instructions, the full
    normal instruction set of instructions including the ones longer than 32 bits... and additional instruction sets _also_ including instructions
    longer than 32 bits.

    There will be more overhead when going to the new exotic features; that
    isn't really avoidable, but the ISA is not debarred from growth.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Mon Mar 16 23:46:04 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 22:36:56 +0000, quadi wrote:

    But what if there are new data types that require longer instructions?

    For example, today's IBM zSeries mainframes have instructions that
    handle UTF-8 character strings natively in hardware.

    I've decided to work out the formats for the additional header types all
    this would need now, rather than leaving it for later. In order to avoid
    going to a 64-bit header, I've gone to an exotic length for the sixth
    header type.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 17 00:06:28 2026
    From Newsgroup: comp.arch

    On Mon, 16 Mar 2026 23:46:04 +0000, quadi wrote:

    I've decided to work out the formats for the additional header types all
    this would need now, rather than leaving it for later. In order to avoid going to a 64-bit header, I've gone to an exotic length for the sixth
    header type.

    Originally, I used radix encoding, but I realized Chen-Ho encoding was
    more appropriate.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 17 08:00:11 2026
    From Newsgroup: comp.arch

    On Tue, 17 Mar 2026 00:06:28 +0000, quadi wrote:

    Originally, I used radix encoding, but I realized Chen-Ho encoding was
    more appropriate.

    Anyways... see what happened?

    You mention the fact that, far too early in the design phase, I have used
    up almost all of the available opcode space.
    In a sense, that certainly is true. The opcode space for the basic
    instruction set of the computer, when there are no headers present to add longer instructions to the instruction set, is perhaps more than 99% allocated! Which _is_ pretty bad.
    But what did I do in response?
    I still, despite recent changes to the block headers to make them consume
    less opcode space, some space left to define new types of headers. So what
    did I do? I defined three new types of header which have the effect of allowing the architecture to be modified... so as to add up to *five
    hundred and twenty-eight* additional instruction sets to the ISA. These headers allow instructions from any one of those auxilliary instruction
    sets to be combined with regular instructions in the same block.
    So if there's a need for instructions acting on short floating-point
    numbers, or UTF-8 strings, that the basic instruction set has not covered,
    it will be possible to extend the instruction set to deal with it.
    There should be enough room for it to meet the demands placed on it not
    merely in years to come, but even centuries or millenia. (Although, as
    Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 17 15:36:32 2026
    From Newsgroup: comp.arch

    On Tue, 17 Mar 2026 08:00:11 +0000, quadi wrote:

    (Although, as
    Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    I lack imagination. If microcode is used, there shouldn't be a problem.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Mar 17 09:09:03 2026
    From Newsgroup: comp.arch

    On 3/17/2026 1:00 AM, quadi wrote:
    On Tue, 17 Mar 2026 00:06:28 +0000, quadi wrote:

    Originally, I used radix encoding, but I realized Chen-Ho encoding was
    more appropriate.

    Anyways... see what happened?

    You mention the fact that, far too early in the design phase, I have used
    up almost all of the available opcode space.
    In a sense, that certainly is true. The opcode space for the basic instruction set of the computer, when there are no headers present to add longer instructions to the instruction set, is perhaps more than 99% allocated! Which _is_ pretty bad.
    But what did I do in response?
    I still, despite recent changes to the block headers to make them consume less opcode space, some space left to define new types of headers. So what did I do? I defined three new types of header which have the effect of allowing the architecture to be modified... so as to add up to *five
    hundred and twenty-eight* additional instruction sets to the ISA. These headers allow instructions from any one of those auxilliary instruction
    sets to be combined with regular instructions in the same block.
    So if there's a need for instructions acting on short floating-point
    numbers, or UTF-8 strings, that the basic instruction set has not covered,
    it will be possible to extend the instruction set to deal with it.
    There should be enough room for it to meet the demands placed on it not merely in years to come, but even centuries or millenia. (Although, as Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    So you have produced a larger, hence more expensive chip. And you then
    expect your chip's potential users to pay more for a chip which, you
    admit, a particular application, particularly embedded ones, has
    features that application will never use.

    Doesn't seem like a recipe for success. :-(
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 17 17:53:14 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 17 18:16:16 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Sat, 14 Mar 2026 22:58:24 +0000, quadi wrote:
    -----------------
    What is the Concertina II architecture, and why has it gone through so
    many iterations?

    For a perspective:
    My 66000 has gone through 1 major revision after being essentially stable
    for 6 years (ISA) and 2.5 years (compiler). Sure, new instructions were
    added, and blocks of instructions moved around, but the template describing
    all instructions had not changed.

    The basic idea behind the Concertina II architecture has been:

    Start from a 32-bit RISC-like architecture.

    OK

    The basic form of a memory-reference instruction is:

    (Opcode) 5 bits for a load-store operation
    (Destination Register) 5 bits for one of two 32-register register banks,
    one for integers, one for floats
    (Index Register) 3 bits - use only seven of the 32 integer registers, so
    the instruction will fit in 32 bits
    (Base Register) 3 bits - as above
    (Displacement) 16 bits - as is conventional for most CISC and RISC microprocessors

    Squeezed already (on your first step)
    a) only 1%-5% of instruction need indexing (scaled or not).
    if you used a 64-bit instruction to cover this small usage
    you gain a 6-bit OpCode and 5-bit base register.
    b) this 6-bit OpCode provides 2× the space

    There are 24 opcodes normally used, so 1/4 of the opcode space is
    available for everything else.

    I only need 7 LDs and 4 STs.

    I had wanted to provide all the advantages and features of popular CISC architectures too, though.

    We have continued to question this over the years.

    The constant struggle to settle on a single OpCOde template is indicative
    of your struggle--the lack of convergence should be telling you that this
    goal is one thing holding your architecture back.

    This meant trying to squeeze stuff into not enough opcode space. So, for example, on the IBM System/360, there are arithmetic instructions between two registers that only take up 16 bits.

    360 is being supported by this huge $$$ base of few customers/machines.
    Will your architecture ever be able to gain a single customer from this
    base ?? If not, perhaps the goal is misdirecting the whole project.

    With 32-bit register banks, a short instruction ought to look like this:

    (Opcode) 7 bits
    (Destination Register) 5 bits
    (Source Register) 5 bits

    But that's 17 bits long.

    5 pounds of sand does not fit in a 4 pound bag !

    So what I usually did was place restrictions on the source and destination registers so that the short instructions could fit into 15 bits. A 32-bit instruction slot could start with 11 and then use 1/4 of the opcode space, containing a pair of these.

    But the basic memory-reference instructions take up 3/4 of the opcode
    space. I still need several other 32-bit instructions!

    See above at the top.
    ------------------

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 17 18:36:41 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 3/17/2026 1:00 AM, quadi wrote:
    On Tue, 17 Mar 2026 00:06:28 +0000, quadi wrote:

    Originally, I used radix encoding, but I realized Chen-Ho encoding was
    more appropriate.

    Anyways... see what happened?

    You mention the fact that, far too early in the design phase, I have used up almost all of the available opcode space.
    In a sense, that certainly is true. The opcode space for the basic instruction set of the computer, when there are no headers present to add longer instructions to the instruction set, is perhaps more than 99% allocated! Which _is_ pretty bad.
    But what did I do in response?
    I still, despite recent changes to the block headers to make them consume less opcode space, some space left to define new types of headers. So what did I do? I defined three new types of header which have the effect of allowing the architecture to be modified... so as to add up to *five hundred and twenty-eight* additional instruction sets to the ISA. These headers allow instructions from any one of those auxilliary instruction sets to be combined with regular instructions in the same block.
    So if there's a need for instructions acting on short floating-point numbers, or UTF-8 strings, that the basic instruction set has not covered, it will be possible to extend the instruction set to deal with it.
    There should be enough room for it to meet the demands placed on it not merely in years to come, but even centuries or millenia. (Although, as Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    So you have produced a larger, hence more expensive chip.

    On top of never getting close to done.

    And you then
    expect your chip's potential users to pay more for a chip which, you
    admit, a particular application, particularly embedded ones, has
    features that application will never use.

    This part does not materialize until he gets done.

    Doesn't seem like a recipe for success. :-(

    I think it's more than "tongue in cheak"


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 17 19:12:56 2026
    From Newsgroup: comp.arch


    jgd@cix.co.uk (John Dallman) posted:

    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.

    Probably easier to describe 432 as bit-sized instructions--as in
    use as many or few bits as desired.

    As to subroutines > 64KB, that was before application generated ASCII
    code, and before massive function inlining.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Mar 17 19:25:35 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:
    --------------
    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...
    Great: PDP-11 but died due to address space limitations
    Brilliant: VAX but died because complexity limits perf over time


    M68K started turning into a turd,

    by adding too much complexity between 010 and 020

    but can't really be faulted that much,
    as its design was pretty close to a direct evolution of the PDP-11;
    which was quite influential.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Mar 17 19:25:45 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction >> > bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering. All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 17 20:18:00 2026
    From Newsgroup: comp.arch

    In article <10p75jk$1dmen$1@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    I remember reading the manuals on Bitsavers for several 360
    emulation options. I don't recall any 7090 emulator which involved
    turning memory parity off in order for it to work!

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    The 360s were built out of hundreds of small circuit cards with discrete components on them. Those included very low-integrated circuits, with
    maybe 10 transistors at most. The orders to the designers of the
    different models were to microcode everything, unless they could improve price:performance by 30% or more with dedicated circuitry.

    The microcoded world lasted until the RISC revolution of the 1980s, when integrated circuits were providing at least 10,000 times more transistors.


    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 17 20:28:03 2026
    From Newsgroup: comp.arch

    On Tue, 17 Mar 2026 20:18:00 +0000, John Dallman wrote:

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    It is true that nearly all System/360 models, except the Model 75 and the 91/95/195, were microcoded.

    I would have expected memory parity to be done in hardware, however.
    Hardware features can be turned off, and emulation features did sometimes involve installing new hardware, not just new microcode.

    But in the 7090 emulation features for which I have read documentation,
    none did it this way.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Tue Mar 17 20:52:44 2026
    From Newsgroup: comp.arch

    On Tue, 17 Mar 2026 18:16:16 +0000, MitchAlsup wrote:

    The constant struggle to settle on a single OpCOde template is
    indicative of your struggle--the lack of convergence should be telling
    you that this goal is one thing holding your architecture back.

    That is entirely true. I wasn't too worried about it, since it was not as
    if the world desperately needed a new ISA.
    If I had settled for perfectly conventional goals, and finished the architectural spec years ago, we still wouldn't be using Concertina II
    CPUs by now, after all.

    5 pounds of sand does not fit in a 4 pound bag !

    Yes, I am aware of the "pigeonhole principle".

    But while I seem to have sought to achieve an impossible goal... finally,
    I feel, I have really come as close to achieving that goal as is possible.
    No more thrashing around is needed, and the compromises this iteration has
    had to make are... not too painful to endure, and all clearly necessary.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Mar 17 17:09:01 2026
    From Newsgroup: comp.arch

    John Dallman wrote:
    In article <10p75jk$1dmen$1@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    I remember reading the manuals on Bitsavers for several 360
    emulation options. I don't recall any 7090 emulator which involved
    turning memory parity off in order for it to work!

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    The 360s were built out of hundreds of small circuit cards with discrete components on them. Those included very low-integrated circuits, with
    maybe 10 transistors at most. The orders to the designers of the
    different models were to microcode everything, unless they could improve price:performance by 30% or more with dedicated circuitry.

    The microcoded world lasted until the RISC revolution of the 1980s, when integrated circuits were providing at least 10,000 times more transistors.


    John

    I was looking at the VAX-780 description of its cache, TLB,
    main memory bus called the Synchronous Backplane Interconnect (SBI),
    and memory controller, and was surprised to find them all directly
    controlled by microcode, not dedicated HW sequencers.
    I think this was for flexibility for easier handling of errors
    with microtraps (microcode exceptions).

    A consequence is that since there is only one microsequencer
    there is no HW concurrency and everything is sequential.

    "2.3.3.2 Microtraps During Memory Control Functions - During the execution
    of a memory control function, a microtrap may occur. Table 2-17 lists the possible microtraps for each memory control function. The conditions for
    each of these microtraps are given below.

    If a microtrap occurs during the execution of a memory control function,
    the reference is usually aborted. This is true for all microtraps except
    for the unaligned data microtrap and the Cache parity error microtrap.
    In the case of the unaligned data microtrap, the microtrap is executed
    as soon as all of the data of the aligned longword is accessed.
    For a Cache parity error microtrap, the reference is only aborted if it
    is a read reference. Otherwise, the function is executed regardless of
    the cache parity error."

    ... and it continues on to describe the various microtraps:
    TLB miss, protection violation, page crossing, unaligned data,
    odd address in PDP-11 mode, parity error in TLB, cache or SBI.

    It even uses a microtrap to handle setting the PTE's M or Modify bit
    and write the change back to memory.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Mar 17 21:43:57 2026
    From Newsgroup: comp.arch

    John Dallman <jgd@cix.co.uk> schrieb:
    In article <10p75jk$1dmen$1@dont-email.me>, quadibloc@ca.invalid (quadi) wrote:

    I remember reading the manuals on Bitsavers for several 360
    emulation options. I don't recall any 7090 emulator which involved
    turning memory parity off in order for it to work!

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    The 360s were built out of hundreds of small circuit cards with discrete components on them. Those included very low-integrated circuits, with
    maybe 10 transistors at most. The orders to the designers of the
    different models were to microcode everything, unless they could improve price:performance by 30% or more with dedicated circuitry.

    The microcoded world lasted until the RISC revolution of the 1980s, when integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Mar 17 22:07:34 2026
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when
    integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you
    decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could
    provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the
    360's EDIT instructions are basically the COBOL picture formatter.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Mar 17 17:57:08 2026
    From Newsgroup: comp.arch

    On 3/17/2026 2:25 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:
    --------------
    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...
    Great: PDP-11 but died due to address space limitations
    Brilliant: VAX but died because complexity limits perf over time

    PDP-11 was good for the time.
    Vs RISC-V or similar, would not be so good.

    Well, vs RV, at least excluding the RV hair that they don't fix.
    And the mess that is the encoding of the JAL instruction.



    M68K started turning into a turd,

    by adding too much complexity between 010 and 020


    Among other things.
    The split between A/D registers probably not a great idea.


    but can't really be faulted that much,
    as its design was pretty close to a direct evolution of the PDP-11;
    which was quite influential.


    It is like, from PDP-11:
    Split into A/D regs => M68K.
    Add a bunch of stuff and expand 16 regs and 32b => VAX;
    Switch to 16 regs and make more convoluted => MSP430;
    Switch to 16 regs, 32b, remove stuff, and make Load/Store => SuperH;
    Also the SuperH path mostly eliminated status flags, except S/T.
    ...


    The x86 line as its own path, very different.
    8080 -> 8086 -> i286 -> i386 -> x86-64
    Add 16 bit registers, 16b ops, and 20-bit address space -> 8086
    Add 16 bit protected mode -> 286
    Make 32-bit and add page tables -> 386
    Add registers via REX prefixes and make 64-bit: x86-64.
    8080 -> Z80 -> ...

    Then:
    6502 -> 65C816 -> mostly dead

    There seems to be non-zero similarity between 6502 instruction names and PDP-11, almost as-if it were intended as a "what if we made it 8 bit
    with fewer registers and 8-bit instructions" option.


    Along the PDP-11 to SuperH path:
    Eliminate ALU flags;
    Switch from pushing/popping PC to a link register;
    Switch to being mostly Load/Store;
    Switch from 8 to 16 registers;
    ...

    Along the SH-2 to SH-4 path:
    Switch interrupts from vector-table and stack-based exceptions, to
    computed address and special registers;
    Add bank-swap for scratch registers/etc in ISR;
    Add an FPU.

    Along the SH-4 to XG1 path:
    Switch to 64 bits;
    Switch to 32 GPRs;
    Drop auto-increment and various niche addressing modes;
    Mostly eliminate register bank swaps for ISRs;
    Glorified computed branch and SR trickery.
    Simplify the FPU design.

    XG1 to XG2:
    Switch from 16/32/64/96 bit instructions to 32/64/96.

    XG2 to XG3:
    Move instruction bits around, hot-glue ISA onto RV64G.

    ...


    With possible paths:
    Conclude RV path failed?
    Switch back to solely XG1 and XG2.
    conclude RV path is winner?
    Replace XG1 with RV64GC;
    Use XG3 as speed-oriented case.
    Or, an XG3L path:
    Conclude XG3 is the winner, but drop RV.


    The XG3L path, while good for performance, is less good for code density.



    As can be noted, RV64GC+Jx is a good option for code-density. Where,
    using RV64GC + Jumbo Prefixes works well for code-density.

    A little more can be squeezed on the code density front by having a few
    48-bit ops:
    LI, ADDI, and SHORI, with an Imm32.

    Rest of 48-bit space is more debatable, and then one can note that
    Huawei and Qualcomm apparently did their own mutually incompatible
    versions of this; and neither matches how I would assume using this
    space if I were to use it.

    Both companies went for using these mostly for Imm32 ops, but the space
    gets used up *very* quickly if one goes for Imm32 (only enough encoding
    space for a handful of 2RI ops).


    Downside is that, while RV-C is good for code-density, as-is both RV-C
    and XG1's 16-bit ops negatively effect performance when targeting code-density.

    Best case for performance seems to be:
    32/64/96 bit encodings,
    with 32-bit as the priority case.

    ...



    I could in theory make a RISC-V Linux build work on my stuff, but it
    looks like the big hassle would be to effectively write an UEFI BIOS and
    then run a bunch of stuff in "firmware".

    This is partly a hassle as what I would need for the firmware to pull
    off UEFI has little hope of fitting in ROM in an FPGA, so basically it
    needs a multi-stage boot process.

    Well, and also the annoyance of needing to go and implement UEFI.
    Turns out though that the non-standard HW interfaces can be glossed:
    Linux on RISC-V apparently mostly leveraging UEFI interfaces rather than direct hardware interop (or, IOW, "We heard you like operating systems
    so how about an operating system to run your operating system?...").

    Though, I may need to make the default CPU mode be RV64GC partway to
    work around an OS like Linux not being aware that any non RV64GC modes
    exist.


    But, ATM, it is debatable if worth the hassle (or the respective "loss
    of point" to try to use all this to just run an RV64GC Linux build).

    Well, except that maybe running Linux would be more "cred" than just
    running things like ports of Doom and similar (and in this sense maybe
    means there is more point in trying to get UEFI implemented).

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Mar 18 00:20:36 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> wrote:

    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    People like to criticise x86, but IMO 386 was at least as sane
    as the ones you call sane. Basicaly, core 386 instruction set
    is really nice: you can put "full word" constants in instructions,
    you can do operations with memory operands, there is SIB
    addressing mode. On minus side there are several legacy
    instructions from 8086 which decreases code density, insane
    shifts and small number of registers. Stack based FPU is
    less convenient than register based one, but can do the work.
    There are segments, but if OS wishes so they can be almost
    forgotten. Compare this with orignal ARM: it looses code
    density due to constional bits, has limited and strangly
    encoded constants. Lack operations on memory. Unaligned loads
    produce insane result.

    x86-64 has funky instruction encoding and due to 64-bit word
    length constants no longer have full word range, but still
    has better range of constants than competitors. 16 registers
    is smaller number than most RISC competitors, but given
    possiblity of operations on memory it is adequate. And it
    got normal register based FPU.

    People despise x86 segmentation, but several RISC-s had their
    own segmentation schemes. Such schemes were hidden from users
    by the OS but could have negative impact, like limitation of
    array size to 1GB. Special cases of floating point operations
    were frequently trapped and handled by interrupt handler,
    which is rather nasty for high performance software.

    I would say that there are real architectures where designers
    make justified compromises. They may not look super nice,
    but they are best things that can survide. I would count
    386 and x86_64 in this camp. Then you have ones that go too
    far in some direction, like early "no compromise" RISC-s and
    VAX. RISC-s were able to compromise later, VAX did not manage
    that.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Mar 17 20:30:26 2026
    From Newsgroup: comp.arch

    On 2026-03-16 5:16 a.m., BGB wrote:
    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction
    bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...


    I guess, probably similar story as the Gen-Z people and the IA-64.

    But, being of an older generation, IA-64 hype was going on when I was in high-school, and x86-64 also came on the scene. Some other people were
    on Team IA-64, but I was on Team x86-64, mostly because as I could see
    it at the time, it seemed like the writing was already on the wall for
    IA-64 even back then.

    In this case, I ended up being right...

    IIRC, the thinking was, for IA-64:
      People already knew of its performance woes;
      It was also very much more expensive.

    On the other side of things, the Athlon64 was already on the horizon;
    and ended up getting one post graduation.


    It is like, there are one of several ways things can go:
      Abysmal Turd: iAPX 432
      Turd: IA-64
      Turd that flies: x86, x86-64
      Sane: ARM, RISC-V, ...
      Sane, but still died: PowerPC, MIPS, SPARC, ...


    M68K started turning into a turd, but can't really be faulted that much,
    as its design was pretty close to a direct evolution of the PDP-11;
    which was quite influential.


    Well, and go back pretty far, and my current ISA design also follows an evolution tree that reaches back to the PDP-11, despite having almost
    noting it common (and by sorta colliding with RISC-V, also sorta half-
    way hybridizes it with the MIPS family).

    Then again, maybe being a product of engineering rather than
    naturalistic evolution sorta excludes it from the normal words of
    phylogeny. But, then again, can one prove that even nature obeys it?...

    Say, for example, what if you ended up with leaf slugs that could
    natively reproduce chloroplasts, would they still be purely animals, or would their then stolen DNA and chloroplasts make then partially
    descended from algae (as IIRC they had gained some DNA from the algae
    via horizontal gene transfer or similar, but need to scavenge
    chloroplasts from plant cells rather than making their own).

    ...

    Some of those architectures are not really dead, but are just not as
    popular as the x86, ARM. I would only consider some of the very earliest machines "deceased" where the number of users can probably be counted on
    one hand. At what count is the architecture deceased?

    Last I looked PowerPC was still in use, the 65xx's are still is use, and
    68k have you seen the AC68080? I suspect there is a certain user base.

    For the 68k I thought splitting the register file into A and D registers
    was a reasonable way to be able to encode 16 regs using only 3-bit
    register codes. Sorta like splitting the register files between integer
    and floats.

    Splitting register files is a headache for compilers. If one wants a lot
    of registers available it is better to use a wide opcode IMO.

    I wonder if it is possible to come up with a micro-op set that can
    encode a variety of instructions from earlier machines. ISA is somewhat
    less relevant when a micro-op engine is used.



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Mar 18 01:08:56 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    John Dallman <jgd@cix.co.uk> schrieb:
    In article <10p75jk$1dmen$1@dont-email.me>, quadibloc@ca.invalid (quadi)
    wrote:

    I remember reading the manuals on Bitsavers for several 360
    emulation options. I don't recall any 7090 emulator which involved
    turning memory parity off in order for it to work!

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    The 360s were built out of hundreds of small circuit cards with discrete
    components on them. Those included very low-integrated circuits, with
    maybe 10 transistors at most. The orders to the designers of the
    different models were to microcode everything, unless they could improve
    price:performance by 30% or more with dedicated circuitry.

    The microcoded world lasted until the RISC revolution of the 1980s, when
    integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    Tanenbaum in his book about computer architecture mentions results
    of Bell and Newell from 1971. IIUC Bell and Newell coded
    some programs in microcode of 2025 (microcode engine of 360/25).
    Claim is that such program run 45 times faster than program
    using 360 instruction set. They also created Fortran compiler/
    interpreter combination with interpreter coded in 2025
    microcode. They claimed that this Fortran run at comparable
    speed as "native" Fortran on 360/50.

    One can make different conclusion from this. Like Tanenbaum you
    can claim that there is need for more specialized microcode.
    But you can also realize that by making samer version of microcode
    level and allowing compilers to target it (possibly via specialized interpreter) one can gain a lot of performance. The second things
    leads to RISC-like designs.

    I think that in 1970 designers knew that if machine is simple
    enough, then hardwired design will be faster than microcoded
    one. But if for compatibility reasons one had to implement
    design that was too complex to directly implement in available
    hardware, then microcoded thesign had advantage. And ability
    to offer "the same" architecture on machines of varying sizes
    was seen as big advantage.

    AFAICS at least part of motivation for microcode was realization
    that hardware could be made simpler and perform better if matched
    with appropriate software. But simpler and presumably cheaper
    hardware would mean less money to hardware folks and more to
    software vendors. Microcode was a way for hardware vendors to
    get bigger part of the pie, by doing thing that otherwise
    software folks would do.

    So, I think that technical people realized around 1971 that
    RISC-like apprach could be technically superior, but for
    several follownig years microcode had business advantage.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Wed Mar 18 01:41:27 2026
    From Newsgroup: comp.arch

    On Tue, 17 Mar 2026 09:09:03 -0700, Stephen Fuld wrote:
    On 3/17/2026 1:00 AM, quadi wrote:

    So if there's a need for instructions acting on short floating-point
    numbers, or UTF-8 strings, that the basic instruction set has not
    covered,
    it will be possible to extend the instruction set to deal with it.
    There should be enough room for it to meet the demands placed on it not
    merely in years to come, but even centuries or millenia. (Although, as
    Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    So you have produced a larger, hence more expensive chip. And you then expect your chip's potential users to pay more for a chip which, you
    admit, a particular application, particularly embedded ones, has
    features that application will never use.

    Doesn't seem like a recipe for success. :-(

    Subset implementations are possible.

    But what _is_ the reasoning behind making the architecture so extensible
    as to allow for totally impractical implementations? What possible gain
    could come from it?

    Well, as I noted, new data types come into fashion.

    So if people start needing instructions for handling 8-bit floats, or
    UTF-8 strings... what's going to happen?
    Are they going to start using totally new ISAs which have instructions for these kinds of data?
    I want to avoid the need to do that. I want to provide an ISA with staying power, one that has room to grow. Despite my having squeezed the opcode
    space so much that there's hardly any of it left!
    This helps to diminish the validity of the criticism that I shouldn't have squeezed the opcode space that much.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Mar 17 18:52:43 2026
    From Newsgroup: comp.arch

    On 3/17/2026 3:07 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when >>> integrated circuits were providing at least 10,000 times more transistors. >>
    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could
    provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the
    360's EDIT instructions are basically the COBOL picture formatter.

    So the instruction set reflected the compiler's need for picture
    formatting, and optimized that. :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Mar 17 18:59:16 2026
    From Newsgroup: comp.arch

    On 3/17/2026 6:08 PM, Waldek Hebisch wrote:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    John Dallman <jgd@cix.co.uk> schrieb:
    In article <10p75jk$1dmen$1@dont-email.me>, quadibloc@ca.invalid (quadi) >>> wrote:

    I remember reading the manuals on Bitsavers for several 360
    emulation options. I don't recall any 7090 emulator which involved
    turning memory parity off in order for it to work!

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    The 360s were built out of hundreds of small circuit cards with discrete >>> components on them. Those included very low-integrated circuits, with
    maybe 10 transistors at most. The orders to the designers of the
    different models were to microcode everything, unless they could improve >>> price:performance by 30% or more with dedicated circuitry.

    The microcoded world lasted until the RISC revolution of the 1980s, when >>> integrated circuits were providing at least 10,000 times more transistors. >>
    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    Tanenbaum in his book about computer architecture mentions results
    of Bell and Newell from 1971. IIUC Bell and Newell coded
    some programs in microcode of 2025 (microcode engine of 360/25).
    Claim is that such program run 45 times faster than program
    using 360 instruction set. They also created Fortran compiler/
    interpreter combination with interpreter coded in 2025
    microcode. They claimed that this Fortran run at comparable
    speed as "native" Fortran on 360/50.

    One can make different conclusion from this. Like Tanenbaum you
    can claim that there is need for more specialized microcode.
    But you can also realize that by making samer version of microcode
    level and allowing compilers to target it (possibly via specialized interpreter) one can gain a lot of performance. The second things
    leads to RISC-like designs.

    I think that in 1970 designers knew that if machine is simple
    enough, then hardwired design will be faster than microcoded
    one. But if for compatibility reasons one had to implement
    design that was too complex to directly implement in available
    hardware, then microcoded thesign had advantage. And ability
    to offer "the same" architecture on machines of varying sizes
    was seen as big advantage.

    AFAICS at least part of motivation for microcode was realization
    that hardware could be made simpler and perform better if matched
    with appropriate software. But simpler and presumably cheaper
    hardware would mean less money to hardware folks and more to
    software vendors.

    But back in the late 1960s and 70s, the hardware vendors gave the basic software, i.e. OS, compilers, utilities, away for free.



    Microcode was a way for hardware vendors to
    get bigger part of the pie, by doing thing that otherwise
    software folks would do.

    So, I think that technical people realized around 1971 that
    RISC-like apprach could be technically superior, but for
    several follownig years microcode had business advantage.

    And in that time period, backward compatibility was becoming important,
    so new architectures had that hurdle to overcome.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Wed Mar 18 00:18:53 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:

    On Tue, 17 Mar 2026 20:18:00 +0000, John Dallman wrote:

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    It is true that nearly all System/360 models, except the Model 75 and the 91/95/195, were microcoded.

    There was also the 360/44 if I am not mistaken.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Wed Mar 18 00:21:05 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

    [...]

    And in that time period, backward compatibility was becoming
    important, so new architectures had that hurdle to overcome.

    When I first saw this I misread it as hackward compatibility. :)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Mar 18 15:01:40 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/17/2026 3:07 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when >>>> integrated circuits were providing at least 10,000 times more transistors. >>>
    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you
    decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could
    provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the
    360's EDIT instructions are basically the COBOL picture formatter.

    So the instruction set reflected the compiler's need for picture
    formatting, and optimized that. :-)

    The contemporaneous Burroughs B3500 also had an instruction (EDT)
    for picture formatting.

    "The Edit instruction moves digits or characters (depending on the **A**
    address controller) from the **A** field to the **C** field under control of
    the edit-operators in the **B** field. Characters may be moved, inserted or
    deleted according to the edit-operators. Data movement and editing are
    stopped by the exhaustion of edit-operators in the **B** field."

    The edit instruction uses an edit table that is located in memory locations **48**-**63** relative to Base #0. This table may be initialized to any desired set of insertion characters. By convention all compilers build
    a default insertion table containing:

    ^ Entry # ^ Base 0 Address ^ Character ^ Description ^
    ^ 0 | 48 | **+** | Positive Sign |
    ^ 1 | 50 | **-** | Negative Sign |
    ^ 2 | 52 | ***** | Check Suppress Character |
    ^ 3 | 54 | **.** | Decimal Point |
    ^ 4 | 56 | **,** | Thousands Separator |
    ^ 5 | 58 | **$** | Currency Symbol |
    ^ 6 | 60 | **0** | Leading zero fill Character |
    ^ 7 | 62 | <blank> | Blank Fill Character |


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 18 16:00:41 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Tue, 17 Mar 2026 09:09:03 -0700, Stephen Fuld wrote:
    On 3/17/2026 1:00 AM, quadi wrote:

    So if there's a need for instructions acting on short floating-point
    numbers, or UTF-8 strings, that the basic instruction set has not
    covered,
    it will be possible to extend the instruction set to deal with it.
    There should be enough room for it to meet the demands placed on it not
    merely in years to come, but even centuries or millenia. (Although, as
    Moore's Law peters out, it may not be possible to put circuitry for so
    many different kinds of instructions on a single die!)

    So you have produced a larger, hence more expensive chip. And you then expect your chip's potential users to pay more for a chip which, you
    admit, a particular application, particularly embedded ones, has
    features that application will never use.

    Doesn't seem like a recipe for success. :-(

    Subset implementations are possible.

    But what _is_ the reasoning behind making the architecture so extensible
    as to allow for totally impractical implementations? What possible gain could come from it?

    Well, as I noted, new data types come into fashion.

    So if people start needing instructions for handling 8-bit floats, or
    UTF-8 strings... what's going to happen?
    Are they going to start using totally new ISAs which have instructions for these kinds of data?

    My 66000 Rev 2.0 ISA encoding has direct support for 4-sizes of floats.
    Only 3 currently defined {Double, single, half}.

    I want to avoid the need to do that. I want to provide an ISA with staying power, one that has room to grow. Despite my having squeezed the opcode space so much that there's hardly any of it left!

    Second sentence is admirable, third says you are not on the path.

    This helps to diminish the validity of the criticism that I shouldn't have squeezed the opcode space that much.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 18 16:01:42 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 3/17/2026 3:07 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when >>> integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the 360's EDIT instructions are basically the COBOL picture formatter.

    So the instruction set reflected the compiler's need for picture
    formatting, and optimized that. :-)

    s/compiler's/COBOL's/



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Mar 18 16:56:17 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 3/17/2026 3:07 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when >> >>> integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you
    decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could
    provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the
    360's EDIT instructions are basically the COBOL picture formatter.

    So the instruction set reflected the compiler's need for picture
    formatting, and optimized that. :-)

    s/compiler's/COBOL's/

    and RPG

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Mar 18 17:35:49 2026
    From Newsgroup: comp.arch

    According to Tim Rentsch <tr.17687@z991.linuxsc.com>:
    quadi <quadibloc@ca.invalid> writes:

    On Tue, 17 Mar 2026 20:18:00 +0000, John Dallman wrote:

    Think about it. If this is possible, the parity checks must be
    implemented by microcode.

    It is true that nearly all System/360 models, except the Model 75 and the
    91/95/195, were microcoded.

    There was also the 360/44 if I am not mistaken.

    The 360/44 was an odd machine with a hardware implmentation of a
    scientific subset of the 360's instruction set. It had optional
    priority interrupt and a high speed direct data channel intended for
    real-time data acquisition and control. It also had a knob to control
    floating point precision, so you could get less accurate answers
    faster, which I suppose was intended to help tight time budgets in
    real-time calculations.

    I don't think they made very many of them. The 16-bit rack mounted 1800
    was a lot more popular.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 18 18:24:12 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 3/17/2026 3:07 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    The microcoded world lasted until the RISC revolution of the 1980s, when
    integrated circuits were providing at least 10,000 times more transistors.

    The 801 demonstrated (within IBM) that RISC was possible in the
    second half of the 1970s. The key there were fast caches which
    were fast enough to replace microcode storage. Separation of
    I and D cache also played a role, of course, as did pipelinging.

    That's partly it but I think it was more that the 801 and the PL.8
    compiler were developed together. They had the insight that if you
    decomposed complicated instructions into simpler ones, the compiler
    now could optimize them and some of those instructions were
    optimized away. It certainly didn't hurt that the 801's cache could
    provide an instruction every cycle so it was as fast as microcode
    would be.

    While the early FORTRAN compilers did optimizations that are still
    quite respectable, the other 1960s compilers were not very
    sophisticated and the instruction sets reflected that. For example, the >> > 360's EDIT instructions are basically the COBOL picture formatter.

    So the instruction set reflected the compiler's need for picture
    formatting, and optimized that. :-)

    s/compiler's/COBOL's/

    and RPG

    and PL/1
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Mar 18 13:51:34 2026
    From Newsgroup: comp.arch

    On 3/17/2026 2:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) >>>> wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction >>>> bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    Yeah, make it bad enough, and it becomes an example piece.


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering. All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.

    These features are non-ideal (except partly Software TLB, which does
    have some useful merits to go along with its drawbacks).


    They lack the obvious drawbacks of x86:
    Nightmarish encoding scheme;
    Mostly 2R / 2RI only;
    Not enough registers / absurd levels of spill-and-fill needed;
    While x86 had 8 registers, they weren't true GPRs;
    All had fixed behaviors in certain instructions;
    And, various edge-case limitations;
    ...
    ...

    So, "turd that flies".

    There was a strategy that sorta works OK on x86:
    Use EAX/ECX/EDX as temporary working registers;
    Use ESI/EDI as cached variables;
    ...
    If one tries to generate similar types of code for another target, it
    performs poorly.

    Generate this sort of code on x86, and it is merely ~ 40% slower than if
    one did an optimized build.

    One of my earlier forrays into non-x86 code generation (before my
    current projects) had involved trying to generate code for ARM on a RasPi;
    had generated it in a similar way to what I did on x86;
    Promptly observed that it performed horribly (around an order of
    magnitude slower than code compiled with GCC).


    Likewise, if typical x86 code were run strictly in-order and with full pipeline stalls as needed, it would also perform horribly. Code have way
    too many loads/stores, too many register dependencies, etc.

    So, absent some heaving lifting, would be solidly in CPI territory and
    not IPC.



    Like, being limited to 2R and excessive spill/fill is likely a worse
    issue than, say:
    Register windows (SPARC/etc);
    Weird bit-sliced branches (MIPs).

    Alpha initially lacked Byte/Half memory operations, etc, which also
    wasn't great.



    From a "design elegance" stance, IA-64 wasn't that bad;
    In terms of practical issues, its design was lacking.

    Code density was pretty bad; Its Inverted-Page-Table was almost "worst
    of both worlds", essentially just a set-associative software-managed
    TLB, but held in RAM. So you still need the miss-handler interrupts, and
    also the MMU needs to access RAM.

    ...

    Contrast, say, major time wasters in RISC-V:
    Need to do arithmetic or similar for indexed loads in RV64G;
    Can be dealt with by indexed load;
    Dealing with "imm/disp doesn't fit in imm12/disp12" issues;
    Can be dealt with by jumbo prefixes;
    ...

    Addressing a few of these issues makes it around 30% or so faster for a
    naive in-order implementation when running things like Doom and similar.

    Or, ironically, a few things that x86 does have...



    But, then people are left debating if it matters for OoO when running
    the SPEC benchmark and similar, and there is currently a rising mindset
    of "anyone CPU that cares about performance will be OoO, so the relative inefficiencies don't matter...".

    While, seemingly ignoring the "mid range" that exists mostly in the
    "budget cell-phone" space and similar, where they still care some about performance but in-order dominates.

    Like, it is not a hard split between big server chips, and small microcontrollers.


    But, I guess some people did some testing, and noted that for SPEC, the
    delta of indexed load/store dropped to closer to around 5% on an
    in-order CPU when supported natively, which is maybe "not quite big
    enough" for some people.

    Well, and also it has a negligible effect on Dhrystone.

    ...


    But, alas...

    ...

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Mar 18 19:55:25 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 3/17/2026 2:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    ---------------
    They lack the obvious drawbacks of x86:
    Nightmarish encoding scheme;

    As someone who worked on both SPARC v8 and V9 and x86 and x86-64,
    I found SPARC VIS worse than x86-64

    Mostly 2R / 2RI only;
    Not enough registers / absurd levels of spill-and-fill needed;

    x86-64 fixed mot of that with 16 more GPRs.

    While x86 had 8 registers, they weren't true GPRs;
    All had fixed behaviors in certain instructions;
    And, various edge-case limitations;

    Lesson:: don't do that to your ISA.

    ------------
    Like, being limited to 2R and excessive spill/fill is likely a worse
    issue than, say:
    Register windows (SPARC/etc);

    Sometimes, sometimes not.

    Weird bit-sliced branches (MIPs).

    Lesson: don't do this to your ISA.

    Alpha initially lacked Byte/Half memory operations, etc, which also
    wasn't great.



    From a "design elegance" stance, IA-64 wasn't that bad;
    In terms of practical issues, its design was lacking.

    Err, no.

    ------------
    Contrast, say, major time wasters in RISC-V:
    Need to do arithmetic or similar for indexed loads in RV64G;
    Can be dealt with by indexed load;
    Dealing with "imm/disp doesn't fit in imm12/disp12" issues;
    Can be dealt with by jumbo prefixes;
    ...

    Addressing a few of these issues makes it around 30% or so faster for a naive in-order implementation when running things like Doom and similar.

    I got closer to 40% by having the ISA designed "properly"

    Or, ironically, a few things that x86 does have...

    This is my experience, [Rbase+Rindex<<scale+Disp] is 1 gate delay slower
    than [Rbase+DISPsmall] and [Rbase+Rindex] saving an instruction every time
    it gets used in full form.

    But, then people are left debating if it matters for OoO when running
    the SPEC benchmark and similar, and there is currently a rising mindset
    of "anyone CPU that cares about performance will be OoO, so the relative inefficiencies don't matter...".

    RISC-V has a 30% longer code latency than My 66000 in 32-bit VAS, and
    when considering 64-bit VAS and needing both indexing and displacements
    My 66000 uses 6× fewer instructions; this kind of addressing is in play
    when accessing (or calling) through GOT.

    While, seemingly ignoring the "mid range" that exists mostly in the
    "budget cell-phone" space and similar, where they still care some about performance but in-order dominates.

    Like, it is not a hard split between big server chips, and small microcontrollers.


    But, I guess some people did some testing, and noted that for SPEC, the delta of indexed load/store dropped to closer to around 5% on an
    in-order CPU when supported natively, which is maybe "not quite big
    enough" for some people.

    The code footprint one saves pays for itself in the Icache, done right
    it also pays for itself in the DCache.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Mar 18 21:08:50 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 3/17/2026 2:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) >>>>> wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current >>>>>> iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction >>>>> bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    Yeah, make it bad enough, and it becomes an example piece.


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering. All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.

    These features are non-ideal (except partly Software TLB, which does
    have some useful merits to go along with its drawbacks).


    They lack the obvious drawbacks of x86:
    Nightmarish encoding scheme;

    Nightmarish from what perspective? The RTL designer?
    The OS developer? The end user?

    Mostly 2R / 2RI only;
    Not enough registers / absurd levels of spill-and-fill needed;

    Somewhat addressed by x86_64.

    It was, after all, a 1970s design.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Thu Mar 19 00:20:00 2026
    From Newsgroup: comp.arch

    In article <10p8i54$1rmt5$1@dont-email.me>, cr88192@gmail.com (BGB)
    wrote:

    Sane, but still died: PowerPC, MIPS, SPARC, ...

    PowerPC is not dead yet. IBM still sells a fair number of POWER systems,
    and IBM i (formerly AS/400, formerly System/38) runs on it.

    MIPS died of Itanium. SGI swallowed the Itanium hype and stopped MIPS development (they owned MIPS at the time). It never caught up once it was restarted.

    SPARC died by stages. Sun couldn't afford to keep development at a high
    pitch, and the register windows did not help. Once Oracle owned them, the
    high cost of chip development caused Oracle to cancel it. They claimed
    they would support running SPARC Solaris until 2030-something, but got
    annoyed when asked how this would be accomplished. You can't keep ISVs
    that way.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu Mar 19 05:13:48 2026
    From Newsgroup: comp.arch

    On Wed, 18 Mar 2026 16:00:41 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    I want to avoid the need to do that. I want to provide an ISA with
    staying power, one that has room to grow. Despite my having squeezed
    the opcode space so much that there's hardly any of it left!

    Second sentence is admirable, third says you are not on the path.

    Since you are the expert, I can't really argue. But I will make some
    points in my own defense, even so.

    I've said that I've used up the opcode space. That's true of the basic
    opcode space when no header is used, so that there are only 32-bit instructions (and some pairs of 16-bit instructions).

    Except:

    1) This is true in so far as instruction formats are concerned. However
    some types of instructions, such as 32-bit two-operand register-to-
    register instructions, have long opcodes. So there likely will be space
    for new opcodes. But this, in itself, only allows the ISA to be expanded
    to new types of operations... on data that will fit into the existing
    register banks.

    2) Since it's only the no-header instructions - and with a header it's possible to add _hundreds_ of complete new instruction sets, then plenty
    of new stuff can be added, *but* at the price of being given higher
    overhead costs. So the new stuff has a lower priority.
    However, in this ISA, lower priority means, at worst, 25% overhead.
    Compare that to architectures similar to RISC-V, where the basic
    instructions are 32 bits long... but plenty of new stuff can be added, as
    long as you only use 64-bit instructions for it. That's 50% overhead.

    3) This ability to increase overhead costs in incremental steps, of
    course, comes from the header mechanism.
    Yes, I'll admit it's ugly to have to contend with a block organization of programs. But:
    - it's not without precedent. The Itanium and the TMS 320C2000 do this
    sort of thing. So people _have_ written compilers for such architectures; there's a body of experience to draw on.
    - it has the benefit of allowing what you have identified as a useful
    feature, making it easy for instructions to use full-length immediate
    values of all types, instead of having to load constants as data from
    memory, without making decoding the lengths of instructions overly complicated.
    - it also has the benefit of allowing the classic feature of VLIW architectures; explicitly indicating when instructions can be executed in parallel, and also allowing instruction predication as an alternative to branching. As you've correctly pointed out, though, while this addresses register hazards, it can't handle cache misses, which are unpredictable
    under many circumstances. I would argue, though, that this is still useful whenever, for some reason, available gates are so constrained that a full
    OoO implementation is not an option.
    If a micro-embedded CPU can run a subset of the ISA of a powerful desktop computer, debugging is more convenient.
    Maybe this advantage is illusory; after all, a specialized ISA for micro- embedded CPUs can be emulated with just-in-time compilation fast enough
    for a development platform on today's CPUs.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Mar 19 06:56:08 2026
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    x86-64 has funky instruction encoding and due to 64-bit word
    length constants no longer have full word range,

    $ cat > foo.c
    unsigned long foo()
    {
    return 0x1234567890abcdef;
    }
    $ gcc -S foo.c
    $ cat foo.s
    .file "foo.c"
    .text
    .globl foo
    .type foo, @function
    foo:
    .LFB0:
    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq %rsp, %rbp
    .cfi_def_cfa_register 6
    movabsq $1311768467294899695, %rax
    popq %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
    .LFE0:
    .size foo, .-foo
    .ident "GCC: (GNU) 16.0.0 20260111 (experimental)"
    .section .note.GNU-stack,"",@progbits
    $
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 19 12:06:53 2026
    From Newsgroup: comp.arch

    On Thu, 19 Mar 2026 00:20 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <10p8i54$1rmt5$1@dont-email.me>, cr88192@gmail.com (BGB)
    wrote:

    Sane, but still died: PowerPC, MIPS, SPARC, ...

    PowerPC is not dead yet. IBM still sells a fair number of POWER
    systems, and IBM i (formerly AS/400, formerly System/38) runs on it.


    Pay attention at cadence:
    POWER3 - 1998
    POWER4 - 2001
    POWER5 - 2004
    POWER6 - 2007
    POWER7 - 2010
    POWER8 - 2014
    POWER9 - 2017
    POWER10 - 2021
    We are in 20026 now and I don't here about 11.

    MIPS died of Itanium. SGI swallowed the Itanium hype and stopped MIPS development (they owned MIPS at the time). It never caught up once it
    was restarted.

    Microarchitecture development on the high end stopped much earlier
    - all high-end MIPS chips after R10K were reusing the same
    microarchitecture.
    On the low end, there were intersting things going on until relatively recently. Until MIPS was finally flattened by RV bandwagon.


    SPARC died by stages. Sun couldn't afford to keep development at a
    high pitch, and the register windows did not help. Once Oracle owned
    them, the high cost of chip development caused Oracle to cancel it.

    Considering Oracle's financial situation at the time (very very good)
    it does not look like they can't afford it. They just didn't see a
    point.

    They claimed they would support running SPARC Solaris until
    2030-something, but got annoyed when asked how this would be
    accomplished. You can't keep ISVs that way.

    John


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Thu Mar 19 10:46:00 2026
    From Newsgroup: comp.arch

    In article <20260319120653.0000778b@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    POWER9 - 2017
    POWER10 - 2021
    We are in 20026 now and I don't hear about 11.

    Announced last July, though nobody has done a wikipedia page yet:

    <https://www.redbooks.ibm.com/redbooks/pdfs/sg248590.pdf>

    <https://newsroom.ibm.com/2025-07-08-ibm-power11-raises-the-bar-for-enterp rise-it>

    Microarchitecture development on the high end stopped much earlier
    - all high-end MIPS chips after R10K were reusing the same
    microarchitecture.

    OK, yes. I had misremembered the dates. R10K was 1995.

    Considering Oracle's financial situation at the time (very very
    good) it does not look like they can't afford it. They just didn't
    see a point.

    Yup. They didn't see a way to make back the money development would cost.


    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Thu Mar 19 16:15:36 2026
    From Newsgroup: comp.arch

    On Thu, 19 Mar 2026 10:46:00 +0000, John Dallman wrote:
    In article <20260319120653.0000778b@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    POWER9 - 2017 POWER10 - 2021 We are in 20026 now and I don't hear about
    11.

    Announced last July, though nobody has done a wikipedia page yet:

    One of your links didn't display clickably in my newsreader, so I just Googled. At first, it seemed as if Power 11 was a new generation of
    servers, but not necessarily chips, but eventually I did find that, yes,
    the chips were definitely changed.

    The chief distinguishing feature of Power 11 is that it offers a lot more memory bandwidth. There appear to be multiple data buses coming from the
    chips - apparently *sixteen*. Which is a figure usually associated with
    things like the NEC SX-7.

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Mar 19 16:22:09 2026
    From Newsgroup: comp.arch

    quadi <quadibloc@ca.invalid> writes:
    On Thu, 19 Mar 2026 10:46:00 +0000, John Dallman wrote:
    In article <20260319120653.0000778b@yahoo.com>, already5chosen@yahoo.com
    (Michael S) wrote:

    POWER9 - 2017 POWER10 - 2021 We are in 20026 now and I don't hear about
    11.

    Announced last July, though nobody has done a wikipedia page yet:

    One of your links didn't display clickably in my newsreader, so I just >Googled. At first, it seemed as if Power 11 was a new generation of
    servers, but not necessarily chips, but eventually I did find that, yes,
    the chips were definitely changed.

    The chief distinguishing feature of Power 11 is that it offers a lot more >memory bandwidth. There appear to be multiple data buses coming from the >chips - apparently *sixteen*. Which is a figure usually associated with >things like the NEC SX-7.

    Most modern high-end processor chips have multiple memory controllers, specifically to provide more memory bandwidth (particularly since
    most memory controllers perform best with only a single attached DIMM)
    as well as larger capacity. 12 isn't uncommon on high end xeons,
    for example.

    In most modern systems, they're directly attached to the internal
    high-speed routing infrastructure (mesh, crossbar, ring) rather
    than using point-to-point bus structures.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Mar 19 19:00:36 2026
    From Newsgroup: comp.arch

    On Thu, 19 Mar 2026 16:15:36 -0000 (UTC)
    quadi <quadibloc@ca.invalid> wrote:

    On Thu, 19 Mar 2026 10:46:00 +0000, John Dallman wrote:
    In article <20260319120653.0000778b@yahoo.com>,
    already5chosen@yahoo.com (Michael S) wrote:

    POWER9 - 2017 POWER10 - 2021 We are in 20026 now and I don't hear
    about 11.

    Announced last July, though nobody has done a wikipedia page yet:

    One of your links didn't display clickably in my newsreader, so I
    just Googled. At first, it seemed as if Power 11 was a new generation
    of servers, but not necessarily chips, but eventually I did find
    that, yes, the chips were definitely changed.

    The chief distinguishing feature of Power 11 is that it offers a lot
    more memory bandwidth. There appear to be multiple data buses coming
    from the chips - apparently *sixteen*. Which is a figure usually
    associated with things like the NEC SX-7.

    John Savard

    I don't think that # of busses was changed vs POWER10.
    Both 10 and 11 seem to suppoort 16 OMI buses.
    What is new is support for 4800 MT/s DIMMs on the other side of buffer
    chips. For POWER10 it was 3200 MT/s max.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Mar 19 21:15:05 2026
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering. All survive
    for backward compatibility and are generally
    eschewed for new designs.

    Is anybody still doing Alpha?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Mar 19 22:24:09 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Wed, 18 Mar 2026 16:00:41 +0000, MitchAlsup wrote:
    quadi <quadibloc@ca.invalid> posted:

    I want to avoid the need to do that. I want to provide an ISA with
    staying power, one that has room to grow. Despite my having squeezed
    the opcode space so much that there's hardly any of it left!

    Second sentence is admirable, third says you are not on the path.

    Since you are the expert, I can't really argue. But I will make some
    points in my own defense, even so.

    I've said that I've used up the opcode space. That's true of the basic opcode space when no header is used, so that there are only 32-bit instructions (and some pairs of 16-bit instructions).

    Except:

    Even given all the snipped below text--unless you can get ISA finished
    and get a compiler written, software ported and a microarchitecture
    built--it is nothing but a mental exercise.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Mar 19 22:26:32 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Waldek Hebisch <antispam@fricas.org> schrieb:

    x86-64 has funky instruction encoding and due to 64-bit word
    length constants no longer have full word range,

    $ cat > foo.c
    unsigned long foo()
    {
    return 0x1234567890abcdef;
    }

    Yes, what x86-64 does not have is universal constants:
    double foo1(double y)
    {
    return 0x123456789.12345 × y;
    }

    $ gcc -S foo.c
    $ cat foo.s
    .file "foo.c"
    .text
    .globl foo
    .type foo, @function
    foo:
    .LFB0:
    .cfi_startproc
    pushq %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq %rsp, %rbp
    .cfi_def_cfa_register 6
    movabsq $1311768467294899695, %rax
    popq %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
    .LFE0:
    .size foo, .-foo
    .ident "GCC: (GNU) 16.0.0 20260111 (experimental)"
    .section .note.GNU-stack,"",@progbits
    $

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Mar 19 22:32:41 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    --------------------
    SPARC died by stages. Sun couldn't afford to keep development at a
    high pitch, and the register windows did not help. Once Oracle owned
    them, the high cost of chip development caused Oracle to cancel it.

    Considering Oracle's financial situation at the time (very very good)
    it does not look like they can't afford it. They just didn't see a
    point.

    It takes cubic dollars to fund a high end design team--something
    like $200M/year just in development more when considering building
    product.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Mar 19 15:48:41 2026
    From Newsgroup: comp.arch

    On 3/17/2026 12:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) >>>> wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current
    iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction >>>> bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering.

    Fwiw, SPARC in RMO mode was weak, but not as weak as a damn Alpha. At
    least SPARC honored data dependent load dependencies, ala implied
    consume membars.

    https://en.cppreference.com/w/cpp/atomic/memory_order.html

    I guess C++26 waves good bye to it. But, still... if they make a consume actually emit, say an acquire membar ala MEMBAR #LoadStore | #LoadLoad
    on SPARC, or an acquire on any other platform, I would be pissed. If at
    all, consume should give a warning and say C++26 does not like it
    anymore. It should be just a compiler barrier unless on something like
    an Alpha. Alpha, well shit out of luck? consume membar on Alpha would
    emit a mb instruction for data dependent loads. C++ says the Alpha can die?


    All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Mar 20 01:13:07 2026
    From Newsgroup: comp.arch

    On Thu, 19 Mar 2026 22:32:41 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    --------------------
    SPARC died by stages. Sun couldn't afford to keep development at a
    high pitch, and the register windows did not help. Once Oracle
    owned them, the high cost of chip development caused Oracle to
    cancel it.

    Considering Oracle's financial situation at the time (very very
    good) it does not look like they can't afford it. They just didn't
    see a point.

    It takes cubic dollars to fund a high end design team--something
    like $200M/year just in development more when considering building
    product.

    It's Oracle of 2017 that we are talking about.
    Total revenues : $ 37.728 B
    Operating income: $ 12.710 B
    Net income : $ 9.335 B

    What is 0.2 B for such juggernaut ?
    Just to put things in proportion, in November 2016 Oracle pid $9.3
    billion USD for Netsuit.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Mar 20 01:25:00 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    x86-64 has funky instruction encoding and due to 64-bit word
    length constants no longer have full word range,

    $ cat > foo.c
    unsigned long foo()
    {
    return 0x1234567890abcdef;
    }
    $ gcc -S foo.c

    movabsq $1311768467294899695, %rax

    But you can not put 64-bit constants in other instructions, constants
    in other instructions are still 32-bit.
    --
    Waldek Hebisch
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Mar 20 01:42:14 2026
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 3/17/2026 12:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) >>>> wrote:

    Even I, who has indeed in Concertina II designed what has been in
    some of its iterations - and which has again become in the current >>>>> iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction
    bits in a segment, or 8K bytes. The idea was that no subroutine or
    function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead...

    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering.

    Fwiw, SPARC in RMO mode was weak, but not as weak as a damn Alpha. At
    least SPARC honored data dependent load dependencies, ala implied
    consume membars.

    https://en.cppreference.com/w/cpp/atomic/memory_order.html

    I guess C++26 waves good bye to it. But, still... if they make a consume actually emit, say an acquire membar ala MEMBAR #LoadStore | #LoadLoad
    on SPARC, or an acquire on any other platform, I would be pissed. If at
    all, consume should give a warning and say C++26 does not like it
    anymore. It should be just a compiler barrier unless on something like
    an Alpha. Alpha, well shit out of luck? consume membar on Alpha would
    emit a mb instruction for data dependent loads. C++ says the Alpha can die?

    It already did.


    All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Mar 20 01:43:41 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 19 Mar 2026 22:32:41 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    --------------------
    SPARC died by stages. Sun couldn't afford to keep development at a
    high pitch, and the register windows did not help. Once Oracle
    owned them, the high cost of chip development caused Oracle to
    cancel it.

    Considering Oracle's financial situation at the time (very very
    good) it does not look like they can't afford it. They just didn't
    see a point.

    It takes cubic dollars to fund a high end design team--something
    like $200M/year just in development more when considering building
    product.

    It's Oracle of 2017 that we are talking about.
    Total revenues : $ 37.728 B
    Operating income: $ 12.710 B
    Net income : $ 9.335 B

    What is 0.2 B for such juggernaut ?

    It should not have been....

    Just to put things in proportion, in November 2016 Oracle pid $9.3
    billion USD for Netsuit.

    Fools and their money...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Fri Mar 20 02:20:45 2026
    From Newsgroup: comp.arch

    On Thu, 19 Mar 2026 22:24:09 +0000, MitchAlsup wrote:

    Even given all the snipped below text--unless you can get ISA finished
    and get a compiler written, software ported and a microarchitecture
    built--it is nothing but a mental exercise.

    I can't argue with that; it is indisputable.

    But I needed to get the basis right before proceeding with the hard work.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Mar 20 07:23:11 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:

    It's Oracle of 2017 that we are talking about.
    Total revenues : $ 37.728 B
    Operating income: $ 12.710 B
    Net income : $ 9.335 B

    What is 0.2 B for such juggernaut ?

    Not sure if you have ever worked for a big company...

    Each major project and product line is judged on its own merit.
    If the business case does not appear to be there, they will
    not spend the money.

    Just to put things in proportion, in November 2016 Oracle pid $9.3
    billion USD for Netsuit.

    They obviously thought it made sense. Let me qualify what I
    wrote above. There is a saying "Where there is a will, there is a
    business case." Such decisions are not always rational.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Mar 20 00:34:55 2026
    From Newsgroup: comp.arch

    On 3/19/2026 6:42 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 3/17/2026 12:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid (quadi) >>>>>> wrote:

    Even I, who has indeed in Concertina II designed what has been in >>>>>>> some of its iterations - and which has again become in the current >>>>>>> iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be
    "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were
    addressed by bit offset in a segment. You could only have 64K instruction
    bits in a segment, or 8K bytes. The idea was that no subroutine or >>>>>> function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead... >>>>
    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
    Abysmal Turd: iAPX 432
    Turd: IA-64
    Turd that flies: x86, x86-64
    Sane: ARM, RISC-V, ...
    Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed. MIPS had software TLB[*], SPARC had register windows [**]
    Alpha had a week [***] memory ordering.

    Fwiw, SPARC in RMO mode was weak, but not as weak as a damn Alpha. At
    least SPARC honored data dependent load dependencies, ala implied
    consume membars.

    https://en.cppreference.com/w/cpp/atomic/memory_order.html

    I guess C++26 waves good bye to it. But, still... if they make a consume
    actually emit, say an acquire membar ala MEMBAR #LoadStore | #LoadLoad
    on SPARC, or an acquire on any other platform, I would be pissed. If at
    all, consume should give a warning and say C++26 does not like it
    anymore. It should be just a compiler barrier unless on something like
    an Alpha. Alpha, well shit out of luck? consume membar on Alpha would
    emit a mb instruction for data dependent loads. C++ says the Alpha can die?

    It already did.

    Well, I hope a C++ compiler does not treat a consume as an acquire
    membar then! Grrrr.....




    All survive
    for backward compatibility and are generally
    eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
    tricks like page coloring. The virtually tagged
    caches were troublesome.

    [**] Register windows. Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
    port software to from other more strongly ordered
    architectures.



    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Mar 20 00:36:14 2026
    From Newsgroup: comp.arch

    On 3/20/2026 12:34 AM, Chris M. Thomasson wrote:
    On 3/19/2026 6:42 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 3/17/2026 12:25 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    BGB <cr88192@gmail.com> posted:

    On 3/15/2026 9:35 AM, John Dallman wrote:
    In article <10p4p6g$lg6r$2@dont-email.me>, quadibloc@ca.invalid >>>>>>> (quadi)
    wrote:

    Even I, who has indeed in Concertina II designed what has been in >>>>>>>> some of its iterations - and which has again become in the current >>>>>>>> iteration, sadly - some really weird ISAs would shrink from
    designing an ISA in which the answer to that question would be >>>>>>>> "Yes".

    iAPX 432 had instructions which weren't in whole bytes, and were >>>>>>> addressed by bit offset in a segment. You could only have 64K
    instruction
    bits in a segment, or 8K bytes. The idea was that no subroutine or >>>>>>> function ever needed to be bigger than that.


    On one hand? "kill it with fire!".

    On the other? By the time I existed, it was basically already dead... >>>>>
    432 is worthy of study if only to figure out "what not to do".


    It is like, there are one of several ways things can go:
         Abysmal Turd: iAPX 432
         Turd: IA-64
         Turd that flies: x86, x86-64
         Sane: ARM, RISC-V, ...
         Sane, but still died: PowerPC, MIPS, SPARC, ...

    interesting viewpoint

    Indeed.  MIPS had software TLB[*], SPARC had register windows [**]
               Alpha had a week [***] memory ordering.

    Fwiw, SPARC in RMO mode was weak, but not as weak as a damn Alpha. At
    least SPARC honored data dependent load dependencies, ala implied
    consume membars.

    https://en.cppreference.com/w/cpp/atomic/memory_order.html

    I guess C++26 waves good bye to it. But, still... if they make a consume >>> actually emit, say an acquire membar ala MEMBAR #LoadStore | #LoadLoad
    on SPARC, or an acquire on any other platform, I would be pissed. If at
    all, consume should give a warning and say C++26 does not like it
    anymore. It should be just a compiler barrier unless on something like
    an Alpha. Alpha, well shit out of luck? consume membar on Alpha would
    emit a mb instruction for data dependent loads. C++ says the Alpha
    can die?

    It already did.

    Well, I hope a C++ compiler does not treat a consume as an acquire
    membar then! Grrrr.....

    I mean if they said consume is acquire it would RUIN all of the nice
    algos out there. Say even RCU. So, up to the C++ compiler vendors i
    guess.. ;^o





        All survive
               for backward compatibility and are generally
               eschewed for new designs.

    [*] Didn't perform well in large scale SMP without
          tricks like page coloring.   The virtually tagged
          caches were troublesome.

    [**] Register windows.  Need I say more?

    [***] Difficult to program multithreaded apps correctly, difficult to
            port software to from other more strongly ordered
            architectures.




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Mar 20 09:27:19 2026
    From Newsgroup: comp.arch

    They obviously thought it made sense. Let me qualify what I
    wrote above. There is a saying "Where there is a will, there is a
    business case." Such decisions are not always rational.

    There's also the well-known effect that it's harder to argue against
    large projects, because most arguments can be deflected on the basis
    either that it's negligible (if it quibbles with something "minor") or
    that it's based on an incomplete understanding of the project.


    === Stefan
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Mar 20 16:01:13 2026
    From Newsgroup: comp.arch

    On Fri, 20 Mar 2026 07:23:11 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:

    It's Oracle of 2017 that we are talking about.
    Total revenues : $ 37.728 B
    Operating income: $ 12.710 B
    Net income : $ 9.335 B

    What is 0.2 B for such juggernaut ?

    Not sure if you have ever worked for a big company...


    Not big in USA or Japanese or Korean sense of the word.
    But in the 90s I worked few years for a company that had several
    thousands employees. Formally, I was not on their parlor, but in
    practice it it hardly mattered.

    Each major project and product line is judged on its own merit.
    If the business case does not appear to be there, they will
    not spend the money.

    Just to put things in proportion, in November 2016 Oracle pid $9.3
    billion USD for Netsuit.

    They obviously thought it made sense. Let me qualify what I
    wrote above. There is a saying "Where there is a will, there is a
    business case." Such decisions are not always rational.

    I don't think that what you say contradict what say. Which is "Oracle
    could easily afford continued development of SPARC server CPUs, but
    decided that they don't want to do it".




    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Fri Mar 20 07:19:52 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Michael S <already5chosen@yahoo.com> schrieb:
    What is 0.2 B for such juggernaut ?
    Each major project and product line is judged on its own merit.
    If the business case does not appear to be there, they will
    not spend the money.

    I have to imagine that the board of Oracle was looking at a chip
    design activity which would cost real money and play to no existing
    strength of the company. Then would follow:

    "Can we just run on chips made by somebody else?"

    "Yes. But--"

    "Thank you."

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Mar 20 17:29:26 2026
    From Newsgroup: comp.arch

    On Fri, 20 Mar 2026 07:19:52 -0700
    Andy Valencia <vandys@vsta.org> wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Michael S <already5chosen@yahoo.com> schrieb:
    What is 0.2 B for such juggernaut ?
    Each major project and product line is judged on its own merit.
    If the business case does not appear to be there, they will
    not spend the money.

    I have to imagine that the board of Oracle was looking at a chip
    design activity which would cost real money and play to no existing
    strength of the company. Then would follow:

    "Can we just run on chips made by somebody else?"

    "Yes. But--"

    "Thank you."


    But after acquisition, in 2010, they did expand SPRAC design team and
    most likely greately improved their financing. Only 7 years later they
    decided to quit.

    As to Board, I dont believe that Ellison and Hurd really cared about
    their opinions.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Mar 20 19:29:24 2026
    From Newsgroup: comp.arch


    quadi <quadibloc@ca.invalid> posted:

    On Thu, 19 Mar 2026 22:24:09 +0000, MitchAlsup wrote:

    Even given all the snipped below text--unless you can get ISA finished
    and get a compiler written, software ported and a microarchitecture built--it is nothing but a mental exercise.

    I can't argue with that; it is indisputable.

    But I needed to get the basis right before proceeding with the hard work.

    20 years ago I had similar notions. But my experience with My 66000
    changed my mind.
    a) you have to have a compiler that can compile most things
    so that you can see the deficiencies with your ISA
    b) you have to have /bin/utils/ mostly compiled to see whatever
    damage you have done to yourself in terms of external variables
    and functions; SBI into and out of your OS; SBI' into and out
    of your HyperVisor, ...

    And after using these for a while, go back and correct the ISA.

    If you fix ISA too soon you cut off much of your future.

    John Savard

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Mar 20 16:56:52 2026
    From Newsgroup: comp.arch

    On 3/20/2026 2:29 PM, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    On Thu, 19 Mar 2026 22:24:09 +0000, MitchAlsup wrote:

    Even given all the snipped below text--unless you can get ISA finished
    and get a compiler written, software ported and a microarchitecture
    built--it is nothing but a mental exercise.

    I can't argue with that; it is indisputable.

    But I needed to get the basis right before proceeding with the hard work.

    20 years ago I had similar notions. But my experience with My 66000
    changed my mind.
    a) you have to have a compiler that can compile most things
    so that you can see the deficiencies with your ISA
    b) you have to have /bin/utils/ mostly compiled to see whatever
    damage you have done to yourself in terms of external variables
    and functions; SBI into and out of your OS; SBI' into and out
    of your HyperVisor, ...

    And after using these for a while, go back and correct the ISA.

    If you fix ISA too soon you cut off much of your future.


    Or, another route being:
    Mock things up;
    Test things;
    Do more mock-ups;
    Test and compare;
    Pick whichever is more promising;
    ...



    Then usually later get annoyed because the design has accumulated some
    level of cruft, and it generally requires a partial "soft restart" for effective cruft removal.

    But, then one needs to have a good idea what the "one true path" is
    before deciding to soft-restart.



    And, then the deltas aren't that large (nowhere near "orders of magnitude").


    Say, for the same types of code, and code density:
    XG1 : ~ 280K (16/32/64/96 bit ops)
    XG2 : ~ 300K (32/64/96 bit ops)
    RV64GC+JX: ~ 300K (16/32/64 bit ops)
    XG3 : ~ 310K (32/64/96 ops)
    RV64G+JX : ~ 340K (32/64 ops)
    RV64GC : ~ 360K (16/32 ops)
    SH-4 : ~ 400K (16 bit ops only; for reference)
    RV64G : ~ 450K (32 bit ops only)


    So, with similar features (indexed load/store, jumbo-prefixes, etc),
    Doom's ".text" size tends to be a little over 300K, but suffers
    significantly without these.


    This is limited to cases where I control all of the external factors.
    Say, for example, comparing with different targets means different
    compilers and C libraries, which kinda ruin the ability to compare them directly.


    For performance:
    XG3 wins ATM:
    ~ 40% faster than RV64G for Doom
    ~ 55% faster than RV64GC for Doom
    ~ 5% faster than RV64G+Jx
    ~ 15% faster than RV64GC+Jx
    ~ 15% faster than XG1 (size optimized)
    XG3 vs XG2:
    Performance difference is negligible.
    XG1 vs RV64GC+Jx: Negligible difference.
    Though, there is around a 35% difference vs RV64GC with no Jx.
    ...


    Having 16 bit ops basically giving around a 20% code-density improvement
    at around a 10-15% performance penalty (when 16-bit ops and misaligned instructions disallow superscalar operation).
    This performance penalty could go away in theory with a CPU that allows superscalar for misaligned instructions and can co-execute 16-bit ops.


    Whether or not 96-bit encodings are allowed (Imm64) tends to have a very
    small effect on code density as they tend to be rarely used (Imm33s can address the vast majority of cases).


    But, having indexed load/store and an encoding scheme to deal with
    larger immediate and displacement values having a more obvious visible
    impact on code density or performance.


    Though, the performance difference mostly goes away if testing using
    Dhrystone or similar (doesn't care much about either indexed load/store
    or large imm/disp values).


    But, I am left to question if any of this really matters...




    Well, I guess the contrast would be, say, data compression:
    LZMA vs LZ4: Often one only sometimes sees more than a 25% difference in compression ratio;
    But, then like there is often a 90x or so difference in speed.

    So, for example, why I had often ended up with something like LZ4 or
    RP2, because the speed advantage (large) mattered more than the ratio difference (smaller); and most see an advantage over raw uncompressed data.

    LZ4 vs RP2: Difference is modest and data dependent. In terms of ratio,
    RP2 usually beats LZ4. But, differences are small, mostly differing in
    terms of how they encode matches and deal with EOB signaling and similar
    (RP2 using an explicit EOB signal; and with an encoding scheme that is essentially "like EA RefPack with the bits shuffled around and some
    fairly minor tweaks").

    ...


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From quadi@quadibloc@ca.invalid to comp.arch on Sat Mar 21 04:44:37 2026
    From Newsgroup: comp.arch

    On Fri, 20 Mar 2026 19:29:24 +0000, MitchAlsup wrote:

    quadi <quadibloc@ca.invalid> posted:

    On Thu, 19 Mar 2026 22:24:09 +0000, MitchAlsup wrote:

    Even given all the snipped below text--unless you can get ISA
    finished and get a compiler written, software ported and a
    microarchitecture built--it is nothing but a mental exercise.

    I can't argue with that; it is indisputable.

    But I needed to get the basis right before proceeding with the hard
    work.

    20 years ago I had similar notions. But my experience with My 66000
    changed my mind.
    a) you have to have a compiler that can compile most things
    so that you can see the deficiencies with your ISA
    b) you have to have /bin/utils/ mostly compiled to see whatever
    damage you have done to yourself in terms of external variables and
    functions; SBI into and out of your OS; SBI' into and out of your
    HyperVisor, ...

    And after using these for a while, go back and correct the ISA.

    If you fix ISA too soon you cut off much of your future.

    It may well be that when I get to the point of porting Linux, I'll find my current design is a disaster.

    But then, why would I try to fix it? Clearly, if this (slightly!) unconventional design of mine - it's nowhere near as original as the Mill
    - turns out to be a wretched failure, and so what I need to do instead is design something far more conventional... then what have I to contribute
    that wouldn't be done just as well by SPARC or PowerPC?

    Actually, I probably would still have some ideas - like taking the System/
    360 off to a different direction than IBM did, or taking the 68000 archtecture, and making a 64-bit version of it - but they would be nothing much.

    The thing is, though, that I have two escape routes with this curent Concertina II ISA that make me feel that total abject disaster is
    unlikely. One escape route is to not use headers. Then, the ISA is not
    really better than any ordinary RISC - but it isn't worse or more
    complicated either.

    And _due_ to the header mechanism, it's not like an application which does
    use headers needs the OS to save special status information during
    interrupts.

    The second escape route is to consistently use the Type I header. Now, the instruction set, except for the mechanical issue of remembering that code addresses skip around the headers, is pretty much a plain ordinary CISC instruction set.

    So if the full complexity of mixing and matching block types creates
    something in which Linux can't be implemented - then all I have to do is
    not do it.

    Of course, it could be precisely because this is a mad scheme, worthy of
    the villain in a pulp fiction story, that there is a fatal flaw which I
    don't see which will bring everything crashing down. But while the ISA as
    a whole, thanks to its headers, is indeed unprecedented in its
    complexity... that's just to give it flexibility, and it can fall back to sheer ordinariness. So it seems to me that I've thought things out...

    John Savard

    John Savard
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Mar 21 07:06:45 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    But after acquisition, in 2010, they did expand SPRAC design team and
    most likely greately improved their financing. Only 7 years later they >decided to quit.

    I was actually somewhat surprised that Oracle did not cancel SPARC
    right away. SPARC CPUs from Sun had lagged in performance for more
    than a decade. But at least they finally got the clock rate up and
    OoO execution. Looking at <https://www.oracle.com/us/products/servers-storage/sparc-m8-processor-ds-3864282.pdf>,
    they boast quad-issue OoO execution. The performance is still nothing
    to write home about:

    wget https://www.complang.tuwien.ac.at/forth/gforth/Snapshots/0.7.9_20260109/gforth-0.7.9_20260109.tar.xz
    time xz -d gforth-0.7.9_20260109.tar.xz
    time xz gforth-0.7.9_20260109.tar

    time result on a 5GHz SPARC M8 (running Solaris 11.4):

    real 0m0.504s
    user 0m0.403s
    sys 0m0.066s

    real 0m7.721s
    user 0m7.664s
    sys 0m0.054s


    time result on a 5GHz Ryzen 8700G:

    real 0m0.160s
    user 0m0.140s
    sys 0m0.020s

    real 0m3.149s
    user 0m3.121s
    sys 0m0.028s

    Admittedly a Ryzen 1800X (from 2017) or Intel Kaby Lake (from 2017)
    will be slower than the 8700G, but I expect them to still be faster
    than the SPARC M8 (from 2017).

    Here's another one:

    ./gforth-fast -i kernl32b.fi startup.fs onebench.fs #on SPARC
    ./gforth-fast onebench.fs #on AMD64

    sieve bubble matrix fib fft
    0.312 0.277 0.136 0.353 0.195 5GHz SPARC M8
    0.020 0.020 0.012 0.030 0.012 5GHz Ryzen 8700G

    More than a factor of 10 for each benchmark. My guess is that the M8
    has deficiencies in indirect-branch prediction.

    Did Gforth disasble its optimizations on SPARC?

    : squared dup * ; ok
    see-code squared
    <squared> dup 1->2
    0xfeeab718: mov %i5, %l6
    <squared+$4> * 2->1
    0xfeeab71c: smul %i5, %l6, %i5
    <squared+$8> ;s 1->1
    0xfeeab720: ld [ %i2 ], %i0
    0xfeeab724: add %i2, 4, %i2
    0xfeeab728: ld [ %i0 ], %g1
    0xfeeab72c: jmp %g1
    0xfeeab730: nop

    Looks perfectly fine; for comparison:

    <squared> dup 1->2
    7F3592470E60: mov r15,r13
    <squared+$8> * 2->1
    7F3592470E63: imul r13,r15
    <squared+$10> ;s 1->1
    7F3592470E67: mov rbx,[r14]
    7F3592470E6A: add r14,$08
    7F3592470E6E: mov rax,[rbx]
    7F3592470E71: jmp eax


    And I expect that in the decades of even larger performace
    disadvantages for Sun's and Oracle SPARCs enough customers had jumped
    ship or at least decided to jump ship at the next opportunity that by
    2017 not enough SPARC business was left to pay for its continued
    development.

    So eventually, what had killed general-purpose MIPS, Alpha, HPPA, and (contemporaneously with SPARC) IA-64 also killed SPARC: High
    development costs of these architectures supported by not enough
    customer interest, due to customers defecting to Linux on Intel/AMD.

    Interesting that Power and S390x still persevere. For S390x I expect
    that the customers are not price-sensitive, and lots of legacy code in
    assembly language tied to some proprietary OS makes porting extra-hard
    and extra-risky. For Power, maybe the AS/400 customers are in a
    similar position. For MIPS, HPPA and SPARC there were lots of Unix
    customers for whom the jump to Linux was not that hard. For Alpha the
    VMS customer base apparently was not captured with enough ties to
    sustain Alpha (or later IA-64).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Mar 21 06:06:18 2026
    From Newsgroup: comp.arch

    On 3/21/2026 12:06 AM, Anton Ertl wrote:

    snipped a lot of detailed performance data

    And I expect that in the decades of even larger performace
    disadvantages for Sun's and Oracle SPARCs enough customers had jumped
    ship or at least decided to jump ship at the next opportunity that by
    2017 not enough SPARC business was left to pay for its continued
    development.

    So eventually, what had killed general-purpose MIPS, Alpha, HPPA, and (contemporaneously with SPARC) IA-64 also killed SPARC: High
    development costs of these architectures supported by not enough
    customer interest, due to customers defecting to Linux on Intel/AMD.

    I think that analysis is generally correct.


    Interesting that Power and S390x still persevere. For S390x I expect
    that the customers are not price-sensitive, and lots of legacy code in assembly language tied to some proprietary OS makes porting extra-hard
    and extra-risky.

    Sort of. Even if the application code is in a HLL, it can be, and on
    S/390 often is, tied to some proprietary software. For the highest performance systems, that software is the OS, namely TPF (previously
    called ACP - Airline Control Program). For the vast majority of S/390
    users it is either a database such as IMS, or a transaction system such
    as CICS.


    For Power, maybe the AS/400 customers are in a
    similar position.

    No, actually the opposite position. AS/400 user code uses a very high
    level system (i.e. no user assembly code) that provides much of the work
    and is proprietary to IBM. While that system could be, and was, ported
    to a different architecture (e.g. from S/38 to Power), of course, IBM
    has no incentive to port it to a generic platform. Without that high
    level system, porting user code would essentially be impossible.

    For MIPS, HPPA and SPARC there were lots of Unix
    customers for whom the jump to Linux was not that hard. For Alpha the
    VMS customer base apparently was not captured with enough ties to
    sustain Alpha (or later IA-64).

    Agreed.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Mar 21 16:21:15 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/21/2026 12:06 AM, Anton Ertl wrote:
    For Power, maybe the AS/400 customers are in a
    similar position.

    No, actually the opposite position. AS/400 user code uses a very high
    level system (i.e. no user assembly code) that provides much of the work
    and is proprietary to IBM. While that system could be, and was, ported
    to a different architecture (e.g. from S/38 to Power), of course, IBM
    has no incentive to port it to a generic platform.

    That's an interesting statement. IBM could implement AS/400 on AMD64
    machines (AFAIK it uses some extra bit for tagging on their enhanced
    Power, but I am sure that an implementation on ARM A64 with top-byte
    ignore and on AMD64 with similar freatures (don't remember the name)
    would not incur too much overhead, if any). That would save them the
    cost of continuing Power development.

    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development costs
    of Power. Why would AS/400 customers be willing to do that? My guess
    is that the different architecture is successfully sold as a "secret
    sauce" to them that justifies charging that much extra. Conversely,
    if they just were given hardware with ARM, Intel or AMD CPUs and the
    AS/400 (followon) OS, they would balk at the prices that IBM charges
    them, even if IBM reduces these prices by their share in Power
    development costs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Mar 21 16:44:31 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    ./gforth-fast -i kernl32b.fi startup.fs onebench.fs #on SPARC
    ./gforth-fast onebench.fs #on AMD64

    sieve bubble matrix fib fft
    0.312 0.277 0.136 0.353 0.195 5GHz SPARC M8
    0.020 0.020 0.012 0.030 0.012 5GHz Ryzen 8700G

    More than a factor of 10 for each benchmark. My guess is that the M8
    has deficiencies in indirect-branch prediction.

    To check that theory:

    ./gforth-fast --no-dynamic -i kernl32b.fi startup.fs -e cr onebench.fs #SPARC ./gforth-fast --no-dynamic -i kernl64l.fi startup.fs -e cr onebench.fs #AMD64

    sieve bubble matrix fib fft
    0.941 0.988 0.754 1.026 0.460 5GHz SPARC M8
    0.150 0.128 0.126 0.157 0.068 5GHz Ryzen 8700G

    Looking at the whole process (loading and running all benchmarks) on
    the M8 with performance monitoring counters, I see:

    cputrack -T 10 -c pic0=PAPI_tot_cyc,pic1=PAPI_tot_ins,pic2=PAPI_br_ins,pic3=Br_mispred ./gforth-fast -i kernl32b.fi startup.fs -e "cr" onebench.fs
    cputrack -T 10 -c pic0=PAPI_tot_cyc,pic1=PAPI_tot_ins,pic2=PAPI_br_ins,pic3=Br_mispred ./gforth-fast --no-dynamic -i kernl32b.fi startup.fs -e "cr" onebench.fs

    s cycles insts branches Br_mispred
    time pic0 pic1 pic2 pic3
    1.420 7056003107 2622301784 319937637 165499524 default
    4.451 22251377351 5399632927 828239093 678326557 --no-dynamic

    So the number of branches increases by 508301456 from --no-dynamic,
    and these are all indirect branches), while the number of branch
    mispredictions increases by 512827033 (1.009 times the number of
    additional indirect branches). There are additional instructions, but
    if compute the additional cycles per additional misprediction, the
    result is 29.6 cycles. Below we see that a direct-threaded dispatch
    costs about 27 cycles, which leaves about 2.6 cycles for the other
    additional instructions.

    Let's compare this with the results on the 8700G:

    LC_NUMERIC=prog perf stat -e cycles -e instructions -e branches -e branch-misses ./gforth-fast -i kernl64l.fi startup.fs onebench.fs
    LC_NUMERIC=prog perf stat -e cycles -e instructions -e branches -e branch-misses ./gforth-fast --no-dynamic -i kernl64l.fi startup.fs onebench.fs

    default --no-dynamic
    624_131_765 3_564_153_121 cycles
    2_210_769_688 4_522_459_842 instructions
    289_527_092 797_459_624 branches
    1_360_753 12_208_051 branch-misses

    We see somewhat fewer instructions on AMD64, maybe due to CISC, and
    about 30M fewer branches (both for default and --no-dynamic); I can
    only guess why that is; 10M branches difference already occurs in the
    startup phase, but the other 20M are unclear. Anyway the major
    difference is that the Ryzen 8700G has relatively few branch
    mispredictions (<3MPKI) where the SPARC M8 has a lot (126MPKI for
    --no-dynamic, 63 MPKI for default). And each of these mispredictions
    costs a lot, see below.

    On the Ryzen 8700G the IPC is reduced from 3.53 to 1.27 when using --no-dynamic, so its slowdown the 8700G sees from --no-dynamic is
    probably explained with additional dependences between instructions
    with --no-dynamic; --no-dynamic disables a number of optimizations in
    Gforth that are based on dynamic native code generation.

    Another check of the theory that the SPARC M8 has bad indirect branch prediction and high misprediction penalty is my threading
    microbenchmark <http://www.complang.tuwien.ac.at/forth/threading/>.
    It performs 1G VM instruction dispatches (1 indirect branch or
    indirect call per dispatch for everything but "subroutine"). Here are
    the results on the SPARC M8:

    gmake TIME="cputrack -T 10 -c pic0=PAPI_tot_cyc,pic1=PAPI_tot_ins,pic2=PAPI_br_ins,pic3=Br_mispred"

    s cycles insts branches Br_mispred
    time pic0 pic1 pic2 pic3
    1.259 6300623847 5100432507 2200076723 4638 subroutine
    5.571 27900764018 3300432545 1100076805 1000004548 direct
    6.170 30901341158 4300432342 1100076731 1000004612 indirect
    6.649 33301453118 10400432413 3100076793 1000004955 switch
    8.326 41701051833 9700432505 3100076817 1000004488 call
    6.868 34400867876 13800432826 2100076852 1000004541 repl-switch

    On the SPARC M8 the performance counters report one misprediction per
    indirect branch/call, with the "direct" microbenchmark performing 3.3 instructions per indirect branch, typically:

    100000e94: 82 00 60 08 add %g1, 8, %g1
    100000e98: c4 58 40 00 ldx [ %g1 ], %g2
    100000e9c: 81 c0 80 00 jmp %g2

    These average 3.3 instructions take 27.9 cycles on average, and my
    guess is that the mispredicted indirect branch here costs around 23
    cycles and the other two instructions a total of 4 cycles.

    Note that already CPUs from 2000 like the Coppermine or Thunderbird
    exhibit better indirect branch prediction (about 50% mispredictions
    for this benchmark). They also have a lower branch misprediction
    penalty in cycles, but given that they run at 1GHz while the SPARC M8
    runs at 5GHz, the M8 still has a lower misprediction penalty in ns.

    Here's performance counter results for the 8700G for this benchmark:

    make TIME="perf stat -e cycles -e instructions -e branches -e branch-misses"

    subroutine
    Performance counter stats for './subroutine':

    subroutine direct indirect switch call repl-switch
    4_103_611_283 2_102_149_984 2_001_992_398 4_709_960_957 24_015_297_261 3_504_026_549 cycles
    2_702_140_183 3_301_375_835 4_301_317_505 9_903_048_005 7_709_573_557 8_802_165_453 instructions
    2_200_488_130 1_100_314_913 1_100_301_489 3_700_697_820 3_102_167_429 2_100_494_942 branches
    62_916 40_447 38_683 90_519 1_000_277_186 64_853 branch-misses

    [Sorry for the overly wide table]

    The indirect branch predictor performs well on the 8700G, except for
    the call case. This benchmark has an indirect call to a return (in
    90% of the cases, the other 10% have instructions before the return,
    but apparently there is a misprediction in those cases, too), and that
    is followed by a direct branch. I guess that somewhere in this
    sequence the microarchitecture has a hiccup, resulting in a
    misprediction. The whole thing, including the misprediction costs 24
    cycles, which is 17.7 cycles faster than the M8 on the call benchmark,
    which also has one misprediction. So apparently the M8 has other
    things that keep it back, not just the indirect branch predictor.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Mar 22 01:00:05 2026
    From Newsgroup: comp.arch

    On Sat, 21 Mar 2026 16:21:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/21/2026 12:06 AM, Anton Ertl wrote:
    For Power, maybe the AS/400 customers are in a
    similar position.

    No, actually the opposite position. AS/400 user code uses a very
    high level system (i.e. no user assembly code) that provides much of
    the work and is proprietary to IBM. While that system could be, and
    was, ported to a different architecture (e.g. from S/38 to Power),
    of course, IBM has no incentive to port it to a generic platform.

    That's an interesting statement. IBM could implement AS/400 on AMD64 machines (AFAIK it uses some extra bit for tagging on their enhanced
    Power, but I am sure that an implementation on ARM A64 with top-byte
    ignore and on AMD64 with similar freatures (don't remember the name)
    would not incur too much overhead, if any). That would save them the
    cost of continuing Power development.

    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development costs
    of Power. Why would AS/400 customers be willing to do that? My guess
    is that the different architecture is successfully sold as a "secret
    sauce" to them that justifies charging that much extra. Conversely,
    if they just were given hardware with ARM, Intel or AMD CPUs and the
    AS/400 (followon) OS, they would balk at the prices that IBM charges
    them, even if IBM reduces these prices by their share in Power
    development costs.

    - anton

    I am not sure that I follow your logic.
    Is it based on assumption that System i constitutes a bulk of IBM POWER
    income? I somehow think that it is does not.

    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Mar 22 01:12:58 2026
    From Newsgroup: comp.arch

    According to Michael S <already5chosen@yahoo.com>:
    That's an interesting statement. IBM could implement AS/400 on AMD64
    machines ...

    That's probably true, give or take the fact that AS/400 and i are big
    endian and AMD is little endian. POWER swings both ways, big endian for
    i and little endian for p running AIX or linux.

    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development costs
    of Power. Why would AS/400 customers be willing to do that? ...

    I am not sure that I follow your logic.
    Is it based on assumption that System i constitutes a bulk of IBM POWER >income? I somehow think that it is does not.

    IBM does not say, but I would think that most POWER machines are running i.
    If you want a linux server, there are a lot of alternatives, mostly less expensive. If you want i, you get it from IBM.

    They're not mutually exclusive. Recent i systems have a mode that can
    run linux code, and POWER has a virtual machine hypervisor than can run
    i and p virtual machines at on the same system.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Mar 22 03:28:36 2026
    From Newsgroup: comp.arch

    On Sun, 22 Mar 2026 01:12:58 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to Michael S <already5chosen@yahoo.com>:
    That's an interesting statement. IBM could implement AS/400 on
    AMD64 machines ...


    Not that I disagreee with the content, but the quote above is
    not mine. It is written by Anton Ertle.

    That's probably true, give or take the fact that AS/400 and i are big
    endian and AMD is little endian. POWER swings both ways, big endian
    for i and little endian for p running AIX or linux.

    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development
    costs of Power. Why would AS/400 customers be willing to do that?
    ...

    I am not sure that I follow your logic.
    Is it based on assumption that System i constitutes a bulk of IBM
    POWER income? I somehow think that it is does not.

    IBM does not say, but I would think that most POWER machines are
    running i.

    I have no data, but if it was the case then why IBM would bother with
    providing anything except the smallest POWER box?
    Even the smallest box is probably huge overkill for i.

    If you want a linux server, there are a lot of
    alternatives, mostly less expensive.

    POWER tries to minimize software licensing expenses.
    Including by playing dirty game of calling something that for most
    practical purposes is a dual-core or quad-core cluster "a core".

    If you want i, you get it from
    IBM.

    They're not mutually exclusive. Recent i systems have a mode that can
    run linux code, and POWER has a virtual machine hypervisor than can
    run i and p virtual machines at on the same system.


    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Mar 21 19:57:38 2026
    From Newsgroup: comp.arch

    On 3/21/2026 9:21 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/21/2026 12:06 AM, Anton Ertl wrote:
    For Power, maybe the AS/400 customers are in a
    similar position.

    No, actually the opposite position. AS/400 user code uses a very high
    level system (i.e. no user assembly code) that provides much of the work
    and is proprietary to IBM. While that system could be, and was, ported
    to a different architecture (e.g. from S/38 to Power), of course, IBM
    has no incentive to port it to a generic platform.

    That's an interesting statement. IBM could implement AS/400 on AMD64 machines (AFAIK it uses some extra bit for tagging on their enhanced
    Power, but I am sure that an implementation on ARM A64 with top-byte
    ignore and on AMD64 with similar freatures (don't remember the name)
    would not incur too much overhead, if any).

    As others have stated, that is almost certainly technically possible.
    BTW, the latest iterations don't use the "extra" bit. That would make
    porting easier.



    That would save them the
    cost of continuing Power development.

    Yes, but to make the decision correctly requires knowledge held by IBM management, that neither you nor I have. For example, they haven't made
    major improvements in the Power architecture for years, so I suspect
    their development team is rather small, and thus doesn't cost too much. Furthermore, neither you nor I know how much revenue they are getting
    from the System P sales (Linux or AIX on power). And I know that there
    used to be a fairly substantial embedded sales (by Freescale). I don't
    know if they get any royalties, etc, from that and how dropping
    development might effect that.

    So overall, while surely technically possible, the economics of what you propose is not knowable to us outsiders.


    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development costs
    of Power. Why would AS/400 customers be willing to do that? My guess
    is that the different architecture is successfully sold as a "secret
    sauce" to them that justifies charging that much extra. Conversely,
    if they just were given hardware with ARM, Intel or AMD CPUs and the
    AS/400 (followon) OS, they would balk at the prices that IBM charges
    them, even if IBM reduces these prices by their share in Power
    development costs.

    You have expressed your "distaste" of IBM marketing and their customer's decisions before. While you may be correct, I am not so sure. Without
    more detailed financial knowledge, it is very hard to say.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Mar 22 08:31:25 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 21 Mar 2026 16:21:15 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    So they probably think that they can charge AS/400 enough extra for
    running on Power, that it more than makes up for the development costs
    of Power. Why would AS/400 customers be willing to do that? My guess
    is that the different architecture is successfully sold as a "secret
    sauce" to them that justifies charging that much extra. Conversely,
    if they just were given hardware with ARM, Intel or AMD CPUs and the
    AS/400 (followon) OS, they would balk at the prices that IBM charges
    them, even if IBM reduces these prices by their share in Power
    development costs.

    - anton

    I am not sure that I follow your logic.
    Is it based on assumption that System i constitutes a bulk of IBM POWER >income?

    It is based on the assumption that System i (if that is this week's
    name for AS/400) has enough margin to pay for Power development, while
    RS/6000 (whatever its current name is) competes head-on with Intel and
    AMD servers in the Linux and pretty much also in the AIX markets. My
    guess is that IBM charges very little or nothing of Power development
    to RS/6000, but while it is available thanks to System i, they sell
    it; even if they made no profits on that, one benefit would be the
    marketing aspect of supporting the architecture that long, longer than
    all the competitors from the 1990s).

    I have no numbers to support my guesses, but the fact that Power is
    still there while everything else including SPARC has been cancelled
    makes me doubt that the AIX and Linux business are keeping Power
    alive.

    My university bought an RS/6000 cluster in the early 1990s, and may
    have bought some followons, but AFAIK nothing in the 2000s or later
    (some research group may have bought some, but I have not heard of
    one). It seems to me that there is not much AIX and Linux on Power
    business these days.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Mar 22 08:49:43 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Michael S <already5chosen@yahoo.com>:
    That's an interesting statement. IBM could implement AS/400 on AMD64
    machines ...

    That's probably true, give or take the fact that AS/400 and i are big
    endian and AMD is little endian.

    If the hardware abstraction of AS/400 does not include byte order,
    that's a good point. There are ways to implement big-endian emulation
    on little-endian hardware, but they tend to cost some performance.

    POWER swings both ways, big endian for
    i and little endian for p running AIX or linux.

    AFAIK AIX has continued to be big-endian, only Linux switched.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Mar 22 09:06:22 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    And I know that there
    used to be a fairly substantial embedded sales (by Freescale). I don't
    know if they get any royalties, etc, from that and how dropping
    development might effect that.

    IBM probably gets licensing fees, but AIM parted ways two decades ago,
    and did not work for several years before that, Freescale has been
    bought by NXP, and I already heard 10 years ago that NXP carries their PPC-based microcontrollers only as a concession to existing customers
    and recommends ARM for new projects. I don't know how their contracts
    look, but as soon as Freescale turned embedded-only (20 years ago,
    maybe earlier), they did not depend on further development of Power by
    IBM, and I doubt that they wanted to pay for that, so I don't think
    they would pay extra for IBM writing such a development guarantee
    into the contract, and IBM would not give such a guarantee for free.

    You have expressed your "distaste" of IBM marketing and their customer's >decisions before.

    It's not just IBM. It's the kind of salesmanship that goes directly
    to the top ranks of organizations and convinces them, with whatever
    means, that they should buy from the company of the salesman, with the
    results often looking unreasonable to those in the lower ranks. Of
    course, the lower ranks are not told about the reasons, whether it is
    plain "good" salesmanship (e.g., finding out where the top ranks are
    gullible; and for those reason where the top ranks fell for the
    salesman, the lower ranks are then told these reasons, which appear unreasonable to them; e.g., read about Intel's decision to go for WNT
    in IIRC the oral history of Robert Colwell), or supporting that by
    kickbacks, or just the kind of good deal that a year's supply of free
    Heroin is (e.g., tell the top ranks that they can reduce the server
    costs and the headcount for sysadmins by buying capacity from a cloud
    provider, and ramp up the prices once the customers have made
    themselves dependent on that cloud provider).

    AS/400 on Power might be one of the things where people might be more
    gullible that they are getting their money's worth than in the
    alternative scenario. In this case it's probably not just the top
    ranks; the lower ranks probably also feel elevated by working on this
    special hardware rather than the run-of-the-mill AMD64 fare.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Mar 22 09:48:54 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 22 Mar 2026 01:12:58 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:
    IBM does not say, but I would think that most POWER machines are
    running i.

    I have no data, but if it was the case then why IBM would bother with >providing anything except the smallest POWER box?
    Even the smallest box is probably huge overkill for i.

    I don't know anybody who uses System i, so I speculate: if they did
    not stop writing software for that system decades ago, I would be
    surprised if the software did not make use of additional hardware
    capacity, as software tends to do. I certainly expect that IBM
    provides software goodies that work better on the bigger machines.
    And the customers might actually see an increasing use over the years
    of the software that is running on top of System i, also resulting in
    increased loads.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 22 10:12:35 2026
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 3/21/2026 12:06 AM, Anton Ertl wrote:
    For Power, maybe the AS/400 customers are in a
    similar position.

    No, actually the opposite position. AS/400 user code uses a very high >>level system (i.e. no user assembly code) that provides much of the work >>and is proprietary to IBM. While that system could be, and was, ported
    to a different architecture (e.g. from S/38 to Power), of course, IBM
    has no incentive to port it to a generic platform.

    That's an interesting statement. IBM could implement AS/400 on AMD64 machines (AFAIK it uses some extra bit for tagging on their enhanced
    Power, but I am sure that an implementation on ARM A64 with top-byte
    ignore and on AMD64 with similar freatures (don't remember the name)
    would not incur too much overhead, if any). That would save them the
    cost of continuing Power development.

    Looking at the list of supported systems for SAP/HANA at https://www.sap.com/dmc/exp/2014-09-02-hana-hardware/enEN/ , one can
    see that only certan Intel and POWER 9-11 processors are certified
    (not sure why the 2014 is in the link, the list is current).
    Neither AMD nor ARM are on that list.

    This is a good application to be in, also one where the memory
    architecture of POWER can use its strenghts.

    Also, looking at https://newsroom.ibm.com/2026-01-28-IBM-RELEASES-FOURTH-QUARTER-RESULTS
    they have a 22 % profit margin on their "Infrastructure" segment;
    they do not differentiate between POWER and Z, but this seems
    to be a rather nice situation overall; they have a higher margin
    than consulting, but less than software.

    So, I don't think that POWER solely relies on System i (or whatever
    it is called today).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 22 10:17:35 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Yes, but to make the decision correctly requires knowledge held by IBM management, that neither you nor I have. For example, they haven't made major improvements in the Power architecture for years, so I suspect
    their development team is rather small, and thus doesn't cost too much.

    The last update to the POWER architecture was the 3.1 ISA, which
    included 34-bit constants. Looking at the startup code of functions
    to load the TOC was too painful, I guess... Since then, Power 11
    has come out, with microarchitectural improvements only.

    But having a stable ISA over many microarchitectural revisions is
    a good thing for software, you don't need to recompile everything
    for a multitude of features like you have to do for Intel...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sun Mar 22 10:19:00 2026
    From Newsgroup: comp.arch

    In article <10php0p$122m0$1@dont-email.me>, tkoenig@netcologne.de (Thomas Koenig) wrote:

    Is anybody still doing Alpha?

    Nobody makes them the processors anymore. Linux still has a maintainer
    for the Alpha platform, which makes it less dead than IA-64. NetBSD and
    OpenBSD still produce releases for it.

    John
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Mar 22 11:22:36 2026
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    Tanenbaum in his book about computer architecture mentions results
    of Bell and Newell from 1971. IIUC Bell and Newell coded
    some programs in microcode of 2025 (microcode engine of 360/25).
    Claim is that such program run 45 times faster than program
    using 360 instruction set.

    According to the Wikipedia article, the 360/25 had 64 bytes of
    high-speed (SLT) memory with an access time of 180 ns, five times
    as fast as main memory (or the usual microcode control store).
    That makes the factor of 45 just borderline believable, but
    still something that would later go away with higher-speed
    memory.

    They also created Fortran compiler/
    interpreter combination with interpreter coded in 2025
    microcode. They claimed that this Fortran run at comparable
    speed as "native" Fortran on 360/50.

    Ah, that's where the people who furnished the Fountainhead project
    at Data General got their ideas about programming-language
    dependent microcode from.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21d-Linux NewsLink 1.2