• Concedtina III May Be Returning

    From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 31 06:17:07 2025
    From Newsgroup: comp.arch

    I have had so much success in adjusting Concertina II to achieve my goals
    more fully than I had thought possible... that I now think that it may be possible to proceed from Concertina II to a design which gets rid of the
    one feature of Concertina II that has been the most controversial.

    Yes, I think that I could actually do without block structure.

    What would Concertina III look like?

    Well, the basic instruction set would be similar to that of Concertina II.
    But the P bits would be taken out of the operate instructions, and so
    would the option of replacing a register specification by a pseudo-
    immediate pointer.

    The tiny gaps between the opcodes of some instructions to squeeze out
    space for block headers would be removed.

    But the big spaces for the shortest block header prefixes would be what is used for doing without headers.

    Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
    contained within a sequence of pairs of 32-bit instructions of this form:

    11110xx(17 bits)(8 bits)
    11111x(9 bits)(17 bits)

    Instructions could be 17 bits long, 34 bits long, 51 bits long, and so on,
    any multiple of 17 bits in length.

    In the first instruction slot of the pair, the two bits xx indicate, for
    the two 17-bit regions of the variable-length instruction area that start
    in it, if they are the first 17-bit area of an instruction. The second instruction slot only contains the start of one 17-bit area, so only one
    bit x is needed. Since 17 is an odd number, this meshes perfectly with the fact that the 17-bit area which straddles both words isn't split evenly,
    but rather one extra bit of it is in the second 32-bit instruction slot.

    I had been hoping to use 18-bit areas instead, but after re-checking my calculations, I found there just wasn't enough opcode space.

    Long instructions that contain immediates would not be part of variable-
    length instruction code. Instead, their lengths would be multiples of 32
    bits, making them part of ordinary code with 32-bit instructions.

    Their form would be like this:

    32-bit immediate:

    1010(12 bits)(16 bits)
    10111(11 bits)(16 bits)'

    where the first parenthesized area belongs to the instruction, and the
    second to the immediate.

    48-bit immediate:

    1010(12 bits)(16 bits)
    10110(11 bits)(16 bits)
    10111(11 bits)(16 bits)

    64-bit immediate:

    1010(12 bits)(16 bits)
    10110(3 bits)(24 bits)
    10111(3 bits)(24 bits)

    Since the instruction, exclusive of the immediate, really only needs 12
    bits - 7 bit opcode, and 5 bit destination register - in each case there's enough additional space for the instruction to begin with a few bits that indicates its length, so that decoding is simple.

    The scheme is not really space-efficient.

    But the question that I really have is... is this really any better than having block headers? Or is it just as bad, just as complicated?

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 31 13:12:52 2025
    From Newsgroup: comp.arch

    On 8/31/2025 1:17 AM, John Savard wrote:
    I have had so much success in adjusting Concertina II to achieve my goals more fully than I had thought possible... that I now think that it may be possible to proceed from Concertina II to a design which gets rid of the
    one feature of Concertina II that has been the most controversial.

    Yes, I think that I could actually do without block structure.

    What would Concertina III look like?

    Well, the basic instruction set would be similar to that of Concertina II. But the P bits would be taken out of the operate instructions, and so
    would the option of replacing a register specification by a pseudo-
    immediate pointer.

    The tiny gaps between the opcodes of some instructions to squeeze out
    space for block headers would be removed.

    But the big spaces for the shortest block header prefixes would be what is used for doing without headers.

    Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
    contained within a sequence of pairs of 32-bit instructions of this form:

    11110xx(17 bits)(8 bits)
    11111x(9 bits)(17 bits)

    Instructions could be 17 bits long, 34 bits long, 51 bits long, and so on, any multiple of 17 bits in length.

    In the first instruction slot of the pair, the two bits xx indicate, for
    the two 17-bit regions of the variable-length instruction area that start
    in it, if they are the first 17-bit area of an instruction. The second instruction slot only contains the start of one 17-bit area, so only one
    bit x is needed. Since 17 is an odd number, this meshes perfectly with the fact that the 17-bit area which straddles both words isn't split evenly,
    but rather one extra bit of it is in the second 32-bit instruction slot.

    I had been hoping to use 18-bit areas instead, but after re-checking my calculations, I found there just wasn't enough opcode space.

    Long instructions that contain immediates would not be part of variable- length instruction code. Instead, their lengths would be multiples of 32 bits, making them part of ordinary code with 32-bit instructions.

    Their form would be like this:

    32-bit immediate:

    1010(12 bits)(16 bits)
    10111(11 bits)(16 bits)'

    where the first parenthesized area belongs to the instruction, and the
    second to the immediate.

    48-bit immediate:

    1010(12 bits)(16 bits)
    10110(11 bits)(16 bits)
    10111(11 bits)(16 bits)

    64-bit immediate:

    1010(12 bits)(16 bits)
    10110(3 bits)(24 bits)
    10111(3 bits)(24 bits)

    Since the instruction, exclusive of the immediate, really only needs 12
    bits - 7 bit opcode, and 5 bit destination register - in each case there's enough additional space for the instruction to begin with a few bits that indicates its length, so that decoding is simple.

    The scheme is not really space-efficient.

    But the question that I really have is... is this really any better than having block headers? Or is it just as bad, just as complicated?



    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...



    Where:
    Prefix+16b = 48b
    Prefix+32b = 64b
    Prefix+Prefix+32b = 96b

    This leaves 15 bits of encoding space for 16 bit ops.

    Nominally, the 16-bit ops can have 16 registers.
    Preferably:
    8 scratch, 8 callee save
    Roughly the first 4-8 argument registers.
    A few cases could have 32 registers.

    For 32-bit ops, 32 or 64 registers.

    Might map out register space as, say:
    R0..R3: ZR/LR/SP/GP
    R4..R7: Scratch
    R8..R15: Scratch / Arg
    R16..R31: Callee Save
    For high 32:
    R32..R47: Scratch
    R48..R63: Callee Save


    With 16 bit ops maybe having maybe R8..R23 (or R16..R24,R8..R15) for the
    Reg4 field.

    Some possible 16-bit ops:
    00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
    01in-nnnn-iiii-0000 MOV Imm5s, Rn5
    10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
    11mn-nnnn-mmmm-0000 MOV Rm5, Rn5

    0000-nnnn-iiii-0010 ADDW Imm4u, Rn4 //ADD Imm, 32-bit sign extension
    0001-nnnn-mmmm-0010 SUB Rm4, Rn4
    0010-nnnn-mmmm-0010 ADDW Imm4n, Rn4 //ADD Imm, 32-bit sign extension
    0011-nnnn-mmmm-0010 MOVW Rm4, Rn4 //MOV with 32-bit sign extension
    0100-nnnn-mmmm-0010 ADDW Rm4, Rn4
    0101-nnnn-mmmm-0010 AND Rm4, Rn4
    0110-nnnn-mmmm-0010 OR Rm4, Rn4
    0111-nnnn-mmmm-0010 XOR Rm4, Rn4

    1000-dddd-dddd-0010 BRA Disp8
    ...

    0ddd-nnnn-mmmm-0100 LW Disp3(Rm4), Rn4
    1ddd-nnnn-mmmm-0100 SW Disp3(Rm4), Rn4
    0ddd-nnnn-mmmm-0110 LD Disp3(Rm4), Rn4
    1ddd-nnnn-mmmm-0110 SD Disp3(Rm4), Rn4

    00dn-nnnn-dddd-1000 LW Disp5(SP), Rn5
    01dn-nnnn-dddd-1000 LD Disp5(SP), Rn5
    10dn-nnnn-dddd-1000 SW Disp5(SP), Rn5
    11dn-nnnn-dddd-1000 SD Disp5(SP), Rn5

    ...

    Avoid temptation to make immediate and displacement fields overly large,
    as for small values these tend to have a very rapid drop-off. Not enough registers hurts more here than narrow Imm/Disp fields.

    Main place a larger Disp is justified is for SP-rel Load/Store.

    The 16-bit ops don't need to be sufficient to support the whole ISA,
    merely to provide space-savings for a subset of common-case operations. Preferably with the encodings not being confetti.



    32 bit instruction layout could do whatever.
    zzzz-oooo-oomm-mmmm-zzzz-nnnn-nnyy-yyy1 //Similar to XG3
    zzzz-zzzo-oooo-mmmm-mzzz-nnnn-nyyy-yyy1 //Similar to RISC-V
    zzzz-zzoo-ooom-mmmm-zzzz-znnn-nnyy-yyy1 //Intermediate, 5b registers
    ...

    May make sense to slightly shrink immediate and displacement fields
    slightly relative to RISC-V, and instead assume use of jumbo prefixes.

    Also, probably not doing something like RISC-V's JAL, which is an
    unreasonable waste of encoding space.

    As for 5 or 6 bit registers, possibilities:
    * Purely 5-bit, like traditional RISC-V
    * Purely 6 bit, like XG3, but this leaves less encoding space.
    * Mixed 5 or 6 bit, like XG1, but more intentionally
    ** In XG1, the subset of 32b ops with Reg6 were a bit of a hack.


    Possible, assuming a mixed 5/6 bit scheme:
    zzzz-oooo-oomm-mmmm-zzzz-nnnn-nnyy-yyy1 //3R, Reg6
    zzzz-zooo-oozm-mmmm-zzzz-znnn-nnyy-yyy1 //3R, Reg5
    iiii-iiii-iimm-mmmm-zzzz-nnnn-nnyy-yyy1 //3RI, Reg6 (Imm10)
    iiii-iiii-iizm-mmmm-zzzz-znnn-nnyy-yyy1 //3RI, Reg5 (Imm10)
    iiii-iiii-iiii-iiii-zzzz-nnnn-nnyy-yyy1 //2RI, Reg6 (Imm16)
    iiii-iiii-iiii-iiii-zzzz-0jjj-jj11-1111 //Jumbo Prefix (Imm, Typical)
    Extends Imm10 to Imm33, leaves 4 bits for sub-operation.
    If prefix is used with a Reg5 op, extend reg fields to 6-bit.

    Keeping register fields consistent helps with superscalar. It more
    matters here that the low order bits remain in the same place.

    Keeping immediate fields consistent helps with "everything not being a
    massive pain". I would assume all normal 3RI immediate and displacement instructions have the same size and layout of immediate. However,
    depending on instruction it may change scale. As I see it, scale
    changing is preferable to bit-slicing though.


    While less elegant to have a mix of 5 and 6 bit register encodings,
    doing so could be more economical regarding the use of encoding space
    compared with purely 6 bit (while being less limiting compared with pure
    5 bit).

    Possibly, one could have a semi-split space where, say:
    R0..R31 are primarily integer registers;
    R32..R63 are primarily FPU and SIMD registers;
    A lot of core operations have access to the entire register space;
    Non-core operations might be specific to the low or high 32;
    SIMD-128 ops could use 5b register fields,
    but still have the full register space


    The semi-split scheme could work well in medium to low register pressure scenarios; with 6-bit core ops helping with high register pressure.

    Likely, at least, all the Load/Store and basic ALU operations need Reg6.


    Not having 16-bit ops could simplify implementation slightly and also
    allow better performance with a simpler implementation. One other option
    being to allow 16 bit ops, but with the caveat that using them may
    reduce performance (with the compiler ideally keeping performance
    optimized using primarily 32-bit encodings wherever possible, and
    maintaining a 32-bit alignment of the instruction stream).

    Where, in such a case, 16-bit ops could be left either for low-traffic
    code or for size optimized binaries.

    ...


    Though, more elegant (and possibly also highest performance) could be to
    just go for 32/64/96 bit instructions with 6-bit register fields (also,
    any lower-probability ops can use 64-bit encodings; though, at the cost
    of code density).


    Just another random pull here...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 2 09:15:59 2025
    From Newsgroup: comp.arch

    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger
    and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Sep 2 13:07:07 2025
    From Newsgroup: comp.arch

    On 9/2/2025 4:15 AM, John Savard wrote:
    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.


    Note that tagging like that described does still allow some amount of
    parallel decoding, since we still have combinatorial logic. Granted, scalability is an issue.

    As can be noted, my use of jumbo-prefixes for large immediate values
    does have the property of allowing reusing 32-bit decoders for 64-bit
    and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings
    don't change the instruction being decoded, but merely extend it.

    Some internal plumbing is needed to stitch the immediate values together though, typically:
    We have OpA, OpB, OpC
    DecC gets OpC, and JBits from OpB
    DecB gets OpB, and JBits from OpA
    DecA gets OpA, and 0 for JBits.

    In my CPU core, I had a few times considered changing how decoding
    worked, to either reverse or right-align the instruction block to reduce
    the amount of MUX'ing needed in the decoder. If going for
    right-alignment, then DecC would always go to Lane1, DecB to Lane2, and
    DecA to Lane3.

    Can note that for immediate-handling, the Lane1 decoder produces the low
    33 bits of the result. If a decoder has a jumbo prefix and is itself
    given a jumbo-prefix, it assumes a 96 bit encoding and produces the
    value for the high 32 bits.

    At least in my designs, I only account for 33 bits of immediate per
    lane. Instead, when a full 64-bit immediate is encoded, its value is
    assembled in the ID2/RF stage.


    Though, admittedly my CPU core design did fall back to sequential
    execution for 16-bit ops, but this was partly for cost reasons.

    For BJX2/XG1 originally, it was because the instructions couldn't use
    WEX tagging, but after adding superscalar it was because I would either
    need multiple parallel 16-bit decoders, or to change how 16 bit ops were handled (likely using a 16->32 repacker).

    So, say:
    IF stage:
    Retrieve instruction from Cache Line;
    Determine fetch length:
    XG1/XG2 used explicit tagging;
    XG3 and RV use SuperScalar checks.
    Run repackers.
    Currently both XG3 and RISC-V 48-bit ops are handled by repacking.
    Decode Stage:
    Decode N parallel 32-bit ops;
    Prefixes route to the corresponding instructions;
    Any lane holding solely a prefix goes NOP.


    For a repacker, it would help if there were fairly direct mappings
    between the 16-bit and 32-bit ops. Contrary to claims, RVC does not
    appear to fit such a pattern. Personally, there isn't much good to say
    about RVC's encoding scheme, as it is very much ad-hoc dog chew.

    The usual claim is more that it is "compressed" in that you can first
    generate a 32-bit op internally and "squish" it down into a 16-bit form
    if it fits. This isn't terribly novel as I see it. Repacking RVC has
    similar problems to decoding it directly, namely that for a fair number
    of instructions, nearly each instruction has its own idiosyncratic
    encoding scheme (and you can't just simply shuffle some of the bits
    around and fill others with 0s and similar to arrive back at a valid RVI instruction).


    Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though
    there were some special cases made in the decoding rules. Though,
    admittedly I did do more than the bare minimum here (to fit it into the
    same encoding space as RV), mostly as I ended up going for a "Dog Chew Reduction" route rather than merely a "do the bare minimum bit-shuffling needed to make it fit".

    For better or worse, it effectively made XG3 its own ISA as far as BGBCC
    is concerned. Even if in theory I could have used repacking, the
    original XG1/XG2 emitter logic is a total mess. It was written
    originally for fixed-length 16-bit ops, so encodes and outputs
    instructions 16 bits at a time (using big "switch()" blocks, but the
    RISC-V and XG3 emitters also went this path; as far as BGBCC is
    concerned, it is treating XG3 as part of RISC-V land).


    Both the CPU core and also JX2VM handle it by repacking to XG2 though.
    For the XG3VM (userland only emulator for now), it instead decodes XG3 directly, with decoders for XG3, RVI, and RVC.

    Had noted the relative irony that despite XG3 having a longer
    instruction listing (than RVI) it still ends up with a slightly shorter decoder.

    Some of this has to deal with one big annoyance of RISC-V's encoding scheme: Its inconsistent and dog-chewed handling of immediate and displacement
    values.


    Though, for mixed-output, there are still a handful of cases where RVI encodings can beat XG3 encodings, mostly involving cases where the RVI encodings have a slightly larger displacement.

    In compiler stats, this seems to mostly affect:
    LB, LBU, LW, LWU
    SB, SW
    ADDI, ADDIW, LUI
    The former:
    Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.
    ADDI: 12b > 10b
    LUI: because loading a 32-bit value of the form XXXXX000 does happen
    sometimes it seems.

    Instruction counts are low enough that a "pure XG3" would likely result
    in Doom being around 1K larger (the cases where RVI ops are used would
    need a 64-bit jumbo-encoding in XG3).

    Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using
    separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit
    in that this effectively gives it an Imm11s encoding; and ADD is one of
    the main instructions that tends to be big-immediate-heavy (and in early design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but
    the current scheme has a tiny bit more range, albeit SUB-ImmU could have possibly avoided the need for an ImmN case).

    So, say:
    ADD: Large immediate heavy.
    SUB: Can reduce to ADD in the immediate case.
    AND: Preferable to have signed immediate values;
    Not common enough to justify the ImmU/ImmN scheme.
    OR: Almost exclusively positive;
    XOR: Almost exclusively positive.

    Load/Store displacements are very lopsided in the positive direction.
    Disp10s slightly beats Disp10u though.
    More negative displacements than 512..1023.
    XG1 had sorta hacked around it by:
    Disp9u, Disp5n
    Disp10u, Disp6n was considered, but didn't go that way.
    Disp10s was at least, "slightly less ugly",
    Even if Disp10u+Disp6n would have arguably been better
    for code density.

    Or, cough, someone could maybe do signed load/store displacements like:
    000..110: Positive
    111: Negative
    So, Disp10as is 0..1791, -256..-1
    Would better fit the statistical distribution, but... Errm...


    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Sep 2 13:10:23 2025
    From Newsgroup: comp.arch

    On 9/2/2025 1:07 PM, BGB wrote:
    On 9/2/2025 4:15 AM, John Savard wrote:
    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
                            xxxx-xxxx-xxxx-xxx0  //16 bit
        xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1  //32 bit
        xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111  //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with
    only
    minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that
    whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then
    trigger
    and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really
    don't like about giving up the block structure.


    Note that tagging like that described does still allow some amount of parallel decoding, since we still have combinatorial logic. Granted, scalability is an issue.

    As can be noted, my use of jumbo-prefixes for large immediate values
    does have the property of allowing reusing 32-bit decoders for 64-bit
    and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings don't change the instruction being decoded, but merely extend it.

    Some internal plumbing is needed to stitch the immediate values together though, typically:
      We have OpA, OpB, OpC
      DecC gets OpC, and JBits from OpB
      DecB gets OpB, and JBits from OpA
      DecA gets OpA, and 0 for JBits.

    In my CPU core, I had a few times considered changing how decoding
    worked, to either reverse or right-align the instruction block to reduce
    the amount of MUX'ing needed in the decoder. If going for right-
    alignment, then DecC would always go to Lane1, DecB to Lane2, and DecA
    to Lane3.

    Can note that for immediate-handling, the Lane1 decoder produces the low
    33 bits of the result. If a decoder has a jumbo prefix and is itself
    given a jumbo-prefix, it assumes a 96 bit encoding and produces the
    value for the high 32 bits.

    At least in my designs, I only account for 33 bits of immediate per
    lane. Instead, when a full 64-bit immediate is encoded, its value is assembled in the ID2/RF stage.


    Though, admittedly my CPU core design did fall back to sequential
    execution for 16-bit ops, but this was partly for cost reasons.

    For BJX2/XG1 originally, it was because the instructions couldn't use
    WEX tagging, but after adding superscalar it was because I would either
    need multiple parallel 16-bit decoders, or to change how 16 bit ops were handled (likely using a 16->32 repacker).

    So, say:
      IF stage:
        Retrieve instruction from Cache Line;
        Determine fetch length:
          XG1/XG2 used explicit tagging;
          XG3 and RV use SuperScalar checks.
        Run repackers.
          Currently both XG3 and RISC-V 48-bit ops are handled by repacking.
      Decode Stage:
        Decode N parallel 32-bit ops;
        Prefixes route to the corresponding instructions;
        Any lane holding solely a prefix goes NOP.


    For a repacker, it would help if there were fairly direct mappings
    between the 16-bit and 32-bit ops. Contrary to claims, RVC does not
    appear to fit such a pattern. Personally, there isn't much good to say
    about RVC's encoding scheme, as it is very much ad-hoc dog chew.

    The usual claim is more that it is "compressed" in that you can first generate a 32-bit op internally and "squish" it down into a 16-bit form
    if it fits. This isn't terribly novel as I see it. Repacking RVC has
    similar problems to decoding it directly, namely that for a fair number
    of instructions, nearly each instruction has its own idiosyncratic
    encoding scheme (and you can't just simply shuffle some of the bits
    around and fill others with 0s and similar to arrive back at a valid RVI instruction).


    Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though
    there were some special cases made in the decoding rules. Though,
    admittedly I did do more than the bare minimum here (to fit it into the
    same encoding space as RV), mostly as I ended up going for a "Dog Chew Reduction" route rather than merely a "do the bare minimum bit-shuffling needed to make it fit".

    For better or worse, it effectively made XG3 its own ISA as far as BGBCC
    is concerned. Even if in theory I could have used repacking, the
    original XG1/XG2 emitter logic is a total mess. It was written
    originally for fixed-length 16-bit ops, so encodes and outputs
    instructions 16 bits at a time (using big "switch()" blocks, but the
    RISC-V and XG3 emitters also went this path; as far as BGBCC is
    concerned, it is treating XG3 as part of RISC-V land).


    Both the CPU core and also JX2VM handle it by repacking to XG2 though.
    For the XG3VM (userland only emulator for now), it instead decodes XG3 directly, with decoders for XG3, RVI, and RVC.

    Had noted the relative irony that despite XG3 having a longer
    instruction listing (than RVI) it still ends up with a slightly shorter decoder.

    Some of this has to deal with one big annoyance of RISC-V's encoding
    scheme:
    Its inconsistent and dog-chewed handling of immediate and displacement values.


    Though, for mixed-output, there are still a handful of cases where RVI encodings can beat XG3 encodings, mostly involving cases where the RVI encodings have a slightly larger displacement.

    In compiler stats, this seems to mostly affect:
      LB, LBU, LW, LWU
      SB, SW
      ADDI, ADDIW, LUI
    The former:
      Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.
      ADDI: 12b > 10b
    LUI: because loading a 32-bit value of the form XXXXX000 does happen sometimes it seems.

    Instruction counts are low enough that a "pure XG3" would likely result
    in Doom being around 1K larger (the cases where RVI ops are used would
    need a 64-bit jumbo-encoding in XG3).

    Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using
    separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit
    in that this effectively gives it an Imm11s encoding; and ADD is one of
    the main instructions that tends to be big-immediate-heavy (and in early design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but
    the current scheme has a tiny bit more range, albeit SUB-ImmU could have possibly avoided the need for an ImmN case).

    So, say:
      ADD: Large immediate heavy.
        SUB: Can reduce to ADD in the immediate case.
      AND: Preferable to have signed immediate values;
        Not common enough to justify the ImmU/ImmN scheme.
      OR: Almost exclusively positive;
      XOR: Almost exclusively positive.

    Load/Store displacements are very lopsided in the positive direction.
      Disp10s slightly beats Disp10u though.
        More negative displacements than 512..1023.
      XG1 had sorta hacked around it by:
        Disp9u, Disp5n
        Disp10u, Disp6n was considered, but didn't go that way.
          Disp10s was at least, "slightly less ugly",
            Even if Disp10u+Disp6n would have arguably been better
            for code density.

    Or, cough, someone could maybe do signed load/store displacements like:
      000..110: Positive
      111: Negative
    So, Disp10as is 0..1791, -256..-1
    Would better fit the statistical distribution, but... Errm...


    Self-correction (brain fart), that would be a Disp11as, groan...

    Probably Disp10as:
    000..110: Positive
    111: Negative
    Disp10as is 0..895, -128..-1
    Or:
    00..10: Positive
    11: Negative
    Disp10as is 0..767, -256..-1

    ...


    ...


    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Sep 2 18:40:16 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:

    How about, say, 16/32/48/64/96:
    xxxx-xxxx-xxxx-xxx0 //16 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
    xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix

    Already elaborate enough...

    Thank you for your interesting suggestions.

    I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.

    Like Concertina II, it is to meet the overriding condition that
    instructions do not have to be decoded sequentially. This means that whenever an instruction, or group of instructions, spans more than 32
    bits, the 32 bit areas of the instruction, other than the first, must
    begin with a combination of bits that says "don't decode me".

    The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.

    This has the consequence that any immediate value that is 32 bits or more
    in length has to be split up into smaller pieces; this is what I really don't like about giving up the block structure.

    I found this completely unnecessary.

    Only a small number of Major OpCodes can have constants, denoted by:: 0b'001xxxdd dddsssss D12dsmin orxsssss

    D=0 signifies '1' and '2' specify 5-bit immediates
    D=1 signifies a constant
    d=0 signifies 32-bit constant
    d=1 signifies 64-bit constant
    '1' signifies negation of Src1
    '2' signifies negation of Src2

    In effect, D12ds is a routing specifier, telling DECODE what to route
    where in an easy to determine pattern. You could go so far as to call
    it a routing OpCode. This field is a large contributor to how My 66000
    requires fewer instructions than Other ISAs.

    However, I also found that STs need an immediate and a displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Lest one thinks this results in serial decoding, consider that the
    pattern decoder is 40 gates (just larger than 3-flip-flops) so one
    can afford to put this pattern decoder on every word in the inst-
    buffer and then inst[0] selects inst[1], but inst[1] has already
    selected inst[2] which has selected inst[3] and we have a tree
    pattern that can parse 16-instructions in a 16-gate cycle time
    from a 24-32 word input-buffer to DECODE. I call this stage of
    the pipeline PARSE.

    Also note that 1 My 66000 instruction does the work of 1.4 RISC-V
    instructions, so, a 6-wide My 66000 machine is equivalent to a
    8.4-to-9 wide RISC-V machine.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Sep 2 23:55:17 2025
    From Newsgroup: comp.arch

    On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:

    Lest one thinks this results in serial decoding, consider that the
    pattern decoder is 40 gates (just larger than 3-flip-flops) so one can
    afford to put this pattern decoder on every word in the inst- buffer

    Yes, given sufficiently simple decoding, one could allow backtracking when
    the second word of an instruction is decoded as if it was the first.

    Of course, though, it wastes electricity and produces heat, but a
    negligible amount, I agree.

    I'm designing my ISA, though, to make it simple to implement... in one specific sense. It's horribly large and complicated, but at least it
    doesn't demand that imlementors understand any fancy tricks.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Sep 3 17:29:30 2025
    From Newsgroup: comp.arch

    On 9/2/2025 6:55 PM, John Savard wrote:
    On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:

    Lest one thinks this results in serial decoding, consider that the
    pattern decoder is 40 gates (just larger than 3-flip-flops) so one can
    afford to put this pattern decoder on every word in the inst- buffer

    Yes, given sufficiently simple decoding, one could allow backtracking when the second word of an instruction is decoded as if it was the first.

    Of course, though, it wastes electricity and produces heat, but a
    negligible amount, I agree.

    I'm designing my ISA, though, to make it simple to implement... in one specific sense. It's horribly large and complicated, but at least it
    doesn't demand that imlementors understand any fancy tricks.


    Usual strategy is, say, for each cache line:
    Detect which 16-bit words represent 16 or 32 bit instructions, and which
    are prefixes.


    Logic isn't too unreasonable, except that it needs to deal with multiple
    ISAs in my case; and so the current ISA mode needs to be visible to the
    L1 I$, have time to settle, and may need to flush cache lines when the
    ISA changes. The relevant tagging data exists alongside the cacheline
    proper, and is resolved when each line arrives into the L1 (following an
    I$ Miss). This has looser latency than determining with and similar
    during the IF stage proper (but, it is only reasonable to store a few
    bits per word).

    So, length determination:
    XG1:
    Is32 = (15:13)==111 || (15:12)==0111 || (15:12)==1001
    IsJX = (15: 9)==1111_111
    IsWEX = ((15:12)==1111 && (10)==1) ||
    ((15:12)==1110 && (11:8)==1z1z) //PrWEX
    XG2:
    Is32 = 1
    IsJX = (12: 9)==1111
    IsWEX = ((12)==1 && (10)==1) ||
    ((12)==0 && (11:8)==1z1z) //PrWEX
    XG3:
    Is32 = 1
    IsJX =
    (5:0)== 11z10 || //XG3 prefix
    (6:0)==1111111 //RISC-V (longer instruction)
    IsWEX = 0
    RV64GC:
    Is32 = (1:0)==11
    IsJX = (6:0)==1111111
    IsWEX = 0

    For odd-numbered 16-bit words, only the XG1 and RV64GC cases are
    relevant. Actual logic overlaps the logic for the ISA modes to some extent.

    For IF, it is a case of MUX'ing the bits for the low order part of PC,
    then feeding them through a "casez()" to arrive at the target length.

    For XG3 and RV64GC, the IsWEX flag would instead be provided external to
    the length-determination module by the superscalar logic. As noted, this
    only works for 32-bit aligned instructions (always 0 if misaligned).

    Here, one sub-module checks for register dependencies, and the other
    checks for which instructions are allowed in which context. These are
    used to determine a virtual WEX bit.

    Typically, then, the IsJX and IsWEX bits are OR'ed together to get an
    IsWJX bit, which is what is used during IF.

    Implicitly, this adds another constraint based on PC(3:2) for
    superscalar operation:
    00: 1-3 wide bundle
    01: 1-3 wide bundle
    10: 1/2 wide bundle
    11: scalar only
    Though, this restriction is N/A for jumbo prefixes.

    Superscalar can't infer across cache line boundaries as it doesn't
    necessarily know what exists in the following cache line. This situation
    would be less bad with 32B cache lines, but then you would need twice as
    much logic for the superscalar checks. Similar problem if trying to deal
    with misaligned instructions.

    Also, trying to deal with RVC here would make it "kinda evil". Currently
    the register-check logic only dealing with RVI/RVI or XG3/XG3 pairs. Had experimented with RVI/XG3 pairs, but this added a fair bit of additional
    cost (and wasn't worth it, cheaper to assume "sparse mixing" of the encodings).

    More analysis would be needed to try to formalize the cost curve, but it
    seems to be fairly steep in any case, so the number of possible paths
    (between potential pairs of source and destination register ports) needs
    to be kept as small as possible. Which in this case, was best served by limiting things to fixed-length aligned-only and keeping RVI and XG3 instructions separate (in the case of a mixed pair, it always assumes
    that register aliasing may exist).


    Well, for similar reasons, "opcode fusion" as a general solution to ISA
    level inefficiencies (in the CPU) would have a "stupidly bad" cost curve (would likely make normal superscalar look "almost free" in comparison).

    And, there are "less stupidly bad" possibilities, like trying to "hot
    patch" the instruction-sequences at load-time.

    Say, for example, we have an Indexed-Load instruction in the CPU, and a
    "PNOP" (Special NOP designated to have an 0-cycle latency, vs the
    implied 1-cycle cost for a normal NOP).

    Then, say, program loader hot patches an SLLI+ADD+LD pair into PNOP+PNOP+LD_Ix.

    But, does still mean the CPU needs to have the instruction, and there is
    still a potential non-zero cost to the PNOPs (unless as a hack they
    actively behave like they were a WEX'ed NOP; in which case it might be
    illegal to have more than 2 PNOPs in a row on a 3-wide machine, ...).

    ...


    Though, even with everything, superscalar might still be a better
    general option than my older explicit WEX tagging system (say, by
    allowing 2 and 3 wide implementations to share the same binaries without
    a potentially steep performance penalty of needing to fall back to
    scalar operation in the case of a pipeline width mismatch).


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Sep 3 22:42:39 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so, Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.

    But it also doesn't need two immediates.
    A 16-bit integer or float and a 16-bit offset packed into a
    single 32-bit immediate would suffice for most purposes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 4 00:20:03 2025
    From Newsgroup: comp.arch

    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative minority being compares against non-zero constants.


    One could argue:
    This is high enough to care, but is it cheap enough?...


    I had experimented with this before, where if a Jumbo-Op64 prefix was
    used, and if flagged to synthesize an immediate for a conditional branch
    or store, it would do a Branch-With-Immediate or Store-With-Immediate.

    Applies to all ISA's, though with separate configuration flags for the
    RV+Jx and XG1/2/3 ISAs (mostly separately enabling or disabling the
    added routing). Both depend on mostly the same plumbing internally though.


    Though, checking:
    My CPU core was already paying for it, as I had left it enabled it
    seems. Whatever the case, the added cost wasn't high enough for me to
    bother disabling it again.

    Testing (in Vivado, turning it off again):
    Cost delta: ~ 300 LUTs.
    OK, so logic cost delta is borderline negligible it seems.


    In the case of its support in RV+Jx, it basically implements has a
    17-bit sign extended immediate with a 12 bit displacement.

    Note that using a Jumbo_Imm prefix wont work, as this would instead
    extend the displacement to 33 bits. So, needs to be a Jumbo_Op64 prefix.




    Enabling it again in the Doom build for RV64GC+Jx, has a slight negative effect on code-density.

    Though, this stands to reason:
    JOp64+Bcc needs 8 bytes;
    C.LI+Bcc needs 6 bytes.

    One can try to ram it into a 32-bit encoding, but then what it actually achieves in terms of hit rate is low enough to make it negligible.


    Effects on Doom:
    No obvious change in framerates;
    Average trace length gets 1.7% shorter.


    Theoretically, also might make it 1.7% faster, but 1.7% is below the
    threshold of what is easily seen in average Doom framerate.

    Trying "-timedemo demo1" with Doom:
    XG2 : 1710 gametics / 1588 realtics
    XG3 : 1710 gametics / 1605 realtics
    RV64GC+Jx (BccImm=false): 1710 gametics / 1882 realtics
    RV64GC+Jx (BccImm=true ): 1710 gametics / 1897 realtics
    RV64GC (plain) : 1710 gametics / 2387 realtics

    So, plain RV64GC being ~ 50% slower than XG2 or XG3 in this test...

    Granted, had been working on off on making RV64G support "less terrible"
    (used to be slower). Note it isn't just about BGBCC being slow with
    plain RV64, GCC output also being kinda slow.


    And, Bcc+Imm seemingly makes it very slightly slower somehow (despite
    making the average trace length slightly shorter...).

    Then remembers another downside:
    Bcc+Imm doesn't work with the branch predictor;
    So, the branches cost more cycles.

    So, rough estimate:
    ~ 1.7 to 1.9% faster assuming it were supported by branch predictor;
    ~ or, 0.8% slower without branch predictor support.


    So, it may not be an issue of whether or not to support it, but rather
    whether or not to add support for dealing with it to the branch predictor.


    But it also doesn't need two immediates.
    A 16-bit integer or float and a 16-bit offset packed into a
    single 32-bit immediate would suffice for most purposes.


    This is one possible way:
    Produce single immediate and then split it post decode (such as in the
    RF stage).


    Can note in my case, I just sort of awkwardly routed a second immediate
    output out of the Lane 1 decoder, which could then optionally replace
    the Lane 3 immediate, with a special case where Lane 1 could then pull
    the immediate from Lane 3 (where, in this case, Lane 3 is often used
    when we need spare register ports or another immediate field; probably
    more often than for actual instructions, *1).

    I used I added both Branch-with-Immediate and Store-with-Immediate at
    the same time as both effectively needed the same mechanism.


    *1: Where, Lane1 generally only sees traffic in the case of:
    ALU | ALU | ALU
    Or:
    ALU | ALU | Load
    Or similar.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Sep 4 10:33:12 2025
    From Newsgroup: comp.arch

    On Sun, 31 Aug 2025 06:17:07 +0000, John Savard wrote:

    Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
    contained within a sequence of pairs of 32-bit instructions of this
    form:

    11110xx(17 bits)(8 bits)
    11111x(9 bits)(17 bits)

    Making the basic unit 17 bits without externally indicating the
    instruction length prevents a full 17-bit short instruction.

    As this is inadequate, I considered abandoning the dream of Concertina III.

    However, if I use both major chunks of opcode space formerly used for
    headers or other items to be dropped, I can do this:

    101xx(18 bits)(9 bits)
    1111x(9 bits)(18 bits)

    which means, though, that I now have to use a longer prefix portion on the instructions with immediates, and this may possibly prevent them from
    working out properly.

    I also have 111001 available, a prefix of only 6 bits. If I use that for immediates, how will it work out?

    111010(10 bits)(16 bits)
    111011(10 bits)(16 bits)

    ...20 bits of instruction, for 32 bits of immediate. Since the instruction proper needs 7 bits of opcode plus 5 bits of destination register, that
    leaves 8 bits to distinguish the instruction from other kinds of
    instruction with these prefixes. So far, so good.

    111010(10 bits)(16 bits)
    111011(2 bits)(24 bits)
    111011(2 bits)(24 bits)

    Fitting a 64-bit immediate into three words (rather than four) is also
    still doable. It takes 1/4 of the available opcode space - but that's OK, because nothing else has a similar problem, not 48-bit immediates, and not 128-bit immediates.

    The only thing I do lose is being able to also have, as I had only very recently introduced to Concertina II, the use of the 64-bit immediate structure to have memory-reference instructions with 64-bit absolute addresses.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 4 13:19:29 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so, >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or comparing against 0 (which can use the Zero Register). Only a relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't know about those side effect flags and will always emit a CMP or TST first. Possibly those VAX that Bcc using ALU side effect flags were assembler.

    One could argue:
    This is high enough to care, but is it cheap enough?...

    The instruction fetch buffer has to be larger as the worst case size
    just got larger. And there are more format variations so the Parser
    gets more complex. And Decode has to pick apart the two immediates
    and place them in different fields so more muxes.

    Each front end uOp lane would have two immediate fields, one for an
    integer or float data value up to 8 bytes, one for up to 8 byte offset.
    Then at Dispatch (hand-off to the back end) muxes to route each
    immediate onto the FU operand bus.

    The difference comes in the back end Reservation Stations.
    If they are valued RS then the immediates are held just like
    any other operand values that were ready at time of Dispatch.
    The number of operands doesn't change so no extra cost here.

    But if they are valueless RS then it has no place to hold those
    immediates so it needs some place to stash them until the uOp launches.
    In that case it might be better if Decode took all the immediates and
    stash them in a circular buffer and just passed the indexes in the uOp.
    Then at launch the FU would pull in the immediates
    just like it pulls in the register operand values.
    This gets rid of the extra front end costs.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 10:30:56 2025
    From Newsgroup: comp.arch

    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement,
    so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative
    minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of zero?
    I expect these to be a significant enough percentage to skew your analysis.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 4 15:06:21 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>> potential displacement (from D12ds above) and the immediate has the >>>>> size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of zero?
    I expect these to be a significant enough percentage to skew your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 13:20:45 2025
    From Newsgroup: comp.arch

    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has
    the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
    1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks. That's interesting. So perhaps ~15% of all compares are to
    zero. I would have expected higher.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 4 20:43:21 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
    1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks. That's interesting. So perhaps ~15% of all compares are to
    zero. I would have expected higher.

    Kinda depends on the compilers used for the workload. I suspect
    that those workloads were mostly COBOL and FORTRAN and maybe BLISS-32
    or MACRO-32.

    Without a heavy C or C++ workload, the need to check for NULL
    pointer is rare.

    Unlike Unix, successful returns from VMS system and library calls
    was SS$_NORMAL, which had the value 1 rather than zero, which
    would probably also reduce the uses of TST to check for zero.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 14:03:19 2025
    From Newsgroup: comp.arch

    On 9/4/2025 1:43 PM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or >>>>>> comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, >>>>> 1982

    That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks. That's interesting. So perhaps ~15% of all compares are to
    zero. I would have expected higher.

    Kinda depends on the compilers used for the workload. I suspect
    that those workloads were mostly COBOL and FORTRAN and maybe BLISS-32
    or MACRO-32.

    Without a heavy C or C++ workload, the need to check for NULL
    pointer is rare.

    Unlike Unix, successful returns from VMS system and library calls
    was SS$_NORMAL, which had the value 1 rather than zero, which
    would probably also reduce the uses of TST to check for zero.


    Good call.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Sep 5 01:28:00 2025
    From Newsgroup: comp.arch

    On Thu, 04 Sep 2025 10:33:12 +0000, John Savard wrote:

    111010(10 bits)(16 bits)
    111011(2 bits)(24 bits)
    111011(2 bits)(24 bits)

    Fitting a 64-bit immediate into three words (rather than four) is also
    still doable. It takes 1/4 of the available opcode space - but that's
    OK, because nothing else has a similar problem, not 48-bit immediates,
    and not 128-bit immediates.

    The only thing I do lose is being able to also have, as I had only very recently introduced to Concertina II, the use of the 64-bit immediate structure to have memory-reference instructions with 64-bit absolute addresses.

    If 101 says "start of two-word variable-length instruction area", and
    11101 says "start of long instruction with displacement", 1111 can just
    say "don't decode, continuation of instruction" for both kinds of
    instruction.

    With this, then:

    11101(11 bits)(16 bits)
    1111(4 bits)(24 bits)
    1111(4 bits)(24 bits)

    do we have enough bits?

    An operate instruction with a 64-bit immediate, as noted, just needs 12
    bits; seven for the opcode, five for the destination register.

    A memory-reference instruction with a 64-bit displacement needs five bits
    for the opcode, five bits for the destination register, and three bits for
    the index register. That's 13 bits. 11 plus 4 plus 4 is 19 bits, so there
    are extra bits for distinguishing between the two types of instruction.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 4 21:53:59 2025
    From Newsgroup: comp.arch

    On 9/4/2025 3:20 PM, Stephen Fuld wrote:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
    1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks.  That's interesting.  So perhaps ~15% of all compares are to zero.  I would have expected higher.


    Looking at some stats generated by my compiler (for branches):
    61% of branches are unconditional
    15% are comparing to 0
    13% are comparing two registers
    11% are comparing to some other non-zero constant.

    ...


    It would be a bit higher than 11% if one only counts conditional
    branches (it being around 28% of the conditional branches).

    Where the unconditional branch count includes:
    BRA/BSR/JAL
    RTS/RET/JMP/JSR/JALR
    In XG1/XG2, RTS is its own instruction.
    In XG3 and RV, RTS (or RET) transforms into JALR.
    JMP and JSR also map to JALR in RV and XG3.
    Note that JALR doesn't exist in XG1 or XG2.


    As noted, there is an average of around 6 instructions in each trace
    (before it terminates with a branch).

    ...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 4 23:57:26 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
    1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks. That's interesting. So perhaps ~15% of all compares are to
    zero. I would have expected higher.

    Oh no I didn't mean that. I meant that a compiler that wanted to
    to compare with zero would use a TST instruction not a CMP.
    That could be used any branch GT, GE, LE, LT, EQ, NE.
    And that is < 2%

    It would use CMP to compare with a number other than 0, which was 10-12%.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 21:41:29 2025
    From Newsgroup: comp.arch

    On 9/4/2025 7:53 PM, BGB wrote:
    On 9/4/2025 3:20 PM, Stephen Fuld wrote:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>> with
    potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional
    or comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler
    Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks.  That's interesting.  So perhaps ~15% of all compares are to
    zero.  I would have expected higher.


    Looking at some stats generated by my compiler (for branches):
     61% of branches are unconditional
     15% are comparing to 0
     13% are comparing two registers
     11% are comparing to some other non-zero constant.


    So ~39% of branches are conditional, and 15% compare to zero. So
    (15/39) ~38% of conditional branches are comparing to zero. That is
    more in line with what I had expected.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 21:43:22 2025
    From Newsgroup: comp.arch

    On 9/4/2025 8:57 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>> with
    potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional
    or comparing against 0 (which can use the Zero Register). Only a
    relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler
    Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks.  That's interesting.  So perhaps ~15% of all compares are to
    zero.  I would have expected higher.

    Oh no I didn't mean that. I meant that a compiler that wanted to
    to compare with zero would use a TST instruction not a CMP.
    That could be used any branch GT, GE, LE, LT, EQ, NE.
    And that is < 2%

    It would use CMP to compare with a number other than 0, which was 10-12%.

    I understand. I was using "compare" in the general sense, not just the
    use of the CMP instruction.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 02:16:17 2025
    From Newsgroup: comp.arch

    On 9/4/2025 11:41 PM, Stephen Fuld wrote:
    On 9/4/2025 7:53 PM, BGB wrote:
    On 9/4/2025 3:20 PM, Stephen Fuld wrote:
    On 9/4/2025 12:06 PM, EricP wrote:
    Stephen Fuld wrote:
    On 9/4/2025 10:19 AM, EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a
    displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>>> with
    potential displacement (from D12ds above) and the immediate has >>>>>>>>> the size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96] >>>>>>>>
    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies. >>>>>>>

    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional >>>>>>> or comparing against 0 (which can use the Zero Register). Only a >>>>>>> relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers: >>>>>> A Case Study of VAX-11 Instruction Set Usage For Compiler
    Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP. >>>>>> That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    OK, but does this tell you how many of the CMPs are to a value of
    zero? I expect these to be a significant enough percentage to skew
    your analysis.

    Looking at
    Measurement and Analysis of Instruction Use in VAX 780, 1982

    VAX had a TST instruction which was the same as CMP src,#0.
    TST has < 2% usage while CMP 10-12%.

    Thanks.  That's interesting.  So perhaps ~15% of all compares are to
    zero.  I would have expected higher.


    Looking at some stats generated by my compiler (for branches):
      61% of branches are unconditional
      15% are comparing to 0
      13% are comparing two registers
      11% are comparing to some other non-zero constant.


    So  ~39% of branches are conditional, and 15% compare to zero.  So
    (15/39) ~38% of conditional branches are comparing to zero.  That is
    more in line with what I had expected.


    Yes.

    Basically, it also means:
    72% of conditional branches are already addressed by the existing "Bcc
    2R, Disp" instructions (in RISC-V or similar).




    28% of conditional branches could "maybe" use a combined Bcc-Imm
    instruction (say, if added to RV or similar).

    But then, the issue is how to best encode it:
    64-bit encoding:
    Kinda meh;
    Doesn't help anything with code density;
    Only maybe helps with performance (if branch-predictor supports it);
    Delta is small.
    32 bit encoding options:
    Burn one of the User blocks;
    Huawei went this way.
    But, too much encoding space.
    Shrink the displacement from 12 to 9 bits.
    Would have 3 bits for compare operator.
    This would have been my choice.
    Does mean a new reloc type and decode/branch-predictor annoyance.
    Most short loops/etc smaller than 512 bytes.
    Do a 32-bit op with a 3 bit register field.
    Very poor option IMHO.
    Do a 32-bit op with a 3 bit immediate field.
    Wasn't considered at the time, but wouldn't be terribly useful.
    Only do BEQ and BNE.
    Kinda meh (doesn't cover much, but at least usable).

    They went with "BEQI/BNEI Rs1, Imm5, Label", which is, kinda meh.
    Personally I would have assumed using these spots for BTST/BNTST.
    But, the TST operator has no precedent in RISC-V.

    But, as I see it, the rough ranking of comparison operators seems to be:
    BEQ/BNE
    BLT/BGE
    BTST/BNTST (N/A in RISC-V, but exists in XG2 and XG3)
    BLTU/BGEU

    Where, likely, a BTST/BNTST would have had a slightly higher average hit
    rate than a BEQI/BNEI. But, when I looked at it before, it seemed like
    it was pretty close either way, and of these, adding ((A&B)==0) is more
    likely to have had a higher logic cost.


    Though, ironically, BTST and BNTST are the more likely ops to use an immediate...
    Mostly as:
    "if(x&0x10) { ... }"
    Being a fairly common idiom.

    Though, for larger masks:
    if((x>>47)&1) { ... }
    Typically being slightly preferable to:
    if(x&0x0000800000000000ULL) { ... }
    Mostly as on-average the shift is cheaper than the large constant (particularly on RISC-V), though little says a compiler can't turn the
    latter into the former.


    Though, I am left to suspect that BEQI/BNEI might not have been the best choice, as while BEQ/BNE are the two most common cases, they are also
    the most likely to compare against X0. If one eliminates BEQ and BNE
    where one of the operands is 0, then BLT/BGE would move into first place
    in the rankings.


    I had just went with a 64-bit encoding and kinda went "meh" as there was seemingly no good way to make it particularly compelling.

    So, had usually preferred to focus more on features which have a more
    obvious benefit.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 09:33:06 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't >know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Sep 5 11:00:55 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton

    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    The Microsoft compiler uses LEA to do the add which doesn't change flags
    so even if it has a flags optimization it would not detect it:

    long foo(long,long) PROC ; foo, COMDAT
    lea eax, DWORD PTR [rcx+rdx]
    test eax, eax
    jns SHORT $LN2@foo
    sub ecx, edx
    mov eax, ecx
    ret 0
    $LN2@foo:
    imul ecx, edx
    mov eax, ecx
    ret 0

    Also if MS had moved ecx to eax first as GCC does then it could have
    the function result land in eax and eliminate the final two MOV eax,ecx.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 5 15:51:13 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't
    know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton

    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    foo:
    ADD R3,R1,R2
    PLT0 R3,TF
    ADD R1,R1,-R2
    MUL R1,R1,R2
    RET

    5 inst versus 8.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    The Microsoft compiler uses LEA to do the add which doesn't change flags
    so even if it has a flags optimization it would not detect it:

    long foo(long,long) PROC ; foo, COMDAT
    lea eax, DWORD PTR [rcx+rdx]
    test eax, eax
    jns SHORT $LN2@foo
    sub ecx, edx
    mov eax, ecx
    ret 0
    $LN2@foo:
    imul ecx, edx
    mov eax, ecx
    ret 0

    5 versus 9

    Also if MS had moved ecx to eax first as GCC does then it could have
    the function result land in eax and eliminate the final two MOV eax,ecx.

    still 5 versus 7
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 16:13:47 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret
    ...
    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    Yes, I often see more register-register moves in gcc-generated code
    than necessary.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    Yes, possible. When I look at what clang-14.0.6 -O -c produces, it's
    this:

    0000000000000000 <foo>:
    0: 48 89 f9 mov %rdi,%rcx
    3: 48 29 f1 sub %rsi,%rcx
    6: 48 89 f0 mov %rsi,%rax
    9: 48 0f af c7 imul %rdi,%rax
    d: 48 01 fe add %rdi,%rsi
    10: 48 0f 48 c1 cmovs %rcx,%rax
    14: c3 ret

    clang seems to prefer using cmov. The interesting thing here is that
    it puts the add right in front of the cmovs, after the code for "a-b"
    and "a*b". When I do

    long foo(long a, long b)
    {
    if (a+b*111<0)
    return a-b;
    else
    return a*b;
    }

    clang produces this code:

    0000000000000000 <foo>:
    0: 48 6b ce 6f imul $0x6f,%rsi,%rcx
    4: 48 89 f8 mov %rdi,%rax
    7: 48 29 f0 sub %rsi,%rax
    a: 48 0f af f7 imul %rdi,%rsi
    e: 48 01 f9 add %rdi,%rcx
    11: 48 0f 49 c6 cmovns %rsi,%rax
    15: c3 ret

    I.e., rcx=b*111 is first, but a+rcx is late, right before the cmovns.
    So it seems to have some mechanism for keeping the add and the
    cmov(n)s as one unit.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Sep 5 21:02:26 2025
    From Newsgroup: comp.arch

    EricP wrote:
    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement,
    so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the >>>> size of the ST. This provides for::
             std    #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
      High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative
    minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't know about those side effect flags and will always emit a CMP or TST first.
    I know I have seen lots of examples of x86 compilers which used side
    effect flags, they are pretty much the standard idiom for decrementing
    loops or incrementing from negative start. The latter case is a common optimization which allows you to use the same register as the source index/indices and the destination index, along with the loop counter itself. Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 5 19:09:36 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so, >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or comparing against 0 (which can use the Zero Register). Only a relative minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    About 25% = (12%-9%)/12% use ALU CCs.

    I would expect those two numbers to be closer as even today compilers don't know about those side effect flags and will always emit a CMP or TST first. Possibly those VAX that Bcc using ALU side effect flags were assembler.

    VAX had "more regular" settings of ALU CCs than typical CISCs.
    This regularity made it easier for the compiler to track.

    On the other hand:: a gain of 25%*12% = 4% would not have allowed CCs
    to "make the cut" for RISC ISA designs.

    One could argue:
    This is high enough to care,

    boarder line

    but is it cheap enough?...

    not for me as it causes RoB/RETIRE to do a lot more work.
    It does require Forwarding to do more work;
    It may also cause DECODE to do more work.

    The instruction fetch buffer has to be larger as the worst case size
    just got larger. And there are more format variations so the Parser
    gets more complex. And Decode has to pick apart the two immediates
    and place them in different fields so more muxes.

    Each front end uOp lane would have two immediate fields, one for an
    integer or float data value up to 8 bytes, one for up to 8 byte offset.
    Then at Dispatch (hand-off to the back end) muxes to route each
    immediate onto the FU operand bus.

    The difference comes in the back end Reservation Stations.
    If they are valued RS then the immediates are held just like
    any other operand values that were ready at time of Dispatch.

    In My designs, the value-capturing RS does not need the ST.data value
    until after it has write permission, so instead of capturing this in
    the RS early, I put a reservation on the location in IB, and move the
    immediate to RS after AGEN (and before RETIRE). This prevents excess
    RS operand capture flip-flops.

    The number of operands doesn't change so no extra cost here.

    If you capture ST.data early it does.

    But if they are valueless RS then it has no place to hold those
    immediates so it needs some place to stash them until the uOp launches.

    Here, the IB is the obvious place to store them until use.

    In that case it might be better if Decode took all the immediates and
    stash them in a circular buffer and just passed the indexes in the uOp.
    Then at launch the FU would pull in the immediates
    just like it pulls in the register operand values.
    This gets rid of the extra front end costs.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 17:05:47 2025
    From Newsgroup: comp.arch

    On 9/5/2025 2:09 PM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so, >>>>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>> potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative
    minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    About 25% = (12%-9%)/12% use ALU CCs.

    I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first. >> Possibly those VAX that Bcc using ALU side effect flags were assembler.

    VAX had "more regular" settings of ALU CCs than typical CISCs.
    This regularity made it easier for the compiler to track.

    On the other hand:: a gain of 25%*12% = 4% would not have allowed CCs
    to "make the cut" for RISC ISA designs.


    Several major RISCs still had them though:
    ARM
    POWER / PowerPC
    ...


    Generalized ALU CC's kinda suck though.

    In their "best case" they are kinda neutral.
    Want to write an emulator, or want to make the CPU superscalar, and CC's
    will kinda ruin the day.

    The 1-bit T/F status flag was at least "less bad" than full CC's:
    Only modified by certain instructions, for which modifying the flag was usually their primary purpose;
    Since only modified infrequently, and only used as inputs to certain
    classes of instructions (such as those which have been marked as
    conditional), it is less of an issue to manage it in the pipeline.


    One could argue:
    This is high enough to care,

    boarder line

    but is it cheap enough?...

    not for me as it causes RoB/RETIRE to do a lot more work.
    It does require Forwarding to do more work;
    It may also cause DECODE to do more work.


    I was originally writing this in the context of Branch-with-Immediate instructions, which AFAIK/IIRC My66000 already has in some form?...


    Some RISC-V people had already been considering this, and some RISC-V
    SoC's apparently already have a variant of this (as custom extensions).


    I am not entirely sure where the stuff related to CC's came from, but alas.

    I was not arguing for having CC's in any case.



    But, yeah, as noted, the main added costs for Branch-with-Immediate and Store-with-Immediate are:
    Decoder needs to produce a second immediate output;
    It needs to be routed to Lane3 or similar;
    We need another pseudo-register, and the logic to fetch an immediate
    from Lane3.

    Where, as noted:
    In the BJX2 Core, the handling of immediate values is mostly done by
    using pseudo registers.

    A few examples of pseudo registers:
    ZZR : Zero
    IMM : Returns immediate associated with current lane.
    JIMM: Returns immediate from gluing the Lane 1/2 immediate together;
    PC : Returns PC of following instruction
    BPC : Returns PC of current instruction
    IMMB: Returns immediate from Lane 3
    ...

    This means I don't need separate logic internally for Register and
    Immediate forms of instructions, as all instructions can be implemented
    as register forms. Partial exception is a few cases like the FPU
    assuming that the Immediate field is used to route the current rounding
    mode and similar.


    <snip, not much to say>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 5 23:39:52 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:

    However, I also found that STs need an immediate and a displacement, so, >> >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
    potential displacement (from D12ds above) and the immediate has the
    size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]

    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.


    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative
    minority being compares against non-zero constants.

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    About 25% = (12%-9%)/12% use ALU CCs.

    I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first. >> Possibly those VAX that Bcc using ALU side effect flags were assembler.

    VAX had "more regular" settings of ALU CCs than typical CISCs.
    This regularity made it easier for the compiler to track.

    It also had AOB<cc> and SOB<cc> which combined the branch
    with the increment/decrement operation.

    100$: MOVAL ERTAB,R1 ;ADDRESS OF TABLE
    CLRL R2 ;COUNT
    101$: CMPL (R1)+[R2],4(R0) ;LOOK FOR A MATCH
    BEQL 108$ ;BRANCH IF FOUND
    AOBLEQ S^#ERNM,R2,101$ ;LOOP TILL DONE
    105$: MOVZWL #SS$_RESIGNAL,R0 ;DON'T WANT TO KNOW IT
    RET ;GIVE BACK TO SYSTEM
    108$: MOVL (R1)[R2],R0 ;GET ADDRESS OF DESCRIPTOR
    MOVQ (R0),R4 ;GET DESCRIPTOR
    BRB PRERLN ;AND PRINT
    ;

    (fragment from the VAX FOCAL interpreter).

    An interesting note in the aforementioned analysis is why
    the call instruction was so expensive in time - the 780 cache
    was write-through, so the multiple stores would be limited
    to DRAM speeds.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 6 14:03:04 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    An interesting note in the aforementioned analysis is why
    the call instruction was so expensive in time - the 780 cache
    was write-through, so the multiple stores would be limited
    to DRAM speeds.

    But do you need fewer stores if you use simpler instructions? Did the
    C compiler that used BSR etc. to implement a call store less? How so?

    Also, the DRAM speed is three cycles. CALL/RET took an average 45
    cycles. RET does not store. So if most of the cost is storing and
    loading, and, say, each instruction has 10 cycles overhead (which
    would already be a lot), that's 90 cycles for a call and a ret, and 70
    cycles of that for n stores and n loads. With stores taking 3 cycles
    and loads taking 1 (the stack stuff is usually in the cache),
    n=17.5. But VAX has only 16 registers (including PC), and not every
    one of them is saved on every call. So there were additional
    overheads.

    With good support for making full use of the cache read bandwidth, the
    loading part could be sped up to two loads per cycle. But I expect
    that the VAX 11/780 did not do that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Sep 6 16:29:36 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    scott@slp53.sl.home (Scott Lurndal) writes:
    An interesting note in the aforementioned analysis is why
    the call instruction was so expensive in time - the 780 cache
    was write-through, so the multiple stores would be limited
    to DRAM speeds.

    But do you need fewer stores if you use simpler instructions? Did the
    C compiler that used BSR etc. to implement a call store less? How so?

    Also, the DRAM speed is three cycles.

    Is this a serial 3-cycles::

    | route | DRAM | route |
    | route | DRAM | route |

    or an pipelineable 3-cycles::

    | route | DRAM | route |
    | route | DRAM | route |
    | route | DRAM | route |

    It makes a big difference.

    CALL/RET took an average 45
    cycles.

    15-registers × 3-cycles

    RET does not store. So if most of the cost is storing and
    loading, and, say, each instruction has 10 cycles overhead (which
    would already be a lot), that's 90 cycles for a call and a ret, and 70
    cycles of that for n stores and n loads. With stores taking 3 cycles
    and loads taking 1 (the stack stuff is usually in the cache),
    n=17.5. But VAX has only 16 registers (including PC), and not every
    one of them is saved on every call. So there were additional
    overheads.

    It seems to me that pipelining of DRAM would have dramatically helped.
    Or making a write-back cache would have also helped immensely.

    With good support for making full use of the cache read bandwidth, the loading part could be sped up to two loads per cycle. But I expect
    that the VAX 11/780 did not do that.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Sep 6 13:46:09 2025
    From Newsgroup: comp.arch

    Terje Mathisen wrote:
    EricP wrote:

    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982

    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers
    don't
    know about those side effect flags and will always emit a CMP or TST
    first.

    I know I have seen lots of examples of x86 compilers which used side
    effect flags, they are pretty much the standard idiom for decrementing
    loops or incrementing from negative start. The latter case is a common optimization which allows you to use the same register as the source index/indices and the destination index, along with the loop counter
    itself.

    Terje

    My mistake. I haven't seen the MS C compiler do this and erroneously
    assumed without checking that no one does it, perhaps due to portability
    issues particularly for x86 with its asymmetric flags updates.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Sep 8 14:52:08 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:
    However, I also found that STs need an immediate and a displacement, so, >>>>>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has the >>>>>> size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]
    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.

    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or
    comparing against 0 (which can use the Zero Register). Only a relative >>>> minority being compares against non-zero constants.
    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982 >>>
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.
    About 25% = (12%-9%)/12% use ALU CCs.

    I would expect those two numbers to be closer as even today compilers don't >>> know about those side effect flags and will always emit a CMP or TST first. >>> Possibly those VAX that Bcc using ALU side effect flags were assembler.
    VAX had "more regular" settings of ALU CCs than typical CISCs.
    This regularity made it easier for the compiler to track.

    It also had AOB<cc> and SOB<cc> which combined the branch
    with the increment/decrement operation.

    100$: MOVAL ERTAB,R1 ;ADDRESS OF TABLE
    CLRL R2 ;COUNT
    101$: CMPL (R1)+[R2],4(R0) ;LOOK FOR A MATCH
    BEQL 108$ ;BRANCH IF FOUND
    AOBLEQ S^#ERNM,R2,101$ ;LOOP TILL DONE
    105$: MOVZWL #SS$_RESIGNAL,R0 ;DON'T WANT TO KNOW IT
    RET ;GIVE BACK TO SYSTEM
    108$: MOVL (R1)[R2],R0 ;GET ADDRESS OF DESCRIPTOR
    MOVQ (R0),R4 ;GET DESCRIPTOR
    BRB PRERLN ;AND PRINT
    ;

    (fragment from the VAX FOCAL interpreter).

    An interesting note in the aforementioned analysis is why
    the call instruction was so expensive in time - the 780 cache
    was write-through, so the multiple stores would be limited
    to DRAM speeds.

    AOB Add-One-Branch, SOB Subtract-One-Branch, ACB Add-Compare-Branch,
    could be nice single cycle, single write port, risc-ish instructions.
    The problem comes from the most optimal and frequent formats would
    have two or three immediate values.

    AOBcc count_Rsd, limit_Rs, offset_imm
    AOBcc count_Rsd, limit_imm, offset_imm

    SOBcc count_Rsd, limit_Rs, offset_imm
    SOBcc count_Rsd, limit_imm, offset_imm

    ACBcc count_Rsd, addend_Rs, limit_Rs, offset_imm
    ACBcc count_Rsd, addend_Imm, limit_Rs, offset_imm
    ACBcc count_Rsd, addend_Rs, limit_imm, offset_imm
    ACBcc count_Rsd, addend_Imm, limit_imm, offset_imm

    Merging the two 16-bit immediate format into one 32-bit field would
    suffice for most purposes. The last ACB packs three 16-bit immediates
    into a 48-bit field.

    If the addend or limit operands are constants but do not fit into the
    16-bit field available then one must load the constant into a register.

    If the branch offset doesn't fit into 16 bits then one cannot use these instructions for that loop and must use individual branch instructions.
    But it would be pretty rare for a loop to cross more that 32k bytes/words.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Sep 8 20:40:14 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    BGB wrote:
    On 9/3/2025 9:42 PM, EricP wrote:
    MitchAlsup wrote:
    However, I also found that STs need an immediate and a displacement, so,
    Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has the >>>>>> size of the ST. This provides for::
    std #4607182418800017408,[r3,r2<<3,96]
    Compare and Branch can also use two immediates as it
    has reg-reg or reg-imm compares plus displacement.
    And has high enough frequency to be worth considering.

    Can be done, yes.
    High enough frequency/etc, is where the possible debate lies.


    Checking stats, it can effect roughly 1.9% of the instructions.
    Or, around 11% of branches; most of the rest being unconditional or >>>> comparing against 0 (which can use the Zero Register). Only a relative >>>> minority being compares against non-zero constants.
    The only instruction usage stats I have are from those VAX papers:
    A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982 >>>
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.
    About 25% = (12%-9%)/12% use ALU CCs.

    I would expect those two numbers to be closer as even today compilers don't
    know about those side effect flags and will always emit a CMP or TST first.
    Possibly those VAX that Bcc using ALU side effect flags were assembler. >> VAX had "more regular" settings of ALU CCs than typical CISCs.
    This regularity made it easier for the compiler to track.

    It also had AOB<cc> and SOB<cc> which combined the branch
    with the increment/decrement operation.

    100$: MOVAL ERTAB,R1 ;ADDRESS OF TABLE
    CLRL R2 ;COUNT
    101$: CMPL (R1)+[R2],4(R0) ;LOOK FOR A MATCH
    BEQL 108$ ;BRANCH IF FOUND
    AOBLEQ S^#ERNM,R2,101$ ;LOOP TILL DONE
    105$: MOVZWL #SS$_RESIGNAL,R0 ;DON'T WANT TO KNOW IT
    RET ;GIVE BACK TO SYSTEM
    108$: MOVL (R1)[R2],R0 ;GET ADDRESS OF DESCRIPTOR
    MOVQ (R0),R4 ;GET DESCRIPTOR
    BRB PRERLN ;AND PRINT
    ;

    (fragment from the VAX FOCAL interpreter).

    An interesting note in the aforementioned analysis is why
    the call instruction was so expensive in time - the 780 cache
    was write-through, so the multiple stores would be limited
    to DRAM speeds.

    AOB Add-One-Branch, SOB Subtract-One-Branch, ACB Add-Compare-Branch,
    could be nice single cycle, single write port, risc-ish instructions.
    The problem comes from the most optimal and frequent formats would
    have two or three immediate values.

    AOBcc count_Rsd, limit_Rs, offset_imm
    AOBcc count_Rsd, limit_imm, offset_imm

    SOBcc count_Rsd, limit_Rs, offset_imm
    SOBcc count_Rsd, limit_imm, offset_imm

    ACBcc count_Rsd, addend_Rs, limit_Rs, offset_imm
    ACBcc count_Rsd, addend_Imm, limit_Rs, offset_imm
    ACBcc count_Rsd, addend_Rs, limit_imm, offset_imm
    ACBcc count_Rsd, addend_Imm, limit_imm, offset_imm

    Merging the two 16-bit immediate format into one 32-bit field would
    suffice for most purposes. The last ACB packs three 16-bit immediates
    into a 48-bit field.

    I faced the same issue in designing the virtual Vector Method. My Loop instruction contains 3-operand specifiers and a condition field. I could
    not allow the front end (read DECODE) to have to deal with 2 immediates,
    So there is a bit used to distinguish at the LOOP Function Unit whether
    the arriving 32-bit or 64-bit immediate was increment, comparison, or
    both. Thus DECODE does not know if the immediate was split or not, and
    remains ignorant of how LOOP consumes the immediate.

    LOOP (010101) uses the same format as FMAC (001101) but a different major OpCode, and the same table describes what DECODE needs to do. There are
    3 flavors of LOOP:

    LOOP1:: for( ; i cnd max; i += inc )
    LOOP2:: for( ; inc && i cnd max; i++ )
    LOOP3:: for( ; inc cnd max; i++ )

    Allowing leaf subroutines from str* and mem* to be fully vectorized.

    Your typical for loop is seen in ASM as::

    VEC Rtemp,{}
    inst
    inst
    inst
    LOOPn EQ,Ri,Rinc,Rmax
    or
    LOOPn LT,Ri,#inc,Rmax
    or
    LOOPn GE,Ri,Rinc,#max
    or
    LOOPn HS,Ri,#inc,#max

    Rtemp contains the IP of the top of the loop
    {} contains a bit vector of registers that are "live' after loop
    terminiates--this is mostly lightly populated often empty
    LT is less than, there are 8 conditions that are allowed
    Ri is the Loop index variable
    ?inc is the increment value
    ?max is the termination value

    LOOP performs the ADD-CMP-BC of you typical for loop. A CMP is simply a
    SUB which is an ADD with a negated operand. Thus, one can perform the
    ADD with your typical adder, perform the ADD-CMP with a 3-input adder
    (1-gate delay longer), and start Fetching from the top of the loop
    while the arithmetic is ongoing.
    --------------------------------
    With the encoding, one has 5-bit immediates at no cost,
    32-bit immediate for #inc or #max or 16-bits for both #inc and #max, or
    64-bit immediate for #inc or #max or 32-bits for both #inc and #max; at
    the cost of the immediate attached to the instruction.

    If the addend or limit operands are constants but do not fit into the
    16-bit field available then one must load the constant into a register.

    Isn't it always the case...

    If the branch offset doesn't fit into 16 bits then one cannot use these instructions for that loop and must use individual branch instructions.
    But it would be pretty rare for a loop to cross more that 32k bytes/words.

    This is one reason for the VEC instruction--it denotes the top of the loop
    so the LOOP instruction does not need a BR-displacement. It also means you cannot abuse the LOOP instruction target (i.e., create ASM spaghetti).
    --- Synchronet 3.21a-Linux NewsLink 1.2