• Re: Matmul in VVM

    From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 25 12:00:52 2026
    From Newsgroup: comp.arch

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    C version loop invariant, and cursoring

    #define N 8
    void mm8(double *a, double *b, double *c)
    {
    int i,j,jN,k,kN;
    double *AcijN,*AbkjN,*AaijN;

    for( jN=0; jN<N*N; jN+=N ) {
    AcijN = &c[jN];
    AbkjN = &b[jN];
    for( kN=k=0; k<N; k++,kN+=N ) {
    AaikN = &a[kN];
    bN = AbkjN[k];
    for( i=0; i<N; i++ ) {
    AcijN[i] += AaikN[i] * bN;
    }
    }
    }
    }

    I did get this into:

    mm8:
    ; R1 = &a[0];
    ; R2 = &b[0];
    ; R3 = &c[0]; -------------------------------------------------------------
    MOV RjN,#0 ; R4
    loop1:
    LA RcijN,[Rc,RjN<<3] ; R5
    LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
    MOV RkN,#0 ; R7
    MOV Rk,#0 ; R8
    loop2:
    LA RaikN,[Ra,RkN<<3] ; R9
    LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
    MOV Ri,#0 ; R11
    VEC 8,{}
    loop3:
    LDD Ra,[RaikN,Ri<<3] ; R12
    LDD Rc,[RcijN,Ri<<3] ; R13
    FMAC Rc,Ra,Rb,Rc ; R14
    STD Rc,[RcijN,Ri<<3] ;

    LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
    ADD Rk,Rk,#1 ; R8
    ADD RkN,RkN,#8 ; R7
    CMP Rt,Rk,#8 ; R11
    BLE Rt,loop2 -------------------------------------------------------------
    ADD RjN,RjN,#8 ; R4
    CMP Rt,RjN,#64 ; R7
    BLE Rt,Loop1 -------------------------------------------------------------
    RET

    without needing any preserved registers.

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
    64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.

    It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

    ----------------------------------
    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    double c0 = c[0 + j*N];
    double c1 = c[1 + j*N];
    double c2 = c[2 + j*N];
    double c3 = c[3 + j*N];
    double c4 = c[4 + j*N];
    double c5 = c[5 + j*N];
    double c6 = c[6 + j*N];
    double c7 = c[7 + j*N];
    for (int k=0; k<N; k++) {
    double bk = b[k + j*N];
    c0 += a[0 + k*N] * bk;
    c1 += a[1 + k*N] * bk;
    c2 += a[2 + k*N] * bk;
    c3 += a[3 + k*N] * bk;
    c4 += a[4 + k*N] * bk;
    c5 += a[5 + k*N] * bk;
    c6 += a[6 + k*N] * bk;
    c7 += a[7 + k*N] * bk;
    }
    /* write back c0 to c7 */
    }
    }
    }

    where the loop over k could be vectorized, but that would still
    leave eccessive memory traffic for a.

    ENTER Rc1,Rc8,#0 ; preserve c[1..8]
    MOV RjN,#0 ; R4
    loop1:
    LA Rca,[Rc,RjN<<3] ; &c[1..8]
    LDD Rc1,[Rca,#0] ; R23
    LDD Rc2,[Rca,#8]
    LDD Rc3,[Rca,#16]
    LDD Rc4,[Rca,#24]
    LDD Rc5,[Rca,#32]
    LDD Rc6,[Rca,#40]
    LDD Rc7,[Rca,#48]
    LDD Rc8,[Rca,#56] ; R30

    MOV RkN,#0 ; R5
    ---------------begin vectorize-------------------
    VEC 8,{Rc1..Rc8}
    loop2:
    LDD Rbk,[R2,RjN<<3] ; R6

    LA RakN,[Ra,RkN<<3] ; R7
    LDD Ra1,[RakN,#0] ; R8
    FMAC Rc1,Ra1,Rbk,Rc1 ; R23
    LDD Ra2,[RakN,#8] ; R7
    FMAC Rc2,Ra2,Rbk,Rc2 ; R24
    LDD Ra3,[RakN,#16] ; R7
    FMAC Rc3,Ra3,Rbk,Rc3 ; R25
    LDD Ra4,[RakN,#24] ; R7
    FMAC Rc4,Ra4,Rbk,Rc4 ; R26
    LDD Ra5,[RakN,#32] ; R7
    FMAC Rc5,Ra2,Rbk,Rc6 ; R27
    LDD Ra6,[RakN,#40] ; R7
    FMAC Rc6,Ra2,Rbk,Rc6 ; R28
    LDD Ra7,[RakN,#48] ; R7
    FMAC Rc7,Ra2,Rbk,Rc7 ; R29
    LDD Ra8,[RakN,#56] ; R7
    FMAC Rc8,Ra8,Rbk,Rc8 ; R30

    LOOP1 LE,RkN,#8,#64 ; R4
    ---------------end vectorize-------------------

    ADD RkN,RkN,#8 ; R5
    CMP Rt,RkN,$64 ; R6
    BLE Rt,loop1

    STD Rc1,[Rca,#0]
    STD Rc2,[Rca,#8]
    STD Rc3,[Rca,#16]
    STD Rc4,[Rca,#24]
    STD Rc5,[Rca,#32]
    STD Rc6,[Rca,#40]
    STD Rc7,[Rca,#48]
    STD Rc8,[Rca,#56]

    EXIT Rc1,Rc8,#0
    RET

    46 instructions 19 instructions in vectorized (unrolled) loop.

    c[k] is read once and written once
    b[k] is read 8×
    a[k] is read 8×

    If you are willing to have 64 FMACs in a row; a[k] can be read 2×
    {with very tr1cky register allocation}.

    Using this many registers causes 64 bytes to be written to stack
    and read back later. Solving the a[k] traffic increases the stack
    footprint to 104 bytes.

    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the TT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now. After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different
    register(s). Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU, thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is,once the Loop
    instruction has been encountered, you know how many of the executed instructions can fir in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in
    this case it eliminates 7/8, i.e. 87.5% of them.

    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount. It also eliminates any
    "mucking around" with the OoO mechanism to handle not knowing which
    registers are involved at instruction decode time that my previous idea had.

    As I said, I am sure there are issues with this. I welcome your comments.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon May 25 19:36:14 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by allowing a sort of register indexing equivalent, I think it could make a substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target address would be eight FMAC instructions, each using a different register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    Of course, I realize that this adds an "extra" instruction execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop instruction has been encountered, you know how many of the executed instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous
    amounts. vVM only vectorizes the innermost loop.

    (*) Predicated flow control is allowed, but branches/calls/SVC are not.

    It also eliminates any
    "mucking around" with the OoO mechanism to handle not knowing which registers are involved at instruction decode time that my previous
    idea had.

    So does register[indexing] and out-of-line instruction execution.

    As I said, I am sure there are issues with this. I welcome your comments.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 25 17:52:51 2026
    From Newsgroup: comp.arch

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential
    solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by
    allowing a sort of register indexing equivalent, I think it could make a
    substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit
    combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    I understand. My rationale for using a different ROB value was to allow
    that functionality to be allowed within a VVM loop, whereas the other
    values invoke a jump or call, which you don't want to allow in VVM. In
    other words an indication that this is a small exception to the no
    control transfer rule and should be allowed. But obviously you
    understand the internals better than I do.





    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to
    the instruction after the TT instruction. This makes the TT instruction
    behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target
    address would be eight FMAC instructions, each using a different
    register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    I understand. If you want to go there within VVM fine. I was trying to
    avoid that.





    Of course, I realize that this adds an "extra" instruction
    execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do
    anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in
    this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.



    So overall, I think this idea reduces the memory traffic cost of keeping
    the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous
    amounts. vVM only vectorizes the innermost loop.

    I agree. This would have to be an exception, which is why I thought a
    unique value for ROP could indicate that.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue May 26 17:48:56 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ----------------------------
    I have been thinking about this and have come up with another potential
    solution. The usual caveats - I am not a hardware designer, and don't
    know the guts of the My 6600, nor am I a numerical analyst, so this may
    have some problems or even be totally unworkable. But if it works, by
    allowing a sort of register indexing equivalent, I think it could make a >> substantial reduction in the memory traffic. There are two parts to
    this proposal.

    First, is to enhance the JTT instruction, using the currently unused bit >> combination of the BOB field to indicate, execute the instruction
    pointed to by the displacement plus the contents of SRC1 (similarly to
    how the values 00 and 11 do now.

    A minor issue is the out-of-sequence Fetch problem this introduces.
    Not un-workable, but an annoyance. Perhaps Call-through-Table would
    work better--let me think on it.

    I understand. My rationale for using a different ROB value was to allow that functionality to be allowed within a VVM loop, whereas the other
    values invoke a jump or call, which you don't want to allow in VVM. In other words an indication that this is a small exception to the no
    control transfer rule and should be allowed. But obviously you
    understand the internals better than I do.





    After execution of that instruction,
    which must not be a control transfer instruction, control is returned to >> the instruction after the TT instruction. This makes the TT instruction >> behave similarly to an "Execute" instruction in some other
    architectures. For our current example, the instructions at the target
    address would be eight FMAC instructions, each using a different
    register(s).

    This is why a CTT is better, you can perform multiple instructions before returning.

    I understand. If you want to go there within VVM fine. I was trying to avoid that.





    Of course, I realize that this adds an "extra" instruction >> execution, and that substituting an I-cache read (the executed
    instruction) in place of a load doesn't seem like a savings. I can't do >> anything about extra instruction execution, but see below.

    So second, I think there is an enhancement that would eliminate most of
    the instruction fetches of the executed instructions. As I understand
    it, in VVM, when they are first encountered, the instructions between
    the VEC and the Loop instructions are fetched and stored in a special
    memory within the CPU,

    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop
    without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all
    of them), and where to start them within this memory, (right after the
    loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in >> this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.

    Once the loop is in the reservation stations, they stay there for the
    entire execution of the loop! while FETCH remains quiescent.


    So overall, I think this idea reduces the memory traffic cost of keeping >> the A matrix in registers by a huge amount.

    The issue is that there can be no-transfers-of-control* out of a vVM
    loop while remaining IN a vVM loop. {This simplified HW by enormous amounts. vVM only vectorizes the innermost loop.

    I agree. This would have to be an exception, which is why I thought a unique value for ROP could indicate that.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue May 26 11:55:11 2026
    From Newsgroup: comp.arch

    On 5/26/2026 10:48 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/25/2026 12:36 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/3/2026 3:28 PM, MitchAlsup wrote:


    snip


    A very minor change to Reservation Station logic, where static operands
    can be used multiple times, and where the instruction remains present
    after being fired until Loop is satisfied. Each RS operand contains an
    index field that matches on the Loop iteration index.

    thus allowing multiple iterations of the loop >>>> without multiple I-cache accesses. So the idea is, once the Loop
    instruction has been encountered, you know how many of the executed
    instructions can fit in the remaining space (in this case, I think all >>>> of them), and where to start them within this memory, (right after the >>>> loop instruction). So further iterations of the loop can execute the
    target instructions without any I-cache references required. I think in >>>> this case it eliminates 7/8, i.e. 87.5% of them.

    Eliminates = 1.0-1.0/(loop_count)

    ??? I thought you would fetch the "executed" instruction once and have
    it internally for the next seven loop iterations. But I may be wrong.

    Once the loop is in the reservation stations, they stay there for the
    entire execution of the loop! while FETCH remains quiescent.

    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations, the
    eight instructions pointed to be the TT instruction. It is those eight fetches out of the 64 executions of those instructions that led me to
    the (64-8)/64 = 7/8 reduction in the extra fetches that would otherwise
    be required.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed May 27 10:57:01 2026
    From Newsgroup: comp.arch

    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction. It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal. You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out
    (i.e. is executed) depends on the value in register Rn?


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed May 27 11:13:26 2026
    From Newsgroup: comp.arch

    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes. That is what I thought. This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction. It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal. You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what
    you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe. And, of
    course, you don't have to "skip over" the instructions not executed, you
    just use the M value to choose which of the instructions to execute.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu May 28 07:43:47 2026
    From Newsgroup: comp.arch

    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in
    that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those
    eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication
    but instead of being a "predication-style `if`" it's a "predication-style
    `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out
    (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what
    you call the NATPRED instruction (which is actually a modified TT instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in
    other functions. Mitch? Thomas?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu May 28 07:53:59 2026
    From Newsgroup: comp.arch

    On 5/28/2026 7:43 AM, Stephen Fuld wrote:
    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in >>>> that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those >>>> eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication >>> but instead of being a "predication-style `if`" it's a "predication-
    style
    `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out >>> (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow
    what you call the NATPRED instruction (which is actually a modified TT
    instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of
    course, you don't have to "skip over" the instructions not executed,
    you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code.  This allows the functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception

    Sorry, "extension", not "exception"


    would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc.  I
    don't know how useful this sort of thing would be.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in other functions.  Mitch? Thomas?


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu May 28 18:08:09 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/27/2026 11:13 AM, Stephen Fuld wrote:
    On 5/27/2026 7:57 AM, Stefan Monnier wrote:
    Stephen Fuld [2026-05-26 11:55:11] wrote:
    Yes.  That is what I thought.  This would be an exception to that in >>> that you would have to fetch and put into the reservation stations,
    the eight instructions pointed to be the TT instruction.  It is those >>> eight fetches out of the 64 executions of those instructions that led
    me to the (64-8)/64 = 7/8 reduction in the extra fetches that would
    otherwise be required.

    Oh, I think I understand your proposal.  You want a kind of predication >> but instead of being a "predication-style `if`" it's a "predication-style >> `case`", i.e. based on a numeric rather than a boolean value.
    E.g. you'd have a `NATPRED Rn, M` prefix instruction which
    would "shadow" the next M instructions such that all but one of the
    M instructions are "predicated out", and which one is not predicated-out >> (i.e. is executed) depends on the value in register Rn?

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction).  But once the instructions are loaded into the
    reservations stations, the result is exactly as you describe.  And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get predicated instructions into the reservation stations and eliminates the "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Thank you Stefan for helping me see this.

    So now the question is, does this save enough to be worth implementing?
    I don't know enough to write prototype code for say MATMUL (8) to see
    how much the savings would be, nor whether this mechanism could help in other functions. Mitch? Thomas?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu May 28 19:29:11 2026
    From Newsgroup: comp.arch

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose
    which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered
    several times until the loop exit condition is satisfied.

    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri May 29 08:50:17 2026
    From Newsgroup: comp.arch

    On 5/28/2026 11:08 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    snip


    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your
    understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the
    "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Yes. But the proposed enhancement to my original proposal gives you
    another way to specify which instructions get executed. Your original
    PRED has a fixed mask to choose which instructions are executed or not,
    based on a binary condition test. This enhancement allows choosing
    which instruction gets executed in each lane based on the value in a
    register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
    "indexed register accesses" at low cost (I believe one extra cycle.)

    My original proposal allows you to execute one instruction based on the
    value in a register, i.e. if the register contains the value 3, then the
    third instruction is the only one executed. The enhanced version allows
    more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use
    case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri May 29 09:03:25 2026
    From Newsgroup: comp.arch

    On 5/28/2026 4:29 PM, Stefan Monnier wrote:
    Yes, with the exception that they don't have to immediately follow what you >> call the NATPRED instruction (which is actually a modified TT instruction). >> But once the instructions are loaded into the reservations stations, the
    result is exactly as you describe. And, of course, you don't have to "skip >> over" the instructions not executed, you just use the M value to choose
    which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

    Yes. Note that in an earlier post, I changed my proposal to be more in
    line with your ideas and the original pred - specifically, the
    instructions to be conditionally executed are physically placed inline,
    after the NATPRED.



    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    My exposition was sloppy. :-(


    But instead of skipping, You can "predicate away" the undesirable instructions.

    Agree. And clearer exposition.

    So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    Yes, size is clearly a limitation. I think it works for MATMUL (8), but probably not for MATMUL (16). I don't know enough to know how practical
    it would be to allow more than the current 32 instructions within a VVM
    loop. As for performance, I expect (hope) that instructions other than
    the one whose position matches the register value will not be executed.
    If that can be done, then the extra cost is presumably one cycle, and it
    saves (in the MATUL (8) example), executing eight load instructions.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 16:55:38 2026
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    Yes, with the exception that they don't have to immediately follow what you call the NATPRED instruction (which is actually a modified TT instruction). But once the instructions are loaded into the reservations stations, the result is exactly as you describe. And, of course, you don't have to "skip over" the instructions not executed, you just use the M value to choose which of the instructions to execute.

    In vVM, all the instructions that make up the loop end up all placed in
    the "dataflow" core once forall after which and they are just triggered several times until the loop exit condition is satisfied.

    Correct.

    So "skip over" makes no sense in this context (also because you want
    a much shorter delay between "M is known" and "the corresponding
    instruction is executed", so you have to decode the instruction(s)
    before M is known).

    Also Correct.

    But instead of skipping, You can "predicate away" the undesirable instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    If the predicated instructions are used in at least 1 iteration, they
    are not useless.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 16:58:42 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/28/2026 11:08 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    snip


    Sorry for the self followup, but I now believe that it would be better,
    both clearer to understand and easier to implement, to follow your
    understanding and to require the "predicated" instructions to
    immediately follow the "NATPRED" in the code. This allows the
    functionality to make use of the already existing mechanism to get
    predicated instructions into the reservation stations and eliminates the >> "jump" characteristic which required an exception to the no jump within
    a VVM loop rule.

    The PRED instruction (when used in vVM loop} produces a lane mask so
    that different iterations of the loop are executed from the same
    starting time.

    And since the NATPRED instruction as you defined it, and I agree, only
    has two operands, perhaps a reasonable exception would be to allow a
    third field that changes the sense of the test to allow things like
    execute all instructions up to N, all instructions except N, etc. I
    don't know how useful this sort of thing would be.

    That is what the then-clause and else-clause 'numbers' do--they set
    up the vertical lane-mask for that iteration. And each lane calculates
    its own lane mask.

    Yes. But the proposed enhancement to my original proposal gives you
    another way to specify which instructions get executed. Your original
    PRED has a fixed mask to choose which instructions are executed or not, based on a binary condition test. This enhancement allows choosing
    which instruction gets executed in each lane based on the value in a register. Thus lane 1 could execute instruction 3, lane 2 could execute instruction 5, etc. This is what allows you to gain the effect of
    "indexed register accesses" at low cost (I believe one extra cycle.)

    It is the latency of PARSDE+DECODE

    My original proposal allows you to execute one instruction based on the value in a register, i.e. if the register contains the value 3, then the third instruction is the only one executed. The enhanced version allows more flexibility. For example, you could allow to specify execute all instructions up to the number in the register. As I said, while the use case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.

    Not sure what the source code would look like in order for the compiler
    to recognize this pattern and optimize to your solution.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 29 17:04:03 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 5/28/2026 4:29 PM, Stefan Monnier wrote:
    ----------------------
    So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).

    Yes, size is clearly a limitation.

    A 6-wide × 16-deep execution window would allow between 90-and-96 instructions.

    I think it works for MATMUL (8), but probably not for MATMUL (16).

    You run out of registers anyway.

    I don't know enough to know how practical
    it would be to allow more than the current 32 instructions within a VVM loop.

    It makes vVM harder for smaller machines.

    Oh and BTW, forward FFT is 27 instructions/butterfly iteration.

    As for performance, I expect (hope) that instructions other than
    the one whose position matches the register value will not be executed.
    If that can be done, then the extra cost is presumably one cycle, and it saves (in the MATUL (8) example), executing eight load instructions.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 1 15:45:37 2026
    From Newsgroup: comp.arch

    On 5/29/2026 9:58 AM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    snip

    My original proposal allows you to execute one instruction based on the
    value in a register, i.e. if the register contains the value 3, then the
    third instruction is the only one executed. The enhanced version allows
    more flexibility. For example, you could allow to specify execute all
    instructions up to the number in the register. As I said, while the use
    case for the basic instruction is clear, emulating register indexing, I
    am not sure there are any use cases for the enhancement.

    Not sure what the source code would look like in order for the compiler
    to recognize this pattern and optimize to your solution.

    Good question. I have thought about it for a while, and though I am far
    from a compiler expert, I have come up with a potential solution at
    least for the basic proposal.

    The idea is if the compiler sees a SWITCH statement where the clauses
    that are switched to will compile to one instruction (not counting any instructions need for array addressing (which would be handled outside
    the SWITCH, e.g. a loop counter), then it could emit the PREDNAT (or
    whatever name is better) followed by the single instructions for each
    clause. I am sure this needs more specificity, but I hope you get the idea.

    I am still not sure of the benefit of the "enhanced" instruction, and
    haven't come up with any reasonable source code that would benefit from it.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 1 14:58:35 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-05-29 16:55:38] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> posted:
    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large
    for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).
    If the predicated instructions are used in at least 1 iteration, they
    are not useless.

    They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
    gives a "50% waste" (for equal size branches or when each branch is
    taken as often as the other), whereas a predicated `switch` results in
    a waste of `N-1/N`. As N grows larger this becomes discouraging.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 1 23:40:38 2026
    From Newsgroup: comp.arch

    On 6/1/2026 11:58 AM, Stefan Monnier wrote:
    MitchAlsup [2026-05-29 16:55:38] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> posted:
    But instead of skipping, You can "predicate away" the undesirable
    instructions. So, in sum, I think what you describe can be made to
    work. The main problem is that it will "fill" your dataflow core with
    many "useless" instructions, so it risks making the whole loop too large >>> for vVM and it risks also making it inefficient (in case all
    N instructions end up speculatively executed and the predication
    operates by throwing away N-1 of the values).
    If the predicated instructions are used in at least 1 iteration, they
    are not useless.

    They may not be useless overall, but they still waste resources at each iteration where they're not used. Traditional predication of an `if`
    gives a "50% waste" (for equal size branches or when each branch is
    taken as often as the other), whereas a predicated `switch` results in
    a waste of `N-1/N`. As N grows larger this becomes discouraging.

    OK, but what exactly are we wasting? We are taking space in the
    reservation stations, but we are saving executing actual load
    instructions. So the "waste" results in faster execution.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2