• Sane(r) SIMD

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 10:26:30 2026
    From Newsgroup: comp.arch

    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    The feature set should be constant across all vector widths.

    No vector registers should be used in interrupts :-)

    It might make sense for a process to announce to the OS which
    vector registers it uses for faster system calls.

    Data types: 8, 16, 32, 64 bit ints; also 128 bit?
    FP types: Same as above?

    Other points?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Apr 26 12:27:06 2026
    From Newsgroup: comp.arch

    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.


    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very
    special cases).

    If SIMD goes wider than 4 elements, the complexity curve quickly becomes unmanageable.

    Almost more preferable IMO to allow superscalar on SIMD vectors than to
    go to wider SIMD vectors. Granted, how much this scales depends more on
    the size of the register file. There is in effect a hard limit here.



    Personally, I don't really trust automatic vectorization, as it is sort
    of a double edged sword between "faster" and "slow and bloated"
    (particularly with MSVC).


    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    The feature set should be constant across all vector widths.


    Or keep all registers the same size, say, 64 bits.
    If you do a 128-bit op, it can use pairs.
    Practically, a 4-wide SIMD op can be seen internally as two 2-wide
    operations glued together (or the 128-bit SIMD operation effectively
    splitting into two co-issued 64-bit operations).

    Granted, one may note that a 4x Binary16 SIMD operation may consume both
    of these lanes. The main alternative being to have two extra
    Binary16-only units per lane, so the processor can effectively do 8
    Binary16 ops per cycle in some cases (if superscalar allows).

    Though, arguably, the probability of being able to co-issuing Binary16
    SIMD ops is low-enough that it many not make sense to allow it.

    Saw, for example:
    Main FPU, Lanes 1 or 2, does not co-issue;
    SIMD units: Lanes 1 and 2, may co-issue.
    Each unit has 2x Binary16/Binary32 elements,
    and two Binary16 only elements.
    ...


    This isn't an exact match for my existing SIMD unit, but probably a
    direction it could make sense to go if I needed more throughput (not
    currently a limited factor, even on SIMD heavy code it is hard to "keep
    the thing fed").

    Well, since Load/Store and SHUF operations effectively take away cycles
    that the SIMD unit could work. At least partly, where a 4x Binary16 SIMD instruction can co-issue with a Load or a SHUF, but the 128-bit
    instructions can't co-issue at all (each 128-bit instruction effectively
    using the whole pipeline width).


    No vector registers should be used in interrupts :-)

    It might make sense for a process to announce to the OS which
    vector registers it uses for faster system calls.


    Or, no vector registers exist at all...

    The role that "could" be served by separate GPR/FPR/SIMD registers can
    instead be served by making the GPR space bigger. Currently, there is a practical limit of 64 both for the FPGA logic and for encoding.

    Though, a "break glass" option could be to allow for 64-bit encodings to address 256 registers, but maybe have R64..R255 as "narrow issue" (only accessible in 2R1W configurations or similar) and likely mapped to
    Block-RAM on an FPGA.



    Doing 2R1W with BRAMs would likely eat 4 BRAMs, or 8 BRAMs if one allows 128-bit SIMD to use them (while still following 2R1W semantics). Note
    that full 4R2W or 6R3W with BRAMs would likely become too expensive.

    Though, in the 128-bit case, if one is handling it internally as a 2R1W 128-bit reg-file with low/high access for 64-bit ops, could almost make
    sense to expand it to 512 registers to make more efficient use of the
    BRAM (though, 256 registers would come back within the reach of LUTRAM
    in this case).


    Though, granted, such a drastic register file expansion would risk
    making things like interrupt handling and system calls painfully slow.

    Would be sort of like register banking but worse:
    Where, register banking allows making basic interrupt handling faster at
    the cost of making context switches slower and more complicated; but a
    single non-banked register file basically keeps both at roughly the same
    cost (though, in my case, does mean effectively two context switches per system call, but system calls turned out to be a nice place to implement preemptive scheduling as it was typically less likely to leave things in
    an inconsistent state vs scheduling based on the timer interrupt).


    Data types: 8, 16, 32, 64 bit ints; also 128 bit?
    FP types: Same as above?


    In my case:
    8-bit: Mostly converter only
    16-bit: Native (Int16 and Binary16 FP)
    32-bit: Native (Int32 and Binary32)
    4x SIMD is 128-bit, as above.
    64-bit: Scalar/SIMD hybrid.
    Binary64 SIMD operations exist, but internally use pipelining.
    128-bit: Scalar only.
    Int128: Typically two 64-bit ALU lanes glued together (*1).
    Binary128: Software emulation only.
    Int128 in HW can help with emulation performance.

    *1:
    ADD/SUB: Can route some signals between ALUs to allow for 128-bit Carry-Select;
    AND/OR/XOR: Co issue 64-bit, nothing special needed.
    SHIFT: Internally co-issues two funnel shifts.

    TBD: It is possible I could make the funnel shifter accessible at the
    ISA level (as 4R ops).


    Other points?


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 26 18:25:54 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    SIMD as a way to calculate more things per cycle is fine. vVM will
    end up doing multi-lane calculations SIMD style.

    SIMD as a way to consume vast quantities of ISA space is not. Done
    right there is no need for a vector RF and the associated context
    switch overhead.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory.

    I have been considering adding a Permute instruction to My 66000
    ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
    (Encryption) while variable permutes can use a register value.

    It seems permute is to go with a carryless multiply.

    These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    That is the problem with SIMD where it is used to make library code
    fast (str*, mem*) too many interrupt handlers and OS handlers want
    to use fast versions of those libraries.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    This is where SIMD consumes a cartesian product of ISA space.

    The feature set should be constant across all vector widths.

    No vector registers should be used in interrupts :-)

    It might make sense for a process to announce to the OS which
    vector registers it uses for faster system calls.

    Data types: 8, 16, 32, 64 bit ints; also 128 bit?
    FP types: Same as above?

    Other points?

    BGB makes the point where SIMD should stop at 4-wide. I mostly
    agree.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 19:07:20 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:
    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.


    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very special cases).


    If SIMD goes wider than 4 elements, the complexity curve quickly becomes unmanageable.

    AVX 512 allows for 64 8-bit numbers in parallel.

    Almost more preferable IMO to allow superscalar on SIMD vectors than to
    go to wider SIMD vectors. Granted, how much this scales depends more on
    the size of the register file. There is in effect a hard limit here.

    Personally, I don't really trust automatic vectorization, as it is sort
    of a double edged sword between "faster" and "slow and bloated" (particularly with MSVC).

    Agreed. For code for which vectors are a good match, Mitch's
    VVM is better. I am talking about the rest, specifically
    shuffles (and friends).


    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have
    sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    The feature set should be constant across all vector widths.


    Or keep all registers the same size, say, 64 bits.
    If you do a 128-bit op, it can use pairs.
    Practically, a 4-wide SIMD op can be seen internally as two 2-wide operations glued together (or the 128-bit SIMD operation effectively splitting into two co-issued 64-bit operations).

    Some time ago, Mitch explained one method of how SIMD addition can
    be performed using a single adder. Assume you want to add two 8-bit
    numbers and have a 17-bit adder. If the inputs to the adders are (lttle-endian) bits A<0:7>,I,A<8:15) and B<0:7>,0,B<8:15> and I is
    zero, you get two eight-bit results in Q<0:7> and Q<9:16>, with the
    carry bit of the first addition as Q<8> and of the second addition
    as the normal carry-out. If I is one, then you get the 16-bit
    result in Q<0:7> and Q<9:16>.

    Alterntively, it would also be possible to kill carries.


    Well, since Load/Store and SHUF operations effectively take away cycles
    that the SIMD unit could work. At least partly, where a 4x Binary16 SIMD instruction can co-issue with a Load or a SHUF, but the 128-bit
    instructions can't co-issue at all (each 128-bit instruction effectively using the whole pipeline width).


    No vector registers should be used in interrupts :-)

    It might make sense for a process to announce to the OS which
    vector registers it uses for faster system calls.


    Or, no vector registers exist at all...

    I should have written "SIMD registers" above.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 19:38:19 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    SIMD as a way to calculate more things per cycle is fine. vVM will
    end up doing multi-lane calculations SIMD style.

    SIMD as a way to consume vast quantities of ISA space is not.

    So it should be done right :-)

    Done
    right there is no need for a vector RF and the associated context
    switch overhead.

    That is what I am doubting.


    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory.

    I have been considering adding a Permute instruction to My 66000
    ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
    (Encryption) while variable permutes can use a register value.

    Would permute work over larger blocks like 256 or 512 bits?
    Can the results from permute be re-used in registers, or would
    they have to be reloaded from memory?


    It seems permute is to go with a carryless multiply.

    I do not understand that sentence.


    These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    That's the pointi, I am trying to explore the "maybe".


    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    That is the problem with SIMD where it is used to make library code
    fast (str*, mem*) too many interrupt handlers and OS handlers want
    to use fast versions of those libraries.

    For code which can be efficiently vectorized, like str* and mem*,
    you are correct. I am talking about the cases where it is not.


    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have
    sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    This is where SIMD consumes a cartesian product of ISA space.

    It does not have to, I think.

    Let's see what an instruction modifier (like CARRY) could look
    like.

    Arithmetic operations like ADD have two bits for size in the newest
    version of the ISA, so the size of the units to be operated upon
    is known.

    The size of the SIMD register could be encoded in the otherwise
    unused SRC1 field; five bits are certainly enough for that.
    There are a maximum of three source registers in each instruction,
    so three bits are enough to encode if a register is an SIMD or a
    regular register for each instructions - room enough for a shadow
    of five instructions.

    Predicates would work fine with SIMD code, I think.

    So, no combinatorial explosion that I can see.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 26 21:27:02 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    SIMD as a way to calculate more things per cycle is fine. vVM will
    end up doing multi-lane calculations SIMD style.

    SIMD as a way to consume vast quantities of ISA space is not.

    So it should be done right :-)

    Done
    right there is no need for a vector RF and the associated context
    switch overhead.

    That is what I am doubting.


    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory.

    I have been considering adding a Permute instruction to My 66000
    ISA over the last 2 weeks. Fixed permutes need a 24-bit constant (Encryption) while variable permutes can use a register value.

    Would permute work over larger blocks like 256 or 512 bits?
    Can the results from permute be re-used in registers, or would
    they have to be reloaded from memory?

    Like everything else, ISA describes register-width units of calculation
    and memory references, while vVM allows bundling these into calculations
    as wide as CPU designers desire/allow--without changing ISA.

    But the target for permutes is faster swizzling of data in cyphers.
    After swizzling, the bytes are multiplied SIMD-style (carryless
    multiply). Put these in a loop and one has faster cyphers via SIMD
    expressed in vVM style.


    It seems permute is to go with a carryless multiply.

    I do not understand that sentence.

    64×64 multiply where only the XOR-gates process since the majority
    gates are continuously de-asserted. Same gates as standard multiplier
    except the 3-2-majority gate has 2 more transistors that turn it off altogether. This logic is also used when one wants 2{32×32} multipliers
    of 4{16×16} multipliers, ...


    These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    That's the point, I am trying to explore the "maybe".


    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    That is the problem with SIMD where it is used to make library code
    fast (str*, mem*) too many interrupt handlers and OS handlers want
    to use fast versions of those libraries.

    For code which can be efficiently vectorized, like str* and mem*,
    you are correct. I am talking about the cases where it is not.


    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have
    sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    This is where SIMD consumes a cartesian product of ISA space.

    It does not have to, I think.

    things vVM does better than SIMD: mixed operand and result widths,
    stride based memory accesses, scatter gather memory accesses. For
    example: Loop{LDByte LDHalf ADD-reduce} ST Doubleword is no problem
    for vVM and is not possible in any SIMD ISA.

    Let's see what an instruction modifier (like CARRY) could look
    like.

    Arithmetic operations like ADD have two bits for size in the newest
    version of the ISA, so the size of the units to be operated upon
    is known.

    The {Sign}{Size} of memory references and integers is known in the
    instruction.

    The size of the SIMD register could be encoded in the otherwise
    unused SRC1 field; five bits are certainly enough for that.

    Sure, those bits could be used to do that--but do you even need
    SIMD-RF at all. vVM allows buffering between cache and data-path
    to provide the width specification on an implementation by imple-
    mentation as a choice--and without any perturbation to SW. So,
    vVM code written for the smallest possible machine will run near
    optimally on the largest possible machine. Would your SIMD Instruc- tion-Modifier have that same property ??

    There are a maximum of three source registers in each instruction,
    so three bits are enough to encode if a register is an SIMD or a
    regular register for each instructions - room enough for a shadow
    of five instructions.

    Predicates would work fine with SIMD code, I think.

    In vVM, predicate instructions are directly turned into lane-masks
    without having to be SIMD instructions creating lane masks, branches
    {of short forward nature} can be done similarly in larger implemen-
    tations.

    So, no combinatorial explosion that I can see.

    SIMD as you describe only adds the superluminal properties of x86
    SIMD.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 05:42:46 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    SIMD as a way to calculate more things per cycle is fine. vVM will
    end up doing multi-lane calculations SIMD style.

    SIMD as a way to consume vast quantities of ISA space is not.

    So it should be done right :-)

    Done
    right there is no need for a vector RF and the associated context
    switch overhead.

    That is what I am doubting.


    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory.

    I have been considering adding a Permute instruction to My 66000
    ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
    (Encryption) while variable permutes can use a register value.

    Would permute work over larger blocks like 256 or 512 bits?
    Can the results from permute be re-used in registers, or would
    they have to be reloaded from memory?

    Like everything else, ISA describes register-width units of calculation
    and memory references, while vVM allows bundling these into calculations
    as wide as CPU designers desire/allow--without changing ISA.

    I know. And I think this works for many use cases, and is preferable
    whenever it works, but not for all.

    But the target for permutes is faster swizzling of data in cyphers.

    That is one application, but by far not the only one.

    After swizzling, the bytes are multiplied SIMD-style (carryless
    multiply). Put these in a loop and one has faster cyphers via SIMD
    expressed in vVM style.


    It seems permute is to go with a carryless multiply.

    I do not understand that sentence.

    64×64 multiply where only the XOR-gates process since the majority
    gates are continuously de-asserted. Same gates as standard multiplier
    except the 3-2-majority gate has 2 more transistors that turn it off altogether. This logic is also used when one wants 2{32×32} multipliers
    of 4{16×16} multipliers, ...

    OK.



    These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    That's the point, I am trying to explore the "maybe".


    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    That is the problem with SIMD where it is used to make library code
    fast (str*, mem*) too many interrupt handlers and OS handlers want
    to use fast versions of those libraries.

    For code which can be efficiently vectorized, like str* and mem*,
    you are correct. I am talking about the cases where it is not.


    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have
    sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    This is where SIMD consumes a cartesian product of ISA space.

    It does not have to, I think.

    things vVM does better than SIMD: mixed operand and result widths,
    stride based memory accesses, scatter gather memory accesses. For
    example: Loop{LDByte LDHalf ADD-reduce} ST Doubleword is no problem
    for vVM and is not possible in any SIMD ISA.

    VVM is very well suited to such things, correct. And when it is
    suitable, it should be used in preference.

    But this mixed arithmetic could also be performed by extending
    the instruction shadow.


    Let's see what an instruction modifier (like CARRY) could look
    like.

    Arithmetic operations like ADD have two bits for size in the newest
    version of the ISA, so the size of the units to be operated upon
    is known.

    The {Sign}{Size} of memory references and integers is known in the instruction.

    The size of the SIMD register could be encoded in the otherwise
    unused SRC1 field; five bits are certainly enough for that.

    Sure, those bits could be used to do that--but do you even need
    SIMD-RF at all. vVM allows buffering between cache and data-path
    to provide the width specification on an implementation by imple-
    mentation as a choice--and without any perturbation to SW. So,
    vVM code written for the smallest possible machine will run near
    optimally on the largest possible machine. Would your SIMD Instruc- tion-Modifier have that same property ??

    No, and that is the big drawback. Again, I am not trying to replace
    which is prefereable whenever it works well. I am looking at the cases
    where SIMD would work better than VVM when both are available.


    There are a maximum of three source registers in each instruction,
    so three bits are enough to encode if a register is an SIMD or a
    regular register for each instructions - room enough for a shadow
    of five instructions.

    Predicates would work fine with SIMD code, I think.

    In vVM, predicate instructions are directly turned into lane-masks
    without having to be SIMD instructions creating lane masks, branches
    {of short forward nature} can be done similarly in larger implemen-
    tations.

    So, no combinatorial explosion that I can see.

    SIMD as you describe only adds the superluminal properties of x86
    SIMD.

    superluminal?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Apr 27 06:56:10 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    I think this points to a way to express shuffling in VVM:

    Have the shuffling specifier as array (of, say, bytes) in memory.
    When the program does something like:

    loop over i
    p = shuffle[i]
    t = a[p]
    b[i] = t
    end

    the hardware can assemble it into uops of the SIMD data path that it
    has. The loop can contain additional instructions and maybe a
    different way of using the result than storing it into b, but the
    pattern to recognize is

    p = shuffle[i]
    t = a[p]

    The size of the SIMD data path and of the elements in shuffle
    determines how many microinstructions are necessary. E.g., if the
    SIMD data path can handle up to 16 elements of a in one uop, and
    shuffle contains values <32, each SIMD width requires 1 load from
    shuffle, 2 loads from a, 2 shuffling uops and a merge. If shuffle
    contains too large values compared to the SIMD width, falling back to
    scalar may be more economical, but then SIMD ISAs would not have
    supported the operation, either.

    The size of the elements of shuffle comes from the data path, and is
    needed in the decoder, which is usually a problem, and usually solved
    with a predictor. That can be done here, too. One can also imagine communicating the maximum size through the ISA, which may also serve
    as a hint for the uarch to use shuffling

    BGB makes the point where SIMD should stop at 4-wide. I mostly
    agree.

    SIMD is useful for data-parallel tasks, many of which have a very wide
    data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

    But given that VVM means that SIMD width is microarchitectural,
    mistakes like BGB's point would have little consequence, because one
    can always go wider while still using the same software.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Apr 26 18:30:31 2026
    From Newsgroup: comp.arch

    BGB [2026-04-26 12:27:06] wrote:
    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    [...]
    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    [...]
    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very
    special cases).

    My understanding is that it might be worth distinguishing the case of
    SIMD used for "tuples" and SIMD used for "arrays":

    - By "tuples" I mean data that is inherently of fixed size, such as the
    3 or 4 element vectors used to represent a point in #3 space, where
    each element has a specific role.
    SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
    Not sure how much shuffle they may need.

    - By "arrays", I mean data of a size that can be much larger than 3-4
    elements and which often/usually varies dynamically.
    When handling such data, SSE/AVX need to wrap the SIMD instructions
    inside loops. vVM should handle that much better in most cases.

    For the arrays case, it might be worth thinking about what kind of
    shuffle is needed and why. IIUC the shuffle is not needed over the
    whole array. Instead, it shows up for example when you do a reduction
    on the array and after doing a fast traversal of the array you end up
    with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

    What other cases of shuffles show up?
    What would be the best way to handle them?

    On a related note, I'd love to see papers that try to reproduce in
    a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
    corresponds to a set of vectors and the control flow of SIMT warps could
    be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
    where the mapping between GPUs and vector CPUs doesn't seem so obvious.


    === Stefan
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Apr 27 08:48:09 2026
    From Newsgroup: comp.arch

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a very wide
    data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Apr 27 12:23:08 2026
    From Newsgroup: comp.arch

    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a very
    wide data width. Stopping at 4-wide is as sensible as stopping at
    1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.
    You obviously don't know what are you talking about. That's not new.
    Modern GPUs are best described as multicore processors with each core
    having SIMD width comparable to that of many (not all) CPUs.
    My understanding, not necessarily precisely correct, but necessarily
    close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
    Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
    Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
    CPUs.
    Plus, nowadays they have outer product (a.k.a. tensor) engines, but
    that's OT.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Apr 27 17:11:50 2026
    From Newsgroup: comp.arch

    On Mon, 27 Apr 2026 12:23:08 +0300
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a very
    wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.

    Modern GPUs are best described as multicore processors with each core
    having SIMD width comparable to that of many (not all) CPUs.
    My understanding, not necessarily precisely correct, but necessarily
    close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
    Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
    Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
    CPUs.

    Plus, nowadays they have outer product (a.k.a. tensor) engines, but
    that's OT.

    I tried to look for more info, and it seems that AMD RDNA, Intel Xe
    and Apple GPU are all built around 1024-bit SIMD units. In AMD's case
    two units can be fused to run the same instructions, supposedly gaining
    extra efficiency and results can appear as 2048-bit SIMD to programmer.
    In Intel's case one unit can be split into two halves or four quarters,
    running different operations, supposedly gaining extra flexibility at
    cost of losing some efficiency.
    I am starting to suspect that Nvidia GPUs in fact also built around
    1024-bit SIMD EUs with twice fewer number of independent EUs units per
    SM than what I speculated in the post above.
    Anyway, a width of SIMD in GPUs, measured in units of their most
    important data size (32 bits) depending on vendor and use case varies
    in range from 8 to 64, rather than "hundreds or thousands".
    The rest of GPU parallelism is of MIMD variety.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 27 18:47:34 2026
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    Shuffles and mixes are handled automatically by VMM, as long as you have register names available to describe each part of the shuffle. It does
    lead to much larger code to describe an arbitrary 16->16 shuffle
    operation, but for most algorithms nat having to SIMD it removes the
    actual need for shuffles.

    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for transparent portability between the smallest and largest instantiation.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 13:58:34 2026
    From Newsgroup: comp.arch

    On 4/26/2026 5:30 PM, Stefan Monnier wrote:
    BGB [2026-04-26 12:27:06] wrote:
    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    [...]
    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    [...]
    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very
    special cases).

    My understanding is that it might be worth distinguishing the case of
    SIMD used for "tuples" and SIMD used for "arrays":

    - By "tuples" I mean data that is inherently of fixed size, such as the
    3 or 4 element vectors used to represent a point in #3 space, where
    each element has a specific role.
    SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
    Not sure how much shuffle they may need.


    Shuffles are frequent. Some operations like cross-product or similar
    need a lot of this sort of thing.

    Though, there are a lot of operations as well where elements don't cross.

    So, common types of operations are:
    Per-element Add
    Per-element Scale
    Multiply each element by a scalar
    The famous "frsqrt" being a very common use-case here.
    It is one of the more common things to scale a vector by...
    Dot Product
    Cross Product
    ...

    Meanwhile:
    Per-Element Mul is typically used as a way to implement one of the other operations, but rarely the end goal it itself.

    Situations where Per-Element MAC could be useful are frequent, but in
    most cases double-rounding would be acceptable (single rounding is
    needed in some niche cases, but not usually in typical uses of SIMD in
    this way).

    Some operations, like a Quaternion Multiply, are effectively a
    cross-product with a modified Dot-Product.
    i/j/k (or x/y/z): Mostly a normal 3D cross product
    A few extra terms, mostly involving the r/w component.
    r or w: Aw*Bw-Ax*Bx-Ay*By-Az*Bz

    Though, one other way to see it is:
    The cross product is a quaternion product just with the r/w component
    always set to 0 (if this element is kept as 0 in 3D vectors, one could conceivably implement the usual 3D vector operations by reusing the
    quaternion operations, though a quat Mul would be more expensive than a
    normal Cross Product, ...).

    In both cases, these are fairly shuffle-heavy operations.
    Also BGBCC supports both but (even with SIMD) does neither operation
    inline (cross product and quaternion multiply exceeding the complexity
    where doing it inline makes sense).

    Can also note that implicitly some languages like GLSL embed the shuffle operations directly into fetching a vector from another vector.
    v.xxxx, v.zxyw, ...
    Probably not hard to guess what is happening here.

    In BGBCC, had also added an extended special case to allow negating and zeroing members.
    v.xyzW //capital W negates it.
    v.xyz_ //gives a 4 element vector with w zeroed.
    In my ISA, this typically adds an additional instruction in addition to
    the shuffle (so, first shuffle and then negate/mask).


    Otherwise, quaternion math is very similar to normal 4D vector math
    (though, some people seem to instead prefer to imagine 3D and 4D vectors
    as 1x4 or 4x1 matrices and use matrix math as a basis, rather than
    treating 3D space as a subset of a 4D quaternion space).

    Well, and if a person wanted, they could probably define the usual 3D
    frustum transform and projection and transform math in terms of
    quaternion calculations rather than a matrix transform.
    Say, something to the effect of:
    MvV = ((Vtx*MvScale)*MvRotate+MvTranslate)
    PrV = MvV * ProjFrustum * ProjScrScale + ProjScrAdj
    ScXyz = PrV.xyz / PrV.w

    Also wouldn't mix well with OpenGL, where things typically work
    explicitly in terms of 4x4 matrix math (and would make sense mostly in a context where quaternion multiply is significantly faster than a matrix multiply).

    Note that (like matrix math): A*B is not equivalent to B*A.

    Note also, the complex plane can also exist as a subset of the
    quaternion space much as the reals can exist as a subset of the complex
    plane.

    Though, people traditionally store complex numbers as (real,imaginary)
    rather than (imaginary,real) or (imaginary,0,0,real). Though, the
    "traditional math notation" would be more like:
    C = ai + 0j + 0k + b

    Technically, any of i/j/k could be used, but as soon as values are mixed
    with different components, the calculation would leave the complex plane.


    In practice though, people usually "just wing it".

    ...



    A lot of RGB math can also be mapped to 3D vector math, but couldn't
    really map RGBA onto quaternions as the behavior of the real component
    and alpha channel are quite different.

    Well, and also no one has really bothered with formal mathematics
    definitions for things like RGBA and Alpha blending behavior (would be
    funny if someone did so, maybe call it "Alpha-Nu Vector Calculus" or something, then if anyone looks at it, it is very obviously just OpenGL
    or maybe Photoshop style color-blending rules, but explained in the sort
    of overly pretentious ways typical of how most math subjects are
    presented, and throw in some random first-order logic and other stuff
    for good measure so that everything looks good and cryptic; and one had
    to wade though a bunch of esoteric rambling to find the parts that are actually relevant).


    Well, sorta like physics which has seemingly split into 3 major groups:
    Pop-culture version:
    Lots of wonk, stuff from sci-fi or TV shows,
    discredited theories, and general pseudoscience.
    Mathematical esoteria:
    Bunch of cryptic stuff pretty much no one understands;
    Practical applied stuff:
    Take basic properties one cares about,
    implement on a computer or such.
    Most of the time just simplified Newtonian mechanics or such.
    Typically the bare minimum to deal with the task at hand (*1).
    Commonly used in 3D games.
    Maybe FEM or CFD in mechanical engineering stuff, ...


    *1:
    Level 1: Quake-Style:
    Bounding boxes may fly through space and hit stuff;
    If they hit something, cancel velocity in that direction;
    Apply gravity at each time-step;
    If gravity would pulls into something solid,
    cancel the gravity move, and apply friction (if not in-water).
    If in water, apply water effects.
    Level 1.5: Extended, Quake Style (Minecraft or Half-Life)
    Add things like pushing forces between entities and similar.
    So, if you run into a walking entity or similar:
    It may be pushed, velocity is reduced but not canceled entirely.
    May add things like local gravity vectors.
    Say, for example, gravity and orientation can rotate.
    Eg: "Serious Sam", "Super Mario Galaxy", etc.
    Level 2: Basic Rigid Body (Eg: Half-Life 2, Portal, Doom 3, etc)
    Entities may have rotations and similar (often quaternions);
    Support for more complex bounding solids;
    Things like point of contact, torque forces, etc, become relevant.
    Usually player and enemy movement stick to Quake-style physics.
    Trying to use rigid body physics on players is janked.
    Level 3: Soft Body stuff
    Usually limited to cosmetic only effects (cloth, hair, etc).
    Computationally expensive.
    Ironically, not usually used for "jiggle physics":
    Which are more typically done via a skeletal / ragdoll system.
    So, leave most bones under control of the normal animations.
    Except those for the jiggly bits, that use ragdoll behavior.
    Then sometimes overused to a point of being annoying/distracting;
    Works best if skeletal system supports multi-bone weighting.
    ...

    For a lot of this stuff, while one could use 4x4 matrix math for
    everything, using matrix calculations directly doesn't tend to work very
    well, so it makes sense to keep things like translation/rotation/scale
    as separate vectors/quaternions during calculations, and then to
    generate a final matrix from these when needed (such as for collision-detection or 3D rendering or similar).



    - By "arrays", I mean data of a size that can be much larger than 3-4
    elements and which often/usually varies dynamically.
    When handling such data, SSE/AVX need to wrap the SIMD instructions
    inside loops. vVM should handle that much better in most cases.


    Potentially.


    Many proponents of wider SIMD seem to assume the array-like case as the dominant or sole use-case, and the SIMD ISA's seem to assume this,
    rather than say, a SIMD vector where each element is itself a vector (or tuple).

    The later case would be, say, to assume that a 16-element vector
    nominally contains something resembling a 4x4 matrix and not a
    16-element linear array.

    A 16-element SIMD could be more useful for my use-cases if each vector
    were a 4x4 matrix and one could effectively perform operations like
    matrix transpose and matrix multiply, and more easily recompose each
    matrix from discrete vectors as elements.


    Even with 64 GPRs or similar, dealing with things like matrix multiply
    without blowing out the register budget is still pretty steep.

    It is more manageable with Binary16 vs Binary32, since each matrix needs
    half as many working registers,


    And, in my case, 2/3/4 element vectors (or "tuples") were the dominant use-case.


    For the arrays case, it might be worth thinking about what kind of
    shuffle is needed and why. IIUC the shuffle is not needed over the
    whole array. Instead, it shows up for example when you do a reduction
    on the array and after doing a fast traversal of the array you end up
    with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

    What other cases of shuffles show up?
    What would be the best way to handle them?

    On a related note, I'd love to see papers that try to reproduce in
    a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
    corresponds to a set of vectors and the control flow of SIMT warps could
    be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
    where the mapping between GPUs and vector CPUs doesn't seem so obvious.


    Most likely I would think it would make sense (if going this route) to
    use large vectors, but then treat the vector as a vector of short vectors.

    So, for example, a shuffle over a 16-element wouldn't be a 16-way
    shuffle, rather a 4-way shuffle applied over each group of 4.

    Or, maybe could have a 32-bit shuffle, to allow each group of 4 to be
    shuffled independently, with a separate shuffle for the groups of 4.


    Wouldn't expect the designers of any mainline SIMD ISA's to go this
    direction though...



    === Stefan

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:22:21 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    -------------------
    SIMD as you describe only adds the superluminal properties of x86
    SIMD.

    superluminal?

    A single SIMD instruction no where near "in a loop".
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:32:28 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a very
    wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.

    Modern GPUs are best described as multicore processors with each core
    having SIMD width comparable to that of many (not all) CPUs.

    GPU people talk of x lanes of calculation tied to a single instruction
    {ala Burroughs Scientific Processor} where each lane has its own register
    file. Where x = {8, 16, 32, 64, or larger}

    My understanding, not necessarily precisely correct, but necessarily
    close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
    Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
    Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
    CPUs.

    Plus, nowadays they have outer product (a.k.a. tensor) engines, but
    that's OT.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:42:10 2026
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    BGB [2026-04-26 12:27:06] wrote:
    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    [...]
    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    [...]
    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very special cases).

    My understanding is that it might be worth distinguishing the case of
    SIMD used for "tuples" and SIMD used for "arrays":

    An interesting observation--congratulations!

    - By "tuples" I mean data that is inherently of fixed size, such as the
    3 or 4 element vectors used to represent a point in #3 space, where
    each element has a specific role.
    SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
    Not sure how much shuffle they may need.

    When that calculation is in a Loop, I suspect vVM is competitive.

    - By "arrays", I mean data of a size that can be much larger than 3-4
    elements and which often/usually varies dynamically.
    When handling such data, SSE/AVX need to wrap the SIMD instructions
    inside loops. vVM should handle that much better in most cases.

    Especially those cases where sizeof(ARRAY) mod 4 ~= 0 or sizeof()
    is unknown at compile time.

    For the arrays case, it might be worth thinking about what kind of
    shuffle is needed and why.

    In loop forms, shuffle is simply memory-indexing, while scatter/gather
    is memory-indirect.

    IIUC the shuffle is not needed over the
    whole array. Instead, it shows up for example when you do a reduction
    on the array and after doing a fast traversal of the array you end up
    with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

    Vector reduction {in all its forms} is not well done in SIMD if you want
    the same result as you get from scalar code.

    What other cases of shuffles show up?
    What would be the best way to handle them?

    On a related note, I'd love to see papers that try to reproduce in
    a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
    corresponds to a set of vectors and the control flow of SIMT warps could
    be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
    where the mapping between GPUs and vector CPUs doesn't seem so obvious.

    In a GPU one can take a tessellated globe and a bit-map of the earth
    then use Texture-LDs to create a planet in space--no calculation
    instructions! {Sure there are zillions of calculations--all buried
    in texture memory access.}


    === Stefan
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 20:02:16 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    There is little doubt (in my mind, at least) that an abstraction
    such as the one offered by VVM is better and easier to use
    for compilers, than the current state of the art where SIMD-based
    vectorization is used. One need only look at the number of bugs
    blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
    (and the fraction of unresolved bugs) to see how difficulat that is.

    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    routines are usually written using (micro-)architecture-specific
    intrinsics. Other operations which could be useful are "match"
    operations known from graphics cards where bit n is set on lane
    m if lane n and m hold the same value.

    Is this worth it? CPU manufacturers seem to think so, they devote
    considerable on-chip resources to shuffles. Routines that use
    these are likely to be highly specialized, hard to write (needs
    somebody like Terje) and, if it is used a lot, can give significant
    speedup.

    Shuffles and mixes are handled automatically by VMM, as long as you have register names available to describe each part of the shuffle.

    So, no variable shuffles (at least from what I understood).

    It does
    lead to much larger code to describe an arbitrary 16->16 shuffle
    operation,

    With two restrictions in addition to the one above: This also
    increases the register pressure enormously, and you do not have
    the result in a register afterwards because VVM is memory to memory.
    You can load, shuffle and store, but you cannot load, shuffle and
    do some operation on it.

    but for most algorithms nat having to SIMD it removes the
    actual need for shuffles.

    To take an extreme case: If you have a few minutes to spare, maybe
    it could be interesting for you to take a look at https://gitlab.ethz.ch/extra_projects/fastjson/-/blob/master/src/scan.c

    The person who wrote that (certainly not me :-) tried to create
    the fastest JSON parser in the West, and used highly aggressive
    AVX512 to do this. (May not be 100% debugged).

    The function per_element_level() is quite interesting. It calculates
    the nesting depth of JSON elements in parallel, using packets of 64
    bytes. It does so by a clever combination of shifting

    Coding this in VVM would introduce a serial dependency.

    Is this extremely high performance code? Yes. It this possible
    in VVM? No. Is this faster than what can be achieved with VVM?
    I certainly believe so. Will such code be written by many people,
    or emitted by a compiler for "standard" code? Certainly not.
    Should such code be possible in a new architecture? I think so.



    How should a new architecture deal with it? Not going down that
    path and forsaking the high-performance gains possible is one option,
    for example if one wants to keep interrupts fast.

    Otherwise, what decisions could be taken in designing such
    a SSIMD?

    Register width would be one concern. It could make sense to have
    sub-architectures with several widths, where a feature enquiry
    could be used to branch to several versions of code.

    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for transparent portability between the smallest and largest instantiation.

    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 20:22:47 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    -------------------
    SIMD as you describe only adds the superluminal properties of x86
    SIMD.

    superluminal?

    A single SIMD instruction no where near "in a loop".

    I still don't understand.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 15:27:16 2026
    From Newsgroup: comp.arch

    On 4/27/2026 1:56 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    SIMD data path yes, absolutely.
    SIMD instructions at best maybe.

    I think this points to a way to express shuffling in VVM:

    Have the shuffling specifier as array (of, say, bytes) in memory.
    When the program does something like:

    loop over i
    p = shuffle[i]
    t = a[p]
    b[i] = t
    end

    the hardware can assemble it into uops of the SIMD data path that it
    has. The loop can contain additional instructions and maybe a
    different way of using the result than storing it into b, but the
    pattern to recognize is

    p = shuffle[i]
    t = a[p]

    The size of the SIMD data path and of the elements in shuffle
    determines how many microinstructions are necessary. E.g., if the
    SIMD data path can handle up to 16 elements of a in one uop, and
    shuffle contains values <32, each SIMD width requires 1 load from
    shuffle, 2 loads from a, 2 shuffling uops and a merge. If shuffle
    contains too large values compared to the SIMD width, falling back to
    scalar may be more economical, but then SIMD ISAs would not have
    supported the operation, either.

    The size of the elements of shuffle comes from the data path, and is
    needed in the decoder, which is usually a problem, and usually solved
    with a predictor. That can be done here, too. One can also imagine communicating the maximum size through the ISA, which may also serve
    as a hint for the uarch to use shuffling

    BGB makes the point where SIMD should stop at 4-wide. I mostly
    agree.

    SIMD is useful for data-parallel tasks, many of which have a very wide
    data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

    But given that VVM means that SIMD width is microarchitectural,
    mistakes like BGB's point would have little consequence, because one
    can always go wider while still using the same software.


    A lot depends on how it is being used.


    My use cases were not typically data-parallel in the sense that wide
    SIMD is usually designed for, and would more often be better suited (for getting what parallelism exists) by having a lot of registers...


    The main reason for not stopping at 1 wide, is that then one needs too
    many registers...

    But, if the limit on the number of registers is lower, say 8 or 16,
    there is more pressure to have wider SIMD for similar parallelism.


    I don't see this as a mistake either way, but more a design tradeoff.


    But, on the other side, highly data-parallel array-processing tasks
    could be considered as "vector tasks" and potentially categorized
    differently (as Stefan had suggested).


    But, as noted, my own use-cases, while often using arrays, have been
    more often dominated by short vector calculations rather than long
    arrays of scalar calculations.

    And, as noted, 3D XYZ or RGB coordinates aren't necessarily getting
    wider. And, when one is working with multiple of them, it is often in a
    fixed configuration, such as a triangle or quad (and the generic "array
    of lots of items" space is above the level of the individual primitive,
    and if one may need to subdivide primitives or call other functions per-primitive, large vectors don't work here).

    Or, say another scenario:
    Can very wide SIMD help evaluate questions like which bounding box in a
    tree of bounding boxes a 3D point falls inside?...

    Once again, it ends up as a case where 4 elements is amazing, but 8 or
    16 not so much.

    ...


    But, OTOH, if doing something like audio FFT or similar, yeah, very wide
    or variable-sized vectors would work nicely...


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 15:55:35 2026
    From Newsgroup: comp.arch

    On 4/27/2026 4:23 AM, Michael S wrote:
    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a very
    wide data width. Stopping at 4-wide is as sensible as stopping at
    1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.

    Modern GPUs are best described as multicore processors with each core
    having SIMD width comparable to that of many (not all) CPUs.
    My understanding, not necessarily precisely correct, but necessarily
    close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
    Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
    Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
    CPUs.



    From what I can gather, it is a sort of SIMT where although wide SIMD
    is used for the actual work, to each "thread" it looks like it is
    working with something much narrower (like a scalar value or 2 element vector), with wider vectors typically implemented using multiple
    registers (from a very wide register set).



    Can also note that in my thinking, saying that SIMD is most;y limited to
    a width of 4 at the ISA level, would *not* be the same as saying that
    the CPU itself would have a hard limit of 4 FPU operations per clock cycle.


    Like, you could still have 4-wide vectors with a 8 or 12/16 wide SIMD
    inside the CPU, provided the unit could be kept fed via the instruction
    stream well enough to make it worthwhile.

    Much, like, how there is no reason to limit the CPU to a single ALU or 1 instruction per clock, even if to a casual programmer, it appears as-if
    the ISA is running everything sequentially.

    My thinking then is more:
    What if, we treat SIMD more like how we treat the integer ALUs?...

    Like, try to make in more so that, if the CPU has the resources to do
    so, nothing exists to stop it from co-issuing the SIMD instructions?...


    The push towards ever wider SIMD makes more sense if one primarily
    assumes that they can't be performed superscalar.
    But, as I see it, this is not necessarily a valid assumption.



    Plus, nowadays they have outer product (a.k.a. tensor) engines, but
    that's OT.


    Yes, there are use-cases for this.


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 16:42:19 2026
    From Newsgroup: comp.arch

    On 4/27/2026 2:42 PM, MitchAlsup wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    BGB [2026-04-26 12:27:06] wrote:
    On 4/26/2026 5:26 AM, Thomas Koenig wrote:
    [...]
    However, there are use cases which VVM or similar systems do not
    cover. The main one are is in-register permutes (aka shuffles)
    which do not go through memory. These can offer significant speed
    advantages (as in factors) for performance-critical code. Such
    [...]
    One practical limit IMO is:
    Don't go wider than 4 elements (without a very good reason or in very
    special cases).

    My understanding is that it might be worth distinguishing the case of
    SIMD used for "tuples" and SIMD used for "arrays":

    An interesting observation--congratulations!


    Agreed.

    Though I guess I wasn't obvious enough in my response.



    - By "tuples" I mean data that is inherently of fixed size, such as the
    3 or 4 element vectors used to represent a point in #3 space, where
    each element has a specific role.
    SSE/AVX approaches seem to work OK for such data, maybe better than vVM. >> Not sure how much shuffle they may need.

    When that calculation is in a Loop, I suspect vVM is competitive.

    - By "arrays", I mean data of a size that can be much larger than 3-4
    elements and which often/usually varies dynamically.
    When handling such data, SSE/AVX need to wrap the SIMD instructions
    inside loops. vVM should handle that much better in most cases.

    Especially those cases where sizeof(ARRAY) mod 4 ~= 0 or sizeof()
    is unknown at compile time.

    For the arrays case, it might be worth thinking about what kind of
    shuffle is needed and why.

    In loop forms, shuffle is simply memory-indexing, while scatter/gather
    is memory-indirect.


    I guess, one can also differentiate between a shuffle operation
    (reordering elements), and a "pack" operation (combining parts of two registers). Where, SSE/AVX tends to conflate these.


    IIUC the shuffle is not needed over the
    whole array. Instead, it shows up for example when you do a reduction
    on the array and after doing a fast traversal of the array you end up
    with N partial-reduction results and the shuffles are needed to do the
    remaining log N steps to combine those N results.

    Vector reduction {in all its forms} is not well done in SIMD if you want
    the same result as you get from scalar code.


    For cases best served by scalar, my preference is mostly to keep scalar.


    Trying to take some generic C loop and force it into some absurd mess of
    SIMD operations isn't something I am well fond of, even if some people
    seem to imagine this is mostly what SIMD is for in the first place...



    What other cases of shuffles show up?
    What would be the best way to handle them?

    On a related note, I'd love to see papers that try to reproduce in
    a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
    corresponds to a set of vectors and the control flow of SIMT warps could
    be represented as mask bitvectors, but then there's the automatic
    scheduling/switching between warps, plus all kinds of other details
    where the mapping between GPUs and vector CPUs doesn't seem so obvious.

    In a GPU one can take a tessellated globe and a bit-map of the earth
    then use Texture-LDs to create a planet in space--no calculation instructions! {Sure there are zillions of calculations--all buried
    in texture memory access.}


    Even going part way, was how I ended up with LDTEX:

    LDTEX would collapse a series of 6 or so instructions into 1 instruction
    (or would be more, if not for helper instructions for things like Morton shuffles and dealing with compressed texture blocks).

    Partial factor why software rasterized OpenGL was viable in my ISA, but
    not so much in RISC-V. Kinda really need a lot of specialized helper ops
    to make this sort of thing viable.


    And, SIMD (or lack thereof) is why, even when it does a system call for
    the backend rendering parts, GLQuake still performs like epic dog crap
    in a plain RV64GC build.


    Though, I can note, for better or worse, I have a SIMD ISA partly
    designed around whatever best aligned with what I needed when trying to implement OpenGL (with SIMD extensions in C partly designed around a
    pattern of "what if I bolt GLSL stuff onto C, but then just end up
    mostly wrapping it in macros?...").


    Though, for better results:
    Still don't use the glBegin/glEnd/glVertex* stuff.
    This adds boilerplate overhead can't be optimized away.



    === Stefan

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Apr 27 21:54:22 2026
    From Newsgroup: comp.arch

    On Mon, 27 Apr 2026 19:32:28 GMT, MitchAlsup wrote:

    GPU people talk of x lanes of calculation tied to a single
    instruction {ala Burroughs Scientific Processor} where each lane has
    its own register file. Where x = {8, 16, 32, 64, or larger}

    Also, they are fond of conditional-execution instructions, are they
    not. So each processing unit can do something slightly different,
    depending on the data it holds, while continuing to execute exactly
    the same instruction as all the other units.

    That way, the architecture remains “SIMD”, instead of turning into “MIMD”.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 17:39:41 2026
    From Newsgroup: comp.arch

    On 4/27/2026 3:22 PM, Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    -------------------
    SIMD as you describe only adds the superluminal properties of x86
    SIMD.

    superluminal?

    A single SIMD instruction no where near "in a loop".

    I still don't understand.


    My guess:
    Goes as fast as it can go and can't be made faster?...



    But, yeah, say for example, one has something like:
    void SomeEntity_ApplyGravity(Entity *self)
    { self->impulseVelocity += sv_gravity * sv_ticktime; }

    SIMD operations may make this faster than scalar math would be...

    But, once it gets as fast as it can be, say, in XG3 or similar (*1):
    SomeEntity_ApplyGravity:
    MOV.L sv_ticktime, R13
    MOV.X sv_gravity, R14
    MOVLD R13, R13, R16
    MOVLD R13, R13, R17
    MOV.X (R10, disp), R12
    PMULX.F R14, R16, R14
    PADDX.F R12, R14, R12
    MOV.X R12, (R10, disp)
    RTS

    Is there really much obvious way to make something like this all that
    much faster?...

    The limiting factors in stuff like this can't be made faster by making
    the SIMD wider or more advanced.

    Well, and assume that such a function is called via something like:
    ent->BasePhysicsTick(ent);

    Where the compiler sorta has its hands tied.



    *1: Nevermind if this specific example is purely illustrative and would violate ABI rules in this case (accesses gloabals and called as a
    function pointer means it needs a prolog/epilog and to perform a GP
    reload; but then any cycle cost of the SIMD instructions is drowned out
    by the overheads of the function itself).

    Well, then again:
    void SomeEntity_ApplyGravity(Entity *self)
    { self->impulseVelocity +=
    self->world->sv_gravity *
    self->world->sv_ticktime; }

    Would avoid the need for a prolog/epilog in this case (no access to
    global variables).

    ...

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 00:03:42 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    -------------------
    SIMD as you describe only adds the superluminal properties of x86
    SIMD.

    superluminal?

    A single SIMD instruction no where near "in a loop".

    I still don't understand.

    Some random SIMD instruction does the data-manipulation you want
    done {and most likely was not designed to do that one specific
    unit of work but is a side effect of what the instruction can do}.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 00:11:32 2026
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Mon, 27 Apr 2026 19:32:28 GMT, MitchAlsup wrote:

    GPU people talk of x lanes of calculation tied to a single
    instruction {ala Burroughs Scientific Processor} where each lane has
    its own register file. Where x = {8, 16, 32, 64, or larger}

    Also, they are fond of conditional-execution instructions, are they
    not. So each processing unit can do something slightly different,
    depending on the data it holds, while continuing to execute exactly
    the same instruction as all the other units.

    What most often happens with GPU conditional execution::

    01234567890123456789012345678901 ; lanes
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ; every body executes
    !!!---!---!!---!!!!!!--!!-!!-!-- ; then clause
    !!!---!---!!---!!!!!!--!!-!!-!--
    !!!---!---!!---!!!!!!--!!-!!-!--
    !!!---!---!!---!!!!!!--!!-!!-!--
    ---!!!-!!!--!!!------!!--!--!-!! ; else clause
    ---!!!-!!!--!!!------!!--!--!-!!
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ; every body executes

    It is easy to see how each lane computes its "execute this" bit
    in vector form over the WARP.

    In effect, each conditional that cannot be applied to all lanes
    leads to a 50% reduction in throughput.

    That way, the architecture remains “SIMD”, instead of turning into “MIMD”.

    SIMT where T means Thread instead of Data.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Apr 28 01:17:03 2026
    From Newsgroup: comp.arch

    On Tue, 28 Apr 2026 00:11:32 GMT, MitchAlsup wrote:

    SIMT where T means Thread instead of Data.

    Is that a meaningful distinction? What is the point of multiple units
    executing the same instruction, if not to operate on different data?
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Apr 28 05:46:30 2026
    From Newsgroup: comp.arch

    On Sun, 26 Apr 2026 18:30:31 -0400, Stefan Monnier wrote:

    - By "tuples" I mean data that is inherently of fixed size, such as
    the 3 or 4 element vectors used to represent a point in #3 space,
    where each element has a specific role.
    SSE/AVX approaches seem to work OK for such data, maybe better than
    vVM.

    Looking at some docs on Bitsavers from back in the Cray-1 era or soon
    after, they said that the break-even point for using the array
    instructions over the scalar ones was passed at arrays of as few as 2
    elements. That is, even with the array-instruction setup overhead, it
    was quicker to operate on a set of 2-element arrays than it was to do
    two sets of scalar operations.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Apr 28 14:36:21 2026
    From Newsgroup: comp.arch

    On Mon, 27 Apr 2026 19:32:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a
    very wide data width. Stopping at 4-wide is as sensible as
    stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.

    Modern GPUs are best described as multicore processors with each
    core having SIMD width comparable to that of many (not all) CPUs.

    GPU people talk of x lanes of calculation tied to a single instruction
    {ala Burroughs Scientific Processor} where each lane has its own
    register file. Where x = {8, 16, 32, 64, or larger}

    Can you point me to example of "or larger" among current high-volume
    GPU products?
    My impression is that even x=64 is implemented as pair of x=32 ALUs
    running in lock step.
    My understanding, not necessarily precisely correct, but necessarily
    close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
    I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and
    Fujitsu CPUs.

    Plus, nowadays they have outer product (a.k.a. tensor) engines, but
    that's OT.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 17:45:12 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 19:32:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a
    very wide data width. Stopping at 4-wide is as sensible as
    stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.

    Modern GPUs are best described as multicore processors with each
    core having SIMD width comparable to that of many (not all) CPUs.

    GPU people talk of x lanes of calculation tied to a single instruction
    {ala Burroughs Scientific Processor} where each lane has its own
    register file. Where x = {8, 16, 32, 64, or larger}


    Can you point me to example of "or larger" among current high-volume
    GPU products?

    I have insufficient data as the [micro]architectures are rarely allowed
    into public domain.

    My impression is that even x=64 is implemented as pair of x=32 ALUs
    running in lock step.

    Samsung (we) had 64 threads per instruction, operating over 4 clocks,
    using 16 'ALUs', so we did not need forwarding even for FMAC instruc-
    tions.

    My understanding, not necessarily precisely correct, but necessarily close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
    I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu CPUs.

    Plus, nowadays they have outer product (a.k.a. tensor) engines, but that's OT.



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 18:03:28 2026
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Tue, 28 Apr 2026 00:11:32 GMT, MitchAlsup wrote:

    SIMT where T means Thread instead of Data.

    Is that a meaningful distinction? What is the point of multiple units executing the same instruction, if not to operate on different data?

    Data is different--necessarily,
    Whether to execute (or not) is slightly different,
    It is compiled as if Scalar,
    Different threads can use different TLB entries,
    ...

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 30 10:02:11 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 19:32:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
    Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

    SIMD is useful for data-parallel tasks, many of which have a
    very wide data width. Stopping at 4-wide is as sensible as
    stopping at 1-wide.

    GPUs implement SIMD up to hundreds or thousands of units wide.

    You obviously don't know what are you talking about. That's not new.>>>> >>>> Modern GPUs are best described as multicore processors with each
    core having SIMD width comparable to that of many (not all) CPUs.

    GPU people talk of x lanes of calculation tied to a single instruction
    {ala Burroughs Scientific Processor} where each lane has its own
    register file. Where x = {8, 16, 32, 64, or larger}


    Can you point me to example of "or larger" among current high-volume
    GPU products?

    I have insufficient data as the [micro]architectures are rarely allowed> into public domain.

    My impression is that even x=64 is implemented as pair of x=32 ALUs
    running in lock step.

    Samsung (we) had 64 threads per instruction, operating over 4 clocks,
    using 16 'ALUs', so we did not need forwarding even for FMAC instruc-
    tions.
    That's exactly how Larrabee started out, with four threads in a barrel scheduler so that, per thread, the next instruction ran 4 cycles later
    and never had to wait for anything to resolve (except memory of course).
    If there were less than four threads available, then they had to wait a
    clock or two, but only after a multi-cycle operation.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 30 14:36:00 2026
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation.

    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

    Besides, this is some of the most fun code to figure out, the fact that
    the results are sometimes actually useful is just gravy.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Apr 30 10:35:41 2026
    From Newsgroup: comp.arch

    On 4/30/2026 5:36 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation.

    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

    This gave me an idea. People could create a new kind of "standard
    benchmark". Since, I believe, the algorithms, and in at least some
    cases, reference code is publicly and freely available, it would be
    possible to create a benchmark suite of these programs, with a set of
    standard input data for each (I am not sure it is worth it for Quake,
    and there may be others that should be included).

    A vendor could then present the results of running the benchmark two
    ways. One way would be using the "standard" C compiler. The other way
    would allow use of assembler to squeeze the best performance, but the assembler source code must be provided.

    Of course, this would be a supplement to things like Spec; certainly not
    a replacement. Is this a hair brained scheme, or could/should it be
    pursued?

    Of course, this is very preliminary and I welcome any comments.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 30 17:51:09 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/30/2026 5:36 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation. >>
    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

    This gave me an idea. People could create a new kind of "standard benchmark". Since, I believe, the algorithms, and in at least some
    cases, reference code is publicly and freely available, it would be
    possible to create a benchmark suite of these programs, with a set of standard input data for each (I am not sure it is worth it for Quake,
    and there may be others that should be included).

    A vendor could then present the results of running the benchmark two
    ways. One way would be using the "standard" C compiler. The other way would allow use of assembler to squeeze the best performance, but the assembler source code must be provided.

    Of course, this would be a supplement to things like Spec; certainly not
    a replacement. Is this a hair brained scheme, or could/should it be pursued?

    Of course, this is very preliminary and I welcome any comments.

    Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
    of M88Ksim into a single <M88K> instruction and inline that at every
    call point. It is all about reading the compiler asm output and finding
    new things to put into the compiler.

    (*) Greenhouse compiler I believe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 30 18:16:13 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/30/2026 5:36 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation. >> >>
    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last
    fraction of a percent, but with way larger usage in CPU years.

    This gave me an idea. People could create a new kind of "standard
    benchmark". Since, I believe, the algorithms, and in at least some
    cases, reference code is publicly and freely available, it would be
    possible to create a benchmark suite of these programs, with a set of
    standard input data for each (I am not sure it is worth it for Quake,
    and there may be others that should be included).

    A vendor could then present the results of running the benchmark two
    ways. One way would be using the "standard" C compiler. The other way
    would allow use of assembler to squeeze the best performance, but the
    assembler source code must be provided.

    Of course, this would be a supplement to things like Spec; certainly not
    a replacement. Is this a hair brained scheme, or could/should it be
    pursued?

    Of course, this is very preliminary and I welcome any comments.

    Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
    of M88Ksim into a single <M88K> instruction and inline that at every
    call point. It is all about reading the compiler asm output and finding
    new things to put into the compiler.

    Was the greenhills compiler available then? We were still using
    the Moto PCC-based M88K compiler in 1990.


    (*) Greenhouse compiler I believe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 00:28:46 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 4/30/2026 5:36 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out >> >>> what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation.

    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last >> > 45 years, I certainly agree. :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last >> > fraction of a percent, but with way larger usage in CPU years.

    This gave me an idea. People could create a new kind of "standard
    benchmark". Since, I believe, the algorithms, and in at least some
    cases, reference code is publicly and freely available, it would be
    possible to create a benchmark suite of these programs, with a set of
    standard input data for each (I am not sure it is worth it for Quake,
    and there may be others that should be included).

    A vendor could then present the results of running the benchmark two
    ways. One way would be using the "standard" C compiler. The other way >> would allow use of assembler to squeeze the best performance, but the
    assembler source code must be provided.

    Of course, this would be a supplement to things like Spec; certainly not >> a replacement. Is this a hair brained scheme, or could/should it be
    pursued?

    Of course, this is very preliminary and I welcome any comments.

    Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
    of M88Ksim into a single <M88K> instruction and inline that at every
    call point. It is all about reading the compiler asm output and finding
    new things to put into the compiler.

    Was the greenhills compiler available then? We were still using
    the Moto PCC-based M88K compiler in 1990.

    Somewhere in 1988 it first became available.


    (*) Greenhouse compiler I believe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 1 07:48:50 2026
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
    of M88Ksim into a single <M88K> instruction and inline that at every
    call point. It is all about reading the compiler asm output and finding
    new things to put into the compiler.

    Was the greenhills compiler available then? We were still using
    the Moto PCC-based M88K compiler in 1990.

    I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
    compiler installed as well as gcc. We used gcc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Fri May 1 08:14:01 2026
    From Newsgroup: comp.arch

    On Fri, 01 May 2026 07:48:50 GMT, Anton Ertl wrote:

    I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
    compiler installed as well as gcc. We used gcc.

    Had Green Hills caught up to ANSI C by that point?

    I was a heavy user of Apple’s MPW development environment from the
    late 1980s onwards. They initially offered a C compiler licensed from
    Green Hills, which was not ANSI-compliant. Then with MPW 3.0, they
    replaced it with their own in-house-developed ANSI-compliant C
    compiler, the one with the slightly tongue-in-cheek (or is that passive-aggressive?) error messages.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri May 1 10:21:15 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation.

    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    As a matter of fact, I had two persons in mind when I wrote this,
    and one of them was you :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

    Besides, this is some of the most fun code to figure out, the fact that
    the results are sometimes actually useful is just gravy.

    :-)

    Maybe another example, the "hello world" of high-performance
    computing: An 8*8 matrix kernel, so C = C + A*B.

    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    I do not see how VVM could express this equally succinctly;
    there are simply not the (architectural) registers to load
    the 64 values of A to start with. (Unless I am mistaken
    and there is a way to express this in VVM - Mitch?)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 21:48:51 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    The feature set should be constant across all vector widths.

    Having all code scalar, with the hardware (VMM) actually figuring out
    what can be done in parallel makes life much simpler, and makes for
    transparent portability between the smallest and largest instantiation. >>
    What you say is true for 99.99% or more of code, but less than 100%.
    Some people may write highly optimized code which runs in hot
    sections which is then used a lot by many unsuspecting people,
    and the *running time* could be much higher.

    Having written a lot more than my fair part of such code over the last
    45 years, I certainly agree. :-)

    As a matter of fact, I had two persons in mind when I wrote this,
    and one of them was you :-)

    Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

    Besides, this is some of the most fun code to figure out, the fact that the results are sometimes actually useful is just gravy.

    :-)

    Maybe another example, the "hello world" of high-performance
    computing: An 8*8 matrix kernel, so C = C + A*B.

    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.

    there are simply not the (architectural) registers to load
    the 64 values of A to start with. (Unless I am mistaken
    and there is a way to express this in VVM - Mitch?)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 22:21:27 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    -------------------------------------------------------------

    Maybe another example, the "hello world" of high-performance
    computing: An 8*8 matrix kernel, so C = C + A*B.

    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    # define N 7 and the code all goes to heck !

    Is that really what one wants ?!?

    I do not see how VVM could express this equally succinctly;
    there are simply not the (architectural) registers to load
    the 64 values of A to start with. (Unless I am mistaken
    and there is a way to express this in VVM - Mitch?)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 02:10:21 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.
    -------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0 -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7 -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2 -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1 -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    And it did not even have to push registers onto the stack!

    20 total instructions, 80 bytes.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture. Changing back
    to K&R C and every one could compile.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sat May 2 00:59:20 2026
    From Newsgroup: comp.arch

    On Fri, 1 May 2026 08:14:01 -0000 (UTC), Lawrence D´Oliveiro
    <ldo@nz.invalid> wrote:

    On Fri, 01 May 2026 07:48:50 GMT, Anton Ertl wrote:

    I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
    compiler installed as well as gcc. We used gcc.

    Had Green Hills caught up to ANSI C by that point?

    I was a heavy user of Apple’s MPW development environment from the
    late 1980s onwards. They initially offered a C compiler licensed from
    Green Hills, which was not ANSI-compliant. Then with MPW 3.0, they
    replaced it with their own in-house-developed ANSI-compliant C
    compiler, the one with the slightly tongue-in-cheek (or is that >passive-aggressive?) error messages.

    I may be mis-remembering, but I thought the 3.0 (and later) MPW C
    compiler was by Symantec.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 2 06:02:39 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:
    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    # define N 7 and the code all goes to heck !

    Is that really what one wants ?!?

    That's auto-vectorization. I have heard that code for the Cray-1
    contains a lot of "64"s.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 06:54:00 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    -------------------------------------------------------------

    Maybe another example, the "hello world" of high-performance
    computing: An 8*8 matrix kernel, so C = C + A*B.

    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    # define N 7 and the code all goes to heck !

    Is that really what one wants ?!?

    Not with 7*7, obviously, but with microarch-dependent fixed size:
    Absolutely.

    High-performance matrix multiplication is made up of "kernels":
    An arbitrary- sized matrix is sliced up into small parts which
    are then computed as efficiently as possible; this could be 4*4,
    8*8, 16*2 or something else, depending on the microarchitecture,
    whatever ist most efficient.

    And what I posted above is obviously hand-optimized assembly,
    but something generated by a compiler, which is worse.

    https://github.com/OpenMathLib/OpenBLAS/tree/develop/kernel/x86_64
    has the examples for OpenBLAS - for each microarchitecture,
    a kernel is selected according to what the developers have
    found to be best.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 08:32:32 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.
    -------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0 -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7 -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2 -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1 -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).


    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    So, including that and the loop overhead, once could expect around
    0.7 FMAs per cycle, correct?

    With AVX512, it is possible to run 16 FMAs per cycle. Divide that
    by a factor < 2 for overhead, and you run at maybe around 10 FMAs
    per cycle.

    BTW, the code generated by gcc is anything but ideal because of
    the dependency chain on zmm0. That is probably worth a PR.

    And it did not even have to push registers onto the stack!

    20 total instructions, 80 bytes.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture. Changing back
    to K&R C and every one could compile.

    if they cannot grok restrict, they are not really modern :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 15:26:34 2026
    From Newsgroup: comp.arch

    On 2026-05-02, Thomas Koenig <tkoenig@netcologne.de> wrote:

    And what I posted above is obviously hand-optimized assembly,

    obiously NOT hand-optimzed assembly.

    but something generated by a compiler, which is worse.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 2 16:29:03 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do. >-------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0
    -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7
    -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop
    -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7
    -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7
    -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2
    -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1
    -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    Let's compare it directly. Posting URLs is not good for discussion.
    So the source code in the example is:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    and the output of gcc-16.1 is (after cleanup by godbolt):

    mm8:
    vmovupd (%rdi), %zmm8
    vmovupd 64(%rdi), %zmm7
    vmovupd 128(%rdi), %zmm6
    vmovupd 192(%rdi), %zmm5
    vmovupd 256(%rdi), %zmm4
    vmovupd 320(%rdi), %zmm3
    vmovupd 384(%rdi), %zmm2
    vmovupd 448(%rdi), %zmm1
    movq %rsi, %rax
    leaq 512(%rdx), %rcx
    .L2:
    vbroadcastsd (%rax), %zmm0
    vfmadd213pd (%rdx), %zmm8, %zmm0
    addq $64, %rdx
    addq $64, %rax
    vfmadd231pd -56(%rax){1to8}, %zmm7, %zmm0
    vfmadd231pd -48(%rax){1to8}, %zmm6, %zmm0
    vfmadd231pd -40(%rax){1to8}, %zmm5, %zmm0
    vfmadd231pd -32(%rax){1to8}, %zmm4, %zmm0
    vfmadd231pd -24(%rax){1to8}, %zmm3, %zmm0
    vfmadd231pd -16(%rax){1to8}, %zmm2, %zmm0
    vfmadd231pd -8(%rax){1to8}, %zmm1, %zmm0
    vmovupd %zmm0, -64(%rdx)
    cmpq %rdx, %rcx
    jne .L2
    vzeroupper
    ret

    So your code reflects the three loops of the source code, with the
    inner loop being sped up by VVM, ideally such that only one
    microarchitectural iteration of the inner loop is necessary. You
    still have the other loops, and all the memory accesses.

    By contrast, the AVX-512 code produced by gcc-16.1 unrolls one loop
    level into using AVX-512 instructions and another loop level into
    using 8 different zmm registers. As a consequence, there is only one
    loop level left, and every byte of the array c is only stored once.
    Also, every byte of a and b are only loaded once (but the accesses to
    the b array are 8 bytes at a time, so 64 loads are needed for that,
    while a is loaded with 8 64-byte loads and c is stored with 8 64-byte
    stores.

    For VVM we can make use of 8 registers to achieve one level of
    unrolling, but I don't see how to reuse the registers as it is done
    with the a array in zmm1..zmm8 in the AVX-512 code. So one would have
    to load all of a on every iteration of the outer loop, instead of
    pulling these loads out of the loop as it is done by gcc-16.1.

    The VVM code would look maybe somewhat like:

    loop1:
    ... loop overhead left as exercise ...
    vec i
    loop2:
    ldd r17=c[...]
    ldd r1=a[i]
    ldd r2=a[i+8]
    ...
    ldd r8=a[i+56]
    ldd r9=b[...]
    fmac r18=r1*r9+r17
    ...
    ldd r16=b[...]
    fmac r25=r8*r16+r24
    std r25->c[...]
    loop1 ...
    ... loop overhead ...
    ble ..., loop1
    ret

    I don't know how well VVM handles the dependency chain of the FMACs (r17->r18...r25). One could use the same register here, as is done
    with zmm0 in the AVX-512 code, but I do not know if VVM would accept
    that.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    In an OoO machine the latency within the loop is not very relevant,
    even the dependence chain of the 8 vfmadd231pd instructions typically
    is not, because the next iteration does not depend on the previous
    one, apart from the loop counter updates, and even that has become
    zero-cycle latency in recent Intel CPUs. So the next iteration can
    start immediately only limited by the resources.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture.

    I tried clang-22, and it compiled the code. I also tried clang-11 and
    gcc-11 and they errored out complaining about the -march=znver5 flag,
    because they do not know this architecture (but still, why report an
    error for that?). Deleting that flag produced SSE2 code on both
    compilers, as expected.

    Changing back
    to K&R C and every one could compile.

    I expect that most do not like -march=znver5 even if the code is
    changed to K&R C. Also, prototypes, const, and restrict have been in
    C for a long time and are probably supported by most compilers on
    godbolt.

    However, trying gcc-1.27 I found that it reports a "parse error before
    'a'" in line 2, i.e., it does not understand restrict. After deleting
    the restricts, it complained about the

    for (int j=...

    usage.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 18:33:55 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:
    https://godbolt.org/z/xd4PedTqv shows an example generated by
    gcc 16.1 (so no hand-generated assembly). This loads all of A
    into memory at the beginning, a row vector of C is loaded each
    iteration and stored at the end of each iteration, and B is loaded
    (and used) element-wise.

    # define N 7 and the code all goes to heck !

    Is that really what one wants ?!?

    That's auto-vectorization. I have heard that code for the Cray-1
    contains a lot of "64"s.

    That was back in the days when one bought a CRAY-ABC one hired a bank
    of programmers...

    - anton

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 18:46:48 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.
    -------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0 -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7 -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2 -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1 -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    So, including that and the loop overhead, once could expect around
    0.7 FMAs per cycle, correct?

    0.7*lanes maybe 0.9*Lanes if my VEC fix works.

    With AVX512, it is possible to run 16 FMAs per cycle.

    Your code can only use 8 FMACs per cycle in any event, and there
    is overhead due to the other instructions in the loop...

    Divide that
    by a factor < 2 for overhead, and you run at maybe around 10 FMAs
    per cycle.

    BTW, the code generated by gcc is anything but ideal because of
    the dependency chain on zmm0. That is probably worth a PR.

    And it did not even have to push registers onto the stack!

    20 total instructions, 80 bytes.

    SIMD got 26 instructions likely longer than 4-bytes each due to
    prefixes to get various SIMD lengths encoded.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture. Changing back
    to K&R C and every one could compile.

    if they cannot grok restrict, they are not really modern :-)


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 3 07:33:09 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.
    -------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0
    -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7
    -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop
    -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7
    -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7
    -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2
    -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1
    -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    I think R6 usage is off (not usual in hand-coded assembly, as I know
    only too well myself :-)

    But let's look at memory access. Like you said, in the code

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
    64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.


    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    So, including that and the loop overhead, once could expect around
    0.7 FMAs per cycle, correct?

    0.7*lanes maybe 0.9*Lanes if my VEC fix works.

    With AVX512, it is possible to run 16 FMAs per cycle.

    Your code can only use 8 FMACs per cycle in any event,

    Actually not correct. Zen5 has a reciprocal throughput of 0.5
    for FMA, it can run two instructions in parallel.

    and there
    is overhead due to the other instructions in the loop...

    Looking at

    https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

    where they put a lot of work into optimizing matmul, one can see
    they reached around 125 gflops on matrix sizes that include some
    odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
    that is 30.34 flops per cycle, which translates into 15.16 FMAs
    per cycle, which is extremely close to 16 and shows that the
    loop overhad is pretty much absorbed in the OoO handling.

    (It also shows that the gcc-generated code I linked to is anything
    but ideal due to the dpendency chain it contains).



    Divide that
    by a factor < 2 for overhead, and you run at maybe around 10 FMAs
    per cycle.

    BTW, the code generated by gcc is anything but ideal because of
    the dependency chain on zmm0. That is probably worth a PR.

    And it did not even have to push registers onto the stack!

    20 total instructions, 80 bytes.

    SIMD got 26 instructions likely longer than 4-bytes each due to
    prefixes to get various SIMD lengths encoded.

    In this case, code length is of extremely minor importance.

    It could be interesting to see how to restructure the code
    for fewer memory traffic with VVM.

    Maybe instead of

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    unroll along i and write something like

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    double c0 = c[0 + j*N];
    double c1 = c[1 + j*N];
    double c2 = c[2 + j*N];
    double c3 = c[3 + j*N];
    double c4 = c[4 + j*N];
    double c5 = c[5 + j*N];
    double c6 = c[6 + j*N];
    double c7 = c[7 + j*N];
    for (int k=0; k<N; k++) {
    double bk = b[k + j*N];
    c0 += a[0 + k*N] * bk;
    c1 += a[1 + k*N] * bk;
    c2 += a[2 + k*N] * bk;
    c3 += a[3 + k*N] * bk;
    c4 += a[4 + k*N] * bk;
    c5 += a[5 + k*N] * bk;
    c6 += a[6 + k*N] * bk;
    c7 += a[7 + k*N] * bk;
    }
    /* write back c0 to c7 */
    }
    }
    }

    where the loop over k could be vectorized, but that would still
    leave eccessive memory traffic for a.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture. Changing back
    to K&R C and every one could compile.

    if they cannot grok restrict, they are not really modern :-)

    Replacing -march=znver5 with -march=znver4 produces
    identical code for gcc 16.1, and should be accepted for
    many more compilers.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 11:13:16 2026
    From Newsgroup: comp.arch

    On Sun, 3 May 2026 07:33:09 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do.
    -------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0
    -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7
    -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop
    -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7
    -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7
    -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2
    -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1
    -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    One problem I see is memory traffic. In the SIMD version, A is
    loaded once at the beginning of the loop. Here, it is loaded N**2
    times, with different offsets each VVM iteration, vs only once
    for the AVX512 version. Also, C is loaded and stored N**2 times,
    vs. only once. (The AVX version also loads B only once).

    The LDD using R6 as an index can be hoisted into Loop2 prologue.
    {I did miss that}.

    I think R6 usage is off (not usual in hand-coded assembly, as I know
    only too well myself :-)

    But let's look at memory access. Like you said, in the code

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there
    are 64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.


    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth
    of the loop is 6-cycles, so the 8-wide machine would run the
    loop in 8-cycles of latency.

    Plus, the setup time for VVM...

    I have been thinking about this overnight and may have a solution
    that alters only the VEC instruction.

    So, including that and the loop overhead, once could expect around
    0.7 FMAs per cycle, correct?

    0.7*lanes maybe 0.9*Lanes if my VEC fix works.

    With AVX512, it is possible to run 16 FMAs per cycle.

    Your code can only use 8 FMACs per cycle in any event,

    Actually not correct. Zen5 has a reciprocal throughput of 0.5
    for FMA, it can run two instructions in parallel.

    and there
    is overhead due to the other instructions in the loop...

    Looking at

    https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

    where they put a lot of work into optimizing matmul, one can see
    they reached around 125 gflops on matrix sizes that include some
    odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
    that is 30.34 flops per cycle, which translates into 15.16 FMAs
    per cycle, which is extremely close to 16 and shows that the
    loop overhad is pretty much absorbed in the OoO handling.

    (It also shows that the gcc-generated code I linked to is anything
    but ideal due to the dpendency chain it contains).


    My experience with matmul is that daxpy-like schemes, even in quite
    advanced form, like inner loop that updates 2 accumulator rows by 3 or
    4 source rows, do not achieve anything like maximal theoretical
    throughput.
    I had much better luck with [advanced forms of] dot-product-like
    schemes, more specifically with inner loops that multiply 5 rows by 2
    SIMD-wide columns or 4 rows by 3 SIMD-wide columns.
    But I didn't do it on Zen5, so it is possible that my experience does
    not apply to it.

    Of course, I was using SIMD intrinsic rather than relied on compiler's autovectorization magics. Back then, in 2017, autovectorization in
    major compilers was very unreliable. I would guess that while today
    autvec is better, it is still not up to the job of getting utilization
    of above 70%. daxpy-like kernels are relatively easier for compiler to
    get right, but in dot-like kernels, last time I looked (not very
    recently, but in this decade) there was almost no progress.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 11:29:18 2026
    From Newsgroup: comp.arch

    On Sat, 02 May 2026 16:29:03 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    I do not see how VVM could express this equally succinctly;

    Give me a day and I will see what I can do. >-------------------------------------------------------------
    I think the below is correct !?! Hand compiled

    /* with
    R1 = &a[]
    R2 = &b[]
    R3 = &c[]
    R4 = i+jN
    R5 = i+kN
    R6 = k+jN
    R7 = jN
    R8 = kN
    */

    mm8:
    -------------------------------------------------------------
    Loop1:
    MOV R7,#0
    -------------------------------------------------------------
    Loop2:
    MOV R8,#0
    MOV R5,#0
    MOV R4,R7
    -------------------------------------------------------------
    VEC R15,{} ; nothing live out of loop
    -------------------------------------------------------------
    loop3:
    LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
    LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
    LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
    FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
    STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
    ADD R4,R4,R7 ADD R4,R4,R7
    -------------------------------------------------------------
    LOOP1 LE,R5,R8,R7
    -------------------------------------------------------------
    ADD R8,R8,#8
    ADD R5,R5,#1
    CMP R13,R5,#8
    BLE R13,Loop2
    -------------------------------------------------------------
    ADD R7,R7,#8
    CMP R13,R7,#64
    BLE R13,Loop1
    -------------------------------------------------------------
    RET

    Where the doubled up column shows the instructions which run
    on a per lane basis. Given:
    1-lane: there are 8 loops
    2-lane: there are 4 loops
    4-lane: there are 2 loops
    8-lane: there is 1 loop

    Let's compare it directly. Posting URLs is not good for discussion.
    So the source code in the example is:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    and the output of gcc-16.1 is (after cleanup by godbolt):

    mm8:
    vmovupd (%rdi), %zmm8
    vmovupd 64(%rdi), %zmm7
    vmovupd 128(%rdi), %zmm6
    vmovupd 192(%rdi), %zmm5
    vmovupd 256(%rdi), %zmm4
    vmovupd 320(%rdi), %zmm3
    vmovupd 384(%rdi), %zmm2
    vmovupd 448(%rdi), %zmm1
    movq %rsi, %rax
    leaq 512(%rdx), %rcx
    .L2:
    vbroadcastsd (%rax), %zmm0
    vfmadd213pd (%rdx), %zmm8, %zmm0
    addq $64, %rdx
    addq $64, %rax
    vfmadd231pd -56(%rax){1to8}, %zmm7, %zmm0
    vfmadd231pd -48(%rax){1to8}, %zmm6, %zmm0
    vfmadd231pd -40(%rax){1to8}, %zmm5, %zmm0
    vfmadd231pd -32(%rax){1to8}, %zmm4, %zmm0
    vfmadd231pd -24(%rax){1to8}, %zmm3, %zmm0
    vfmadd231pd -16(%rax){1to8}, %zmm2, %zmm0
    vfmadd231pd -8(%rax){1to8}, %zmm1, %zmm0
    vmovupd %zmm0, -64(%rdx)
    cmpq %rdx, %rcx
    jne .L2
    vzeroupper
    ret

    So your code reflects the three loops of the source code, with the
    inner loop being sped up by VVM, ideally such that only one microarchitectural iteration of the inner loop is necessary. You
    still have the other loops, and all the memory accesses.

    By contrast, the AVX-512 code produced by gcc-16.1 unrolls one loop
    level into using AVX-512 instructions and another loop level into
    using 8 different zmm registers. As a consequence, there is only one
    loop level left, and every byte of the array c is only stored once.
    Also, every byte of a and b are only loaded once (but the accesses to
    the b array are 8 bytes at a time, so 64 loads are needed for that,
    while a is loaded with 8 64-byte loads and c is stored with 8 64-byte
    stores.

    For VVM we can make use of 8 registers to achieve one level of
    unrolling, but I don't see how to reuse the registers as it is done
    with the a array in zmm1..zmm8 in the AVX-512 code. So one would have
    to load all of a on every iteration of the outer loop, instead of
    pulling these loads out of the loop as it is done by gcc-16.1.

    The VVM code would look maybe somewhat like:

    loop1:
    ... loop overhead left as exercise ...
    vec i
    loop2:
    ldd r17=c[...]
    ldd r1=a[i]
    ldd r2=a[i+8]
    ...
    ldd r8=a[i+56]
    ldd r9=b[...]
    fmac r18=r1*r9+r17
    ...
    ldd r16=b[...]
    fmac r25=r8*r16+r24
    std r25->c[...]
    loop1 ...
    ... loop overhead ...
    ble ..., loop1
    ret

    I don't know how well VVM handles the dependency chain of the FMACs (r17->r18...r25). One could use the same register here, as is done
    with zmm0 in the AVX-512 code, but I do not know if VVM would accept
    that.

    With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
    the loop is 6-cycles, so the 8-wide machine would run the loop in
    8-cycles of latency.

    In an OoO machine the latency within the loop is not very relevant,
    even the dependence chain of the 8 vfmadd231pd instructions typically
    is not, because the next iteration does not depend on the previous
    one, apart from the loop counter updates, and even that has become
    zero-cycle latency in recent Intel CPUs. So the next iteration can
    start immediately only limited by the resources.

    Oh, and BTW; most of the compilers in godbot take a compile error--
    I tried a fairly big sample across every architecture.

    I tried clang-22, and it compiled the code. I also tried clang-11 and
    gcc-11 and they errored out complaining about the -march=znver5 flag,
    because they do not know this architecture (but still, why report an
    error for that?). Deleting that flag produced SSE2 code on both
    compilers, as expected.


    Why would you think that the code generated for Zen5 would be different
    from code, generated for other AVX512 targets with 2 FMA pipes, like
    any Intel server core starting from Skylake-SP?
    Specific targets I would try are:
    skylake-avx512
    cascadelake
    icelake-server
    sapphirerapids
    emeraldrapids
    graniterapids
    The last three are likely unsupported by clang-11, which is quite
    ancient, but with considerably newer gcc11 at least sapphirerapids
    should work.
    Any way, cascadelake should be supported by all of them and I don't
    expect meaningful differences in code generation between cascadelake
    and znver5.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 3 11:22:02 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 02 May 2026 16:29:03 GMT
    I tried clang-22, and it compiled the code. I also tried clang-11 and
    gcc-11 and they errored out complaining about the -march=znver5 flag,
    because they do not know this architecture (but still, why report an
    error for that?). Deleting that flag produced SSE2 code on both
    compilers, as expected.


    Why would you think that the code generated for Zen5 would be different
    from code, generated for other AVX512 targets with 2 FMA pipes, like
    any Intel server core starting from Skylake-SP?

    I did not express anything in that direction. However, now that you
    ask, my experience is that it is was more difficult than I had
    expected to get gcc (13 IIRC) to produce AVX-512 code, even with
    explicit vectorization. The actual target was a Rocket Lake machine,
    but using -march=native on that produced AVX-256 code (for the
    programs that I tried). IIRC I also tried specifying one other Intel
    uarch, and the code was not satisfactory, either, but I don't remember
    the details. Eventually I tried -march=znver4, and that worked, so I
    stuck with that.

    BTW, one thing that I find unsatisfactory about "x86-64-v4" is that it
    does not include the ADX instructions.

    Specific targets I would try are:
    skylake-avx512
    cascadelake
    icelake-server
    sapphirerapids
    emeraldrapids
    graniterapids
    The last three are likely unsupported by clang-11, which is quite
    ancient, but with considerably newer gcc11 at least sapphirerapids
    should work.
    Any way, cascadelake should be supported by all of them and I don't
    expect meaningful differences in code generation between cascadelake
    and znver5.

    Good to know. Anyway, I did not try to get older compilers to produce
    AVX-512 for <2026May2.182903@mips.complang.tuwien.ac.at>. Instead, I
    tried gcc-11 and clang-11 to find out why "most of the compilers in
    godbot take a compile error", as Mitch Alsup claimed. It appears that
    most of the compilers barf on the flag "-march=znver5".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 16:53:46 2026
    From Newsgroup: comp.arch

    On Sun, 03 May 2026 11:22:02 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 02 May 2026 16:29:03 GMT
    I tried clang-22, and it compiled the code. I also tried clang-11
    and gcc-11 and they errored out complaining about the
    -march=znver5 flag, because they do not know this architecture
    (but still, why report an error for that?). Deleting that flag
    produced SSE2 code on both compilers, as expected.


    Why would you think that the code generated for Zen5 would be
    different from code, generated for other AVX512 targets with 2 FMA
    pipes, like any Intel server core starting from Skylake-SP?

    I did not express anything in that direction. However, now that you
    ask, my experience is that it is was more difficult than I had
    expected to get gcc (13 IIRC) to produce AVX-512 code, even with
    explicit vectorization. The actual target was a Rocket Lake machine,
    but using -march=native on that produced AVX-256 code (for the
    programs that I tried). IIRC I also tried specifying one other Intel
    uarch, and the code was not satisfactory, either, but I don't remember
    the details. Eventually I tried -march=znver4, and that worked, so I
    stuck with that.

    BTW, one thing that I find unsatisfactory about "x86-64-v4" is that it
    does not include the ADX instructions.

    Specific targets I would try are:
    skylake-avx512
    cascadelake
    icelake-server
    sapphirerapids
    emeraldrapids
    graniterapids
    The last three are likely unsupported by clang-11, which is quite
    ancient, but with considerably newer gcc11 at least sapphirerapids
    should work.
    Any way, cascadelake should be supported by all of them and I don't
    expect meaningful differences in code generation between cascadelake
    and znver5.

    Good to know. Anyway, I did not try to get older compilers to produce AVX-512 for <2026May2.182903@mips.complang.tuwien.ac.at>. Instead, I
    tried gcc-11 and clang-11 to find out why "most of the compilers in
    godbot take a compile error", as Mitch Alsup claimed. It appears that
    most of the compilers barf on the flag "-march=znver5".

    - anton


    I tried it myself and results were rather unexpected.
    In order to convince gcc to generate AVX-512 on skylake-avx512 I had to
    go back to gcc7.

    cascadelake target is not recognized until gcc9 and for this
    particular kernel no version of gcc generates AVX512 code.

    icelake-server recognized by gcc8, i.e. earlier than cascadelake,
    despite the fact that Icelike server shipped 2 full years later than
    Cascade Lake. Here too no version of gcc generates AVX512 code for this
    kernel.

    sapphirerapids recognized by gcc11. Here too no version of gcc
    generates AVX512 code for this kernel.

    emeraldrapids recognized by gcc13. Here too no version of gcc
    generates AVX512 code for this kernel.

    graniterapids is exactly the same as emerald.

    So, it's hard to be more wrong than I was in my previous post.
    Just saying that Intel people responsible for gcc maintenance screw it
    would be big understatement.


    Now, I want to know if clang situation is any different.

    skylake-avx512: recognized since clang-3.9, generates semi-reasonable
    avx512 code with clang-3.9 to 9. Semi-reasonable means using 512-bit
    add and mul, but no fma. Good code is produced only by clang-4 with
    following flags: -march=skylake-avx512 -Ofast -O3 -ffast-math


    cascadelake: recognized since clang-9, Code generation on supported
    versions appears to be the same as for skylake-avx512. In effect in
    means that good code not generated at all.

    icelake-server: recognized since clang-7, i.e earlier than cascadelake.
    code generation appears to be the same.

    At this point I ran out of gas.
    But it looks almost certain that clang code generation on all supported
    Intel server cores is the same. I.e. newer clang does not generate
    AVX512 code and older clang in some random versions generates the same
    code as new clang and in other random versions generates AVX512 code
    but often not 512-bit wide and at first glance looking like crap.

    In this particular case it could be a blessing for Intel, because newer
    clang code generated for znver5 used 512-bit SIMD but looks horrible.
    I fully expect that it is slower than more conservative code for Intel
    targets.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 3 14:46:19 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    sapphirerapids recognized by gcc11. Here too no version of gcc
    generates AVX512 code for this kernel.

    I think it does, but you have to specify additional options.
    Historically, AVX512 has been a low performer due to fequency
    throttling etc. You may have to specify -mprefer-vector-width=512 .
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 3 14:39:54 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    In order to convince gcc to generate AVX-512 on skylake-avx512 I had to
    go back to gcc7.

    skylake-avx512: recognized since clang-3.9, generates semi-reasonable
    avx512 code with clang-3.9 to 9.

    gcc-8.1 was released on May 2, 2018.

    clang-10.0 was released on 24 March 2020.

    Alder Lake was released in late 2021, but initially had some AVX-512
    support that was disabled via firmware later.

    Given how undecisive Intel has been on Alder Lake, it seems unlikely
    that they eliminated the AVX-512 usage already in gcc-8 so early to
    avoid disappointments when Alder Lake is released. But somehow I
    cannot come up with a different explanation.

    In this particular case it could be a blessing for Intel, because newer
    clang code generated for znver5 used 512-bit SIMD but looks horrible.
    I fully expect that it is slower than more conservative code for Intel >targets.

    The whole auto-vectorization stuff tends to be pretty unreliable
    overall. Sometimes the code that is generated looks good, sometimes
    it is a mess, sometimes there is no auto-vectorization at all.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 18:30:56 2026
    From Newsgroup: comp.arch

    On Sun, 03 May 2026 14:39:54 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    In order to convince gcc to generate AVX-512 on skylake-avx512 I had
    to go back to gcc7.

    skylake-avx512: recognized since clang-3.9, generates semi-reasonable >avx512 code with clang-3.9 to 9.

    gcc-8.1 was released on May 2, 2018.

    clang-10.0 was released on 24 March 2020.

    Alder Lake was released in late 2021, but initially had some AVX-512
    support that was disabled via firmware later.

    Given how undecisive Intel has been on Alder Lake, it seems unlikely
    that they eliminated the AVX-512 usage already in gcc-8 so early to
    avoid disappointments when Alder Lake is released. But somehow I
    cannot come up with a different explanation.

    In this particular case it could be a blessing for Intel, because
    newer clang code generated for znver5 used 512-bit SIMD but looks
    horrible. I fully expect that it is slower than more conservative
    code for Intel targets.


    I wrote less artificial kernel, one that is actually very similar to how
    I'd do inner loop of matmul of big matrices on SIMD512 target machine.

    https://godbolt.org/z/aYhxcTehq

    I am very impressed by the quality of code that gcc generated for Zen4
    and Zen5. Exactly the same code would be excellent fit on any Intel
    AVX512 target, but especially so on Intel processors with dual 512b
    pipes.
    But gcc does not generate this code for any Intel target. That's
    extremely weird.



    The whole auto-vectorization stuff tends to be pretty unreliable
    overall. Sometimes the code that is generated looks good, sometimes
    it is a mess, sometimes there is no auto-vectorization at all.

    - anton

    Very true.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 19:07:16 2026
    From Newsgroup: comp.arch

    On Sun, 3 May 2026 14:46:19 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    sapphirerapids recognized by gcc11. Here too no version of gcc
    generates AVX512 code for this kernel.

    I think it does, but you have to specify additional options.

    Yes, the option you listed helps to force compiler into generation of
    code identical to znver4/znver5.

    Historically, AVX512 has been a low performer due to fequency
    throttling etc.

    That's why compiler has -march and -mtune, does not it?
    Historical problems of certain Skylake-SP and Cascade Lake SKUs, mostly
    of Gold-5xxx, Silver and Bronze, should have no effect on code
    generated for Icelake-server or for newer Xeon variants, where
    problems of this sort either do not exist, or at very least,
    moderate throttling more than offseted by higher performance per clock
    provided by wider SIMD.

    The observed gcc behavior, however, is that apart from supported
    instruction set the only march-related bit that compiler cares about,
    is CPU manufacturer.

    You may have to specify -mprefer-vector-width=512 .

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 3 22:28:50 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    C version loop invariant, and cursoring

    #define N 8
    void mm8(double *a, double *b, double *c)
    {
    int i,j,jN,k,kN;
    double *AcijN,*AbkjN,*AaijN;

    for( jN=0; jN<N*N; jN+=N ) {
    AcijN = &c[jN];
    AbkjN = &b[jN];
    for( kN=k=0; k<N; k++,kN+=N ) {
    AaikN = &a[kN];
    bN = AbkjN[k];
    for( i=0; i<N; i++ ) {
    AcijN[i] += AaikN[i] * bN;
    }
    }
    }
    }

    I did get this into:

    mm8:
    ; R1 = &a[0];
    ; R2 = &b[0];
    ; R3 = &c[0]; -------------------------------------------------------------
    MOV RjN,#0 ; R4
    loop1:
    LA RcijN,[Rc,RjN<<3] ; R5
    LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
    MOV RkN,#0 ; R7
    MOV Rk,#0 ; R8
    loop2:
    LA RaikN,[Ra,RkN<<3] ; R9
    LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
    MOV Ri,#0 ; R11
    VEC 8,{}
    loop3:
    LDD Ra,[RaikN,Ri<<3] ; R12
    LDD Rc,[RcijN,Ri<<3] ; R13
    FMAC Rc,Ra,Rb,Rc ; R14
    STD Rc,[RcijN,Ri<<3] ;

    LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
    ADD Rk,Rk,#1 ; R8
    ADD RkN,RkN,#8 ; R7
    CMP Rt,Rk,#8 ; R11
    BLE Rt,loop2 -------------------------------------------------------------
    ADD RjN,RjN,#8 ; R4
    CMP Rt,RjN,#64 ; R7
    BLE Rt,Loop1 -------------------------------------------------------------
    RET

    without needing any preserved registers.

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
    64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.

    It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

    ----------------------------------
    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    double c0 = c[0 + j*N];
    double c1 = c[1 + j*N];
    double c2 = c[2 + j*N];
    double c3 = c[3 + j*N];
    double c4 = c[4 + j*N];
    double c5 = c[5 + j*N];
    double c6 = c[6 + j*N];
    double c7 = c[7 + j*N];
    for (int k=0; k<N; k++) {
    double bk = b[k + j*N];
    c0 += a[0 + k*N] * bk;
    c1 += a[1 + k*N] * bk;
    c2 += a[2 + k*N] * bk;
    c3 += a[3 + k*N] * bk;
    c4 += a[4 + k*N] * bk;
    c5 += a[5 + k*N] * bk;
    c6 += a[6 + k*N] * bk;
    c7 += a[7 + k*N] * bk;
    }
    /* write back c0 to c7 */
    }
    }
    }

    where the loop over k could be vectorized, but that would still
    leave eccessive memory traffic for a.

    ENTER Rc1,Rc8,#0 ; preserve c[1..8]
    MOV RjN,#0 ; R4
    loop1:
    LA Rca,[Rc,RjN<<3] ; &c[1..8]
    LDD Rc1,[Rca,#0] ; R23
    LDD Rc2,[Rca,#8]
    LDD Rc3,[Rca,#16]
    LDD Rc4,[Rca,#24]
    LDD Rc5,[Rca,#32]
    LDD Rc6,[Rca,#40]
    LDD Rc7,[Rca,#48]
    LDD Rc8,[Rca,#56] ; R30

    MOV RkN,#0 ; R5
    ---------------begin vectorize-------------------
    VEC 8,{Rc1..Rc8}
    loop2:
    LDD Rbk,[R2,RjN<<3] ; R6

    LA RakN,[Ra,RkN<<3] ; R7
    LDD Ra1,[RakN,#0] ; R8
    FMAC Rc1,Ra1,Rbk,Rc1 ; R23
    LDD Ra2,[RakN,#8] ; R7
    FMAC Rc2,Ra2,Rbk,Rc2 ; R24
    LDD Ra3,[RakN,#16] ; R7
    FMAC Rc3,Ra3,Rbk,Rc3 ; R25
    LDD Ra4,[RakN,#24] ; R7
    FMAC Rc4,Ra4,Rbk,Rc4 ; R26
    LDD Ra5,[RakN,#32] ; R7
    FMAC Rc5,Ra2,Rbk,Rc6 ; R27
    LDD Ra6,[RakN,#40] ; R7
    FMAC Rc6,Ra2,Rbk,Rc6 ; R28
    LDD Ra7,[RakN,#48] ; R7
    FMAC Rc7,Ra2,Rbk,Rc7 ; R29
    LDD Ra8,[RakN,#56] ; R7
    FMAC Rc8,Ra8,Rbk,Rc8 ; R30

    LOOP1 LE,RkN,#8,#64 ; R4
    ---------------end vectorize-------------------

    ADD RkN,RkN,#8 ; R5
    CMP Rt,RkN,$64 ; R6
    BLE Rt,loop1

    STD Rc1,[Rca,#0]
    STD Rc2,[Rca,#8]
    STD Rc3,[Rca,#16]
    STD Rc4,[Rca,#24]
    STD Rc5,[Rca,#32]
    STD Rc6,[Rca,#40]
    STD Rc7,[Rca,#48]
    STD Rc8,[Rca,#56]

    EXIT Rc1,Rc8,#0
    RET

    46 instructions 19 instructions in vectorized (unrolled) loop.

    c[k] is read once and written once
    b[k] is read 8×
    a[k] is read 8×

    If you are willing to have 64 FMACs in a row; a[k] can be read 2×
    {with very tr1cky register allocation}.

    Using this many registers causes 64 bytes to be written to stack
    and read back later. Solving the a[k] traffic increases the stack
    footprint to 104 bytes.

    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 18:15:23 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ---------------------------
    Your code can only use 8 FMACs per cycle in any event,

    Actually not correct. Zen5 has a reciprocal throughput of 0.5
    for FMA, it can run two instructions in parallel.

    and there
    is overhead due to the other instructions in the loop...

    Looking at

    https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

    where they put a lot of work into optimizing matmul, one can see
    they reached around 125 gflops on matrix sizes that include some
    odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
    that is 30.34 flops per cycle, which translates into 15.16 FMAs
    per cycle, which is extremely close to 16 and shows that the
    loop overhad is pretty much absorbed in the OoO handling.

    But your loop only has 4×4 matrixes.

    Yes, MM with big matrixes reached up above 90% of theoretical perf,
    small ones do not.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 10 06:55:39 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    #define N 8
    void mm8(double * const restrict a, double * const restrict b,
    double * restrict c)
    {
    for (int j=0; j<N; j++) {
    for (int k=0; k<N; k++) {
    for (int i=0; i<N; i++) {
    c[i + j*N] += a[i + k*N] * b[k + j*N];
    }
    }
    }
    }

    C version loop invariant, and cursoring

    #define N 8
    void mm8(double *a, double *b, double *c)
    {
    int i,j,jN,k,kN;
    double *AcijN,*AbkjN,*AaijN;

    for( jN=0; jN<N*N; jN+=N ) {
    AcijN = &c[jN];
    AbkjN = &b[jN];
    for( kN=k=0; k<N; k++,kN+=N ) {
    AaikN = &a[kN];
    bN = AbkjN[k];
    for( i=0; i<N; i++ ) {
    AcijN[i] += AaikN[i] * bN;
    }
    }
    }
    }

    I was going to check this for corectness and instrument this to
    actually count load and stores (counting by hand can be wrong
    :-) but his does not compile (renamed your function to mm8b for
    testing):

    mm1.c: In function ‘mm8b’:
    mm1.c:80:7: error: ‘AaikN’ undeclared (first use in this function); did you mean ‘AaijN’?
    80 | AaikN = &a[kN];
    | ^~~~~
    | AaijN
    mm1.c:80:7: note: each undeclared identifier is reported only once for each function it appears in
    mm1.c:81:7: error: ‘bN’ undeclared (first use in this function); did you mean ‘kN’?
    81 | bN = AbkjN[k];
    | ^~
    | kN

    b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
    64 double reads for b. For a and c are 512 reads of doubles each,
    512 doubles are written for c. Total, 1600 memory access for doubles.

    By comparison, the SIMD code reads 192 doubles and writes 64, the
    minimum, for a total of 256. This is a factor of 6.25.

    It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

    If you are multiplying whole matrices, yes. If you are piecing together
    matrix multiplications from kernels, then you need to add up all
    the consecutive kernels that make up C(i,j).

    [...]


    ENTER Rc1,Rc8,#0 ; preserve c[1..8]
    MOV RjN,#0 ; R4
    loop1:
    LA Rca,[Rc,RjN<<3] ; &c[1..8]
    LDD Rc1,[Rca,#0] ; R23
    LDD Rc2,[Rca,#8]
    LDD Rc3,[Rca,#16]
    LDD Rc4,[Rca,#24]
    LDD Rc5,[Rca,#32]
    LDD Rc6,[Rca,#40]
    LDD Rc7,[Rca,#48]
    LDD Rc8,[Rca,#56] ; R30

    MOV RkN,#0 ; R5
    ---------------begin vectorize-------------------
    VEC 8,{Rc1..Rc8}
    loop2:
    LDD Rbk,[R2,RjN<<3] ; R6

    LA RakN,[Ra,RkN<<3] ; R7
    LDD Ra1,[RakN,#0] ; R8
    FMAC Rc1,Ra1,Rbk,Rc1 ; R23
    LDD Ra2,[RakN,#8] ; R7
    FMAC Rc2,Ra2,Rbk,Rc2 ; R24
    LDD Ra3,[RakN,#16] ; R7
    FMAC Rc3,Ra3,Rbk,Rc3 ; R25
    LDD Ra4,[RakN,#24] ; R7
    FMAC Rc4,Ra4,Rbk,Rc4 ; R26
    LDD Ra5,[RakN,#32] ; R7
    FMAC Rc5,Ra2,Rbk,Rc6 ; R27
    LDD Ra6,[RakN,#40] ; R7
    FMAC Rc6,Ra2,Rbk,Rc6 ; R28
    LDD Ra7,[RakN,#48] ; R7
    FMAC Rc7,Ra2,Rbk,Rc7 ; R29
    LDD Ra8,[RakN,#56] ; R7
    FMAC Rc8,Ra8,Rbk,Rc8 ; R30

    LOOP1 LE,RkN,#8,#64 ; R4

    I don't recall the limit on the number of statements in a VVM loop;
    what is it?

    [...]

    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    What are its drawbacks? Do register accesses get slower?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 10 07:14:37 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ---------------------------
    Your code can only use 8 FMACs per cycle in any event,

    Actually not correct. Zen5 has a reciprocal throughput of 0.5
    for FMA, it can run two instructions in parallel.

    and there
    is overhead due to the other instructions in the loop...

    Looking at

    https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

    where they put a lot of work into optimizing matmul, one can see
    they reached around 125 gflops on matrix sizes that include some
    odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
    that is 30.34 flops per cycle, which translates into 15.16 FMAs
    per cycle, which is extremely close to 16 and shows that the
    loop overhad is pretty much absorbed in the OoO handling.

    But your loop only has 4×4 matrixes.

    It is an example 8*8 (not 4*4) kernel. AMD actually used 24*8 for
    their microcernel.


    Yes, MM with big matrixes reached up above 90% of theoretical perf,
    small ones do not.

    Sure, overhead for small matrices matters a lot.

    Some time ago, I wrote an inline version of matmul for gfortran
    because small matrices (especially if their size is known at
    compile time) are handled very inefficiently by external packages.
    ifort had actually done so before, although I didn't know it at
    the time.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun May 10 08:58:34 2026
    From Newsgroup: comp.arch

    On 5/3/2026 3:28 PM, MitchAlsup wrote:

    big snip
    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    A possible alternative that I have seen is to "memory map" the registers
    as an alternative accessing mechanism. This allows you to "index" the registers, similarly to indexing a memory array.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 17:28:38 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    ----------------
    I don't recall the limit on the number of statements in a VVM loop;
    what is it?

    32

    [...]

    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

    What are its drawbacks? Do register accesses get slower?

    Yes, you have to read a register and move it to where it can
    read another register.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun May 10 16:15:12 2026
    From Newsgroup: comp.arch

    Thomas Koenig [2026-05-10 06:55:39] wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    The solution to the excessive a[] traffic would be having the ability
    to index the register file Ra[#] so the array can be allocated into
    registers and indexed from the file itself. Most ISAs do not have this
    ability--although a few GPU ISAs do.
    What are its drawbacks?

    I guess the problem is that in an OoO design, this introduces a deeply problematic dependency between the in-order front-end that renames
    logical registers to physical registers and the OoO core.

    Usually such dependencies (where the front-end needs info from the
    OoO core) are handled via speculation, the classical example being
    branches.

    Do register accesses get slower?

    In order not to mess the whole pipeline, I think you'd have to predict
    the register indexing.

    In the mean time you can "simulate" it by replacing the register
    indexing by a `switch` table to various copies of the code, each one
    using the appropriate register. Of course, this wouldn't work in vVM
    since IIRC vVM doesn't support branches within the loop, and using
    predication to simulate the `switch` table would be impractical.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon May 11 05:36:37 2026
    From Newsgroup: comp.arch

    On Sun, 10 May 2026 07:14:37 -0000 (UTC), Thomas Koenig wrote:

    Some time ago, I wrote an inline version of matmul for gfortran
    because small matrices (especially if their size is known at compile
    time) are handled very inefficiently by external packages.

    General linear-algebra packages are optimized for large matrices I
    guess because those are common in linear optimization problems and the
    like.

    (I was reading earlier today about erasure codes, and I think you
    could use these there as well.)

    CG, on the other hand, will use matrices no larger than 4×4. For that
    sort of size, I think you can get away with computing inverses using
    Cramer’s rule. ;)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue May 12 02:17:49 2026
    From Newsgroup: comp.arch

    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 20:11:43 2026
    From Newsgroup: comp.arch

    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue May 12 05:14:48 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.

    It would have to be implemented. How? And how does the supposed
    rareness help?

    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 23:11:07 2026
    From Newsgroup: comp.arch

    On 5/11/2026 10:14 PM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
    On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

    A possible alternative that I have seen is to "memory map" the
    registers as an alternative accessing mechanism. This allows you to
    "index" the registers, similarly to indexing a memory array.

    Doesn’t this defeat the point of how registers are supposed to work?

    No. In the vast majority of cases, you reference registers as you do
    now, with register numbers in assigned places in the instruction. But
    you do have an "alternate" way of referencing them that allows you to
    use an index, just as you can with memory. That mechanism would only be
    used in rare circumstances.

    It would have to be implemented. How?

    Let me give one possible implementation. There are certainly others.
    Say you have 32 registers. They are "memory mapped" into the first 32 addresses of memory. So programs would have to start not at zero, but
    at 32 (I know this can cause other problems - I clearly have not thought through all of the details.) So now when the CPU encounters a load (or
    store) instruction where the virtual address is less than 32, it is
    resolved not by the memory system, but by the appropriate register.
    i.e. if the virtual address was say 4, the load would be from register
    R4, not memory location 4. Yes, the virtual addressing mechanism
    would have to be sensitive to whether the address was below 32 or not,
    but that is simple within the CPU. Note that the load instruction in
    this case would not touch the memory system at all, so no cache lookups,
    no TLB lookups, etc.


    And how does the supposed
    rareness help?

    Laurence said it would defeat the purpose of registers. My comment was
    that since it would be rare, i.e. most of the register references would
    be the same as before, it wouldn't defeat the purpose.>
    Remember the subject: You suggested this mechanism as a way to
    eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
    some memory accesses that AVX-512 can. Turning the registers into
    memory does not solve that, and probably incurs additional costs.
    This cure is worse than the disease.

    The register accesses are not turned into memory accesses. If the
    address is less than 32, the instruction references the actual register,
    not the memory. The only advantage of this scheme is that it allows "indexing" the registers similarly to how one indexes memory today.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2