Forum: War Ensemble BBS

Sane(r) SIMD

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 10:26:30 2026

From Newsgroup: comp.arch

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

The feature set should be constant across all vector widths.

No vector registers should be used in interrupts :-)

It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.

Data types: 8, 16, 32, 64 bit ints; also 128 bit?
FP types: Same as above?

Other points?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Apr 26 12:27:06 2026

From Newsgroup: comp.arch

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).

If SIMD goes wider than 4 elements, the complexity curve quickly becomes unmanageable.

Almost more preferable IMO to allow superscalar on SIMD vectors than to
go to wider SIMD vectors. Granted, how much this scales depends more on
the size of the register file. There is in effect a hard limit here.

Personally, I don't really trust automatic vectorization, as it is sort
of a double edged sword between "faster" and "slow and bloated"
(particularly with MSVC).

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

The feature set should be constant across all vector widths.

Or keep all registers the same size, say, 64 bits.
If you do a 128-bit op, it can use pairs.
Practically, a 4-wide SIMD op can be seen internally as two 2-wide
operations glued together (or the 128-bit SIMD operation effectively
splitting into two co-issued 64-bit operations).

Granted, one may note that a 4x Binary16 SIMD operation may consume both
of these lanes. The main alternative being to have two extra
Binary16-only units per lane, so the processor can effectively do 8
Binary16 ops per cycle in some cases (if superscalar allows).

Though, arguably, the probability of being able to co-issuing Binary16
SIMD ops is low-enough that it many not make sense to allow it.

Saw, for example:
Main FPU, Lanes 1 or 2, does not co-issue;
SIMD units: Lanes 1 and 2, may co-issue.
Each unit has 2x Binary16/Binary32 elements,
and two Binary16 only elements.
...

This isn't an exact match for my existing SIMD unit, but probably a
direction it could make sense to go if I needed more throughput (not
currently a limited factor, even on SIMD heavy code it is hard to "keep
the thing fed").

Well, since Load/Store and SHUF operations effectively take away cycles
that the SIMD unit could work. At least partly, where a 4x Binary16 SIMD instruction can co-issue with a Load or a SHUF, but the 128-bit
instructions can't co-issue at all (each 128-bit instruction effectively
using the whole pipeline width).

No vector registers should be used in interrupts :-)

It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.

Or, no vector registers exist at all...

The role that "could" be served by separate GPR/FPR/SIMD registers can
instead be served by making the GPR space bigger. Currently, there is a practical limit of 64 both for the FPGA logic and for encoding.

Though, a "break glass" option could be to allow for 64-bit encodings to address 256 registers, but maybe have R64..R255 as "narrow issue" (only accessible in 2R1W configurations or similar) and likely mapped to
Block-RAM on an FPGA.

Doing 2R1W with BRAMs would likely eat 4 BRAMs, or 8 BRAMs if one allows 128-bit SIMD to use them (while still following 2R1W semantics). Note
that full 4R2W or 6R3W with BRAMs would likely become too expensive.

Though, in the 128-bit case, if one is handling it internally as a 2R1W 128-bit reg-file with low/high access for 64-bit ops, could almost make
sense to expand it to 512 registers to make more efficient use of the
BRAM (though, 256 registers would come back within the reach of LUTRAM
in this case).

Though, granted, such a drastic register file expansion would risk
making things like interrupt handling and system calls painfully slow.

Would be sort of like register banking but worse:
Where, register banking allows making basic interrupt handling faster at
the cost of making context switches slower and more complicated; but a
single non-banked register file basically keeps both at roughly the same
cost (though, in my case, does mean effectively two context switches per system call, but system calls turned out to be a nice place to implement preemptive scheduling as it was typically less likely to leave things in
an inconsistent state vs scheduling based on the timer interrupt).

Data types: 8, 16, 32, 64 bit ints; also 128 bit?
FP types: Same as above?

In my case:
8-bit: Mostly converter only
16-bit: Native (Int16 and Binary16 FP)
32-bit: Native (Int32 and Binary32)
4x SIMD is 128-bit, as above.
64-bit: Scalar/SIMD hybrid.
Binary64 SIMD operations exist, but internally use pipelining.
128-bit: Scalar only.
Int128: Typically two 64-bit ALU lanes glued together (*1).
Binary128: Software emulation only.
Int128 in HW can help with emulation performance.

*1:
ADD/SUB: Can route some signals between ALUs to allow for 128-bit Carry-Select;
AND/OR/XOR: Co issue 64-bit, nothing special needed.
SHIFT: Internally co-issues two funnel shifts.

TBD: It is possible I could make the funnel shifter accessible at the
ISA level (as 4R ops).

Other points?

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 26 18:25:54 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.

SIMD as a way to consume vast quantities of ISA space is not. Done
right there is no need for a vector RF and the associated context
switch overhead.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.

I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
(Encryption) while variable permutes can use a register value.

It seems permute is to go with a carryless multiply.

These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

This is where SIMD consumes a cartesian product of ISA space.

The feature set should be constant across all vector widths.

No vector registers should be used in interrupts :-)

It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.

Data types: 8, 16, 32, 64 bit ints; also 128 bit?
FP types: Same as above?

Other points?

BGB makes the point where SIMD should stop at 4-wide. I mostly
agree.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 19:07:20 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very special cases).

If SIMD goes wider than 4 elements, the complexity curve quickly becomes unmanageable.

AVX 512 allows for 64 8-bit numbers in parallel.

Almost more preferable IMO to allow superscalar on SIMD vectors than to
go to wider SIMD vectors. Granted, how much this scales depends more on
the size of the register file. There is in effect a hard limit here.

Personally, I don't really trust automatic vectorization, as it is sort
of a double edged sword between "faster" and "slow and bloated" (particularly with MSVC).

Agreed. For code for which vectors are a good match, Mitch's
VVM is better. I am talking about the rest, specifically
shuffles (and friends).

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

The feature set should be constant across all vector widths.

Or keep all registers the same size, say, 64 bits.
If you do a 128-bit op, it can use pairs.
Practically, a 4-wide SIMD op can be seen internally as two 2-wide operations glued together (or the 128-bit SIMD operation effectively splitting into two co-issued 64-bit operations).

Some time ago, Mitch explained one method of how SIMD addition can
be performed using a single adder. Assume you want to add two 8-bit
numbers and have a 17-bit adder. If the inputs to the adders are (lttle-endian) bits A<0:7>,I,A<8:15) and B<0:7>,0,B<8:15> and I is
zero, you get two eight-bit results in Q<0:7> and Q<9:16>, with the
carry bit of the first addition as Q<8> and of the second addition
as the normal carry-out. If I is one, then you get the 16-bit
result in Q<0:7> and Q<9:16>.

Alterntively, it would also be possible to kill carries.

Well, since Load/Store and SHUF operations effectively take away cycles
that the SIMD unit could work. At least partly, where a 4x Binary16 SIMD instruction can co-issue with a Load or a SHUF, but the 128-bit
instructions can't co-issue at all (each 128-bit instruction effectively using the whole pipeline width).

No vector registers should be used in interrupts :-)

It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.

Or, no vector registers exist at all...

I should have written "SIMD registers" above.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Apr 26 19:38:19 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.

SIMD as a way to consume vast quantities of ISA space is not.

So it should be done right :-)

Done
right there is no need for a vector RF and the associated context
switch overhead.

That is what I am doubting.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.

I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
(Encryption) while variable permutes can use a register value.

Would permute work over larger blocks like 256 or 512 bits?
Can the results from permute be re-used in registers, or would
they have to be reloaded from memory?

It seems permute is to go with a carryless multiply.

I do not understand that sentence.

These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

That's the pointi, I am trying to explore the "maybe".

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.

For code which can be efficiently vectorized, like str* and mem*,
you are correct. I am talking about the cases where it is not.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

This is where SIMD consumes a cartesian product of ISA space.

It does not have to, I think.

Let's see what an instruction modifier (like CARRY) could look
like.

Arithmetic operations like ADD have two bits for size in the newest
version of the ISA, so the size of the units to be operated upon
is known.

The size of the SIMD register could be encoded in the otherwise
unused SRC1 field; five bits are certainly enough for that.
There are a maximum of three source registers in each instruction,
so three bits are enough to encode if a register is an SIMD or a
regular register for each instructions - room enough for a shadow
of five instructions.

Predicates would work fine with SIMD code, I think.

So, no combinatorial explosion that I can see.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Apr 26 21:27:02 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.

SIMD as a way to consume vast quantities of ISA space is not.

So it should be done right :-)

Done
right there is no need for a vector RF and the associated context
switch overhead.

That is what I am doubting.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.

I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant (Encryption) while variable permutes can use a register value.

Would permute work over larger blocks like 256 or 512 bits?
Can the results from permute be re-used in registers, or would
they have to be reloaded from memory?

Like everything else, ISA describes register-width units of calculation
and memory references, while vVM allows bundling these into calculations
as wide as CPU designers desire/allow--without changing ISA.

But the target for permutes is faster swizzling of data in cyphers.
After swizzling, the bytes are multiplied SIMD-style (carryless
multiply). Put these in a loop and one has faster cyphers via SIMD
expressed in vVM style.

It seems permute is to go with a carryless multiply.

I do not understand that sentence.

64×64 multiply where only the XOR-gates process since the majority
gates are continuously de-asserted. Same gates as standard multiplier
except the 3-2-majority gate has 2 more transistors that turn it off altogether. This logic is also used when one wants 2{32×32} multipliers
of 4{16×16} multipliers, ...

These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

That's the point, I am trying to explore the "maybe".

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.

For code which can be efficiently vectorized, like str* and mem*,
you are correct. I am talking about the cases where it is not.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

This is where SIMD consumes a cartesian product of ISA space.

It does not have to, I think.

things vVM does better than SIMD: mixed operand and result widths,
stride based memory accesses, scatter gather memory accesses. For
example: Loop{LDByte LDHalf ADD-reduce} ST Doubleword is no problem
for vVM and is not possible in any SIMD ISA.

Let's see what an instruction modifier (like CARRY) could look
like.

Arithmetic operations like ADD have two bits for size in the newest
version of the ISA, so the size of the units to be operated upon
is known.

The {Sign}{Size} of memory references and integers is known in the
instruction.

The size of the SIMD register could be encoded in the otherwise
unused SRC1 field; five bits are certainly enough for that.

Sure, those bits could be used to do that--but do you even need
SIMD-RF at all. vVM allows buffering between cache and data-path
to provide the width specification on an implementation by imple-
mentation as a choice--and without any perturbation to SW. So,
vVM code written for the smallest possible machine will run near
optimally on the largest possible machine. Would your SIMD Instruc- tion-Modifier have that same property ??

There are a maximum of three source registers in each instruction,
so three bits are enough to encode if a register is an SIMD or a
regular register for each instructions - room enough for a shadow
of five instructions.

Predicates would work fine with SIMD code, I think.

In vVM, predicate instructions are directly turned into lane-masks
without having to be SIMD instructions creating lane masks, branches
{of short forward nature} can be done similarly in larger implemen-
tations.

So, no combinatorial explosion that I can see.

SIMD as you describe only adds the superluminal properties of x86
SIMD.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 05:42:46 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.

SIMD as a way to consume vast quantities of ISA space is not.

So it should be done right :-)

Done
right there is no need for a vector RF and the associated context
switch overhead.

That is what I am doubting.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.

I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
(Encryption) while variable permutes can use a register value.

Would permute work over larger blocks like 256 or 512 bits?
Can the results from permute be re-used in registers, or would
they have to be reloaded from memory?

Like everything else, ISA describes register-width units of calculation
and memory references, while vVM allows bundling these into calculations
as wide as CPU designers desire/allow--without changing ISA.

I know. And I think this works for many use cases, and is preferable
whenever it works, but not for all.

But the target for permutes is faster swizzling of data in cyphers.

That is one application, but by far not the only one.

After swizzling, the bytes are multiplied SIMD-style (carryless
multiply). Put these in a loop and one has faster cyphers via SIMD
expressed in vVM style.

It seems permute is to go with a carryless multiply.

I do not understand that sentence.

64×64 multiply where only the XOR-gates process since the majority
gates are continuously de-asserted. Same gates as standard multiplier
except the 3-2-majority gate has 2 more transistors that turn it off altogether. This logic is also used when one wants 2{32×32} multipliers
of 4{16×16} multipliers, ...

OK.

These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

That's the point, I am trying to explore the "maybe".

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.

For code which can be efficiently vectorized, like str* and mem*,
you are correct. I am talking about the cases where it is not.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

This is where SIMD consumes a cartesian product of ISA space.

It does not have to, I think.

things vVM does better than SIMD: mixed operand and result widths,
stride based memory accesses, scatter gather memory accesses. For
example: Loop{LDByte LDHalf ADD-reduce} ST Doubleword is no problem
for vVM and is not possible in any SIMD ISA.

VVM is very well suited to such things, correct. And when it is
suitable, it should be used in preference.

But this mixed arithmetic could also be performed by extending
the instruction shadow.

Let's see what an instruction modifier (like CARRY) could look
like.

Arithmetic operations like ADD have two bits for size in the newest
version of the ISA, so the size of the units to be operated upon
is known.

The {Sign}{Size} of memory references and integers is known in the instruction.

The size of the SIMD register could be encoded in the otherwise
unused SRC1 field; five bits are certainly enough for that.

Sure, those bits could be used to do that--but do you even need
SIMD-RF at all. vVM allows buffering between cache and data-path
to provide the width specification on an implementation by imple-
mentation as a choice--and without any perturbation to SW. So,
vVM code written for the smallest possible machine will run near
optimally on the largest possible machine. Would your SIMD Instruc- tion-Modifier have that same property ??

No, and that is the big drawback. Again, I am not trying to replace
which is prefereable whenever it works well. I am looking at the cases
where SIMD would work better than VVM when both are available.

There are a maximum of three source registers in each instruction,
so three bits are enough to encode if a register is an SIMD or a
regular register for each instructions - room enough for a shadow
of five instructions.

Predicates would work fine with SIMD code, I think.

In vVM, predicate instructions are directly turned into lane-masks
without having to be SIMD instructions creating lane masks, branches
{of short forward nature} can be done similarly in larger implemen-
tations.

So, no combinatorial explosion that I can see.

SIMD as you describe only adds the superluminal properties of x86
SIMD.

superluminal?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Apr 27 06:56:10 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

I think this points to a way to express shuffling in VVM:

Have the shuffling specifier as array (of, say, bytes) in memory.
When the program does something like:

loop over i
p = shuffle[i]
t = a[p]
b[i] = t
end

the hardware can assemble it into uops of the SIMD data path that it
has. The loop can contain additional instructions and maybe a
different way of using the result than storing it into b, but the
pattern to recognize is

p = shuffle[i]
t = a[p]

The size of the SIMD data path and of the elements in shuffle
determines how many microinstructions are necessary. E.g., if the
SIMD data path can handle up to 16 elements of a in one uop, and
shuffle contains values <32, each SIMD width requires 1 load from
shuffle, 2 loads from a, 2 shuffling uops and a merge. If shuffle
contains too large values compared to the SIMD width, falling back to
scalar may be more economical, but then SIMD ISAs would not have
supported the operation, either.

The size of the elements of shuffle comes from the data path, and is
needed in the decoder, which is usually a problem, and usually solved
with a predictor. That can be done here, too. One can also imagine communicating the maximum size through the ISA, which may also serve
as a hint for the uarch to use shuffling

BGB makes the point where SIMD should stop at 4-wide. I mostly
agree.

SIMD is useful for data-parallel tasks, many of which have a very wide
data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

But given that VVM means that SIMD width is microarchitectural,
mistakes like BGB's point would have little consequence, because one
can always go wider while still using the same software.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21f-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Apr 26 18:30:31 2026

From Newsgroup: comp.arch

BGB [2026-04-26 12:27:06] wrote:

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

[...]

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such

[...]

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).

My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":

- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
Not sure how much shuffle they may need.

- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.

For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why. IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

What other cases of shuffles show up?
What would be the best way to handle them?

On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.

=== Stefan
--- Synchronet 3.21f-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Apr 27 08:48:09 2026

From Newsgroup: comp.arch

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a very wide
data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Apr 27 12:23:08 2026

From Newsgroup: comp.arch

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at
1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Apr 27 17:11:50 2026

From Newsgroup: comp.arch

On Mon, 27 Apr 2026 12:23:08 +0300
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.

Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.

Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.

I tried to look for more info, and it seems that AMD RDNA, Intel Xe
and Apple GPU are all built around 1024-bit SIMD units. In AMD's case
two units can be fused to run the same instructions, supposedly gaining
extra efficiency and results can appear as 2048-bit SIMD to programmer.
In Intel's case one unit can be split into two halves or four quarters,
running different operations, supposedly gaining extra flexibility at
cost of losing some efficiency.
I am starting to suspect that Nvidia GPUs in fact also built around
1024-bit SIMD EUs with twice fewer number of independent EUs units per
SM than what I speculated in the post above.
Anyway, a width of SIMD in GPUs, measured in units of their most
important data size (32 bits) depending on vendor and use case varies
in range from 8 to 64, rather than "hundreds or thousands".
The rest of GPU parallelism is of MIMD variety.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Apr 27 18:47:34 2026

From Newsgroup: comp.arch

Thomas Koenig wrote:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

Shuffles and mixes are handled automatically by VMM, as long as you have register names available to describe each part of the shuffle. It does
lead to much larger code to describe an arbitrary 16->16 shuffle
operation, but for most algorithms nat having to SIMD it removes the
actual need for shuffles.

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for transparent portability between the smallest and largest instantiation.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 13:58:34 2026

From Newsgroup: comp.arch

On 4/26/2026 5:30 PM, Stefan Monnier wrote:

BGB [2026-04-26 12:27:06] wrote:

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

[...]

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such

[...]

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).

My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":

- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
Not sure how much shuffle they may need.

Shuffles are frequent. Some operations like cross-product or similar
need a lot of this sort of thing.

Though, there are a lot of operations as well where elements don't cross.

So, common types of operations are:
Per-element Add
Per-element Scale
Multiply each element by a scalar
The famous "frsqrt" being a very common use-case here.
It is one of the more common things to scale a vector by...
Dot Product
Cross Product
...

Meanwhile:
Per-Element Mul is typically used as a way to implement one of the other operations, but rarely the end goal it itself.

Situations where Per-Element MAC could be useful are frequent, but in
most cases double-rounding would be acceptable (single rounding is
needed in some niche cases, but not usually in typical uses of SIMD in
this way).

Some operations, like a Quaternion Multiply, are effectively a
cross-product with a modified Dot-Product.
i/j/k (or x/y/z): Mostly a normal 3D cross product
A few extra terms, mostly involving the r/w component.
r or w: Aw*Bw-Ax*Bx-Ay*By-Az*Bz

Though, one other way to see it is:
The cross product is a quaternion product just with the r/w component
always set to 0 (if this element is kept as 0 in 3D vectors, one could conceivably implement the usual 3D vector operations by reusing the
quaternion operations, though a quat Mul would be more expensive than a
normal Cross Product, ...).

In both cases, these are fairly shuffle-heavy operations.
Also BGBCC supports both but (even with SIMD) does neither operation
inline (cross product and quaternion multiply exceeding the complexity
where doing it inline makes sense).

Can also note that implicitly some languages like GLSL embed the shuffle operations directly into fetching a vector from another vector.
v.xxxx, v.zxyw, ...
Probably not hard to guess what is happening here.

In BGBCC, had also added an extended special case to allow negating and zeroing members.
v.xyzW //capital W negates it.
v.xyz_ //gives a 4 element vector with w zeroed.
In my ISA, this typically adds an additional instruction in addition to
the shuffle (so, first shuffle and then negate/mask).

Otherwise, quaternion math is very similar to normal 4D vector math
(though, some people seem to instead prefer to imagine 3D and 4D vectors
as 1x4 or 4x1 matrices and use matrix math as a basis, rather than
treating 3D space as a subset of a 4D quaternion space).

Well, and if a person wanted, they could probably define the usual 3D
frustum transform and projection and transform math in terms of
quaternion calculations rather than a matrix transform.
Say, something to the effect of:
MvV = ((Vtx*MvScale)*MvRotate+MvTranslate)
PrV = MvV * ProjFrustum * ProjScrScale + ProjScrAdj
ScXyz = PrV.xyz / PrV.w

Also wouldn't mix well with OpenGL, where things typically work
explicitly in terms of 4x4 matrix math (and would make sense mostly in a context where quaternion multiply is significantly faster than a matrix multiply).

Note that (like matrix math): A*B is not equivalent to B*A.

Note also, the complex plane can also exist as a subset of the
quaternion space much as the reals can exist as a subset of the complex
plane.

Though, people traditionally store complex numbers as (real,imaginary)
rather than (imaginary,real) or (imaginary,0,0,real). Though, the
"traditional math notation" would be more like:
C = ai + 0j + 0k + b

Technically, any of i/j/k could be used, but as soon as values are mixed
with different components, the calculation would leave the complex plane.

In practice though, people usually "just wing it".

...

A lot of RGB math can also be mapped to 3D vector math, but couldn't
really map RGBA onto quaternions as the behavior of the real component
and alpha channel are quite different.

Well, and also no one has really bothered with formal mathematics
definitions for things like RGBA and Alpha blending behavior (would be
funny if someone did so, maybe call it "Alpha-Nu Vector Calculus" or something, then if anyone looks at it, it is very obviously just OpenGL
or maybe Photoshop style color-blending rules, but explained in the sort
of overly pretentious ways typical of how most math subjects are
presented, and throw in some random first-order logic and other stuff
for good measure so that everything looks good and cryptic; and one had
to wade though a bunch of esoteric rambling to find the parts that are actually relevant).

Well, sorta like physics which has seemingly split into 3 major groups:
Pop-culture version:
Lots of wonk, stuff from sci-fi or TV shows,
discredited theories, and general pseudoscience.
Mathematical esoteria:
Bunch of cryptic stuff pretty much no one understands;
Practical applied stuff:
Take basic properties one cares about,
implement on a computer or such.
Most of the time just simplified Newtonian mechanics or such.
Typically the bare minimum to deal with the task at hand (*1).
Commonly used in 3D games.
Maybe FEM or CFD in mechanical engineering stuff, ...

*1:
Level 1: Quake-Style:
Bounding boxes may fly through space and hit stuff;
If they hit something, cancel velocity in that direction;
Apply gravity at each time-step;
If gravity would pulls into something solid,
cancel the gravity move, and apply friction (if not in-water).
If in water, apply water effects.
Level 1.5: Extended, Quake Style (Minecraft or Half-Life)
Add things like pushing forces between entities and similar.
So, if you run into a walking entity or similar:
It may be pushed, velocity is reduced but not canceled entirely.
May add things like local gravity vectors.
Say, for example, gravity and orientation can rotate.
Eg: "Serious Sam", "Super Mario Galaxy", etc.
Level 2: Basic Rigid Body (Eg: Half-Life 2, Portal, Doom 3, etc)
Entities may have rotations and similar (often quaternions);
Support for more complex bounding solids;
Things like point of contact, torque forces, etc, become relevant.
Usually player and enemy movement stick to Quake-style physics.
Trying to use rigid body physics on players is janked.
Level 3: Soft Body stuff
Usually limited to cosmetic only effects (cloth, hair, etc).
Computationally expensive.
Ironically, not usually used for "jiggle physics":
Which are more typically done via a skeletal / ragdoll system.
So, leave most bones under control of the normal animations.
Except those for the jiggly bits, that use ragdoll behavior.
Then sometimes overused to a point of being annoying/distracting;
Works best if skeletal system supports multi-bone weighting.
...

For a lot of this stuff, while one could use 4x4 matrix math for
everything, using matrix calculations directly doesn't tend to work very
well, so it makes sense to keep things like translation/rotation/scale
as separate vectors/quaternions during calculations, and then to
generate a final matrix from these when needed (such as for collision-detection or 3D rendering or similar).

- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.

Potentially.

Many proponents of wider SIMD seem to assume the array-like case as the dominant or sole use-case, and the SIMD ISA's seem to assume this,
rather than say, a SIMD vector where each element is itself a vector (or tuple).

The later case would be, say, to assume that a 16-element vector
nominally contains something resembling a 4x4 matrix and not a
16-element linear array.

A 16-element SIMD could be more useful for my use-cases if each vector
were a 4x4 matrix and one could effectively perform operations like
matrix transpose and matrix multiply, and more easily recompose each
matrix from discrete vectors as elements.

Even with 64 GPRs or similar, dealing with things like matrix multiply
without blowing out the register budget is still pretty steep.

It is more manageable with Binary16 vs Binary32, since each matrix needs
half as many working registers,

And, in my case, 2/3/4 element vectors (or "tuples") were the dominant use-case.

For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why. IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

What other cases of shuffles show up?
What would be the best way to handle them?

On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.

Most likely I would think it would make sense (if going this route) to
use large vectors, but then treat the vector as a vector of short vectors.

So, for example, a shuffle over a 16-element wouldn't be a 16-way
shuffle, rather a 4-way shuffle applied over each group of 4.

Or, maybe could have a 32-bit shuffle, to allow each group of 4 to be
shuffled independently, with a separate shuffle for the groups of 4.

Wouldn't expect the designers of any mainline SIMD ISA's to go this
direction though...

=== Stefan

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:22:21 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

-------------------

SIMD as you describe only adds the superluminal properties of x86
SIMD.

superluminal?

A single SIMD instruction no where near "in a loop".
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:32:28 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.

Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.

GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own register
file. Where x = {8, 16, 32, 64, or larger}

My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.

Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Apr 27 19:42:10 2026

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> posted:

BGB [2026-04-26 12:27:06] wrote:

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

[...]

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such

[...]

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very special cases).

My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":

An interesting observation--congratulations!

- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
Not sure how much shuffle they may need.

When that calculation is in a Loop, I suspect vVM is competitive.

- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.

Especially those cases where sizeof(ARRAY) mod 4 ~= 0 or sizeof()
is unknown at compile time.

For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why.

In loop forms, shuffle is simply memory-indexing, while scatter/gather
is memory-indirect.

IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.

Vector reduction {in all its forms} is not well done in SIMD if you want
the same result as you get from scalar code.

What other cases of shuffles show up?
What would be the best way to handle them?

On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.

In a GPU one can take a tessellated globe and a bit-map of the earth
then use Texture-LDs to create a planet in space--no calculation
instructions! {Sure there are zillions of calculations--all buried
in texture memory access.}

=== Stefan

--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 20:02:16 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.

Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.

Shuffles and mixes are handled automatically by VMM, as long as you have register names available to describe each part of the shuffle.

So, no variable shuffles (at least from what I understood).

It does
lead to much larger code to describe an arbitrary 16->16 shuffle
operation,

With two restrictions in addition to the one above: This also
increases the register pressure enormously, and you do not have
the result in a register afterwards because VVM is memory to memory.
You can load, shuffle and store, but you cannot load, shuffle and
do some operation on it.

but for most algorithms nat having to SIMD it removes the
actual need for shuffles.

To take an extreme case: If you have a few minutes to spare, maybe
it could be interesting for you to take a look at https://gitlab.ethz.ch/extra_projects/fastjson/-/blob/master/src/scan.c

The person who wrote that (certainly not me :-) tried to create
the fastest JSON parser in the West, and used highly aggressive
AVX512 to do this. (May not be 100% debugged).

The function per_element_level() is quite interesting. It calculates
the nesting depth of JSON elements in parallel, using packets of 64
bytes. It does so by a clever combination of shifting

Coding this in VVM would introduce a serial dependency.

Is this extremely high performance code? Yes. It this possible
in VVM? No. Is this faster than what can be achieved with VVM?
I certainly believe so. Will such code be written by many people,
or emitted by a compiler for "standard" code? Certainly not.
Should such code be possible in a new architecture? I think so.

How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.

Otherwise, what decisions could be taken in designing such
a SSIMD?

Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for transparent portability between the smallest and largest instantiation.

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Apr 27 20:22:47 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

-------------------

SIMD as you describe only adds the superluminal properties of x86
SIMD.

superluminal?

A single SIMD instruction no where near "in a loop".

I still don't understand.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 15:27:16 2026

From Newsgroup: comp.arch

On 4/27/2026 1:56 AM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

SIMD data path yes, absolutely.
SIMD instructions at best maybe.

I think this points to a way to express shuffling in VVM:

Have the shuffling specifier as array (of, say, bytes) in memory.
When the program does something like:

loop over i
p = shuffle[i]
t = a[p]
b[i] = t
end

the hardware can assemble it into uops of the SIMD data path that it
has. The loop can contain additional instructions and maybe a
different way of using the result than storing it into b, but the
pattern to recognize is

p = shuffle[i]
t = a[p]

The size of the SIMD data path and of the elements in shuffle
determines how many microinstructions are necessary. E.g., if the
SIMD data path can handle up to 16 elements of a in one uop, and
shuffle contains values <32, each SIMD width requires 1 load from
shuffle, 2 loads from a, 2 shuffling uops and a merge. If shuffle
contains too large values compared to the SIMD width, falling back to
scalar may be more economical, but then SIMD ISAs would not have
supported the operation, either.

The size of the elements of shuffle comes from the data path, and is
needed in the decoder, which is usually a problem, and usually solved
with a predictor. That can be done here, too. One can also imagine communicating the maximum size through the ISA, which may also serve
as a hint for the uarch to use shuffling

BGB makes the point where SIMD should stop at 4-wide. I mostly
agree.

SIMD is useful for data-parallel tasks, many of which have a very wide
data width. Stopping at 4-wide is as sensible as stopping at 1-wide.

But given that VVM means that SIMD width is microarchitectural,
mistakes like BGB's point would have little consequence, because one
can always go wider while still using the same software.

A lot depends on how it is being used.

My use cases were not typically data-parallel in the sense that wide
SIMD is usually designed for, and would more often be better suited (for getting what parallelism exists) by having a lot of registers...

The main reason for not stopping at 1 wide, is that then one needs too
many registers...

But, if the limit on the number of registers is lower, say 8 or 16,
there is more pressure to have wider SIMD for similar parallelism.

I don't see this as a mistake either way, but more a design tradeoff.

But, on the other side, highly data-parallel array-processing tasks
could be considered as "vector tasks" and potentially categorized
differently (as Stefan had suggested).

But, as noted, my own use-cases, while often using arrays, have been
more often dominated by short vector calculations rather than long
arrays of scalar calculations.

And, as noted, 3D XYZ or RGB coordinates aren't necessarily getting
wider. And, when one is working with multiple of them, it is often in a
fixed configuration, such as a triangle or quad (and the generic "array
of lots of items" space is above the level of the individual primitive,
and if one may need to subdivide primitives or call other functions per-primitive, large vectors don't work here).

Or, say another scenario:
Can very wide SIMD help evaluate questions like which bounding box in a
tree of bounding boxes a 3D point falls inside?...

Once again, it ends up as a case where 4 elements is amazing, but 8 or
16 not so much.

...

But, OTOH, if doing something like audio FFT or similar, yeah, very wide
or variable-sized vectors would work nicely...

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 15:55:35 2026

From Newsgroup: comp.arch

On 4/27/2026 4:23 AM, Michael S wrote:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at
1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.

Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.

From what I can gather, it is a sort of SIMT where although wide SIMD
is used for the actual work, to each "thread" it looks like it is
working with something much narrower (like a scalar value or 2 element vector), with wider vectors typically implemented using multiple
registers (from a very wide register set).

Can also note that in my thinking, saying that SIMD is most;y limited to
a width of 4 at the ISA level, would *not* be the same as saying that
the CPU itself would have a hard limit of 4 FPU operations per clock cycle.

Like, you could still have 4-wide vectors with a 8 or 12/16 wide SIMD
inside the CPU, provided the unit could be kept fed via the instruction
stream well enough to make it worthwhile.

Much, like, how there is no reason to limit the CPU to a single ALU or 1 instruction per clock, even if to a casual programmer, it appears as-if
the ISA is running everything sequentially.

My thinking then is more:
What if, we treat SIMD more like how we treat the integer ALUs?...

Like, try to make in more so that, if the CPU has the resources to do
so, nothing exists to stop it from co-issuing the SIMD instructions?...

The push towards ever wider SIMD makes more sense if one primarily
assumes that they can't be performed superscalar.
But, as I see it, this is not necessarily a valid assumption.

Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.

Yes, there are use-cases for this.

--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 16:42:19 2026

From Newsgroup: comp.arch

On 4/27/2026 2:42 PM, MitchAlsup wrote:

Stefan Monnier <monnier@iro.umontreal.ca> posted:

BGB [2026-04-26 12:27:06] wrote:

On 4/26/2026 5:26 AM, Thomas Koenig wrote:

[...]

However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such

[...]

One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).

My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":

An interesting observation--congratulations!

Agreed.

Though I guess I wasn't obvious enough in my response.

- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM. >> Not sure how much shuffle they may need.

When that calculation is in a Loop, I suspect vVM is competitive.

- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.

Especially those cases where sizeof(ARRAY) mod 4 ~= 0 or sizeof()
is unknown at compile time.

For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why.

In loop forms, shuffle is simply memory-indexing, while scatter/gather
is memory-indirect.

I guess, one can also differentiate between a shuffle operation
(reordering elements), and a "pack" operation (combining parts of two registers). Where, SSE/AVX tends to conflate these.

IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the
remaining log N steps to combine those N results.

Vector reduction {in all its forms} is not well done in SIMD if you want
the same result as you get from scalar code.

For cases best served by scalar, my preference is mostly to keep scalar.

Trying to take some generic C loop and force it into some absurd mess of
SIMD operations isn't something I am well fond of, even if some people
seem to imagine this is mostly what SIMD is for in the first place...

What other cases of shuffles show up?
What would be the best way to handle them?

On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic
scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.

In a GPU one can take a tessellated globe and a bit-map of the earth
then use Texture-LDs to create a planet in space--no calculation instructions! {Sure there are zillions of calculations--all buried
in texture memory access.}

Even going part way, was how I ended up with LDTEX:

LDTEX would collapse a series of 6 or so instructions into 1 instruction
(or would be more, if not for helper instructions for things like Morton shuffles and dealing with compressed texture blocks).

Partial factor why software rasterized OpenGL was viable in my ISA, but
not so much in RISC-V. Kinda really need a lot of specialized helper ops
to make this sort of thing viable.

And, SIMD (or lack thereof) is why, even when it does a system call for
the backend rendering parts, GLQuake still performs like epic dog crap
in a plain RV64GC build.

Though, I can note, for better or worse, I have a SIMD ISA partly
designed around whatever best aligned with what I needed when trying to implement OpenGL (with SIMD extensions in C partly designed around a
pattern of "what if I bolt GLSL stuff onto C, but then just end up
mostly wrapping it in macros?...").

Though, for better results:
Still don't use the glBegin/glEnd/glVertex* stuff.
This adds boilerplate overhead can't be optimized away.

=== Stefan

--- Synchronet 3.21f-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Apr 27 21:54:22 2026

From Newsgroup: comp.arch

On Mon, 27 Apr 2026 19:32:28 GMT, MitchAlsup wrote:

GPU people talk of x lanes of calculation tied to a single
instruction {ala Burroughs Scientific Processor} where each lane has
its own register file. Where x = {8, 16, 32, 64, or larger}

Also, they are fond of conditional-execution instructions, are they
not. So each processing unit can do something slightly different,
depending on the data it holds, while continuing to execute exactly
the same instruction as all the other units.

That way, the architecture remains “SIMD”, instead of turning into “MIMD”.
--- Synchronet 3.21f-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Apr 27 17:39:41 2026

From Newsgroup: comp.arch

On 4/27/2026 3:22 PM, Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

-------------------

SIMD as you describe only adds the superluminal properties of x86
SIMD.

superluminal?

A single SIMD instruction no where near "in a loop".

I still don't understand.

My guess:
Goes as fast as it can go and can't be made faster?...

But, yeah, say for example, one has something like:
void SomeEntity_ApplyGravity(Entity *self)
{ self->impulseVelocity += sv_gravity * sv_ticktime; }

SIMD operations may make this faster than scalar math would be...

But, once it gets as fast as it can be, say, in XG3 or similar (*1):
SomeEntity_ApplyGravity:
MOV.L sv_ticktime, R13
MOV.X sv_gravity, R14
MOVLD R13, R13, R16
MOVLD R13, R13, R17
MOV.X (R10, disp), R12
PMULX.F R14, R16, R14
PADDX.F R12, R14, R12
MOV.X R12, (R10, disp)
RTS

Is there really much obvious way to make something like this all that
much faster?...

The limiting factors in stuff like this can't be made faster by making
the SIMD wider or more advanced.

Well, and assume that such a function is called via something like:
ent->BasePhysicsTick(ent);

Where the compiler sorta has its hands tied.

*1: Nevermind if this specific example is purely illustrative and would violate ABI rules in this case (accesses gloabals and called as a
function pointer means it needs a prolog/epilog and to perform a GP
reload; but then any cycle cost of the SIMD instructions is drowned out
by the overheads of the function itself).

Well, then again:
void SomeEntity_ApplyGravity(Entity *self)
{ self->impulseVelocity +=
self->world->sv_gravity *
self->world->sv_ticktime; }

Would avoid the need for a prolog/epilog in this case (no access to
global variables).

...

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 00:03:42 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

-------------------

SIMD as you describe only adds the superluminal properties of x86
SIMD.

superluminal?

A single SIMD instruction no where near "in a loop".

I still don't understand.

Some random SIMD instruction does the data-manipulation you want
done {and most likely was not designed to do that one specific
unit of work but is a side effect of what the instruction can do}.
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 00:11:32 2026

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Mon, 27 Apr 2026 19:32:28 GMT, MitchAlsup wrote:

GPU people talk of x lanes of calculation tied to a single
instruction {ala Burroughs Scientific Processor} where each lane has
its own register file. Where x = {8, 16, 32, 64, or larger}

Also, they are fond of conditional-execution instructions, are they
not. So each processing unit can do something slightly different,
depending on the data it holds, while continuing to execute exactly
the same instruction as all the other units.

What most often happens with GPU conditional execution::

01234567890123456789012345678901 ; lanes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ; every body executes
!!!---!---!!---!!!!!!--!!-!!-!-- ; then clause
!!!---!---!!---!!!!!!--!!-!!-!--
!!!---!---!!---!!!!!!--!!-!!-!--
!!!---!---!!---!!!!!!--!!-!!-!--
---!!!-!!!--!!!------!!--!--!-!! ; else clause
---!!!-!!!--!!!------!!--!--!-!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ; every body executes

It is easy to see how each lane computes its "execute this" bit
in vector form over the WARP.

In effect, each conditional that cannot be applied to all lanes
leads to a 50% reduction in throughput.

That way, the architecture remains “SIMD”, instead of turning into “MIMD”.

SIMT where T means Thread instead of Data.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Apr 28 01:17:03 2026

From Newsgroup: comp.arch

On Tue, 28 Apr 2026 00:11:32 GMT, MitchAlsup wrote:

SIMT where T means Thread instead of Data.

Is that a meaningful distinction? What is the point of multiple units
executing the same instruction, if not to operate on different data?
--- Synchronet 3.21f-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Apr 28 05:46:30 2026

From Newsgroup: comp.arch

On Sun, 26 Apr 2026 18:30:31 -0400, Stefan Monnier wrote:

- By "tuples" I mean data that is inherently of fixed size, such as
the 3 or 4 element vectors used to represent a point in #3 space,
where each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than
vVM.

Looking at some docs on Bitsavers from back in the Cray-1 era or soon
after, they said that the break-even point for using the array
instructions over the scalar ones was passed at arrays of as few as 2
elements. That is, even with the array-instruction setup overhead, it
was quicker to operate on a set of 2-element arrays than it was to do
two sets of scalar operations.
--- Synchronet 3.21f-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Apr 28 14:36:21 2026

From Newsgroup: comp.arch

On Mon, 27 Apr 2026 19:32:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.

Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.

GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}

Can you point me to example of "or larger" among current high-volume
GPU products?
My impression is that even x=64 is implemented as pair of x=32 ALUs
running in lock step.

My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and
Fujitsu CPUs.

Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 17:45:12 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 19:32:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.

Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.

GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}

Can you point me to example of "or larger" among current high-volume
GPU products?

I have insufficient data as the [micro]architectures are rarely allowed
into public domain.

My impression is that even x=64 is implemented as pair of x=32 ALUs
running in lock step.

Samsung (we) had 64 threads per instruction, operating over 4 clocks,
using 16 'ALUs', so we did not need forwarding even for FMAC instruc-
tions.

My understanding, not necessarily precisely correct, but necessarily close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu CPUs.

Plus, nowadays they have outer product (a.k.a. tensor) engines, but that's OT.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Apr 28 18:03:28 2026

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Tue, 28 Apr 2026 00:11:32 GMT, MitchAlsup wrote:

SIMT where T means Thread instead of Data.

Is that a meaningful distinction? What is the point of multiple units executing the same instruction, if not to operate on different data?

Data is different--necessarily,
Whether to execute (or not) is slightly different,
It is compiled as if Scalar,
Different threads can use different TLB entries,
...

--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 30 10:02:11 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 19:32:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence Dâ€™Oliveiro <ldo@nz.invalid> wrote:

On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:

SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.

GPUs implement SIMD up to hundreds or thousands of units wide.

You obviously don't know what are you talking about. That's not new.>>>> >>>> Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.

GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}

Can you point me to example of "or larger" among current high-volume
GPU products?

I have insufficient data as the [micro]architectures are rarely allowed> into public domain.

My impression is that even x=64 is implemented as pair of x=32 ALUs
running in lock step.

Samsung (we) had 64 threads per instruction, operating over 4 clocks,
using 16 'ALUs', so we did not need forwarding even for FMAC instruc-
tions.

That's exactly how Larrabee started out, with four threads in a barrel scheduler so that, per thread, the next instruction ran 4 cycles later
and never had to wait for anything to resolve (except memory of course).
If there were less than four threads available, then they had to wait a
clock or two, but only after a multi-cycle operation.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Apr 30 14:36:00 2026

From Newsgroup: comp.arch

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

Besides, this is some of the most fun code to figure out, the fact that
the results are sometimes actually useful is just gravy.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21f-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Apr 30 10:35:41 2026

From Newsgroup: comp.arch

On 4/30/2026 5:36 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

This gave me an idea. People could create a new kind of "standard
benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of
standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).

A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way
would allow use of assembler to squeeze the best performance, but the assembler source code must be provided.

Of course, this would be a supplement to things like Spec; certainly not
a replacement. Is this a hair brained scheme, or could/should it be
pursued?

Of course, this is very preliminary and I welcome any comments.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Apr 30 17:51:09 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/30/2026 5:36 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >>

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

This gave me an idea. People could create a new kind of "standard benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).

A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way would allow use of assembler to squeeze the best performance, but the assembler source code must be provided.

Of course, this would be a supplement to things like Spec; certainly not
a replacement. Is this a hair brained scheme, or could/should it be pursued?

Of course, this is very preliminary and I welcome any comments.

Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.

(*) Greenhouse compiler I believe.
--- Synchronet 3.21f-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Apr 30 18:16:13 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/30/2026 5:36 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >> >>

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last
fraction of a percent, but with way larger usage in CPU years.

This gave me an idea. People could create a new kind of "standard
benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of
standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).

A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way
would allow use of assembler to squeeze the best performance, but the
assembler source code must be provided.

Of course, this would be a supplement to things like Spec; certainly not
a replacement. Is this a hair brained scheme, or could/should it be
pursued?

Of course, this is very preliminary and I welcome any comments.

Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.

Was the greenhills compiler available then? We were still using
the Moto PCC-based M88K compiler in 1990.

(*) Greenhouse compiler I believe.

--- Synchronet 3.21f-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 00:28:46 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 4/30/2026 5:36 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out >> >>> what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last >> > 45 years, I certainly agree. :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last >> > fraction of a percent, but with way larger usage in CPU years.

This gave me an idea. People could create a new kind of "standard
benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of
standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).

A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way >> would allow use of assembler to squeeze the best performance, but the
assembler source code must be provided.

Of course, this would be a supplement to things like Spec; certainly not >> a replacement. Is this a hair brained scheme, or could/should it be
pursued?

Of course, this is very preliminary and I welcome any comments.

Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.

Was the greenhills compiler available then? We were still using
the Moto PCC-based M88K compiler in 1990.

Somewhere in 1988 it first became available.

(*) Greenhouse compiler I believe.

--- Synchronet 3.21f-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri May 1 07:48:50 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.

Was the greenhills compiler available then? We were still using
the Moto PCC-based M88K compiler in 1990.

I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
compiler installed as well as gcc. We used gcc.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Fri May 1 08:14:01 2026

From Newsgroup: comp.arch

On Fri, 01 May 2026 07:48:50 GMT, Anton Ertl wrote:

I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
compiler installed as well as gcc. We used gcc.

Had Green Hills caught up to ANSI C by that point?

I was a heavy user of Apple’s MPW development environment from the
late 1980s onwards. They initially offered a C compiler licensed from
Green Hills, which was not ANSI-compliant. Then with MPW 3.0, they
replaced it with their own in-house-developed ANSI-compliant C
compiler, the one with the slightly tongue-in-cheek (or is that passive-aggressive?) error messages.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri May 1 10:21:15 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

As a matter of fact, I had two persons in mind when I wrote this,
and one of them was you :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

Besides, this is some of the most fun code to figure out, the fact that
the results are sometimes actually useful is just gravy.

:-)

Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

I do not see how VVM could express this equally succinctly;
there are simply not the (architectural) registers to load
the 64 values of A to start with. (Unless I am mistaken
and there is a way to express this in VVM - Mitch?)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 21:48:51 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

The feature set should be constant across all vector widths.

Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >>

What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.

Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)

As a matter of fact, I had two persons in mind when I wrote this,
and one of them was you :-)

Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.

Besides, this is some of the most fun code to figure out, the fact that the results are sometimes actually useful is just gravy.

:-)

Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

there are simply not the (architectural) registers to load
the 64 values of A to start with. (Unless I am mistaken
and there is a way to express this in VVM - Mitch?)

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 1 22:21:27 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

-------------------------------------------------------------

Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

# define N 7 and the code all goes to heck !

Is that really what one wants ?!?

I do not see how VVM could express this equally succinctly;
there are simply not the (architectural) registers to load
the 64 values of A to start with. (Unless I am mistaken
and there is a way to express this in VVM - Mitch?)

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 02:10:21 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

-------------------------------------------------------------
I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0 -------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7 -------------------------------------------------------------
VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2 -------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1 -------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

And it did not even have to push registers onto the stack!

20 total instructions, 80 bytes.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.
--- Synchronet 3.22a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sat May 2 00:59:20 2026

From Newsgroup: comp.arch

On Fri, 1 May 2026 08:14:01 -0000 (UTC), Lawrence D´Oliveiro
<ldo@nz.invalid> wrote:

On Fri, 01 May 2026 07:48:50 GMT, Anton Ertl wrote:

I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
compiler installed as well as gcc. We used gcc.

Had Green Hills caught up to ANSI C by that point?

I was a heavy user of Apple’s MPW development environment from the
late 1980s onwards. They initially offered a C compiler licensed from
Green Hills, which was not ANSI-compliant. Then with MPW 3.0, they
replaced it with their own in-house-developed ANSI-compliant C
compiler, the one with the slightly tongue-in-cheek (or is that >passive-aggressive?) error messages.

I may be mis-remembering, but I thought the 3.0 (and later) MPW C
compiler was by Symantec.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 2 06:02:39 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

# define N 7 and the code all goes to heck !

Is that really what one wants ?!?

That's auto-vectorization. I have heard that code for the Cray-1
contains a lot of "64"s.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 06:54:00 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

-------------------------------------------------------------

Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

# define N 7 and the code all goes to heck !

Is that really what one wants ?!?

Not with 7*7, obviously, but with microarch-dependent fixed size:
Absolutely.

High-performance matrix multiplication is made up of "kernels":
An arbitrary- sized matrix is sliced up into small parts which
are then computed as efficiently as possible; this could be 4*4,
8*8, 16*2 or something else, depending on the microarchitecture,
whatever ist most efficient.

And what I posted above is obviously hand-optimized assembly,
but something generated by a compiler, which is worse.

https://github.com/OpenMathLib/OpenBLAS/tree/develop/kernel/x86_64
has the examples for OpenBLAS - for each microarchitecture,
a kernel is selected according to what the developers have
found to be best.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 08:32:32 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

-------------------------------------------------------------
I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0 -------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7 -------------------------------------------------------------
VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2 -------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1 -------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

Plus, the setup time for VVM...

So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?

With AVX512, it is possible to run 16 FMAs per cycle. Divide that
by a factor < 2 for overhead, and you run at maybe around 10 FMAs
per cycle.

BTW, the code generated by gcc is anything but ideal because of
the dependency chain on zmm0. That is probably worth a PR.

And it did not even have to push registers onto the stack!

20 total instructions, 80 bytes.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.

if they cannot grok restrict, they are not really modern :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat May 2 15:26:34 2026

From Newsgroup: comp.arch

On 2026-05-02, Thomas Koenig <tkoenig@netcologne.de> wrote:

And what I posted above is obviously hand-optimized assembly,

obiously NOT hand-optimzed assembly.

but something generated by a compiler, which is worse.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat May 2 16:29:03 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do. >-------------------------------------------------------------

I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

Let's compare it directly. Posting URLs is not good for discussion.
So the source code in the example is:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

and the output of gcc-16.1 is (after cleanup by godbolt):

mm8:
vmovupd (%rdi), %zmm8
vmovupd 64(%rdi), %zmm7
vmovupd 128(%rdi), %zmm6
vmovupd 192(%rdi), %zmm5
vmovupd 256(%rdi), %zmm4
vmovupd 320(%rdi), %zmm3
vmovupd 384(%rdi), %zmm2
vmovupd 448(%rdi), %zmm1
movq %rsi, %rax
leaq 512(%rdx), %rcx
.L2:
vbroadcastsd (%rax), %zmm0
vfmadd213pd (%rdx), %zmm8, %zmm0
addq $64, %rdx
addq $64, %rax
vfmadd231pd -56(%rax){1to8}, %zmm7, %zmm0
vfmadd231pd -48(%rax){1to8}, %zmm6, %zmm0
vfmadd231pd -40(%rax){1to8}, %zmm5, %zmm0
vfmadd231pd -32(%rax){1to8}, %zmm4, %zmm0
vfmadd231pd -24(%rax){1to8}, %zmm3, %zmm0
vfmadd231pd -16(%rax){1to8}, %zmm2, %zmm0
vfmadd231pd -8(%rax){1to8}, %zmm1, %zmm0
vmovupd %zmm0, -64(%rdx)
cmpq %rdx, %rcx
jne .L2
vzeroupper
ret

So your code reflects the three loops of the source code, with the
inner loop being sped up by VVM, ideally such that only one
microarchitectural iteration of the inner loop is necessary. You
still have the other loops, and all the memory accesses.

By contrast, the AVX-512 code produced by gcc-16.1 unrolls one loop
level into using AVX-512 instructions and another loop level into
using 8 different zmm registers. As a consequence, there is only one
loop level left, and every byte of the array c is only stored once.
Also, every byte of a and b are only loaded once (but the accesses to
the b array are 8 bytes at a time, so 64 loads are needed for that,
while a is loaded with 8 64-byte loads and c is stored with 8 64-byte
stores.

For VVM we can make use of 8 registers to achieve one level of
unrolling, but I don't see how to reuse the registers as it is done
with the a array in zmm1..zmm8 in the AVX-512 code. So one would have
to load all of a on every iteration of the outer loop, instead of
pulling these loads out of the loop as it is done by gcc-16.1.

The VVM code would look maybe somewhat like:

loop1:
... loop overhead left as exercise ...
vec i
loop2:
ldd r17=c[...]
ldd r1=a[i]
ldd r2=a[i+8]
...
ldd r8=a[i+56]
ldd r9=b[...]
fmac r18=r1*r9+r17
...
ldd r16=b[...]
fmac r25=r8*r16+r24
std r25->c[...]
loop1 ...
... loop overhead ...
ble ..., loop1
ret

I don't know how well VVM handles the dependency chain of the FMACs (r17->r18...r25). One could use the same register here, as is done
with zmm0 in the AVX-512 code, but I do not know if VVM would accept
that.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

In an OoO machine the latency within the loop is not very relevant,
even the dependence chain of the 8 vfmadd231pd instructions typically
is not, because the next iteration does not depend on the previous
one, apart from the loop counter updates, and even that has become
zero-cycle latency in recent Intel CPUs. So the next iteration can
start immediately only limited by the resources.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture.

I tried clang-22, and it compiled the code. I also tried clang-11 and
gcc-11 and they errored out complaining about the -march=znver5 flag,
because they do not know this architecture (but still, why report an
error for that?). Deleting that flag produced SSE2 code on both
compilers, as expected.

Changing back
to K&R C and every one could compile.

I expect that most do not like -march=znver5 even if the code is
changed to K&R C. Also, prototypes, const, and restrict have been in
C for a long time and are probably supported by most compilers on
godbolt.

However, trying gcc-1.27 I found that it reports a "parse error before
'a'" in line 2, i.e., it does not understand restrict. After deleting
the restricts, it complained about the

for (int j=...

usage.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 18:33:55 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Thomas Koenig <tkoenig@netcologne.de> posted:

https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.

# define N 7 and the code all goes to heck !

Is that really what one wants ?!?

That's auto-vectorization. I have heard that code for the Cray-1
contains a lot of "64"s.

That was back in the days when one bought a CRAY-ABC one hired a bank
of programmers...

- anton

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat May 2 18:46:48 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

-------------------------------------------------------------
I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0 -------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7 -------------------------------------------------------------
VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2 -------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1 -------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

Plus, the setup time for VVM...

I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.

So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?

0.7*lanes maybe 0.9*Lanes if my VEC fix works.

With AVX512, it is possible to run 16 FMAs per cycle.

Your code can only use 8 FMACs per cycle in any event, and there
is overhead due to the other instructions in the loop...

Divide that
by a factor < 2 for overhead, and you run at maybe around 10 FMAs
per cycle.

BTW, the code generated by gcc is anything but ideal because of
the dependency chain on zmm0. That is probably worth a PR.

And it did not even have to push registers onto the stack!

20 total instructions, 80 bytes.

SIMD got 26 instructions likely longer than 4-bytes each due to
prefixes to get various SIMD lengths encoded.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.

if they cannot grok restrict, they are not really modern :-)

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 3 07:33:09 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

-------------------------------------------------------------
I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.

I think R6 usage is off (not usual in hand-coded assembly, as I know
only too well myself :-)

But let's look at memory access. Like you said, in the code

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

Plus, the setup time for VVM...

I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.

So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?

0.7*lanes maybe 0.9*Lanes if my VEC fix works.

With AVX512, it is possible to run 16 FMAs per cycle.

Your code can only use 8 FMACs per cycle in any event,

Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.

and there
is overhead due to the other instructions in the loop...

Looking at

https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.

(It also shows that the gcc-generated code I linked to is anything
but ideal due to the dpendency chain it contains).

Divide that
by a factor < 2 for overhead, and you run at maybe around 10 FMAs
per cycle.

BTW, the code generated by gcc is anything but ideal because of
the dependency chain on zmm0. That is probably worth a PR.

And it did not even have to push registers onto the stack!

20 total instructions, 80 bytes.

SIMD got 26 instructions likely longer than 4-bytes each due to
prefixes to get various SIMD lengths encoded.

In this case, code length is of extremely minor importance.

It could be interesting to see how to restructure the code
for fewer memory traffic with VVM.

Maybe instead of

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

unroll along i and write something like

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}

where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.

if they cannot grok restrict, they are not really modern :-)

Replacing -march=znver5 with -march=znver4 produces
identical code for gcc 16.1, and should be accepted for
many more compilers.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 11:13:16 2026

From Newsgroup: comp.arch

On Sun, 3 May 2026 07:33:09 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do.

-------------------------------------------------------------
I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).

The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.

I think R6 usage is off (not usual in hand-coded assembly, as I know
only too well myself :-)

But let's look at memory access. Like you said, in the code

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

b[k + j*N] is invariant for the innermost loop. So, for N=8, there
are 64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth
of the loop is 6-cycles, so the 8-wide machine would run the
loop in 8-cycles of latency.

Plus, the setup time for VVM...

I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.

So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?

0.7*lanes maybe 0.9*Lanes if my VEC fix works.

With AVX512, it is possible to run 16 FMAs per cycle.

Your code can only use 8 FMACs per cycle in any event,

Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.

and there
is overhead due to the other instructions in the loop...

Looking at

https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.

(It also shows that the gcc-generated code I linked to is anything
but ideal due to the dpendency chain it contains).

My experience with matmul is that daxpy-like schemes, even in quite
advanced form, like inner loop that updates 2 accumulator rows by 3 or
4 source rows, do not achieve anything like maximal theoretical
throughput.
I had much better luck with [advanced forms of] dot-product-like
schemes, more specifically with inner loops that multiply 5 rows by 2
SIMD-wide columns or 4 rows by 3 SIMD-wide columns.
But I didn't do it on Zen5, so it is possible that my experience does
not apply to it.

Of course, I was using SIMD intrinsic rather than relied on compiler's autovectorization magics. Back then, in 2017, autovectorization in
major compilers was very unreliable. I would guess that while today
autvec is better, it is still not up to the job of getting utilization
of above 70%. daxpy-like kernels are relatively easier for compiler to
get right, but in dot-like kernels, last time I looked (not very
recently, but in this decade) there was almost no progress.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 11:29:18 2026

From Newsgroup: comp.arch

On Sat, 02 May 2026 16:29:03 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:

Thomas Koenig <tkoenig@netcologne.de> posted:

I do not see how VVM could express this equally succinctly;

Give me a day and I will see what I can do. >-------------------------------------------------------------

I think the below is correct !?! Hand compiled

/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/

mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET

Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop

Let's compare it directly. Posting URLs is not good for discussion.
So the source code in the example is:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

and the output of gcc-16.1 is (after cleanup by godbolt):

mm8:
vmovupd (%rdi), %zmm8
vmovupd 64(%rdi), %zmm7
vmovupd 128(%rdi), %zmm6
vmovupd 192(%rdi), %zmm5
vmovupd 256(%rdi), %zmm4
vmovupd 320(%rdi), %zmm3
vmovupd 384(%rdi), %zmm2
vmovupd 448(%rdi), %zmm1
movq %rsi, %rax
leaq 512(%rdx), %rcx
.L2:
vbroadcastsd (%rax), %zmm0
vfmadd213pd (%rdx), %zmm8, %zmm0
addq $64, %rdx
addq $64, %rax
vfmadd231pd -56(%rax){1to8}, %zmm7, %zmm0
vfmadd231pd -48(%rax){1to8}, %zmm6, %zmm0
vfmadd231pd -40(%rax){1to8}, %zmm5, %zmm0
vfmadd231pd -32(%rax){1to8}, %zmm4, %zmm0
vfmadd231pd -24(%rax){1to8}, %zmm3, %zmm0
vfmadd231pd -16(%rax){1to8}, %zmm2, %zmm0
vfmadd231pd -8(%rax){1to8}, %zmm1, %zmm0
vmovupd %zmm0, -64(%rdx)
cmpq %rdx, %rcx
jne .L2
vzeroupper
ret

So your code reflects the three loops of the source code, with the
inner loop being sped up by VVM, ideally such that only one microarchitectural iteration of the inner loop is necessary. You
still have the other loops, and all the memory accesses.

By contrast, the AVX-512 code produced by gcc-16.1 unrolls one loop
level into using AVX-512 instructions and another loop level into
using 8 different zmm registers. As a consequence, there is only one
loop level left, and every byte of the array c is only stored once.
Also, every byte of a and b are only loaded once (but the accesses to
the b array are 8 bytes at a time, so 64 loads are needed for that,
while a is loaded with 8 64-byte loads and c is stored with 8 64-byte
stores.

For VVM we can make use of 8 registers to achieve one level of
unrolling, but I don't see how to reuse the registers as it is done
with the a array in zmm1..zmm8 in the AVX-512 code. So one would have
to load all of a on every iteration of the outer loop, instead of
pulling these loads out of the loop as it is done by gcc-16.1.

The VVM code would look maybe somewhat like:

loop1:
... loop overhead left as exercise ...
vec i
loop2:
ldd r17=c[...]
ldd r1=a[i]
ldd r2=a[i+8]
...
ldd r8=a[i+56]
ldd r9=b[...]
fmac r18=r1*r9+r17
...
ldd r16=b[...]
fmac r25=r8*r16+r24
std r25->c[...]
loop1 ...
... loop overhead ...
ble ..., loop1
ret

I don't know how well VVM handles the dependency chain of the FMACs (r17->r18...r25). One could use the same register here, as is done
with zmm0 in the AVX-512 code, but I do not know if VVM would accept
that.

With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.

In an OoO machine the latency within the loop is not very relevant,
even the dependence chain of the 8 vfmadd231pd instructions typically
is not, because the next iteration does not depend on the previous
one, apart from the loop counter updates, and even that has become
zero-cycle latency in recent Intel CPUs. So the next iteration can
start immediately only limited by the resources.

Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture.

I tried clang-22, and it compiled the code. I also tried clang-11 and
gcc-11 and they errored out complaining about the -march=znver5 flag,
because they do not know this architecture (but still, why report an
error for that?). Deleting that flag produced SSE2 code on both
compilers, as expected.

Why would you think that the code generated for Zen5 would be different
from code, generated for other AVX512 targets with 2 FMA pipes, like
any Intel server core starting from Skylake-SP?
Specific targets I would try are:
skylake-avx512
cascadelake
icelake-server
sapphirerapids
emeraldrapids
graniterapids
The last three are likely unsupported by clang-11, which is quite
ancient, but with considerably newer gcc11 at least sapphirerapids
should work.
Any way, cascadelake should be supported by all of them and I don't
expect meaningful differences in code generation between cascadelake
and znver5.

--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 3 11:22:02 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Sat, 02 May 2026 16:29:03 GMT

I tried clang-22, and it compiled the code. I also tried clang-11 and
gcc-11 and they errored out complaining about the -march=znver5 flag,
because they do not know this architecture (but still, why report an
error for that?). Deleting that flag produced SSE2 code on both
compilers, as expected.

Why would you think that the code generated for Zen5 would be different
from code, generated for other AVX512 targets with 2 FMA pipes, like
any Intel server core starting from Skylake-SP?

I did not express anything in that direction. However, now that you
ask, my experience is that it is was more difficult than I had
expected to get gcc (13 IIRC) to produce AVX-512 code, even with
explicit vectorization. The actual target was a Rocket Lake machine,
but using -march=native on that produced AVX-256 code (for the
programs that I tried). IIRC I also tried specifying one other Intel
uarch, and the code was not satisfactory, either, but I don't remember
the details. Eventually I tried -march=znver4, and that worked, so I
stuck with that.

BTW, one thing that I find unsatisfactory about "x86-64-v4" is that it
does not include the ADX instructions.

Specific targets I would try are:
skylake-avx512
cascadelake
icelake-server
sapphirerapids
emeraldrapids
graniterapids
The last three are likely unsupported by clang-11, which is quite
ancient, but with considerably newer gcc11 at least sapphirerapids
should work.
Any way, cascadelake should be supported by all of them and I don't
expect meaningful differences in code generation between cascadelake
and znver5.

Good to know. Anyway, I did not try to get older compilers to produce
AVX-512 for <2026May2.182903@mips.complang.tuwien.ac.at>. Instead, I
tried gcc-11 and clang-11 to find out why "most of the compilers in
godbot take a compile error", as Mitch Alsup claimed. It appears that
most of the compilers barf on the flag "-march=znver5".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 16:53:46 2026

From Newsgroup: comp.arch

On Sun, 03 May 2026 11:22:02 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Sat, 02 May 2026 16:29:03 GMT

I tried clang-22, and it compiled the code. I also tried clang-11
and gcc-11 and they errored out complaining about the
-march=znver5 flag, because they do not know this architecture
(but still, why report an error for that?). Deleting that flag
produced SSE2 code on both compilers, as expected.

Why would you think that the code generated for Zen5 would be
different from code, generated for other AVX512 targets with 2 FMA
pipes, like any Intel server core starting from Skylake-SP?

I did not express anything in that direction. However, now that you
ask, my experience is that it is was more difficult than I had
expected to get gcc (13 IIRC) to produce AVX-512 code, even with
explicit vectorization. The actual target was a Rocket Lake machine,
but using -march=native on that produced AVX-256 code (for the
programs that I tried). IIRC I also tried specifying one other Intel
uarch, and the code was not satisfactory, either, but I don't remember
the details. Eventually I tried -march=znver4, and that worked, so I
stuck with that.

BTW, one thing that I find unsatisfactory about "x86-64-v4" is that it
does not include the ADX instructions.

Specific targets I would try are:
skylake-avx512
cascadelake
icelake-server
sapphirerapids
emeraldrapids
graniterapids
The last three are likely unsupported by clang-11, which is quite
ancient, but with considerably newer gcc11 at least sapphirerapids
should work.
Any way, cascadelake should be supported by all of them and I don't
expect meaningful differences in code generation between cascadelake
and znver5.

Good to know. Anyway, I did not try to get older compilers to produce AVX-512 for <2026May2.182903@mips.complang.tuwien.ac.at>. Instead, I
tried gcc-11 and clang-11 to find out why "most of the compilers in
godbot take a compile error", as Mitch Alsup claimed. It appears that
most of the compilers barf on the flag "-march=znver5".

- anton

I tried it myself and results were rather unexpected.
In order to convince gcc to generate AVX-512 on skylake-avx512 I had to
go back to gcc7.

cascadelake target is not recognized until gcc9 and for this
particular kernel no version of gcc generates AVX512 code.

icelake-server recognized by gcc8, i.e. earlier than cascadelake,
despite the fact that Icelike server shipped 2 full years later than
Cascade Lake. Here too no version of gcc generates AVX512 code for this
kernel.

sapphirerapids recognized by gcc11. Here too no version of gcc
generates AVX512 code for this kernel.

emeraldrapids recognized by gcc13. Here too no version of gcc
generates AVX512 code for this kernel.

graniterapids is exactly the same as emerald.

So, it's hard to be more wrong than I was in my previous post.
Just saying that Intel people responsible for gcc maintenance screw it
would be big understatement.

Now, I want to know if clang situation is any different.

skylake-avx512: recognized since clang-3.9, generates semi-reasonable
avx512 code with clang-3.9 to 9. Semi-reasonable means using 512-bit
add and mul, but no fma. Good code is produced only by clang-4 with
following flags: -march=skylake-avx512 -Ofast -O3 -ffast-math

cascadelake: recognized since clang-9, Code generation on supported
versions appears to be the same as for skylake-avx512. In effect in
means that good code not generated at all.

icelake-server: recognized since clang-7, i.e earlier than cascadelake.
code generation appears to be the same.

At this point I ran out of gas.
But it looks almost certain that clang code generation on all supported
Intel server cores is the same. I.e. newer clang does not generate
AVX512 code and older clang in some random versions generates the same
code as new clang and in other random versions generates AVX512 code
but often not 512-bit wide and at first glance looking like crap.

In this particular case it could be a blessing for Intel, because newer
clang code generated for znver5 used 512-bit SIMD but looks horrible.
I fully expect that it is slower than more conservative code for Intel
targets.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 3 14:46:19 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

sapphirerapids recognized by gcc11. Here too no version of gcc
generates AVX512 code for this kernel.

I think it does, but you have to specify additional options.
Historically, AVX512 has been a low performer due to fequency
throttling etc. You may have to specify -mprefer-vector-width=512 .
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun May 3 14:39:54 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

In order to convince gcc to generate AVX-512 on skylake-avx512 I had to
go back to gcc7.

skylake-avx512: recognized since clang-3.9, generates semi-reasonable
avx512 code with clang-3.9 to 9.

gcc-8.1 was released on May 2, 2018.

clang-10.0 was released on 24 March 2020.

Alder Lake was released in late 2021, but initially had some AVX-512
support that was disabled via firmware later.

Given how undecisive Intel has been on Alder Lake, it seems unlikely
that they eliminated the AVX-512 usage already in gcc-8 so early to
avoid disappointments when Alder Lake is released. But somehow I
cannot come up with a different explanation.

In this particular case it could be a blessing for Intel, because newer
clang code generated for znver5 used 512-bit SIMD but looks horrible.
I fully expect that it is slower than more conservative code for Intel >targets.

The whole auto-vectorization stuff tends to be pretty unreliable
overall. Sometimes the code that is generated looks good, sometimes
it is a mess, sometimes there is no auto-vectorization at all.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 18:30:56 2026

From Newsgroup: comp.arch

On Sun, 03 May 2026 14:39:54 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

In order to convince gcc to generate AVX-512 on skylake-avx512 I had
to go back to gcc7.

skylake-avx512: recognized since clang-3.9, generates semi-reasonable >avx512 code with clang-3.9 to 9.

gcc-8.1 was released on May 2, 2018.

clang-10.0 was released on 24 March 2020.

Alder Lake was released in late 2021, but initially had some AVX-512
support that was disabled via firmware later.

Given how undecisive Intel has been on Alder Lake, it seems unlikely
that they eliminated the AVX-512 usage already in gcc-8 so early to
avoid disappointments when Alder Lake is released. But somehow I
cannot come up with a different explanation.

In this particular case it could be a blessing for Intel, because
newer clang code generated for znver5 used 512-bit SIMD but looks
horrible. I fully expect that it is slower than more conservative
code for Intel targets.

I wrote less artificial kernel, one that is actually very similar to how
I'd do inner loop of matmul of big matrices on SIMD512 target machine.

https://godbolt.org/z/aYhxcTehq

I am very impressed by the quality of code that gcc generated for Zen4
and Zen5. Exactly the same code would be excellent fit on any Intel
AVX512 target, but especially so on Intel processors with dual 512b
pipes.
But gcc does not generate this code for any Intel target. That's
extremely weird.

The whole auto-vectorization stuff tends to be pretty unreliable
overall. Sometimes the code that is generated looks good, sometimes
it is a mess, sometimes there is no auto-vectorization at all.

- anton

Very true.

--- Synchronet 3.22a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun May 3 19:07:16 2026

From Newsgroup: comp.arch

On Sun, 3 May 2026 14:46:19 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Michael S <already5chosen@yahoo.com> schrieb:

sapphirerapids recognized by gcc11. Here too no version of gcc
generates AVX512 code for this kernel.

I think it does, but you have to specify additional options.

Yes, the option you listed helps to force compiler into generation of
code identical to znver4/znver5.

Historically, AVX512 has been a low performer due to fequency
throttling etc.

That's why compiler has -march and -mtune, does not it?
Historical problems of certain Skylake-SP and Cascade Lake SKUs, mostly
of Gold-5xxx, Silver and Bronze, should have no effect on code
generated for Icelake-server or for newer Xeon variants, where
problems of this sort either do not exist, or at very least,
moderate throttling more than offseted by higher performance per clock
provided by wider SIMD.

The observed gcc behavior, however, is that apart from supported
instruction set the only march-related bit that compiler cares about,
is CPU manufacturer.

You may have to specify -mprefer-vector-width=512 .

--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 3 22:28:50 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

C version loop invariant, and cursoring

#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;

for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}

I did get this into:

mm8:
; R1 = &a[0];
; R2 = &b[0];
; R3 = &c[0]; -------------------------------------------------------------
MOV RjN,#0 ; R4
loop1:
LA RcijN,[Rc,RjN<<3] ; R5
LA RbkjN,[Rb,RjN<<3] ; R6 -------------------------------------------------------------
MOV RkN,#0 ; R7
MOV Rk,#0 ; R8
loop2:
LA RaikN,[Ra,RkN<<3] ; R9
LDD RbN,[RbkjN,Rk<<3] ; R10 -------------------------------------------------------------
MOV Ri,#0 ; R11
VEC 8,{}
loop3:
LDD Ra,[RaikN,Ri<<3] ; R12
LDD Rc,[RcijN,Ri<<3] ; R13
FMAC Rc,Ra,Rb,Rc ; R14
STD Rc,[RcijN,Ri<<3] ;

LOOP1 LE,Ri,#1,#8 ; R11 -------------------------------------------------------------
ADD Rk,Rk,#1 ; R8
ADD RkN,RkN,#8 ; R7
CMP Rt,Rk,#8 ; R11
BLE Rt,loop2 -------------------------------------------------------------
ADD RjN,RjN,#8 ; R4
CMP Rt,RjN,#64 ; R7
BLE Rt,Loop1 -------------------------------------------------------------
RET

without needing any preserved registers.

b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

----------------------------------

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}

where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.

ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30

MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6

LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30

LOOP1 LE,RkN,#8,#64 ; R4
---------------end vectorize-------------------

ADD RkN,RkN,#8 ; R5
CMP Rt,RkN,$64 ; R6
BLE Rt,loop1

STD Rc1,[Rca,#0]
STD Rc2,[Rca,#8]
STD Rc3,[Rca,#16]
STD Rc4,[Rca,#24]
STD Rc5,[Rca,#32]
STD Rc6,[Rca,#40]
STD Rc7,[Rca,#48]
STD Rc8,[Rca,#56]

EXIT Rc1,Rc8,#0
RET

46 instructions 19 instructions in vectorized (unrolled) loop.

c[k] is read once and written once
b[k] is read 8×
a[k] is read 8×

If you are willing to have 64 FMACs in a row; a[k] can be read 2×
{with very tr1cky register allocation}.

Using this many registers causes 64 bytes to be written to stack
and read back later. Solving the a[k] traffic increases the stack
footprint to 104 bytes.

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri May 8 18:15:23 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

---------------------------

Your code can only use 8 FMACs per cycle in any event,

Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.

and there
is overhead due to the other instructions in the loop...

Looking at

https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.

But your loop only has 4×4 matrixes.

Yes, MM with big matrixes reached up above 90% of theoretical perf,
small ones do not.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 10 06:55:39 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}

C version loop invariant, and cursoring

#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;

for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}

I was going to check this for corectness and instrument this to
actually count load and stores (counting by hand can be wrong
:-) but his does not compile (renamed your function to mm8b for
testing):

mm1.c: In function ‘mm8b’:
mm1.c:80:7: error: ‘AaikN’ undeclared (first use in this function); did you mean ‘AaijN’?
80 | AaikN = &a[kN];
| ^~~~~
| AaijN
mm1.c:80:7: note: each undeclared identifier is reported only once for each function it appears in
mm1.c:81:7: error: ‘bN’ undeclared (first use in this function); did you mean ‘kN’?
81 | bN = AbkjN[k];
| ^~
| kN

b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.

By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.

It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.

If you are multiplying whole matrices, yes. If you are piecing together
matrix multiplications from kernels, then you need to add up all
the consecutive kernels that make up C(i,j).

[...]

ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30

MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6

LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30

LOOP1 LE,RkN,#8,#64 ; R4

I don't recall the limit on the number of statements in a VVM loop;
what is it?

[...]

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

What are its drawbacks? Do register accesses get slower?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun May 10 07:14:37 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

---------------------------

Your code can only use 8 FMACs per cycle in any event,

Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.

and there
is overhead due to the other instructions in the loop...

Looking at

https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html

where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.

But your loop only has 4×4 matrixes.

It is an example 8*8 (not 4*4) kernel. AMD actually used 24*8 for
their microcernel.

Yes, MM with big matrixes reached up above 90% of theoretical perf,
small ones do not.

Sure, overhead for small matrices matters a lot.

Some time ago, I wrote an inline version of matmul for gfortran
because small matrices (especially if their size is known at
compile time) are handled very inefficiently by external packages.
ifort had actually done so before, although I didn't know it at
the time.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun May 10 08:58:34 2026

From Newsgroup: comp.arch

On 5/3/2026 3:28 PM, MitchAlsup wrote:

big snip

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

A possible alternative that I have seen is to "memory map" the registers
as an alternative accessing mechanism. This allows you to "index" the registers, similarly to indexing a memory array.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun May 10 17:28:38 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

----------------

I don't recall the limit on the number of statements in a VVM loop;
what is it?

32

[...]

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.

What are its drawbacks? Do register accesses get slower?

Yes, you have to read a register and move it to where it can
read another register.
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun May 10 16:15:12 2026

From Newsgroup: comp.arch

Thomas Koenig [2026-05-10 06:55:39] wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this
ability--although a few GPU ISAs do.

What are its drawbacks?

I guess the problem is that in an OoO design, this introduces a deeply problematic dependency between the in-order front-end that renames
logical registers to physical registers and the OoO core.

Usually such dependencies (where the front-end needs info from the
OoO core) are handled via speculation, the classical example being
branches.

Do register accesses get slower?

In order not to mess the whole pipeline, I think you'd have to predict
the register indexing.

In the mean time you can "simulate" it by replacing the register
indexing by a `switch` table to various copies of the code, each one
using the appropriate register. Of course, this wouldn't work in vVM
since IIRC vVM doesn't support branches within the loop, and using
predication to simulate the `switch` table would be impractical.

=== Stefan
--- Synchronet 3.22a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon May 11 05:36:37 2026

From Newsgroup: comp.arch

On Sun, 10 May 2026 07:14:37 -0000 (UTC), Thomas Koenig wrote:

Some time ago, I wrote an inline version of matmul for gfortran
because small matrices (especially if their size is known at compile
time) are handled very inefficiently by external packages.

General linear-algebra packages are optimized for large matrices I
guess because those are common in linear optimization problems and the
like.

(I was reading earlier today about erasure codes, and I think you
could use these there as well.)

CG, on the other hand, will use matrices no larger than 4×4. For that
sort of size, I think you can get away with computing inverses using
Cramer’s rule. ;)
--- Synchronet 3.22a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue May 12 02:17:49 2026

From Newsgroup: comp.arch

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 20:11:43 2026

From Newsgroup: comp.arch

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue May 12 05:14:48 2026

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.

It would have to be implemented. How? And how does the supposed
rareness help?

Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.22a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon May 11 23:11:07 2026

From Newsgroup: comp.arch

On 5/11/2026 10:14 PM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:

On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:

A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.

Doesn’t this defeat the point of how registers are supposed to work?

No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.

It would have to be implemented. How?

Let me give one possible implementation. There are certainly others.
Say you have 32 registers. They are "memory mapped" into the first 32 addresses of memory. So programs would have to start not at zero, but
at 32 (I know this can cause other problems - I clearly have not thought through all of the details.) So now when the CPU encounters a load (or
store) instruction where the virtual address is less than 32, it is
resolved not by the memory system, but by the appropriate register.
i.e. if the virtual address was say 4, the load would be from register
R4, not memory location 4. Yes, the virtual addressing mechanism
would have to be sensitive to whether the address was below 32 or not,
but that is simple within the CPU. Note that the load instruction in
this case would not touch the memory system at all, so no cache lookups,
no TLB lookups, etc.

And how does the supposed
rareness help?

Laurence said it would defeat the purpose of registers. My comment was
that since it would be rare, i.e. most of the register references would
be the same as before, it wouldn't defeat the purpose.>

Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.

The register accesses are not turned into memory accesses. If the
address is less than 32, the instruction references the actual register,
not the memory. The only advantage of this scheme is that it allows "indexing" the registers similarly to how one indexes memory today.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.22a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,116
Nodes:	10 (0 / 10)
Uptime:	85:27:18
Calls:	14,305
Files:	186,338
D/L today:	647 files (184M bytes)
Messages:	2,525,478

Sane(r) SIMD

Who's Online

System Info