There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
The feature set should be constant across all vector widths.
No vector registers should be used in interrupts :-)
It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.
Data types: 8, 16, 32, 64 bit ints; also 128 bit?
FP types: Same as above?
Other points?
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.
These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
The feature set should be constant across all vector widths.
No vector registers should be used in interrupts :-)
It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.
Data types: 8, 16, 32, 64 bit ints; also 128 bit?
FP types: Same as above?
Other points?
On 4/26/2026 5:26 AM, Thomas Koenig wrote:
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very special cases).
If SIMD goes wider than 4 elements, the complexity curve quickly becomes unmanageable.
Almost more preferable IMO to allow superscalar on SIMD vectors than to
go to wider SIMD vectors. Granted, how much this scales depends more on
the size of the register file. There is in effect a hard limit here.
Personally, I don't really trust automatic vectorization, as it is sort
of a double edged sword between "faster" and "slow and bloated" (particularly with MSVC).
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
The feature set should be constant across all vector widths.
Or keep all registers the same size, say, 64 bits.
If you do a 128-bit op, it can use pairs.
Practically, a 4-wide SIMD op can be seen internally as two 2-wide operations glued together (or the 128-bit SIMD operation effectively splitting into two co-issued 64-bit operations).
Well, since Load/Store and SHUF operations effectively take away cycles
that the SIMD unit could work. At least partly, where a 4x Binary16 SIMD instruction can co-issue with a Load or a SHUF, but the 128-bit
instructions can't co-issue at all (each 128-bit instruction effectively using the whole pipeline width).
No vector registers should be used in interrupts :-)
It might make sense for a process to announce to the OS which
vector registers it uses for faster system calls.
Or, no vector registers exist at all...
Thomas Koenig <tkoenig@netcologne.de> posted:
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.
SIMD as a way to consume vast quantities of ISA space is not.
Done
right there is no need for a vector RF and the associated context
switch overhead.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.
I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
(Encryption) while variable permutes can use a register value.
It seems permute is to go with a carryless multiply.
These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
SIMD data path yes, absolutely.
SIMD instructions at best maybe.
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
This is where SIMD consumes a cartesian product of ISA space.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.
SIMD as a way to consume vast quantities of ISA space is not.
So it should be done right :-)
Done
right there is no need for a vector RF and the associated context
switch overhead.
That is what I am doubting.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.
I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant (Encryption) while variable permutes can use a register value.
Would permute work over larger blocks like 256 or 512 bits?
Can the results from permute be re-used in registers, or would
they have to be reloaded from memory?
It seems permute is to go with a carryless multiply.
I do not understand that sentence.
These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
SIMD data path yes, absolutely.
SIMD instructions at best maybe.
That's the point, I am trying to explore the "maybe".
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.
For code which can be efficiently vectorized, like str* and mem*,
you are correct. I am talking about the cases where it is not.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
This is where SIMD consumes a cartesian product of ISA space.
It does not have to, I think.
Let's see what an instruction modifier (like CARRY) could look
like.
Arithmetic operations like ADD have two bits for size in the newest
version of the ISA, so the size of the units to be operated upon
is known.
The size of the SIMD register could be encoded in the otherwise
unused SRC1 field; five bits are certainly enough for that.
There are a maximum of three source registers in each instruction,
so three bits are enough to encode if a register is an SIMD or a
regular register for each instructions - room enough for a shadow
of five instructions.
Predicates would work fine with SIMD code, I think.
So, no combinatorial explosion that I can see.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
SIMD as a way to calculate more things per cycle is fine. vVM will
end up doing multi-lane calculations SIMD style.
SIMD as a way to consume vast quantities of ISA space is not.
So it should be done right :-)
Done
right there is no need for a vector RF and the associated context
switch overhead.
That is what I am doubting.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory.
I have been considering adding a Permute instruction to My 66000
ISA over the last 2 weeks. Fixed permutes need a 24-bit constant
(Encryption) while variable permutes can use a register value.
Would permute work over larger blocks like 256 or 512 bits?
Can the results from permute be re-used in registers, or would
they have to be reloaded from memory?
Like everything else, ISA describes register-width units of calculation
and memory references, while vVM allows bundling these into calculations
as wide as CPU designers desire/allow--without changing ISA.
But the target for permutes is faster swizzling of data in cyphers.
After swizzling, the bytes are multiplied SIMD-style (carryless
multiply). Put these in a loop and one has faster cyphers via SIMD
expressed in vVM style.
It seems permute is to go with a carryless multiply.
I do not understand that sentence.
64×64 multiply where only the XOR-gates process since the majority
gates are continuously de-asserted. Same gates as standard multiplier
except the 3-2-majority gate has 2 more transistors that turn it off altogether. This logic is also used when one wants 2{32×32} multipliers
of 4{16×16} multipliers, ...
These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
SIMD data path yes, absolutely.
SIMD instructions at best maybe.
That's the point, I am trying to explore the "maybe".
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
That is the problem with SIMD where it is used to make library code
fast (str*, mem*) too many interrupt handlers and OS handlers want
to use fast versions of those libraries.
For code which can be efficiently vectorized, like str* and mem*,
you are correct. I am talking about the cases where it is not.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
This is where SIMD consumes a cartesian product of ISA space.
It does not have to, I think.
things vVM does better than SIMD: mixed operand and result widths,
stride based memory accesses, scatter gather memory accesses. For
example: Loop{LDByte LDHalf ADD-reduce} ST Doubleword is no problem
for vVM and is not possible in any SIMD ISA.
Let's see what an instruction modifier (like CARRY) could look
like.
Arithmetic operations like ADD have two bits for size in the newest
version of the ISA, so the size of the units to be operated upon
is known.
The {Sign}{Size} of memory references and integers is known in the instruction.
The size of the SIMD register could be encoded in the otherwise
unused SRC1 field; five bits are certainly enough for that.
Sure, those bits could be used to do that--but do you even need
SIMD-RF at all. vVM allows buffering between cache and data-path
to provide the width specification on an implementation by imple-
mentation as a choice--and without any perturbation to SW. So,
vVM code written for the smallest possible machine will run near
optimally on the largest possible machine. Would your SIMD Instruc- tion-Modifier have that same property ??
There are a maximum of three source registers in each instruction,
so three bits are enough to encode if a register is an SIMD or a
regular register for each instructions - room enough for a shadow
of five instructions.
Predicates would work fine with SIMD code, I think.
In vVM, predicate instructions are directly turned into lane-masks
without having to be SIMD instructions creating lane masks, branches
{of short forward nature} can be done similarly in larger implemen-
tations.
So, no combinatorial explosion that I can see.
SIMD as you describe only adds the superluminal properties of x86
SIMD.
SIMD data path yes, absolutely.
SIMD instructions at best maybe.
BGB makes the point where SIMD should stop at 4-wide. I mostly
agree.
On 4/26/2026 5:26 AM, Thomas Koenig wrote:[...]
[...]However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).
SIMD is useful for data-parallel tasks, many of which have a very wide
data width. Stopping at 4-wide is as sensible as stopping at 1-wide.
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:You obviously don't know what are you talking about. That's not new.
SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at
1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
The feature set should be constant across all vector widths.
BGB [2026-04-26 12:27:06] wrote:
On 4/26/2026 5:26 AM, Thomas Koenig wrote:[...]
[...]However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).
My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":
- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
Not sure how much shuffle they may need.
- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.
For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why. IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.
What other cases of shuffles show up?
What would be the best way to handle them?
On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.
=== Stefan
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:-------------------
SIMD as you describe only adds the superluminal properties of x86
SIMD.
superluminal?
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at 1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.
BGB [2026-04-26 12:27:06] wrote:
On 4/26/2026 5:26 AM, Thomas Koenig wrote:[...]
[...]However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very special cases).
My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":
- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM.
Not sure how much shuffle they may need.
- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.
For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why.
IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the remaining log N steps to combine those N results.
What other cases of shuffles show up?
What would be the best way to handle them?
On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.
=== Stefan--- Synchronet 3.21f-Linux NewsLink 1.2
Thomas Koenig wrote:
There is little doubt (in my mind, at least) that an abstraction
such as the one offered by VVM is better and easier to use
for compilers, than the current state of the art where SIMD-based
vectorization is used. One need only look at the number of bugs
blocking https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 to
(and the fraction of unresolved bugs) to see how difficulat that is.
However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
routines are usually written using (micro-)architecture-specific
intrinsics. Other operations which could be useful are "match"
operations known from graphics cards where bit n is set on lane
m if lane n and m hold the same value.
Is this worth it? CPU manufacturers seem to think so, they devote
considerable on-chip resources to shuffles. Routines that use
these are likely to be highly specialized, hard to write (needs
somebody like Terje) and, if it is used a lot, can give significant
speedup.
Shuffles and mixes are handled automatically by VMM, as long as you have register names available to describe each part of the shuffle.
It does
lead to much larger code to describe an arbitrary 16->16 shuffle
operation,
but for most algorithms nat having to SIMD it removes the
actual need for shuffles.
How should a new architecture deal with it? Not going down that
path and forsaking the high-performance gains possible is one option,
for example if one wants to keep interrupts fast.
Otherwise, what decisions could be taken in designing such
a SSIMD?
Register width would be one concern. It could make sense to have
sub-architectures with several widths, where a feature enquiry
could be used to branch to several versions of code.
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for transparent portability between the smallest and largest instantiation.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:-------------------
A single SIMD instruction no where near "in a loop".SIMD as you describe only adds the superluminal properties of x86
SIMD.
superluminal?
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
SIMD data path yes, absolutely.
SIMD instructions at best maybe.
I think this points to a way to express shuffling in VVM:
Have the shuffling specifier as array (of, say, bytes) in memory.
When the program does something like:
loop over i
p = shuffle[i]
t = a[p]
b[i] = t
end
the hardware can assemble it into uops of the SIMD data path that it
has. The loop can contain additional instructions and maybe a
different way of using the result than storing it into b, but the
pattern to recognize is
p = shuffle[i]
t = a[p]
The size of the SIMD data path and of the elements in shuffle
determines how many microinstructions are necessary. E.g., if the
SIMD data path can handle up to 16 elements of a in one uop, and
shuffle contains values <32, each SIMD width requires 1 load from
shuffle, 2 loads from a, 2 shuffling uops and a merge. If shuffle
contains too large values compared to the SIMD width, falling back to
scalar may be more economical, but then SIMD ISAs would not have
supported the operation, either.
The size of the elements of shuffle comes from the data path, and is
needed in the decoder, which is usually a problem, and usually solved
with a predictor. That can be done here, too. One can also imagine communicating the maximum size through the ISA, which may also serve
as a hint for the uarch to use shuffling
BGB makes the point where SIMD should stop at 4-wide. I mostly
agree.
SIMD is useful for data-parallel tasks, many of which have a very wide
data width. Stopping at 4-wide is as sensible as stopping at 1-wide.
But given that VVM means that SIMD width is microarchitectural,
mistakes like BGB's point would have little consequence, because one
can always go wider while still using the same software.
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a very
wide data width. Stopping at 4-wide is as sensible as stopping at
1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each core
having SIMD width comparable to that of many (not all) CPUs.
My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada
Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM. I.e.
Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu
CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.
Stefan Monnier <monnier@iro.umontreal.ca> posted:
BGB [2026-04-26 12:27:06] wrote:
On 4/26/2026 5:26 AM, Thomas Koenig wrote:[...]
[...]However, there are use cases which VVM or similar systems do not
cover. The main one are is in-register permutes (aka shuffles)
which do not go through memory. These can offer significant speed
advantages (as in factors) for performance-critical code. Such
One practical limit IMO is:
Don't go wider than 4 elements (without a very good reason or in very
special cases).
My understanding is that it might be worth distinguishing the case of
SIMD used for "tuples" and SIMD used for "arrays":
An interesting observation--congratulations!
- By "tuples" I mean data that is inherently of fixed size, such as the
3 or 4 element vectors used to represent a point in #3 space, where
each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than vVM. >> Not sure how much shuffle they may need.
When that calculation is in a Loop, I suspect vVM is competitive.
- By "arrays", I mean data of a size that can be much larger than 3-4
elements and which often/usually varies dynamically.
When handling such data, SSE/AVX need to wrap the SIMD instructions
inside loops. vVM should handle that much better in most cases.
Especially those cases where sizeof(ARRAY) mod 4 ~= 0 or sizeof()
is unknown at compile time.
For the arrays case, it might be worth thinking about what kind of
shuffle is needed and why.
In loop forms, shuffle is simply memory-indexing, while scatter/gather
is memory-indirect.
IIUC the shuffle is not needed over the
whole array. Instead, it shows up for example when you do a reduction
on the array and after doing a fast traversal of the array you end up
with N partial-reduction results and the shuffles are needed to do the
remaining log N steps to combine those N results.
Vector reduction {in all its forms} is not well done in SIMD if you want
the same result as you get from scalar code.
What other cases of shuffles show up?
What would be the best way to handle them?
On a related note, I'd love to see papers that try to reproduce in
a "normal vector CPU" the behavior of GPUs: in many ways a "warp"
corresponds to a set of vectors and the control flow of SIMT warps could
be represented as mask bitvectors, but then there's the automatic
scheduling/switching between warps, plus all kinds of other details
where the mapping between GPUs and vector CPUs doesn't seem so obvious.
In a GPU one can take a tessellated globe and a bit-map of the earth
then use Texture-LDs to create a planet in space--no calculation instructions! {Sure there are zillions of calculations--all buried
in texture memory access.}
=== Stefan
GPU people talk of x lanes of calculation tied to a single
instruction {ala Burroughs Scientific Processor} where each lane has
its own register file. Where x = {8, 16, 32, 64, or larger}
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:-------------------
A single SIMD instruction no where near "in a loop".SIMD as you describe only adds the superluminal properties of x86
SIMD.
superluminal?
I still don't understand.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:-------------------
A single SIMD instruction no where near "in a loop".SIMD as you describe only adds the superluminal properties of x86
SIMD.
superluminal?
I still don't understand.
On Mon, 27 Apr 2026 19:32:28 GMT, MitchAlsup wrote:
GPU people talk of x lanes of calculation tied to a single
instruction {ala Burroughs Scientific Processor} where each lane has
its own register file. Where x = {8, 16, 32, 64, or larger}
Also, they are fond of conditional-execution instructions, are they
not. So each processing unit can do something slightly different,
depending on the data it holds, while continuing to execute exactly
the same instruction as all the other units.
That way, the architecture remains “SIMD”, instead of turning into “MIMD”.
SIMT where T means Thread instead of Data.
- By "tuples" I mean data that is inherently of fixed size, such as
the 3 or 4 element vectors used to represent a point in #3 space,
where each element has a specific role.
SSE/AVX approaches seem to work OK for such data, maybe better than
vVM.
Michael S <already5chosen@yahoo.com> posted:
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.
GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}
--- Synchronet 3.21f-Linux NewsLink 1.2My understanding, not necessarily precisely correct, but necessarily
close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and
Fujitsu CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but
that's OT.
On Mon, 27 Apr 2026 19:32:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.
Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.
GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}
Can you point me to example of "or larger" among current high-volume
GPU products?
My impression is that even x=64 is implemented as pair of x=32 ALUs
running in lock step.
My understanding, not necessarily precisely correct, but necessarily close, is that latest Nvidia GPUs have 4 (Turing, Ampere) or 8 (Ada Lovelace, Blackwell) 512-bit SIMD EUs (=16 CUDA "cores") per SM.
I.e. Nvidia GPUs have exactly the same SIMD width as AMD, Intel and Fujitsu CPUs.
Plus, nowadays they have outer product (a.k.a. tensor) engines, but that's OT.
On Tue, 28 Apr 2026 00:11:32 GMT, MitchAlsup wrote:
SIMT where T means Thread instead of Data.
Is that a meaningful distinction? What is the point of multiple units executing the same instruction, if not to operate on different data?
Michael S <already5chosen@yahoo.com> posted:That's exactly how Larrabee started out, with four threads in a barrel scheduler so that, per thread, the next instruction ran 4 cycles later
On Mon, 27 Apr 2026 19:32:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Mon, 27 Apr 2026 08:48:09 -0000 (UTC)
Lawrence D’Oliveiro <ldo@nz.invalid> wrote:
On Mon, 27 Apr 2026 06:56:10 GMT, Anton Ertl wrote:
SIMD is useful for data-parallel tasks, many of which have a
very wide data width. Stopping at 4-wide is as sensible as
stopping at 1-wide.
GPUs implement SIMD up to hundreds or thousands of units wide.
You obviously don't know what are you talking about. That's not new.>>>> >>>> Modern GPUs are best described as multicore processors with each
core having SIMD width comparable to that of many (not all) CPUs.
GPU people talk of x lanes of calculation tied to a single instruction
{ala Burroughs Scientific Processor} where each lane has its own
register file. Where x = {8, 16, 32, 64, or larger}
Can you point me to example of "or larger" among current high-volume
GPU products?
I have insufficient data as the [micro]architectures are rarely allowed> into public domain.
My impression is that even x=64 is implemented as pair of x=32 ALUs
running in lock step.
Samsung (we) had 64 threads per instruction, operating over 4 clocks,
using 16 'ALUs', so we did not need forwarding even for FMAC instruc-
tions.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.
What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.
What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.
On 4/30/2026 5:36 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:What you say is true for 99.99% or more of code, but less than 100%.
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >>
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.
This gave me an idea. People could create a new kind of "standard benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).
A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way would allow use of assembler to squeeze the best performance, but the assembler source code must be provided.
Of course, this would be a supplement to things like Spec; certainly not
a replacement. Is this a hair brained scheme, or could/should it be pursued?
Of course, this is very preliminary and I welcome any comments.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 4/30/2026 5:36 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:What you say is true for 99.99% or more of code, but less than 100%.
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >> >>
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last
fraction of a percent, but with way larger usage in CPU years.
This gave me an idea. People could create a new kind of "standard
benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of
standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).
A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way
would allow use of assembler to squeeze the best performance, but the
assembler source code must be provided.
Of course, this would be a supplement to things like Spec; certainly not
a replacement. Is this a hair brained scheme, or could/should it be
pursued?
Of course, this is very preliminary and I welcome any comments.
Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.
(*) Greenhouse compiler I believe.--- Synchronet 3.21f-Linux NewsLink 1.2
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:
On 4/30/2026 5:36 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out >> >>> what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.
What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last >> > 45 years, I certainly agree. :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last >> > fraction of a percent, but with way larger usage in CPU years.
This gave me an idea. People could create a new kind of "standard
benchmark". Since, I believe, the algorithms, and in at least some
cases, reference code is publicly and freely available, it would be
possible to create a benchmark suite of these programs, with a set of
standard input data for each (I am not sure it is worth it for Quake,
and there may be others that should be included).
A vendor could then present the results of running the benchmark two
ways. One way would be using the "standard" C compiler. The other way >> would allow use of assembler to squeeze the best performance, but the
assembler source code must be provided.
Of course, this would be a supplement to things like Spec; certainly not >> a replacement. Is this a hair brained scheme, or could/should it be
pursued?
Of course, this is very preliminary and I welcome any comments.
Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.
Was the greenhills compiler available then? We were still using
the Moto PCC-based M88K compiler in 1990.
--- Synchronet 3.21f-Linux NewsLink 1.2(*) Greenhouse compiler I believe.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Back in the late 1980s, Mc 88000 C-compiler* could compile a subroutine
of M88Ksim into a single <M88K> instruction and inline that at every
call point. It is all about reading the compiler asm output and finding
new things to put into the compiler.
Was the greenhills compiler available then? We were still using
the Moto PCC-based M88K compiler in 1990.
I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
compiler installed as well as gcc. We used gcc.
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation.
What you say is true for 99.99% or more of code, but less than 100%.
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.
Besides, this is some of the most fun code to figure out, the fact that
the results are sometimes actually useful is just gravy.
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
Thomas Koenig wrote:What you say is true for 99.99% or more of code, but less than 100%.
The feature set should be constant across all vector widths.
Having all code scalar, with the hardware (VMM) actually figuring out
what can be done in parallel makes life much simpler, and makes for
transparent portability between the smallest and largest instantiation. >>
Some people may write highly optimized code which runs in hot
sections which is then used a lot by many unsuspecting people,
and the *running time* could be much higher.
Having written a lot more than my fair part of such code over the last
45 years, I certainly agree. :-)
As a matter of fact, I had two persons in mind when I wrote this,
and one of them was you :-)
Quake, AES, Ogg Vorbis, MPEG4 and h.264 decoding all land in that last fraction of a percent, but with way larger usage in CPU years.
Besides, this is some of the most fun code to figure out, the fact that the results are sometimes actually useful is just gravy.
:-)
Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.
https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.
I do not see how VVM could express this equally succinctly;
there are simply not the (architectural) registers to load
the 64 values of A to start with. (Unless I am mistaken
and there is a way to express this in VVM - Mitch?)
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:-------------------------------------------------------------
Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.
https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.
I do not see how VVM could express this equally succinctly;
there are simply not the (architectural) registers to load
the 64 values of A to start with. (Unless I am mistaken
and there is a way to express this in VVM - Mitch?)
Thomas Koenig <tkoenig@netcologne.de> posted:-------------------------------------------------------------
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do.
On Fri, 01 May 2026 07:48:50 GMT, Anton Ertl wrote:
I worked on DG Aviions in 1990 and 1991, and we had a Green Hills
compiler installed as well as gcc. We used gcc.
Had Green Hills caught up to ANSI C by that point?
I was a heavy user of Apple’s MPW development environment from the
late 1980s onwards. They initially offered a C compiler licensed from
Green Hills, which was not ANSI-compliant. Then with MPW 3.0, they
replaced it with their own in-house-developed ANSI-compliant C
compiler, the one with the slightly tongue-in-cheek (or is that >passive-aggressive?) error messages.
Thomas Koenig <tkoenig@netcologne.de> posted:
https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.
# define N 7 and the code all goes to heck !
Is that really what one wants ?!?
Thomas Koenig <tkoenig@netcologne.de> posted:
Terje Mathisen <terje.mathisen@tmsw.no> schrieb:-------------------------------------------------------------
Maybe another example, the "hello world" of high-performance
computing: An 8*8 matrix kernel, so C = C + A*B.
https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.
# define N 7 and the code all goes to heck !
Is that really what one wants ?!?
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------------------------------------------------
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do.
I think the below is correct !?! Hand compiled
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0 -------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7 -------------------------------------------------------------
VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2 -------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1 -------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
And it did not even have to push registers onto the stack!
20 total instructions, 80 bytes.
Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.
And what I posted above is obviously hand-optimized assembly,
but something generated by a compiler, which is worse.--
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
I think the below is correct !?! Hand compiled
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do. >-------------------------------------------------------------
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture.
Changing back
to K&R C and every one could compile.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
Thomas Koenig <tkoenig@netcologne.de> posted:
https://godbolt.org/z/xd4PedTqv shows an example generated by
gcc 16.1 (so no hand-generated assembly). This loads all of A
into memory at the beginning, a row vector of C is loaded each
iteration and stored at the end of each iteration, and B is loaded
(and used) element-wise.
# define N 7 and the code all goes to heck !
Is that really what one wants ?!?
That's auto-vectorization. I have heard that code for the Cray-1
contains a lot of "64"s.
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------------------------------------------------
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do.
I think the below is correct !?! Hand compiled
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0 -------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7 -------------------------------------------------------------
VEC R15,{} ; nothing live out of loop -------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7 -------------------------------------------------------------
LOOP1 LE,R5,R8,R7 -------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2 -------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1 -------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?
With AVX512, it is possible to run 16 FMAs per cycle.
Divide that
by a factor < 2 for overhead, and you run at maybe around 10 FMAs
per cycle.
BTW, the code generated by gcc is anything but ideal because of
the dependency chain on zmm0. That is probably worth a PR.
And it did not even have to push registers onto the stack!
20 total instructions, 80 bytes.
Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.
if they cannot grok restrict, they are not really modern :-)
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------------------------------------------------
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do.
I think the below is correct !?! Hand compiled
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?
0.7*lanes maybe 0.9*Lanes if my VEC fix works.
With AVX512, it is possible to run 16 FMAs per cycle.
Your code can only use 8 FMACs per cycle in any event,
and there
is overhead due to the other instructions in the loop...
Divide that
by a factor < 2 for overhead, and you run at maybe around 10 FMAs
per cycle.
BTW, the code generated by gcc is anything but ideal because of
the dependency chain on zmm0. That is probably worth a PR.
And it did not even have to push registers onto the stack!
20 total instructions, 80 bytes.
SIMD got 26 instructions likely longer than 4-bytes each due to
prefixes to get various SIMD lengths encoded.
Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture. Changing back
to K&R C and every one could compile.
if they cannot grok restrict, they are not really modern :-)
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------------------------------------------------
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do.
I think the below is correct !?! Hand compiled
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
One problem I see is memory traffic. In the SIMD version, A is
loaded once at the beginning of the loop. Here, it is loaded N**2
times, with different offsets each VVM iteration, vs only once
for the AVX512 version. Also, C is loaded and stored N**2 times,
vs. only once. (The AVX version also loads B only once).
The LDD using R6 as an index can be hoisted into Loop2 prologue.
{I did miss that}.
I think R6 usage is off (not usual in hand-coded assembly, as I know
only too well myself :-)
But let's look at memory access. Like you said, in the code
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}
b[k + j*N] is invariant for the innermost loop. So, for N=8, there
are 64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.
By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth
of the loop is 6-cycles, so the 8-wide machine would run the
loop in 8-cycles of latency.
Plus, the setup time for VVM...
I have been thinking about this overnight and may have a solution
that alters only the VEC instruction.
So, including that and the loop overhead, once could expect around
0.7 FMAs per cycle, correct?
0.7*lanes maybe 0.9*Lanes if my VEC fix works.
With AVX512, it is possible to run 16 FMAs per cycle.
Your code can only use 8 FMACs per cycle in any event,
Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.
and there
is overhead due to the other instructions in the loop...
Looking at
https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html
where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.
(It also shows that the gcc-generated code I linked to is anything
but ideal due to the dpendency chain it contains).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
I think the below is correct !?! Hand compiled
Thomas Koenig <tkoenig@netcologne.de> posted:
I do not see how VVM could express this equally succinctly;
Give me a day and I will see what I can do. >-------------------------------------------------------------
/* with
R1 = &a[]
R2 = &b[]
R3 = &c[]
R4 = i+jN
R5 = i+kN
R6 = k+jN
R7 = jN
R8 = kN
*/
mm8:
-------------------------------------------------------------
Loop1:
MOV R7,#0
-------------------------------------------------------------
Loop2:
MOV R8,#0
MOV R5,#0
MOV R4,R7
-------------------------------------------------------------
VEC R15,{} ; nothing live out of loop
-------------------------------------------------------------
loop3:
LDD R10,[R1,R5<<3] LDD R10,[R1,R5<<3]
LDD R11,[R2,R6<<3] LDD R11,[R2,R6<<3]
LDD R12,[R3,R4<<3] LDD R12,[R3,R4<<3]
FMAC R12,R10,R11,R12 FMAC R12,R10,R11,R12
STD R12,[R3,R4<<3] STD R12,[R3,R4<<3]
ADD R4,R4,R7 ADD R4,R4,R7
-------------------------------------------------------------
LOOP1 LE,R5,R8,R7
-------------------------------------------------------------
ADD R8,R8,#8
ADD R5,R5,#1
CMP R13,R5,#8
BLE R13,Loop2
-------------------------------------------------------------
ADD R7,R7,#8
CMP R13,R7,#64
BLE R13,Loop1
-------------------------------------------------------------
RET
Where the doubled up column shows the instructions which run
on a per lane basis. Given:
1-lane: there are 8 loops
2-lane: there are 4 loops
4-lane: there are 2 loops
8-lane: there is 1 loop
Let's compare it directly. Posting URLs is not good for discussion.
So the source code in the example is:
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}
and the output of gcc-16.1 is (after cleanup by godbolt):
mm8:
vmovupd (%rdi), %zmm8
vmovupd 64(%rdi), %zmm7
vmovupd 128(%rdi), %zmm6
vmovupd 192(%rdi), %zmm5
vmovupd 256(%rdi), %zmm4
vmovupd 320(%rdi), %zmm3
vmovupd 384(%rdi), %zmm2
vmovupd 448(%rdi), %zmm1
movq %rsi, %rax
leaq 512(%rdx), %rcx
.L2:
vbroadcastsd (%rax), %zmm0
vfmadd213pd (%rdx), %zmm8, %zmm0
addq $64, %rdx
addq $64, %rax
vfmadd231pd -56(%rax){1to8}, %zmm7, %zmm0
vfmadd231pd -48(%rax){1to8}, %zmm6, %zmm0
vfmadd231pd -40(%rax){1to8}, %zmm5, %zmm0
vfmadd231pd -32(%rax){1to8}, %zmm4, %zmm0
vfmadd231pd -24(%rax){1to8}, %zmm3, %zmm0
vfmadd231pd -16(%rax){1to8}, %zmm2, %zmm0
vfmadd231pd -8(%rax){1to8}, %zmm1, %zmm0
vmovupd %zmm0, -64(%rdx)
cmpq %rdx, %rcx
jne .L2
vzeroupper
ret
So your code reflects the three loops of the source code, with the
inner loop being sped up by VVM, ideally such that only one microarchitectural iteration of the inner loop is necessary. You
still have the other loops, and all the memory accesses.
By contrast, the AVX-512 code produced by gcc-16.1 unrolls one loop
level into using AVX-512 instructions and another loop level into
using 8 different zmm registers. As a consequence, there is only one
loop level left, and every byte of the array c is only stored once.
Also, every byte of a and b are only loaded once (but the accesses to
the b array are 8 bytes at a time, so 64 loads are needed for that,
while a is loaded with 8 64-byte loads and c is stored with 8 64-byte
stores.
For VVM we can make use of 8 registers to achieve one level of
unrolling, but I don't see how to reuse the registers as it is done
with the a array in zmm1..zmm8 in the AVX-512 code. So one would have
to load all of a on every iteration of the outer loop, instead of
pulling these loads out of the loop as it is done by gcc-16.1.
The VVM code would look maybe somewhat like:
loop1:
... loop overhead left as exercise ...
vec i
loop2:
ldd r17=c[...]
ldd r1=a[i]
ldd r2=a[i+8]
...
ldd r8=a[i+56]
ldd r9=b[...]
fmac r18=r1*r9+r17
...
ldd r16=b[...]
fmac r25=r8*r16+r24
std r25->c[...]
loop1 ...
... loop overhead ...
ble ..., loop1
ret
I don't know how well VVM handles the dependency chain of the FMACs (r17->r18...r25). One could use the same register here, as is done
with zmm0 in the AVX-512 code, but I do not know if VVM would accept
that.
With a 3-cycle LDD and a 4-cycle FMAC and 3 LD ports the depth of
the loop is 6-cycles, so the 8-wide machine would run the loop in
8-cycles of latency.
In an OoO machine the latency within the loop is not very relevant,
even the dependence chain of the 8 vfmadd231pd instructions typically
is not, because the next iteration does not depend on the previous
one, apart from the loop counter updates, and even that has become
zero-cycle latency in recent Intel CPUs. So the next iteration can
start immediately only limited by the resources.
Oh, and BTW; most of the compilers in godbot take a compile error--
I tried a fairly big sample across every architecture.
I tried clang-22, and it compiled the code. I also tried clang-11 and
gcc-11 and they errored out complaining about the -march=znver5 flag,
because they do not know this architecture (but still, why report an
error for that?). Deleting that flag produced SSE2 code on both
compilers, as expected.
On Sat, 02 May 2026 16:29:03 GMT
I tried clang-22, and it compiled the code. I also tried clang-11 and
gcc-11 and they errored out complaining about the -march=znver5 flag,
because they do not know this architecture (but still, why report an
error for that?). Deleting that flag produced SSE2 code on both
compilers, as expected.
Why would you think that the code generated for Zen5 would be different
from code, generated for other AVX512 targets with 2 FMA pipes, like
any Intel server core starting from Skylake-SP?
Specific targets I would try are:
skylake-avx512
cascadelake
icelake-server
sapphirerapids
emeraldrapids
graniterapids
The last three are likely unsupported by clang-11, which is quite
ancient, but with considerably newer gcc11 at least sapphirerapids
should work.
Any way, cascadelake should be supported by all of them and I don't
expect meaningful differences in code generation between cascadelake
and znver5.
Michael S <already5chosen@yahoo.com> writes:
On Sat, 02 May 2026 16:29:03 GMT
I tried clang-22, and it compiled the code. I also tried clang-11
and gcc-11 and they errored out complaining about the
-march=znver5 flag, because they do not know this architecture
(but still, why report an error for that?). Deleting that flag
produced SSE2 code on both compilers, as expected.
Why would you think that the code generated for Zen5 would be
different from code, generated for other AVX512 targets with 2 FMA
pipes, like any Intel server core starting from Skylake-SP?
I did not express anything in that direction. However, now that you
ask, my experience is that it is was more difficult than I had
expected to get gcc (13 IIRC) to produce AVX-512 code, even with
explicit vectorization. The actual target was a Rocket Lake machine,
but using -march=native on that produced AVX-256 code (for the
programs that I tried). IIRC I also tried specifying one other Intel
uarch, and the code was not satisfactory, either, but I don't remember
the details. Eventually I tried -march=znver4, and that worked, so I
stuck with that.
BTW, one thing that I find unsatisfactory about "x86-64-v4" is that it
does not include the ADX instructions.
Specific targets I would try are:
skylake-avx512
cascadelake
icelake-server
sapphirerapids
emeraldrapids
graniterapids
The last three are likely unsupported by clang-11, which is quite
ancient, but with considerably newer gcc11 at least sapphirerapids
should work.
Any way, cascadelake should be supported by all of them and I don't
expect meaningful differences in code generation between cascadelake
and znver5.
Good to know. Anyway, I did not try to get older compilers to produce AVX-512 for <2026May2.182903@mips.complang.tuwien.ac.at>. Instead, I
tried gcc-11 and clang-11 to find out why "most of the compilers in
godbot take a compile error", as Mitch Alsup claimed. It appears that
most of the compilers barf on the flag "-march=znver5".
- anton
sapphirerapids recognized by gcc11. Here too no version of gcc
generates AVX512 code for this kernel.
In order to convince gcc to generate AVX-512 on skylake-avx512 I had to
go back to gcc7.
skylake-avx512: recognized since clang-3.9, generates semi-reasonable
avx512 code with clang-3.9 to 9.
In this particular case it could be a blessing for Intel, because newer
clang code generated for znver5 used 512-bit SIMD but looks horrible.
I fully expect that it is slower than more conservative code for Intel >targets.
Michael S <already5chosen@yahoo.com> writes:
In order to convince gcc to generate AVX-512 on skylake-avx512 I had
to go back to gcc7.
skylake-avx512: recognized since clang-3.9, generates semi-reasonable >avx512 code with clang-3.9 to 9.
gcc-8.1 was released on May 2, 2018.
clang-10.0 was released on 24 March 2020.
Alder Lake was released in late 2021, but initially had some AVX-512
support that was disabled via firmware later.
Given how undecisive Intel has been on Alder Lake, it seems unlikely
that they eliminated the AVX-512 usage already in gcc-8 so early to
avoid disappointments when Alder Lake is released. But somehow I
cannot come up with a different explanation.
In this particular case it could be a blessing for Intel, because
newer clang code generated for znver5 used 512-bit SIMD but looks
horrible. I fully expect that it is slower than more conservative
code for Intel targets.
The whole auto-vectorization stuff tends to be pretty unreliable
overall. Sometimes the code that is generated looks good, sometimes
it is a mess, sometimes there is no auto-vectorization at all.
- anton
Michael S <already5chosen@yahoo.com> schrieb:
sapphirerapids recognized by gcc11. Here too no version of gcc
generates AVX512 code for this kernel.
I think it does, but you have to specify additional options.
Historically, AVX512 has been a low performer due to fequency
throttling etc.
You may have to specify -mprefer-vector-width=512 .
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}
b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.
By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
double c0 = c[0 + j*N];
double c1 = c[1 + j*N];
double c2 = c[2 + j*N];
double c3 = c[3 + j*N];
double c4 = c[4 + j*N];
double c5 = c[5 + j*N];
double c6 = c[6 + j*N];
double c7 = c[7 + j*N];
for (int k=0; k<N; k++) {
double bk = b[k + j*N];
c0 += a[0 + k*N] * bk;
c1 += a[1 + k*N] * bk;
c2 += a[2 + k*N] * bk;
c3 += a[3 + k*N] * bk;
c4 += a[4 + k*N] * bk;
c5 += a[5 + k*N] * bk;
c6 += a[6 + k*N] * bk;
c7 += a[7 + k*N] * bk;
}
/* write back c0 to c7 */
}
}
}
where the loop over k could be vectorized, but that would still
leave eccessive memory traffic for a.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:---------------------------
Your code can only use 8 FMACs per cycle in any event,
Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.
and there
is overhead due to the other instructions in the loop...
Looking at
https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html
where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
#define N 8
void mm8(double * const restrict a, double * const restrict b,
double * restrict c)
{
for (int j=0; j<N; j++) {
for (int k=0; k<N; k++) {
for (int i=0; i<N; i++) {
c[i + j*N] += a[i + k*N] * b[k + j*N];
}
}
}
}
C version loop invariant, and cursoring
#define N 8
void mm8(double *a, double *b, double *c)
{
int i,j,jN,k,kN;
double *AcijN,*AbkjN,*AaijN;
for( jN=0; jN<N*N; jN+=N ) {
AcijN = &c[jN];
AbkjN = &b[jN];
for( kN=k=0; k<N; k++,kN+=N ) {
AaikN = &a[kN];
bN = AbkjN[k];
for( i=0; i<N; i++ ) {
AcijN[i] += AaikN[i] * bN;
}
}
}
}
b[k + j*N] is invariant for the innermost loop. So, for N=8, there are
64 double reads for b. For a and c are 512 reads of doubles each,
512 doubles are written for c. Total, 1600 memory access for doubles.
By comparison, the SIMD code reads 192 doubles and writes 64, the
minimum, for a total of 256. This is a factor of 6.25.
It occurs to me that c[*] should be set to zero for a "real" matrix multiply...as is c[*] is both input and output.
ENTER Rc1,Rc8,#0 ; preserve c[1..8]
MOV RjN,#0 ; R4
loop1:
LA Rca,[Rc,RjN<<3] ; &c[1..8]
LDD Rc1,[Rca,#0] ; R23
LDD Rc2,[Rca,#8]
LDD Rc3,[Rca,#16]
LDD Rc4,[Rca,#24]
LDD Rc5,[Rca,#32]
LDD Rc6,[Rca,#40]
LDD Rc7,[Rca,#48]
LDD Rc8,[Rca,#56] ; R30
MOV RkN,#0 ; R5
---------------begin vectorize-------------------
VEC 8,{Rc1..Rc8}
loop2:
LDD Rbk,[R2,RjN<<3] ; R6
LA RakN,[Ra,RkN<<3] ; R7
LDD Ra1,[RakN,#0] ; R8
FMAC Rc1,Ra1,Rbk,Rc1 ; R23
LDD Ra2,[RakN,#8] ; R7
FMAC Rc2,Ra2,Rbk,Rc2 ; R24
LDD Ra3,[RakN,#16] ; R7
FMAC Rc3,Ra3,Rbk,Rc3 ; R25
LDD Ra4,[RakN,#24] ; R7
FMAC Rc4,Ra4,Rbk,Rc4 ; R26
LDD Ra5,[RakN,#32] ; R7
FMAC Rc5,Ra2,Rbk,Rc6 ; R27
LDD Ra6,[RakN,#40] ; R7
FMAC Rc6,Ra2,Rbk,Rc6 ; R28
LDD Ra7,[RakN,#48] ; R7
FMAC Rc7,Ra2,Rbk,Rc7 ; R29
LDD Ra8,[RakN,#56] ; R7
FMAC Rc8,Ra8,Rbk,Rc8 ; R30
LOOP1 LE,RkN,#8,#64 ; R4
The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
Thomas Koenig <tkoenig@netcologne.de> posted:
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:---------------------------
Your code can only use 8 FMACs per cycle in any event,
Actually not correct. Zen5 has a reciprocal throughput of 0.5
for FMA, it can run two instructions in parallel.
and there
is overhead due to the other instructions in the loop...
Looking at
https://www.amd.com/en/developer/resources/technical-articles/2025/aocl-blas-boosting-gemm-performance-for-small-matrices-.html
where they put a lot of work into optimizing matmul, one can see
they reached around 125 gflops on matrix sizes that include some
odd sizes. Dividing by the 4.12 GHz they give as maximum frequency,
that is 30.34 flops per cycle, which translates into 15.16 FMAs
per cycle, which is extremely close to 16 and shows that the
loop overhad is pretty much absorbed in the OoO handling.
But your loop only has 4×4 matrixes.
Yes, MM with big matrixes reached up above 90% of theoretical perf,
small ones do not.
The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:----------------
I don't recall the limit on the number of statements in a VVM loop;
what is it?
[...]
The solution to the excessive a[] traffic would be having the ability
to index the register file Ra[#] so the array can be allocated into registers and indexed from the file itself. Most ISAs do not have this ability--although a few GPU ISAs do.
What are its drawbacks? Do register accesses get slower?
MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
The solution to the excessive a[] traffic would be having the abilityWhat are its drawbacks?
to index the register file Ra[#] so the array can be allocated into
registers and indexed from the file itself. Most ISAs do not have this
ability--although a few GPU ISAs do.
Do register accesses get slower?
Some time ago, I wrote an inline version of matmul for gfortran
because small matrices (especially if their size is known at compile
time) are handled very inefficiently by external packages.
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be >used in rare circumstances.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 5/11/2026 7:17 PM, Lawrence D’Oliveiro wrote:
On Sun, 10 May 2026 08:58:34 -0700, Stephen Fuld wrote:
A possible alternative that I have seen is to "memory map" the
registers as an alternative accessing mechanism. This allows you to
"index" the registers, similarly to indexing a memory array.
Doesn’t this defeat the point of how registers are supposed to work?
No. In the vast majority of cases, you reference registers as you do
now, with register numbers in assigned places in the instruction. But
you do have an "alternate" way of referencing them that allows you to
use an index, just as you can with memory. That mechanism would only be
used in rare circumstances.
It would have to be implemented. How?
And how does the supposed
rareness help?
Remember the subject: You suggested this mechanism as a way to
eliminate the disadvantage of VMM compared to AVX-512 in 8x8 matrix multiplication, and the disadvantage was that VVM cannot not eliminate
some memory accesses that AVX-512 can. Turning the registers into
memory does not solve that, and probably incurs additional costs.
This cure is worse than the disease.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,116 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 85:27:18 |
| Calls: | 14,305 |
| Files: | 186,338 |
| D/L today: |
647 files (184M bytes) |
| Messages: | 2,525,478 |