Forum: War Ensemble BBS

Re: Zen 5 FP latencies / throughput

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 12:41:11 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

I do not think you can run 16 cores at boost frequency for any
reasonable period of time. And all processors that I looked at
slowed down clock when AVX FMA was present. And I doubt this
"on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
If the chip can provide that many arguments in a single cycle
this probably can be only for some special combination of
sources.

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 13:32:34 2025

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

I just looked at the latency / throughput for Zen 5 (the link I
followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
wants to see for themselves), and I found the performance quite
impressive.

They can execute two 512-bit AVX512 fp adds in parallel (either 64
or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

Latency for the floating point add is two (!) cycles, for the FMA
it is four cycles, which is not a lof when running with a boost
frequency 5.7 GHz. The ratio is also interesting, they must
have optimized the floating-point adder quite well.

Let's see... the peak FP performance with 64-bit reals, with 16
cores (to get an upper limit on FP performance) would be

16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
which is approximately 4.3 TFlops per CPU.

I do not think you can run 16 cores at boost frequency for any
reasonable period of time. And all processors that I looked at
slowed down clock when AVX FMA was present.

It slows down somewhat, but the behavior is still impressive.
If you want to know the details, an analysis is at https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior .
Unfortunately, they didn't run two FMA + two adds, but only
two FMA + one add in parralel.

And I doubt this
"on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
If the chip can provide that many arguments in a single cycle
this probably can be only for some special combination of
sources.

Register to register

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.

I wrote about "peak performance", which is the speed where there
it is guaranteed that it cannot be exceeded :-) It's like the
160 MFlops for the Cray-I, which people also could not realistically
achieve.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:10:57 2025

From Newsgroup: comp.arch

On Sat, 20 Sep 2025 12:41:11 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

And note that your mix is 2 multiplies and 4 adds per cycle.
Normal FP mix is closer to 50% multiplies.

Normality is the eye of beholder.
Consider FFT.
Radix-2 butterfly that you find in the books consists of 2 FMUL, 2 FMADD
and 4 FADD/FSUB. Radix-4 butterfly that constitutes bulk of modern
high-perf implementations is a little less unbalanced, but only a
little.

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Oct 14 00:34:00 2025

From Newsgroup: comp.arch

On Fri, 19 Sep 2025 18:09:11 +0000, Thomas Koenig wrote:

An interesting question: When (approximately) did the total installed floating point performace of all computers worldwide surpass that of a
single 16-core Zen5 CPU? My guess would be somewhere in the late
1970s/early 1980s, before the PC and the 8087 took off.

I wouldn't want to guess an answer to that question myself.

But when did floating-point performance of the world's computers increase
the most explosively? My guess for _that_ would be when the original 486, later called the 486 DX when the 486 SX arrived, came on the market.

The 8087 was expensive, and only a few people originally bought it for
their PCs because they needed high floating-point performance. The 486, on
the other hand, was the new standard chip if you wanted a PC. It, too, was expensive _at first_, but its price came down, as that of the 386 before
it did.

So about when the 486 got cheap, the performance of all the world's
computers skyrocketed to a level a single computer would be hard pressed
to equal.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,076
Nodes:	10 (1 / 9)
Uptime:	64:57:36
Calls:	13,805
Files:	186,990
D/L today:	541 files (173M bytes)
Messages:	2,442,779

Re: Zen 5 FP latencies / throughput

Who's Online

System Info