• Re: Zen 5 FP latencies / throughput

    From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 12:41:11 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    I do not think you can run 16 cores at boost frequency for any
    reasonable period of time. And all processors that I looked at
    slowed down clock when AVX FMA was present. And I doubt this
    "on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
    If the chip can provide that many arguments in a single cycle
    this probably can be only for some special combination of
    sources.

    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 13:32:34 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    I just looked at the latency / throughput for Zen 5 (the link I
    followed is https://docs.amd.com/v/u/en-US/58455_1.00 if anybody
    wants to see for themselves), and I found the performance quite
    impressive.

    They can execute two 512-bit AVX512 fp adds in parallel (either 64
    or 32 bits), plus two 512-bit AFX 512 FMA instructions on top.

    Latency for the floating point add is two (!) cycles, for the FMA
    it is four cycles, which is not a lof when running with a boost
    frequency 5.7 GHz. The ratio is also interesting, they must
    have optimized the floating-point adder quite well.

    Let's see... the peak FP performance with 64-bit reals, with 16
    cores (to get an upper limit on FP performance) would be

    16 cores * (2 * 2 for FMA + 1 * 2 for fadd) * 8 FP numbers * 5.7e9/s
    which is approximately 4.3 TFlops per CPU.

    I do not think you can run 16 cores at boost frequency for any
    reasonable period of time. And all processors that I looked at
    slowed down clock when AVX FMA was present.

    It slows down somewhat, but the behavior is still impressive.
    If you want to know the details, an analysis is at https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior .
    Unfortunately, they didn't run two FMA + two adds, but only
    two FMA + one add in parralel.

    And I doubt this
    "on the top" claim: 2 FMA-s + 2 fadd-s need 10 arguments.
    If the chip can provide that many arguments in a single cycle
    this probably can be only for some special combination of
    sources.

    Register to register

    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.

    I wrote about "peak performance", which is the speed where there
    it is guaranteed that it cannot be exceeded :-) It's like the
    160 MFlops for the Cray-I, which people also could not realistically
    achieve.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:10:57 2025
    From Newsgroup: comp.arch

    On Sat, 20 Sep 2025 12:41:11 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:


    And note that your mix is 2 multiplies and 4 adds per cycle.
    Normal FP mix is closer to 50% multiplies.


    Normality is the eye of beholder.
    Consider FFT.
    Radix-2 butterfly that you find in the books consists of 2 FMUL, 2 FMADD
    and 4 FADD/FSUB. Radix-4 butterfly that constitutes bulk of modern
    high-perf implementations is a little less unbalanced, but only a
    little.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Oct 14 00:34:00 2025
    From Newsgroup: comp.arch

    On Fri, 19 Sep 2025 18:09:11 +0000, Thomas Koenig wrote:

    An interesting question: When (approximately) did the total installed floating point performace of all computers worldwide surpass that of a
    single 16-core Zen5 CPU? My guess would be somewhere in the late
    1970s/early 1980s, before the PC and the 8087 took off.

    I wouldn't want to guess an answer to that question myself.

    But when did floating-point performance of the world's computers increase
    the most explosively? My guess for _that_ would be when the original 486, later called the 486 DX when the 486 SX arrived, came on the market.

    The 8087 was expensive, and only a few people originally bought it for
    their PCs because they needed high floating-point performance. The 486, on
    the other hand, was the new standard chip if you wanted a PC. It, too, was expensive _at first_, but its price came down, as that of the 386 before
    it did.

    So about when the 486 got cheap, the performance of all the world's
    computers skyrocketed to a level a single computer would be hard pressed
    to equal.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2