• Re: Efficiency of in-order vs. OoO

    From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Mar 25 21:42:18 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:46:39 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    In article <2024Mar25.193535@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel
    or AMD.

    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    Look at vtune, for example.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Mar 25 20:48:08 2024
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    There is a significant demand for performance monitoring. Note
    that in addition to to standard performance monitoring registers,
    AArch64 also (optionally) supports statistical profiling and
    out-of-band instruction tracing (ETF). The demand from users
    is such that all those features are present in most designs.

    Interesting. I would have expected that the likes of me are few and
    far between, and easy to ignore for a big company like ARM, Intel or AMD.

    My theory was that the CPU manufacturers put performance monitoring
    counters in CPUs in order to understand the performance of real-world
    programs themselves, and how they should tweak the successor core to
    relieve it of bottlenecks.

    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so >instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:22:31 2024
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    The question is if "users" to ARM Holdings are actual end-users, or the
    SoC manufacturers who build chips incorporating Aarch64 cores. I'd expect >most of the latter to want those features so that they can understand the >performance of their silicon better.

    That might explain why for the AmLogic S922X in the Odroid N2/N2+
    there is a Linux 4.9 kernel that supports performance monitoring
    counters (AmLogic put that in for their own uses), but the mainline
    Linux kernel does not support perf on the S922X (perf was not in the requirements of whoever integrated the S922X stuff into the mainline).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 09:27:54 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have
    simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a long-running program.

    Look at vtune, for example.

    And?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Mar 26 10:47:07 2024
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Having reverse engineered the original Pentium EMON counters I got a
    meeting with Intel about their next cpu (the PentiumPro), what I was
    told about the Pentium was that this chip was the first one which was
    too complicated to create/sell an In-Circuit Emulator (ICE) version, so
    instead they added a bunch of counters for near-zero overhead monitoring
    and depended on a bit-serial read-out when they needed to dump all state
    for debugging. (I have forgotten the proper term for that interface! :-( )

    Scan chains. The modern interface to scan chains (which we used on the mainframes in the late 70's/early 80') is JTAG.


    Thanks!

    JTAG was indeed the term as was looking for (and not remembering). Maybe
    I'm getting old?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Mar 26 14:15:41 2024
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Mar 26 16:47:02 2024
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Mar 26 17:29:00 2024
    From Newsgroup: comp.arch

    In article <2024Mar26.174702@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    There can be considerable confusion on this point. In the early days of
    Intel VTune, it would only work on small and simple programs, but Intel
    sent one of the lead developers to visit the UK with it, expecting that
    it would instantly find huge speed-ups in my employers' code.

    What happened was that VTune crashed almost instantly when faced with
    something that large, and Intel learned about the difference between microarchitecture analysis and application analysis.

    John
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From mitchalsup@mitchalsup@aol.com (MitchAlsup1) to comp.arch on Tue Mar 26 18:47:38 2024
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    The biggest demand is from the OS vendors. Hardware folks have >>>>simulation and emulators.

    You don't want to use a full-blown microarchitectural emulator for a >>>long-running program.

    Generally hardware folks don't run 'long-running programs' when
    analyzing performance, they use the emulator for determining latencies, >>bandwidths and efficiacy of cache coherency algorithms and
    cache prefetchers.

    Their target is not application analysis.

    This sounds like hardware folks that are only concerned with
    memory-bound programs.

    I OTOH expect that designers of out-of-order (and in-order) cores
    analyse the performance of various programs to find out where the
    bottlenecks of their microarchitectures are in benchmarks and
    applications that people look at to determine which CPU to buy. And
    that's why we not only just have PMCs for memory accesses, but also
    for branch prediction accuracy, functional unit utilization, scheduler utilization, etc.

    Quit being so CPU-centric.

    You also need measurement on how many of which transactions few across
    the bus, DRAM use analysis, and PCIe usage to fully tune the system.

    - anton
    --- Synchronet 3.20a-Linux NewsLink 1.114