• A new method for OoO

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 10 15:15:10 2025
    From Newsgroup: comp.arch

    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    This made me wonder about the number of cycles cache reads for the
    different levels take on CPUs with variable frequency. Do modern
    CPU use fewer cycles to access, for example, L2, when the frequency
    is lower?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 10 18:21:30 2025
    From Newsgroup: comp.arch

    On Wed, 10 Sep 2025 15:15:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    This made me wonder about the number of cycles cache reads for the
    different levels take on CPUs with variable frequency. Do modern
    CPU use fewer cycles to access, for example, L2, when the frequency
    is lower?

    As far as I know, for L2 the answer is 'No'.
    For L3 - it depends.
    For main RAM - hopefully yes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 10 15:22:29 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes: >https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    This made me wonder about the number of cycles cache reads for the
    different levels take on CPUs with variable frequency. Do modern
    CPU use fewer cycles to access, for example, L2, when the frequency
    is lower?

    It's likely that there is a clock domain crossing involved
    to get to the memory subsystem.

    Note that in most processors, there are multiple clock domains;
    one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
    (typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
    clock domain, while L3 may be in either the processor domain
    or the rest-of-chip domain.

    Accesses to L1 and L2 take the same number of clocks regardless
    of the actual clock speed when they're part of the same clock domain.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 11 15:51:28 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes: >https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    To me, it sounds worry-some as it leaves 5%-7% on the table

    This made me wonder about the number of cycles cache reads for the >different levels take on CPUs with variable frequency. Do modern
    CPU use fewer cycles to access, for example, L2, when the frequency
    is lower?

    It's likely that there is a clock domain crossing involved
    to get to the memory subsystem.

    Almost invariably

    Note that in most processors,

    Certainly at the chip level, the interiors of "cores" are mostly
    a single clock domain. Core = {processor, L1, L2, Miss buffering}

    there are multiple clock domains;
    one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
    (typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
    clock domain, while L3 may be in either the processor domain
    or the rest-of-chip domain.

    The interconnect can be running at core or rest-of-chip domain.
    PCIe can have each root complex at different frequencies.

    Accesses to L1 and L2 take the same number of clocks regardless
    of the actual clock speed when they're part of the same clock domain.

    Depends if the L1/L2 is banked or not. Accesses to free banks have
    fixed timing, access to conflicting banks have an added conflict
    delay.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 11 15:57:01 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:


    Note that in most processors,

    Certainly at the chip level, the interiors of "cores" are mostly
    a single clock domain. Core = {processor, L1, L2, Miss buffering}

    there are multiple clock domains;
    one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
    (typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
    clock domain, while L3 may be in either the processor domain
    or the rest-of-chip domain.

    The interconnect can be running at core or rest-of-chip domain.
    PCIe can have each root complex at different frequencies.

    PCIe may be in three different clock domains. One for the
    PCI controller (typically ROC), one for the PCIe MAC
    and one for the SERDES.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 11 13:22:44 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    I searched for "processor" "schedule" "time resource matrix" and got
    a hit on a different company's patent for what looks like the same idea.

    Time-resource matrix for a microprocessor with time counter
    for statically dispatching instructions https://patents.google.com/patent/US11829762B2

    It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction.
    The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
    If a load later misses L1 it triggers a replay of all younger instructions.

    They claim it is simpler but I question that.
    Putting all the schedule info in one matrix means that to scale it
    requires adding more ports to the matrix. Also different resources
    can require different allocation and scheduling algorithms.
    Doing all this in one place at the same time gets complicated quickly.

    My simulated design intentionally distributed schedulers to each FU's bank
    of reservation stations so they all schedule concurrently and each scheduler algorithm is optimized for its FU.

    Also a wake-up matrix is not that complicated. I used the write of the destination Physical Register Number (PRN) as the wake-up signal.
    Each PRN has a wire the runs to all RS and each operand waiting for
    that PRN watches that wire for a pulse indicating the write result value
    will be forwarded in the next cycle on a dynamically assigned result bus.
    The RS operand can either save a copy of the value or launch execution immediately if all resources are available.

    My design appears to be similar to issue logic for
    RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers
    are simple and different kinds can be used for different FU.
    My ALU used simple round-robin whereas Branch Unit BRU is age ordered.
    This is simple to do as each scheduler only looks at its own RS bank. https://docs.boom-core.org/en/latest/sections/issue-units.html

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 11 18:48:06 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Thomas Koenig wrote:
    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    I searched for "processor" "schedule" "time resource matrix" and got
    a hit on a different company's patent for what looks like the same idea.

    Time-resource matrix for a microprocessor with time counter
    for statically dispatching instructions https://patents.google.com/patent/US11829762B2

    It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
    If a load later misses L1 it triggers a replay of all younger instructions.

    They claim it is simpler but I question that.

    Scoreboards are simpler than RS (and smaller too) but come with a
    10%-odd disadvantage in performance (per frequency). The purported
    scheme is 7%-odd slower--read into that anything you want.

    Putting all the schedule info in one matrix means that to scale it
    requires adding more ports to the matrix. Also different resources
    can require different allocation and scheduling algorithms.
    Doing all this in one place at the same time gets complicated quickly.

    Scoreboards scale with inst^2+registers^3
    Stations scale with inst×FU+RoB

    CDC got away with a Scoreboard because it tracked 3 sets of 8 registers;
    doing this with 32 uniform registers would be 8× as big !!! and somewhat slower; doing this with 32+32 {int,FP} rgisters would be 16× worse than
    6600; adding in SIMD and I don't even know how to calculate it.

    My simulated design intentionally distributed schedulers to each FU's bank
    of reservation stations so they all schedule concurrently and each scheduler algorithm is optimized for its FU.

    Each entire pipelined sequence is optimized for its pipeline::

    | INT RS | INT | Result|
    | MEM RS | AGEN | Cache | LDalgn | result|
    | DECODE | FMAC RS | MUX | MUL | ADD | NORM | Result|
    | MISC RS | stuff | Result|
    | BR RS | Check | backup|

    Also a wake-up matrix is not that complicated. I used the write of the destination Physical Register Number (PRN) as the wake-up signal.

    Agreed: however, I pipelined the result delivery mechanism into 3 stages:: {tag, result, exception} with the following timing::

    | tag | tag+1 | tag+2 |
    | result | result+1| result+2|
    | excptn | excptn+1| excptn+2|

    Tag consists of {pRN, pValid; slot, CKid, cValid}
    pRN is the physical Register Number
    pValid tells if you are writing the pRF
    slot is which FU
    CKid is which Insert BUndle
    cValid tells if {slot, CKid} is delivering a result

    There is a case where aRN is written more than once in a single Insert
    Bundle, in these cases, its result is delivered only to RS entries
    waiting on {slot, CKid}; Here a pRN is not assigned to the result
    only a {slot, CKid}; hence pValid.

    There is the case where {slot, CKid} is not delivering a result;
    hence cValid. This is used for ST instructions to read pRF after
    all exceptions in the bundle have accrued. This eliminates forwarding
    on ST.data since all older results have <necessarily> been written
    into pRF.

    The exception timing allows for direct mapped caches to deliver data
    while checking for hit, and delivering miss after LD.data. It also
    allows for instructions like FDIV to deliver a result and then change
    its mind later. Mc88120 could deliver FDIV at cycle 12 with 1/128
    chance in improper rounding, re-delivering the correctly rounded
    result in cycle 17. SQRT was similar.

    The only real complication is that 1-cycle instructions have RS broadcast
    the tag instead of the dedicated FU.

    Each PRN has a wire the runs to all RS and each operand waiting for
    that PRN watches that wire for a pulse indicating the write result value
    will be forwarded in the next cycle on a dynamically assigned result bus.

    When an instructions are written (Insert) into RS, each operant contains
    the slot of the FU which will deliver that result. Thus, the Operand
    capture portion only "looks" at one result bus for its data. Mc88120
    1991.

    The RS operand can either save a copy of the value or launch execution immediately if all resources are available.

    My design appears to be similar to issue logic for
    RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers
    are simple and different kinds can be used for different FU.
    My ALU used simple round-robin whereas Branch Unit BRU is age ordered.
    This is simple to do as each scheduler only looks at its own RS bank.

    I always considered the FU scheduler to be the RS "everybody ready?"
    OK "let's choose the oldest ready instruction !!" That is each FU has
    a dedicated RS on its front, and a dedicated result <bus> at its rear.

    https://docs.boom-core.org/en/latest/sections/issue-units.html

    The BOOM front end seems to have a lot more cycles than what is required.

    I am working on a 6-wide GBOoO implementation, and FETCH-PARSE-DECODE-INSERT
    is only 3½ cycles--while if RS does not launch an instruction, the <just> decoded instruction can begin {INSERT can be EXECUTE} in that 4rd-cycle delivering its result in cycle-5.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 12 05:45:32 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    I searched for "processor" "schedule" "time resource matrix" and got
    a hit on a different company's patent for what looks like the same idea.

    Time-resource matrix for a microprocessor with time counter
    for statically dispatching instructions https://patents.google.com/patent/US11829762B2

    Maybe the same people/company. Thang Minh Tran, the inventor
    of the patent, works for Simple (the owner of the patent), but
    previously worked for Andes, who held the presentation. This
    might be a case of shared IP, or a licensing agreement.

    Mitch, from his bio on simplexmicro.com, it seems that he worked
    at AMD around the same time you did, maybe a little earlier.
    Do you know him?

    It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
    If a load later misses L1 it triggers a replay of all younger instructions.

    It's the same that was refrenced in the presentation; the drawings
    also match.

    They claim it is simpler but I question that.

    Patents and marketing often claim advantages which are, let's say,
    dubious :-)

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 12 15:44:51 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
    has an interestig take on how to do OoO (quite patented,
    apparently). Apparently, they predict how many cycles their
    instructions are going to take, and replay if that doesn't work
    (for example in case of an L1 cache miss).

    Sounds interesting, I wonder what people here think of it.

    I searched for "processor" "schedule" "time resource matrix" and got
    a hit on a different company's patent for what looks like the same idea.

    Time-resource matrix for a microprocessor with time counter
    for statically dispatching instructions https://patents.google.com/patent/US11829762B2

    Maybe the same people/company. Thang Minh Tran, the inventor
    of the patent, works for Simple (the owner of the patent), but
    previously worked for Andes, who held the presentation. This
    might be a case of shared IP, or a licensing agreement.

    Mitch, from his bio on simplexmicro.com, it seems that he worked
    at AMD around the same time you did, maybe a little earlier.
    Do you know him?

    I was at AMD from Oct 1999 to May 2006

    I don't remember the name.

    It basically puts all the schedule in one HW matrix of time_slots * resources
    and scans forward looking for empty slots to allocate to each instruction.

    CRAY 1 vector sequencer used such a "shift register" approach.

    The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
    If a load later misses L1 it triggers a replay of all younger instructions.

    It's the same that was refrenced in the presentation; the drawings
    also match.

    They claim it is simpler but I question that.

    Patents and marketing often claim advantages which are, let's say,
    dubious :-)

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2