Forum: War Ensemble BBS

A new method for OoO

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 10 15:15:10 2025

From Newsgroup: comp.arch

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

This made me wonder about the number of cycles cache reads for the
different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 10 18:21:30 2025

From Newsgroup: comp.arch

On Wed, 10 Sep 2025 15:15:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

This made me wonder about the number of cycles cache reads for the
different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?

As far as I know, for L2 the answer is 'No'.
For L3 - it depends.
For main RAM - hopefully yes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 10 15:22:29 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes: >https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/

has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

This made me wonder about the number of cycles cache reads for the
different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?

It's likely that there is a clock domain crossing involved
to get to the memory subsystem.

Note that in most processors, there are multiple clock domains;
one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
(typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
clock domain, while L3 may be in either the processor domain
or the rest-of-chip domain.

Accesses to L1 and L2 take the same number of clocks regardless
of the actual clock speed when they're part of the same clock domain.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 11 15:51:28 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

Thomas Koenig <tkoenig@netcologne.de> writes: >https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/

has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

To me, it sounds worry-some as it leaves 5%-7% on the table

This made me wonder about the number of cycles cache reads for the >different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?

It's likely that there is a clock domain crossing involved
to get to the memory subsystem.

Almost invariably

Note that in most processors,

Certainly at the chip level, the interiors of "cores" are mostly
a single clock domain. Core = {processor, L1, L2, Miss buffering}

there are multiple clock domains;
one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
(typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
clock domain, while L3 may be in either the processor domain
or the rest-of-chip domain.

The interconnect can be running at core or rest-of-chip domain.
PCIe can have each root complex at different frequencies.

Accesses to L1 and L2 take the same number of clocks regardless
of the actual clock speed when they're part of the same clock domain.

Depends if the L1/L2 is banked or not. Accesses to free banks have
fixed timing, access to conflicting banks have an added conflict
delay.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 11 15:57:01 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Note that in most processors,

Certainly at the chip level, the interiors of "cores" are mostly
a single clock domain. Core = {processor, L1, L2, Miss buffering}

there are multiple clock domains;
one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
(typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
clock domain, while L3 may be in either the processor domain
or the rest-of-chip domain.

The interconnect can be running at core or rest-of-chip domain.
PCIe can have each root complex at different frequencies.

PCIe may be in three different clock domains. One for the
PCI controller (typically ROC), one for the PCIe MAC
and one for the SERDES.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Sep 11 13:22:44 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.

Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2

It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction.
The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.

They claim it is simpler but I question that.
Putting all the schedule info in one matrix means that to scale it
requires adding more ports to the matrix. Also different resources
can require different allocation and scheduling algorithms.
Doing all this in one place at the same time gets complicated quickly.

My simulated design intentionally distributed schedulers to each FU's bank
of reservation stations so they all schedule concurrently and each scheduler algorithm is optimized for its FU.

Also a wake-up matrix is not that complicated. I used the write of the destination Physical Register Number (PRN) as the wake-up signal.
Each PRN has a wire the runs to all RS and each operand waiting for
that PRN watches that wire for a pulse indicating the write result value
will be forwarded in the next cycle on a dynamically assigned result bus.
The RS operand can either save a copy of the value or launch execution immediately if all resources are available.

My design appears to be similar to issue logic for
RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers
are simple and different kinds can be used for different FU.
My ALU used simple round-robin whereas Branch Unit BRU is age ordered.
This is simple to do as each scheduler only looks at its own RS bank. https://docs.boom-core.org/en/latest/sections/issue-units.html

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 11 18:48:06 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Thomas Koenig wrote:

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.

Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2

It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.

They claim it is simpler but I question that.

Scoreboards are simpler than RS (and smaller too) but come with a
10%-odd disadvantage in performance (per frequency). The purported
scheme is 7%-odd slower--read into that anything you want.

Putting all the schedule info in one matrix means that to scale it
requires adding more ports to the matrix. Also different resources
can require different allocation and scheduling algorithms.
Doing all this in one place at the same time gets complicated quickly.

Scoreboards scale with inst^2+registers^3
Stations scale with inst×FU+RoB

CDC got away with a Scoreboard because it tracked 3 sets of 8 registers;
doing this with 32 uniform registers would be 8× as big !!! and somewhat slower; doing this with 32+32 {int,FP} rgisters would be 16× worse than
6600; adding in SIMD and I don't even know how to calculate it.

My simulated design intentionally distributed schedulers to each FU's bank
of reservation stations so they all schedule concurrently and each scheduler algorithm is optimized for its FU.

Each entire pipelined sequence is optimized for its pipeline::

| INT RS | INT | Result|
| MEM RS | AGEN | Cache | LDalgn | result|
| DECODE | FMAC RS | MUX | MUL | ADD | NORM | Result|
| MISC RS | stuff | Result|
| BR RS | Check | backup|

Also a wake-up matrix is not that complicated. I used the write of the destination Physical Register Number (PRN) as the wake-up signal.

Agreed: however, I pipelined the result delivery mechanism into 3 stages:: {tag, result, exception} with the following timing::

| tag | tag+1 | tag+2 |
| result | result+1| result+2|
| excptn | excptn+1| excptn+2|

Tag consists of {pRN, pValid; slot, CKid, cValid}
pRN is the physical Register Number
pValid tells if you are writing the pRF
slot is which FU
CKid is which Insert BUndle
cValid tells if {slot, CKid} is delivering a result

There is a case where aRN is written more than once in a single Insert
Bundle, in these cases, its result is delivered only to RS entries
waiting on {slot, CKid}; Here a pRN is not assigned to the result
only a {slot, CKid}; hence pValid.

There is the case where {slot, CKid} is not delivering a result;
hence cValid. This is used for ST instructions to read pRF after
all exceptions in the bundle have accrued. This eliminates forwarding
on ST.data since all older results have <necessarily> been written
into pRF.

The exception timing allows for direct mapped caches to deliver data
while checking for hit, and delivering miss after LD.data. It also
allows for instructions like FDIV to deliver a result and then change
its mind later. Mc88120 could deliver FDIV at cycle 12 with 1/128
chance in improper rounding, re-delivering the correctly rounded
result in cycle 17. SQRT was similar.

The only real complication is that 1-cycle instructions have RS broadcast
the tag instead of the dedicated FU.

Each PRN has a wire the runs to all RS and each operand waiting for
that PRN watches that wire for a pulse indicating the write result value
will be forwarded in the next cycle on a dynamically assigned result bus.

When an instructions are written (Insert) into RS, each operant contains
the slot of the FU which will deliver that result. Thus, the Operand
capture portion only "looks" at one result bus for its data. Mc88120
1991.

The RS operand can either save a copy of the value or launch execution immediately if all resources are available.

My design appears to be similar to issue logic for
RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers
are simple and different kinds can be used for different FU.
My ALU used simple round-robin whereas Branch Unit BRU is age ordered.
This is simple to do as each scheduler only looks at its own RS bank.

I always considered the FU scheduler to be the RS "everybody ready?"
OK "let's choose the oldest ready instruction !!" That is each FU has
a dedicated RS on its front, and a dedicated result <bus> at its rear.

https://docs.boom-core.org/en/latest/sections/issue-units.html

The BOOM front end seems to have a lot more cycles than what is required.

I am working on a 6-wide GBOoO implementation, and FETCH-PARSE-DECODE-INSERT
is only 3½ cycles--while if RS does not launch an instruction, the <just> decoded instruction can begin {INSERT can be EXECUTE} in that 4rd-cycle delivering its result in cycle-5.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 12 05:45:32 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.

Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2

Maybe the same people/company. Thang Minh Tran, the inventor
of the patent, works for Simple (the owner of the patent), but
previously worked for Andes, who held the presentation. This
might be a case of shared IP, or a licensing agreement.

Mitch, from his bio on simplexmicro.com, it seems that he worked
at AMD around the same time you did, maybe a little earlier.
Do you know him?

It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.

It's the same that was refrenced in the presentation; the drawings
also match.

They claim it is simpler but I question that.

Patents and marketing often claim advantages which are, let's say,
dubious :-)

[...]
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 12 15:44:51 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).

Sounds interesting, I wonder what people here think of it.

I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.

Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2

Maybe the same people/company. Thang Minh Tran, the inventor
of the patent, works for Simple (the owner of the patent), but
previously worked for Andes, who held the presentation. This
might be a case of shared IP, or a licensing agreement.

Mitch, from his bio on simplexmicro.com, it seems that he worked
at AMD around the same time you did, maybe a little earlier.
Do you know him?

I was at AMD from Oct 1999 to May 2006

I don't remember the name.

It basically puts all the schedule in one HW matrix of time_slots * resources
and scans forward looking for empty slots to allocate to each instruction.

CRAY 1 vector sequencer used such a "shift register" approach.

The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.

It's the same that was refrenced in the presentation; the drawings
also match.

They claim it is simpler but I question that.

Patents and marketing often claim advantages which are, let's say,
dubious :-)

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Tue Sep 16 10:00:46 2025
  from Moore, Ok via Telnet
- Snow
  Mon Sep 15 12:19:45 2025
  from Nyc via Telnet
- Microbot
  Mon Sep 15 11:13:27 2025
  from Moore, Ok via Telnet
- Noozle
  Sun Sep 14 14:16:26 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,070
Nodes:	10 (0 / 10)
Uptime:	127:37:26
Calls:	13,731
Calls today:	1
Files:	186,965
D/L today:	1,256 files (486M bytes)
Messages:	2,417,814

A new method for OoO

Who's Online

Recent Visitors

System Info