https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
This made me wonder about the number of cycles cache reads for the
different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
This made me wonder about the number of cycles cache reads for the
different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?
Thomas Koenig <tkoenig@netcologne.de> writes: >https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
This made me wonder about the number of cycles cache reads for the >different levels take on CPUs with variable frequency. Do modern
CPU use fewer cycles to access, for example, L2, when the frequency
is lower?
It's likely that there is a clock domain crossing involved
to get to the memory subsystem.
Note that in most processors,
there are multiple clock domains;
one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
(typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
clock domain, while L3 may be in either the processor domain
or the rest-of-chip domain.
Accesses to L1 and L2 take the same number of clocks regardless
of the actual clock speed when they're part of the same clock domain.
scott@slp53.sl.home (Scott Lurndal) posted:
Note that in most processors,
Certainly at the chip level, the interiors of "cores" are mostly
a single clock domain. Core = {processor, L1, L2, Miss buffering}
there are multiple clock domains;
one for the processor/core (e.g. 3Ghz) and one for the 'rest of chip'
(typ. 800mhz - 1ghz). L1 and L2 are generally in the processor
clock domain, while L3 may be in either the processor domain
or the rest-of-chip domain.
The interconnect can be running at core or rest-of-chip domain.
PCIe can have each root complex at different frequencies.
https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
Thomas Koenig wrote:
https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.
Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2
It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.
They claim it is simpler but I question that.
Putting all the schedule info in one matrix means that to scale it
requires adding more ports to the matrix. Also different resources
can require different allocation and scheduling algorithms.
Doing all this in one place at the same time gets complicated quickly.
My simulated design intentionally distributed schedulers to each FU's bank
of reservation stations so they all schedule concurrently and each scheduler algorithm is optimized for its FU.
Also a wake-up matrix is not that complicated. I used the write of the destination Physical Register Number (PRN) as the wake-up signal.
Each PRN has a wire the runs to all RS and each operand waiting for
that PRN watches that wire for a pulse indicating the write result value
will be forwarded in the next cycle on a dynamically assigned result bus.
The RS operand can either save a copy of the value or launch execution immediately if all resources are available.
My design appears to be similar to issue logic for
RISC-V Berkeley Out-of-Order Machine (BOOM). As they note, schedulers
are simple and different kinds can be used for different FU.
My ALU used simple round-robin whereas Branch Unit BRU is age ordered.
This is simple to do as each scheduler only looks at its own RS bank.
https://docs.boom-core.org/en/latest/sections/issue-units.html
Thomas Koenig wrote:
https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.
Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2
It basically puts all the schedule in one HW matrix of time_slots * resources and scans forward looking for empty slots to allocate to each instruction. The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.
They claim it is simpler but I question that.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
https://old.chipsandcheese.com/2025/08/29/condors-cuzco-risc-v-core-at-hot-chips-2025/
has an interestig take on how to do OoO (quite patented,
apparently). Apparently, they predict how many cycles their
instructions are going to take, and replay if that doesn't work
(for example in case of an L1 cache miss).
Sounds interesting, I wonder what people here think of it.
I searched for "processor" "schedule" "time resource matrix" and got
a hit on a different company's patent for what looks like the same idea.
Time-resource matrix for a microprocessor with time counter
for statically dispatching instructions https://patents.google.com/patent/US11829762B2
Maybe the same people/company. Thang Minh Tran, the inventor
of the patent, works for Simple (the owner of the patent), but
previously worked for Andes, who held the presentation. This
might be a case of shared IP, or a licensing agreement.
Mitch, from his bio on simplexmicro.com, it seems that he worked
at AMD around the same time you did, maybe a little earlier.
Do you know him?
It basically puts all the schedule in one HW matrix of time_slots * resources
and scans forward looking for empty slots to allocate to each instruction.
--- Synchronet 3.21a-Linux NewsLink 1.2The scheduling is done at Rename and time slots assigned for each resource needed, source operand read ports, FU's, result buses.
If a load later misses L1 it triggers a replay of all younger instructions.
It's the same that was refrenced in the presentation; the drawings
also match.
They claim it is simpler but I question that.
Patents and marketing often claim advantages which are, let's say,
dubious :-)
[...]
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,070 |
Nodes: | 10 (0 / 10) |
Uptime: | 127:37:26 |
Calls: | 13,731 |
Calls today: | 1 |
Files: | 186,965 |
D/L today: |
1,256 files (486M bytes) |
Messages: | 2,417,814 |