Forum: War Ensemble BBS

Re: Intel's Software Defined Super Cores

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 11:48:00 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that >direction.

What direction?

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

That's probably a software problem. Different Eee PC models have
different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with
1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
faster than the 700Mhz ARM11), and also some CPUs similar to those
used in the Eee PC; numbers are times in seconds:

- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

So all of these CPUs clearly beat the one in the Raspi3, which I
expect to be clearly faster than the ARM11.

Now imagine running the software that made the Eee PC so slow with
dynamic translation on a Raspi1. How slow would that be?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Sep 20 12:01:39 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >machines view, and from Clocks Per Instruction to Instructions Per Clock.

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

One can take an Alpha ISA and implement it with a microcoded sequencer
but that should not be called RISC

Alpha is a RISC architecture. So this hypothetical implementation
would certainly be an implementation of a RISC architecture.

RISC changes that design to one like a multi-threaded program with
messages passing between them called uOps, where the dynamic state
of each instruction is mostly carried with the uOp message,
and each thread does something very simple and passes the uOp on.
Where global resources are required, they are temporarily dynamically >allocated to the uOp by the various threads, carried with the uOp,
and returned later when the uOp message is passed to the Retire thread.
The Retire thread is the only one which updates the visible global state.

This does not sound like RISC vs. non-RISC at all, but like OoO microarchitecture, and the contrast would be an in-order execution microarchitecture. Both RISCs and non-RISCs can make use of OoO microarchitectures, and have done so.

The RISC design guidelines described by various papers, rather than
go/no-go decisions, are mostly engineering compromises for consideration
of things which would make an MST-MPA more expensive to implement or >otherwise interfere with maximizing the active concurrency of all threads.

The interesting aspect is that RISCs are easier to implement in simple pipelines like the ones of early ARM, HPPA, MIPS and SPARC
implementations, but can also be implemented as in-order superscalar
or OoO superscalar microarchitectures; you can also implement it as sequentially-executed microcode engine. Wolfgang Kleinert implemented
a microcoded RISC in the 1980s, but I think that it was pipelined.

The advantages from the instruction set diminish with the more complex implementation techniques, and there are a number of instruction set
design decisions in early RISCs that turned out to be not so great and
that were eliminated in later RISCs (if not from the start), most
notably delayed branches, but many of the recent instruction sets (ARM
A64, RISC-V) take many of the same design decisions as the RISC
architectures of the 1980s (load/store, register architecture, etc.,
see John Mashey's criteria and recent discussions about this topic),
whereas many non-RISCs deviate from this design style.

This is why I think it would have been possible to build a risc-style
PDP-11 in 1975 TTL, or a VAX if they had just left the instructions of
the same complexity as PDP-11 ISA (53 opcodes, max one immediate,
max one mem op per instruction).

The PDP-11 instruction set is not RISC, and you paint a picture that
is too rosy: It has up to two mem ops per instruction, and IIRC even memory-indirect addressing modes. Not a problem for the
physically-addressed first implementations, nasty as soon as you add
virtual memory.

Implementing a pipelined implementation of PDP-11 (like the 486 was
for IA-32) for PDP-11 would have been quite a bit harder than for the
486 (admittedly the 486 has to deal with 16-bit modes and other legacy features, so it's not the easiest target, either).

For the VAX I would go for a RISC instead of a cleaned-up IA-32-like instruction set, and then implement pipelining. I would rather put
the effort in implementing compressed instructions rather than
load-and-op or RMW instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Sep 20 13:10:49 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x
slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks
to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 19:32:17 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >> machines view, and from Clocks Per Instruction to Instructions Per Clock.

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C
code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 20 17:38:19 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture,

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent >>> machines view, and from Clocks Per Instruction to Instructions Per Clock. >>

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX,
386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

And then came back with SIMD, I presume? :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Sep 20 22:01:27 2025

From Newsgroup: comp.arch

On Sat, 20 Sep 2025 07:56:49 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Yes, organizing the interconnect in a hierarchical way can help reduce
the increase in interconnect cost, but I expect that there is a reason
why Intel did not do that for its server CPUs with P-Cores, by e.g.,
forming clusters of 4, and then continuing with the ring; instead,
they opted for a grid interconnect.

- anton

I don't know for sure, but would imagine that the reason is that their
server CPUs with P-core have the same design for low-to-mid end "cloud"
models and for high-end "enterpise" models. High-end models have OLTP
and similar enterprise workloads as rather important market. Flatter
LLC is better for OLTP/enterprise than dozen or two of separate L3
caches. Besides, their current L2 caches are rather big, so if they
make those separate L3s true exclusive, which is optimal for reduction
of cc traffic, then there would be rather big waste of total cache
capacity.

An alternative is to left LLC intact and instead make L2s shared by
pairs of cores. That is unacceptable because of yet another market
addressed by the same Xeons line - computations/HPC, where being
limited by L2 bandwidth is not rare even now. With shared L2 it will
become very common.

3 different uncore designs for 3 different markets can solve that
nicely, but of course in the Intel's current financial situation that
is unthinkable. Probably even current arrangement with 3 Xeon lines
(Xeon-E = desktop chips with E-cores fused off, Seirra Forrest = plenty
of Crestmont cores and "normal" Xeons currently represented by Granite
Rapids) could be unsustainable.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Sep 20 21:14:23 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

I see the difference between CISC and RISC as in the micro-architecture, >>>

But the microarchitecture is not an architectural criterion.

changing from a single sequential state machine view to multiple concurrent
machines view, and from Clocks Per Instruction to Instructions Per Clock. >>>

People changed from talking CPI to IPC when CPI started to go below 1.
That's mainly a distinction between single-issue and superscalar CPUs.

The monolithic microcoded machine, which covers 360, 370, PDP-11, VAX, >>>> 386, 486 and Pentium, is like a single threaded program which
operates sequentially on a single global set of state variables.
While there is some variation and fuzziness around the edges,
the heart of each of these are single sequential execution engines.

The same holds true for the MIPS R2000, the ARM1/2 (and probably many
successors), probably early SPARCs and early HPPA CPUs, all of which
are considered as RISCs. Documents about them also talk about CPI.

And the 486 is already pipelined and can perform straight-line code at
1 CPI; the Pentium is superscalar, and can have up to 2 IPC (in
straight-line code).

Maybe relevant:

Performance optimizers writing asm regularly hit that 1 IPC on the 486
and (with more difficulty) 2 IPC on the Pentium.

When we did get there, the final performance was typically 3X compiled C
code.

That 3X gap almost went away (maybe 1.2 to 1.5X for many algorithms) on
the PPro and later OoO CPUs.

And then came back with SIMD, I presume? :-)

Sure!

I typically got 3X SIMD speedup from 4-way processing, years before any compilers were able to autovectorize to again partly close the gap.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 16:10:40 2025

From Newsgroup: comp.arch

On 9/20/2025 8:10 AM, Waldek Hebisch wrote:

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently
higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>> slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with
Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.

I had noted before that when I compiled Dhrystone on my Ryzen using
MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

Curiously, the score is around 4x higher (around 40M) if Dhrystone is
compiled with GCC (and around 2.5x with Clang).

For most other things, the performance scores seem closer.

I don't really trust GCC's and Clang's Dhrystone scores as they seem
basically out-of-line with most other things I can measure.

Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
If assuming MSVC as the reference, this would imply (after normalizing
for clock-speeds) that the Ryzen only gets around 50% more IPC.

I noted when compiling my BJX2 emulator:
My Ryzen can emulate it at roughly 70MHz;
My cell-phone can manage it at roughly 30MHz.

This isn't *that* much larger than the difference in CPU clock speeds.

It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
clock speeds and similar.

Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
to perform somewhat below that curve.

Though, in the case of the laptop, it may be a case of not getting all
that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
speeds, and on that laptop, its memcpy speeds are kinda crap if compared
with CPU clock-speed).

Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
Thing ran Quake and Quake 2 pretty OK, but not much else.

Though, if running the my emulator on the laptop, it is more back on the
curve of relative clock-speed, rather than on the
relative-memory-bandwidth curve.

It seems both my neural-net stuff and most of my data compression stuff,
more follow the memory bandwidth curve (though, for the laptop, it seems
NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).

Well, and then my BJX2 core seems to punch slightly outside its weight
class (MHz wise) by having disproportionately high memory bandwidth.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 20 22:01:48 2025

From Newsgroup: comp.arch

On 9/20/2025 6:48 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

On 9/19/2025 4:50 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

And, for many uses, performance is "good enough";

In that case, better buy a cheaper AMD64 CPU rather than a
particularly fast CPU with a different architecture X and then run a
dynamic AMD64->X translator on it.

Possibly, it depends.

The question is what could Intel or AMD do if the wind blew in that
direction.

What direction?

In some directly where emulating x86 on in-order cores was preferable to having x86 in hardware...

May or may not be "extreme budget".

Though, I am writing this after having to battle for a while to get
"boot magic" out of a Dell OptiPlex that I got on Amazon for $80.

Turned out the UEFI BIOS was not installed correctly on the PC, which
was effectively "utterly helpless" without it.

Had to use a Dell tool to make an installer image on a USB thumb-drive,
to get a bootable BIOS to configure the thing into a form where it could actually boot (where, it could then apparently install the BIOS config
UI from the USB drive). Apparently no support for Legacy Boot, option
was listed by just sort of grayed out and could not be selected
(apparently no TPM either, so can't run Win 11).

But, for $80, could get something with a Core i3 and a 500GB HDD.
Case was only really designed to handle a 2.5" drive, no space to fit a
3.5" HDD.

Going much cheaper, it apparently crosses from HDD into "eMMC Flash" territory, but "64GB eMMC Flash" was maybe a little too budget.

There were also some options with M.2, but I wanted SATA. At least, in
theory, with SATA once can swap HDDs if needed, but this is seemingly
hindered if the firmware is so limited as to be rendered helpless if if
can't load it from the HDD.

Likewise, x86 tends to need a lot of the "big CPU" stuff to perform
well, whereas something like a RISC style ISA can get better performance >>>> on a comparably smaller and cheaper core, and with a somewhat better
"performance per watt" metric.

Evidence?

No hard numbers, but experience here:
ASUS Eee (with an in-order Intel Atom) vs original RasPi (with 700MHz
ARM11 cores).

The RasPi basically runs circles around the Eee...

That's probably a software problem. Different Eee PC models have
different CPUs, Celeron M @571Mhz, 900MHz, or 630MHz, Atoms with 1330-1860Mhz, or AMD C-50 or E350. All of them are quite a bit faster
than the 700Mhz ARM11. While I don't have a Raspi1 result on https://www.complang.tuwien.ac.at/franz/latex-bench, I have a Raspi 3
result (and the Raspi 3 with its 1200MHz 2-wide core is quite a bit
faster than the 700Mhz ARM11), and also some CPUs similar to those
used in the Eee PC; numbers are times in seconds:

- Raspberry Pi 3, Cortex A53 1.2GHz Raspbian 8 5.46
- Celeron 800, , PC133 SDRAM, RedHat 7.1 (expi2) 2.89
- Intel Atom 330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- AMD E-450 1650MHz (Lenovo Thinkpad X121e), Ubuntu 11.10 64-bit 1.216

So all of these CPUs clearly beat the one in the Raspi3, which I
expect to be clearly faster than the ARM11.

IIRC, I was running Debian on the Eee (IIRC because the Xandros it came
with was kinda useless).

The one I have being one of the 701 variants (would need to find it
again to know the model). Looking online, it was probably one of the underclocked Celeron models though.

Not sure how fast (or not fast) it was, but it was basically about
enough to run Quake and Quake 2 in 640x480, but was hard pressed to do
much more than this (and be playable).

Trying to use Firefox or similar on it was just kinda painful.

Now imagine running the software that made the Eee PC so slow with
dynamic translation on a Raspi1. How slow would that be?

Seemingly the RasPi could run Quake OK in 800x600 though...
And, also did well working with CRAM video.

By other subjective measures, at least the GUI on the RasPi didn't
behave like molasses.

So, in any case, a better user experience at least (with some
uncertainty as to the actual speed).

Granted, might have been relevant to time running GCC builds or similar
for a more objective measure, would need to find both.

Though, at least, an emulator would need to be faster than DOSBox, as
DOSBox on RasPi tends to be too slow to even really run Doom or similar.

My cellphone at least gave a slightly better experience running DOSBox
(well, except that DOSBox and Termux on Android occasionally forget all
of their local storage and get reverted to their default contents).

RasPi+DOSBox can at least seemingly run Windows 3.11 and similar though.

Though, AFAIK DOSBox on ARM is running purely as an interpreter.

I remember though that one time I did try doing custom code generation
on the RasPi, and performance was terrible. At the time it seemed like
there was some "secret sauce" that GCC had to not get terrible performance.

Though, IIRC, this was a fork where I had tried to modified BGBCC's
SuperH backend to be able to target Thumb2.

Or, seeming informal/subjective ranking (mostly from memory):

Eee (CPU = something slow):
Quake 2, 640x400, OK-ish
Quake 3, N/A, didn't work
(No memcpy score or formal benchmarks)

Laptop from 2003 (1.4GHz Athlon, of some variant):
Quake 1/2: 1024x768, runs well.
(1024x768 is max resolution of LCD).
Quake 3: Also runs well.
As did GLQuake and Quake2 in OpenGL.
Half-Life runs well.
Half-Life 2, ran but poorly.
Gets around 400MB/s in a memcpy benchmark.
DDR1 100 MHz (or, DDR-200)
Notably lower than theoretical bandwidth.
(No values for LZ4 or CRAM tests IIRC)

RasPi 1 (700 MHz ARM11):
Quake 800x600 runs OK.
Quake 3: Ran, but poorly.
Gets around 1.2 GB/sec in memcpy.
Around 300 MB/s LZ4 decode
Around 400 Mpix/sec in CRAM decode.

RasPi 3 (1400 MHz 4x A53):
Quake 1/2/3 and GLQuake and Q3A run well.
Gets around 1.6 GB/sec in memcpy.
Around 500 MB/s LZ4 decode
Around 700 Mpix/sec in CRAM decode.

Laptop from 2009 (2.1 GHz Core 2, 2 cores):
Quake 1/2 and Half-Life are 60 fps at max resolution (1440x900).
In SW rendering only.
It was a very good option if you were OK with software rendering.
Quake 3: Around 20 fps.
GLQuake and Quake3 perform like dog crap.
GPU: Intel GMA X3100
Half-Life 2: Also very poor.
Minecraft ran, but unplayable.
Even on lowest draw distance.
Doom 3, started up at least...
Severe graphical glitches (lighting didn't work correctly)
Dead slow.
Around 2.4 GB/sec in memcpy.
Around 2.0 GB/s in LZ4
Around 1500 Mpix/sec in CRAM decode.
Performs well in CPU based tasks.
OpenGL via Software rasterization almost as fast as the GPU.

Current PC (Ryzen 2700X, 3.7GHz, 8C16T)
No issues running any of these games.
Memcpy: 3.6 GB/sec.
DDR4-2133
Around 3.2 GB/sec in LZ4
Around 2000 Mpix/sec in CRAM decode.

As can be noted:
memcpy tests tend to measure lower than RAM bandwidth.
CRAM decode often tends to exceed memcpy.
My mempy and LZ4 tests are single threaded.
Multi-threading can often give higher total bandwidth.

The bulk of time in CRAM decoding is spent in logic like:
tab[0]=colorA;
tab[1]=colorB;
px0=tab[(pix>>0)&1]; px1=tab[(pix>>1)&1];
px2=tab[(pix>>2)&1]; px3=tab[(pix>>3)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>>4)&1]; px1=tab[(pix>>5)&1];
px2=tab[(pix>>6)&1]; px3=tab[(pix>>7)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>> 8)&1]; px1=tab[(pix>> 9)&1];
px2=tab[(pix>>10)&1]; px3=tab[(pix>>11)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;
ct+=stride;
px0=tab[(pix>>12)&1]; px1=tab[(pix>>13)&1];
px2=tab[(pix>>14)&1]; px3=tab[(pix>>15)&1];
ct[0]=px0; ct[1]=px1; ct[2]=px2; ct[3]=px3;

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Sep 21 16:20:00 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> wrote:

On 9/20/2025 8:10 AM, Waldek Hebisch wrote:

BGB <cr88192@gmail.com> wrote:

On 9/19/2025 9:33 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Like, most of the ARM chips don't exactly have like a 150W TDP or similar...

And most Intel and AMD chips have 150W TDP, either, although the
shenanigans they play with TDP are not nice. The usual TDP for
Desktop chips is 65W (with the power limits temporarily or permanently >>>> higher). The Zen5 laptop chips (Strix Point, Krackan Point) have a
configurable TDP of 15-54W. Lunar Lake (4 P-cores, 4 LP-E-cores) has
a configurable TDP of 8-37W.

Seems so...
Seems the CPU I am running as a 105W TDP, I had thought I remembered
150W, oh well...

Seems 150-200W is more Threadripper territory, and not the generic
desktop CPUs.

Like, if an ARM chip uses 1/30th the power, unless it is more than 30x >>>>> slower, it may still win in Perf/W and similar...

No TDP numbers are given for Oryon. For Apple's M4, the numbers are

M4 4P 6E 22W
M4 Pro 8P 4E 38W
M4 Pro 10P 4E 46W
M4 Max 10P 4E 62W
M4 Max 12P 4E 70W

Not quite 1/30th of the power, although I think that Apple does not
play the same shenanigans as Intel and AMD.

A lot of the ARM SoC's I had seen had lower TDPs, though more often with >>> Cortex A53 or A55/A78 cores or similar:

Say (MediaTek MT6752):
https://unite4buy.com/cpu/MediaTek-MT6752/
Has a claimed TDP here of 7W and has 8x A53.

Or, for a slightly newer chip (2020):
https://www.cpu-monkey.com/en/cpu-mediatek_mt8188j

TDP 5W, has A55 and A78 cores.

Some amount of the HiSilicon numbers look similar...

But, yeah, I guess if using these as data-points:
A55: ~ 5/8W, or ~ 0.625W (very crude)
Zen+: ~ 105/16W, ~ 6.56W

So, more like 10x here, but ...

Then, I guess it becomes a question of the relative performance
difference, say, between a 2.0 GHz A55 vs a 3.7 GHz Zen+ core...

Judging based on my cellphone (with A53 cores), and previously running
my emulator in Termux, there is a performance difference, but nowhere
near 10x.

Single core in Orange Pi Zero 3 (Allwinner H618 at about 1.2 GHz) benchmarks >> to 4453.45 DMIPS (dhrystone MIPS). Single core in my desktop bencharks to >> about 50000 DMIPS. Dhrystone contain string operations which benefit
from SSE/AVX, but I would expect that on media load speed ratio would
be even more favourable to desktop core. On jumpy code ratio is probably
lower. 1GHz RISCV in Milkv-Duo benchmarks to 1472 DMIPS.

It is hard to compare performance per watt: Orange Pi Zero 3 has low
power draw (of order 100 mA from 5V USB charger with one core active) and
it is not clear how it is distributed between CPU-s and Etherent interface. >> RISCV in Milkv-Duo has even lower power draw. OTOH desktop cores
normally seem to run at at fraction of rated power too (but I have
no way to directly measure CPU power draw).

Of course, there is a catch: desktop CPU is made on more advanced
process than small processors. So it is hard to separate effects
from architecture and from the process.

I had noted before that when I compiled Dhrystone on my Ryzen using
MSVC, it is around 10M, or 5691 DMIPs, or around 1.53 DMIPs/MHz.

Curiously, the score is around 4x higher (around 40M) if Dhrystone is compiled with GCC (and around 2.5x with Clang).

For most other things, the performance scores seem closer.

I don't really trust GCC's and Clang's Dhrystone scores as they seem basically out-of-line with most other things I can measure.

I would not totally dismiss Dhrystone scores. Apparently Dhrystone
allows more optimizations than other programs. There may be bias,
because GCC and Clang developers select optimizations to improve
benchark scores. But AFAICS compiled code performs work it should
do. And the work correspond to typical work mix from the past.
More important, optimizations on gcc are mostly independent of
architecture, so essentially the same optimizations are applied
on all machines.

BTW: I get similar Dhrystone results from GCC Clang (differences of
few percent or less).

Concerning other loads, my current desktop (12 cores) build a medium size program about 8.5 times faster than 4 core Core 2 from 2008. There
is non-negilgable serial part in the build, so single modern core is about
3 times faster than single core in Core 2. I do not have comparable
results for 64-bit Orange Pi, but on slow machines I see build times
that are 40 times longer. Big part is numebr of cores, hypertheading
helps too (real time using 20 jobs is significanty smaller than real
time using 12 jobs). But clearly single big core is significanlty
faster than smaller cores.

Part of advantage of big core is due to big caches, my understanding
is that smaller processors that I use have much smaller caches.

Noting my BJX2 core seems to perform at 90K at 50MHz, or 1.02 DMIPS/MHz.
If assuming MSVC as the reference, this would imply (after normalizing
for clock-speeds) that the Ryzen only gets around 50% more IPC.

I noted when compiling my BJX2 emulator:
My Ryzen can emulate it at roughly 70MHz;
My cell-phone can manage it at roughly 30MHz.

This isn't *that* much larger than the difference in CPU clock speeds.

It is like, I seemingly live in a world where a lot of my own benchmark attempts tend to be largely correlated with the relative different in
clock speeds and similar.

Well, clock speeds is major factor for power efficiency. Running CPU
and lower clock freqency significanlty lowers energy per instruction.
And mere capability to run at high clock freqency causes increased
power use at lower clock freqencies (IIUC high freqency may need
bigger transistors and/or more transistors).

Well, except for my old laptop (from 2003), and an ASUS Eee, which seem
to perform somewhat below that curve.

Though, in the case of the laptop, it may be a case of not getting all
that much memory bandwidth from a 100MHz DDR1 SO-DIMM (a lot of the performance on some tests seems highly correlated with "memcpy()"
speeds, and on that laptop, its memcpy speeds are kinda crap if compared with CPU clock-speed).

Well, and the Eee has, IIRC, an Intel Atom N270 down-clocked to 630 MHz.
Thing ran Quake and Quake 2 pretty OK, but not much else.

Though, if running the my emulator on the laptop, it is more back on the curve of relative clock-speed, rather than on the
relative-memory-bandwidth curve.

It seems both my neural-net stuff and most of my data compression stuff, more follow the memory bandwidth curve (though, for the laptop, it seems
NN stuff can get a big boost here by using BFloat16 and getting a little clever with the repacking).

Well, and then my BJX2 core seems to punch slightly outside its weight
class (MHz wise) by having disproportionately high memory bandwidth.

...

--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Mon Sep 22 03:21:17 2025

From Newsgroup: comp.arch

In article <bp4jck19kcmq4i571fiofcrk1k6nn9k0ha@4ax.com>,
George Neuner <gneuner2@comcast.net> wrote:

On Tue, 16 Sep 2025 00:03:51 -0000 (UTC), John Savard ><quadibloc@invalid.invalid> wrote:

On Mon, 15 Sep 2025 23:54:12 +0000, John Savard wrote:

Although it's called "inverse hyperthreading", this technique could be
combined with SMT - put the chunks into different threads on the same
core, rather than on different cores, and then one wouldn't need to add
extra connections between cores to make it work.

On further reflection, this may be equivalent to re-inventing out-of-order >>execution.

John Savard

Sounds more like dynamic micro-threading.

Over the years I've seen a handful of papers about compile time >micro-threading: that is the compiler itself identifies separable
dependency chains in serial code and rewrites them into deliberate
threaded code to be executed simultaneously.

It is not easy to do under the best of circumstances and I've never
seen anything about doing it dynamically at run time.

To make a thread worth rehosting to another core, it would need to be
(at least) many 10s of instructions in length. To figure this out >dynamically at run time, it seems like you'd need the decode window to
be 1000s of instructions and a LOT of "figure-it-out" circuitry.

MMV, but to me it doesn't seem worth the effort.

I began reading the patent, and it's not clear to me this approach is
going to be much of an improvement. A great deal of analysis magic has
to happen to find code to spread across the cores. To summarize, it's basically taking code that looks like:

for(i = 0; i < N; i++) {
// Do some work
}

for(i = 0; i < M; i++) {
// Do some different work
}

and have two cores run the loops at the same time, with some special
check hardware to make sure they really are dependent (I gave up before
really figuring out what they're going to do, patents are not fun to read).
I think they actually want to divide up each loop into sections, and do
them in parallel. If someone wanted to explain in better detail what
they are doing, I'd like to read that short summary in non-patentese.

A trivial alternative approach to shrinking core size while not losing
single thread speed is to basically make all cores Narrow (meaning
support something like 4 instructions wide), and when code needs more,
stall the neighboring core and steal it's functional units to form a new
8-wide core. This approaches the SMT hardware sharing in a different direction, and so code without much instruction parallelism will run
better on two smaller cores than on a big core with two threads, but if
a single thread can use 8-wide instruction execution, it can steal it from
the neighboring core for a while.

If that's too much trouble, then for x86, all cores have just AVX-256 width, and take two clocks to do each AVX-512 operation (which is still better than just AVX-256). But hardware can join the neighboring cores together to be AVX-512, with each AVX-512 op taking just one clock now (and this can just
be AVX, the other core can run other instructions unimpeded).

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Sep 22 11:28:13 2025

From Newsgroup: comp.arch

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Sep 22 20:28:33 2025

From Newsgroup: comp.arch

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go
smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Sep 22 19:36:05 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than x86. (If you go smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big processors for servers, and are using a significant fraction of the processor's computing power, the details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs, and
NICs} and DRAM {plus power supplies and cooling} than in the cores.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Sep 23 08:24:54 2025

From Newsgroup: comp.arch

On 22/09/2025 21:36, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power when
emulating x86 than a typical Intel or AMD CPU, even if slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most of them
run software that's just as easily available for ARM. And many
datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more efficient
at general integer work and other common actions, as a result of a
better designed instruction set and register set. But once you are
using slightly more specific hardware features - vector processing,
floating point, acceleration for cryptography, etc., it's all much the
same. It takes roughly the same energy to do these things regardless of
the instruction set. Cache memory takes about the same power, as do PCI
interfaces, memory interfaces, and everything else that takes up power
on a chip.

So when you have a relatively small device - such as what you need for a
mobile phone - the instruction set and architecture makes a significant
difference and ARM is a lot more power-efficient than x86. (If you go
smaller - small embedded systems - x86 is totally non-existent because
an x86 microcontroller would be an order of magnitude bigger, more
expensive and power-consuming than an ARM core.) But when you have big
processors for servers, and are using a significant fraction of the
processor's computing power, the details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs, and NICs} and DRAM {plus power supplies and cooling} than in the cores.

Yes, all that will be independent of the type of cpu core.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 21:08:10 2025

From Newsgroup: comp.arch

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM
nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of
calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.

So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of
magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.
Spec.org has special benchmark for that called SPECpower_ssj 2008.
It is old and java-oriented but I don't think that it is useless.

Right now the benchmark clearly shows that AMD offferings dominate
Intel's.
The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html

The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html

There are very few non-x86 submissions. The only one that I found in
last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Wed Sep 24 15:56:37 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Sep 24 20:00:07 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

David Brown <david.brown@hesbynett.no> posted:

On 22/09/2025 17:28, Stefan Monnier wrote:

But, AFAIK the ARM cores tend to use significantly less power
when emulating x86 than a typical Intel or AMD CPU, even if
slower.

AFAIK datacenters still use a lot of x86 CPUs, even though most
of them run software that's just as easily available for ARM.
And many datacenters care more about "perf per watt" than raw performance.

So, I think the difference in power consumption does not favor ARM nearly as significantly as you think.

Yes, I think that is correct.

A lot of it, as far as I have read, comes down to the type of calculation you are doing. ARM cores can often be a lot more
efficient at general integer work and other common actions, as a
result of a better designed instruction set and register set. But
once you are using slightly more specific hardware features -
vector processing, floating point, acceleration for cryptography,
etc., it's all much the same. It takes roughly the same energy to
do these things regardless of the instruction set. Cache memory
takes about the same power, as do PCI interfaces, memory
interfaces, and everything else that takes up power on a chip.

So when you have a relatively small device - such as what you need
for a mobile phone - the instruction set and architecture makes a significant difference and ARM is a lot more power-efficient than
x86. (If you go smaller - small embedded systems - x86 is totally non-existent because an x86 microcontroller would be an order of magnitude bigger, more expensive and power-consuming than an ARM
core.) But when you have big processors for servers, and are using
a significant fraction of the processor's computing power, the
details of the core matter a lot less.

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.
Spec.org has special benchmark for that called SPECpower_ssj 2008.
It is old and java-oriented but I don't think that it is useless.

Right now the benchmark clearly shows that AMD offferings dominate
Intel's.
The best AMD score is 44168 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2025q2/power_ssj2008-20250407-01522.html

The best Intel score are 25526 ssj_ops/watt (Sierra Forest) and 25374 ssj_ops/watt (Granite Rapids). Both lag behind ~100 AMD scores,
They barely beats some old EPYC3 scores from 2021. https://www.spec.org/power_ssj2008/results/res2025q3/power_ssj2008-20250811-01533.html
https://www.spec.org/power_ssj2008/results/res2025q1/power_ssj2008-20250310-01505.html

There are very few non-x86 submissions. The only one that I found in
last 5 years was using Nvidia Grace CPU Superchip based on Arm Inc.
Neoverse V2 cores. It scored 13218 ssj_ops/watt https://www.spec.org/power_ssj2008/results/res2024q3/power_ssj2008-20240515-01413.html

A quick survey of the result database indicates only Oracle is
sending results to the data base.

Would be interesting to see the Apple/ARM comparisons.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:37:17 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 20:00:07 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

A quick survey of the result database indicates only Oracle is
sending results to the data base.

You misread it.
The organization that submits result is listed as "Test Sponsor".
Oracle is a sponsore of none of results that I listed in my previous
post.
The sponsors are ASUSTeK Computer Inc, New H3C Technologies Co, Lenovo
Global Technology and Infobell IT Solutions Pvt.

The most recent submissions are by Dell and Lenovo. https://www.spec.org/power_ssj2008/results/res2025q3/

Would be interesting to see the Apple/ARM comparisons.

Would be very interesting, but not going to happen.
Last time Apple submitted something to spec.org was almost 20 yers ago.
And it never submitted to Spec Power SSJ, which sort of makes sense -
this is a benchmark designed for severs and Apple does not sell servers.

The ARM architectecture vendor with highest number of submissions to
spec.org is Ampere, but they abondoned Arm-designed cores couple of
years ago and now shipping Arm architecture CPUs with cores of their
own design.
However there are few results in the database that use their previous
offerings based on Arm Neovese-N1 cores. Here is the best result: https://www.spec.org/power_ssj2008/results/res2024q1/power_ssj2008-20231104-01332.html

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Sep 24 23:48:50 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

I think that it's less than 80%. But it does not matter and does not
change anything - power spent for coooling is approximately
proportional to power spent for runninng.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

I don't think that you have scientific study to support your claims.

That's before than I state the obvious - even if you were correct about
main RAM consuming more power than CPU (which I doubt very much), still different CPUs can perform the same job with very different number of
main RAM accesses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Sep 24 21:04:03 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than in
the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

I think that it's less than 80%. But it does not matter and does not
change anything - power spent for coooling is approximately
proportional to power spent for runninng.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
draw in the vincinity of 32 watts. The TDP for a high-end
xeon may exceed 350 watts, Diamond Rapids may exceed 500 watts.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 00:21:02 2025

From Newsgroup: comp.arch

On Wed, 24 Sep 2025 21:04:03 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

<...>

Scott,
When you answer George Neuner's point, can you, please, reply to George Neuner's post rather than to mine?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Sep 24 21:27:09 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 15:56:37 -0400
George Neuner <gneuner2@comcast.net> wrote:

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

In the old times I heard that they used about as much power for
cooling as goes into the machines. In recent times, I have heard
about success stories where they use less. <https://en.wikipedia.org/wiki/Coefficient_of_performance> says: "Most
air conditioners have a COP of 3.5 to 5", i.e., quite a bit less
energy is expended on cooling than is moved away.

At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

Where do you get this from?

A typical 16GB dimm module will dissipate 3-5 watts. So 128GB will
draw in the vincinity of 32 watts.

We have several machines with 128GB RAM. They idle at around 40W, and
a box with less RAM and otherwise the same hardware does not idle at
much lower power consumption. The RAM has no active cooler, no
passive cooler, and sits close to each other, so it cannot dissipate
lots of power, certainly not 32W.

By contrast, the CPUs on these machines have elaborate active cooling solutions, and consume 105W TDP (142W power limit).

SSDs are also unlikely to be consuming a lot of power, given the kind
of cooling that they get. Yes, there are elaborate coolers for
M.2-format SSDs, but these is not the kind of format that the bigger
servers use (which rather use U.2 or U.3 SSDs), and even with M.2,
there is usually no need to use SSD cooling.

Maybe if you have a huge number of SSDs, power consumption may rival
that of the CPU.

- antn
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Sep 24 18:38:06 2025

From Newsgroup: comp.arch

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them.

Is it really *that* inefficient? Sounds even more horrible than what
I'd expect. Do you have some reference?

At the same time, most of the heat generated by typical systems is due
to the RAM - not the CPU(s).

Even if we consider "CPUs" their power consumption can go much further
than just that of the cores. I remember reading about Threadripper
spending about half its power in the its interconnect.
Still, I suspect you need a lot of RAM before it starts consuming more
power than your CPUs (at least the kind of RAM you find in gaming
desktops consume significantly less than the CPU, last I checked), so it
likely depends on the workloads that are targeted.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 14:23:04 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 24 Sep 2025 21:04:03 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

<...>

Scott,
When you answer George Neuner's point, can you, please, reply to George >Neuner's post rather than to mine?

The attributions are there, as are the appropriate indentation markers ('>').

Once I've read an article and restarted my newsreader, I don't have access
to read articles (at least not easily).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Sep 25 17:49:13 2025

From Newsgroup: comp.arch

On Thu, 25 Sep 2025 14:23:04 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).

Does not it suck?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (M. Anton Ertl) to comp.arch on Thu Sep 25 15:28:56 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Once I've read an article and restarted my newsreader, I don't have access
to read articles (at least not easily).

I press the "Goto parent" button, and I think that already existed in
xrn-9.03, which you use; maybe you need to configure it, or use the
shortcut if one exists. The only problem is that if the parent is
read, but an ancestor article is unread, it will skip the parent and
go to that ancestor. If I ever find the time, I will fix that and
send a patch to Jonathan Kamens.

- anton
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:37:49 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 25 Sep 2025 14:23:04 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Once I've read an article and restarted my newsreader, I don't have
access to read articles (at least not easily).

Does not it suck?

Not really. I've been using the same client since 1989; I'm used to it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Sep 25 15:41:30 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (M. Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Once I've read an article and restarted my newsreader, I don't have access >>to read articles (at least not easily).

I press the "Goto parent" button, and I think that already existed in >xrn-9.03,

yes, it has always existed, and yes, I can use it, but it is quite
slow over NNTP. As the quoting is always accurate,
I generally don't feel it is necessary in the case that Michael
complained about.

I can also hand-edit ~/.newsrc to see older articles, but seldom
have the need.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Sep 25 23:16:00 2025

From Newsgroup: comp.arch

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat
generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

.. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 25 23:48:19 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs, SSDs,
and NICs} and DRAM {plus power supplies and cooling} than in the
cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we were quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

Flash will have low heat signature
DRAM will have significant heat signature
DISKs will have significant heat signature
GPUs will have significant heat signature
CPUs will have significant heat signature
Motherboard has low-medium heat signature

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 02:03:21 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago,
was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

The FB article above describes how they reduced the
losses due to voltage changes as well as rectification.

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Sep 25 23:30:27 2025

From Newsgroup: comp.arch

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

I am quite sure that number is simply bogus: The power factors we were
quoted when building the largest new datacenter in Norway 10+ years ago, >>> was more like 6-10% of total power for cooling afair.

. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

All of this depends on where the "cold sink" is !! and how cold it is.

Pumping 6ºC sea water through water to air heat exchangers is a lot
more power efficient than using FREON and dumping the heat into 37ºC
air.

I still suspect that rectifying and delivering clean (low noise) D/C
to the chassis' takes a lot more energy that taking the resulting heat
away.

The FB article above describes how they reduced the
losses due to voltage changes as well as rectification.

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors, but
still limiting electrical losses due to wire resistance, while still
avoiding losses due to transformers and rectifiers.

To balance cost and efficiency, could use, say, 8 or 10AWG CCA (copper
clad aluminum) vs 10 or 12AWG copper. Could run the wires at a
relatively lower amperage rating, say:
8A over 10AWG CCA
16A over 8AWG CCA
Or, roughly 1/3 nominal.

Where, CCA wire is a lot cheaper than copper wire, so it is easier to
justify using absurdly thick wire here.

Where, contrast say to running 8A over 20AWG, which works, but a fair
bit more is lost due to heat. Or, the alternative could be to run the
power over parallel thinner wires rather than a single thicker wire. For example, replacing each 10AWG wire with four 14AWG wires.

8A at 192V being 1.5kW, and 8A at 960V being 7.7kW.

Though, assuming a series of 16 racks running on each shared 960V bus,
this would be 128A. The above de-rating scheme would likely make normal
CCA wire impractical. Probably could distribute DC power over a pair of
1.25" aluminum bars or 0.75" to 1.0" copper bars. Likely, the 1.25"
aluminum bar being the cheaper option here.

Could maybe then connect each 10AWG wire to the bars using a clamp,
and/or use an intermediate socket or modular connector.

Does kinda seem a bit overkill though.

Main power distribution would likely need to operate at a higher
voltage, otherwise the building-scale power rails would be absurd here.

Say, if one assumes a monolithic 960VDC system, and 16 rows, this is
2048A. Like, what does one do here, 3" copper or 5" aluminum rails?... Probably no.

Well, or maybe get creative and use large aluminum I-beams that serve
both as power distribution and joists (so, all this metal can serve
additional purpose). Though, 960V through the joists seems like a
building maintain maintenance hazard. Say, for example, 0V through the
floor and 960V through the ceiling.

Input power would likely need multiple transformers and rectifiers to be practical; though admittedly I have little idea here what sorts of
diodes would be used in these rectifiers. Seems like each diode would
itself need to be stupidly large to deal with this crap.

As for cooling, could maybe either use liquid cooling, or hybrid
aid/liquid (say, with superchilled liquid pumped through radiators, and
then fans circle air through these radiators).

To move lots of heat, could maybe use -90C ethanol as a coolant. Where
ethanol can be pumped like water, but could be nearly as cold as Freon.
Would likely still need big refrigeration pumps.

If one could have an artificial lake outside (preferably with a
sun-blocking cover), this could be used as a heat-sink.

Where, say:
Inner loop uses cold ethanol;
Refrigeration system moves heat from ethanol loop to a water loop;
The water loop pumps to/from an artificial lake used as a heat sink.
If the lake is above ambient, it will dissipate heat, but if too much
higher it would suffer evaporation looses.

One idea here could be to have 2 levels of cover over the lake:
The lower one is a metal cover painted black on both sides, placed
roughly 20 inches over the surface of the water;
The second cover is another 20 inches higher, painted black on the lower
side and white on the upper side;
The lower cover has a blocking wall to limit how much water vapor
escapes, whereas the upper barrier is open to the sides (allowing air to
flow through).

As the water evaporates, it moves heat into the barrier, which then
radiates heat (as black-body radiation) where the water condenses and
falls back into the lake;
The upper barrier partly absorbs heat from the lower layer, and also
serves to reflect the sun. Air-flow between the layers can be used to
radiate heat.

One other possibility being to have a tall tapered tube (narrower near
the top) with an open top, with the coolant water in the bottom (with
the tube tube serving to reduce evaporation loss, as water is more
likely to re-condense on the walls and fall back down than to escape the
top). Could likely be made out of steel or similar, maybe black inside,
white outside. Then maybe could heat the coolant water to around 70 or 80C.

While in theory, a giant radiator could work, a sufficiently large
radiator would likely be impractically expensive.

Well, don't know what people actually do, this is just what comes to
mind at the moment.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 14:02:31 2025

From Newsgroup: comp.arch

On Thu, 25 Sep 2025 23:16:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

George Neuner wrote:

On Wed, 24 Sep 2025 21:08:10 +0300, Michael S
<already5chosen@yahoo.com> wrote:

On Mon, 22 Sep 2025 19:36:05 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Big servers have rather equal power in the peripherals {DISKs,
SSDs, and NICs} and DRAM {plus power supplies and cooling} than
in the cores.

Still, CPU power often matters.

Yes ... and no.

80+% of the power used by datacenters is devoted to cooling the
computers - not to running them. At the same time, most of the heat generated by typical systems is due to the RAM - not the CPU(s).

I am quite sure that number is simply bogus: The power factors we
were quoted when building the largest new datacenter in Norway 10+
years ago, was more like 6-10% of total power for cooling afair.

.. a quick google...

https://engineering.fb.com/2011/04/14/core-infra/designing-a-very-efficient-data-center/

This one claims a 1.07 Power Usage Effectiveness.

Terje

I think, 1.07 is for 480VAC outside data center building to 48VDC at
server power plug.
It does not include losses withing server
- 48V to mostly 12V by server's PSU
- 12V to the whole zoo of low voltages by on-board DC2DC.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 26 12:10:41 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Sep 26 16:32:59 2025

From Newsgroup: comp.arch

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 14:28:02 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch on Fri Sep 26 07:37:59 2025

From Newsgroup: comp.arch

On 9/26/25 7:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

Is it still -48V?
Historically, Bell System plant voltage, supplied by batteries.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Sep 26 15:07:40 2025

From Newsgroup: comp.arch

Al Kossow <aek@bitsavers.org> writes:

On 9/26/25 7:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).

Is it still -48V?
Historically, Bell System plant voltage, supplied by batteries.

Yes. Using a postive ground system reduced corrosion in buried
cabling. While corrosion is not generally an issue for datacenters,
they use the same PDU's that the telcom industry uses.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 12:58:43 2025

From Newsgroup: comp.arch

On 9/26/2025 9:28 AM, Scott Lurndal wrote:

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

OK.

I had thought they were usually 120VAC or 240VAC.

At least, what rack-servers I had encountered were usually one of these (sometimes they had the little switch on the power-supply set to 240V
even in the US).

Then again, can also note that when setting up my milling machine,
lathe, and plasma table, that these were all using 240VAC for the power distribution to the various components. These were all Tormach machines though, so can't say for others.

48VDC also makes sense, as it is common in other contexts. I sorta
figured a higher voltage would have been used to reduce the wire
thickness needed.

Though, I don't actually know how real datacenters work here, just sort
of coming up with something assuming optimizing for the target goals
(powering all this stuff while minimizing electrical losses and cost).

I did realize after posting that, if the main power rails were organized
as a grid, the whole building could be done probably with 1.25" aluminum
bars.

Could power the grid of bars at each of the 4 corners, with maybe some
central diagonal bars (which cross and intersect with the central part
of the grid, and an additional square around the perimeter). Each corner supply could drive 512A, and with this layout, no bar or segment should
exceed 128A.

Assuming if they were using 240VAC, seems like the typical housing setup (12AWG wire) would be woefully insufficient. Would either need to be
heavily built up and/or use much heavier gauge wiring.

Or also solid copper or aluminum bars. Not sure if I had heard of this,
usual idea IIRC was that people always use wire for AC power, except
that if pushing a continuous load of several hundred amps, wire seems
less practical (would need to be very thick, hard to work with, and expensive).

Granted, more likely they would run the cable closer to the rated values
and accept more energy loss due to electrical resistance (since, yeah, a
1.25" bar or similar for 128A is a little excessive).

Though, it seems likely that in this case, solid metal bars might be
cheaper than using a whole lot of heavy gauge wire. And, repurposing
generic aluminum bar-stock might be the cheapest option here (with joins either as aluminum clamps or via welding).

If operating closer to conventional electrical ratings, could drop to
0.375" bars for 128A. Going much thinner, voltage drops and heat would
become an issue.

So, say:
0.250" likely high resistive loss.
0.375" roughly nominal.
0.750" maybe sufficiently low resistance
(could likely handle 500A before significant heat)
1.250" maybe overkill

Well, and could maybe put a plastic coating or similar on the bars to
limit accidental short-circuits. Decided to leave out analysis, but the
most likely option (to balance cost and effectiveness) would likely be a post-install application of acrylic paint (latex paint would be
insufficient, epoxy likely too expensive, ...).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 15:23:38 2025

From Newsgroup: comp.arch

On 9/26/2025 8:32 AM, Michael S wrote:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.

IIRC, reluctance motors are also popular here. They are sorta like BLDC,
but cheaper due to not needing big magnets (though, BLDC motors can give
more power in a physically smaller package if compared with reluctance
motors; but reluctance motors are still more compact if compared with AC induction motors).

Like BLDC, it is possible to run reluctance motors at an exact speed.

This is unlike AC induction motors where, although speed can be adjusted
with a VFD, it isn't particularly exact as it depends on the load on the
motor and similar. Accurate speed control on an induction motor will
still require using an encoder, but they are still not good for
positional control (and the effective "holding torque" of an AC
induction motor is very low).

Where more accuracy is needed, something like a big BLDC or reluctance
motor with a servo-drive might be used (typically with hall-effect
sensors in the stator).

Generally, these motors can't be driven open-loop, as they are prone to
"drop out" at relatively little load in these cases.

Technically, the stator construction for a reluctance motor can be
nearly identical to an induction motor, the main differences are in the
design of the rotor.

Where, say, an induction motor typically has a hollow rotor consisting
of layered steel plates with an embedded copper or aluminum "squirrel
cage" (a ring of bars around the perimeter, all shorted together at the
top and bottom).

The reluctance motor can use a solid steel rotor, with gaps machined in
to control where magnetic flux will go.

A typical BLDC motor either has a ring of permanent magnets, or
alternating poles (from the top/bottom) with a central ring magnet.

I had before imagined it should be possible to make a hybrid of a
reluctance and induction rotor for intermediate effects; partly by
filling the gaps in the reluctance rotor with aluminum in place of air.
This could still operate synchronously, but could have better torque
under load and less issue with drop out. If it drops below synchronous
speed, it would instead induce eddy currents in the aluminum parts of
the rotor; rather than the air being "basically useless". However,
aluminum would still behave more like air as far as the magnetic flux
lines are concerned.

Though, some commercial designs had instead gone the other way,
hybridizing the reluctance rotor with a BLDC rotor, and using (cheaper) ceramic magnets in place of rare-earth magnets (as typical in a BLDC).

One variant here resembling a reluctance motor with a split rotor, with
the top/bottom rotated relative to each other, and a central ceramic
ring magnet. Though, I think this pushes it more into the BLDC category.

Also common, on the AC side, are 440 and 208 3-phase.
Many traditional AC induction motors operate on 440VAC 3-phase.
A lot of traditional industrial machines were also 440VAC.

There is some stuff I saw about electrostatic motors gaining popularity
in some areas, but these tend to operate at high voltages but very
little amperage. They are comparably weak compared with magnetic motors,
but can be more energy efficient.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

DC -> DC allows higher conversion efficiency compared to AC.
Higher voltage distribution also allows more efficiency.

Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

A higher line frequency would increase the relative efficiency of
electrical transformers. Higher voltage AC also has a higher conversion efficiency than lower voltage.

In theory, assuming the AC comes in at 60Hz, could have a sort of rotary converter to boost the line frequency (could have a vaguely similar construction to an AC motor, but where input power uses 6 coils, and the output side has 12 or 24 coils; likely also operating like a boost transformer).

Not sure if anyone already builds this, or the conversion efficiency of
such a device. Would need to hopefully have a high conversion efficiency (otherwise it would not offset the losses in all of any smaller
transformers).

Though, wouldn't really gain anything if just going directly to DC via
bridge rectifiers (with no intermediate transformers), and then using
DC-DC conversion.

So, say 1320VAC 3-phase could likely be rectified into 960VDC, where,
assuming the presence of big capacitors, the voltage would drop slightly
in conversion due to phase ripple (the "peaks" getting flattened out).

Or, in theory, I have little idea where people would get diodes and
capacitors big enough for this. Presumably giant industrial-sized diodes
and capacitors could exist though (well, and/or PCBs with craptons of
smaller components).

Then again, in a relative sense, boards with 1000s of diodes and
capacitors wouldn't cost much relative to the cost of the building and servers.

...

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

Yes, pretty much.

MOSFET, diode (from ground), inductor, and a capacitor;
Then you need a controller circuit to keep track of the voltage and
adjust the duty cycle as needed to maintain the target voltage.

MOSFET lets power in, which goes through the coil, and charges the
capacitor (in parallel with the load). When the MOSFET turns off, there
is a voltage kick from the inductor (it goes negative), pulling power
from the ground plane.

It is possible to use an opamp for this (rather than a microcontoller),
but an opamp would generate very crude PWM, thus, noisier.

Possible noise reduction approaches:
Big capacitor;
Secondary inductor, diode, and capacitor.
Assuming a constant load, a second inductor could smooth the PWM noise
by maintaining closer to a constant current; but is more likely to see
voltage ripples if there are sudden changes in the load (if compared
with using a bigger capacitor).

Comparably a microcontroller can generate an higher-frequency PWM
signal, and keep the initial noise lower.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 26 23:35:52 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Sep 26 19:37:34 2025

From Newsgroup: comp.arch

On 9/26/2025 6:35 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted: --------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.

OK, so it makes sense then...

I guessed 240Hz as it could likely be enough to usefully boost
efficiency, but not so high as to cause significant leakage from the building's electrical system.

Something like 400 or 480Hz should also work.

Moving too far into kHz territory is likely to result in significant
leakage.

Though, looking into it, would likely have to get pretty high into the
kHz range before a buildings' power distribution system starts radiating
most of the power into the environment (with most of the sub-kHz
territory likely being pretty safe here).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 08:14:11 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my imressioon was that today various type of electric motors (DC, esp. brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

I never was in big datacenter, but heard that they prefer DC.

Eventually, electronics requires DC. Of course, you can make
an economic calculation of where you put your transformers and
rectifiers, and where you want which voltage.

An option which makes little sense is to have a rectifier which
creates high-voltage DC, then distributes that, and to have
an alternator at the other end to create AC which you can then
transform down. It would be better to distribute AC and transform
it down, saving two parts.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:27:02 2025

From Newsgroup: comp.arch

On 26/09/2025 14:10, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

True DC motors - with brushes - are rarely found outside of very small
motors (where they are cheap and simple). But there are a wide variety
of AC motors controlled in many different ways. Asynchronous AC motors
are only one type. There are lots of other topologies for motors and
their controllers, with different pros and cons and suitable applications.

What if, opposed to each computer using its own power-supply (from 120
or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

That makes little sense. If you're going to distribute power,
distribute it as AC so you save one transformer.

There are lots of advantages of distributing power as DC. Transformers
are only a good choice at higher voltages - once you get to the levels
that can be handled well by semiconductor switches, they are smaller and
more efficient, and work best for DC-to-DC. 1200V switches are cheap
and common now, though there are devices that handle a few thousand
volts. Electric car charger standards are 400V and 800V, with some new
ones at 1000V or up to 1500V.

It makes to distribute locally at something like 48V or 60V DC.
Connections are simpler, you can take it directly from an UPS, and the
local conversions to low voltage power lines is simpler than with 120V
or 240V AC.

So for a data centre, using perhaps 800V DC (taking advantage of the
electric car industry standards) to the rack, then 48V DC to the devices
on the rack would seem a good setup to me. DC also makes life much
easier and more efficient when you have UPSs and battery backup -
locally in a rack, or wider in the higher level supply.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 13:52:23 2025

From Newsgroup: comp.arch

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my
imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC
motors. The motors you use in a car have (roughly) sine wave drive
signals, generally 3 phases (but sometimes more). Even motors referred
to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
waveforms are more trapezoidal than sinusoidal.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
brushed DC motors.

Bigger brushed DC motors, as you say, used to be used in situations
where you needed speed control and the alternative was AC motors driven
at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies.
And as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at
first, but are not common choices now for most use-cases because
synchronous AC motors give better control and efficiencies. (There are,
of course, many factors to consider - and sometimes asynchronous motors
are still the best choice.)

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.

It's better, perhaps, to refer to "semiconductor switches" as a more
general term.

Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
have more control for switching off as well as switching on - perhaps
even using light for the switching rather than electrical signals.
(Those are particularly nice for megavolt DC lines.)

You can happily switch multiple MW of power with a single IGBT module
for a could of thousand dollars. Or you can use SiC FETs for up to a
few hundred kW but with much faster PWM frequencies and thus better control.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Sep 27 12:38:14 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> schrieb:

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

If you have three phases (required for high-power industrial motors)
I believe people use the three phases directly to convert from three
phases to three phases.

The resulting waveforms are not pretty, and contribute to the
difficulty of measuing power input.

[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Sep 27 15:15:41 2025

From Newsgroup: comp.arch

On 27/09/2025 14:38, Thomas Koenig wrote:

David Brown <david.brown@hesbynett.no> schrieb:

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM
controlled semiconductor switches.

If you have three phases (required for high-power industrial motors)
I believe people use the three phases directly to convert from three
phases to three phases.

The resulting waveforms are not pretty, and contribute to the
difficulty of measuing power input.

That used to be how it was done - using thyristors, and powering
induction motors. But it is not how it has been done in new motors for
a long time. (In industrial use, some motors can be very big, very
expensive, and very difficult to replace - thus factories can have the
same motors for decades, even though better and more efficient ones are available.)

Using thyristors to regulate the power out from your three phase input
is relatively simple, but as you say, the waveforms are not pretty.
This leads to significant noise (electrical and audible), vibrations,
torque ripple, and wear and tear on the motor. And it makes a mess of
the input supply, giving harmonics and phase differences between the
current and voltage input - which leads to significant loses in the
power delivery. These loses are between the generation and the
customer, meaning the electricity supplier sees it but the customer does
not see it on their bill - thus electricity suppliers greatly dislike
it. The effect is less with thyristors on three phase power than
thyristors on two phase power, but it is still very bad.

So these days, the AC power - two phase or three phase - is invariably converted to DC first, using power factor correction rectification (so
that the instantaneous current draw is proportional to the voltage at
the time, keeping current and voltage in phase and nicely sinusoidal).
After that, the AC drive to the motor is generated using PWM signals -
from perhaps 2 or 4 kHz for old IGBT systems to at least 20 kHz for
newer systems (avoiding audible noise) or up to maybe 160 kHz using GaN
or SiC FETs - higher frequencies mean smaller and lighter inductors and capacitors.

These kinds of motor control are smaller, more efficient, and much more controllable than old thyristor-based drives.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:06 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 9/26/2025 9:28 AM, Scott Lurndal wrote:

In those datacenters, the UPS distributes 48VDC to the rack components
(computers, network switches, storage devices, etc).

48VDC also makes sense, as it is common in other contexts. I sorta
figured a higher voltage would have been used to reduce the wire
thickness needed.

This is within a 19" rack.

I did realize after posting that, if the main power rails were organized
as a grid, the whole building could be done probably with 1.25" aluminum >bars.

The Burroughs V5x0 series ECL machines had Aluminum bus-bars.

Spectacular failure mode when/if something conductive (screwdriver,
wrench) was dropped across the hot and ground bars.

Could power the grid of bars at each of the 4 corners, with maybe some >central diagonal bars (which cross and intersect with the central part
of the grid, and an additional square around the perimeter). Each corner >supply could drive 512A, and with this layout, no bar or segment should >exceed 128A.

In the old mainframe days, there would be large bus-bars (in an enclosure) across the ceiling and plug-in tap boxes would drop power to the
various mainframe units.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Sep 27 14:56:47 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

BGB <cr88192@gmail.com> posted: >--------------------snip----------------------------------

Higher voltage would be needed with DC vs AC, as DC is more subject to
resistive losses. Though, more efficiency on the AC side would be
possible by increasing line frequency, say, using 240Hz rather than
60Hz; but don't want to push the frequency too high as then the wires
would start working like antennas and radiating the power into space.

The military routinely uses 400 Hz to reduce the weight of transformers.

IBM mainframes used 400hz (via a motor-generator set).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch on Sat Sep 27 08:57:44 2025

From Newsgroup: comp.arch

On 9/26/25 5:37 PM, BGB wrote:

Something like 400 or 480Hz should also work.

Would y'all please change the subject line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Sep 27 14:23:22 2025

From Newsgroup: comp.arch

On 9/27/2025 6:52 AM, David Brown wrote:

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial
applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors
were that most wide-spread motors by far up to 25-30 years ago. But my >>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC motors. The motors you use in a car have (roughly) sine wave drive signals, generally 3 phases (but sometimes more). Even motors referred
to as "Brushless DC motors" - "BLDC" - use AC inputs, though the
waveforms are more trapezoidal than sinusoidal.

Yes.

Typically one needs to generate a 3-phase waveform at the speed they
want to spin the motor at.

I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:
Sine waves give low noise, but less power;
Square waves are noisier and only work well at low RPM,
but have higher torque.
Sawtooth waves seem to work well at higher RPMs.
Well, sorta, more like sawtooth with alternating sign.
Square-Root Sine: Intermediate between sign and square.
Gives torque more like a square wave, but quieter.
Trapezoid waves are similar to this, but more noise.

Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
and thus generate a lot of heat otherwise).

In this case, the sawtooth wave helps because the coils don't like
changing quickly, so in this case one hits them full power at the start
(to get them going) and then rapidly drop back down to zero, then hit
them the same way on the opposite sign for the next part of the wave.

When I was messing around with it at the time, input control signals
were typically one of:
ADC input connected to a POT (for direct control);
1-2ms RC style PWM.

Step/Dir signaling (typical for stepper drivers and servomotor
controllers) could also make sense.

One other option is dual-phase motors, which have the partial advantage
that one can use a repurposed stepper driver. In this case typically set
to micro-stepping. A lot of the dual-phase motors in this case though
were built from repurposed capacitor-run split-phase motors.

Say, for example, one can be like, "Yeah, this AC split phase motor is
close enough to being a NEMA34 stepper...".

Typically need to partly rewire it as typically the split phase motors
have 3 wires, but need 4 wire in this case. Some other motors are easily modified into 3-phase though (with the same coils as a 3-phase motor internally, just wired into a split-phase configuration with a 60-degree
phase offset; vs 90 degrees in some other motors).

One can get different properties if they machine a new rotor, as these
motors invariably come with squirrel-cage rotors. Easier/cheaper to
machine here being a reluctance rotor.

Main annoyance mostly being that this can be a pretty big chunk of steel
for any non-trivial motor (also heavy). Could likely reduce weight by
making the base rotor by layering multiple sizes of steel tubing, then
brazing or welding it all together with some steel end-caps (drilled out
for the motor shaft, probably also brazed in place). Then turn it to the target diameter, and mill the side grooves.

Well, or find something with sufficiently thick walls (say, a chunk of
5" OD, schedule 120 or 180 steel pipe). This would simplify the process,
and be cheaper (and lighter) than, say, a chunk of 5" bar stock.

Haven't done much in this area for a while, was mostly messing around a
lot more with this when I was a little younger.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using PWM controlled semiconductor switches.

Yes:
Dual-phase: may use a "Dual H-Bridge" configuration
Where, the H-bridge is built using power transistors;
Three-phase: "Triple Half-Bridge"
Needs fewer transistors than dual phase.

It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
but are more fault tolerant.

MOSFETs can handle more power, but one needs to be very careful not to
exceed the Gain-Source voltage limit, otherwise they are insta-dead (and
will behave as if they are shorted).

So, one needs a more complex circuit, say:
MOSFET power transistor (typically NMOS);
NPN or PNP control transistor (such as a 2N3904 or similar);
Pull up/down resistors;
Zener diode.
In which case the control transistors can be driven as in a typical
H-Bridge.

Say, for example:
Pull down resistor pulls Gate to Source, keeping it off by default;
Zener diode in parallel with resistor, to impose VGS limit;
Pull-up transistor connects to Drain via a resistor
(via emitter or collector, depending on PNP or NPN).
Base on control transistor used for control.

Then can control the MOSFETs as-if they were BJTs. Not sure why they
can't have this stuff built-in (sort of like with a Darlington), but alas.

One typically also needs flyback diodes, and a main DC rail capacitor,
and a DC rail zener diode, ...

Though, at this stage, more preferable to buy these things than build
them, as the hand-built ones tend to have a bad habit of exploding.

Really, the distinction between "DC motor" and "AC motor" is mostly meaningless, other than for the smallest and cheapest (or oldest)
brushed DC motors.

Pretty much.

More the motor technology one finds in toys and a lot of cordless power
tools. Also the "Power Wheels" vehicles, which tended to use the same
kind of 1/4 HP brushed DC motors often found in cordless power tools.

Some adults had ridden around on these things, but sometimes modded them
out to use bigger 1/2 or 3/4 HP motors. Typically also need a bigger
battery, as they were using repurposed UPS batteries (from one I ended
up tearing down some years ago). Otherwise, mostly all plastic apart
from a steel axle and similar.

Bigger brushed DC motors, as you say, used to be used in situations
where you needed speed control and the alternative was AC motors driven
at fixed or geared speeds directly from the 50 Hz or 60 Hz supplies. And
as you say, these were replaced by AC motors driven from frequency inverters. Asynchronous motors (or "induction motors") were popular at first, but are not common choices now for most use-cases because
synchronous AC motors give better control and efficiencies. (There are,
of course, many factors to consider - and sometimes asynchronous motors
are still the best choice.)

Yeah.

Large brushed DC motors are not usually seen much IME.

Have encountered brushed DC motors up to around 1/2 or 3/4 HP, not sure
if they go much larger.

They often go at lower RPMs, say (IIRC):
1/4 HP: 20000 RPM (roughly 1.25" OD x 2" L)
1/2 HP: 10000 RPM (roughly 2.5" OD x 4" L)
3/4 HP: 6000 RPM (roughly 4" OD x 6" L)

As well as typically being physically larger (though, a 3/4 HP brushed
DC motor is merely the size of a 1/4 HP AC induction motor). Like, very
large by DC motor standards, but by AC motor standards, smaller than the motors typically used to spin the fan blades on an air conditioner unit.

Whereas, a 3/4 HP induction motor is a much bigger beast.

BLDC motors are typically also small. But, pure BLDC motors are also
often very expensive much over 1/4 HP (often because they use neodymium magnets).

But, the other option being reluctance motors, but these may or may not
be passed off as BLDC.

Can sort of tell the difference when spinning them with no power
applied: True BLDCs will have high "cogging torque" (almost more like a stepper motor, but not as strong and with much bigger steps);
If there is a very weak coging torque, it is likely one of the
intermediate reluctance/BLDC hybrids (eg, with a ceramic ring magnet);
If it spins freely (no cogging torque) it is likely a reluctance motor.

Or, 2-stage, say:
960V -> 192V (with 960V to each rack).
192V -> 12V (with 192V to each server).

Where the second stage drop could use slightly cheaper transistors,

Transistors?

Yes, transistors. DC-to-DC convertors are made of FETs. FETs are
transistors.

I'm more used to thyristors in that role.

It's better, perhaps, to refer to "semiconductor switches" as a more
general term.

Thyristors are mostly outdated, and are only used now in very high power situations. Even then, they are not your granddad's thyristors, but
have more control for switching off as well as switching on - perhaps
even using light for the switching rather than electrical signals.
(Those are particularly nice for megavolt DC lines.)

You can happily switch multiple MW of power with a single IGBT module
for a could of thousand dollars. Or you can use SiC FETs for up to a
few hundred kW but with much faster PWM frequencies and thus better
control.

Yes.

For medium power, typically MOSFETs were used.

For low power, typically BJTs or Darlingtons.

But, BJTs seem to become impractical much over around 60V 5A or so. Even
this requires a pretty aggressive heat-sink and/or active cooling.

MOSFETs handle more power with less heat, and are often available up to
around 1000V 50A or so (in TO-247 packaging or similar), but can be run
in parallel as needed for more amps.

IGBTs for when one needs something big...

Never really messed with Thyristors.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Sep 28 12:00:56 2025

From Newsgroup: comp.arch

On Fri, 26 Sep 2025 14:28:02 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

BGB <cr88192@gmail.com> writes:

On 9/25/2025 9:03 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Consider that there are losses converting from the
primary (e.g. 22kv) to 480v (2%), and additional losses
converting to 208v (3%) to the UPS. That's before any
rectification losses (6% to 12%). With various optimizations,
they reduced total losses to 7.5%, including rectification
and transformation from the primary voltage.

Hmm...

Brings up a thought: 960VDC is a semi-common voltage in industrial >applications IIRC.

What if, opposed to each computer using its own power-supply (from
120 or 240 VAC), it uses a buck converter, say, 960VDC -> 12VDC.

In those datacenters, the UPS distributes 48VDC to the rack components (computers, network switches, storage devices, etc).

I looked at PSUs offered by Dell for their rack servers. There are four
options for the inputs, although not every server model has all four.
The options are:
- 100-240 VAC.
- 200-240 VAC
- -48 VDC
- 240-400 VDC

I don't know in which countries and in which branch of IT they prefer
the fourth option, but knowing Dell of late (as opposed to Dell of up to ~2008), they would not offer the option unless demand was quite
significant.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sun Sep 28 16:44:02 2025

From Newsgroup: comp.arch

On 27/09/2025 21:23, BGB wrote:

On 9/27/2025 6:52 AM, David Brown wrote:

On 27/09/2025 10:14, Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Fri, 26 Sep 2025 12:10:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

Brings up a thought: 960VDC is a semi-common voltage in industrial >>>>>> applications IIRC.

I've never encountered that voltage. Direct current motors are
also mostly being phased out (pun intended) by asynchronous motors
with frequency inverters.

Are you sure?
Indeed, in industry, outside of transportation, asynchronous AC motors >>>> were that most wide-spread motors by far up to 25-30 years ago. But my >>>> imressioon was that today various type of electric motors (DC, esp.
brushlees, AC sync, AC async) enjoy similar popularity.

I can only speak from poersonal experience about the industry I
work in (chemical). People used to use DC motors when they needed
variable motor speed, but have now switched to asynchronous (AC)
motors with frequency inverters, which usually have a 1:10 ratio
of speed. There are no DC network in chemical plants.

If you have high-voltage DC system (like in an electric car) then
using DC motors makes more sense.

These are not "DC motors" in the traditional sense, like brushed DC
motors. The motors you use in a car have (roughly) sine wave drive
signals, generally 3 phases (but sometimes more). Even motors
referred to as "Brushless DC motors" - "BLDC" - use AC inputs, though
the waveforms are more trapezoidal than sinusoidal.

Yes.

Typically one needs to generate a 3-phase waveform at the speed they
want to spin the motor at.

Details of motor drives is perhaps getting a bit OT for this group - but
there are people here interested in all sorts of things. If you want to
have more discussions on motor drives, comp.arch.embedded might be a
nice place for a new thread - the group appears fairly empty, but
experts crawl out of the woodwork whenever an interesting new thread is started!

I had noted in some experience when writing some code to spin motors (typically on an MSP430, mostly experimentally) or similar:

Experiments are always good, but it is also helpful to combine them with
a bit of theory so that you don't generalise too much from a small
number of tests. In particular, the motor windings in a three phase AC
motor can be done in several different ways, optimised for different
kinds of controlling waves. The two main ones for small and medium
permanent magnet motors are for sinusoidal waves (aiming for smoothest
and most controlled driving - often called "PMSM - permanent magnet synchronous motors") and for trapezoidal driving (for simpler driving,
often referred to as "BLDC - Brushless DC").

Then there are different ways to track the position of the motor. You
can have hall effect sensors, which are simple and cheap, giving 6
positions per electrical rotation (motors can have multiple sets of
windings and magnets, giving two or more electrical rotations per
mechanical rotation). These are good for trapezoidal BLDC control. It
is also possible to use sensorless control, where the hall effect
signals are calculated by measuring the back EMF from the motor windings during the off periods of the driver half bridges. This avoids the
sensors and can make cabling easier, but can't be used at low speed - it
is only suitable for continuously running motors rather than positioning motors.

Or you can have encoders, which give the more precise position needed
for sine wave or PMSM waves. These are usually quadrature encoders,
which are accurate and reliable but need to pass through an index
position to get their absolute position. Sometimes absolute encoders
are used - these are either cheaper but less precise using analogue hall effect sensors, or much more expensive using multiple Grey code rings
with optical or inductive sensing.

For trapezoidal drives, you usually have a simple 6-step switching
sequence, with each of the three half-bridges driving high for 2 steps,
off for 1 step, low for 2 steps, off for one step. You can control the
speed of the motor by the speed of the steps, and the power by using PWM modulation when driving high or low (or by using a single PWM control
for the common DC bus voltage).

For sine wave driving, you need fast PWM for each of the three half
bridges to generate three sine waves at 120° phase differences. The PWM frequency has to be high enough so that after the filtering in the motor windings, you have little in the way of harmonics.

Generally, however, instead of actively producing sine waves, you do
what is known as "vector control" - you measure the currents in the
three branches, and use the angle data to convert these to currents perpendicular to and aligned to the motor position. You then regulate
the PWM values to control these two currents - aiming to get the desired current in the active direction, and zero current perpendicular to it
(since that is just wasted effort). The resulting waveforms are
somewhat distorted sine waves.

An msp430 is fine for trapezoidal control and hall effect sensors, but a
bit underpowered for serious sine wave or vector control. You are
better with a Cortex-M4 for motors.

Sine waves give low noise, but less power;

Sine waves are closer to the ideal for many motors, but you'll get even
lower noise with good vector control.

You can also try adding some third harmonic - use sin(x) + 1/9 sin(3x).
The third harmonic disappears in the motor, since it affects all three
phases equally. But it flattens out the peaks of the sine wave and lets
you then increase the amplitude by about 12.5% before hitting 100% of
your DC bus voltage.

Square waves are noisier and only work well at low RPM,
    but have higher torque.

Square waves are a really bad idea - you jump between high torque and
low torque, and will regularly be pulling the motor back a bit rather
than forwards. Prefer trapezoidal control - it is just as easy, and
works vastly better. You of course get more torque ripple than with
sine waves or vector control.

Sawtooth waves seem to work well at higher RPMs.
    Well, sorta, more like sawtooth with alternating sign.

Do you mean trapezoidal control?

Square-Root Sine: Intermediate between sign and square.
    Gives torque more like a square wave, but quieter.

That's just weird. I think what you are seeing is something similar to
the shape you get from vector control.

    Trapezoid waves are similar to this, but more noise.

Seemingly, one "better" option might be to mutate the wave-shape between Square-Root-Sine and sawtooth depending on the target RPM. Also dropping
the wave amplitude at lower RPMs (at low RPMs motors pull more amperage
and thus generate a lot of heat otherwise).

Of course you have lower average voltage at lower speeds and torques -
that's why you use PWM to control your wave amplitudes.

And whenever you have a frequency inverter, the input to the frequency
is first rectified to DC, then new AC waveforms are generated using
PWM controlled semiconductor switches.

Yes:
Dual-phase: may use a "Dual H-Bridge" configuration
    Where, the H-bridge is built using power transistors;
Three-phase: "Triple Half-Bridge"
    Needs fewer transistors than dual phase.

It is slightly easier to build these drivers with BJTs or Darlington transistors, but these tend to handle less power and generate more heat,
but are more fault tolerant.

MOSFETs can handle more power, but one needs to be very careful not to exceed the Gain-Source voltage limit, otherwise they are insta-dead (and will behave as if they are shorted).

FETs are always used (until voltage or current requirements force you to
use IGBTs) - no one uses BJTs or Darlingtons in real motor control. You
need an appropriate gate driver for the FETs, but those are common and
cheap, and usually combined with deadtime control to avoid accidental shoot-thrown when you enable the high side and low side together.
Modules that combine all this with three half-bridges are also small and cheap.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,076
Nodes:	10 (1 / 9)
Uptime:	64:57:36
Calls:	13,805
Files:	186,990
D/L today:	541 files (173M bytes)
Messages:	2,442,779

Re: Intel's Software Defined Super Cores

Who's Online

System Info