Forum: War Ensemble BBS

Re: VAX

From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:13:50 2025

From Newsgroup: comp.arch

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

On 2025-08-04 15:03, Michael S wrote:

On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

...

In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.

That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.

If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.

I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.

I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.

(I don't know that to be true; an extension has to be documented other
than by omission. But anyway, if the GCC documentation says somewhere
something like, "no other identifier is reserved in this version of
GCC", then it means that the remaining portions of the reserved
namespaces are available to the program. Since it is undefined behavior
to use those identifiers (or in certain ways in certain circumstances,
as the case may be), being able to use them with the documentation's
blessing constitutes use of a documented extension.)

I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.

__builtin also in a standard-defined reserved namespace; the double
underscore namespace. It is no more or less conservative to name
something __bitInt as _BitInt.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 6 00:21:25 2025

From Newsgroup: comp.arch

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.

This is the carry chain that I don't see any obvious way to break...

Terje

You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:25:17 2025

From Newsgroup: comp.arch

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

GCC does not define a complete C implementation; it doesn't provide a
library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...

Those libraries are C implementors also, and get to name things
in the reserved namespace.

It would be unthinkable for GCC to introduce, say, an extension
using the identifier __libc_malloc.

In addition to libraries, if some other important project that serves as
a base package in many distributions happens to claim identifiers in
those spaces, it wouldn't be wise for GCC (or the C libraries) to start
taking them away.

You can't just rename the identifier out of the way in the offending
package, because that only fixes the issue going forward. Older versions
of the package can't be compiled with the new compiler without a patch. Compiling older things with newer GCC happens.

There are always the questions:

1. Is there an issue? Is anything broken?

2. If so, is what is broken important such that it becomes a showstopper
if the compiler change is rolled out (major distros are on fire?)
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Tue Aug 5 17:41:30 2025

From Newsgroup: comp.arch

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 00:49:21 2025

From Newsgroup: comp.arch

On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

... the problem was all the programs ported from unix which assumed
that any negative return value was a failure code.

If the POSIX API spec says a negative return for a particular call is an error, then a negative return for that particular call is an error.

I can’t imagine this kind of thing blithely being carried over to any non- POSIX API calls.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Tue Aug 5 19:14:48 2025

From Newsgroup: comp.arch

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

Agreed -- and in gcc did not do that in this case. I was referring to
_BitInt, not to other identifiers in the reserved namespace.

Do you have any reason to believe that gcc's use of _BitInt will break
any existing code? My best guess is that there is no such code, that
the only real world uses of the name _BitInt are deliberate uses of the
new C23 feature, and that gcc's support of _BitInt in non-C23 mode
will not break anything.

It is of course possible that I'm wrong.

If the name _BitInt did break (non-portable) existing C code, then the
fault would lie with the C committee, not with the gcc maintainers.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Tue Aug 5 20:15:11 2025

From Newsgroup: comp.arch

On 8/5/25 17:59, Lawrence D'Oliveiro wrote:

On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards
workstations within the same company (which would be HARD).

None of the companies which tried to move in that direction were
successful. The mass micro market had much higher volumes and lower
margins, and those accustomed to lower-volume, higher-margin operation
simply couldn’t adapt.

The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.

As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could have done to
the PC market, if IBM could have been prevented from gaining such market
dominance.

IBM had massive marketing clout in the mainframe market. I think that was
the basis on which customers gravitated to their products. And remember,
the IBM PC was essentially a skunkworks project that totally went against
the entire IBM ethos. Internally, it was seen as a one-off mistake that
they determined never to repeat. Hence the PS/2 range.

DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM. But what OS would they have used? They were still dominated by Unix-haters then.

VMS was a heckuva good OS.

A bit like the /360 strategy, offering a wide range of machines (or CPUs
and systems) with different performance.

That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
for example, offered entire ranges of machines in each of its various minicomputer families.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Wed Aug 6 04:31:59 2025

From Newsgroup: comp.arch

On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.

Do you have any reason to believe that gcc's use of _BitInt will break
any existing code?

It has landed, and we don't hear reports that the sky is falling.

If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.

My position has always been to think about the threat of real,
or at least probable clashes.

I can turn it around: I have not heard of any compiler or library using _CreamPuff as an identifier, or of a compiler which misbehaves when a
program uses it, on grounds of it being undefined behavior. Someone
using _CreamPuff in their code is taking a risk that is vanishingly
small, the same way that introducing _BigInt is a risk that is
vanishingly small.

In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced some identifier is vastly larger than the audience of implementations that a
given program will face that has introduced some funny identifier.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:37:32 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

The plurality of embedded systems are 8 bit processors - about 40
percent of the total. They are largely used for things like industrial automation, Internet of Things, SCADA, kitchen appliances, etc.

I believe heart pacemakers run with a 6502 (well, 65C02)

16 bi
account for a small, and shrinking percentage. 32 bit is next (IIRC ~30-35%, but 64 bit is the fastest growing. Perhaps surprising, there
is still a small market for 4 bit processors for things like TV remote controls, where battery life is more important than the highest performance.

There is far more to the embedded market than phones and servers.

Also, the processors which run in earphones etc...

Does anybody have an estimate how many CPUs humanity has made
so far?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:50:11 2025

From Newsgroup: comp.arch

Peter Flass <Peter@Iron-Spring.com> schrieb:

The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.

Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...

But the _real_ killer application for UNIX wasn't writing patents,
it was phototypesetting speeches for the CEO of AT&T, who, for
reasons of vanity, did not want to wear glasses, and it was possible
to scale the output of the phototoypesetter so he would be able
to read them.

After somebody pointed out that having confidential speeches on
one of the most well-known machines in the world, where loads of
people had dial-up access, was not a good idea, his secretary got
her own PDP-11 for that.

And with support from that high up, the project flourished.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 05:53:22 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 06:20:57 2025

From Newsgroup: comp.arch

On Wed, 6 Aug 2025 05:37:32 -0000 (UTC), Thomas Koenig wrote:

Does anybody have an estimate how many CPUs humanity has made so far?

More ARM chips are made each year than the entire population of the Earth.

I think RISC-V has also achieved that status.

Where are they all going??
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 07:28:52 2025

From Newsgroup: comp.arch

On Wed, 6 Aug 2025 05:50:11 -0000 (UTC), Thomas Koenig wrote:

Using UNIX faced stiff competition from AT&T's internal IT people, who
wanted to run DEC's operating systems on all PDP-11 within the company (basically, they wanted to kill UNIX).

But because AT&T controlled Unix, they were able to mould it like putty to their own uses. E.g. look at the MERT project which supported real-time
tasks (as needed in telephone exchanges) besides conventional Unix ones.
No way they could do this with an outside proprietary system, like those
from DEC.

AT&T also created its own hardware (the 3B range) to complement the
software in serving those high-availability needs.

But the _real_ killer application for UNIX wasn't writing patents, it
was phototypesetting speeches for the CEO of AT&T, who, for reasons of vanity, did not want to wear glasses, and it was possible to scale the
output of the phototoypesetter so he would be able to read them.

Heck, no. The biggest use for the Unix documentation tools was in the
legal department, writing up patent applications. troff was just about the only software around that could do automatic line-numbering, which was
crucial for this purpose.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch,comp.lang.c on Wed Aug 6 11:48:09 2025

From Newsgroup: comp.arch

On Wed, 6 Aug 2025 04:31:59 -0000 (UTC)
Kaz Kylheku <643-408-1753@kylheku.com> wrote:

On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Kaz Kylheku <643-408-1753@kylheku.com> writes:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

Agreed -- and in gcc did not do that in this case. I was referring
to _BitInt, not to other identifiers in the reserved namespace.

Do you have any reason to believe that gcc's use of _BitInt will
break any existing code?

It has landed, and we don't hear reports that the sky is falling.

If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.

Exactly.
The World is a very big place. Even nowadays it is not completely
transparent. Even those parts that are publicly visible in theory not necessarily had been had been observed recently by a single person even
if the person in question is Keith.
Besides, according to my understanding majority of gcc users didn't yet
migrate to gcc14 or 15.

My position has always been to think about the threat of real,
or at least probable clashes.

I can turn it around: I have not heard of any compiler or library
using _CreamPuff as an identifier, or of a compiler which misbehaves
when a program uses it, on grounds of it being undefined behavior.
Someone using _CreamPuff in their code is taking a risk that is
vanishingly small, the same way that introducing _BigInt is a risk
that is vanishingly small.

In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced
some identifier is vastly larger than the audience of implementations
that a given program will face that has introduced some funny
identifier.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:24:49 2025

From Newsgroup: comp.arch

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Of all the major OSes for Alpha, Windows NT was the only one
that couldn’t take advantage of the 64-bit architecture.

Actually, Windows took good advantage of the 64-bit architecture:
"64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Wed Aug 6 10:48:51 2025

From Newsgroup: comp.arch

In article <106uqej$36gll$3@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Peter Flass <Peter@Iron-Spring.com> schrieb:

The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.

Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...

I've never heard this before, and I do not believe that it is
true. Do you have a source?

Bell Telephone's computer center was basically an IBM shop
before Unix was written, having written BESYS for the IBM 704,
for instance. They made investments in GE machines around the
time of the Multics project (e.g., they had a GE 645 and at
least one 635). The PDP-11 used for Unix was so new that they
had to wait a few weeks for its disk to arrive.

Unix escaped out of research, and into the larger Bell System,
via the legal department, as has been retold many times. It
spread widely internally after that. After divestiture, when
AT&T was freed to be able to compete in the computer industry,
it was seen as a strategic asset.

But the _real_ killer application for UNIX wasn't writing patents,
it was phototypesetting speeches for the CEO of AT&T, who, for
reasons of vanity, did not want to wear glasses, and it was possible
to scale the output of the phototoypesetter so he would be able
to read them.

After somebody pointed out that having confidential speeches on
one of the most well-known machines in the world, where loads of
people had dial-up access, was not a good idea, his secretary got
her own PDP-11 for that.

And with support from that high up, the project flourished.

While it is true that Charlie Brown's office got a Unix system
of their own to run troff because its output scaled to large
sizes, the speeches weren't the data they were worried about
protecting: those were records from AT&T board meetings.

At the time, the research PDP-11 used for Unix at Bell Labs was
not one of the, "most well-known machines in the world, where
loads of people had dial-up access" in any sense; in the grand
scheme of things, it was pretty obscure, and had a few dozen
users. But it was a machine where most users had "root" access,
and it was agreed that these documents shouldn't be on the
research machine out of concern for confidentiality.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:32:39 2025

From Newsgroup: comp.arch

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Not aware of any platforms that do/did ILP64.

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

ILP64 for Cray is documented in <https://en.cppreference.com/w/c/language/arithmetic_types.html>. For
short int, I don't have a direct reference, only the statement

|Firstly there was the word size, one rather large size fitted all,
|integers and floats were represented in 64 bits

<https://cray-history.net/faq-1-cray-supercomputer-families/faq-3/>

For the 8-bit characters I found a reference (maybe somewhere else in
that document), but I do not find it at the moment.

Followups set to comp.arch.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 6 11:10:46 2025

From Newsgroup: comp.arch

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 11:05:30 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?

Of course int16_t uint16_t int32_t uint32_t

On what keywords should these types be based? That's up to the
implementor. In C23 one could

typedef signed _BitInt(16) int16_t

etc. Around 1990, one would have just followed the example of "long
long" of accumulating several modifiers. I would go for 16-bit
"short" and 32-bit "long short".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 13:48:17 2025

From Newsgroup: comp.arch

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

... the problem was all the programs ported from unix which assumed
that any negative return value was a failure code.

If the POSIX API spec says a negative return for a particular call is an >error, then a negative return for that particular call is an error.

Please find a single POSIX API that says a negative return is an error.

You won't have much success. POSIX explicitly states in most
cases that the API returns -1 on error (mmap returns MAP_FAILED,
which happens to be -1 on most implementations; regardless a
POSIX application _must_ check for MAP_FAILED, not a negative
return value).

More misinformation from LDO.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 6 16:19:11 2025

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.

OK, nice.

BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?

No, you are not. I skipped pretty much all the setup code. :-)

Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?

I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.

The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.

I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.

This is the carry chain that I don't see any obvious way to break...

You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.

Aha!

That's _very_ nice.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 10:23:26 2025

From Newsgroup: comp.arch

George Neuner wrote:

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]

To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.

The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.

So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.

The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.

So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode
bottleneck.

I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.

ADDLL (r2)+, (r2)+, (r2)+

the first (left) operand reads r2 then adds 4, which the second r2 reads
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.

I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.

--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:54:57 2025

From Newsgroup: comp.arch

On 2025-08-05 17:13, Kaz Kylheku wrote:

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

...

If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.

I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.

I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.

No, I meant very specifically that if, as reported, _BitInt was
supported even in earlier versions, then it was supported as an extension.

--- Synchronet 3.21a-Linux NewsLink 1.2

From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:56:04 2025

From Newsgroup: comp.arch

On 2025-08-05 17:25, Kaz Kylheku wrote:

On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.

However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.

GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...

Those libraries are C implementors also, and get to name things
in the reserved namespace.

GCC cannot be implemented in such a way as to create a fully conforming implementation of C when used in connection with an arbitrary
implementation of the C standard library. This is just one example of a
more general potential problem: Both gcc and the library must use some
reserved identifiers, and they might have made conflicting choices.
That's just one example of the many things that might prevent them from
being combined to form a conforming implementation of C. It doesn't mean
that either one is defective. It does mean that the two groups of
implementors should consider working together to resolve the conflicts.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 16:35:23 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqej$36gll$3@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Peter Flass <Peter@Iron-Spring.com> schrieb:

The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.

Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...

I've never heard this before, and I do not believe that it is
true. Do you have a source?

Hmm... I _think_ it was on a talk given by the UNIX people,
but I may be misremembering.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:34:55 2025

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> writes:

The same happened to some extent with the early amd64 machines, which
ended up running 32bit Windows and applications compiled for the i386
ISA. Those processors were successful mostly because they were fast at >running i386 code (with the added marketing benefit of being "64bit
ready"): it took 2 years for MS to release a matching OS.

Apr 2003: Opteron launch
Sep 2003: Athlon 64 launch
Oct 2003 (IIRC): I buy an Athlon 64
Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

I installed Fedora Core 1 on my Athlon64 box in early 2004.

Why wait for MS?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:32 2025

From Newsgroup: comp.arch

On 8/6/2025 6:05 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?

Of course int16_t uint16_t int32_t uint32_t

Well, assuming a post C99 world.

On what keywords should these types be based? That's up to the
implementor. In C23 one could

typedef signed _BitInt(16) int16_t

Possible, though one can realize that _BitInt(16) is not equivalent to a normal 16-bit integer.

_BitInt(16) sa, sb;
_BitInt(32) lc;
sa=0x5678;
sb=0x789A;
lc=sa+sb;

Would give:
0xFFFFCF12
Rather than 0xCF12 (as would be expected with 'short' or similar).

Because _BitInt(16) would not auto-promote before the addition, but
rather would produce a _BitInt(16) result which is then widened to 32
bits via sign extension.

etc. Around 1990, one would have just followed the example of "long
long" of accumulating several modifiers. I would go for 16-bit
"short" and 32-bit "long short".

OK.

Apparently at least some went for "__int32" instead.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 18:22:03 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 8/6/2025 6:05 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?

Of course int16_t uint16_t int32_t uint32_t

Well, assuming a post C99 world.

'typedef' was around long before C99 happened to
standardize the aforementioned typedefs.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:30 2025

From Newsgroup: comp.arch

On 8/6/25 09:47, Anton Ertl wrote:

Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?

I'm a time-traveler from the 1960s!

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 20:06:00 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Wed Aug 6 13:58:51 2025

From Newsgroup: comp.arch

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

On 2025-08-05 17:13, Kaz Kylheku wrote:

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

...

If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.

I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.

I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.

No, I meant very specifically that if, as reported, _BitInt was
supported even in earlier versions, then it was supported as an extension.

gcc 13.4.0 does not recognize _BitInt at all.

gcc 14.2.0 handles _BitInt as a language feature in C23 mode,
and as an "extension" in pre-C23 modes.

It warns about _BitInt with "-std=c17 -pedantic", but not with
just "-std=c17". I think I would have preferred a warning with
"-std=c17", but it doesn't bother me. There's no mention of _BitInt
as an extension or feature in the documentation. An implementation
is required to document the implementation-defined value of
BITINT_MAXWIDTH, so that's a conformance issue. In pre-C23 mode,
since it's not documented, support for _BitInt is not formally an
"extension"; it's an allowed behavior in the presence of code that
has undefined behavior due to its use of a reserved identifier.
(This is a picky language-lawyerly interpretation.)
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:00:03 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 6 21:14:07 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Thomas Koenig wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,
especially for one of the BUNCH, for whom backward compatability was
paramount.

[*] The machine (Unisys V530) sold for well over a megabuck in
a single processor configuration.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:57:03 2025

From Newsgroup: comp.arch

EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).

TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.

The question is could one build this at a commercially competitive price?
There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent state machines, each with its own logic sequencer.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:12:26 2025

From Newsgroup: comp.arch

On 2025-08-06, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Not aware of any platforms that do/did ILP64.

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

Not having a 16-bit integer type and not having a 32-bit integer type
would make it very hard to adapt portable code, such as TCP/IP protocol processing.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:15:54 2025

From Newsgroup: comp.arch

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

Not having a 16-bit integer type and not having a 32-bit integer type
would make it very hard to adapt portable code, such as TCP/IP protocol >processing.

I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the right term.

Perhaps "happens to work on some computers similar to the one it was originally written on."
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:32:47 2025

From Newsgroup: comp.arch

["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

Not having a 16-bit integer type and not having a 32-bit integer type
would make it very hard to adapt portable code, such as TCP/IP protocol >>processing.

I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the
right term.

My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?

- Lars
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:38:15 2025

From Newsgroup: comp.arch

On Wed, 06 Aug 2025 10:32:39 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Not aware of any platforms that do/did ILP64.

AFAIK the Cray-1 (1976) was the first 64-bit machine ...

But it was not byte-addressable. Its precursor CDC machines had 60-bit
words, as I recall. DEC’s “large systems” family from around that era (PDP-6, PDP-10) had 36-bit words. And there were likely some other vendors offering 48-bit words, that kind of thing. Maybe some with word lengths
even longer than 64 bits.

I was thinking more specifically of machines from the byte-addressable
era.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:40:48 2025

From Newsgroup: comp.arch

On Wed, 06 Aug 2025 10:24:49 GMT, Anton Ertl wrote:

Lawrence D'Oliveiro <ldo@nz.invalid> writes:

Of all the major OSes for Alpha, Windows NT was the only one that
couldn’t take advantage of the 64-bit architecture.

Actually, Windows took good advantage of the 64-bit architecture:
"64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

Remember the Alpha was first released in 1992. No shipping version of
Windows NT ever ran on it in anything other than “TASO” (“Truncated Address-Space Option”, i.e. 32-bit-only addressing) mode.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Wed Aug 6 23:43:12 2025

From Newsgroup: comp.arch

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA, which
used PALs which were not available to the VAX 11/780 designers, so it
could be clocked a bit higher, but at a multiple of the performance
than the VAX.

So, Anton visiting DEC or me visiting Data General could have brought
them a technology which would significantly outperformed the VAX
(especially if we brought along the algorithm for graph coloring. Some
people at IBM would have been peeved at having somebody else "develop"
this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
matrix)
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Wed Aug 6 20:21:31 2025

From Newsgroup: comp.arch

Robert Swindells wrote:

On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:

For comparison:

SPARC: Berkeley RISC research project between 1980 and 1984;
<https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
in May 1982 (but could only run at 0.5MHz). No date for the completion
of RISC-II, but given that the research project ended in 1984, it was
probably at that time. Sun developed Berkeley RISC into SPARC, and the
first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz
processor.

The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.

There is also a bibliography entry of an informal discussion with John
Cocke at Berkeley about the 801 in June 1983

There is a citation to Cocke as "private communication" in 1980 by
Patterson in The Case for the Reduced Instruction Set Computer, 1980.

"REASONS FOR INCREASED COMPLEXITY

Why have computers become more complex? We can think of several reasons:
Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about ten times as fast as the core main memory; this made any primitives that
were implemented as subroutines much slower than primitives that were instructions. Thus the floating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted
in an advance that made it more cost-effective than the 701. Since then,
many "higher-level" instructions have been added to machines in an attempt
to improve performance. Note that this trend began because of the imbalance
in speeds; it is not clear that architects have asked themselves whether
this imbalance still holds for their designs."

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 20:41:44 2025

From Newsgroup: comp.arch

EricP wrote:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)

^^^^
Oops... typo. Should be FPLA.
PAL or Programmable Array Logic was a slightly different thing,
also an AND-OR matrix from Monolithic Memories.

were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

And PAL's too. Whatever works and is cheapest.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Charlie Gibbs@cgibbs@kltpzyxm.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 01:36:50 2025

From Newsgroup: comp.arch

On 2025-08-06, Peter Flass <Peter@Iron-Spring.com> wrote:

On 8/6/25 09:47, Anton Ertl wrote:

Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?

I'm a time-traveler from the 1960s!

I'm starting to tell people that I'm a traveller
from a distant land known as the past.
--
/~\ Charlie Gibbs | Growth for the sake of
\ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
X I'm really at ac.dekanfrus | of the cancer cell.
/ \ if you read it the right way. | -- Edward Abbey
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 02:22:05 2025

From Newsgroup: comp.arch

On Wed, 06 Aug 2025 20:21:31 -0400, EricP wrote:

There is a citation to Cocke as "private communication" in 1980 by
Patterson in The Case for the Reduced Instruction Set Computer,
1980.

"REASONS FOR INCREASED COMPLEXITY

Why have computers become more complex? We can think of several
reasons: Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709
[Cocke80]. The 701 CPU was about ten times as fast as the core main
memory; this made any primitives that were implemented as
subroutines much slower than primitives that were instructions. Thus
the floating point subroutines became part of the 709 architecture
with dramatic gains. Making the 709 more complex resulted in an
advance that made it more cost-effective than the 701. Since then,
many "higher-level" instructions have been added to machines in an
attempt to improve performance. Note that this trend began because
of the imbalance in speeds; it is not clear that architects have
asked themselves whether this imbalance still holds for their
designs."

That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

How come? Caching.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Thu Aug 7 10:27:40 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

There is a citation to Cocke as "private communication" in 1980 by
Patterson in The Case for the Reduced Instruction Set Computer, 1980.

"REASONS FOR INCREASED COMPLEXITY

Why have computers become more complex? We can think of several reasons: >Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >ten times as fast as the core main memory; this made any primitives that
were implemented as subroutines much slower than primitives that were >instructions. Thus the floating point subroutines became part of the 709 >architecture with dramatic gains. Making the 709 more complex resulted
in an advance that made it more cost-effective than the 701. Since then,
many "higher-level" instructions have been added to machines in an attempt
to improve performance. Note that this trend began because of the imbalance >in speeds; it is not clear that architects have asked themselves whether
this imbalance still holds for their designs."

At the start of this thread
<2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
argument about the relation between memory speed and clock rate. In
that posting, I wrote:

|my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
|within a row would have been possible. Moreover, the VAX 11/780 has a
|cache

In the meantime, this discussion and some additional searching has
unearthed that the VAX 11/780 memory subsystem has 600ns main memory
cycle time (apparently without contiguous-access (row) optimization),
with the cache lowering the average memory cycle time to 290ns.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Thu Aug 7 11:06:06 2025

From Newsgroup: comp.arch

In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There is a citation to Cocke as "private communication" in 1980 by >>Patterson in The Case for the Reduced Instruction Set Computer, 1980.

"REASONS FOR INCREASED COMPLEXITY

Why have computers become more complex? We can think of several reasons: >>Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >>with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >>ten times as fast as the core main memory; this made any primitives that >>were implemented as subroutines much slower than primitives that were >>instructions. Thus the floating point subroutines became part of the 709 >>architecture with dramatic gains. Making the 709 more complex resulted
in an advance that made it more cost-effective than the 701. Since then, >>many "higher-level" instructions have been added to machines in an attempt >>to improve performance. Note that this trend began because of the imbalance >>in speeds; it is not clear that architects have asked themselves whether >>this imbalance still holds for their designs."

At the start of this thread
<2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
argument about the relation between memory speed and clock rate. In
that posting, I wrote:

|my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
|within a row would have been possible. Moreover, the VAX 11/780 has a |cache

In the meantime, this discussion and some additional searching has
unearthed that the VAX 11/780 memory subsystem has 600ns main memory
cycle time (apparently without contiguous-access (row) optimization),

Memory subsystem was able to operate at bus speed: during memory
cycle memory delivered 64 bits. Bus was 32-bit and needed 3 cycles
(200 ns each) to transfer 64-bit. Making memory faster would
require redesigning the bus.

with the cache lowering the average memory cycle time to 290ns.

For processor miss penalty was 1800 ns (documentation say that
was du to bus protocol overhead). Cache hit rate was claimed
to be 95%.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 10:47:50 2025

From Newsgroup: comp.arch

Robert Swindells <rjs@fdy2.co.uk> writes:

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.

The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:16:20 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

Russians in late sixties proposed graph coloring as a way of
memory allocation (and proved that optimal allocation is
equivalent to graph coloring). They also proposed heuristics
for graph coloring and experimentaly showed that they
are reasonably effective. This is not the same thing as
register allocation, but connection is rather obvious.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:29:46 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> wrote:

EricP wrote:

Thomas Koenig wrote:

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

IIUC description of IBM 360-85 it had a pipeline which was much
more aggresivly clocked than VAX. 360-85 probaly used ECL, but
at VAX clock speed should be easily doable in Schottky TTL
(used in VAX).

The question is could one build this at a commercially competitive price?

Yes.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:21:56 2025

From Newsgroup: comp.arch

Lars Poulsen <lars@cleo.beagle-ears.com> writes:

["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:

AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine

type bits
char 8
short int 64
int 64
long int 64
pointer 64

Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.

...

My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?

I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:38:54 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

EricP wrote:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).

TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.

The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.

I am confused. You gave a possible answer in the posting you are
replying to.

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:59:35 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,

The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.

especially for one of the BUNCH, for whom backward compatability was >paramount.

Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to backwards compatibility?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 13:34:26 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Lars Poulsen <lars@cleo.beagle-ears.com> writes:

["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:

...

My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have nativea types __int16 etc?

I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.

The more likely solution would be to push the protocol processing
into an attached I/O processor, in those days.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:26:32 2025

From Newsgroup: comp.arch

On 8/6/25 22:29, Thomas Koenig wrote:

That is one of the things I find astonishing - how a company like
DG grew from a kitche-table affair to the size they had.

Recent history is littered with companies like this. The microcomputer revolution spawned scores of companies that started in someone's garage, ballooned to major presence overnight, and then disappeared - bankrupt,
bought out, split up, etc. Look at all the players in the S-100 CP/M
space, or Digital Research. Only a few, like Apple and Microsoft, made
it out alive.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 15:03:23 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.

Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,

The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.

especially for one of the BUNCH, for whom backward compatability was >>paramount.

Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?

Because you need to sell it. Without disrupting your existing
customer base.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Thu Aug 7 08:38:56 2025

From Newsgroup: comp.arch

On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

How come? Caching.

Yes, but complex instructions also make pipelining and out-of-order
execution much more difficult - to the extent that, as far back as the
Pentium Pro, Intel has had to implement the x86 instruction set as a
microcoded program running on top of a simpler RISC architecture.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Thu Aug 7 17:52:05 2025

From Newsgroup: comp.arch

John Ames wrote:

On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex
instructions, and are getting better performance with simpler ones.

How come? Caching.

Yes, but complex instructions also make pipelining and out-of-order
execution much more difficult - to the extent that, as far back as the Pentium Pro, Intel has had to implement the x86 instruction set as a microcoded program running on top of a simpler RISC architecture.

That's simply wrong:

The PPro had close to zero microcode actually running in any user program.

What it did have was decoders that would look at complex operations and
spit out two or more basic operations, like load+execute.

Later on we've seen the opposite where cmp+branch could be combined into
a single internal op.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch,alt.folklore.computers on Thu Aug 7 21:53:11 2025

From Newsgroup: comp.arch

On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

John Ames wrote:

On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex
instructions, and are getting better performance with simpler ones.

How come? Caching.

Yes, but complex instructions also make pipelining and out-of-order
execution much more difficult - to the extent that, as far back as the
Pentium Pro, Intel has had to implement the x86 instruction set as a
microcoded program running on top of a simpler RISC architecture.

That's simply wrong:

The PPro had close to zero microcode actually running in any user program.

What it did have was decoders that would look at complex operations and
spit out two or more basic operations, like load+execute.

Later on we've seen the opposite where cmp+branch could be combined into
a single internal op.

Terje

You say "tomato". 8-)

It's still "microcode" for some definition ... just not a classic
"interpreter" implementation where a library of routines implements
the high level instructions.

The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache. If they age out, the decoder has to produce them
again from the "source" x86 instructions.

So the core is executing microinstructions - not x86 - and the program
as executed reasonably can be said to be "microcoded" ... again for
some definition.

YMMV.
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Fri Aug 8 01:57:53 2025

From Newsgroup: comp.arch

In article <107008b$3g8jl$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqej$36gll$3@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Peter Flass <Peter@Iron-Spring.com> schrieb:

The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM >>>> or DEC.

Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...

I've never heard this before, and I do not believe that it is
true. Do you have a source?

Hmm... I _think_ it was on a talk given by the UNIX people,
but I may be misremembering.

I have heard similar stories about DEC, but not AT&T. The Unix
fortune file used to (in)famously have a quote from Ken Olsen
about the relative volume of documentation between Unix and VMS
(reproduced below).

- Dan C.

BEGIN FORTUNE<---

One of the questions that comes up all the time is: How
enthusiastic is our support for UNIX?
Unix was written on our machines and for our machines many
years ago. Today, much of UNIX being done is done on our machines.
Ten percent of our VAXs are going for UNIX use. UNIX is a simple
language, easy to understand, easy to get started with. It's great for students, great for somewhat casual users, and it's great for
interchanging programs between different machines. And so, because of
its popularity in these markets, we support it. We have good UNIX on
VAX and good UNIX on PDP-11s.
It is our belief, however, that serious professional users will
run out of things they can do with UNIX. They'll want a real system and
will end up doing VMS when they get to be serious about programming.
With UNIX, if you're looking for something, you can easily and
quickly check that small manual and find out that it's not there. With
VMS, no matter what you look for -- it's literally a five-foot shelf of documentation -- if you look long enough it's there. That's the
difference -- the beauty of UNIX is it's simple; and the beauty of VMS
is that it's all there.
-- Ken Olsen, President of DEC, 1984

END FORTUNE<---

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Fri Aug 8 03:57:17 2025

From Newsgroup: comp.arch

On Thu, 7 Aug 2025 07:26:32 -0700, Peter Flass wrote:

On 8/6/25 22:29, Thomas Koenig wrote:

That is one of the things I find astonishing - how a company like DG
grew from a kitche-table affair to the size they had.

Recent history is littered with companies like this.

DG were famously the setting for that Tracy Kidder book, “The Soul Of A
New Machine”, chronicling their belated and high-pressure project to enter the 32-bit virtual-memory supermini market and compete with DEC’s VAX.

Looking at things with the eyes of a software guy, I found some of their hardware decisions questionable. Like they thought they were very clever
to avoid having separate privilege modes in the processor status register
like the VAX did: instead, they encoded the access privilege mode in the address itself.

I guess they thought that 32 address bits left plenty to spare for
something like this. But I think it just shortened the life of their 32-
bit architecture by that much more.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Fri Aug 8 06:16:51 2025

From Newsgroup: comp.arch

George Neuner <gneuner2@comcast.net> writes:

On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:

John Ames wrote:
The PPro had close to zero microcode actually running in any user program.

What it did have was decoders that would look at complex operations and >>spit out two or more basic operations, like load+execute.

Later on we've seen the opposite where cmp+branch could be combined into
a single internal op.

Terje

You say "tomato". 8-)

It's still "microcode" for some definition ... just not a classic >"interpreter" implementation where a library of routines implements
the high level instructions.

Exactly, for most instructions there is no microcode. There are
microops, with 118 bits on the Pentium Pro (P6). They are not RISC instructions (no RISC has 118-bit instructions). At best one might
argue that one P6 microinstruction typically does what a RISC
instruction does in a RISC. But in the end the reorder buffer still
has to deal with the CISC instructions.

The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.

No such cache in the P6 or any of its descendents until the Sandy
Bridge (2011). The Pentium 4 has a microop cache, but eventually
(with Core Duo, Core2 Duo) was replaced with P6 descendents that have
no microop cache. Actually, the Core 2 Duo has a loop buffer which
might be seen as a tiny microop cache. Microop caches and loop
buffers still have to contain information about which microops belong
to the same CISC instruction, because otherwise the reorder buffer
could not commit/execute* CISC instructions.

* OoO microarchitecture terminology calls what the reorder buffer does
"retire" or "commit". But this is where the speculative execution
becomes architecturally visible ("commit"), so from an architectural
view it is execution.

Followups set to comp.arch

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Fri Aug 8 11:43:00 2025

From Newsgroup: comp.arch

On Fri, 8 Aug 2025 03:57:17 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

I guess they thought that 32 address bits left plenty to spare for
something like this. But I think it just shortened the life of their
32- bit architecture by that much more.

The history proved them right. Eagle series didn't last long enough to
run out of 512MB address space.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Aug 8 10:08:43 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Robert Swindells <rjs@fdy2.co.uk> writes:

On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.

The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

I don't know why they think these are problems with the 82S100.
These complaints sound like from a hobbyist.

Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

- anton

Yes. This risc-VAX would have to decode 1 instruction per clock to
keep keep a pipeline full so I envision running the fetch buffer
through a bank of those PLA and generating a uOp out.

I don't know whether the instructions can be byte aligned variable size
or have to be fixed 32-bits in order to meet performance requirements.
I would prefer the flexibility of variable size but
the Fetch byte alignment shifter adds a lot of logic.

If variable then the high frequency instructions like MOV rd,rs
and ADD rsd,rs fit into two bytes. The longest instruction looks like
12 bytes, 4 bytes operation specifier (opcode plus registers)
plus 8 bytes immediate FP64.

If a variable size instruction arranges that all the critical parse
information is located in the first 8-16 bits then we can just run
those bits through a PLAs in parallel and have that control the
alignment shifter as well as generate the uOp.

I envision this Fetch buffer alignment shifter built from tri-state
buffers rather than muxes as TTL muxes are very slow and we would
need a lot of them.

The whole fetch-parse-decode should fit on a single board.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 19:48:59 2025

From Newsgroup: comp.arch

On Fri, 08 Aug 2025 06:16:51 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

George Neuner <gneuner2@comcast.net> writes:

The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.

No such cache in the P6 or any of its descendents until the Sandy
Bridge (2011). The Pentium 4 has a microop cache, but eventually
(with Core Duo, Core2 Duo) was replaced with P6 descendents that have
no microop cache. Actually, the Core 2 Duo has a loop buffer which
might be seen as a tiny microop cache. Microop caches and loop
buffers still have to contain information about which microops belong
to the same CISC instruction, because otherwise the reorder buffer
could not commit/execute* CISC instructions.

* OoO microarchitecture terminology calls what the reorder buffer does
"retire" or "commit". But this is where the speculative execution
becomes architecturally visible ("commit"), so from an architectural
view it is execution.

Followups set to comp.arch

- anton

Thanks for the correction. I did fair amount of SIMD coding for
Pentium II, III and IV, so was more aware of their architecture. After
the IV, I moved on to other things so haven't kept up.

Question:
It would seem that, lacking the microop cache the decoder would need
to be involved, e.g., for every iteration of a loop, and there would
be more pressure on I$1. Did these prove to be a bottleneck for the
models lacking cache? [either? or something else?]
--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 21:43:11 2025

From Newsgroup: comp.arch

On Wed, 06 Aug 2025 10:23:26 -0400, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

George Neuner wrote:

On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.

They used fast SRAM and had three copies of their registers,
for 2R1W.

I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.

Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?

Given 1 R/W port each I can see needing a pair to handle cases where
destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.

Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.

Or do I completely misunderstand? [Definitely possible.]

To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.

I was aware of this (thank you), but I was trying to figure out why
the VAX - particularly early ones - would need it. And also it does
not mesh with Waldek's comment [at top] about 3 copies.

The VAX did have one (pathological?) address mode:

displacement deferred indexed @dis(Rn)[Rx]

in which Rn and Rx could be the same register. It is the only mode
for which a single operand could reference a given register more than
once. I never saw any code that actually did this, but the manual
does say it was possible.

But even with this situation, it seems that the register would only
need to be read once (per operand, at least) and the value could be
used twice.

The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.

So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.

The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.

So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode >bottleneck.

I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.

ADDLL (r2)+, (r2)+, (r2)+

You mean exceptions? Exceptions were handled between instructions.
VAX had no iterating string-copy/move instructions so every
instruction logically could stand alone.

VAX separately identified the case where the instruction completed
with a problem (trap) from where the instruction could not complete
because of the problem (fault), but in both cases it indicated the
offending instruction.

the first (left) operand reads r2 then adds 4, which the second r2 reads
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.

I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 08:07:12 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sat Aug 9 09:04:40 2025

From Newsgroup: comp.arch

In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.

If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not, but that's a LOT of
speculation with hindsight-colored glasses. Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).

Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.

So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not. But as with all alternate history, this is
completely unknowable.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 10:00:54 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.

Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.

I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.

But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
over that line.

I'm not sure what, precisely, you're disagreeing with.

I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.

There are a few intermediate steps.

The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.

This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.

Sure. IBM was in less than no hurry to make a product out of
the 801.

If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,

Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.

but that's a LOT of
speculation with hindsight-colored glasses.

Graph-colored glasses, for the register allocation, please :-)

Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?

[...]

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Aug 9 10:03:29 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
- with 8 optional XOR output invertors,
- driving 8 tri-state or open collector buffers.

So I count roughly 7 or 8 equivalent gate delays.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 20:54:07 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

That is strange. Why would they make the chip worse?

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,

Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.

- with 8 optional XOR output invertors,

I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).

- driving 8 tri-state or open collector buffers.

A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.

So I count roughly 7 or 8 equivalent gate delays.

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch on Sat Aug 9 14:57:03 2025

From Newsgroup: comp.arch

On 8/9/25 1:54 PM, Thomas Koenig wrote:

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings?

using typicals was a rookie mistake
also not comparing delay times across vendors

--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sun Aug 10 12:06:46 2025

From Newsgroup: comp.arch

In article <107768m$17rul$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[snip]
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,

Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.

Ok. Sure.

but that's a LOT of
speculation with hindsight-colored glasses.

Graph-colored glasses, for the register allocation, please :-)

Heh. :-)

Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.

I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?

Absolutely.

[...]

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

Sure.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Similarly for other minicomputer companies.

So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.

Plausiblity is orthogonal to whether a thing is knowable.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Aug 10 15:18:23 2025

From Newsgroup: comp.arch

cross@spitfire.i.gajendra.net (Dan Cross) writes:

In article <107768m$17rul$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

<snip>

While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.

Yep, that would have been possible, either as an alternate
VAX or a competitor.

But as with all alternate history, this is
completely unknowable.

Sure.

We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer. Considerable
internal resources were being applied to the Jupiter project
at the end of the 1970s to support a wider range of applications.

http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".

"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"

Fundamentally, 36-bit words ended up being a dead-end.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Aug 10 21:01:50 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[Snipping the previous long discussion]

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,

There, we agree.

that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,

With a certainty, if they followed RISC principles.

but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started,

That is true. Reading https://acg.cis.upenn.edu/milom/cis501-Fall11/papers/cocke-RISC.pdf
(I liked the potential toung-in-cheek "Regular Instruction
Set-Computer" name for their instruction set).

and even
fewer would have believed it absent a working prototype,

The simulation approach that IBM took is interesting. They built
a fast simulator, translating one 801 instruciton into one (or
several) /370-instructions on the fly, with a fixed 32-bit size.

which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.

That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.

Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 11 08:17:48 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.

This does not make the actual VAX more attractive relative to the
hypothetical RISC-VAX IMO.

Fundamentally, 36-bit words ended up being a dead-end.

The reason why this once-common architectural style died out are:

* 18-bit addresses

* word addressing

Sure, one could add 36-bit byte addresses to such an architecture
(probably with 9-bit bytes to make it easy to deal with words), but it
would force a completely different ABI and API, so the legacy code
would still have no good upgrade path and would be limited to its
256KW address space no matter how much actual RAM there is available.
IBM decided to switch from this 36-bit legacy to the 32-bit
byte-addressed S/360 in the early 1960s (with support for their legacy
lines built into various S/360 implementations), DEC did so when they introduced the VAX.

Concerning other manufacturers:

<https://en.wikipedia.org/wiki/36-bit_computing> tells me that the
GE-600 series was also 36-bit. It continued as Honeywell 6000 series <https://en.wikipedia.org/wiki/Honeywell_6000_series>. Honeywell
introduced the DPS-88 in 1982; the architecture is described as
supporting the usual 256KW, but apparently the DPS-88 could be bought
with up to 128MB; programming that probably was no fun. Honeywell
later sold the NEC S1000 as DPS-90, which does not sound like the
Honeywell 6000 line was a growing business. And that's the last I
read about the Honeywell 6000 line.

Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. <https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:

|In addition to the IX (1100/2200) CPUs [...], the architecture had
|Xeon [...] CPUs. Unisys' goal was to provide an orderly transition for
|their 1100/2200 customers to a more modern architecture.

So they continued to support it for a long time, but it's a legacy
thing, not a future-oriented architecture.

The Wikipedia article also mentions the Symbolics 3600 as 36-bit
machine, but that was quite different from the 36-bit architectures of
the 1950s and 1960s: The Symbolics 3600 has 28-bit addresses (the rest apparently taken by tags) and its successor Ivory has 32-bit addresses
and a 40-bit word. Here the reason for its demise was the AI winter
of the late 1980s and early 1990s.

DEC did the right thing when they decided to support VAX as *the*
future architecture, and the success of the VAX compared to the
Honeywell 6000 and Univac 1100/2200 series demonstrates this.

RISC-VAX would have been better than the PDP-10, for the same reasons:
32-bit addresses and byte addressing. And in addition, the
performance advantage of RISC-VAX would have made the position of
RISC-VAX compared to PDP-10 even stronger.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 11 14:51:20 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.

This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.

Fundamentally, 36-bit words ended up being a dead-end.

In a sense, they still live in the Unisys Clearpath systems.

The reason why this once-common architectural style died out are:

* 18-bit addresses

An issue for PDP-10, certainly. Not so much for the Univac
systems.

Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:

I spent 14 years at Burroughs/Unisys (on the Burroughs side, mainly).

Yes, two of the six mainframe lines still exist (albeit in emulation);
one 48-bit, the other 36-bit.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 11 17:27:30 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

Interesting link, thanks!

Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".

"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"

They were considering byte-addressability; interesting. It is also
slightly funny that a 9-bit byte address would be made up of
30 bits of virtual address and 2 bits of byte address, i.e.
a 32-bit address in total.

Fundamentally, 36-bit words ended up being a dead-end.

Pretty much so. It was a pity for floating-point, where they had
more precision than the 32-bit words (and especially the horrible
IBM format).

But byte addressability and power of two won.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:02:04 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through.

Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

While looking for the handbook, I also found

http://hps.ece.utexas.edu/pub/patt_micro22.pdf

which describes some parts of the microarchitecture of the VAX 11/780,
11/750, 8600, and 8800.

Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:59:32 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

The basic question is if VAX could afford the pipeline.

VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.

VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+

SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.

The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.

Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).

I doubt that they could afford 1-cycle multiply

Yes, one might do a multiplier and divider with its own sequencer (and
more sophisticated in later implementations), and with any user of the
result waiting stalling the pipeline until that is complete, and any
following user of the multiplier or divider stalling the pipeline
until it is free again.

The idea of providing multiply-step instructions and using a bunch of
them was short-lived; already the MIPS R2000 included a multiply
instruction (with its own sequencer), HPPA has multiply-step as well
as an FPU-based multiply from the start. The idea of avoiding divide instructions had a longer life. MIPS has divide right from the start,
but Alpha and even IA-64 avoided it. RISC-V includes divide in the M
extension that also gives multiply.

or
even a barrel shifter.

Five levels of 32-bit 2->1 muxes might be doable, but would that be cost-effecti

It is accepted in this era that using more hardware could
give substantial speedup. IIUC IBM used quadatic rule:
performance was supposed to be proportional to square of
CPU price. That was partly marketing, but partly due to
compromises needed in smaller machines.

That's more of a 1960s thing, probably because low-end S/360
implementations used all (slow) tricks to minimize hardware. In the
VAX 11/780 environment, I very much doubt that it is true. Looking at
the early VAXen, you get the 11/730 with 0.3 VUPs up to the 11/784
with 3.5 VUPs (from 4 11/780 CPUs). sqrt(3.5/0.3)=3.4. I very much
doubt that you could get an 11/784 for 3.4 times the price of an
11/730.

Searching a little, I find

|[11/730 is] to be a quarter the price and a quarter the performance of
|a grown-up VAX (11/780) <https://retrocomputingforum.com/t/price-of-vax-730-with-vms-the-11-730-from-dec/3286>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 11:25:24 2025

From Newsgroup: comp.arch

In article <107b1bu$252qo$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

[Snipping the previous long discussion]

My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,

There, we agree.

that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,

With a certainty, if they followed RISC principles.

Sure. I wasn't disputing that, just saying that I don't think
it mattered that much.

[snip]

which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.

That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.

Well, then we're definitely into the unknowable. :-)

Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:18:06 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.

Were there different versions, maybe?

https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.

Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:

http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

That is strange. Why would they make the chip worse?

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,

Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.

- with 8 optional XOR output invertors,

I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).

- driving 8 tri-state or open collector buffers.

A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.

One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.

I didn't use the typical values. Yes, it would be dangerous to use them.
I never understood why they even quoted those typical numbers.
I always considered them marketing fluff.

So I count roughly 7 or 8 equivalent gate delays.

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.

Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.

Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?

I wasn't suggesting that. People used to modern CMOS speeds might not appreciate how slow TTL was. I was showing that its 50 ns speed number
was not out of line with other MSI parts of that day, and just happened
to have a PDF TTL manual opened on that part so used it as an example.
A 74181 4-bit ALU is also of similar complexity and 62 ns max.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:40:01 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

While looking for the handbook, I also found

http://hps.ece.utexas.edu/pub/patt_micro22.pdf

which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.

Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.

- anton

Yes I saw the Patt paper recently. He has written many microarchitecture papers. I was surprised that in 1990 he would say on page 2:

"All VAXes are microcoded. The richness of the instruction set urges that
the flexibility of microcoded control be employed, notwithstanding the conventional mythology that hardwired control is somehow faster than
microcode. It is instructive to point out that (1) hardwired control
produces higher performance execution only in situations where the
critical path is in the microsequencing function, and (2) that this
should not occur in VAX implementations if one designs with the
well-understood (to microarchitects) technique that the next control
store address must be obtained from information available at the start
of the current microcycle. A variation of this basic old technique is
the recently popularized delayed branch present in many ISA architectures introduced in the last few years."

When he refers to the "mythology that hardwired control is somehow faster"
he appears to still be using the monolithic "eyes" I referred to earlier
in that everything must go through a single microsequencer.
He compares a hardwired sequential controller to a microcoded sequential controller and notes that in that case hardwired is no faster.

What he is not doing is comparing multiple parallel hardware stages
to a sequential controller, hardwired or microcoded.

Risc brings with it the concurrent hardware stages view.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 13 12:09:35 2025

From Newsgroup: comp.arch

On 8/13/25 11:26, Ted Nolan <tednolan> wrote:

In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

<snip>

So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide
8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
porting easier. And then try to sell it to IBM Boca Raton.

https://en.wikipedia.org/wiki/Rainbow_100

That's completely different from what I suggest above, and DEC
obviously did not capture the PC market with that.

They did manage to crack the college market some where CS departments
had DEC hardware anyway. I know USC (original) had a Rainbow computer
lab circa 1985. That "in" didn't translate to anything else though.

Skidmore College was a DEC shop back in the day.
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Wed Aug 13 19:35:09 2025

From Newsgroup: comp.arch

In comp.arch Scott Lurndal <scott@slp53.sl.home> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Stephen Fuld wrote:

On 8/4/2025 8:32 AM, John Ames wrote:
=20
snip
=20

This notion that the only advantage of a 64-bit architecture is a larg= >>e
address space is very curious to me. Obviously that's *one* advantage,=

but while I don't know the in-the-field history of heavy-duty business= >>/
scientific computing the way some folks here do, I have not gotten the=

impression that a lot of customers were commonly running up against th= >>e
4 GB limit in the early '90s;

=20
Not exactly the same, but I recall an issue with Windows NT where it=20
initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >>=20
users.=C2=A0 Some users were "running out of address space", so Microso= >>ft=20
came up with an option to reduce the OS space to 1 GB, thus allowing up= >>=20
to 3 GB for users.=C2=A0 I am sure others here will know more details.

Any program written to Microsoft/Windows spec would work transparently=20 >>with a 3:1 split, the problem was all the programs ported from unix=20 >>which assumed that any negative return value was a failure code.

The only interfaces that I recall this being an issue for were
mmap(2) and lseek(2). The latter was really related to maximum
file size (although it applied to /dev/[k]mem and /proc/<pid>/mem
as well). The former was handled by the standard specifying
MAP_FAILED as the return value.

That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally broken.

I remeber RIM. When I compiled it on Linux and tried it I got error
due to check for "< 0". Change to '== -1" fixed it. Possibly there
were similar troubles in other programs that I do not remember.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 13 20:23:53 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

Is that possible with a PAL before it has been programmed?

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

Two layers of NAND :-)

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From drb@drb@ihatespam.msu.edu (Dennis Boone) to comp.arch,alt.folklore.computers on Thu Aug 14 17:12:40 2025

From Newsgroup: comp.arch

The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
less capable) than what I suggest above, but it was several years
earlier, and what I envision was not possible in one chip then.

Maybe compare 808x to something more in its weight class? The 8-bit
8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
capable of 22 bit addressing on a single 40-pin carrier.

De
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Thu Aug 14 15:22:46 2025

From Newsgroup: comp.arch

Dennis Boone wrote:

The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
less capable) than what I suggest above, but it was several years
earlier, and what I envision was not possible in one chip then.

Maybe compare 808x to something more in its weight class? The 8-bit
8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
capable of 22 bit addressing on a single 40-pin carrier.

De

For those interested in a blast from the past, on the Wikipedia WD16 page https://en.wikipedia.org/wiki/Western_Digital_WD16

is a link to a copy of Electronic Design magazine from 1977 which
has a set of articles on microprocessors starting on page 60.

Its a nice summary of the state of the microprocessor world circa 1977.

https://www.worldradiohistory.com/Archive-Electronic-Design/1977/Electronic-Design-V25-N21-1977-1011.pdf

Table 1 General Purpose Microprocessors on pg 62 shows 8 different
16-bit microprocessor chip sets including the WD16.

Table 3 on pg 66 show ~11 bit slice families that can be used to build
larger microcoded processors, such as AMD 2900 4-bit slice series.

It also has many data sheets on various micros starting on pg 88
and 16-bit ones starting on pg 170, mostly chips you never heard
on like the Ferranti F100L, but also some you'll know like the
Data General MicroNova mN601 on page 178.
The Western Digital WD-16 is on pg 190.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Thu Aug 14 12:59:00 2025

From Newsgroup: comp.arch

On 8/14/25 10:12 AM, Dennis Boone wrote:

The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were

capable of 22 bit addressing on a single 40-pin carrier.

The only single die PDP-11 DEC produced was the T-11 and it didn't
have an MMU

The J-11 is a Harris two chip hybrid, and is in a >40 pin chip carrier. http://simh.trailing-edge.com/semi/j11.html
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 15 03:20:56 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through.

Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.

_For current machines_ there are reasons to use bigger pages, but
in VAX time bigger pages almost surely would lead to higher memory
use and consequently to higher price for end user. In effect
machine would be much less competitive.

BTW: Long ago I saw message about porting an application from
VAX to Linux. On VAX application run OK in 1GB of memory.
On 32 bit Inter architecture Linux with 1 GB there was excessive
paging. The reason was much smaller number of bigger pages.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 15 05:07:01 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Aug 15 12:57:35 2025

From Newsgroup: comp.arch

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

The VAX was built to be a commercial product. As such, it was
designed to be successful in the market. But in order to be
successful in the market, it was important that the designers be
informed by the business landscape at both the time they were
designing it, and what they could project would be the lifetime
of the product. Those are considerations that extend beyond
the purely technical aspects of the design, and are both more
speculative and more abstract.

Consider how the business criteria might influence the technical
design, and how these might play off of one another: obviously,
DEC understood that the PDP-11 was growing ever more constrained
by its 16-bit address space, and that any successor would have
to have a larger address space. From a business perspective, it
made no sense to create a VAX with a 16-bit address space.
Similarly, they could have chosen (say) a 20, 24, or 28 bit
address space, or used segmented memory, or any number of other
such decisions, but the model that they did chose (basically a
flat 32-bit virtual address space: at least as far as the
hardware was concerned; I know VMS did things differently) was
ultimately the one that "won".

Of course, those are obvious examples. What I'm contending is
that the business<->technical relationship is probably deeper
and that business has more influence on technology than we
realize, up to and including the ISA design. I'm not saying
that the business folks are looking over the engineers'
shoulders telling them how the opcode space should be arranged,
but I am saying that they're probably going to engineering with
broad-strokes requirements based on market analysis and customer
demand. Indeed, we see examples of this now, with the addition
of vector instructions to most major ISAs. That's driven by the
market, not merely engineers saying to each other, "you know
what would be cool? AVX-512!"

And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT
instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.

Of course, they messed some of it up; EDITPC was like the
punchline of a bad joke, and the ways that POLY was messed up
are well-known.

Anyway, I apologize for the length of the post, but that's the
sort of thing I mean.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Fri Aug 15 13:36:12 2025

From Newsgroup: comp.arch

On Fri, 15 Aug 2025 12:57:35 -0000 (UTC), Dan Cross wrote:

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107b1bu$252qo$1@dont-email.me>,

Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
[...]

They certainly did! I'm not saying that they're right; I'm saying
that business needs must have, at least in part, influenced the ISA
design. That is, while mistaken, it was part of the business decision
process regardless.

It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

[snip]

There are also bits of the business requirements in each of the
descriptions of DEC microprocessor projects on Bob Supnik's site
that Al Kossow linked to earlier:

<http://simh.trailing-edge.com/dsarchive.html>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 15 15:10:58 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.

The handbook is:
https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

The cache is indeed 8KB in size, two-way set associative and write-through. >>
Section 2.7 also mentions an 8-byte instruction buffer, and that the
instruction fetching is done happens concurrently with the microcoded
execution. So here we have a little bit of pipelining.

Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.

I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.

One must also consider that the disks in that era were
fairly small, and 512 bytes was a common sector size.

Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 15:26:31 2025

From Newsgroup: comp.arch

On 8/7/2025 6:38 AM, Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

EricP wrote:

Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.

If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.

The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).

TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.

The question is could one build this at a commercially competitive price?
There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.

I am confused. You gave a possible answer in the posting you are
replying to.

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table, but is otherwise pretty much
OK assuming TLB miss rate isn't too unreasonable.

For the TLB, had noticed best results with 4 or 8 way associativity:
1-way: Doesn't work for main TLB.
1-way works OK for an L1-TLB in a split L1/L2 TLB config.
2-way: Barely works
In some edge cases and configurations,
may get stuck in a TLB miss loop.
4-way: Works fairly well, cheaper option.
8-way: Works better, but significantly more expensive.

A possible intermediate option could be 6-way associativity.
Full associativity is impractically expensive.
Also a large set associative TLB beats a small full associative TLB.

For a lot of the test programs I run, TLB size:
64x: Small, fairly high TLB miss rate.
256x: Mostly good
512x or 1024x: Can mostly eliminate TLB misses, but debatable.

In practice, this has mostly left 256x4 as the main configuration for
the Main TLB. Optionally, can use a 64x1 L1 TLB (with the main TLB as an
L2 TLB), but this is optional.

A hardware page walker or inverted page table has been considered, but
not crossed into use yet. If I were to add a hardware page walker, it
would likely be semi-optional (still allowing processes to use
unconventional memory management as needed, *).

Supported page sizes thus far are 4K, 16K, and 64K. In test-kern, 16K
mostly won out, using a 3-level page table and 48-bit address space,
though technically the current page-table layout only does 47 bits.

Idea was that the high half of the address space could use a separate
System page table, but this isn't really used thus far.

*: One merit of software TLB is that is allows for things like nested
page tables or other trickery without needing any actual hardware
support. Though, you can also easily enough fake software TLB in
software as well (a host TLB miss pulling from the guest TLB and
translating the address again).

Ended up not as inclined towards inverted page tables, as they offer
fewer benefits than a page walker but would have many of the same issues
in terms of implementation complexity (needs to access RAM and perform multiple memory accesses to resolve a miss, ...). The page walker then
is closer to the end goal, whereas the IPT is basically just a much
bigger RAM-backed TLB.

Actually, it is not too far removed from doing a weaker (not-quite-IEEE)
FPU in hardware, and then using optional traps to emulate full IEEE
behavior (nevermind if such an FPU encountering things like subnormal
numbers or similar causes performance to tank; and the usual temptation
to just disable the use of full IEEE semantics).

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 06:16:08 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table

I assume that you mean a balanced search tree (binary (AVL) or n-ary
(B)) vs. the now-dominant hierarchical multi-level page tables, which
are tries.

In both a hardware and a software implementation, one could implement
a balanced search tree, but what would be the advantage?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 10:00:56 2025

From Newsgroup: comp.arch

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
Recent studies show that TLB-related precise interrupts occur
once every 100–1000 user instructions on all ranges of code, from
SPEC to databases and engineering workloads [5, 18]."

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 15:21:38 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.

I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 11:29:20 2025

From Newsgroup: comp.arch

On 8/17/2025 1:16 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table

I assume that you mean a balanced search tree (binary (AVL) or n-ary
(B)) vs. the now-dominant hierarchical multi-level page tables, which
are tries.

Yes.

AVL tree is a balanced binary tree that tracks depth and "rotates" nodes
as needed to keep the depth of one side within +/- 1 of the other.

The B-Trees would use N elements per node, which are stored in sorted
order so that one can use a binary search.

In both a hardware and a software implementation, one could implement
a balanced search tree, but what would be the advantage?

Can use less RAM for large sparse address spaces with aggressive ASLR.
However. looking up a page or updating the page table are significantly
slower (enough to be relevant).

Though, I mostly ended up staying with more conventional page tables and weakening the ASLR, where it may try to reuse the previous bits (47:25)
and (47:36) of the address a few times, to reduce page-table
fragmentation (sparse, mostly-empty, page table pages).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 13:35:03 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

Is that possible with a PAL before it has been programmed?

They can speed and partially function test it.
Its programmed by blowing internal fuses which is a one-shot thing
so that function can't be tested.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

Two layers of NAND :-)

Thinking about different ways of doing this...
If the first NAND layer has open collector outputs then we can use
a wired-AND logic driving and invertor for the second NAND plane.

If the instruction buffer outputs to a set of 74159 4:16 demux with
open collector outputs, then we can just wire the outputs we want
together with a 10k pull-up resistor and drive an invertor,
to form the second output NAND layer.

inst buf <15:8> <7:0>
| | | |
4:16 4:16 4:16 4:16
vvvv vvvv vvvv vvvv
10k ---|---|---|---|------>INV->
10k ---------------------->INV->
10k ---------------------->INV->

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 12:53:32 2025

From Newsgroup: comp.arch

On 8/17/2025 9:00 AM, EricP wrote:

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

I am not saying SW page walkers are fast.
Though in my experience, the cycle cost of the SW TLB miss handling
isn't "too bad".

If it were a bigger issue in my case, could probably add a HW page
walker, as I had long considered it as a possible optional feature. In
this case, it could be per-process (with the LOBs of the page-base
register also encoding whether or not HW page-walking is allowed; along
with in my case also often encoding the page-table type/layout).

In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
Recent studies show that TLB-related precise interrupts occur
once every 100–1000 user instructions on all ranges of code, from
SPEC to databases and engineering workloads [5, 18]."

This is around 2 orders of magnitude more than I am often seeing in my
testing (mind you, with a TLB miss handler that is currently written in C).

But, this is partly where things like page-sizes and also the size of
the TLB can have a big effect.

Ideally, one wants a TLB that has a coverage larger than the working set
of the typical applications (and OS); at which point miss rate becomes negligible. Granted, if one has GB's of RAM, and larger programs, this
is a harder problem...

Then the ratio of working set to TLB coverage comes into play, which
granted (sadly) appears to follow an (workingSet/coverage)^2 curve...

I had noted before that some of the 90s era RISC's had comparably very
small TLBs, such as 64-entry fully associative, or 16x4.
Such a TLB with a 4K page size having a coverage of roughly 256K.

Where, most programs have working sets somewhat larger than 256K.

Looking it up, the DEC Alpha used a 48 entry TLB, so 192K coverage, yeah...

The CPU time cost of TLB Miss handling would be significantly reduced
with a "not pissant" TLB.

I was mostly using 256x4, with a 16K page size, which covers a working
set of roughly 16MB.

A 1024x4 would cover 64MB, and 1024x6 would cover 96MB.

One possibility though would be to use 64K pages for larger programs,
which would increase coverage of a 1024x TLB to 256MB or 384MB.

At present, a 1024x4 TLB would use 64K of Block-RAM, and 1024x6 would
use 98K.

But, yeah... this is comparable to the apparent TLB sizes on a lot of
modern ARM processors; which typically deal with somewhat larger working
sets than I am dealing with.

Another option is to RAM-back part of the TLB, essentially as an
"Inverted Page Table", but admittedly, this has similar complexities to
a HW page walker (and the hassle of still needing a fault handler to
deal with missing IPT entries).

In an ideal case, could make sense to write at least the fast path of
the miss handler in ASM.

Note that TLB misses are segregated into their own interrupt category
separate from other interrupt:
8: General Fault (Memory Faults, Instruction Faults, FPU Traps)
A: TLB Miss (TLB Miss, ACL Miss)
C: Interrupt (1kHz HW timer mostly)
E: Syscall (System Calls)

Typically, the VBR layout looks like:
+ 0: Reset (typically only used on boot, with VBR reset to 0)
+ 8: General Fault
+16: TLB Miss
+24: Interrupt
+32: Syscall
With a non-standard alignment requirement (vector table needs to be
aligned to a multiple of 256 bytes, for "reasons"). Though actual CPU
core currently only needs a 64B alignment (256B would allow adding a lot
more vectors while staying with the use of bit-slicing). Each "entry" in
this table being a branch to the entry point of the ISR handler.

On initial Boot, as a little bit of a hack, the CPU looks at the
encoding of the Reset Vector branch to determine the initial ISA Mode
(such as XG1, XG3, or RV64GC).

If doing a TLB miss handler in ASM, possible strategy could be:
Save off some of the registers;
Check if a simple case TLB miss or ACL miss;
Try to deal with it;
Restore registers;
Return.
Save rest of registers;
Deal with more complex scenario (probably in C land);
Such as initiate a context switch to the page-fault handler.

For the simple cases:
TLB Miss involves walking the page table;
ACL miss may involve first looking up the ID pairs in a hash table;
Fallback cases may involve more complex logic in a more general handler.

At present, the Interrupt and Syscall handlers have the quirk in that
they require TBR to be set-up first, as they directly save to the
register save area (relative to) TBR, rather than using the interrupt
stack. The main rationale here being that these interrupts frequently
perform context switches and saving/restoring registers to TBR greatly
reduces the performance cost of performing a context switch.

Note though that one ideally wants to use shared address spaces or ASIDs
to limit the amount of TLB misses.

Can note that currently my CPU core uses 16-bit ASIDs, split into 6+10
bits, currently 64 groups, each with 1024 members. Global pages are
generally only global within a groups, and high numbered groups are
assumed to not allow global pages. Say, for example, if you were running
a VM, you wouldn't want its VAS being polluted with global pages from
the host OS.

Though, global pages would allow things like DLLs and similar to be
shared without needing TLB misses for them on context switches.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Jakob Bohm@egenagwemdimtapsar@jbohm.dk to comp.arch,comp.lang.c on Sun Aug 17 20:18:36 2025

From Newsgroup: comp.arch

On 2025-08-05 23:08, Kaz Kylheku wrote:

On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:

On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.

That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.

They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.

What would be an exmaple of a more conservative way to name the
identifier?

What is actually going on is GCC offering its users a gradual way to transition from C17 to C23, by applying the C23 meaning of any C23
construct that has no conflicting meaning in C17 . In particular, this
allows installed library headers to use the new types as part of
logically opaque (but compiler visible) implementation details, even
when those libraries are used by pure C17 programs. For example, the
ISO POSIX datatype struct stat could contain a _BitInt(128) type for
st_dev or st_ino if the kernel needs that, as was the case with the 1996
NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
used "system" headers (including installed libc headers), as those are
allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
report the use of non-standard extensions to the library standards when
the compiler is invoked with --pedantic and no contrary options .

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C instead
of GNUC reverts those to the standard definition .

Enjoy

Jakob
--
Jakob Bohm, MSc.Eng., I speak only for myself, not my company
This public discussion message is non-binding and may contain errors
All trademarks and other things belong to their owners, if any.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 19:10:21 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

... if the buffers fill up and there is not enough resources left for
the TLB miss handler.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 15:08:14 2025

From Newsgroup: comp.arch

On 8/17/2025 2:10 PM, Anton Ertl wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

... if the buffers fill up and there is not enough resources left for
the TLB miss handler.

If the processor has microcode, could try to handle it that way.

If it could work, and the CPU allows sufficiently complex logic in
microcode to deal with this.

...

One idea I had considered early on would be that there is would be a
special interrupt class that always goes into the ROM; so to the OS it
would always looks as-if there were a HW page walker.

This was eventually ended though as I was typically using 32K for the
Boot ROM, and with the initial startup tests, font initialization, and
FAT32 driver + PEL and elf loaders, ..., there wasn't much space left
for "niceties" like TLB miss handling and similar. So, the role of the
ROM was largely reduced to initial boot-up.

It could be possible to have a "2-stage ROM", where the first stage boot
ROM also loads more "ROM" from the SDcard. But, at that point, may as
well just go over to using the current loader design to essentially try
to load a UEFI BIOS or similar (which could then load the OS, achieving basically the same effect).

Where, in effect, UEFI is basically an OS in its own right, which just
so happens to use similar binary formats to what I am using already (eg, PE/COFF).

Not yet gone up the learning curve for how to make TestKern behave like
a UEFI backend though (say, for example, if I wanted to try to get
"Debian RV64G" or similar to boot on my stuff).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 18:56:49 2025

From Newsgroup: comp.arch

On 8/17/2025 12:35 PM, EricP wrote:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.

Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.

Is that possible with a PAL before it has been programmed?

They can speed and partially function test it.
Its programmed by blowing internal fuses which is a one-shot thing
so that function can't be tested.

By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.

The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,

Should be free coming from a Flip-Flop.

Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like
74LS375.

For a wide instruction or stage register I'd look at chips such as a
74LS377
with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
vcc, gnd.

So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).

Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.

I'm just showing why it was more than just an AND gate.

Two layers of NAND :-)

Thinking about different ways of doing this...
If the first NAND layer has open collector outputs then we can use
a wired-AND logic driving and invertor for the second NAND plane.

If the instruction buffer outputs to a set of 74159 4:16 demux with
open collector outputs, then we can just wire the outputs we want
together with a 10k pull-up resistor and drive an invertor,
to form the second output NAND layer.

inst buf <15:8>   <7:0>
         |    |   |   |
       4:16 4:16 4:16 4:16
       vvvv vvvv vvvv vvvv
10k ---|---|---|---|------>INV->
10k ---------------------->INV->
10k ---------------------->INV->

I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.

There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.

Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

When code/density is the goal, a 16/32 RISC can do well.

Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.

Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)

In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd

Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)

LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)

For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9

For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

Reg3B was a bit hacky, but had similar hit rates but uses less encoding
space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Sun Aug 17 22:18:28 2025

From Newsgroup: comp.arch

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.

I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.

Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 18 05:48:00 2025

From Newsgroup: comp.arch

Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:

It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?

I can attempt to, though I'm not sure if I can be successful.

[...]

And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.

I had actually forgotten that the VAX also had decimal
instructions. But the 11/780 also had one really important
restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
severely limited their throughput there (assuming they did
things bytewise). So yes, decimal arithmetic was important
in the day for COBOL and related commercial applications.

So, what to do with decimal arithmetic, which was important
at the time (and a business consideration)?

Something like Power's addg6s instruction could have been
introduced, it adds two numbers together, generating only the
decimal carries, and puts a nibble "6" into the corresponding
nibble if there is one, and "0" otherwise. With 32 bits, that
would allow addition of eight-digit decimal numbers in four
instructions (see one of the POWER ISA documents for details),
but the cycle of "read ASCII digits, do arithmetic, write
ASCII digits" would have needed some extra shifts and masks,
so it might have been more beneficial to use four digits per
register.

The article above is also extremely interesting otherwise. It does
not give cycle timings for each individual instruction and address
mode, but it gives statistics on how they were used, and a good
explanation of the timing implications of their microcode design.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Richard Heathfield@rjh@cpax.org.uk to comp.arch,comp.lang.c on Mon Aug 18 08:02:30 2025

From Newsgroup: comp.arch

On 18/08/2025 06:18, Keith Thompson wrote:

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.

I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.

Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?

$ cat so.c
#include <stdio.h>

int main(void)
{
int foo = 42;
size_t soa = sizeof (foo, 'C');
size_t sob = sizeof foo;
printf("%s.\n", (soa == sob) ? "Yes" : "No");
return 0;
}
$ gcc -o so so.c
$ ./so
Yes.
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch,comp.lang.c on Mon Aug 18 11:34:49 2025

From Newsgroup: comp.arch

On 18.08.2025 07:18, Keith Thompson wrote:

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.

I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.

Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?

Presumably that's a typo - you meant to ask when the size is /not/ the
size of "int" ? After all, you said yourself that "(foo, 'C')"
evaluates to 'C' which is of type "int". It would be very interesting
if Jakob can show an example where gcc treats the expression as any
other type than "int".

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Aug 18 11:03:15 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

BGB wrote:

On 8/7/2025 6:38 AM, Anton Ertl wrote:

Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.

Yeah, this approach works a lot better than people seem to give it
credit for...

Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.

Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

None of those research papers that I have seen consider the possibility
that OoO can make use of multiple concurrent HW walkers if the
cache supports hit-under-miss and multiple pending miss buffers.

While instruction fetch only needs to occasionally translate a VA one
at a time, with more aggressive alternate path prefetching all those VA
have to be translated first before the buffers can be prefetched.
LSQ could also potentially be translating as many VA as there are entries.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.

I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.

- anton

I'm looking for papers that separate out the common cost of loading a PTE
from the extra cost of just the SW-miss handler. I had a paper a while
back but can't find it now. IIRC in that paper the extra cost of the
SW miss handler on Alpha was measured at 5-25%.

One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
that Intel's HW walker on its downward walk caches the interior node
PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

A SW walker can accomplish the same bottom-up walk by locating
the different page table levels at *virtual* base addresses,
and adding each VA of those interior PTE's to the TLB.
This is what VAX VA translate did, probably Alpha too but I didn't check.

This interior PTE node caching is critical for optimal performance
and some of their stats don't take it into account
and give much worse numbers than they should.

Also many papers were written before ASID's were in common use
so the TLB got invalidated with each address space switch.
This would penalize any OS which had separate user and kernel space.

So all these numbers need to be taken with a grain of salt.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 18 15:35:36 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.

Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.

The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.

I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:

1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.

2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal
reaction to a branch misprediction).

3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!

It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 18 17:19:13 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

the same problem.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want >atomicity there.

To avoid race conditions with software clearing those bits, presumably.

ARM64 originally didn't support hardware updates in V8.0, they were
independent hardware features added to V8.1.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 18 21:57:59 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 18.08.2025 07:18, Keith Thompson wrote:

Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]

I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .

I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator
has
the type and value of its right operand, and the type of an unprefixed
character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?

Presumably that's a typo - you meant to ask when the size is /not/ the
size of "int" ? After all, you said yourself that "(foo, 'C')"
evaluates to 'C' which is of type "int". It would be very interesting
if Jakob can show an example where gcc treats the expression as any
other type than "int".

Yes (more of a thinko, actually).

I meant to ask about `sizeof (foo, 'C')` yielding a value *other than*
`sizeof (int)`. Jakob implies a difference in this area between GNU C
and ISO C. I'm not aware of any.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 03:47:17 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

The basic question is if VAX could afford the pipeline.

VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.

VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+

SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.

The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.

Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).

Prism paper says the following about RISC versus VAX performance:

: 1. Shorter cycle time. VAX chips have more, and longer, critical
: paths than RISC chips. The worst VAX paths are the control store
: loop and the variable length instruction decode loop, both of
: which are absent in RISC chips.

: 2. Fewer cycles per function. Although VAX chips require fewer
: instructions than RISC chips (1:2.3) to implement a given
: function, VAX instructions take so many more cycles than RISC
: instructions (5-10:1-1.5) that VAX chips require many more cycles
: per function than RISC chips.

: 3. Increased pipelining. VAX chips have more inter- and
: intra-instruction dependencies, architectural irregularities,
: instruction formats, address modes, and ordering requirements
: than RISC chips. This makes VAX chips harder and more
: complicated to pipeline.

Point 1 above for me means that VAX chips were microcoded. Point
2 above suggest that there were limited changes compared to VAX-780
microcode.

IIUC attempts to create better hardware for VAX were canceled
just after PRISM memos, so later VAX used essential the same
logic, just rescaled to better process.

I think that VAX had problem with hardware decoders because of gate
delay: in 1987 probably hardware decoder would slow down clock.
But 1977 design for me looks quite relaxed: man logic was Schotky
TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
this means about 66 gate delays per cycle. And in critical paths
VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
commonly available in 1970 and 1 ns ECL was leading edge in 1970.

That is why I think that in 1977 hardware decoder could give
speedup, assuming that execution units could keep up: gate delay
and cycle time means that rather deep circuit could fit within
cycle time. IIUC 1987 designs were much more aggressive and
decoder delay probably could not fit within single cycle.

Quite possible that hardware designers attempting VAX hardware
decoders were too ambitious and wanted to decode in one cycle
too complicated instructions. AFAICS for instructions that can
not be executed in one cycle decode can be slower than one
cycle, all what one needs is to recognize withing one cycle
that decode will take multiple cycles.

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 14:36:43 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

the same problem.

Not quite.
My idea was to have two HW threads HT1 and HT2 which are like x86 HW
threads except when HT1 gets a TLB miss it stalls its execution and
injects the TLB miss handler at the front of HT2 pipeline,
and a HT2 TLB miss stalls itself and injects its handler into HT1.
The TLB miss handler never itself TLB misses as it explicitly checks
the TLB for any VA it needs to translate so recursion is not possible.

As the handler is injected at the front of the pipeline no drain occurs.
The only possible problem is if between when HT1 injects its miss handler
into HT2 that HT2's existing pipeline code then also does a TLB miss.
As this would cause a deadlock, if this occurs then it cores detects it
and both HT fault and run their TLB miss handler themselves.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want
atomicity there.

To avoid race conditions with software clearing those bits, presumably.

ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

Yes. A memory recycler can periodically clear the Accessed bit
so it can detect page usage, and that might be a different core.
But it might skip sending TLB shootdowns to all other cores
to lower the overhead (maybe a lazy usage detector).

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 16:41:39 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.

All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.

Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.

The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.

On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.

Hmmm... I don't think that is possible, or if it is then its really hairy.
The miss handler needs to LD the memory PTE's, which can happen OoO.
But it also needs to do things like writing control registers
(e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
can't get to retire because the older instructions that triggered the
miss are stalled.

The miss handler needs general registers so it needs to
stash the current content someplace and it can't use memory.
Then add a nested miss handler on top of that.

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

As Scott said, to avoid race conditions with software clearing those bits.
Plus there might be PTE modifications that an OS could perform on other
PTE fields concurrently without first acquiring the normal mutexes
and doing a TLB shoot down of the PTE on all the other cores,
provided they are done atomically so the updates of one core
don't clobber the changes of another.

Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.

Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.

A HW walker looks simple to me.
It has a few bits of state number and a couple of registers.
It needs to detect memory read errors if they occur and abort.
Otherwise it checks each TLB level in backwards order using the
appropriate VA bits, and if it gets a hit walks back down the tree
reading PTE's for each level and adding them to their level TLB,
checking it is marked present, and performing an atomic OR to set
the Accessed and Dirty flags if they are clear.

The HW walker is even simpler if the atomic OR is implemented directly
in the cache controller as part of the Atomic Fetch And OP series.

I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:

1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.

2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!

It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.

- anton

Yes, and it seems to me that one would spend a lot more time trying to
fix the SW walker than doing the simple HW walker that just works.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 19:17:01 2025

From Newsgroup: comp.arch

BGB wrote:

On 8/17/2025 12:35 PM, EricP wrote:

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

When code/density is the goal, a 16/32 RISC can do well.

Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.

I'm assuming 16 32-bit registers, plus a separate RIP.
The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
With just 16 registers there would be no zero register.

The 4-bit register allows many 2-byte accumulate style instructions
(where a register is both source and dest)
8-bit opcode plus two 4-bit registers,
or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

A flags register allows 2-byte short conditional branch instructions,
8-bit opcode and 8-bit offset. With no flags register the shortest
conditional branch would be 3 bytes as it needs a register specifier.

If one is doing variable byte length instructions then
it allows the highest usage frequency to be most compact possible.
Eg. an ADD with 32-bit immediate in 6 bytes.

Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)

The saving for fixed 32-bit instructions is that it only needs to
prefetch aligned 4 bytes ahead of the current instruction to maintain
1 decode per clock.

With variable length instructions from 1 to 12 bytes it could need
a 16 byte fetch buffer to maintain that decode rate.
And a 16 byte variable shifter (collapsing buffer) is much more logic.

I was thinking the variable instruction buffer shifter could be built
from tri-state buffers in a cross-bar rather than muxes.

The difference for supporting variable aligned 16-bit instructions and
byte aligned is that bytes doubles the number of tri-state buffers.

In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd

Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)

LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)

For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9

For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Aug 20 23:50:52 2025

From Newsgroup: comp.arch

On 8/20/2025 6:17 PM, EricP wrote:

BGB wrote:

On 8/17/2025 12:35 PM, EricP wrote:

The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.

Looking at the instruction set usage of VAX in

Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709

we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most
instructions.

But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.

When code/density is the goal, a 16/32 RISC can do well.

Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.

I'm assuming 16 32-bit registers, plus a separate RIP.
The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
With just 16 registers there would be no zero register.

The 4-bit register allows many 2-byte accumulate style instructions
(where a register is both source and dest)
8-bit opcode plus two 4-bit registers,
or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

Yeah.

SuperH had:
ZZZZnnnnmmmmZZZZ //2R
ZZZZnnnniiiiiiii //2RI (Imm8)
ZZZZnnnnZZZZZZZZ //1R

For BJX2/XG1, had went with:
ZZZZZZZZnnnnmmmm
But, in retrospect, this layout was inferior to the one SuperH had used
(and I almost would have just been better off doing a clean-up of the SH encoding scheme than moving the bits around).

Though, this happened during a transition between B32V and BSR1, where:
B32V was basically a bare-metal version of SH;
BSR1 was an instruction repack (with tweaks to try make it more
competitive with MSP430 while still remaining Load/Store);
BJX2 was basically rebuilding all the stuff from BJX1 on top of BSR1's encoding scheme (which then mutated more).

At first, BJX2's 32-bit ops were a prefix:
111P-YYWY-qnmo-oooo ZZZZ-ZZZZ-nnnn-mmmm

But, then got reorganized:
111P-YYWY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

Originally, this repack was partly because I had ended up designing some Imm9/Disp9 encodings as it quickly became obvious that Imm5/Disp5 was insufficient. But, I had designed the new instructions to have the Imm
field not be totally dog-chewed, so ended up changing the layout. Then
ended up changing the encoding for the 3R instructions to better match
that of the new Imm9 encodings.

Then, XG2:
NMOP-YYwY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ //3R

Which moved entirely over to 32/64/96 bit encodings in exchange for
being able to directly encode 64 GPRs in 32-bit encodings for the whole ISA.

In the original BJX2 (later renamed XG1), only a small subset having
direct access to the higher numbered registers; and other instructions
using 64-bit encodings.

Though, ironically, XG2 never surpassed XG1 in terms of code-density;
but being able to use 64 registers "pretty much everywhere" was (mostly)
a good thing for performance.

For XG3, there was another repack:
ZZZZ-oooooo-mmmmmm-ZZZZ-nnnnnn-qY-YYPw //3R

But, this was partly to allow it to co-exist with RISC-V.

Technically, still has conditional instructions, but these were demoted
to optional; as if one did a primarily RISC-V core, with an XG3 subset
as an ISA extension, they might not necessarily want to deal with the
added architectural state of a 'T' bit.

BGBCC doesn't currently use it by default.

Was also able to figure out how to make the encoding less dog chewed
than either XG2 or RISC-V.

Though, ironically, the full merits of XG3 are only really visible in
cases where XG1 and XG2 are dropped. But, it has a new boat-anchor in
that it now assumes coexistence with RISC-V (which itself has a fair bit
of dog chew).

And, if the goal is RISC-V first, then likely the design of XG3 is a big
ask; it being essentially its own ISA.

Though, while giving fairly solid performance, XG3 currently hasn't
matched the code density of its predecessors (either XG1 or XG2). It is
more like "RISC-V but faster".

And, needing to use mode changes to access XG3 or RV-C is a little ugly.

Though, OTOH, RISC-V land is annoying in a way; lots of people being
like "RV-V will save us from all our performance woes!". Vs, realizing
that some issues need to be addressed in the integer ISA, and SIMD and auto-vectorization will not address inefficiencies in the integer ISA.

Though, I have seen glimmers of hope that other people in RV land
realize this...

A flags register allows 2-byte short conditional branch instructions,
8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.

Yeah, "BT/BF Disp8".

If one is doing variable byte length instructions then
it allows the highest usage frequency to be most compact possible.
Eg. an ADD with 32-bit immediate in 6 bytes.

In BSR1, I had experimented with:
LDIZ Imm12u, R0 //R0=Imm12
LDISH Imm8u //R0=(R0<<8)|Umm8u
OP Imm4R, Rn //OP [(R0<<4)|Imm4u], Rn

Which allowed Imm24 in 6 bytes or Imm32 in 8 bytes.
Granted, as 3 or 4 instructions.

Though, this began the process of allowing the assembler to fake more
complex instructions which would decompose into simpler instructions.

But, this was not kept, and in BJX2 was mostly replaced with:
LDIZ Imm24u, R0
OP R0, Rn

Then, when I added Jumbo Prefixes:
OP Rm, Imm33s, Rn

Some extensions of RISC-V support Imm32 in 48-bit ops, but this burns
through lots of encoding space.

iiiiiiii-iiiiiiii iiiiiiii-iiiiiiii zzzz-nnnnn-z0-11111

This doesn't go very far.

Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1   : 16|32 (Mostly 16)

The saving for fixed 32-bit instructions is that it only needs to
prefetch aligned 4 bytes ahead of the current instruction to maintain
1 decode per clock.

With variable length instructions from 1 to 12 bytes it could need
a 16 byte fetch buffer to maintain that decode rate.
And a 16 byte variable shifter (collapsing buffer) is much more logic.

I was thinking the variable instruction buffer shifter could be built
from tri-state buffers in a cross-bar rather than muxes.

The difference for supporting variable aligned 16-bit instructions and
byte aligned is that bytes doubles the number of tri-state buffers.

If the smallest instruction size is 16 bits, it simplifies things
considerably vs 8 bits.

If the smallest size is 32-bits, it simplifies things even more.
Fixed length is the simplest case though.

As noted, 32/64/96 bit fetch isn't too difficult though.

For 64/96 bit instructions though, mostly want to be able to (mostly)
treat it like a superscalar fetch of 2 or 3 32-bit instructions.

In my CPU, I ended up making it so that only 32-bit instructions support superscalar; whereas 16 and 64/96 bit instructions are scalar only.

Superscalar only works with native alignment though (for RISC-V), and
for XG3, 32-bit instruction alignment is mandatory.

As noted, in terms of code density, a few of the stronger options are
Thumb2 and RV-C, which have 16 bits as the smallest size.

I once experimented with having a range of 24-bit instructions, but the
hair this added (combined with the fairly little gain in terms of code density) showed this was rather not worth it.

...

In my recent fiddling for trying to design a pair encoding for XG3,
can note the top-used instructions are mostly, it seems (non Ld/St):
ADD   Rs, 0, Rd    //MOV     Rs, Rd
ADD   X0, Imm, Rd //MOV     Imm, Rd
ADDW Rs, 0, Rd    //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW    Imm, Rd
ADD   Rd, Imm, Rd //ADD     Imm, Rd

Followed by:
ADDWU Rs, 0, Rd    //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu   Imm, Rd
ADDW Rd, Rs, Rd   //ADDW    Rs, Rd
ADD   Rd, Rs, Rd   //ADD     Rs, Rd
ADDWU Rd, Rs, Rd   //ADDWU   Rs, Rd

Most every other ALU instruction and usage pattern either follows a
bit further behind or could not be expressed in a 16-bit op.

For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)

LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)

For registers, there is a split:
Leaf functions:
    R10..R17, R28..R31 dominate.
Non-Leaf functions:
    R10, R18..R27, R8/R9

For 3-bit configurations:
R8..R15                             Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

Reg3B was a bit hacky, but had similar hit rates but uses less
encoding space than using a 4-bit R8..R23 (saving 1 bit on the
relevant scenarios).

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 21 16:21:37 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.

It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.

Consider "virgin" page, that is neither accessed nor modified.
Intruction 1 reads the page, instruction 2 modifies it. After
both are done you should have both bits set. But if miss handling
for instruction 1 reads page table entry first, but stores after
store fomr instruction 2 handler, then you get only accessed bit
and modified flag is lost. Symbolically we could have

read PTE for instruction 1
read PTE for instruction 2
store PTE for instruction 2 (setting Accessed and Modified)
store PTE for instruction 1 (setting Accessed but clearing Modified)
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 21 19:26:47 2025

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 22 16:36:09 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.

1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions. Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Aug 22 16:45:56 2025

From Newsgroup: comp.arch

According to Thomas Koenig <tkoenig@netcologne.de>:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
mounted a few to a package. The /91 was big but it wasn't *that* big.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 22 17:21:17 2025

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).

An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.

1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions.

Like the multiply instruction in ARM2.

Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.

Yep, FP support can be expensive and was an extra option
on the VAX, which also included integer multiply.

However, I maintain that a ~1977 supermini with a similar sort
of bus, MMU, floating point unit etc like the VAX, but with an
architecture similar to ARM2, plus separate icache and dcache, would
have beaten the VAX hands-down in performance - it would have taken
fewer chips to implement, less power and possibly time to develop.
HP showed this was possible some time later.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Aug 23 16:38:47 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

Waldek Hebisch <antispam@fricas.org> schrieb:

Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.

HUH? That is more than one order of magnitude than what is needed
for a RISC chip.

It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.

I remember this number, but do not remember where I found it. So
it may be wrong.

However, one can estimate possible density in a different way: package
probably of similar dimensions as VAX package can hold about 100 TTL
chips. I do not have detailed data about chip usage and transistor
couns for each chip. Simple NAND gate is 4 transitors, but input
transitor has two emiters and really works like two transistors
so it is probably better to count it as 2 transitors, and conseqently
consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
20 transistors. D-flop probably is about 20-30 transitors, so
74S74 is probably around 40-60. Quad D-flop bring us close to 100.
I suspect that in VAX time octal D-flops were available. There
were 4 bit ALU slices. Also multiplexers need nontrivial number
of transistors. So I think that 50 transistors is reasonable (maybe
low) estimate of average density. Assuming 50 transitors per chip
that would be 5000 transistors per package. Packages were rather
flat, so when mounted vertically one probably could allocate 1 cm
of horizotal space for each. That would allow 30 packages at
single level. With 7 levels we get 210 packages, enough for
1 mln transistors.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Aug 25 00:56:26 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...

Let's see:

#include <stddef.h>

long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}

arrays:
MOV Ri,#0
MOV Rr,#0
VEC Rt,{}
LDD Rl,[Rv,Ri<<3]
ADD Rr,Rr,Rl
LOOP LT,Ri,Rn,#1
MOV R1,Rr
RET

7 instructions, 1 instruction-modifier; 8 words.

long a, b, c, d;

void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}

globals:
STD 0x1234567890abcdef,[IP,a]
STD 0xcdef1234567890ab,[IP,b]
STD 0x567890abcdef1234,[IP,c]
STD 0x5678901234abcdef,[IP,d]
RET

5 instructions, 13 words, 0 .data, 0 .bss

gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret

0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret

When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.

gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:

000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret

000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
1198: c3 ret

gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:

0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret

0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret

So, the overall sizes (including data size for globals() on RV64GC) are:

arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64

So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.

NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:

NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64

I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.

Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.

The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).

There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang) versions.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 00:56:58 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.

Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.

In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.

Also, PDP-11 compatibility depended on microcode.

Different address modes mainly.

Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.

To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX especially given constraint
of PDP-11 compatibility.

RISC in MSI TTL logic would not have worked all that well.

OTOH VAX designers probably felt
that CISC nature added significant value: they understood
that cost of programming was significant and believed that
ortogonal instruction set, in particular allowing complex
addresing on all operands made programming simpler.

Some of us RISC designers believe similarly {about orthogonal
ISA not about address modes.}

They
probably thought that providing resonably common procedures
as microcoded instructions made work of programmers simpler
even if routines were only marginally faster than ordinary
code.

We think similarly--but we do not accept µCode being slower
that SW ISA, or especially compiled HLL.

Part of this thinking was probably like "future
system" motivation at IBM: Digital did not want to produce
"commodity" systems, they wanted something with unique
features that custemer will want to use.

s/used/get locked in on/

Without
isight into future it is hard to say that they were
wrong.

It is hard to argue that they made ANY mistakes with
what we know about the world of computers circa 1977.

It is not hard in 2025.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 27 10:56:31 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

10.5 on a characteristic mix, actually.

See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.

Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte
for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.

So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

And I say "at least" 1 C/IB as I am not including any micro-pipeline stalls. The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way jump/call
to uSubroutines, each of those dispatches might have a branch delay slot too.

(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 28 07:49:31 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:
-----------snip--------------

If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.

Witness Mc 68000, Mc 68010, and Mc 68020. In all these
designs, the microcode and its surrounding engine took
1/2 of the die-area insides the pins.

Note that most of this is microcode ROM. They complicated
logic to get smaller ROM size. For VAX it was quite different:
microcode memory (and cache) were build from LSI chips,
not suitable for logic at that time. Assuming 6 transistor
static RAM cells VAX had 590000 transistors in microcode memory
chips (and another 590000 transistors in cache chips).
Comparatively one can estimate VAX logic chips as between 20000
and 100000 transistors, with low numbers looking more likely
to me. IIUC at least early VAX on a "single" chip were slowed
down by going to off-chip microcode memory.

In 1980 it was possible to put the data path of a 32-bit
ISA on one die and pipeline it, but runs out of area when
you put microcode on the same die (area). Thus, RISC was
born. Mc88100 had a decoder and sequencer that was 1/8
of the interior area of the chip and had 4 FUs {Int,
Mem, MUL, and FADD} all pipelined.

Yes, but IIUC big item was on-chip microcode memory (or pins
needed to go to external microcode memory).
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Aug 28 13:39:54 2025

From Newsgroup: comp.arch

EricP wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)

10.5 on a characteristic mix, actually.

See "A Characterization of Processor Performance in the VAX-11/780"
by Emer and Clark, their Table 8.

Going through the VAX 780 hardware schematics and various performance
papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
use microcode to pull bytes from the 8-byte instruction buffer (IB)
*one at a time*.

So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

And I say "at least" 1 C/IB as I am not including any micro-pipeline
stalls.
The microsequencer has some pipelining, overlap read of the next uWord
with execute of current, which would introduce a branch delay slot into
the microcode. As it uses the opcode and operand bytes to do N-way
jump/call
to uSubroutines, each of those dispatches might have a branch delay slot too.

(Similar issues appear in the MV-8000 uSequencer except it appears to
have 2 or maybe 3 microcode branch delay slots).

I found a description of the 780 instruction buffer parser
in the Data Path description on bitsavers and
it does in fact pull one operand specifier from IB per clock.
There is a mux network to handle various immediate formats in parallel,

There are conflicting descriptions as to exactly how it handles the
first operand, whether that is pulled with the opcode or in a separate clock, as the IB shifter can only do 1 to 5 byte shifts but an opcode with
a first operand with 32-bit displacement would be 6 bytes.

But basically it takes 1 clock for the opcode byte and the first operand specifier byte, a second clock if the first opspec has an immediate,
then 1 clock for each subsequent operand specifier.
If an operand has an immediate it is extracted in parallel with its opspec.

If that is correct a MOV rs,rd or ADD rs,rd would take 2 clocks to decode,
and a MOV offset(rs),rd would take 3 clocks to decode.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 31 18:04:44 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...

Let's see:

#include <stddef.h>

long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}

arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET

long a, b, c, d;

void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}

globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET

-----------------

So, the overall sizes (including data size for globals() on RV64GC) are:
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22

32 68 My 66000 8 5

So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.

Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.

Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,

3 for My 66000

so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.

* My 66000 uses ST immediate for globals

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Sun Aug 31 16:43:26 2025

From Newsgroup: comp.arch

Apr 2003: Opteron launch
Sep 2003: Athlon 64 launch
Oct 2003 (IIRC): I buy an Athlon 64
Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

I installed Fedora Core 1 on my Athlon64 box in early 2004.

Why wait for MS?

Same here (tho I was on team Debian), but I don't think GNU/Linux
enthusiasts were the main buyers of those Opteron and
Athlon64 machines.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sun Aug 31 22:26:43 2025

From Newsgroup: comp.arch

On Sun, 31 Aug 2025 16:43:26 -0400, Stefan Monnier wrote:

... I don't think GNU/Linux enthusiasts were the main buyers of
those Opteron and Athlon64 machines.

Their early popularity would have been in servers. And servers were
already becoming dominated by Linux in those days.

“Opteron” was specifically a brand name for server chips, as I recall.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 06:07:27 2025

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> writes:

Apr 2003: Opteron launch
Sep 2003: Athlon 64 launch
Oct 2003 (IIRC): I buy an Athlon 64
Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

I installed Fedora Core 1 on my Athlon64 box in early 2004.

Why wait for MS?

Same here (tho I was on team Debian)

I would have liked to install 64-bit Debian (IIRC I initially ran
32-bit Debian on the Athlon 64), but they were not ready at the time,
and still busily working on their multi-arch (IIRC) plans, so
eventually I decided to go with Fedora Core 1, which just implemented
/lib and /lib64 and was there first.

For some reason I switched to Gentoo relatively soon after
(/etc/hostname from 2005-02-20, and IIRC Debian still had not finished hammering out multi-arch at that time), before finally settling in
Debian-land several years later.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Mon Sep 1 06:57:26 2025

From Newsgroup: comp.arch

On Mon, 01 Sep 2025 06:07:27 GMT, Anton Ertl wrote:

I would have liked to install 64-bit Debian (IIRC I initially ran
32-bit Debian on the Athlon 64), but they were not ready at the time
... so eventually I decided to go with Fedora Core 1, which just
implemented /lib and /lib64 and was there first.

For some reason I switched to Gentoo relatively soon after ...
before finally settling in Debian-land several years later.

Distro-hopping is a long-standing tradition in the Linux world. No other platform comes close.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Microbot
  Tue Sep 16 10:00:46 2025
  from Moore, Ok via Telnet
- Snow
  Mon Sep 15 12:19:45 2025
  from Nyc via Telnet
- Microbot
  Mon Sep 15 11:13:27 2025
  from Moore, Ok via Telnet
- Noozle
  Sun Sep 14 14:16:26 2025
  from Noozle City via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,070
Nodes:	10 (0 / 10)
Uptime:	127:47:11
Calls:	13,731
Calls today:	1
Files:	186,965
D/L today:	1,259 files (486M bytes)
Messages:	2,417,822

Re: VAX

Who's Online

Recent Visitors

System Info