• Re: VAX

    From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:13:50 2025
    From Newsgroup: comp.arch

    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    (I don't know that to be true; an extension has to be documented other
    than by omission. But anyway, if the GCC documentation says somewhere
    something like, "no other identifier is reserved in this version of
    GCC", then it means that the remaining portions of the reserved
    namespaces are available to the program. Since it is undefined behavior
    to use those identifiers (or in certain ways in certain circumstances,
    as the case may be), being able to use them with the documentation's
    blessing constitutes use of a documented extension.)

    I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    __builtin also in a standard-defined reserved namespace; the double
    underscore namespace. It is no more or less conservative to name
    something __bitInt as _BitInt.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 6 00:21:25 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...

    Terje



    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:25:17 2025
    From Newsgroup: comp.arch

    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a
    library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    It would be unthinkable for GCC to introduce, say, an extension
    using the identifier __libc_malloc.

    In addition to libraries, if some other important project that serves as
    a base package in many distributions happens to claim identifiers in
    those spaces, it wouldn't be wise for GCC (or the C libraries) to start
    taking them away.

    You can't just rename the identifier out of the way in the offending
    package, because that only fixes the issue going forward. Older versions
    of the package can't be compiled with the new compiler without a patch. Compiling older things with newer GCC happens.

    There are always the questions:

    1. Is there an issue? Is anything broken?

    2. If so, is what is broken important such that it becomes a showstopper
    if the compiler change is rolled out (major distros are on fire?)
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Tue Aug 5 17:41:30 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.

    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 00:49:21 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

    ... the problem was all the programs ported from unix which assumed
    that any negative return value was a failure code.

    If the POSIX API spec says a negative return for a particular call is an error, then a negative return for that particular call is an error.

    I can’t imagine this kind of thing blithely being carried over to any non- POSIX API calls.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Tue Aug 5 19:14:48 2025
    From Newsgroup: comp.arch

    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring to
    _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will break
    any existing code? My best guess is that there is no such code, that
    the only real world uses of the name _BitInt are deliberate uses of the
    new C23 feature, and that gcc's support of _BitInt in non-C23 mode
    will not break anything.

    It is of course possible that I'm wrong.

    If the name _BitInt did break (non-portable) existing C code, then the
    fault would lie with the C committee, not with the gcc maintainers.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Tue Aug 5 20:15:11 2025
    From Newsgroup: comp.arch

    On 8/5/25 17:59, Lawrence D'Oliveiro wrote:
    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards
    workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation
    simply couldn’t adapt.

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.


    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could have done to
    the PC market, if IBM could have been prevented from gaining such market
    dominance.

    IBM had massive marketing clout in the mainframe market. I think that was
    the basis on which customers gravitated to their products. And remember,
    the IBM PC was essentially a skunkworks project that totally went against
    the entire IBM ethos. Internally, it was seen as a one-off mistake that
    they determined never to repeat. Hence the PS/2 range.

    DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM. But what OS would they have used? They were still dominated by Unix-haters then.

    VMS was a heckuva good OS.


    A bit like the /360 strategy, offering a wide range of machines (or CPUs
    and systems) with different performance.

    That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
    for example, offered entire ranges of machines in each of its various minicomputer families.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Wed Aug 6 04:31:59 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will break
    any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library using _CreamPuff as an identifier, or of a compiler which misbehaves when a
    program uses it, on grounds of it being undefined behavior. Someone
    using _CreamPuff in their code is taking a risk that is vanishingly
    small, the same way that introducing _BigInt is a risk that is
    vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced some identifier is vastly larger than the audience of implementations that a
    given program will face that has introduced some funny identifier.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:37:32 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    The plurality of embedded systems are 8 bit processors - about 40
    percent of the total. They are largely used for things like industrial automation, Internet of Things, SCADA, kitchen appliances, etc.

    I believe heart pacemakers run with a 6502 (well, 65C02)

    16 bi
    account for a small, and shrinking percentage. 32 bit is next (IIRC ~30-35%, but 64 bit is the fastest growing. Perhaps surprising, there
    is still a small market for 4 bit processors for things like TV remote controls, where battery life is more important than the highest performance.

    There is far more to the embedded market than phones and servers.

    Also, the processors which run in earphones etc...

    Does anybody have an estimate how many CPUs humanity has made
    so far?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:50:11 2025
    From Newsgroup: comp.arch

    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    But the _real_ killer application for UNIX wasn't writing patents,
    it was phototypesetting speeches for the CEO of AT&T, who, for
    reasons of vanity, did not want to wear glasses, and it was possible
    to scale the output of the phototoypesetter so he would be able
    to read them.

    After somebody pointed out that having confidential speeches on
    one of the most well-known machines in the world, where loads of
    people had dial-up access, was not a good idea, his secretary got
    her own PDP-11 for that.

    And with support from that high up, the project flourished.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 05:53:22 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 06:20:57 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 05:37:32 -0000 (UTC), Thomas Koenig wrote:

    Does anybody have an estimate how many CPUs humanity has made so far?

    More ARM chips are made each year than the entire population of the Earth.

    I think RISC-V has also achieved that status.

    Where are they all going??
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 07:28:52 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 05:50:11 -0000 (UTC), Thomas Koenig wrote:

    Using UNIX faced stiff competition from AT&T's internal IT people, who
    wanted to run DEC's operating systems on all PDP-11 within the company (basically, they wanted to kill UNIX).

    But because AT&T controlled Unix, they were able to mould it like putty to their own uses. E.g. look at the MERT project which supported real-time
    tasks (as needed in telephone exchanges) besides conventional Unix ones.
    No way they could do this with an outside proprietary system, like those
    from DEC.

    AT&T also created its own hardware (the 3B range) to complement the
    software in serving those high-availability needs.

    But the _real_ killer application for UNIX wasn't writing patents, it
    was phototypesetting speeches for the CEO of AT&T, who, for reasons of vanity, did not want to wear glasses, and it was possible to scale the
    output of the phototoypesetter so he would be able to read them.

    Heck, no. The biggest use for the Unix documentation tools was in the
    legal department, writing up patent applications. troff was just about the only software around that could do automatic line-numbering, which was
    crucial for this purpose.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,comp.lang.c on Wed Aug 6 11:48:09 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 04:31:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
    wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring
    to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will
    break any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.


    Exactly.
    The World is a very big place. Even nowadays it is not completely
    transparent. Even those parts that are publicly visible in theory not necessarily had been had been observed recently by a single person even
    if the person in question is Keith.
    Besides, according to my understanding majority of gcc users didn't yet
    migrate to gcc14 or 15.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library
    using _CreamPuff as an identifier, or of a compiler which misbehaves
    when a program uses it, on grounds of it being undefined behavior.
    Someone using _CreamPuff in their code is taking a risk that is
    vanishingly small, the same way that introducing _BigInt is a risk
    that is vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced
    some identifier is vastly larger than the audience of implementations
    that a given program will face that has introduced some funny
    identifier.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:24:49 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Of all the major OSes for Alpha, Windows NT was the only one
    that couldn’t take advantage of the 64-bit architecture.

    Actually, Windows took good advantage of the 64-bit architecture:
    "64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Wed Aug 6 10:48:51 2025
    From Newsgroup: comp.arch

    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Bell Telephone's computer center was basically an IBM shop
    before Unix was written, having written BESYS for the IBM 704,
    for instance. They made investments in GE machines around the
    time of the Multics project (e.g., they had a GE 645 and at
    least one 635). The PDP-11 used for Unix was so new that they
    had to wait a few weeks for its disk to arrive.

    Unix escaped out of research, and into the larger Bell System,
    via the legal department, as has been retold many times. It
    spread widely internally after that. After divestiture, when
    AT&T was freed to be able to compete in the computer industry,
    it was seen as a strategic asset.

    But the _real_ killer application for UNIX wasn't writing patents,
    it was phototypesetting speeches for the CEO of AT&T, who, for
    reasons of vanity, did not want to wear glasses, and it was possible
    to scale the output of the phototoypesetter so he would be able
    to read them.

    After somebody pointed out that having confidential speeches on
    one of the most well-known machines in the world, where loads of
    people had dial-up access, was not a good idea, his secretary got
    her own PDP-11 for that.

    And with support from that high up, the project flourished.

    While it is true that Charlie Brown's office got a Unix system
    of their own to run troff because its output scaled to large
    sizes, the speeches weren't the data they were worried about
    protecting: those were records from AT&T board meetings.

    At the time, the research PDP-11 used for Unix at Bell Labs was
    not one of the, "most well-known machines in the world, where
    loads of people had dial-up access" in any sense; in the grand
    scheme of things, it was pretty obscure, and had a few dozen
    users. But it was a machine where most users had "root" access,
    and it was agreed that these documents shouldn't be on the
    research machine out of concern for confidentiality.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:32:39 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    ILP64 for Cray is documented in <https://en.cppreference.com/w/c/language/arithmetic_types.html>. For
    short int, I don't have a direct reference, only the statement

    |Firstly there was the word size, one rather large size fitted all,
    |integers and floats were represented in 64 bits

    <https://cray-history.net/faq-1-cray-supercomputer-families/faq-3/>

    For the 8-bit characters I found a reference (maybe somewhere else in
    that document), but I do not find it at the moment.

    Followups set to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 6 11:10:46 2025
    From Newsgroup: comp.arch

    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 11:05:30 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t

    On what keywords should these types be based? That's up to the
    implementor. In C23 one could

    typedef signed _BitInt(16) int16_t

    etc. Around 1990, one would have just followed the example of "long
    long" of accumulating several modifiers. I would go for 16-bit
    "short" and 32-bit "long short".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 13:48:17 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

    ... the problem was all the programs ported from unix which assumed
    that any negative return value was a failure code.

    If the POSIX API spec says a negative return for a particular call is an >error, then a negative return for that particular call is an error.

    Please find a single POSIX API that says a negative return is an error.

    You won't have much success. POSIX explicitly states in most
    cases that the API returns -1 on error (mmap returns MAP_FAILED,
    which happens to be -1 on most implementations; regardless a
    POSIX application _must_ check for MAP_FAILED, not a negative
    return value).

    More misinformation from LDO.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 6 16:19:11 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?

    No, you are not. I skipped pretty much all the setup code. :-)


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...


    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 10:23:26 2025
    From Newsgroup: comp.arch

    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode
    bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:54:57 2025
    From Newsgroup: comp.arch

    On 2025-08-05 17:13, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    ...
    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    No, I meant very specifically that if, as reported, _BitInt was
    supported even in earlier versions, then it was supported as an extension.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:56:04 2025
    From Newsgroup: comp.arch

    On 2025-08-05 17:25, Kaz Kylheku wrote:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    GCC cannot be implemented in such a way as to create a fully conforming implementation of C when used in connection with an arbitrary
    implementation of the C standard library. This is just one example of a
    more general potential problem: Both gcc and the library must use some
    reserved identifiers, and they might have made conflicting choices.
    That's just one example of the many things that might prevent them from
    being combined to form a conforming implementation of C. It doesn't mean
    that either one is defective. It does mean that the two groups of
    implementors should consider working together to resolve the conflicts.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 16:35:23 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Hmm... I _think_ it was on a talk given by the UNIX people,
    but I may be misremembering.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:34:55 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    The same happened to some extent with the early amd64 machines, which
    ended up running 32bit Windows and applications compiled for the i386
    ISA. Those processors were successful mostly because they were fast at >running i386 code (with the added marketing benefit of being "64bit
    ready"): it took 2 years for MS to release a matching OS.

    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:32 2025
    From Newsgroup: comp.arch

    On 8/6/2025 6:05 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t


    Well, assuming a post C99 world.


    On what keywords should these types be based? That's up to the
    implementor. In C23 one could

    typedef signed _BitInt(16) int16_t


    Possible, though one can realize that _BitInt(16) is not equivalent to a normal 16-bit integer.

    _BitInt(16) sa, sb;
    _BitInt(32) lc;
    sa=0x5678;
    sb=0x789A;
    lc=sa+sb;

    Would give:
    0xFFFFCF12
    Rather than 0xCF12 (as would be expected with 'short' or similar).

    Because _BitInt(16) would not auto-promote before the addition, but
    rather would produce a _BitInt(16) result which is then widened to 32
    bits via sign extension.


    etc. Around 1990, one would have just followed the example of "long
    long" of accumulating several modifiers. I would go for 16-bit
    "short" and 32-bit "long short".


    OK.

    Apparently at least some went for "__int32" instead.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 18:22:03 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 8/6/2025 6:05 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t


    Well, assuming a post C99 world.

    'typedef' was around long before C99 happened to
    standardize the aforementioned typedefs.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:30 2025
    From Newsgroup: comp.arch

    On 8/6/25 09:47, Anton Ertl wrote:


    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    I'm a time-traveler from the 1960s!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 20:06:00 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Wed Aug 6 13:58:51 2025
    From Newsgroup: comp.arch

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    On 2025-08-05 17:13, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    ...
    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    No, I meant very specifically that if, as reported, _BitInt was
    supported even in earlier versions, then it was supported as an extension.

    gcc 13.4.0 does not recognize _BitInt at all.

    gcc 14.2.0 handles _BitInt as a language feature in C23 mode,
    and as an "extension" in pre-C23 modes.

    It warns about _BitInt with "-std=c17 -pedantic", but not with
    just "-std=c17". I think I would have preferred a warning with
    "-std=c17", but it doesn't bother me. There's no mention of _BitInt
    as an extension or feature in the documentation. An implementation
    is required to document the implementation-defined value of
    BITINT_MAXWIDTH, so that's a conformance issue. In pre-C23 mode,
    since it's not documented, support for _BitInt is not formally an
    "extension"; it's an allowed behavior in the presence of code that
    has undefined behavior due to its use of a reserved identifier.
    (This is a picky language-lawyerly interpretation.)
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:00:03 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.
    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 6 21:14:07 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,
    especially for one of the BUNCH, for whom backward compatability was
    paramount.

    [*] The machine (Unisys V530) sold for well over a megabuck in
    a single processor configuration.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:57:03 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent state machines, each with its own logic sequencer.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:12:26 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol processing.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:15:54 2025
    From Newsgroup: comp.arch

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol >processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the right term.

    Perhaps "happens to work on some computers similar to the one it was originally written on."
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:32:47 2025
    From Newsgroup: comp.arch

    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol >>processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the
    right term.

    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have native types __int16 etc?

    - Lars
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:38:15 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:32:39 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine ...

    But it was not byte-addressable. Its precursor CDC machines had 60-bit
    words, as I recall. DEC’s “large systems” family from around that era (PDP-6, PDP-10) had 36-bit words. And there were likely some other vendors offering 48-bit words, that kind of thing. Maybe some with word lengths
    even longer than 64 bits.

    I was thinking more specifically of machines from the byte-addressable
    era.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:40:48 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:24:49 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Of all the major OSes for Alpha, Windows NT was the only one that
    couldn’t take advantage of the 64-bit architecture.

    Actually, Windows took good advantage of the 64-bit architecture:
    "64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

    Remember the Alpha was first released in 1992. No shipping version of
    Windows NT ever ran on it in anything other than “TASO” (“Truncated Address-Space Option”, i.e. 32-bit-only addressing) mode.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Wed Aug 6 23:43:12 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA, which
    used PALs which were not available to the VAX 11/780 designers, so it
    could be clocked a bit higher, but at a multiple of the performance
    than the VAX.

    So, Anton visiting DEC or me visiting Data General could have brought
    them a technology which would significantly outperformed the VAX
    (especially if we brought along the algorithm for graph coloring. Some
    people at IBM would have been peeved at having somebody else "develop"
    this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
    matrix)
    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Wed Aug 6 20:21:31 2025
    From Newsgroup: comp.arch

    Robert Swindells wrote:
    On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:

    For comparison:

    SPARC: Berkeley RISC research project between 1980 and 1984;
    <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the completion
    of RISC-II, but given that the research project ended in 1984, it was
    probably at that time. Sun developed Berkeley RISC into SPARC, and the
    first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz
    processor.

    The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.

    There is also a bibliography entry of an informal discussion with John
    Cocke at Berkeley about the 801 in June 1983

    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons:
    Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about ten times as fast as the core main memory; this made any primitives that
    were implemented as subroutines much slower than primitives that were instructions. Thus the floating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an attempt
    to improve performance. Note that this trend began because of the imbalance
    in speeds; it is not clear that architects have asked themselves whether
    this imbalance still holds for their designs."



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 20:41:44 2025
    From Newsgroup: comp.arch

    EricP wrote:

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)
    ^^^^
    Oops... typo. Should be FPLA.
    PAL or Programmable Array Logic was a slightly different thing,
    also an AND-OR matrix from Monolithic Memories.

    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    And PAL's too. Whatever works and is cheapest.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Charlie Gibbs@cgibbs@kltpzyxm.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 01:36:50 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Peter Flass <Peter@Iron-Spring.com> wrote:

    On 8/6/25 09:47, Anton Ertl wrote:

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    I'm a time-traveler from the 1960s!

    I'm starting to tell people that I'm a traveller
    from a distant land known as the past.
    --
    /~\ Charlie Gibbs | Growth for the sake of
    \ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
    X I'm really at ac.dekanfrus | of the cancer cell.
    / \ if you read it the right way. | -- Edward Abbey
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 02:22:05 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 20:21:31 -0400, EricP wrote:

    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer,
    1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several
    reasons: Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709
    [Cocke80]. The 701 CPU was about ten times as fast as the core main
    memory; this made any primitives that were implemented as
    subroutines much slower than primitives that were instructions. Thus
    the floating point subroutines became part of the 709 architecture
    with dramatic gains. Making the 709 more complex resulted in an
    advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an
    attempt to improve performance. Note that this trend began because
    of the imbalance in speeds; it is not clear that architects have
    asked themselves whether this imbalance still holds for their
    designs."

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

    How come? Caching.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Thu Aug 7 10:27:40 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons: >Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >ten times as fast as the core main memory; this made any primitives that
    were implemented as subroutines much slower than primitives that were >instructions. Thus the floating point subroutines became part of the 709 >architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an attempt
    to improve performance. Note that this trend began because of the imbalance >in speeds; it is not clear that architects have asked themselves whether
    this imbalance still holds for their designs."

    At the start of this thread
    <2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
    argument about the relation between memory speed and clock rate. In
    that posting, I wrote:

    |my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
    |within a row would have been possible. Moreover, the VAX 11/780 has a
    |cache

    In the meantime, this discussion and some additional searching has
    unearthed that the VAX 11/780 memory subsystem has 600ns main memory
    cycle time (apparently without contiguous-access (row) optimization),
    with the cache lowering the average memory cycle time to 290ns.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Thu Aug 7 11:06:06 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There is a citation to Cocke as "private communication" in 1980 by >>Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons: >>Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >>with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >>ten times as fast as the core main memory; this made any primitives that >>were implemented as subroutines much slower than primitives that were >>instructions. Thus the floating point subroutines became part of the 709 >>architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then, >>many "higher-level" instructions have been added to machines in an attempt >>to improve performance. Note that this trend began because of the imbalance >>in speeds; it is not clear that architects have asked themselves whether >>this imbalance still holds for their designs."

    At the start of this thread
    <2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
    argument about the relation between memory speed and clock rate. In
    that posting, I wrote:

    |my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
    |within a row would have been possible. Moreover, the VAX 11/780 has a |cache

    In the meantime, this discussion and some additional searching has
    unearthed that the VAX 11/780 memory subsystem has 600ns main memory
    cycle time (apparently without contiguous-access (row) optimization),

    Memory subsystem was able to operate at bus speed: during memory
    cycle memory delivered 64 bits. Bus was 32-bit and needed 3 cycles
    (200 ns each) to transfer 64-bit. Making memory faster would
    require redesigning the bus.

    with the cache lowering the average memory cycle time to 290ns.

    For processor miss penalty was 1800 ns (documentation say that
    was du to bus protocol overhead). Cache hit rate was claimed
    to be 95%.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 10:47:50 2025
    From Newsgroup: comp.arch

    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:16:20 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    Russians in late sixties proposed graph coloring as a way of
    memory allocation (and proved that optimal allocation is
    equivalent to graph coloring). They also proposed heuristics
    for graph coloring and experimentaly showed that they
    are reasonably effective. This is not the same thing as
    register allocation, but connection is rather obvious.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:29:46 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    IIUC description of IBM 360-85 it had a pipeline which was much
    more aggresivly clocked than VAX. 360-85 probaly used ECL, but
    at VAX clock speed should be easily doable in Schottky TTL
    (used in VAX).

    The question is could one build this at a commercially competitive price?

    Yes.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:21:56 2025
    From Newsgroup: comp.arch

    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
    ...
    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have native types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:38:54 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:59:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to backwards compatibility?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 13:34:26 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:

    ...
    My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have nativea types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    The more likely solution would be to push the protocol processing
    into an attached I/O processor, in those days.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:26:32 2025
    From Newsgroup: comp.arch

    On 8/6/25 22:29, Thomas Koenig wrote:


    That is one of the things I find astonishing - how a company like
    DG grew from a kitche-table affair to the size they had.


    Recent history is littered with companies like this. The microcomputer revolution spawned scores of companies that started in someone's garage, ballooned to major presence overnight, and then disappeared - bankrupt,
    bought out, split up, etc. Look at all the players in the S-100 CP/M
    space, or Digital Research. Only a few, like Apple and Microsoft, made
    it out alive.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 15:03:23 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >>paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?

    Because you need to sell it. Without disrupting your existing
    customer base.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Thu Aug 7 08:38:56 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the
    Pentium Pro, Intel has had to implement the x86 instruction set as a
    microcoded program running on top of a simpler RISC architecture.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Thu Aug 7 17:52:05 2025
    From Newsgroup: comp.arch

    John Ames wrote:
    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex
    instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the Pentium Pro, Intel has had to implement the x86 instruction set as a microcoded program running on top of a simpler RISC architecture.

    That's simply wrong:

    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and
    spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch,alt.folklore.computers on Thu Aug 7 21:53:11 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    John Ames wrote:
    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex
    instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the
    Pentium Pro, Intel has had to implement the x86 instruction set as a
    microcoded program running on top of a simpler RISC architecture.

    That's simply wrong:

    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and
    spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje

    You say "tomato". 8-)

    It's still "microcode" for some definition ... just not a classic
    "interpreter" implementation where a library of routines implements
    the high level instructions.

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache. If they age out, the decoder has to produce them
    again from the "source" x86 instructions.

    So the core is executing microinstructions - not x86 - and the program
    as executed reasonably can be said to be "microcoded" ... again for
    some definition.

    YMMV.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Fri Aug 8 01:57:53 2025
    From Newsgroup: comp.arch

    In article <107008b$3g8jl$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM >>>> or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Hmm... I _think_ it was on a talk given by the UNIX people,
    but I may be misremembering.

    I have heard similar stories about DEC, but not AT&T. The Unix
    fortune file used to (in)famously have a quote from Ken Olsen
    about the relative volume of documentation between Unix and VMS
    (reproduced below).

    - Dan C.

    BEGIN FORTUNE<---

    One of the questions that comes up all the time is: How
    enthusiastic is our support for UNIX?
    Unix was written on our machines and for our machines many
    years ago. Today, much of UNIX being done is done on our machines.
    Ten percent of our VAXs are going for UNIX use. UNIX is a simple
    language, easy to understand, easy to get started with. It's great for students, great for somewhat casual users, and it's great for
    interchanging programs between different machines. And so, because of
    its popularity in these markets, we support it. We have good UNIX on
    VAX and good UNIX on PDP-11s.
    It is our belief, however, that serious professional users will
    run out of things they can do with UNIX. They'll want a real system and
    will end up doing VMS when they get to be serious about programming.
    With UNIX, if you're looking for something, you can easily and
    quickly check that small manual and find out that it's not there. With
    VMS, no matter what you look for -- it's literally a five-foot shelf of documentation -- if you look long enough it's there. That's the
    difference -- the beauty of UNIX is it's simple; and the beauty of VMS
    is that it's all there.
    -- Ken Olsen, President of DEC, 1984

    END FORTUNE<---
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Fri Aug 8 03:57:17 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 07:26:32 -0700, Peter Flass wrote:

    On 8/6/25 22:29, Thomas Koenig wrote:

    That is one of the things I find astonishing - how a company like DG
    grew from a kitche-table affair to the size they had.

    Recent history is littered with companies like this.

    DG were famously the setting for that Tracy Kidder book, “The Soul Of A
    New Machine”, chronicling their belated and high-pressure project to enter the 32-bit virtual-memory supermini market and compete with DEC’s VAX.

    Looking at things with the eyes of a software guy, I found some of their hardware decisions questionable. Like they thought they were very clever
    to avoid having separate privilege modes in the processor status register
    like the VAX did: instead, they encoded the access privilege mode in the address itself.

    I guess they thought that 32 address bits left plenty to spare for
    something like this. But I think it just shortened the life of their 32-
    bit architecture by that much more.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Fri Aug 8 06:16:51 2025
    From Newsgroup: comp.arch

    George Neuner <gneuner2@comcast.net> writes:
    On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    John Ames wrote:
    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and >>spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje

    You say "tomato". 8-)

    It's still "microcode" for some definition ... just not a classic >"interpreter" implementation where a library of routines implements
    the high level instructions.

    Exactly, for most instructions there is no microcode. There are
    microops, with 118 bits on the Pentium Pro (P6). They are not RISC instructions (no RISC has 118-bit instructions). At best one might
    argue that one P6 microinstruction typically does what a RISC
    instruction does in a RISC. But in the end the reorder buffer still
    has to deal with the CISC instructions.

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache.

    No such cache in the P6 or any of its descendents until the Sandy
    Bridge (2011). The Pentium 4 has a microop cache, but eventually
    (with Core Duo, Core2 Duo) was replaced with P6 descendents that have
    no microop cache. Actually, the Core 2 Duo has a loop buffer which
    might be seen as a tiny microop cache. Microop caches and loop
    buffers still have to contain information about which microops belong
    to the same CISC instruction, because otherwise the reorder buffer
    could not commit/execute* CISC instructions.

    * OoO microarchitecture terminology calls what the reorder buffer does
    "retire" or "commit". But this is where the speculative execution
    becomes architecturally visible ("commit"), so from an architectural
    view it is execution.

    Followups set to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Fri Aug 8 11:43:00 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 03:57:17 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:


    I guess they thought that 32 address bits left plenty to spare for
    something like this. But I think it just shortened the life of their
    32- bit architecture by that much more.


    The history proved them right. Eagle series didn't last long enough to
    run out of 512MB address space.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Aug 8 10:08:43 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.
    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    I don't know why they think these are problems with the 82S100.
    These complaints sound like from a hobbyist.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton

    Yes. This risc-VAX would have to decode 1 instruction per clock to
    keep keep a pipeline full so I envision running the fetch buffer
    through a bank of those PLA and generating a uOp out.

    I don't know whether the instructions can be byte aligned variable size
    or have to be fixed 32-bits in order to meet performance requirements.
    I would prefer the flexibility of variable size but
    the Fetch byte alignment shifter adds a lot of logic.

    If variable then the high frequency instructions like MOV rd,rs
    and ADD rsd,rs fit into two bytes. The longest instruction looks like
    12 bytes, 4 bytes operation specifier (opcode plus registers)
    plus 8 bytes immediate FP64.

    If a variable size instruction arranges that all the critical parse
    information is located in the first 8-16 bits then we can just run
    those bits through a PLAs in parallel and have that control the
    alignment shifter as well as generate the uOp.

    I envision this Fetch buffer alignment shifter built from tri-state
    buffers rather than muxes as TTL muxes are very slow and we would
    need a lot of them.

    The whole fetch-parse-decode should fit on a single board.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 19:48:59 2025
    From Newsgroup: comp.arch

    On Fri, 08 Aug 2025 06:16:51 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    George Neuner <gneuner2@comcast.net> writes:

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache.

    No such cache in the P6 or any of its descendents until the Sandy
    Bridge (2011). The Pentium 4 has a microop cache, but eventually
    (with Core Duo, Core2 Duo) was replaced with P6 descendents that have
    no microop cache. Actually, the Core 2 Duo has a loop buffer which
    might be seen as a tiny microop cache. Microop caches and loop
    buffers still have to contain information about which microops belong
    to the same CISC instruction, because otherwise the reorder buffer
    could not commit/execute* CISC instructions.

    * OoO microarchitecture terminology calls what the reorder buffer does
    "retire" or "commit". But this is where the speculative execution
    becomes architecturally visible ("commit"), so from an architectural
    view it is execution.

    Followups set to comp.arch

    - anton

    Thanks for the correction. I did fair amount of SIMD coding for
    Pentium II, III and IV, so was more aware of their architecture. After
    the IV, I moved on to other things so haven't kept up.

    Question:
    It would seem that, lacking the microop cache the decoder would need
    to be involved, e.g., for every iteration of a loop, and there would
    be more pressure on I$1. Did these prove to be a bottleneck for the
    models lacking cache? [either? or something else?]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 21:43:11 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:23:26 -0400, EricP
    <ThatWouldBeTelling@thevillage.com> wrote:

    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where
    destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    I was aware of this (thank you), but I was trying to figure out why
    the VAX - particularly early ones - would need it. And also it does
    not mesh with Waldek's comment [at top] about 3 copies.


    The VAX did have one (pathological?) address mode:

    displacement deferred indexed @dis(Rn)[Rx]

    in which Rn and Rx could be the same register. It is the only mode
    for which a single operand could reference a given register more than
    once. I never saw any code that actually did this, but the manual
    does say it was possible.

    But even with this situation, it seems that the register would only
    need to be read once (per operand, at least) and the value could be
    used twice.


    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode >bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    You mean exceptions? Exceptions were handled between instructions.
    VAX had no iterating string-copy/move instructions so every
    instruction logically could stand alone.

    VAX separately identified the case where the instruction completed
    with a problem (trap) from where the instruction could not complete
    because of the problem (fault), but in both cases it indicated the
    offending instruction.


    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 08:07:12 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sat Aug 9 09:04:40 2025
    From Newsgroup: comp.arch

    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not, but that's a LOT of
    speculation with hindsight-colored glasses. Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not. But as with all alternate history, this is
    completely unknowable.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 10:00:54 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    Sure. IBM was in less than no hurry to make a product out of
    the 801.


    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Aug 9 10:03:29 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,
    - with 8 optional XOR output invertors,
    - driving 8 tri-state or open collector buffers.

    So I count roughly 7 or 8 equivalent gate delays.
    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 20:54:07 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.


    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch on Sat Aug 9 14:57:03 2025
    From Newsgroup: comp.arch

    On 8/9/25 1:54 PM, Thomas Koenig wrote:

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings?

    using typicals was a rookie mistake
    also not comparing delay times across vendors


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sun Aug 10 12:06:46 2025
    From Newsgroup: comp.arch

    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    [snip]
    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    Ok. Sure.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Heh. :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    Absolutely.

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Similarly for other minicomputer companies.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.

    Plausiblity is orthogonal to whether a thing is knowable.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Aug 10 15:18:23 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    <snip>

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer. Considerable
    internal resources were being applied to the Jupiter project
    at the end of the 1970s to support a wider range of applications.

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    Fundamentally, 36-bit words ended up being a dead-end.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Aug 10 21:01:50 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started,

    That is true. Reading https://acg.cis.upenn.edu/milom/cis501-Fall11/papers/cocke-RISC.pdf
    (I liked the potential toung-in-cheek "Regular Instruction
    Set-Computer" name for their instruction set).

    and even
    fewer would have believed it absent a working prototype,

    The simulation approach that IBM took is interesting. They built
    a fast simulator, translating one 801 instruciton into one (or
    several) /370-instructions on the fly, with a fixed 32-bit size.


    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 11 08:17:48 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the
    hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    * word addressing

    Sure, one could add 36-bit byte addresses to such an architecture
    (probably with 9-bit bytes to make it easy to deal with words), but it
    would force a completely different ABI and API, so the legacy code
    would still have no good upgrade path and would be limited to its
    256KW address space no matter how much actual RAM there is available.
    IBM decided to switch from this 36-bit legacy to the 32-bit
    byte-addressed S/360 in the early 1960s (with support for their legacy
    lines built into various S/360 implementations), DEC did so when they introduced the VAX.

    Concerning other manufacturers:

    <https://en.wikipedia.org/wiki/36-bit_computing> tells me that the
    GE-600 series was also 36-bit. It continued as Honeywell 6000 series <https://en.wikipedia.org/wiki/Honeywell_6000_series>. Honeywell
    introduced the DPS-88 in 1982; the architecture is described as
    supporting the usual 256KW, but apparently the DPS-88 could be bought
    with up to 128MB; programming that probably was no fun. Honeywell
    later sold the NEC S1000 as DPS-90, which does not sound like the
    Honeywell 6000 line was a growing business. And that's the last I
    read about the Honeywell 6000 line.

    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. <https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:

    |In addition to the IX (1100/2200) CPUs [...], the architecture had
    |Xeon [...] CPUs. Unisys' goal was to provide an orderly transition for
    |their 1100/2200 customers to a more modern architecture.

    So they continued to support it for a long time, but it's a legacy
    thing, not a future-oriented architecture.

    The Wikipedia article also mentions the Symbolics 3600 as 36-bit
    machine, but that was quite different from the 36-bit architectures of
    the 1950s and 1960s: The Symbolics 3600 has 28-bit addresses (the rest apparently taken by tags) and its successor Ivory has 32-bit addresses
    and a 40-bit word. Here the reason for its demise was the AI winter
    of the late 1980s and early 1990s.

    DEC did the right thing when they decided to support VAX as *the*
    future architecture, and the success of the VAX compared to the
    Honeywell 6000 and Univac 1100/2200 series demonstrates this.

    RISC-VAX would have been better than the PDP-10, for the same reasons:
    32-bit addresses and byte addressing. And in addition, the
    performance advantage of RISC-VAX would have made the position of
    RISC-VAX compared to PDP-10 even stronger.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 11 14:51:20 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    In a sense, they still live in the Unisys Clearpath systems.


    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    An issue for PDP-10, certainly. Not so much for the Univac
    systems.



    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:


    I spent 14 years at Burroughs/Unisys (on the Burroughs side, mainly).

    Yes, two of the six mainframe lines still exist (albeit in emulation);
    one 48-bit, the other 36-bit.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 11 17:27:30 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting link, thanks!


    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    They were considering byte-addressability; interesting. It is also
    slightly funny that a 9-bit byte address would be made up of
    30 bits of virtual address and 2 bits of byte address, i.e.
    a 32-bit address in total.

    Fundamentally, 36-bit words ended up being a dead-end.

    Pretty much so. It was a pity for floating-point, where they had
    more precision than the 32-bit words (and especially the horrible
    IBM format).

    But byte addressability and power of two won.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:02:04 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780,
    11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:59:32 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    I doubt that they could afford 1-cycle multiply

    Yes, one might do a multiplier and divider with its own sequencer (and
    more sophisticated in later implementations), and with any user of the
    result waiting stalling the pipeline until that is complete, and any
    following user of the multiplier or divider stalling the pipeline
    until it is free again.

    The idea of providing multiply-step instructions and using a bunch of
    them was short-lived; already the MIPS R2000 included a multiply
    instruction (with its own sequencer), HPPA has multiply-step as well
    as an FPU-based multiply from the start. The idea of avoiding divide instructions had a longer life. MIPS has divide right from the start,
    but Alpha and even IA-64 avoided it. RISC-V includes divide in the M
    extension that also gives multiply.

    or
    even a barrel shifter.

    Five levels of 32-bit 2->1 muxes might be doable, but would that be cost-effecti

    It is accepted in this era that using more hardware could
    give substantial speedup. IIUC IBM used quadatic rule:
    performance was supposed to be proportional to square of
    CPU price. That was partly marketing, but partly due to
    compromises needed in smaller machines.

    That's more of a 1960s thing, probably because low-end S/360
    implementations used all (slow) tricks to minimize hardware. In the
    VAX 11/780 environment, I very much doubt that it is true. Looking at
    the early VAXen, you get the 11/730 with 0.3 VUPs up to the 11/784
    with 3.5 VUPs (from 4 11/780 CPUs). sqrt(3.5/0.3)=3.4. I very much
    doubt that you could get an 11/784 for 3.4 times the price of an
    11/730.

    Searching a little, I find

    |[11/730 is] to be a quarter the price and a quarter the performance of
    |a grown-up VAX (11/780) <https://retrocomputingforum.com/t/price-of-vax-730-with-vms-the-11-730-from-dec/3286>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 11:25:24 2025
    From Newsgroup: comp.arch

    In article <107b1bu$252qo$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    Sure. I wasn't disputing that, just saying that I don't think
    it mattered that much.

    [snip]
    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Well, then we're definitely into the unknowable. :-)

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:18:06 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.
    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.
    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    I didn't use the typical values. Yes, it would be dangerous to use them.
    I never understood why they even quoted those typical numbers.
    I always considered them marketing fluff.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?

    I wasn't suggesting that. People used to modern CMOS speeds might not appreciate how slow TTL was. I was showing that its 50 ns speed number
    was not out of line with other MSI parts of that day, and just happened
    to have a PDF TTL manual opened on that part so used it as an example.
    A 74181 4-bit ALU is also of similar complexity and 62 ns max.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:40:01 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton

    Yes I saw the Patt paper recently. He has written many microarchitecture papers. I was surprised that in 1990 he would say on page 2:

    "All VAXes are microcoded. The richness of the instruction set urges that
    the flexibility of microcoded control be employed, notwithstanding the conventional mythology that hardwired control is somehow faster than
    microcode. It is instructive to point out that (1) hardwired control
    produces higher performance execution only in situations where the
    critical path is in the microsequencing function, and (2) that this
    should not occur in VAX implementations if one designs with the
    well-understood (to microarchitects) technique that the next control
    store address must be obtained from information available at the start
    of the current microcycle. A variation of this basic old technique is
    the recently popularized delayed branch present in many ISA architectures introduced in the last few years."

    When he refers to the "mythology that hardwired control is somehow faster"
    he appears to still be using the monolithic "eyes" I referred to earlier
    in that everything must go through a single microsequencer.
    He compares a hardwired sequential controller to a microcoded sequential controller and notes that in that case hardwired is no faster.

    What he is not doing is comparing multiple parallel hardware stages
    to a sequential controller, hardwired or microcoded.

    Risc brings with it the concurrent hardware stages view.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 13 12:09:35 2025
    From Newsgroup: comp.arch

    On 8/13/25 11:26, Ted Nolan <tednolan> wrote:
    In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide
    8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.


    They did manage to crack the college market some where CS departments
    had DEC hardware anyway. I know USC (original) had a Rainbow computer
    lab circa 1985. That "in" didn't translate to anything else though.

    Skidmore College was a DEC shop back in the day.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Wed Aug 13 19:35:09 2025
    From Newsgroup: comp.arch

    In comp.arch Scott Lurndal <scott@slp53.sl.home> wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Stephen Fuld wrote:
    On 8/4/2025 8:32 AM, John Ames wrote:
    =20
    snip
    =20
    This notion that the only advantage of a 64-bit architecture is a larg= >>e
    address space is very curious to me. Obviously that's *one* advantage,=

    but while I don't know the in-the-field history of heavy-duty business= >>/
    scientific computing the way some folks here do, I have not gotten the=

    impression that a lot of customers were commonly running up against th= >>e
    4 GB limit in the early '90s;
    =20
    Not exactly the same, but I recall an issue with Windows NT where it=20
    initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >>=20
    users.=C2=A0 Some users were "running out of address space", so Microso= >>ft=20
    came up with an option to reduce the OS space to 1 GB, thus allowing up= >>=20
    to 3 GB for users.=C2=A0 I am sure others here will know more details.

    Any program written to Microsoft/Windows spec would work transparently=20 >>with a 3:1 split, the problem was all the programs ported from unix=20 >>which assumed that any negative return value was a failure code.

    The only interfaces that I recall this being an issue for were
    mmap(2) and lseek(2). The latter was really related to maximum
    file size (although it applied to /dev/[k]mem and /proc/<pid>/mem
    as well). The former was handled by the standard specifying
    MAP_FAILED as the return value.

    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally broken.

    I remeber RIM. When I compiled it on Linux and tried it I got error
    due to check for "< 0". Change to '== -1" fixed it. Possibly there
    were similar troubles in other programs that I do not remember.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 13 20:23:53 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From drb@drb@ihatespam.msu.edu (Dennis Boone) to comp.arch,alt.folklore.computers on Thu Aug 14 17:12:40 2025
    From Newsgroup: comp.arch

    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
    total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    Maybe compare 808x to something more in its weight class? The 8-bit
    8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    De
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Thu Aug 14 15:22:46 2025
    From Newsgroup: comp.arch

    Dennis Boone wrote:
    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    Maybe compare 808x to something more in its weight class? The 8-bit
    8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    De

    For those interested in a blast from the past, on the Wikipedia WD16 page https://en.wikipedia.org/wiki/Western_Digital_WD16

    is a link to a copy of Electronic Design magazine from 1977 which
    has a set of articles on microprocessors starting on page 60.

    Its a nice summary of the state of the microprocessor world circa 1977.

    https://www.worldradiohistory.com/Archive-Electronic-Design/1977/Electronic-Design-V25-N21-1977-1011.pdf

    Table 1 General Purpose Microprocessors on pg 62 shows 8 different
    16-bit microprocessor chip sets including the WD16.

    Table 3 on pg 66 show ~11 bit slice families that can be used to build
    larger microcoded processors, such as AMD 2900 4-bit slice series.

    It also has many data sheets on various micros starting on pg 88
    and 16-bit ones starting on pg 170, mostly chips you never heard
    on like the Ferranti F100L, but also some you'll know like the
    Data General MicroNova mN601 on page 178.
    The Western Digital WD-16 is on pg 190.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Thu Aug 14 12:59:00 2025
    From Newsgroup: comp.arch

    On 8/14/25 10:12 AM, Dennis Boone wrote:
    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    The only single die PDP-11 DEC produced was the T-11 and it didn't
    have an MMU

    The J-11 is a Harris two chip hybrid, and is in a >40 pin chip carrier. http://simh.trailing-edge.com/semi/j11.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 15 03:20:56 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    _For current machines_ there are reasons to use bigger pages, but
    in VAX time bigger pages almost surely would lead to higher memory
    use and consequently to higher price for end user. In effect
    machine would be much less competitive.

    BTW: Long ago I saw message about porting an application from
    VAX to Linux. On VAX application run OK in 1GB of memory.
    On 32 bit Inter architecture Linux with 1 GB there was excessive
    paging. The reason was much smaller number of bigger pages.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 15 05:07:01 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Aug 15 12:57:35 2025
    From Newsgroup: comp.arch

    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    The VAX was built to be a commercial product. As such, it was
    designed to be successful in the market. But in order to be
    successful in the market, it was important that the designers be
    informed by the business landscape at both the time they were
    designing it, and what they could project would be the lifetime
    of the product. Those are considerations that extend beyond
    the purely technical aspects of the design, and are both more
    speculative and more abstract.

    Consider how the business criteria might influence the technical
    design, and how these might play off of one another: obviously,
    DEC understood that the PDP-11 was growing ever more constrained
    by its 16-bit address space, and that any successor would have
    to have a larger address space. From a business perspective, it
    made no sense to create a VAX with a 16-bit address space.
    Similarly, they could have chosen (say) a 20, 24, or 28 bit
    address space, or used segmented memory, or any number of other
    such decisions, but the model that they did chose (basically a
    flat 32-bit virtual address space: at least as far as the
    hardware was concerned; I know VMS did things differently) was
    ultimately the one that "won".

    Of course, those are obvious examples. What I'm contending is
    that the business<->technical relationship is probably deeper
    and that business has more influence on technology than we
    realize, up to and including the ISA design. I'm not saying
    that the business folks are looking over the engineers'
    shoulders telling them how the opcode space should be arranged,
    but I am saying that they're probably going to engineering with
    broad-strokes requirements based on market analysis and customer
    demand. Indeed, we see examples of this now, with the addition
    of vector instructions to most major ISAs. That's driven by the
    market, not merely engineers saying to each other, "you know
    what would be cool? AVX-512!"

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT
    instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    Of course, they messed some of it up; EDITPC was like the
    punchline of a bad joke, and the ways that POLY was messed up
    are well-known.

    Anyway, I apologize for the length of the post, but that's the
    sort of thing I mean.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Fri Aug 15 13:36:12 2025
    From Newsgroup: comp.arch

    On Fri, 15 Aug 2025 12:57:35 -0000 (UTC), Dan Cross wrote:

    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm saying
    that business needs must have, at least in part, influenced the ISA
    design. That is, while mistaken, it was part of the business decision
    process regardless.

    It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [snip]

    There are also bits of the business requirements in each of the
    descriptions of DEC microprocessor projects on Bob Supnik's site
    that Al Kossow linked to earlier:

    <http://simh.trailing-edge.com/dsarchive.html>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 15 15:10:58 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is:
    https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through. >>
    Section 2.7 also mentions an 8-byte instruction buffer, and that the
    instruction fetching is done happens concurrently with the microcoded
    execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 15:26:31 2025
    From Newsgroup: comp.arch

    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
    running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table, but is otherwise pretty much
    OK assuming TLB miss rate isn't too unreasonable.


    For the TLB, had noticed best results with 4 or 8 way associativity:
    1-way: Doesn't work for main TLB.
    1-way works OK for an L1-TLB in a split L1/L2 TLB config.
    2-way: Barely works
    In some edge cases and configurations,
    may get stuck in a TLB miss loop.
    4-way: Works fairly well, cheaper option.
    8-way: Works better, but significantly more expensive.

    A possible intermediate option could be 6-way associativity.
    Full associativity is impractically expensive.
    Also a large set associative TLB beats a small full associative TLB.

    For a lot of the test programs I run, TLB size:
    64x: Small, fairly high TLB miss rate.
    256x: Mostly good
    512x or 1024x: Can mostly eliminate TLB misses, but debatable.

    In practice, this has mostly left 256x4 as the main configuration for
    the Main TLB. Optionally, can use a 64x1 L1 TLB (with the main TLB as an
    L2 TLB), but this is optional.


    A hardware page walker or inverted page table has been considered, but
    not crossed into use yet. If I were to add a hardware page walker, it
    would likely be semi-optional (still allowing processes to use
    unconventional memory management as needed, *).

    Supported page sizes thus far are 4K, 16K, and 64K. In test-kern, 16K
    mostly won out, using a 3-level page table and 48-bit address space,
    though technically the current page-table layout only does 47 bits.

    Idea was that the high half of the address space could use a separate
    System page table, but this isn't really used thus far.

    *: One merit of software TLB is that is allows for things like nested
    page tables or other trickery without needing any actual hardware
    support. Though, you can also easily enough fake software TLB in
    software as well (a host TLB miss pulling from the guest TLB and
    translating the address again).


    Ended up not as inclined towards inverted page tables, as they offer
    fewer benefits than a page walker but would have many of the same issues
    in terms of implementation complexity (needs to access RAM and perform multiple memory accesses to resolve a miss, ...). The page walker then
    is closer to the end goal, whereas the IPT is basically just a much
    bigger RAM-backed TLB.



    Actually, it is not too far removed from doing a weaker (not-quite-IEEE)
    FPU in hardware, and then using optional traps to emulate full IEEE
    behavior (nevermind if such an FPU encountering things like subnormal
    numbers or similar causes performance to tank; and the usual temptation
    to just disable the use of full IEEE semantics).

    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 06:16:08 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.

    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 10:00:56 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 15:21:38 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 11:29:20 2025
    From Newsgroup: comp.arch

    On 8/17/2025 1:16 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.


    Yes.

    AVL tree is a balanced binary tree that tracks depth and "rotates" nodes
    as needed to keep the depth of one side within +/- 1 of the other.

    The B-Trees would use N elements per node, which are stored in sorted
    order so that one can use a binary search.


    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?


    Can use less RAM for large sparse address spaces with aggressive ASLR.
    However. looking up a page or updating the page table are significantly
    slower (enough to be relevant).

    Though, I mostly ended up staying with more conventional page tables and weakening the ASLR, where it may try to reuse the previous bits (47:25)
    and (47:36) of the address a few times, to reduce page-table
    fragmentation (sparse, mostly-empty, page table pages).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 13:35:03 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
    For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8> <7:0>
    | | | |
    4:16 4:16 4:16 4:16
    vvvv vvvv vvvv vvvv
    10k ---|---|---|---|------>INV->
    10k ---------------------->INV->
    10k ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 12:53:32 2025
    From Newsgroup: comp.arch

    On 8/17/2025 9:00 AM, EricP wrote:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software.  While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.


    I am not saying SW page walkers are fast.
    Though in my experience, the cycle cost of the SW TLB miss handling
    isn't "too bad".

    If it were a bigger issue in my case, could probably add a HW page
    walker, as I had long considered it as a possible optional feature. In
    this case, it could be per-process (with the LOBs of the page-base
    register also encoding whether or not HW page-walking is allowed; along
    with in my case also often encoding the page-table type/layout).


    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."


    This is around 2 orders of magnitude more than I am often seeing in my
    testing (mind you, with a TLB miss handler that is currently written in C).


    But, this is partly where things like page-sizes and also the size of
    the TLB can have a big effect.

    Ideally, one wants a TLB that has a coverage larger than the working set
    of the typical applications (and OS); at which point miss rate becomes negligible. Granted, if one has GB's of RAM, and larger programs, this
    is a harder problem...


    Then the ratio of working set to TLB coverage comes into play, which
    granted (sadly) appears to follow an (workingSet/coverage)^2 curve...


    I had noted before that some of the 90s era RISC's had comparably very
    small TLBs, such as 64-entry fully associative, or 16x4.
    Such a TLB with a 4K page size having a coverage of roughly 256K.

    Where, most programs have working sets somewhat larger than 256K.

    Looking it up, the DEC Alpha used a 48 entry TLB, so 192K coverage, yeah...


    The CPU time cost of TLB Miss handling would be significantly reduced
    with a "not pissant" TLB.



    I was mostly using 256x4, with a 16K page size, which covers a working
    set of roughly 16MB.

    A 1024x4 would cover 64MB, and 1024x6 would cover 96MB.

    One possibility though would be to use 64K pages for larger programs,
    which would increase coverage of a 1024x TLB to 256MB or 384MB.

    At present, a 1024x4 TLB would use 64K of Block-RAM, and 1024x6 would
    use 98K.

    But, yeah... this is comparable to the apparent TLB sizes on a lot of
    modern ARM processors; which typically deal with somewhat larger working
    sets than I am dealing with.


    Another option is to RAM-back part of the TLB, essentially as an
    "Inverted Page Table", but admittedly, this has similar complexities to
    a HW page walker (and the hassle of still needing a fault handler to
    deal with missing IPT entries).



    In an ideal case, could make sense to write at least the fast path of
    the miss handler in ASM.

    Note that TLB misses are segregated into their own interrupt category
    separate from other interrupt:
    8: General Fault (Memory Faults, Instruction Faults, FPU Traps)
    A: TLB Miss (TLB Miss, ACL Miss)
    C: Interrupt (1kHz HW timer mostly)
    E: Syscall (System Calls)

    Typically, the VBR layout looks like:
    + 0: Reset (typically only used on boot, with VBR reset to 0)
    + 8: General Fault
    +16: TLB Miss
    +24: Interrupt
    +32: Syscall
    With a non-standard alignment requirement (vector table needs to be
    aligned to a multiple of 256 bytes, for "reasons"). Though actual CPU
    core currently only needs a 64B alignment (256B would allow adding a lot
    more vectors while staying with the use of bit-slicing). Each "entry" in
    this table being a branch to the entry point of the ISR handler.

    On initial Boot, as a little bit of a hack, the CPU looks at the
    encoding of the Reset Vector branch to determine the initial ISA Mode
    (such as XG1, XG3, or RV64GC).



    If doing a TLB miss handler in ASM, possible strategy could be:
    Save off some of the registers;
    Check if a simple case TLB miss or ACL miss;
    Try to deal with it;
    Restore registers;
    Return.
    Save rest of registers;
    Deal with more complex scenario (probably in C land);
    Such as initiate a context switch to the page-fault handler.

    For the simple cases:
    TLB Miss involves walking the page table;
    ACL miss may involve first looking up the ID pairs in a hash table;
    Fallback cases may involve more complex logic in a more general handler.



    At present, the Interrupt and Syscall handlers have the quirk in that
    they require TBR to be set-up first, as they directly save to the
    register save area (relative to) TBR, rather than using the interrupt
    stack. The main rationale here being that these interrupts frequently
    perform context switches and saving/restoring registers to TBR greatly
    reduces the performance cost of performing a context switch.

    Note though that one ideally wants to use shared address spaces or ASIDs
    to limit the amount of TLB misses.

    Can note that currently my CPU core uses 16-bit ASIDs, split into 6+10
    bits, currently 64 groups, each with 1024 members. Global pages are
    generally only global within a groups, and high numbered groups are
    assumed to not allow global pages. Say, for example, if you were running
    a VM, you wouldn't want its VAS being polluted with global pages from
    the host OS.

    Though, global pages would allow things like DLLs and similar to be
    shared without needing TLB misses for them on context switches.

    ...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jakob Bohm@egenagwemdimtapsar@jbohm.dk to comp.arch,comp.lang.c on Sun Aug 17 20:18:36 2025
    From Newsgroup: comp.arch

    On 2025-08-05 23:08, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?


    What is actually going on is GCC offering its users a gradual way to transition from C17 to C23, by applying the C23 meaning of any C23
    construct that has no conflicting meaning in C17 . In particular, this
    allows installed library headers to use the new types as part of
    logically opaque (but compiler visible) implementation details, even
    when those libraries are used by pure C17 programs. For example, the
    ISO POSIX datatype struct stat could contain a _BitInt(128) type for
    st_dev or st_ino if the kernel needs that, as was the case with the 1996
    NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

    GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
    used "system" headers (including installed libc headers), as those are
    allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
    report the use of non-standard extensions to the library standards when
    the compiler is invoked with --pedantic and no contrary options .

    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C instead
    of GNUC reverts those to the standard definition .

    Enjoy

    Jakob
    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 19:10:21 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 15:08:14 2025
    From Newsgroup: comp.arch

    On 8/17/2025 2:10 PM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.


    If the processor has microcode, could try to handle it that way.

    If it could work, and the CPU allows sufficiently complex logic in
    microcode to deal with this.

    ...



    One idea I had considered early on would be that there is would be a
    special interrupt class that always goes into the ROM; so to the OS it
    would always looks as-if there were a HW page walker.

    This was eventually ended though as I was typically using 32K for the
    Boot ROM, and with the initial startup tests, font initialization, and
    FAT32 driver + PEL and elf loaders, ..., there wasn't much space left
    for "niceties" like TLB miss handling and similar. So, the role of the
    ROM was largely reduced to initial boot-up.

    It could be possible to have a "2-stage ROM", where the first stage boot
    ROM also loads more "ROM" from the SDcard. But, at that point, may as
    well just go over to using the current loader design to essentially try
    to load a UEFI BIOS or similar (which could then load the OS, achieving basically the same effect).

    Where, in effect, UEFI is basically an OS in its own right, which just
    so happens to use similar binary formats to what I am using already (eg, PE/COFF).

    Not yet gone up the learning curve for how to make TestKern behave like
    a UEFI backend though (say, for example, if I wanted to try to get
    "Debian RV64G" or similar to boot on my stuff).


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 18:56:49 2025
    From Newsgroup: comp.arch

    On 8/17/2025 12:35 PM, EricP wrote:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns.  Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like
    74LS375.

    For a wide instruction or stage register I'd look at chips such as a
    74LS377
    with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
    vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more.  If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8>   <7:0>
             |    |   |   |
           4:16 4:16 4:16 4:16
           vvvv vvvv vvvv vvvv
      10k  ---|---|---|---|------>INV->
      10k  ---------------------->INV->
      10k  ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere.  Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.


    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)


    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding
    space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Sun Aug 17 22:18:28 2025
    From Newsgroup: comp.arch

    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 18 05:48:00 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [...]

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    I had actually forgotten that the VAX also had decimal
    instructions. But the 11/780 also had one really important
    restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
    severely limited their throughput there (assuming they did
    things bytewise). So yes, decimal arithmetic was important
    in the day for COBOL and related commercial applications.

    So, what to do with decimal arithmetic, which was important
    at the time (and a business consideration)?

    Something like Power's addg6s instruction could have been
    introduced, it adds two numbers together, generating only the
    decimal carries, and puts a nibble "6" into the corresponding
    nibble if there is one, and "0" otherwise. With 32 bits, that
    would allow addition of eight-digit decimal numbers in four
    instructions (see one of the POWER ISA documents for details),
    but the cycle of "read ASCII digits, do arithmetic, write
    ASCII digits" would have needed some extra shifts and masks,
    so it might have been more beneficial to use four digits per
    register.

    The article above is also extremely interesting otherwise. It does
    not give cycle timings for each individual instruction and address
    mode, but it gives statistics on how they were used, and a good
    explanation of the timing implications of their microcode design.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.arch,comp.lang.c on Mon Aug 18 08:02:30 2025
    From Newsgroup: comp.arch

    On 18/08/2025 06:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    $ cat so.c
    #include <stdio.h>

    int main(void)
    {
    int foo = 42;
    size_t soa = sizeof (foo, 'C');
    size_t sob = sizeof foo;
    printf("%s.\n", (soa == sob) ? "Yes" : "No");
    return 0;
    }
    $ gcc -o so so.c
    $ ./so
    Yes.
    $ gcc --version
    gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch,comp.lang.c on Mon Aug 18 11:34:49 2025
    From Newsgroup: comp.arch

    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?


    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Aug 18 11:03:15 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    Yeah, this approach works a lot better than people seem to give it
    credit for...
    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.
    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    None of those research papers that I have seen consider the possibility
    that OoO can make use of multiple concurrent HW walkers if the
    cache supports hit-under-miss and multiple pending miss buffers.

    While instruction fetch only needs to occasionally translate a VA one
    at a time, with more aggressive alternate path prefetching all those VA
    have to be translated first before the buffers can be prefetched.
    LSQ could also potentially be translating as many VA as there are entries.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton

    I'm looking for papers that separate out the common cost of loading a PTE
    from the extra cost of just the SW-miss handler. I had a paper a while
    back but can't find it now. IIRC in that paper the extra cost of the
    SW miss handler on Alpha was measured at 5-25%.

    One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
    that Intel's HW walker on its downward walk caches the interior node
    PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

    A SW walker can accomplish the same bottom-up walk by locating
    the different page table levels at *virtual* base addresses,
    and adding each VA of those interior PTE's to the TLB.
    This is what VAX VA translate did, probably Alpha too but I didn't check.

    This interior PTE node caching is critical for optimal performance
    and some of their stats don't take it into account
    and give much worse numbers than they should.

    Also many papers were written before ASID's were in common use
    so the TLB got invalidated with each address space switch.
    This would penalize any OS which had separate user and kernel space.

    So all these numbers need to be taken with a grain of salt.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 18 15:35:36 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal
    reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 18 17:19:13 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want >atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were
    independent hardware features added to V8.1.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 18 21:57:59 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .
    I'm not sure what you're referring to. You didn't say what foo is.
    I believe that in all versions of C, the result of a comma operator
    has
    the type and value of its right operand, and the type of an unprefixed
    character constant is int.
    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".

    Yes (more of a thinko, actually).

    I meant to ask about `sizeof (foo, 'C')` yielding a value *other than*
    `sizeof (int)`. Jakob implies a difference in this area between GNU C
    and ISO C. I'm not aware of any.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 03:47:17 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    Prism paper says the following about RISC versus VAX performance:

    : 1. Shorter cycle time. VAX chips have more, and longer, critical
    : paths than RISC chips. The worst VAX paths are the control store
    : loop and the variable length instruction decode loop, both of
    : which are absent in RISC chips.

    : 2. Fewer cycles per function. Although VAX chips require fewer
    : instructions than RISC chips (1:2.3) to implement a given
    : function, VAX instructions take so many more cycles than RISC
    : instructions (5-10:1-1.5) that VAX chips require many more cycles
    : per function than RISC chips.

    : 3. Increased pipelining. VAX chips have more inter- and
    : intra-instruction dependencies, architectural irregularities,
    : instruction formats, address modes, and ordering requirements
    : than RISC chips. This makes VAX chips harder and more
    : complicated to pipeline.

    Point 1 above for me means that VAX chips were microcoded. Point
    2 above suggest that there were limited changes compared to VAX-780
    microcode.

    IIUC attempts to create better hardware for VAX were canceled
    just after PRISM memos, so later VAX used essential the same
    logic, just rescaled to better process.

    I think that VAX had problem with hardware decoders because of gate
    delay: in 1987 probably hardware decoder would slow down clock.
    But 1977 design for me looks quite relaxed: man logic was Schotky
    TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
    this means about 66 gate delays per cycle. And in critical paths
    VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
    commonly available in 1970 and 1 ns ECL was leading edge in 1970.

    That is why I think that in 1977 hardware decoder could give
    speedup, assuming that execution units could keep up: gate delay
    and cycle time means that rather deep circuit could fit within
    cycle time. IIUC 1987 designs were much more aggressive and
    decoder delay probably could not fit within single cycle.

    Quite possible that hardware designers attempting VAX hardware
    decoders were too ambitious and wanted to decode in one cycle
    too complicated instructions. AFAICS for instructions that can
    not be executed in one cycle decode can be slower than one
    cycle, all what one needs is to recognize withing one cycle
    that decode will take multiple cycles.

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 14:36:43 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    Not quite.
    My idea was to have two HW threads HT1 and HT2 which are like x86 HW
    threads except when HT1 gets a TLB miss it stalls its execution and
    injects the TLB miss handler at the front of HT2 pipeline,
    and a HT2 TLB miss stalls itself and injects its handler into HT1.
    The TLB miss handler never itself TLB misses as it explicitly checks
    the TLB for any VA it needs to translate so recursion is not possible.

    As the handler is injected at the front of the pipeline no drain occurs.
    The only possible problem is if between when HT1 injects its miss handler
    into HT2 that HT2's existing pipeline code then also does a TLB miss.
    As this would cause a deadlock, if this occurs then it cores detects it
    and both HT fault and run their TLB miss handler themselves.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want
    atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

    Yes. A memory recycler can periodically clear the Accessed bit
    so it can detect page usage, and that might be a different core.
    But it might skip sending TLB shootdowns to all other cores
    to lower the overhead (maybe a lazy usage detector).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 16:41:39 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    Hmmm... I don't think that is possible, or if it is then its really hairy.
    The miss handler needs to LD the memory PTE's, which can happen OoO.
    But it also needs to do things like writing control registers
    (e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
    can't get to retire because the older instructions that triggered the
    miss are stalled.

    The miss handler needs general registers so it needs to
    stash the current content someplace and it can't use memory.
    Then add a nested miss handler on top of that.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    As Scott said, to avoid race conditions with software clearing those bits.
    Plus there might be PTE modifications that an OS could perform on other
    PTE fields concurrently without first acquiring the normal mutexes
    and doing a TLB shoot down of the PTE on all the other cores,
    provided they are done atomically so the updates of one core
    don't clobber the changes of another.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    A HW walker looks simple to me.
    It has a few bits of state number and a couple of registers.
    It needs to detect memory read errors if they occur and abort.
    Otherwise it checks each TLB level in backwards order using the
    appropriate VA bits, and if it gets a hit walks back down the tree
    reading PTE's for each level and adding them to their level TLB,
    checking it is marked present, and performing an atomic OR to set
    the Accessed and Dirty flags if they are clear.

    The HW walker is even simpler if the atomic OR is implemented directly
    in the cache controller as part of the Atomic Fetch And OP series.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton

    Yes, and it seems to me that one would spend a lot more time trying to
    fix the SW walker than doing the simple HW walker that just works.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 19:17:01 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest
    conditional branch would be 3 bytes as it needs a register specifier.

    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.

    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.

    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 20 23:50:52 2025
    From Newsgroup: comp.arch

    On 8/20/2025 6:17 PM, EricP wrote:
    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most
    instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
      Maximizing code density often prefers fewer registers;
      For 16-bit instructions, 8 or 16 registers is good;
      8 is rather limiting;
      32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.


    Yeah.

    SuperH had:
    ZZZZnnnnmmmmZZZZ //2R
    ZZZZnnnniiiiiiii //2RI (Imm8)
    ZZZZnnnnZZZZZZZZ //1R


    For BJX2/XG1, had went with:
    ZZZZZZZZnnnnmmmm
    But, in retrospect, this layout was inferior to the one SuperH had used
    (and I almost would have just been better off doing a clean-up of the SH encoding scheme than moving the bits around).

    Though, this happened during a transition between B32V and BSR1, where:
    B32V was basically a bare-metal version of SH;
    BSR1 was an instruction repack (with tweaks to try make it more
    competitive with MSP430 while still remaining Load/Store);
    BJX2 was basically rebuilding all the stuff from BJX1 on top of BSR1's encoding scheme (which then mutated more).


    At first, BJX2's 32-bit ops were a prefix:
    111P-YYWY-qnmo-oooo ZZZZ-ZZZZ-nnnn-mmmm

    But, then got reorganized:
    111P-YYWY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

    Originally, this repack was partly because I had ended up designing some Imm9/Disp9 encodings as it quickly became obvious that Imm5/Disp5 was insufficient. But, I had designed the new instructions to have the Imm
    field not be totally dog-chewed, so ended up changing the layout. Then
    ended up changing the encoding for the 3R instructions to better match
    that of the new Imm9 encodings.

    Then, XG2:
    NMOP-YYwY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ //3R

    Which moved entirely over to 32/64/96 bit encodings in exchange for
    being able to directly encode 64 GPRs in 32-bit encodings for the whole ISA.


    In the original BJX2 (later renamed XG1), only a small subset having
    direct access to the higher numbered registers; and other instructions
    using 64-bit encodings.

    Though, ironically, XG2 never surpassed XG1 in terms of code-density;
    but being able to use 64 registers "pretty much everywhere" was (mostly)
    a good thing for performance.


    For XG3, there was another repack:
    ZZZZ-oooooo-mmmmmm-ZZZZ-nnnnnn-qY-YYPw //3R

    But, this was partly to allow it to co-exist with RISC-V.

    Technically, still has conditional instructions, but these were demoted
    to optional; as if one did a primarily RISC-V core, with an XG3 subset
    as an ISA extension, they might not necessarily want to deal with the
    added architectural state of a 'T' bit.

    BGBCC doesn't currently use it by default.

    Was also able to figure out how to make the encoding less dog chewed
    than either XG2 or RISC-V.


    Though, ironically, the full merits of XG3 are only really visible in
    cases where XG1 and XG2 are dropped. But, it has a new boat-anchor in
    that it now assumes coexistence with RISC-V (which itself has a fair bit
    of dog chew).

    And, if the goal is RISC-V first, then likely the design of XG3 is a big
    ask; it being essentially its own ISA.

    Though, while giving fairly solid performance, XG3 currently hasn't
    matched the code density of its predecessors (either XG1 or XG2). It is
    more like "RISC-V but faster".

    And, needing to use mode changes to access XG3 or RV-C is a little ugly.



    Though, OTOH, RISC-V land is annoying in a way; lots of people being
    like "RV-V will save us from all our performance woes!". Vs, realizing
    that some issues need to be addressed in the integer ISA, and SIMD and auto-vectorization will not address inefficiencies in the integer ISA.


    Though, I have seen glimmers of hope that other people in RV land
    realize this...


    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.


    Yeah, "BT/BF Disp8".


    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.



    In BSR1, I had experimented with:
    LDIZ Imm12u, R0 //R0=Imm12
    LDISH Imm8u //R0=(R0<<8)|Umm8u
    OP Imm4R, Rn //OP [(R0<<4)|Imm4u], Rn

    Which allowed Imm24 in 6 bytes or Imm32 in 8 bytes.
    Granted, as 3 or 4 instructions.

    Though, this began the process of allowing the assembler to fake more
    complex instructions which would decompose into simpler instructions.


    But, this was not kept, and in BJX2 was mostly replaced with:
    LDIZ Imm24u, R0
    OP R0, Rn

    Then, when I added Jumbo Prefixes:
    OP Rm, Imm33s, Rn

    Some extensions of RISC-V support Imm32 in 48-bit ops, but this burns
    through lots of encoding space.

    iiiiiiii-iiiiiiii iiiiiiii-iiiiiiii zzzz-nnnnn-z0-11111

    This doesn't go very far.


    Can note ISAs with 16 bit encodings:
      PDP-11: 8 registers
      M68K  : 2x 8 (A and D)
      MSP430: 16
      Thumb : 8|16
      RV-C  : 8|32
      SuperH: 16
      XG1   : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.


    If the smallest instruction size is 16 bits, it simplifies things
    considerably vs 8 bits.

    If the smallest size is 32-bits, it simplifies things even more.
    Fixed length is the simplest case though.


    As noted, 32/64/96 bit fetch isn't too difficult though.

    For 64/96 bit instructions though, mostly want to be able to (mostly)
    treat it like a superscalar fetch of 2 or 3 32-bit instructions.

    In my CPU, I ended up making it so that only 32-bit instructions support superscalar; whereas 16 and 64/96 bit instructions are scalar only.

    Superscalar only works with native alignment though (for RISC-V), and
    for XG3, 32-bit instruction alignment is mandatory.


    As noted, in terms of code density, a few of the stronger options are
    Thumb2 and RV-C, which have 16 bits as the smallest size.


    I once experimented with having a range of 24-bit instructions, but the
    hair this added (combined with the fairly little gain in terms of code density) showed this was rather not worth it.


    ...


    In my recent fiddling for trying to design a pair encoding for XG3,
    can note the top-used instructions are mostly, it seems (non Ld/St):
      ADD   Rs, 0, Rd    //MOV     Rs, Rd
      ADD   X0, Imm, Rd  //MOV     Imm, Rd
      ADDW  Rs, 0, Rd    //EXTS.L  Rs, Rd
      ADDW  Rd, Imm, Rd  //ADDW    Imm, Rd
      ADD   Rd, Imm, Rd  //ADD     Imm, Rd

    Followed by:
      ADDWU Rs, 0, Rd    //EXTU.L  Rs, Rd
      ADDWU Rd, Imm, Rd  //ADDWu   Imm, Rd
      ADDW  Rd, Rs, Rd   //ADDW    Rs, Rd
      ADD   Rd, Rs, Rd   //ADD     Rs, Rd
      ADDWU Rd, Rs, Rd   //ADDWU   Rs, Rd

    Most every other ALU instruction and usage pattern either follows a
    bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
      SD  Rn, Disp(SP)
      LD  Rn, Disp(SP)
      LW  Rn, Disp(SP)
      SW  Rn, Disp(SP)

      LD  Rn, Disp(Rm)
      LW  Rn, Disp(Rm)
      SD  Rn, Disp(Rm)
      SW  Rn, Disp(Rm)


    For registers, there is a split:
      Leaf functions:
        R10..R17, R28..R31 dominate.
      Non-Leaf functions:
        R10, R18..R27, R8/R9

    For 3-bit configurations:
      R8..R15                             Reg3A
      R18/R19, R20/R21, R26/R27, R10/R11  Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less
    encoding space than using a 4-bit R8..R23 (saving 1 bit on the
    relevant scenarios).




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 21 16:21:37 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Consider "virgin" page, that is neither accessed nor modified.
    Intruction 1 reads the page, instruction 2 modifies it. After
    both are done you should have both bits set. But if miss handling
    for instruction 1 reads page table entry first, but stores after
    store fomr instruction 2 handler, then you get only accessed bit
    and modified flag is lost. Symbolically we could have

    read PTE for instruction 1
    read PTE for instruction 2
    store PTE for instruction 2 (setting Accessed and Modified)
    store PTE for instruction 1 (setting Accessed but clearing Modified)
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 21 19:26:47 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 22 16:36:09 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions. Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 22 16:45:56 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 22 17:21:17 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions.

    Like the multiply instruction in ARM2.

    Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.

    Yep, FP support can be expensive and was an extra option
    on the VAX, which also included integer multiply.

    However, I maintain that a ~1977 supermini with a similar sort
    of bus, MMU, floating point unit etc like the VAX, but with an
    architecture similar to ARM2, plus separate icache and dcache, would
    have beaten the VAX hands-down in performance - it would have taken
    fewer chips to implement, less power and possibly time to develop.
    HP showed this was possible some time later.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Aug 23 16:38:47 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors mounted a few to a package. The /91 was big but it wasn't *that* big.

    I remember this number, but do not remember where I found it. So
    it may be wrong.

    However, one can estimate possible density in a different way: package
    probably of similar dimensions as VAX package can hold about 100 TTL
    chips. I do not have detailed data about chip usage and transistor
    couns for each chip. Simple NAND gate is 4 transitors, but input
    transitor has two emiters and really works like two transistors
    so it is probably better to count it as 2 transitors, and conseqently
    consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
    20 transistors. D-flop probably is about 20-30 transitors, so
    74S74 is probably around 40-60. Quad D-flop bring us close to 100.
    I suspect that in VAX time octal D-flops were available. There
    were 4 bit ALU slices. Also multiplexers need nontrivial number
    of transistors. So I think that 50 transistors is reasonable (maybe
    low) estimate of average density. Assuming 50 transitors per chip
    that would be 5000 transistors per package. Packages were rather
    flat, so when mounted vertically one probably could allocate 1 cm
    of horizotal space for each. That would allow 30 packages at
    single level. With 7 levels we get 210 packages, enough for
    1 mln transistors.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Aug 25 00:56:26 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }
    arrays:
    MOV Ri,#0
    MOV Rr,#0
    VEC Rt,{}
    LDD Rl,[Rv,Ri<<3]
    ADD Rr,Rr,Rl
    LOOP LT,Ri,Rn,#1
    MOV R1,Rr
    RET

    7 instructions, 1 instruction-modifier; 8 words.

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD 0x1234567890abcdef,[IP,a]
    STD 0xcdef1234567890ab,[IP,b]
    STD 0x567890abcdef1234,[IP,c]
    STD 0x5678901234abcdef,[IP,d]
    RET

    5 instructions, 13 words, 0 .data, 0 .bss

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret

    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.

    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret

    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
    1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
    1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
    1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
    1198: c3 ret

    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret

    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang) versions.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 00:56:58 2025
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Also, PDP-11 compatibility depended on microcode.

    Different address modes mainly.

    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX especially given constraint
    of PDP-11 compatibility.

    RISC in MSI TTL logic would not have worked all that well.

    OTOH VAX designers probably felt
    that CISC nature added significant value: they understood
    that cost of programming was significant and believed that
    ortogonal instruction set, in particular allowing complex
    addresing on all operands made programming simpler.

    Some of us RISC designers believe similarly {about orthogonal
    ISA not about address modes.}

    They
    probably thought that providing resonably common procedures
    as microcoded instructions made work of programmers simpler
    even if routines were only marginally faster than ordinary
    code.

    We think similarly--but we do not accept µCode being slower
    that SW ISA, or especially compiled HLL.

    Part of this thinking was probably like "future
    system" motivation at IBM: Digital did not want to produce
    "commodity" systems, they wanted something with unique
    features that custemer will want to use.

    s/used/get locked in on/

    Without
    isight into future it is hard to say that they were
    wrong.

    It is hard to argue that they made ANY mistakes with
    what we know about the world of computers circa 1977.

    It is not hard in 2025.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 27 10:56:31 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte
    for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline stalls. The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 28 07:49:31 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:
    -----------snip--------------
    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    Witness Mc 68000, Mc 68010, and Mc 68020. In all these
    designs, the microcode and its surrounding engine took
    1/2 of the die-area insides the pins.

    Note that most of this is microcode ROM. They complicated
    logic to get smaller ROM size. For VAX it was quite different:
    microcode memory (and cache) were build from LSI chips,
    not suitable for logic at that time. Assuming 6 transistor
    static RAM cells VAX had 590000 transistors in microcode memory
    chips (and another 590000 transistors in cache chips).
    Comparatively one can estimate VAX logic chips as between 20000
    and 100000 transistors, with low numbers looking more likely
    to me. IIUC at least early VAX on a "single" chip were slowed
    down by going to off-chip microcode memory.

    In 1980 it was possible to put the data path of a 32-bit
    ISA on one die and pipeline it, but runs out of area when
    you put microcode on the same die (area). Thus, RISC was
    born. Mc88100 had a decoder and sequencer that was 1/8
    of the interior area of the chip and had 4 FUs {Int,
    Mem, MUL, and FADD} all pipelined.

    Yes, but IIUC big item was on-chip microcode memory (or pins
    needed to go to external microcode memory).
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Aug 28 13:39:54 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.

    Going through the VAX 780 hardware schematics and various performance
    papers, near as I can tell it took *at least* 1 clock per instruction byte for decode, plus any I&D cache miss and execute time, as it appears to
    use microcode to pull bytes from the 8-byte instruction buffer (IB)
    *one at a time*.

    So far I have not found any parallel pathway that could pull a multi-byte immediate operand from the IB in 1 clock.

    And I say "at least" 1 C/IB as I am not including any micro-pipeline
    stalls.
    The microsequencer has some pipelining, overlap read of the next uWord
    with execute of current, which would introduce a branch delay slot into
    the microcode. As it uses the opcode and operand bytes to do N-way
    jump/call
    to uSubroutines, each of those dispatches might have a branch delay slot too.

    (Similar issues appear in the MV-8000 uSequencer except it appears to
    have 2 or maybe 3 microcode branch delay slots).

    I found a description of the 780 instruction buffer parser
    in the Data Path description on bitsavers and
    it does in fact pull one operand specifier from IB per clock.
    There is a mux network to handle various immediate formats in parallel,

    There are conflicting descriptions as to exactly how it handles the
    first operand, whether that is pulled with the opcode or in a separate clock, as the IB shifter can only do 1 to 5 byte shifts but an opcode with
    a first operand with 32-bit displacement would be 6 bytes.

    But basically it takes 1 clock for the opcode byte and the first operand specifier byte, a second clock if the first opspec has an immediate,
    then 1 clock for each subsequent operand specifier.
    If an operand has an immediate it is extracted in parallel with its opspec.

    If that is correct a MOV rs,rd or ADD rs,rd would take 2 clocks to decode,
    and a MOV offset(rs),rd would take 3 clocks to decode.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 31 18:04:44 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are:
    Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Sun Aug 31 16:43:26 2025
    From Newsgroup: comp.arch

    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    Same here (tho I was on team Debian), but I don't think GNU/Linux
    enthusiasts were the main buyers of those Opteron and
    Athlon64 machines.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sun Aug 31 22:26:43 2025
    From Newsgroup: comp.arch

    On Sun, 31 Aug 2025 16:43:26 -0400, Stefan Monnier wrote:

    ... I don't think GNU/Linux enthusiasts were the main buyers of
    those Opteron and Athlon64 machines.

    Their early popularity would have been in servers. And servers were
    already becoming dominated by Linux in those days.

    “Opteron” was specifically a brand name for server chips, as I recall.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 06:07:27 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    Same here (tho I was on team Debian)

    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time,
    and still busily working on their multi-arch (IIRC) plans, so
    eventually I decided to go with Fedora Core 1, which just implemented
    /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after
    (/etc/hostname from 2005-02-20, and IIRC Debian still had not finished hammering out multi-arch at that time), before finally settling in
    Debian-land several years later.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch,alt.folklore.computers on Mon Sep 1 06:57:26 2025
    From Newsgroup: comp.arch

    On Mon, 01 Sep 2025 06:07:27 GMT, Anton Ertl wrote:

    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time
    ... so eventually I decided to go with Fedora Core 1, which just
    implemented /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after ...
    before finally settling in Debian-land several years later.

    Distro-hopping is a long-standing tradition in the Linux world. No other platform comes close.
    --- Synchronet 3.21a-Linux NewsLink 1.2