• Re: VAX (was: Why I've Dropped In)

    From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 00:59:07 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation
    simply couldn’t adapt.

    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could have done to
    the PC market, if IBM could have been prevented from gaining such market dominance.

    IBM had massive marketing clout in the mainframe market. I think that was
    the basis on which customers gravitated to their products. And remember,
    the IBM PC was essentially a skunkworks project that totally went against
    the entire IBM ethos. Internally, it was seen as a one-off mistake that
    they determined never to repeat. Hence the PS/2 range.

    DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM.
    But what OS would they have used? They were still dominated by Unix-haters then.

    A bit like the /360 strategy, offering a wide range of machines (or CPUs
    and systems) with different performance.

    That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
    for example, offered entire ranges of machines in each of its various minicomputer families.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 11:28:45 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    counter-argument to ILP64, where the more natural alternative is LP64.

    I am curious what makes you think that I32LP64 is "more natural",
    given that C is a human creation.

    ILP64 is more consistent with the historic use of int: int is the
    integer type corresponding to the unnamed single type of B
    (predecessor of C), which was used for both integers and pointers.
    You can see that in various parts of C, e.g., in the integer type
    promotion rules (all integers are promoted at least to int in any
    case, beyond that only when another bigger integer is involved).
    Another example is

    main(argc, argv)
    char *argv[];
    {
    return 0;
    }

    Here the return type of main() defaults to int, and the type of argc
    defaults to int.

    As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
    It is the case for ILP64.

    Some people conspired in 1992 to set the de-facto standard, and made
    the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
    this mistake ever since, one way or the other.

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Wed Aug 6 08:28:03 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 00:59:07 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    DEC was bigger in the minicomputer market. If DEC could have offered
    an open-standard machine, that could have offered serious competition
    to IBM. But what OS would they have used? They were still dominated
    by Unix-haters then.

    DEC had plenty of experience in small-system single-user OSes by then;
    their bigger challenge would've been picking one. (CP/M owes a lot to
    the DEC lineage, although it dispenses with some of the more tedious mainframe-isms - e.g. the RUN [program] [parameters] syntax vs. just
    treating executable files on disk as commands in themselves.)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 6 15:55:06 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Obectively, a lot of programs fit into 32-bit address space and
    may wish to run as 32-bit code for increased performance. Code
    that fits into 16-bit address space is rare enough on 64-bit
    machines to ignore.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    It is more complex. There are machines on the market with 64 MB
    RAM and 64-bit RISCV processor. There are (or were) machines
    with 512 MB RAM and 64-bit ARM processor. On such machines it
    is quite natural to use 32-bit pointers. With 32-bit pointers
    there is possibility to use existing 32-bit code. And
    IPL32 is natural model.

    You can say that 32-bit pointers on 64-bit hardware are rare.
    But we really do not know. And especially in embedded space one
    big customer may want a feature and vendor to avoid fragmentation
    provides that feature to everyone.

    Why such code need 32-bit addressing? Well, if enough parts of
    C were undefined compiler could just extend everthing during
    load to 64-bits. So equally well you can claim that real problem
    is that C standard should have more undefined behaviour.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 14:00:56 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good
    match for the beliefs of the time, but that's a different thing.


    The evidence of 801 is the 801 did not deliver until more than decade
    later. And the variant that delivered was quite different from original
    801.
    Actually, it can be argued that 801 didn't deliver until more than 15
    years late.

    Maybe for IBM. IBM had its successful S/370 business, and no real
    need for the IBM 801 after the telephone switch project for which it
    was originally developed had been canceled, so they had no hurry in productizing it. <https://en.wikipedia.org/wiki/IBM_ROMP> says:

    |The architectural work on the ROMP began in late spring of 1977, as a |spin-off of IBM Research's 801 RISC processor (hence the "Research"
    |in the acronym). Most of the architectural changes were for cost
    |reduction, such as adding 16-bit instructions for
    |byte-efficiency. [...]
    |
    |The first chips were ready in early 1981 [...] ROMP first appeared in
    |a commercial product as the processor for the IBM RT PC workstation,
    |which was introduced in 1986. To provide examples for RT PC
    |production, volume production of the ROMP and its MMU began in
    |1985. The delay between the completion of the ROMP design, and
    |introduction of the RT PC was caused by overly ambitious software
    |plans for the RT PC and its operating system (OS).

    If IBM had been in a hurry to introduce ROMP, they would have had a
    contingency plan for the RT PC system software.

    For comparison:

    HPPA: "In early 1982, work on the Precision Architecture began at HP Laboratories, defining the instruction set and virtual memory
    system. Development of the first TTL implementation started in April
    1983. With simulation of the processor having completed in 1983, a
    final processor design was delivered to software developers in July
    1984. Systems prototyping followed, with "lab prototypes" being
    produced in 1985 and product prototypes in 1986. The first processors
    were introduced in products during 1986, with the first HP 9000 Series
    840 units shipping in November of that year." <https://en.wikipedia.org/wiki/PA-RISC>

    MIPS: Inspired by IBM 801, Stanford MIPS research project 1981-1984,
    1984 MIPS Inc, R2000 and R2010 (FP) introduced May 1986 (12.5MHz), and according to
    <https://en.wikipedia.org/wiki/MIPS_Computer_Systems#History> MIPS
    delivered a workstation in the same year.

    SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the
    completion of RISC-II, but given that the research project ended in
    1984, it was probably at that time. Sun developed Berkeley RISC into
    SPARC, and the first SPARC machine, the Sun-4/260 appeared in July
    1987 with a 16.67MHz processor.

    ARM: Inspired by Berkeley RISC, "Acorn initiated its RISC research
    project in October 1983" <https://en.wikipedia.org/wiki/Acorn_Computers#New_RISC_architecture>
    "The first samples of ARM silicon worked properly when first received
    and tested on 26 April 1985. Known as ARM1, these versions ran at 6
    MHz.[...] late 1986 introduction of the ARM2 design running at 8 MHz
    [...] Acorn Archimedes personal computer models A305, A310, and A440,
    launched on the 6th June 1987." <https://en.wikipedia.org/wiki/ARM_architecture_family#History> Note
    that the Acorn people originally were not computer architects or
    circuit designers. ARM1 and ARM2 did not include an MMU, cache
    controller, or FPU, however.

    There are examples of Motorola (88000, 1988), Intel (i960, 1988), IBM
    (RS/6000, 1990), and DEC (Alpha, 1992) which had successful
    established architectures, and that caused the problem of how to place
    the RISC architecture in the market, and a certain lack of urgency.
    Read up on the individual architectures and their predecessors to
    learn about the individual causes for delays (there's not much in
    Wikipedia about the development of the 88000, however).

    HP might have been in the same camp, but apparently someone high up at
    HP decided to replace all their existing architectures with RISC ASAP,
    and they succeeded.

    In any case, RISCs delivered, starting in 1986. There is no reason
    they could not have delivered earlier.


    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:21:51 2025
    From Newsgroup: comp.arch

    Al Kossow <aek@bitsavers.org> writes:
    [RISC] didn't really make sense until main
    memory systems got a lot faster.

    The memory system of the VAX 11/780 was plenty fast for RISC to make
    sense:

    Cache cycle time: 200ns
    Memory cycle time: 600ns
    Average memory access time: 290ns
    Average VAX instruction execution time: 2000ns

    If we assume 1.5 RISC instructions per average VAX instruction, and a
    RISC CPI of 2 cycles (400ns: the 290ns plus extra time data memory
    accesses and branches), the equivalent of a VAX instruction takes
    600ns, more then 3 times as fast as the actual VAX.

    Followups to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 6 12:00:36 2025
    From Newsgroup: comp.arch

    On 8/6/2025 6:28 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    counter-argument to ILP64, where the more natural alternative is LP64.

    I am curious what makes you think that I32LP64 is "more natural",
    given that C is a human creation.


    We would have needed a new type to be able to express 32 bit values.

    Though, goes and looks it up, apparently the solution was to add __int32
    to address this issue.

    So, it seems, an early occurrence of the __int8, __int16, __int32,
    __int64, __int128 system.


    ILP64 is more consistent with the historic use of int: int is the
    integer type corresponding to the unnamed single type of B
    (predecessor of C), which was used for both integers and pointers.
    You can see that in various parts of C, e.g., in the integer type
    promotion rules (all integers are promoted at least to int in any
    case, beyond that only when another bigger integer is involved).
    Another example is

    main(argc, argv)
    char *argv[];
    {
    return 0;
    }

    Here the return type of main() defaults to int, and the type of argc
    defaults to int.

    As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
    It is the case for ILP64.


    Possibly.

    Though, in BGBCC I did make a minor tweak in the behavior of K&R and C89
    style code:
    The 'implicit int' was replaced with 'implicit long'...

    Which, ironically, allows a lot more K&R style code to run unmodified on
    a 64-bit machine. Where, if one assumes 'int', then a lot of K&R style
    code doesn't work correctly.


    Some people conspired in 1992 to set the de-facto standard, and made
    the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
    this mistake ever since, one way or the other.

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.


    It is a tradeoff.

    A lot of 32 bit code expected int to be 32 bits, and also expects int to
    wrap on overflow. Without ADDW and friends, the expected wrap on
    overflow behavior is not preserved.

    Early BitManip would have added an ADDWU instruction (ADDW but zero extending); but then they dropped it.

    In my own RV extensions, I re-added ADDWU because IMHO dropping it was a mistake.


    In Zba, they have ADDUW instead, which zero-extends Rs1; so "ADDUW Rd,
    Rs, X0" can be used to zero-extend stuff, but this isn't as good. There
    was at one point an ADDIWU instruction, but I did not re-add it. I
    managed to add the original form of my jumbo prefix into the same
    encoding space; but have since relocated it.


    Re-adding ADDIWU is more debatable as the relative gains are smaller
    than for ADDWU (in a compiler with zero-extended unsigned int).

    For RV64G, it still needs, say:
    ADD Rd, Rs, Rt
    SLLI Rd, Rd, 32
    SLRI Rd, Rd, 32
    Which isn't ideal.

    Though, IMHO, the cost of needing 2 shifts for "unsigned int" ADD is
    less than the mess that results from sign-extending "unsigned int".

    Like, Zba adds "SHnADD.UW" and similar, which with zero-extended
    "unsigned int" would have been entirely unnecessary.


    So, that was my partial act of rebellion against the RV ABI spec (well,
    that and different handling of passing and returning structs by value).

    Where, BGBCC handles it in a way more like that in MS style ABIs, where:
    1-16 bytes, pass in registers or register pair;
    17+ bytes: pass or return via memory reference.

    As opposed to using on-stack copying as the fallback case.
    Though, arguably at least less of a mess than whatever was going on in
    the design of the SysV AMD64 ABI.



    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.


    This assumes though that all 64 bit operations can have the same latency
    as 32 bit operations.

    If you have a machine where common 32-bit ops can have 1 cycle latency
    but 64 needs 2 cycles, then it may be preferable to have 32 bit types
    for cases where 64 isn't needed.


    But, yeah, in an idealized world, maybe yeah, the avoidance of 32-bit
    int, or at least the avoidance of a dependency on assumed wrap on
    overflow semantics, or implicit promotion to whatever is the widest
    natively supported type, could have led to less of a mess.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Wed Aug 6 10:20:18 2025
    From Newsgroup: comp.arch

    On 8/6/25 7:00 AM, Anton Ertl wrote:

    In any case, RISCs delivered, starting in 1986.

    http://bitsavers.org/pdf/ridge/Ridge_Hardware_Reference_Manual_Aug82.pdf


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 17:25:25 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a full bit address was only 24 bits. So if I were writing a C compiler, pointers and ints would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:47:39 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
    it is not a register machine nor byte-addressed, while the PDP-11 is,
    and the RISC-VAX would be, too.

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    Yes, convincing people in the mid-1970s to bet the company on RISC is
    a hard sell, that's I asked for "a magic wand that would convince the
    DEC management and workforce that I know how to design their next
    architecture, and how to compile for it" in <2025Mar1.125817@mips.complang.tuwien.ac.at>.

    Some arguments that might help:

    Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
    instruction, virtual memory, and precise exceptions.

    How the CDC 6600 achieved performance (pipelining) and how non-complex
    its instructions are.

    I guess I would read through RISC-vs-CISC literature before entering
    the time machine in order to have some additional arguments.


    Concerning your three options, I think it will be a problem in any
    case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode, so maybe even more in the wrong direction
    than VAX; I can imagine a high-performance OoO VAX implementation, but
    for an architecture with exposed microcode like FHP an OoO
    implementation would probably be pretty challenging. The backup
    project that eventually came through was also a CISC.

    Concerning founding ones own company, one would have to convince
    venture capital, and then run the RISC of being bought by one of the
    big players, who buries the architecture. And even if you survive,
    you then have to build up the whole thing: production, marketing,
    sales, software support, ...

    In any case, the original claim was about the VAX, so of course the
    question at hand is what DEC could have done instead.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 6 20:43:33 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.



    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort
    of dependency in renamer. Probably not yet when it is coded as
    'inc', but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next
    value of [rdi+rcx*8] does depend on value of rbx from previous
    iteration, but the next value of rbx depends only on [rsi+rcx*8],
    [r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
    value of rbx, except for control dependency that hopefully would
    be speculated around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries
    from the previous round.

    This is the carry chain that I don't see any obvious way to
    break...

    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
    pay a heavy price of branch misprediction. But outside of specially
    crafted inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje



    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const
    uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long
    straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 6 12:47:10 2025
    From Newsgroup: comp.arch

    On 8/6/2025 10:55 AM, Waldek Hebisch wrote:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Obectively, a lot of programs fit into 32-bit address space and
    may wish to run as 32-bit code for increased performance. Code
    that fits into 16-bit address space is rare enough on 64-bit
    machines to ignore.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    It is more complex. There are machines on the market with 64 MB
    RAM and 64-bit RISCV processor. There are (or were) machines
    with 512 MB RAM and 64-bit ARM processor. On such machines it
    is quite natural to use 32-bit pointers. With 32-bit pointers
    there is possibility to use existing 32-bit code. And
    IPL32 is natural model.

    You can say that 32-bit pointers on 64-bit hardware are rare.
    But we really do not know. And especially in embedded space one
    big customer may want a feature and vendor to avoid fragmentation
    provides that feature to everyone.

    Why such code need 32-bit addressing? Well, if enough parts of
    C were undefined compiler could just extend everthing during
    load to 64-bits. So equally well you can claim that real problem
    is that C standard should have more undefined behaviour.


    Something like the X32 style ABIs almost make sense, since most
    processes need less than 4GB of RAM.

    But, then the problem becomes that one would need both 32 and 64 bit
    variants of most of the OS shared libraries, which may well offset the
    savings from using less RAM for pointers.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:11:08 2025
    From Newsgroup: comp.arch

    On 8/6/25 10:25, John Levine wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 19:50:17 2025
    From Newsgroup: comp.arch

    According to Peter Flass <Peter@Iron-Spring.com>:
    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few >bytes.

    STRETCH had a severe case of second system syndrome, and was full of
    complex features that weren't worth the effort and it was impressive
    that IBM got it to work and to run as fast as it did.

    In that era memory was expensive, and usually measured in K, not M.
    The idea was presumably to pack data as tightly as possible.

    In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
    microcode so COBOL programs used the COBOL instruction set, FORTRAN programs used the FORTRAN instruction set, and so forth, with each one having whatever word or byte sizes they wanted. In retrospect it seems like a lot of
    premature optimization.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 20:30:00 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Peter Flass <Peter@Iron-Spring.com>:
    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.) >>
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it, >>too, and it seems like all it does is drastically shrink your address >>space and complexify instruction and operand fetch to (maybe) save a few >>bytes.

    STRETCH had a severe case of second system syndrome, and was full of
    complex features that weren't worth the effort and it was impressive
    that IBM got it to work and to run as fast as it did.

    In that era memory was expensive, and usually measured in K, not M.
    The idea was presumably to pack data as tightly as possible.

    In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
    microcode so COBOL programs used the COBOL instruction set, FORTRAN programs >used the FORTRAN instruction set, and so forth, with each one having whatever >word or byte sizes they wanted. In retrospect it seems like a lot of >premature optimization.

    We had a B1900 in the software lab, but I don't recall anyone
    actually using it - I believe it had been moved from Santa
    Barbara (Small Systems plant) and may have been used for
    reproducing customer issues, but by 1983, there weren't many
    small systems customers remaining.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch,alt.folklore.computers on Wed Aug 6 22:30:56 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:

    For comparison:

    SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the completion
    of RISC-II, but given that the research project ended in 1984, it was probably at that time. Sun developed Berkeley RISC into SPARC, and the
    first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz processor.

    The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.

    There is also a bibliography entry of an informal discussion with John
    Cocke at Berkeley about the 801 in June 1983
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Wed Aug 6 23:34:04 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 11:28:45 GMT, Anton Ertl wrote:

    Why were 32-bit indices and 32-bit operations more important than 16-bit indices and 16-bit operations?

    32 bits was considered a kind of “sweet spot” in the evolution of computer architectures. It was the first point at which memory-addressability constraints were no longer at the top of list of things to worry about
    when designing a software architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:36:11 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 12:11:08 -0700, Peter Flass wrote:

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.

    But with 64-bit addressing, it only means sacrificing the bottom 3 bits.

    With normal load/store, you can insist that these 3 bits be zero, whereas
    in bit-aligned load/store, they can specify a nonzero bit offset.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:45:44 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:

    CP/M owes a lot to the DEC lineage, although it dispenses with some
    of the more tedious mainframe-isms - e.g. the RUN [program]
    [parameters] syntax vs. just treating executable files on disk as
    commands in themselves.)

    It added its own misfeatures, though. Like single-letter device names,
    but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to
    this day, aggravated by a totally perverse extension of the concept to
    paths with hierarchical directory names.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Charlie Gibbs@cgibbs@kltpzyxm.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 01:49:18 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:

    CP/M owes a lot to the DEC lineage, although it dispenses with some
    of the more tedious mainframe-isms - e.g. the RUN [program]
    [parameters] syntax vs. just treating executable files on disk as
    commands in themselves.)

    It added its own misfeatures, though. Like single-letter device names,
    but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to this day, aggravated by a totally perverse extension of the concept to
    paths with hierarchical directory names.

    Funny how people ridicule COBOL's reserved words, while accepting MS-DOS/Windows' CON, LPT, etc. If only a trailing colon (which I
    always used) were mandatory; that would put device names cleanly
    into a different name space, eliminating the problem.

    But, you know, Microsoft...
    --
    /~\ Charlie Gibbs | Growth for the sake of
    \ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
    X I'm really at ac.dekanfrus | of the cancer cell.
    / \ if you read it the right way. | -- Edward Abbey
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Thu Aug 7 02:56:08 2025
    From Newsgroup: comp.arch

    According to Lars Poulsen <lars@cleo.beagle-ears.com>:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't
    declare its variables to use those word sizes, I don't think "portable" is the
    right term.

    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?

    In modern C you use the values in limits.h to pick the type, and define
    macros that mask values to the size you need. In older C you did the same thing in much uglier ways. Writing code that is portable across different
    word sizes has always been tedious.

    Or did the compiler have native types __int16 etc?

    Given how long ago it was, I doubt it.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Thu Aug 7 05:29:33 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
    it is not a register machine nor byte-addressed, while the PDP-11 is,
    and the RISC-VAX would be, too.

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    Bring an mobile phone or tablet with you, install Stockfish,
    and beat everybody at chess.

    But making it known that you are a time traveller (and being able
    to prove it) would very probably invite all sorts of questions
    from all sorts of people about the future (or even about things
    in the then-present which were declassified in the future), and
    these people might not tke "no" or "I don't know" for an answer.

    [...]

    Yes, convincing people in the mid-1970s to bet the company on RISC is
    a hard sell, that's I asked for "a magic wand that would convince the
    DEC management and workforce that I know how to design their next architecture, and how to compile for it" in
    <2025Mar1.125817@mips.complang.tuwien.ac.at>.

    Some arguments that might help:

    Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
    instruction, virtual memory, and precise exceptions.

    How the CDC 6600 achieved performance (pipelining) and how non-complex
    its instructions are.

    I guess I would read through RISC-vs-CISC literature before entering
    the time machine in order to have some additional arguments.


    Concerning your three options, I think it will be a problem in any
    case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode,

    That would have been the right time, I think - convince de Castro
    that, instead of writable microcode, RISC is the right direction.
    Fountainhead project started in July 1975, more or less contemporary
    with the VAX, and an alternate-Fountainhead could probably have
    been introduced at the same time, in 1977.

    so maybe even more in the wrong direction
    than VAX; I can imagine a high-performance OoO VAX implementation, but
    for an architecture with exposed microcode like FHP an OoO
    implementation would probably be pretty challenging. The backup
    project that eventually came through was also a CISC.

    Sure.


    Concerning founding ones own company, one would have to convince
    venture capital, and then run the RISC of being bought by one of the
    big players, who buries the architecture. And even if you survive,
    you then have to build up the whole thing: production, marketing,
    sales, software support, ...

    That is one of the things I find astonishing - how a company like
    DG grew from a kitche-table affair to the size they had.

    In any case, the original claim was about the VAX, so of course the
    question at hand is what DEC could have done instead.

    - anton
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Aug 7 15:15:17 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.

    I don't remember my code exactly, but the intent was that RDX would
    contain any incoming carries (0,1,2) from the previous iteration.

    Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.

    Terje

    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz


    Thanks for an interesting set of tests/results!

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Thu Aug 7 15:44:55 2025
    From Newsgroup: comp.arch

    Peter Flass wrote:
    On 8/6/25 10:25, John Levine wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a
    full bit
    address was only 24 bits.  So if I were writing a C compiler, pointers
    and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely
    unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it, > too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.
    Bit addressing, presumably combined with an easy way to mask the
    results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.
    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in
    reduced address space compared to what you gain.
    In the real world, all important codecs (like mp4 or aes crypto) end up
    as dedicated hardware, either AES opcodes or a standalone VLSI slice
    capable of CABAC decoding. The main reason is energy: A cell phone or
    laptop cannot stream video all day without having hardware support for
    the decoding task.
    One possibly relevant anecdote: Back in the later 1990'ies, when Intel
    was producing the first quad core Pentium Pro-style cpus, I showed them
    that it was in fact possible for one of those CPUs to decode a maximum
    h264 bitstream, with 40 Mbit/s of CABAC coded data, in pure software.
    (Their own sw engineers had claimed that every other frame of a 60 Hz HD video would have to be skipped.)
    What Intel did was to license h264 decoding IP since that would use far
    less power and leave 3 of the 4 cores totally idle.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:34:28 2025
    From Newsgroup: comp.arch

    On 8/7/25 06:44, Terje Mathisen wrote:

    Bit addressing, presumably combined with an easy way to mask the results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.

    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.


    Bit addressing *as an option* (Bit Load, Bit store instructions, etc.)
    is a great idea, for example it greatly simplifies BitBlt logic. The
    432's use of bit addressing for everything, especially instructions,
    seems just too cute. I forget the details, it's been a while since I
    looked, but it forced extremely small code segments which, combined with
    the segmentation logic, etc. really impacted performance.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Thu Aug 7 08:28:52 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 23:45:44 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    It added its own misfeatures, though.

    Unfortunately, yes. "User areas" in particular are just a completely
    useless bastard child of proper subdirectories and something like
    TOPS-10's programmer/project pairs; even making user area 0 a "common
    area" accessible from any of the others would've helped, but they
    didn't do that. It's a sign of how misconceived they were that MS-DOS
    (in re-implementing CP/M) dropped them entirely and nobody complained,
    then added real subdirectories later.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Thu Aug 7 14:57:59 2025
    From Newsgroup: comp.arch

    Peter Flass <Peter@Iron-Spring.com> writes:
    [IBM STRETCH bit-addressable]
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too

    One might come to think that it's the signature of overambitious
    projects that eventually fail.

    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?), why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    The S/360 then found the compromise that conquered the world: Byte
    addressing with 8-bit bytes.

    Why iAPX432 went for bit addressing at a time when byte addressing and
    the 8-bit byte was firmly established, over ten years after the S/360
    and 5 years after the PDP-11 is a mystery, however.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From drb@drb@ihatespam.msu.edu (Dennis Boone) to comp.arch,alt.folklore.computers on Thu Aug 7 15:54:16 2025
    From Newsgroup: comp.arch

    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?), why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    Remember who they built STRETCH for.

    De
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lynn Wheeler@lynn@garlic.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:32:35 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    It's a 32 bit architecture with 31 bit addressing, kludgily extended
    from 24 bit addressing in the 1970s.

    2nd half 70s kludge, with 370s that could have 64mbytes of real memory
    with only 24bit addressing ... the virtual memory page table entry (PTE)
    had 16bits with 2 "unused bits" ... 12bit page number (12bit 4kbyte
    pages, 24bits) ... and defined the two unused bits to prepend to the
    page number ... making 14bit page number ... for 26bits (instructions
    were still be 24bit, but virtual memory used to translate to 26bits real addressing).

    original 360 I/O had only 24bit addressing, adding virtual memory (to
    all 370s) added IDALs, the CCW was still 24bit but were still being
    built by applications running in virtual memory ... and (effectively)
    assumed any large storage locations consisting of one contiguous
    area. Moving to virtual memory, I/O large "contiguous" area was now
    borken into page size chunks in non-contiguous areas. Translating
    "virtual" I/O program, the original virtual CCW ... would be converted
    to CCW with real addresses and flagged as IDAL ... where the CCW pointed
    to IDAL list of real addresses ... that were 32 bit words ... (31 bits specifying real address) for each (possibly non-contiguous) real page
    involved.
    --
    virtualization experience starting Jan1968, online at home since Mar1970
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 13:01:07 2025
    From Newsgroup: comp.arch

    On 8/7/2025 7:57 AM, Anton Ertl wrote:
    Peter Flass <Peter@Iron-Spring.com> writes:
    [IBM STRETCH bit-addressable]
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too

    One might come to think that it's the signature of overambitious
    projects that eventually fail.

    Interesting. While it seems to be sufficient to predict the failure of
    a project, it certainly isn't necessary. So I think calling it a
    signature is too extreme.


    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?)

    While perhaps not absolutely necessary, it is very useful. For example, inputting the parameters for, and showing the results of a simulation in
    human readable format, And for a compiler. While you could do all of
    those things on another (different architecture) computer, and transfer
    the results via say magnetic tape, that is pretty inconvenient and
    increases the cost for that additional computer. And there is
    interaction with the console.


    , why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    According to Wikipedia

    https://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_formats

    it supported both binary and decimal fixed point arithmetic (so it helps
    to have four bit "characters", the floating point representation had a
    four bit sign, and alphanumeric characters could be anywhere from 1-8
    bits. And as you say, 6 bit characters were common, especially for
    scientific computers.


    The S/360 then found the compromise that conquered the world: Byte
    addressing with 8-bit bytes.

    Yes, but several years later.

    Another factor that may have contributed. According to the same
    Wikipedia article, the requirements for the system came from Edward
    Teller then at Lawrence Livermore Labs, so there may have been some
    classified requirement that led to bit addressability.


    Why iAPX432 went for bit addressing at a time when byte addressing and
    the 8-bit byte was firmly established, over ten years after the S/360
    and 5 years after the PDP-11 is a mystery, however.

    Agreed.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Thu Aug 7 13:34:09 2025
    From Newsgroup: comp.arch

    The TI TMS34020 graphics processor may have been the last CPU to have bit addressing.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Aug 7 23:48:10 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Thu Aug 7 20:54:01 2025
    From Newsgroup: comp.arch

    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few
    bytes.

    Bit addressing, presumably combined with an easy way to mask the >results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.

    STRETCH was designed in the late 1950s. Shannon-Fano coding was invented
    in the 1940s, and Huffman published his paper on optimal coding in 1950,
    but modern codes like LZ were only invented in the 1970s. I doubt anyone
    did compression or decompression on STRETCH other than packing and unpacking bit fields.

    IBMs commercial machines were digit or character addressed, with a variety of different representations. They didn't know what the natural byte size would be so they let you use whatever you wanted. That made it easy to pack and unpack bitfields to store data compactly in fields of exactly the minimum size.

    The NSA was an important customer, for whom they built the 7950 HARVEST coprocessor
    and it's quite plausible that they had applications for which bit addressing was useful.

    The paper on the design of S/360 said they looked at addressing of 6 bit characters, and 8 bit characters, with 4-bit BCD digits sometimes stored in them. It was evident at the time that 6 bit characters were too small, so
    8 bits it was. They don't mention bit addressing, so they'd presumably already decided that was a bad idea.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Aug 7 16:01:07 2025
    From Newsgroup: comp.arch

    On 8/7/25 3:48 PM, Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    No, I do not. And I am worried.

    brian

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Fri Aug 8 03:51:08 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 15:44:55 +0200, Terje Mathisen wrote:

    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.

    Reserving the bottom 3 bits for a bit offset in a 64-bit address, even if
    it is unused in most instructions, doesn’t seem like such a big cost. And
    it unifies the pointer representation for all data types, which can make things more convenient in a higher-level language.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 8 01:41:00 2025
    From Newsgroup: comp.arch

    On 8/7/2025 4:01 PM, Brian G. Lucas wrote:
    On 8/7/25 3:48 PM, Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    No, I do not.  And I am worried.


    Yeah, that is concerning...


    brian


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Aug 8 11:58:39 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    I've been in cantact, he lost his usenet provider, and the one I am
    using does not seem to accept new registrations any langer.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Aug 8 13:20:33 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 8 14:22:28 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.


    www.usenetserver.com is priced reasonably. I've been using them
    for well over a decade now.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Fri Aug 8 18:34:46 2025
    From Newsgroup: comp.arch

    On 2025-08-08 17:22, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.


    www.usenetserver.com is priced reasonably. I've been using them
    for well over a decade now.


    I have been happy with http://news.individual.net/. 10 euro/year.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 19:07:17 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 13:20:33 +0300, Michael S
    <already5chosen@yahoo.com> wrote:


    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.

    Eternal September still accepts new users. What they don't support is
    shrouding your email address other than using ".invalid" as the
    domain. It's trivially easy to figure out the real addresses, so for
    users who care about address hiding, ES would not be a good choice.

    However, Mitch has tended to use his own name for his addresses in the
    past, so I doubt he cares much about hiding. ES is free (0$) so
    somebody who can reach him ought to mention it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Aug 10 19:55:01 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>: >http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    Fundamentally, 36-bit words ended up being a dead-end.

    Interesting document. It added half-hearted 9 bit byte addressing to the PDP-10, intended for COBOL string processing and decimal arithmetic.

    Except that the PDP-10's existing byte instructions let you use any byte
    size you wanted, and everyone used 7-bit bytes for ASCII strings. It would have been straightforward but very tedious to add 9 bit byte strings to
    the COBOL compiler since they'd need ways to say which text data was in
    which format and convert as needed. Who knows what they'd have done for
    data files.

    36 bit word machines had a good run starting in the mid 1950s but once
    S/360 came out with 8 bit bytes and power of two addressing for larger
    data, all of the other addressing models were doomed.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Tue Aug 12 15:28:27 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    My memories (from reading about it, I never compiled a program for
    that usage myself) are that on Digital OSF/1, the corresponding usage
    did just that: Configure the compiler for ILP32, and allocate all
    memory in the low 2GB. I expect that types such as off_t would be
    defined appropriately, and any pointers in library-defined structures
    (e.g., FILE from <stdio.h>) consumed 8 bytes, even though the ILP32
    code only accessed the bottom 4. Or maybe they had compiled the
    library also for ILP32. In those days fewer shared libraries were in
    play, and the number of system calls and their interface complexity in
    OSF/1 was probably closer to Unix v6 or so than to Linux today (or in
    2012, when x32 was introduced), so all of that required a lot less
    work.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Tue Aug 12 16:08:58 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    The primary issue on x86 was with the API definitions. Several
    legacy API declarations used signed integers (int) for
    address parameters. This limited addresses to 2GB on
    a 32-bit system.

    https://en.wikipedia.org/wiki/Large-file_support

    The Large File Summit (I was one of the Unisys reps at the LFS)
    specified a standard way to support files larger than 2GB
    on 32-bit systems that used signed integers for file offsets
    and file size.

    Also, https://en.wikipedia.org/wiki/2_GB_limit

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Tue Aug 12 11:53:37 2025
    From Newsgroup: comp.arch

    On 8/12/2025 11:08 AM, Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    The primary issue on x86 was with the API definitions. Several
    legacy API declarations used signed integers (int) for
    address parameters. This limited addresses to 2GB on
    a 32-bit system.

    https://en.wikipedia.org/wiki/Large-file_support

    The Large File Summit (I was one of the Unisys reps at the LFS)
    specified a standard way to support files larger than 2GB
    on 32-bit systems that used signed integers for file offsets
    and file size.

    Also, https://en.wikipedia.org/wiki/2_GB_limit


    Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    Though, admittedly, not messed with it much personally...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aph@aph@littlepinkcloud.invalid to comp.arch,alt.folklore.computers on Tue Aug 12 17:57:20 2025
    From Newsgroup: comp.arch

    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the
    effort.

    Andrew.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Tue Aug 12 19:09:27 2025
    From Newsgroup: comp.arch

    According to <aph@littlepinkcloud.invalid>:
    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and
    similar back down to 32 bits, requiring special versions of any shared
    libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    FreeBSD does the same thing. The 32 bit libraries are installed by default
    on 64 bit systems because, by current standards, they're not very big.

    I've stopped installing them because I know I don't have any 32 bit apps
    left but on systems with old packages, who knows?
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 06:11:02 2025
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the
    effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.
    That's apparently not enough to gain a critical mass of x32 programs.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation. Admittedly, there are CPUs without ARM A32/T32
    support, but if there was any significant program for these CPUs that
    does not work with I32LP64, the manufacturer would have chosen to
    include the A32/T32 option. Given that the situation is the same as
    for x32, the result is the same: What I find about it are discussions
    about deprecation and removal <https://www.phoronix.com/news/GCC-Deprecates-ARM64-ILP32>.

    Concerning performance, <https://static.linaro.org/connect/bkk16/Presentations/Wednesday/BKK16-305B.pdf>
    shows SPECint 2006 benchmarks on two unnamed platforms. Out of 12
    benchmark programs, ILP32 shows a speedup by a factor ~1.55 on
    429.mcf, ~1.2 on 471.omnetpp, ~1.1 on 483.xalancbmk, ~1.05 on 403.gcc,
    and ~0.95 (i.e., slowdowns) on 401.bzip2, 456.hmmer, 458.sjeng.

    That slide deck concludes with:

    |Do We Care? Enough?
    |
    |A lot of code to maintain for little gain.

    Apparently the answers to these questions is no.

    Followups to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 07:32:28 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >broken.

    That may be the interface of the C system call wrapper, along with
    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those architectures, where the sign is used, mmap(2) cannot return negative addresses, or must have a special wrapper.

    Let's look at what the system call wrappers do on RV64G(C) (which has
    no carry flag). For read(2) the wrapper contains:

    0x3ff7f173be <read+20>: ecall
    0x3ff7f173c2 <read+24>: lui a5,0xfffff
    0x3ff7f173c4 <read+26>: mv s0,a0
    0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>

    For dup(2) the wrapper contains:

    0x3ff7e7fe9a <dup+2>: ecall
    0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
    0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>

    and for mmap(2):

    0x3ff7e86b6e <mmap64+12>: ecall
    0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
    0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>

    So instead of checking for the sign flag, on RV64G the wrapper checks
    if the result is >0xfffff00000000000. This costs one instruction more
    than just checking the sign flag, and allows to almost double the
    number of bytes read(2) can read in one call, the number of file ids
    that cn be returned by dup(2), and the address range returnable by
    mmap(2). Will we ever see processes that need more than 8EB? Maybe
    not, but the designers of the RV64G(C) ABI obviously did not want to
    be the ones that are quoted as saying "8EB should be enough for
    anyone":-).

    Followups to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 08:22:17 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    To be efficient, a RISC needs a full-width (presumably 32 bit)
    external data bus, plus a separate address bus, which should at
    least be 26 bits, better 32. A random ARM CPU I looked at at
    bitsavers had 84 pins, which sounds reasonable.

    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and
    24-bit address bus, like the 68000, or even an 8-bit data bus, and
    without FPU and MMU and without PDP-11 decoder. The performance would
    have been memory-bandwidth-limited and therefore simular to the 68000
    and 68008, respectively (unless extra love was spent on the memory
    interface, e.g., with row optimization), with a few memory accesses
    saved by having more registers. This would still have made sense in a
    world where the same architecture was available (with better
    performance) on the supermini of the day, the RISC-VAX: Write your
    code on the cheap micro RISC-VAX and this will give you the
    performance advantages in a few years when proper 32-bit computing
    arrives (or on more expensive systems today).

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move
    towards workstations within the same company (which would be HARD).

    For workstations one would need the MMU and the FPU as extra chips.

    Getting a company to avoid trying to milk the cash cow for longer
    (short-term profits) by burying in-company progress (that other
    companies then make, i.e., long-term loss) may be hard, but given that
    some companies have survived, it's obviously possible.

    HP seems to have avoided the problem at various stages: They had their
    own HP3000 and HP9000/500 architectures, but found ways to drop that
    for HPPA without losing too many customers, then they dropped HPPa for
    IA-64, and IA-64 for AMD64, and they still survive. They also managed
    to become one of the biggest PC makers, but found it necessary to
    split the PC and big-machine businesses into two companies.

    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could
    have done to the PC market, if IBM could have been prevented
    from gaining such market dominance.

    The IBM PC success was based on the open architecture, on being more
    advanced than the Apple II and not too expensive, and the IBM name
    certainly helped at the start. In the long run it was an Intel and
    Microsoft success, not an IBM success. And Intel's 8086 success was
    initially helped by being able to port 8080 programs (with 8080->8086 assemblers).

    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide 8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    An alternative would be to sell it as a faster and better upgrade path
    for the 8088 later, as competition to the 80286. Have a RISC-VAX
    (without MMU und FPU) with an additional 8086 decoder for running
    legacy programs (should be possible in the 134,000 transistors that the
    80286 has): Users could run their existing code, as well as
    future-oriented (actually present-oriented) 32-bit code. The next
    step would be adding the TLB for paging.

    Concerning on how to do it from the business side: The microprocessor
    business (at least, maybe more) should probably be spun off as an
    independent company, such that customers would not need to worry about
    being at a disadvantage compared to DEC in-house demands.

    One can also imagine other ways: Instead of the reduced-RISC-VAX, Try
    to get a PDP-11 variant with 8-bit data bus into the actual IBM PC
    (instead of the 8088), or set up your own PC business based on such a processor; and then the logical upgrade path would be to the successor
    of the PDP-11, the RISC-VAX (with PDP-11 decoder).

    What about the fears of the majority in the company working on big
    computers? They would continue to make big computers, with initially
    faster and later more CPUs than PCs. That's what we are seeing today.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 09:37:27 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards
    workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation >simply couldn’t adapt.

    At leas some of the Nova-based microprocessors were relatively cheap,
    and still did not succeed. I think that the essential parts of the
    success of the 8088 were:

    * Offered 1MB of address space. In a cumbersome way, but still; and
    AFAIK less cumbersome than what you would do on a mini or Apple III.
    Intel's architects did not understand that themselves, as shown by
    the 80286, which offered decent support for multiple processes, each
    with 64KB address space. Users actually preferred single-tasking of
    programs that can access more than 64KB easily to multitasking of
    64KB (or 64KB+64KB) processes.

    * Cheap to design computers for, in particular the 8-bit bus and small
    package.

    * Support for porting 8080 assembly code to the 8086 architecture.
    That was not needed for long, but it provided a boost in available
    software at a critical time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 14:24:48 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the >>effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    I understand what you're saying here, but disagree. A program that
    works on ILP32 but not I32LP64 is fundamentally broken, IMHO.


    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.

    I suspect that performance advantage was minimal, the primary advantage would have been that existing applications didn't need to be rebuilt
    and requalified.

    That's apparently not enough to gain a critical mass of x32 programs.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation.

    In the early days of AArch64 (2013), we actually built a toolchain to support Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
    and the project was dropped.

    Admittedly, there are CPUs without ARM A32/T32

    Very few AArch64 designs included AArch32 support; even the Cortex
    chips supported it only at exception level zero (user mode), not
    at the other exception levels. The latest Neoverse chips have,
    for the most part, dropped AArch32 compeletely, even at EL0.

    The markets for AArch64 (servers, high-end appliances) didn't have
    a huge existing reservoir of 32-bit ARM applications, so there was
    no demand to support them.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 13 14:26:18 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    To be efficient, a RISC needs a full-width (presumably 32 bit)
    external data bus, plus a separate address bus, which should at
    least be 26 bits, better 32. A random ARM CPU I looked at at
    bitsavers had 84 pins, which sounds reasonable.

    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and
    24-bit address bus.

    LSI11?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 13 14:44:29 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 15:03:08 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >>broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    Other language bindings offer alternative mechanisms.


    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >architectures, where the sign is used, mmap(2) cannot return negative >addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed? The
    return value from the kernel should be passed through to
    the application as per the POSIX language binding requirements.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers, except in the case where it doesn't provide the correct semantics.


    Let's look at what the system call wrappers do on RV64G(C) (which has
    no carry flag). For read(2) the wrapper contains:

    0x3ff7f173be <read+20>: ecall
    0x3ff7f173c2 <read+24>: lui a5,0xfffff
    0x3ff7f173c4 <read+26>: mv s0,a0
    0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>

    For dup(2) the wrapper contains:

    0x3ff7e7fe9a <dup+2>: ecall
    0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
    0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>

    and for mmap(2):

    0x3ff7e86b6e <mmap64+12>: ecall
    0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
    0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>

    So instead of checking for the sign flag, on RV64G the wrapper checks
    if the result is >0xfffff00000000000. This costs one instruction more
    than just checking the sign flag, and allows to almost double the
    number of bytes read(2) can read in one call, the number of file ids
    that cn be returned by dup(2), and the address range returnable by
    mmap(2). Will we ever see processes that need more than 8EB? Maybe
    not, but the designers of the RV64G(C) ABI obviously did not want to
    be the ones that are quoted as saying "8EB should be enough for
    anyone":-).

    Followups to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 16:10:10 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >>>broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call
    indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers

    Sure, but system calls are first introduced in real kernels using the
    actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures. And when you look
    closely, you find how the system calls are design to support returning
    the error indication, success value, and errno in one register.

    lseek64 on 32-bit platforms is an exception (the success value does
    not fit in one register), and looking at the machine code of the
    wrapper and comparing it with the machine code for the lseek wrapper,
    some funny things are going on, but I would have to look at the source
    code to understand what is going on. One other interesting thing I
    noticed is that the system call wrappers from libc-2.36 on i386 now
    draws the boundary between success returns and error returns at
    0xfffff000:

    0xf7d853c4 <lseek+68>: call *%gs:0x10
    0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
    0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>

    So now the kernel can produce 4095 error values, and the rest can be
    success values. In particular, mmap() can return all possible page
    addresses as success values with these wrappers. When I last looked
    at how system calls are done, I found just a check of the N or the C
    flag. I wonder how the kernel is informed that it can now return more addresses from mmap().

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 17:46:59 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 17:50:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and >>24-bit address bus.

    LSI11?

    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
    total of 160 pins; and it supported only 16 address bits without extra
    chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 18:15:23 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally
    broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    Which was addressed by the LFS (Large File Summit), to support
    files > 2GB in size.

    There is also the degenerate case of open("/dev/mem"...) which
    requires lseek support over the entire physical address space
    and /dev/kmem which supports access to the kernel virtual memory
    address space, which on most systems has the high-order bit
    in the address set to one. Personally, I've used pread/pwrite
    in those cases (once 1003.4 was merged) rather than lseek/read
    and lseek/write.



    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    Aside from mmap-ing /dev/mem or /dev/kmem,
    one must also consider the use of MAP_FIXED, when supported,
    where the kernel doesn't choose the mapped address (although
    it is allowed to refuse to map certain ranges).

    The return value for mmap is 'void *'. The only special value
    for mmap(2) is MAP_FAILED (which is the unsigned equivalent of -1)
    which implies that a one-byte mapping at the end of the address
    space isn't supported.

    all that said, my initial point about -1 was that applications
    should always check for -1 (or MAP_FAILED), not for return
    values less than zero. The actual kernel interface to the
    C library is clearly implementation dependent although it
    must preserve the user-visible required semantics.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ted@loft.tnolan.com (Ted Nolan@tednolan to comp.arch,alt.folklore.computers on Wed Aug 13 18:26:44 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a >>>reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >>>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>>porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.


    They did manage to crack the college market some where CS departments
    had DEC hardware anyway. I know USC (original) had a Rainbow computer
    lab circa 1985. That "in" didn't translate to anything else though.
    --
    columbiaclosings.com
    What's not in Columbia anymore..
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 18:13:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the >>>effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    I understand what you're saying here, but disagree. A program that
    works on ILP32 but not I32LP64 is fundamentally broken, IMHO.

    In 1992 most C programs worked on ILP32, but not on I32LP64. That's
    because Digital OSF/1 was the first I32LP64 platform, and it only
    appeared in 1992. ILP32 support was a way to increase the amount of
    available software.

    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.

    I suspect that performance advantage was minimal, the primary advantage would >have been that existing applications didn't need to be rebuilt
    and requalified.

    You certainly have to rebuild for x32. It's a new ABI.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation.

    In the early days of AArch64 (2013), we actually built a toolchain to support >Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
    and the project was dropped.

    Admittedly, there are CPUs without ARM A32/T32

    Very few AArch64 designs included AArch32 support

    If by Aarch32 you mean what ARM now calls the A32 and T32 instruction
    sets (their constant renamings are confusing, but the A64/A32/T32
    naming makes more sense than earlier ones), every ARMv8 core I use
    (A53, A55, A72, A73, A76) includes A32 and T32 support.


    even the Cortex
    chips supported it only at exception level zero (user mode)

    When you run user-mode software, that's what's important. Only kernel developers care about which instruction set kernel mode supports.

    The markets for AArch64 (servers, high-end appliances) didn't have
    a huge existing reservoir of 32-bit ARM applications, so there was
    no demand to support them.

    Actually there is a huge market for CPUs with ARM A32/T32 ISA
    (earlier) and ARM A64 ISA (now): smartphones and tablets. Apparently
    this market has mechanisms that remove software after relatively few
    years and the customers accept it. So the appearance of cores without
    A32/T32 support indicates that the software compiled to A32/T32 has
    been mostly eliminated. Smartphone SoCs typically still contain some
    cores that support A32/T32 (at least last time I read about them), but
    others don't. It's interesting to see which cores support A32/T32 and
    which don't.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 18:51:15 2025
    From Newsgroup: comp.arch

    In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    [snip]
    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed? The
    return value from the kernel should be passed through to
    the application as per the POSIX language binding requirements.

    For the branch to `cerror`. That is, the usual reason is (was?)
    to convert from the system call interface to the C ABI,
    specifically, to populate the (userspace, now thread-local)
    `errno` variable if there was an error. (I know you know this,
    Scott, but others reading the discussion may not.)

    Looking at the 32v code for VAX and 7th Edition on the PDP-11,
    on error the kernel returns a non-zero value and sets the carry
    bit in the PSW. The stub checks whether the C bit is set, and
    if so, copies R0 to `errno` and then sets R0 to -1. On the
    PDP-11, `as` supports the non-standard "bec" mnemonic as an
    alias for "bcc" and the stub is actually something like:

    / Do sys call....land in the kernel `trap` in m40.s
    bec 1f
    jmp cerror
    1f:
    rts pc

    cerror:
    mov r0, _errno
    mov $-1, r0
    rts pc

    In other words, if the carry bit is not set, there system call
    was successful, so just return whatever it returned. Otherwise,
    the kernel is returning an error to the user, so do the dance of
    setting up `errno` and returning -1.

    (There's some fiddly bits with popping R5, which Unix used as
    the frame pointer, but I omitted those for brevity).

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    At last for lseek, that was true in the 1990 POSIX standard,
    where the programmer was expected to (maybe save and then) clear
    `errno`, invoke `lseek`, and then check the value of `errno`
    after return to see if there was an error, but has been relaxed
    in subsequent editions (include POSIX 2024) where `lseek` now
    must return `EINVAL` if the offset is negative for a regular
    file, directory, or block-special file. (https://pubs.opengroup.org/onlinepubs/9799919799/functions/lseek.html;
    see "ERRORS")

    For mmap, at least the only documented error return value is
    `MAP_FAILED`, and programmers must check for that explicitly.

    It strikes me that this implies that the _value_ of `MAP_FAILED`
    need not be -1; on x86_64, for instance, it _could_ be any
    non-canonical address.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers, >except in the case where it doesn't provide the correct semantics.

    Certainly, these are hidden by the system call stubs in the
    libraries for language-specific bindings, and workaday
    programmers should not be trying to side-step those!

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 19:25:31 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally
    broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however. But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple
    environments, in which overflow is potentially unavoidable.
    This leads to considerable complexity in implementations that
    try to support such multiple environments in their ABI (for
    instance, for backwards compatability with old programs).

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    The point is that the programmer shouldn't have to care. The
    programmer should check the return value against MAP_FAILED, and
    if it is NOT that value, then the returned address may be
    assumed valid. If such an address is not actually valid, that
    indicates a bug in the implementation of `mmap`.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers

    Sure, but system calls are first introduced in real kernels using the
    actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    Linux _could_ decide to define `MAP_FAILED` as
    0x0fff_ffff_0000_0000, which is non-canonical on all extant
    versions of x86-64, even with 5-level paging, but maybe they do
    not because they're anticipating 6-level paging showing up at
    some point.

    And when you look
    closely, you find how the system calls are design to support returning
    the error indication, success value, and errno in one register.

    lseek64 on 32-bit platforms is an exception (the success value does
    not fit in one register), and looking at the machine code of the
    wrapper and comparing it with the machine code for the lseek wrapper,
    some funny things are going on, but I would have to look at the source
    code to understand what is going on. One other interesting thing I
    noticed is that the system call wrappers from libc-2.36 on i386 now
    draws the boundary between success returns and error returns at
    0xfffff000:

    0xf7d853c4 <lseek+68>: call *%gs:0x10
    0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
    0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>

    So now the kernel can produce 4095 error values, and the rest can be
    success values. In particular, mmap() can return all possible page
    addresses as success values with these wrappers. When I last looked
    at how system calls are done, I found just a check of the N or the C
    flag.

    Yes; see above.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 19:40:17 2025
    From Newsgroup: comp.arch

    In article <%C4nQ.6540$CQJe.2438@fx14.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    all that said, my initial point about -1 was that applications
    should always check for -1 (or MAP_FAILED), not for return
    values less than zero. The actual kernel interface to the
    C library is clearly implementation dependent although it
    must preserve the user-visible required semantics.

    For some reason, I have a vague memory of reading somewhere that
    it was considered "more robust" to check for a negative return
    value, and not just -1 specifically. Perhaps this was just
    superstition, or perhaps someone had been bit by an overly
    permissive environment. It certainly seems like advice that we
    can safely discard at this point.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 20:28:13 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:

    For mmap, at least the only documented error return value is
    `MAP_FAILED`, and programmers must check for that explicitly.

    It strikes me that this implies that the _value_ of `MAP_FAILED`
    need not be -1; on x86_64, for instance, it _could_ be any
    non-canonical address.

    And in the very unlikely case that a C compiler was developed
    for the Burroughs B4900, MAP_FAILED could be 0xC0EEEEEE (which
    is how the NULL pointer was encoded in the hardware). Because
    all the data was BCD, undigits (a-f) in an address were
    unconditionally illegal.

    There were instructions to search linked lists, so the hardware
    needed to understand the concept of a NULL pointer (as well
    as deal with the possibility of a loop, using a timer while
    the search instruction was executing).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 21:23:34 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day. Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware), but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple
    environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux >>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    Sure, but system calls are first introduced in real kernels using the >>actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up
    where reserved for the kernel.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* -rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high
    addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that, but the addresses produced by mmap() seem unaffected.

    However, by calling the binaries with setarch -L, mmap() returns only
    addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 14 07:58:41 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative.

    I have now checked this by chrooting into an old Red Hat 6.2 system
    (not RHEL) with glibc-2.1.3 (released in Feb 2000) and its system call wrappers. And already those wrappers use the current way of
    determining whether a system call returns an error or not:

    For mmap():

    0xf7fd984b <__mmap+11>: int $0x80
    0xf7fd984d <__mmap+13>: mov %edx,%ebx
    0xf7fd984f <__mmap+15>: cmp $0xfffff000,%eax
    0xf7fd9854 <__mmap+20>: ja 0xf7fd9857 <__mmap+23>

    Bottom line: If Linux-i386 ever had a different way of determining
    whether a system call has an error result, it was changed to the
    current way early on. Given that IIRC I looked into that later than
    in 2000, my memory is obviously not of Linux. I must have looked at
    source code for a different system.

    Actually, the whole wrapper is short enough to easily understand what
    is going on:

    0xf7fd9840 <__mmap>: mov %ebx,%edx
    0xf7fd9842 <__mmap+2>: mov $0x5a,%eax
    0xf7fd9847 <__mmap+7>: lea 0x4(%esp,1),%ebx
    0xf7fd984b <__mmap+11>: int $0x80
    0xf7fd984d <__mmap+13>: mov %edx,%ebx
    0xf7fd984f <__mmap+15>: cmp $0xfffff000,%eax
    0xf7fd9854 <__mmap+20>: ja 0xf7fd9857 <__mmap+23>
    0xf7fd9856 <__mmap+22>: ret
    0xf7fd9857 <__mmap+23>: push %ebx
    0xf7fd9858 <__mmap+24>: call 0xf7fd985d <__mmap+29>
    0xf7fd985d <__mmap+29>: pop %ebx
    0xf7fd985e <__mmap+30>: xor %edx,%edx
    0xf7fd9860 <__mmap+32>: add $0x400b,%ebx
    0xf7fd9866 <__mmap+38>: sub %eax,%edx
    0xf7fd9868 <__mmap+40>: push %edx
    0xf7fd9869 <__mmap+41>: call 0xf7fd7f80 <__errno_location>
    0xf7fd986e <__mmap+46>: pop %ecx
    0xf7fd986f <__mmap+47>: pop %ebx
    0xf7fd9870 <__mmap+48>: mov %ecx,(%eax)
    0xf7fd9872 <__mmap+50>: or $0xffffffff,%eax
    0xf7fd9875 <__mmap+53>: jmp 0xf7fd9856 <__mmap+22>

    One interesting difference from the current way of invoking a system
    call is that (as far as I understand the wrapper) the wrapper loads
    the arguments from memory (IA-32 ABI passes parameters on the stack)
    into registers and then performs the system call in some newfangled
    way, whereas here the arguments are left in memory, and apparently a
    pointer to the first argument is passed in %ebx; the system call is
    invoked in the old way: int $0x80.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 14 13:28:31 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Bottom line: If Linux-i386 ever had a different way of determining
    whether a system call has an error result, it was changed to the
    current way early on. Given that IIRC I looked into that later than
    in 2000, my memory is obviously not of Linux. I must have looked at
    source code for a different system.

    I looked around and found
    <2016Sep18.100027@mips.complang.tuwien.ac.at>. I mentioned the Linux
    approach there, but apparently it did not stick in my memory. I
    linked to <http://stackoverflow.com/questions/36845866/history-of-using-negative-errno-values-in-gnu>,
    and there fuz writes:

    |Historically, system calls returned either a positive value (in case
    |of success) or a negative value indicating an error code. This has
    |been the case from the very beginning of UNIX as far as I'm concerned.

    and Steve Summit earlier writes essentially the same. But Lars
    Brinkhoff read my posting and contradicted Steve Summit and fuz, e.g.:

    |PDP-11 Unix V1 does not do this. When there's an error, the system
    |call sets the carry flag in the status register, and returns the error
    |code in register R0. On success, the carry flag is cleared, and R0
    |holds a return value. Unix V7 does the same.

    Why do I know he read my posting? Because he wrote a followup: <868tunivr0.fsf@molnjunk.nocrew.org>.

    In <2016Sep20.160042@mips.complang.tuwien.ac.at> I wrote:

    |Some Linux ports use a second register to indicate that there is an
    |error, and SPARC even uses the carry flag.

    So apparently I had looked at the source code of the C wrappers (or of
    the Linux kernel) at that point. I definitely remember finding this
    in some source code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:14:56 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    However, conversion from signed to unsigned has always been
    well-defined, and follows effectively 2's complement semantics.

    Conversion from unsigned to signed is a bit more complex, and is
    implementation defined, but not UB. Given that the system call
    interface is necessarily deeply intwined with the implementation
    I see no reason why the semantics of signed overflow should be
    an issue here.

    Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware),

    If by "C is not used" you mean newer editions of the C standard
    are not used on very old computers with strange representations
    of signed integers, then maybe.

    but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    This isn't about signed overflow. The issue here is conversion
    of an unsigned value to signed; almost certainly, the kernel
    performs the calculation of the actual file offset using
    unsigned arithmetic, and relies on the (assembler, mind you)
    system call stubs to map those to the appropriate userspace
    type.

    I think this is mostly irrelevant, as the system call stub,
    almost by necessity, must be written in assembler in order to
    have percise control over the use of specific registers and so
    on. From C's perspective, a program making a system call just
    calls some function that's defined to return a signed integer;
    the assembler code that swizzles the register that integer will
    be extracted from sets things up accordingly. In other words,
    the conversion operation that the C standard mentions isn't at
    play, since the code that does the "conversion" is in assembly.
    Again from C's perspective the return value of the syscall stub
    function is already signed with no need of conversion.

    No, for `lseek`, the POSIX rationale explains the reasoning here
    quite clearly: the 1990 standard permitted negative offsets, and
    programs were expected to accommodate this by special handling
    of `errno` before and after calls to `lseek` that returned
    negative values. This was deemed onerous and fragile, so they
    modified the standard to prohibit calls that would result in
    negative offsets.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple >>environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    As I said earlier, post POSIX.1-1990, this is true.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>>32-bit Linux originally only produced user-level addresses below 2GB. >>>When memories grew larger, on some architectures (e.g., i386) Linux >>>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    I don't see why. You were talking about the system call stubs,
    which run in userspace, and are responsbile for setting up state
    so that the kernel can perform some requested action on entry,
    whether by trap, call gate, or special instruction, and then for
    tearing down that state and handling errors on return from the
    kernel.

    For mmap, there is exactly one value that may be returned from
    the its stub that indicates an error; any other value, by
    definition, represents a valid mapping. Whether such a mapping
    falls in the first 2G, 3G, anything except the upper 256MiB, or
    some hole in the middle is the part that's irrelevant, and
    focusing on that misses the main point: all the stub has to do
    is detect the error, using whatever convetion the kernel
    specifies for communicating such things back to the program, and
    ensure that in an error case, MAP_FAILED is returned from the
    stub and `errno` is set appropriately. Everything else is
    superfluous.

    Sure, but system calls are first introduced in real kernels using the >>>actual system call interface, and are limited by that interface. And >>>that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up
    where reserved for the kernel.

    Define "Linux-i386" in this case. For the kernel, I'm confident
    that was NOT the case, and it is easy enough to research, since
    old kernel versions are online. Looking at e.g. 0.99.15, one
    can see that they set the carry bit in the flags register to
    indicate an error, along with returning a negative errno value: https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S

    By 2.0, they'd stopped setting the carry bit, though they
    continued to clear it on entry.

    But remember, `mmap` returns a pointer, not an integer, relying
    on libc to do the necessary translation between whatever the
    kernel returns and what the program expects. So if the behavior
    you describe where anywhere, it would be in libc. Given that
    they have, and had, a mechanism for signaling an error
    independent of C already, and necessarily the fixup of the
    return value must happen in the syscall stub in whatever library
    the system used, relying soley on negative values to detect
    errors seems like a poor design decision ifor a C library.

    So if what you're saying were true, such a check wuld have to
    be in the userspace library that provides the syscall stubs; the
    kernel really doesn't care. I don't know what version libc
    Torvalds started with, or if he did his own bespoke thing
    initially or something, but looking at some commonly used C
    libraries of a certain age, such as glibc 2.0 from 1997-ish, one
    can see that they're explicitly testing the error status against
    -4095 (as an unsigned value) in the stub. (e.g., in sysdeps/unix/sysv/linux/i386/syscall.S).

    But glibc-1.06.1 is a different story, and _does_ appear to
    simply test whether the return value is negative and then jump
    to an error handler if so. So mmap may have worked incidentally
    due to the restriction on where in the address space it would
    place a mapping in very early kernel versions, as you described,
    but that's a library issue, not a kernel issue: again, the
    kernel doesn't care.

    The old version of libc5 available on kernel.org similarly; it
    looks like HJ Lu changed the error handling path to explicitly
    compare against -4095 in October of 1996.

    So, fixed in the most common libc's used with Linux on i386 for
    nearly 30 years, well before the existence of x86_64.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
    GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high >addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't see any reason why it wouldn't.

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that,

    `file` is pulling that from a `PT_NOTE` segment defined in the
    program header for that second file. A better tool for picking
    apart the details of those binaries is probably `objdump`.

    I'm mildly curious what version of libc those are linked against
    (e.g., as reported by `ldd`).

    but the addresses produced by mmap() seem unaffected.

    I don't see why it would be. Any common libc post 1997-ish
    handles errors in a way that permits this to work correctly. If
    you tried glibc 1.0, it might be a different story, but the
    Linux folks forked that in 1994 and modified it as "Linux libc"
    and the

    However, by calling the binaries with setarch -L, mmap() returns only >addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    Unclear without looking at the kernel source code, but possibly.
    `setarch -L` turns on the "legacy" virtual address space layout,
    but I suspect that the number of binaries that _actually care_
    is pretty small, indeed.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:25:13 2025
    From Newsgroup: comp.arch

    In article <107kuhg$8ks$1@reader1.panix.com>,
    Dan Cross <cross@spitfire.i.gajendra.net> wrote:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    However, conversion from signed to unsigned has always been
    well-defined, and follows effectively 2's complement semantics.

    Conversion from unsigned to signed is a bit more complex, and is >implementation defined, but not UB. Given that the system call
    interface is necessarily deeply intwined with the implementation
    I see no reason why the semantics of signed overflow should be
    an issue here.

    Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware),

    If by "C is not used" you mean newer editions of the C standard
    are not used on very old computers with strange representations
    of signed integers, then maybe.

    but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    This isn't about signed overflow. The issue here is conversion
    of an unsigned value to signed; almost certainly, the kernel
    performs the calculation of the actual file offset using
    unsigned arithmetic, and relies on the (assembler, mind you)
    system call stubs to map those to the appropriate userspace
    type.

    I think this is mostly irrelevant, as the system call stub,
    almost by necessity, must be written in assembler in order to
    have percise control over the use of specific registers and so
    on. From C's perspective, a program making a system call just
    calls some function that's defined to return a signed integer;
    the assembler code that swizzles the register that integer will
    be extracted from sets things up accordingly. In other words,
    the conversion operation that the C standard mentions isn't at
    play, since the code that does the "conversion" is in assembly.
    Again from C's perspective the return value of the syscall stub
    function is already signed with no need of conversion.

    No, for `lseek`, the POSIX rationale explains the reasoning here
    quite clearly: the 1990 standard permitted negative offsets, and
    programs were expected to accommodate this by special handling
    of `errno` before and after calls to `lseek` that returned
    negative values. This was deemed onerous and fragile, so they
    modified the standard to prohibit calls that would result in
    negative offsets.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple >>>environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    As I said earlier, post POSIX.1-1990, this is true.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>>>32-bit Linux originally only produced user-level addresses below 2GB. >>>>When memories grew larger, on some architectures (e.g., i386) Linux >>>>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    I don't see why. You were talking about the system call stubs,
    which run in userspace, and are responsbile for setting up state
    so that the kernel can perform some requested action on entry,
    whether by trap, call gate, or special instruction, and then for
    tearing down that state and handling errors on return from the
    kernel.

    For mmap, there is exactly one value that may be returned from
    the its stub that indicates an error; any other value, by
    definition, represents a valid mapping. Whether such a mapping
    falls in the first 2G, 3G, anything except the upper 256MiB, or
    some hole in the middle is the part that's irrelevant, and
    focusing on that misses the main point: all the stub has to do
    is detect the error, using whatever convetion the kernel
    specifies for communicating such things back to the program, and
    ensure that in an error case, MAP_FAILED is returned from the
    stub and `errno` is set appropriately. Everything else is
    superfluous.

    Sure, but system calls are first introduced in real kernels using the >>>>actual system call interface, and are limited by that interface. And >>>>that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up >>where reserved for the kernel.

    Define "Linux-i386" in this case. For the kernel, I'm confident
    that was NOT the case, and it is easy enough to research, since
    old kernel versions are online. Looking at e.g. 0.99.15, one
    can see that they set the carry bit in the flags register to
    indicate an error, along with returning a negative errno value: >https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S

    By 2.0, they'd stopped setting the carry bit, though they
    continued to clear it on entry.

    But remember, `mmap` returns a pointer, not an integer, relying
    on libc to do the necessary translation between whatever the
    kernel returns and what the program expects. So if the behavior
    you describe where anywhere, it would be in libc. Given that
    they have, and had, a mechanism for signaling an error
    independent of C already, and necessarily the fixup of the
    return value must happen in the syscall stub in whatever library
    the system used, relying soley on negative values to detect
    errors seems like a poor design decision ifor a C library.

    So if what you're saying were true, such a check wuld have to
    be in the userspace library that provides the syscall stubs; the
    kernel really doesn't care. I don't know what version libc
    Torvalds started with, or if he did his own bespoke thing
    initially or something, but looking at some commonly used C
    libraries of a certain age, such as glibc 2.0 from 1997-ish, one
    can see that they're explicitly testing the error status against
    -4095 (as an unsigned value) in the stub. (e.g., in >sysdeps/unix/sysv/linux/i386/syscall.S).

    But glibc-1.06.1 is a different story, and _does_ appear to
    simply test whether the return value is negative and then jump
    to an error handler if so. So mmap may have worked incidentally
    due to the restriction on where in the address space it would
    place a mapping in very early kernel versions, as you described,
    but that's a library issue, not a kernel issue: again, the
    kernel doesn't care.

    The old version of libc5 available on kernel.org similarly; it
    looks like HJ Lu changed the error handling path to explicitly
    compare against -4095 in October of 1996.

    So, fixed in the most common libc's used with Linux on i386 for
    nearly 30 years, well before the existence of x86_64.

    I wonder how the kernel is informed that it can now return more >>>>addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >>-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
    GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high >>addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't see any reason why it wouldn't.

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that,

    `file` is pulling that from a `PT_NOTE` segment defined in the
    program header for that second file. A better tool for picking
    apart the details of those binaries is probably `objdump`.

    I'm mildly curious what version of libc those are linked against
    (e.g., as reported by `ldd`).

    but the addresses produced by mmap() seem unaffected.

    I don't see why it would be. Any common libc post 1997-ish
    handles errors in a way that permits this to work correctly. If
    you tried glibc 1.0, it might be a different story, but the
    Linux folks forked that in 1994 and modified it as "Linux libc"
    and the

    ...and the Linux folks changed this to the present mechanism in
    1996.

    (Sorry 'bout that.)

    However, by calling the binaries with setarch -L, mmap() returns only >>addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    Unclear without looking at the kernel source code, but possibly.
    `setarch -L` turns on the "legacy" virtual address space layout,
    but I suspect that the number of binaries that _actually care_
    is pretty small, indeed.

    - Dan C.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 14 15:32:40 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:44:34 2025
    From Newsgroup: comp.arch

    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had >>>reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Aug 14 19:15:42 2025
    From Newsgroup: comp.arch

    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.


    They would presumably have been part of the justification for supporting multiple signed integer formats at the time. UB on signed integer
    arithmetic overflow is a different matter altogether.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.


    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware. It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.


    Modern C and C++ standards have dropped support for signed integer representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 14 17:43:50 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.

    I believe it was you who wrote "If you add enough apples to a
    pile, the number of apples becomes negative", so there is
    clerly a defined physical meaning to overflow.

    :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 21:44:42 2025
    From Newsgroup: comp.arch

    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    They would presumably have been part of the justification for supporting >multiple signed integer formats at the time.

    C90 doesn't have much to say about this at all, other than
    saying that the actual representation and ranges of the integer
    types are implementation defined (G.3.5 para 1).

    C90 does say that, "The representations of integral types shall
    define values by use of a pure binary numeration system" (sec
    6.1.2.5).

    C99 tightens this up and talks about 2's comp, 1's comp, and
    sign/mag as being the permissible representations (J.3.5, para
    1).

    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is >undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.

    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a >correct answer, then then the C standards don't say anything about the >results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world >positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets. Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow. But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer >representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for >general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From OrangeFish@OrangeFish@invalid.invalid to comp.arch,alt.folklore.computers on Fri Aug 15 11:42:09 2025
    From Newsgroup: comp.arch

    On 2025-08-12 15:09, John Levine wrote:
    According to <aph@littlepinkcloud.invalid>:
    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and >>> similar back down to 32 bits, requiring special versions of any shared
    libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    FreeBSD does the same thing. The 32 bit libraries are installed by default on 64 bit systems because, by current standards, they're not very big.

    Same is true for Solaris Sparc.

    OF.


    I've stopped installing them because I know I don't have any 32 bit apps
    left but on systems with old packages, who knows?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Aug 15 17:49:53 2025
    From Newsgroup: comp.arch

    On 14.08.2025 23:44, Dan Cross wrote:
    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    They would presumably have been part of the justification for supporting
    multiple signed integer formats at the time.

    C90 doesn't have much to say about this at all, other than
    saying that the actual representation and ranges of the integer
    types are implementation defined (G.3.5 para 1).

    C90 does say that, "The representations of integral types shall
    define values by use of a pure binary numeration system" (sec
    6.1.2.5).

    C99 tightens this up and talks about 2's comp, 1's comp, and
    sign/mag as being the permissible representations (J.3.5, para
    1).

    Yes. Early C didn't go into the details, then C99 described the systems
    that could realistically be used. And now in C23 only two's complement
    is allowed.


    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.


    You have overflow when the mathematical result of an operation cannot be expressed accurately in the type - regardless of the representation
    format for the numbers. Your options, as a language designer or
    implementer, of handling the overflow are the same regardless of the representation. You can pick a fixed value to return, or saturate, or
    invoke some kind of error handler mechanism, or return a "don't care" unspecified value of the type, or perform a specified algorithm to get a representable value (such as reduction modulo 2^n), or you can simply
    say the program is broken if this happens (it is UB).

    I don't see where the representation comes into it - overflow is a
    matter of values and the ranges that can be stored in a type, not how
    those values are stored in the bits of the data.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.


    It's basically the same in C90 onwards, with just small changes to the wording. And it /does/ define what is meant by an "exceptional
    condition" (or just "exception" in C90) - that is done by the part in parentheses.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.


    They are thinking about that possibility, yes. In C90, the term
    "exception" here was not clearly defined - and it is definitely not the
    same as the term "exception" in 6.3p5. The wording was improved in C99 without changing the intended meaning - there the term in the paragraph
    under "Expressions" is "exceptional condition" (defined in that
    paragraph), while in the example in "Execution environments", it says
    "On a machine in which overflows produce an explicit trap". (C11
    further clarifies what "performs a trap" means.)

    But this is about re-arrangements the compiler is allowed to make, or
    barred from making - it can't make re-arrangements that would mean
    execution failed when the direct execution of the code according to the
    C abstract machine would have worked correctly (without ever having encountered an "exceptional condition" or other UB). Representation is
    not relevant here - there is nothing about two's complement, ones'
    complement, sign-magnitude, or anything else. Even the machine hardware
    is not actually particularly important, given that most processors
    support non-trapping integer arithmetic instructions and for those that
    don't have explicit trap instructions, a compiler could generate "jump
    if overflow flag set" or similar instructions to emulate traps
    reasonably efficiently. (Many compilers support that kind of thing as
    an option to aid debugging.)


    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a
    correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    It is a fair point that I am describing a rational and sensible reason
    for UB on arithmetic overflow - and I do not know the motivation of the
    early C language designers, compiler implementers, and authors of the
    first C standard.

    I do know, however, that the principle of "garbage in, garbage out" was
    well established long before C was conceived. And programmers of that
    time were familiar with the concept of functions and operations being
    defined for appropriate inputs, and having no defined behaviour for
    invalid inputs. C is full of other things where behaviour is left
    undefined when no sensible correct answer can be specified, and that is
    not just because the behaviour of different hardware could vary. It
    seems perfectly reasonable to me to suppose that signed integer
    arithmetic overflow is just another case, no different from
    dereferencing an invalid pointer, dividing by zero, or any one of the
    other UB's in the standards.


    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets.

    Yes, that is true. It is, however, also important to remember that it
    was based on a general abstract machine, not any particular hardware,
    and that the operations were intended to follow standard mathematics as
    well as practically possible - operations and expressions in C were not designed for any particular hardware. (Though some design choices were
    biased by particular hardware.)

    Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.


    It could have made signed integer overflow defined behaviour, but it did
    not. The C standards committee have explicitly chosen not to do that,
    even after deciding that two's complement is the only supported
    representation for signed integers in C23 onwards. It is fine to have
    two's complement representation, and fine to have modulo arithmetic in
    some circumstances, while leaving other arithmetic overflow undefined. Unsigned integer operations in C have always been defined as modulo
    arithmetic - addition of unsigned values is a different operation from addition of signed values. Having some modulo behaviour does not in any
    way imply that signed arithmetic should be modulo.

    In Java, the language designers decided that integer arithmetic
    operations would be modulo operations. Wrapping therefore gives the
    correct answer for those operations - it does not give the correct
    answer for mathematical integer operations. And Java loses common mathematical identities which C retains - such as the identity that
    adding a positive integer to another integer will increase its value. Something always has to be lost when approximating unbounded
    mathematical integers in a bounded implementation - I think C made the
    right choices here about what to keep and what to lose, and Java made
    the wrong choices. (Others may of course have different opinions.)

    In Zig, unsigned integer arithmetic overflow is also UB as these
    operations are not defined as modulo. I think that is a good natural
    choice too - but it is useful for a language to have a way to do
    wrapping arithmetic on the occasions you need it.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow.

    Yes. Unsigned arithmetic operations are different operations from
    signed arithmetic operations in C.

    But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.


    They could have done that (as the Zig folk did).

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.


    Not at all. Usually when someone says "the only logical reason is...",
    they really mean "the only logical reason /I/ can think of is...", or
    "the only reason that /I/ can think of that /I/ think is logical is...".

    For a language that can be used as a low-level systems language, it is important to be able to do modulo arithmetic efficiently. It is needed
    for a number of low-level tasks, including the implementation of large arithmetic operations, handling timers, counters, and other bits and
    pieces. So it was definitely a useful thing to have in C.

    For a language that can be used as a fast and efficient application
    language, it must have a reasonable approximation to mathematical
    integer arithmetic. Implementations should not be forced to have
    behaviours beyond the mathematically sensible answers - if a calculation
    can't be done correctly, there's no point in doing it. Giving nonsense results does not help anyone - C programmers or toolchain implementers,
    so the language should not specify any particular result. More sensible defined overflow behaviour - saturation, error values, language
    exceptions or traps, etc., would be very inefficient on most hardware.
    So UB is the best choice - and implementations can do something
    different if they like.

    Too many options make a language bigger - harder to implement, harder to learn, harder to use. So it makes sense to have modulo arithmetic for unsigned types, and normal arithmetic for signed types.

    I am not claiming to know that this is the reasoning made by the C
    language pioneers. But it is definitely an alternative logical reason
    for C being the way it is.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer
    representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for
    general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.


    I think you might have missed a few words in that paragraph, but I
    believe I know what you intended. There are certainly DSPs and other
    cores that have strong support for alternative overflow behaviour -
    saturation is very common in DSPs, and it is also common to have a
    "sticky overflow" flag so that you can do lots of calculations in a
    tight loop, and check for problems once you are finished. I think it is highly unlikely that you'll find a core with something other than two's complement as the representation for signed integer types, though I
    can't claim that I know /all/ devices! (I do know a bit about more
    cores than would be considered popular or common.)

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.


    Modern C is definitely used on DSPs with strong saturation support.
    (Even ARM cores have saturated arithmetic instructions.) But they can
    also handle two's complement wrapped signed integer arithmetic if the programmer wants that - after all, it's exactly the same in the hardware
    as modulo unsigned arithmetic (except for division). That doesn't mean
    that wrapping signed integer overflow is useful or desired behaviour.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Aug 15 17:49:58 2025
    From Newsgroup: comp.arch

    On 14.08.2025 19:43, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    I believe it was you who wrote "If you add enough apples to a
    pile, the number of apples becomes negative", so there is
    clerly a defined physical meaning to overflow.

    :-)

    Yes, I did say something along those lines - but perhaps not /exactly/
    those words!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 15 16:53:29 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 15 13:19:36 2025
    From Newsgroup: comp.arch

    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did see
    a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much
    increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
    4K: Pages tend to be accessed fairly evenly
    16K: Minor variation as to what parts of the page are being used.
    64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").


    Though, can also note that a skew appeared in 64 pages where they were
    more likely to be being accessed in the first 32K rather than the latter
    32K. Though, I would expect the opposite pattern with stack pages (in
    this case, a traditional "grows downwards" stack being used).


    Granted, always possible other people might see something different if
    they ran their own tests on different programs.



    Can also note that in this case, IIRC, the "malloc()" tended to operate
    by allocating chunks of memory for the "medium heap" 128K at a time, and
    for objects larger than 64K would fall back to page allocation. This may
    be a poor fit for 64K pages (since a whole class of "just over 64K"
    mallocs needs 128K).

    Arguably, would be better for this page size to grow the "malloc()" heap
    in 1MB chunks; but this only really makes sense if the system has
    "plenty of RAM" and applications tend to use a lot of RAM. Where, say,
    if the program is fine with 64K of stack and 128K of heap, it is kind of
    a waste to allocate 1MB of heap for it (although in TestKern programs
    default to 128K, but this is about how much is needed for Doom and Quake
    and similar; though Quake will exceed 128K if all local arrays are stack-allocated rather than auto-folding large structs or arrays to heap allocation, otherwise Quake needs around 256K of stack space; more for
    GLQuake as it used a very large stack array, ~ 256K IIRC, for texture resizing).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Aug 15 18:33:07 2025
    From Newsgroup: comp.arch

    In article <107nkv2$1753a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 23:44, Dan Cross wrote:
    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    [snip]
    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.

    You have overflow when the mathematical result of an operation cannot be >expressed accurately in the type - regardless of the representation
    format for the numbers. Your options, as a language designer or >implementer, of handling the overflow are the same regardless of the >representation. You can pick a fixed value to return, or saturate, or >invoke some kind of error handler mechanism, or return a "don't care" >unspecified value of the type, or perform a specified algorithm to get a >representable value (such as reduction modulo 2^n), or you can simply
    say the program is broken if this happens (it is UB).

    I don't see where the representation comes into it - overflow is a
    matter of values and the ranges that can be stored in a type, not how
    those values are stored in the bits of the data.

    I understood your point. But we are talking about the history of
    the language here, not the presently defined behavior.

    We do, in fact, have historical source materials we can draw
    from when discussing this; there's little need to guess. Here,
    we know that the earliest C implementations simply ignored the
    posibility of overflow. In K&R1, chap 2, sec 2.5 ("Arithmetic
    Operators") on page 38, the authors write, "the action taken on
    overflow or underflow depends on the machine at hand." In
    Appendix A, sec 7 ("Expressions"), page 185, the authors write:
    "The handling of overflow and divide check in expression
    evaluation is machine-dependent. All existing implements of C
    ignore integer overflows; treatment of division by 0, and all
    floating point exceptions, varies between machines, and is
    usually adjustable by a library function."

    In other words, different machines give different results; some
    will trap, others will differ due to representation issues. No
    where here does it suggest that the language designers were
    worried about getting the "wrong" result, as you have asserted.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.

    It's basically the same in C90 onwards, with just small changes to the >wording.

    Did I suggest otherwise?

    And it /does/ define what is meant by an "exceptional
    condition" (or just "exception" in C90) - that is done by the part in >parentheses.

    That is an interpretation.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.

    They are thinking about that possibility, yes. In C90, the term
    "exception" here was not clearly defined - and it is definitely not the
    same as the term "exception" in 6.3p5. The wording was improved in C99 >without changing the intended meaning - there the term in the paragraph >under "Expressions" is "exceptional condition" (defined in that
    paragraph), while in the example in "Execution environments", it says
    "On a machine in which overflows produce an explicit trap". (C11
    further clarifies what "performs a trap" means.)

    But this is about re-arrangements the compiler is allowed to make, or
    barred from making - it can't make re-arrangements that would mean
    execution failed when the direct execution of the code according to the
    C abstract machine would have worked correctly (without ever having >encountered an "exceptional condition" or other UB). Representation is
    not relevant here - there is nothing about two's complement, ones' >complement, sign-magnitude, or anything else. Even the machine hardware
    is not actually particularly important, given that most processors
    support non-trapping integer arithmetic instructions and for those that >don't have explicit trap instructions, a compiler could generate "jump
    if overflow flag set" or similar instructions to emulate traps
    reasonably efficiently. (Many compilers support that kind of thing as
    an option to aid debugging.)

    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a >>> correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    It is a fair point that I am describing a rational and sensible reason
    for UB on arithmetic overflow - and I do not know the motivation of the >early C language designers, compiler implementers, and authors of the
    first C standard.

    Then there's really nothing more to discuss. The intent here is
    to understand the motivation of those folks.

    Early C didn't even have unsigned; Dennis Ritchie's paper for
    the History of Programming Languages conference said that it
    came around 1977 (https://www.nokia.com/bell-labs/about/dennis-m-ritchie/chist.html;
    see the section on "portability"), and in pre-ANSI C, struct
    fields of `int` type were effectively unsigned (K&R1,
    pp.138,197). I mentioned the quote from K&R1 about overflow
    above, but we see some other hints about signed overflow
    becoming negative in other documents. For instance, K&R2, p 118
    gives the example of a hash function followed by the sentence,
    "unsigned arithmetic ensures that the hash value is
    non-negative." This does not suggest to me that the authors
    thought that the wrapping behavior of twos-complement arithemtic
    was "incorrect".

    I do know, however, that the principle of "garbage in, garbage out" was
    well established long before C was conceived. And programmers of that
    time were familiar with the concept of functions and operations being >defined for appropriate inputs, and having no defined behaviour for
    invalid inputs. C is full of other things where behaviour is left
    undefined when no sensible correct answer can be specified, and that is
    not just because the behaviour of different hardware could vary. It
    seems perfectly reasonable to me to suppose that signed integer
    arithmetic overflow is just another case, no different from
    dereferencing an invalid pointer, dividing by zero, or any one of the
    other UB's in the standards.

    Indeed; this is effectively what I've been saying: signed
    integer overflow is UB because the behavior of overflow varied
    between the machines of the day, so C could not make assumptions
    about what value would result, in part because of representation
    issues: at the hardware level, signed overflow of the largest
    representable positive integer yields different _values_ between
    1s comp and 2s comp machines. Who is to say which is correct?

    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets.

    Yes, that is true. It is, however, also important to remember that it
    was based on a general abstract machine, not any particular hardware,
    and that the operations were intended to follow standard mathematics as
    well as practically possible - operations and expressions in C were not >designed for any particular hardware. (Though some design choices were >biased by particular hardware.)

    This is historically inaccurate.

    C was developed by and for the PDP-11 initially, targeting Unix,
    building from Martin Richards's BCPL (which Ritchie and Thompson
    had used under Multics on the GE-645 machine, and GCOS on the
    635) and Ken Thompson's B language, which he had implemented as
    a chopped-down BCPL to be a systems programming language for
    _very_ early Unix on the PDP-7. B was typeless, as the PDP-7
    was word-oriented, and we see vestages of this ancestral DNA in
    C today. See Ritchie's C history paper for details.

    Concerns for protability, leading to the development of the
    abstract machine informally described by the C standard, came
    much, much later in its evolutionary development.

    Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.

    It could have made signed integer overflow defined behaviour, but it did >not. The C standards committee have explicitly chosen not to do that,
    even after deciding that two's complement is the only supported >representation for signed integers in C23 onwards. It is fine to have
    two's complement representation, and fine to have modulo arithmetic in
    some circumstances, while leaving other arithmetic overflow undefined. >Unsigned integer operations in C have always been defined as modulo >arithmetic - addition of unsigned values is a different operation from >addition of signed values. Having some modulo behaviour does not in any
    way imply that signed arithmetic should be modulo.

    In Java, the language designers decided that integer arithmetic
    operations would be modulo operations. Wrapping therefore gives the
    correct answer for those operations - it does not give the correct
    answer for mathematical integer operations. And Java loses common >mathematical identities which C retains - such as the identity that
    adding a positive integer to another integer will increase its value. >Something always has to be lost when approximating unbounded
    mathematical integers in a bounded implementation - I think C made the
    right choices here about what to keep and what to lose, and Java made
    the wrong choices. (Others may of course have different opinions.)

    In Zig, unsigned integer arithmetic overflow is also UB as these
    operations are not defined as modulo. I think that is a good natural
    choice too - but it is useful for a language to have a way to do
    wrapping arithmetic on the occasions you need it.

    None of this seems relevant to understanding the motivations of
    the members of the committee that produced the 1990 C standard,
    other than agreeing that the decision could have been different.

    I would add that very early C treated signed and unsigned
    arithmetic as more or less equivalent. It wasn't until they
    started porting C to machines other than the PDP-11 that it
    started to matter.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow.

    Yes. Unsigned arithmetic operations are different operations from
    signed arithmetic operations in C.

    This is the second time you have mentioned this. Did I say
    something that led you believe that I suggested otherwise, or
    am somehow unaware of this fact?

    But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.

    They could have done that (as the Zig folk did).

    Or the SML folks before the Zig folks.

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.

    Not at all. Usually when someone says "the only logical reason is...",
    they really mean "the only logical reason /I/ can think of is...", or
    "the only reason that /I/ can think of that /I/ think is logical is...".

    I probably should have said that I'm also drawing from direct
    references, as well as hints and inferences from other
    historical documents; both editions of K&R as well as early Unix
    source code and the "C Reference Manual" from 6th and 7th
    Edition Unix (the language described in 7th Ed is quite
    different from the language in 6th Ed; most of this was driven
    by the a) portability, and b) the need to support
    phototypesetters, hence why the C implemented in 7th Ed and PCC
    is sometimes called "Typesetter C"). This is complemented with
    direct conversations with some of the original players, though
    admittedly those were quite a while ago.

    For a language that can be used as a low-level systems language, it is >important to be able to do modulo arithmetic efficiently. It is needed
    for a number of low-level tasks, including the implementation of large >arithmetic operations, handling timers, counters, and other bits and
    pieces. So it was definitely a useful thing to have in C.

    For a language that can be used as a fast and efficient application >language, it must have a reasonable approximation to mathematical
    integer arithmetic. Implementations should not be forced to have
    behaviours beyond the mathematically sensible answers - if a calculation >can't be done correctly, there's no point in doing it. Giving nonsense >results does not help anyone - C programmers or toolchain implementers,
    so the language should not specify any particular result. More sensible >defined overflow behaviour - saturation, error values, language
    exceptions or traps, etc., would be very inefficient on most hardware.
    So UB is the best choice - and implementations can do something
    different if they like.

    This is where we differ: you keep asserting notions of
    "correctness", without acknowledging that a) correctness differs
    in this context, and b) the notion of what is "correct" has
    itself differed over time as C has evolved.

    Moreover, when you say, "if a calculation can't be done
    correctly, there's no point in doing it" that's seems highly
    specific and reliant on your definition of correctness. My

    Here's an example:

    char foo = 128;
    int x = foo + 1;
    printf("%d\n", x);

    What is printed? (Note: that's rhetorical)

    On the systems I just tested, x86_64, ARM64 and RISCV64, I get
    -127 for the first two, and 129 for the last.

    Of course, we all know that this relies on implementation
    defined behavior around whether `char` is treated as signed or
    unsigned (and resultingly conversion from an unsigned constant
    to signed), but if what you say were true about GIGO, why is
    this not _undefined_ behavior?

    Too many options make a language bigger - harder to implement, harder to >learn, harder to use. So it makes sense to have modulo arithmetic for >unsigned types, and normal arithmetic for signed types.

    I am not claiming to know that this is the reasoning made by the C
    language pioneers. But it is definitely an alternative logical reason
    for C being the way it is.

    But we _can_ see what those pioneers were thinking by reading
    the artifacts they left behind, which we know, again based on
    primary sources, had an impact on the standards committee.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer
    representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for
    general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g.,
    https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.

    I think you might have missed a few words in that paragraph, but I
    believe I know what you intended. There are certainly DSPs and other
    cores that have strong support for alternative overflow behaviour - >saturation is very common in DSPs, and it is also common to have a
    "sticky overflow" flag so that you can do lots of calculations in a
    tight loop, and check for problems once you are finished. I think it is >highly unlikely that you'll find a core with something other than two's >complement as the representation for signed integer types, though I
    can't claim that I know /all/ devices! (I do know a bit about more
    cores than would be considered popular or common.)

    I was referring specifically to integer representation here, not
    saturating (or other) operations, but sure.

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.

    Modern C is definitely used on DSPs with strong saturation support.
    (Even ARM cores have saturated arithmetic instructions.) But they can
    also handle two's complement wrapped signed integer arithmetic if the >programmer wants that - after all, it's exactly the same in the hardware
    as modulo unsigned arithmetic (except for division). That doesn't mean
    that wrapping signed integer overflow is useful or desired behaviour.

    So again, the context here is understanding the initial
    motivation. I've mentioned reasons why they don't change it now
    (there _are_ arguments about correctness, but compiler writers
    also argue strongly that making signed integer overflow well
    defined would prohibit them from implementing what they consider
    to be important optimizations).

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Aug 15 12:03:41 2025
    From Newsgroup: comp.arch

    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to
    have "typically 97% hit rate".  I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found. I suspect
    the average page size should grow as memory gets cheaper, which leads to
    more memory on average in systems. This also leads to larger programs,
    as they can "fit" in larger memory with less paging. And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down. While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K. At some point
    in the future, it may get to 64K, but not for some years yet.


    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did see
    a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
      4K: Pages tend to be accessed fairly evenly
     16K: Minor variation as to what parts of the page are being used.
     64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Interesting.


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").

    Not necessarily. Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K pages, but only one with larger pages.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 15 19:19:50 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to >>>>>> have "typically 97% hit rate".  I would go for larger pages, which >>>>>> would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found. I suspect
    the average page size should grow as memory gets cheaper, which leads to >more memory on average in systems. This also leads to larger programs,
    as they can "fit" in larger memory with less paging. And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down. While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K. At some point
    in the future, it may get to 64K, but not for some years yet.

    ARM64 (ARMv8) architecturally supports 4k, 16k and 64k. When
    ARMv8 first came out, one vendor (Redhat) shipped using 64k pages,
    while Ubuntu shipped with 4k page support. 16k support by the
    processor was optional (although the Neoverse cores support all
    three, some third-party cores developed before ARM added 16k
    pages to the architecture specification only supported 4k and 64k).


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").

    Not necessarily. Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K >pages, but only one with larger pages.

    Note that the ARMv8 architecture[*] supports terminating the table walk
    before reaching the smallest level, so with 4K pages[**], a single TLB
    entry can cover 4K, 2M or 1GB blocks. With 16k pages, a single
    TLB entry can cover 16k, 32MB or 64GB blocks. 64k pages support
    64k, 512M and 4TB block sizes.

    [*] Intel, AMD and others have similar "large page" capabilities.
    [**] Granules, in ARM terminology.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 15 20:40:44 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.

    S/370 had 2K or 4K pages grouped into 64K or 1M segment tables. By the time it became S/390 it was just 4K pages and 1M segment tables, in a 31 bit address spave.

    In zSeries there are multiple 2G regions consisting of 1M segments and 4K pages.
    A segment can optionally be mapped as a single unit, in effect a 1M page.

    These days it doesn't make much sense to have pages smaller than 4K since that's the block size on most disks. I can believe that with today's giant memories and bloated programs larger than 4K pages would work better.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Aug 15 21:22:53 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.

    Two block devices bought less than a year ago:

    Disk model: KINGSTON SEDC2000BM8960G
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    Disk model: WD Blue SN580 2TB
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 01:22:57 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.

    Two block devices bought less than a year ago:

    SSDs often let you do 512 byte reads and writes for backward compatibility even though the physical block size is much larger.

    Wikipedia tells us all about it:

    https://en.wikipedia.org/wiki/Advanced_Format#512_emulation_(512e)

    Disk model: KINGSTON SEDC2000BM8960G

    Says here the block size of the 480GB version is 16K, so I'd assume the 960GB is
    the same:

    https://www.techpowerup.com/ssd-specs/kingston-dc2000b-480-gb.d2166

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random read/write benchmarks all use 4K blocks so that's probably the internal block size.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 16 05:09:43 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors. In 1985, 1986 and 1992 the common HDDs of the time had
    actual 512B sectors, so if that argument had any merit, the i386
    (1985), MIPS R2000 (1986), SPARC (1986), and Alpha (1992) should have
    been introduced with 512B pages, but they actually were introduced
    with 4KB (386, MIPS, SPARC) and 8KB (Alpha) pages.

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.

    https://www.techpowerup.com/ssd-specs/western-digital-sn580-2-tb.d1542

    claims

    |Page Size: 16 KB
    |Block Size: 1344 Pages

    I assume that the "Block size" means the size of an erase block.
    Where does the number 1344 come from? My guess is that it has to do
    with:

    |Type: TLC
    |Technology: 112-layer

    3*112*4=1344

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 03:17:49 2025
    From Newsgroup: comp.arch

    On 8/15/2025 2:03 PM, Stephen Fuld wrote:
    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to >>>>>> have "typically 97% hit rate".  I would go for larger pages, which >>>>>> would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found.  I suspect
    the average page size should grow as memory gets cheaper, which leads to more memory on average in systems.  This also leads to larger programs,
    as they can "fit" in larger memory with less paging.  And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down.  While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K.  At some point
    in the future, it may get to 64K, but not for some years yet.


    Some of the programs I have tested don't have particularly large memory footprints by modern standards (~ 10 to 50MB).

    Excluding very small programs (where TLB miss rate becomes negligible)
    had noted that 16K appeared to be reasonably stable.


    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did
    see a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much
    increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
       4K: Pages tend to be accessed fairly evenly
      16K: Minor variation as to what parts of the page are being used.
      64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Interesting.


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at
    all (and increasing page size only really sees benefit for TLB miss
    rate so long as the whole page is "actually being used").

    Not necessarily.  Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart.  That takes 2 TLB slots with 4K pages, but only one with larger pages.


    This is part of why 16K has an advantage.

    But, it drops off with 32K or 64K, as one may have a lot of large gaps
    of relatively little activity.

    So, rather than having a 64K page with two or more hot-spots ~ 30K apart
    or less, one may often just have a lot of pages with one hot-spot.

    Granted, my testing was far from exhaustive...


    One may think that larger page would always be better for TLB miss rate,
    but this assumes that most of the pages have most of the page being
    accessed.

    Which, as noted, is fairly true at 4/8/16K, but seemingly not as true at
    32K or 64K.

    And, for the more limited effect of the larger page size on reducing TLB
    miss rate, one does have a lot more memory being wasted by things like "mmap()" type calls.


    Say, for example, you want to allocate 93K via "mmap()":
    4K pages: 96K (waste=3K, 3%)
    8K pages: 96K
    16K pages: 96K
    32K pages: 96K
    64K pages: 128K (waste=35K, 27%)
    OK, 99K:
    4K: 100K (waste= 1K, 1%)
    8K: 104K (waste= 5K, 5%)
    16K: 112K (waste=13K, 12%)
    32K: 128K (waste=29K, 23%)
    64K: 128K
    What about 65K:
    4K: 68K (waste= 3K, 4%)
    8K: 72K (waste= 7K, 10%)
    16K: 80K (waste=15K, 19%)
    32K: 96K (waste=31K, 32%)
    64K: 128K (waste=63K, 49%)

    ...


    So, bigger pages aren't great for "mmap()" with smaller allocation sizes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Aug 16 10:00:18 2025
    From Newsgroup: comp.arch

    On 8/15/2025 10:09 PM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I don't think anyone has argued for 512B page sizes. There are two
    issues that are perhaps being conflated. One is whether it would be
    better if page sizes were increased from the current typical 4K to 16K.
    The other is about changing the size of blocks on disks (both hard disks
    and SSDs) from 512 bytes to 4K bytes.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 17:06:42 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 05:47:01 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    However, even without the loop overhead (which can be reduced with
    unrolling) that's 8 instructions per iteration, and therefore we will
    have a hard time executing it at less than 1cycle/iteration on current
    CPUs. What if we mix in some adc-based stuff to bring down the
    instruction count? E.g., with one adc-based and one cmov-based
    iteration:

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    add [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    Now we have 15 instructions per unrolled iteration (3 original
    iterations). Executing an unrolled iteration in less than three
    cycles might be in reach for Zen3 and Raptor Cove (I don't remember if
    all the other resource limits are also satisfied; the load/store unit
    may be at its limit, too).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 07:09:51 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one
    iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Aug 19 12:11:56 2025
    From Newsgroup: comp.arch

    Anton, I like what you and Michael have done, but I'm still not sure everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    Re breaking dependency chains (from Michael):

    In each iteration we have four inputs:

    carry_in from the previous iteration, [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8], and we want to generate [rdi+rcx*8] and the carry_out.

    Assuming effectively random inputs, cin+[rsi]+[r8]+[r9] will result in
    random low-order 64 bits in [rdi], and either 0, 1 or 2 as carry_out.

    In order to break the per-iteration dependency (per Michael), it is
    sufficient to branch out IFF adding cin to the 3-sum produces an
    additional carry:

    ; rdx = cin (0,1,2)
    next:
    mov rbx,rdx ; Save CIN
    xor rdx,rdx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc rdx,rdx ; RDX = 0 or 1 (50:50)
    add rax,[r9+rcx*8]
    adc rdx,0 ; RDX = 0, 1 or 2 (33:33:33)

    ; At this point RAX has the 3-sum, now do the cin 0..2 add

    add rax,rbx
    jc fixup ; Pretty much never taken

    save:
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next

    fixup:
    inc rdx
    jmp save

    It would also be possible to use SETC to save the intermediate carries...

    Terje

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never
    incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:20:54 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 07:09:51 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready
    and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
    all makes more sense with that. ebx then contains the carry from
    the last cycle on entry. The carry dependency chain starts at
    clearing edx, then gets to additional carries, then is copied to
    ebx, transferred into the next iteration, and is ended there by
    overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
    ebx contains the carry from the last iteration

    One other problem is that according to Agner Fog's instruction
    tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
    per cycle, and adc has a latency of 1, so breaking the dependency
    chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
    same cycle, but looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    To many back and force mental switches between Intel and AT&T syntax.
    The real code that I measured was for Windows platform, but in AT&T
    (gnu) syntax.
    Below is full function with loop unrolled by 3. The rest may be
    I'd answer later, right now I don't have time.

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef
    .seh_proc add3
    add3:
    pushq %r13
    .seh_pushreg %r13
    pushq %r12
    .seh_pushreg %r12
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %rbp
    mov 16(%rcx,%rdx), %r10
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %rbp
    adc 16(%rcx,%r8), %r10
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %rbp
    adc 16(%rcx,%r9), %r10
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration
    jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %rbp, 8(%rcx)
    mov %r10, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r12
    popq %r13
    ret

    .prop_carry:
    add $1, %rbp
    adc $0, %r10
    adc $0, %esi
    jmp .carry_done

    .seh_endproc








    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:24:23 2025
    From Newsgroup: comp.arch

    Above by mistake I posted not the most up to date variant, sorry.
    Here is a correct code:

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef .seh_proc add3
    add3:
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %r10
    mov 16(%rcx,%rdx), %r11
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %r10
    adc 16(%rcx,%r8), %r11
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %r10
    adc 16(%rcx,%r9), %r11
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %r10, 8(%rcx)
    mov %r11, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    ret

    .prop_carry:
    add $1, %r10
    adc $0, %r11
    adc $0, %esi
    jmp .carry_done

    .seh_endproc


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 17:43:25 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    I implemented a two-summand addition, not three-summand. I wanted the
    minumum of complexity to make it easier to understand, and latency is
    a bigger problem for the two-summand case.

    It would also be possible to use SETC to save the intermediate carries...

    I must have had a bad morning. Instead of xor edx, edx, setc dl (also
    2 per cycle on Zen3), I wrote

    mov edi,1
    ...
    xor edx, edx
    ...
    cmovc edx, edi

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 23:03:01 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock.


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    mov 16(%r11), %rax
    adcx 16(%r11,%r8), %rax
    adox 16(%r11,%r9), %rax
    mov %rax, 16(%r10,%r11)

    mov 24(%r11), %rax
    adcx 24(%r11,%r8), %rax
    adox 24(%r11,%r9), %rax
    mov %rax, 24(%r10,%r11)

    mov 32(%r11), %rax
    adcx 32(%r11,%r8), %rax
    adox 32(%r11,%r9), %rax
    mov %rax, 32(%r10,%r11)

    mov 40(%r11), %rax
    adcx 40(%r11,%r8), %rax
    adox 40(%r11,%r9), %rax
    mov %rax, 40(%r10,%r11)

    mov 48(%r11), %rax
    adcx 48(%r11,%r8), %rax
    adox 48(%r11,%r9), %rax
    mov %rax, 48(%r10,%r11)

    mov 56(%r11), %rax
    adcx 56(%r11,%r8), %rax
    adox 56(%r11,%r9), %rax
    mov %rax, 56(%r10,%r11)

    mov 64(%r11), %rax
    adcx 64(%r11,%r8), %rax
    adox 64(%r11,%r9), %rax
    mov %rax, 64(%r10,%r11)

    mov 72(%r11), %rax
    adcx 72(%r11,%r8), %rax
    adox 72(%r11,%r9), %rax
    mov %rax, 72(%r10,%r11)

    mov 80(%r11), %rax
    adcx 80(%r11,%r8), %rax
    adox 80(%r11,%r9), %rax
    mov %rax, 80(%r10,%r11)

    mov 88(%r11), %rax
    adcx 88(%r11,%r8), %rax
    adox 88(%r11,%r9), %rax
    mov %rax, 88(%r10,%r11)

    mov 96(%r11), %rax
    adcx 96(%r11,%r8), %rax
    adox 96(%r11,%r9), %rax
    mov %rax, 96(%r10,%r11)

    mov 104(%r11), %rax
    adcx 104(%r11,%r8), %rax
    adox 104(%r11,%r9), %rax
    mov %rax, 104(%r10,%r11)

    mov 112(%r11), %rax
    adcx 112(%r11,%r8), %rax
    adox 112(%r11,%r9), %rax
    mov %rax, 112(%r10,%r11)

    mov 120(%r11), %rax
    adcx 120(%r11,%r8), %rax
    adox 120(%r11,%r9), %rax
    mov %rax, 120(%r10,%r11)

    lea 136(%r11), %r11

    mov -8(%r11), %rax
    adcx -8(%r11,%r8), %rax
    adox -8(%r11,%r9), %rax
    mov %rax, -8(%r10,%r11)

    mov %ebx, %eax # EAX <= 0
    adcx %ebx, %eax # EAX <= OF, OF <= 0
    adox %ebx, %eax # EAX <= OF, OF <= 0

    add %rcx, %rsi
    jc .prop_carry
    .carry_done:
    mov %rsi, -136(%r10,%r11)
    mov %eax, %ecx
    dec %edx
    jnz .loop

    # last 3
    mov (%r11), %rax
    mov 8(%r11), %rdx
    mov 16(%r11), %rbx
    add (%r11,%r8), %rax
    adc 8(%r11,%r8), %rdx
    adc 16(%r11,%r8), %rbx
    add (%r11,%r9), %rax
    adc 8(%r11,%r9), %rdx
    adc 16(%r11,%r9), %rbx
    add %rcx, %rax
    adc $0, %rdx
    adc $0, %rbx
    mov %rax, (%r10,%r11)
    mov %rdx, 8(%r10,%r11)
    mov %rbx, 16(%r10,%r11)

    lea (-1020*8)(%r10,%r11), %rax
    popq %rbx
    popq %rsi
    ret

    .prop_carry:
    lea -128(%r10,%r11), %rbx
    xor %ecx, %ecx
    addq $1, (%rbx)
    adc %rcx, 8(%rbx)
    adc %rcx, 16(%rbx)
    adc %rcx, 24(%rbx)
    adc %rcx, 32(%rbx)
    adc %rcx, 40(%rbx)
    adc %rcx, 48(%rbx)
    adc %rcx, 56(%rbx)
    adc %rcx, 64(%rbx)
    adc %rcx, 72(%rbx)
    adc %rcx, 80(%rbx)
    adc %rcx, 88(%rbx)
    adc %rcx, 96(%rbx)
    adc %rcx,104(%rbx)
    adc %rcx,112(%rbx)
    adc %rcx,120(%rbx)
    adc %ecx, %eax
    jmp .carry_done
    .seh_endproc








    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 01:49:41 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.

    Several posts above I wrote:

    : I think that in 1979 VAX 512 bytes page was close to optimal.
    : Namely, IIUC smallest supported configuration was 128 KB RAM.
    : That gives 256 pages, enough for sophisticated system with
    : fine-grained access control.

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation. PDP-11 experience
    possibly contributed to using smaller pages for VAX.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Aug 20 02:49:26 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    S/370 was a decade before that and its pages were 2K or 4K. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    ...

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation.

    I don't think so. The smallest 370s were 370/115 with 64K to 192K of
    RAM, 370/125 with 96K to 256K, both with paging hardware and running
    DOS/VS. The 115 was shipped in 1973, the 125 in 1972.

    PDP-11 experience possibly contributed to using smaller pages for VAX.

    The PDP-11's pages were 8K which were too big to be used as pages so
    we used them as a single block for swapping. When I was at Yale I did
    a hack that mapped the 32K display memory for a bitmap terminal into
    the high half of the process' data space but that left too little room
    for regular data so we addressed the display memory a different way that
    didn't use up address spavce.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    I can only guess what their thinking was, but I can tell you that
    at the time the 512 byte pages seemed oddly small.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.

    The VAX design started in 1976, well after IBM had shipped those
    low end 370s with tiny memories and 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 20 10:50:39 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    [snipped the rest]


    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain two
    carry bits without having to save them off to an additional register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 20 14:16:55 2025
    From Newsgroup: comp.arch

    On Wed, 20 Aug 2025 10:50:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:



    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain
    two carry bits without having to save them off to an additional
    register.

    Terje


    It is interesting as an exercise in ADX extension programming, but in
    practice it is only 0-10% faster than much simpler and smaller code
    presented in the other post that uses no ISA extensions so runs on every
    iAMD64 CPU since K8.
    I suspect that this result is quite representative of the gains that
    can be achieved with ADX. May be, if there is a crypto requirement
    of independence of execution time from inputs then the gain would be
    somewhat bigger, but even there I would be very surprised to find 1.5x
    gain.
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.


    Going back to the task of 3-way addition, another approach that can
    utilize the same idea of breaking data dependency is using SIMD.
    In case of 4 cores that I tested SIMD means AVX2.
    The are results of AVX2 implementation that unrolls by two i.e. 512
    output bits per iteration of the inner loop.

    Platform RC GM SK Z3
    add3_avxq_u2 226.7 823.3 321.1 309.5

    The speed is about equal to more unrolled ADX variant on RC, faster on
    Z3, much faster on SK and much slower on GM. Unlike ADX, it runs on
    Intel Haswell and on few pre-Zen AMD CPUs.

    .file "add3_avxq_u2.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    subq $56, %rsp
    .seh_stackalloc 56
    vmovups %xmm6, 32(%rsp)
    .seh_savexmm %xmm6, 32
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx # %rdx - a-dst
    sub %rcx, %r8 # %r8 - b-dst
    sub %rcx, %r9 # %r9 - c-dst
    vpcmpeqq %ymm6, %ymm6, %ymm6
    vpsllq $63, %ymm6, %ymm6 # ymm6[0:3] = msbit = 2**63
    vpxor %xmm5, %xmm5, %xmm5 # ymm5[0] = carry = 0
    mov $127, %eax
    .loop:
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 32(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[4:7] = a[4:7] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 32(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[4:7] = iA[4:7] + b[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[4:7] = iA[4:7] > iSum1[4:7]
    vpaddq 32(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum1[4:7] + c[4:7]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[4:7] = iSum1[4:7] > iSum2[4:7]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum0[4:7] = c1[4:7] + c2[4:7]
    vpermq $0x93, %ymm2, %ymm4
    # ymm4[0:3] = cSum0[3,0:2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0x93, %ymm3, %ymm5
    # ymm5[0:3] = cSum0[7,4:6] == carry
    vpblendd $3, %ymm4, %ymm5, %ymm3
    # ymm3[0:3] = cSum[4:7] = { cSum0[3], cSum0[4:6] }
    .add_carry:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm3[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vpor %ymm0, %ymm1, %ymm4
    vptest %ymm4, %ymm4
    jne .prop_carry
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %ymm1, 32(%rcx)
    addq $64, %rcx
    dec %eax
    jnz .loop

    # last 7
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 24(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[3:6] = a[3:6] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 24(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[3:6] = iA[3:6] + b[3:6]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[3:6] = iA[3:6] > iSum1[3:6]
    vpaddq 24(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[3:6] = iSum1[3:6] + c[3:6]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[3:6] = iSum1[3:6] > iSum2[3:6]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum[4:7] = cSum0[3:6] = c1[3:6] + c2[367]
    vpermq $0x93, %ymm2, %ymm4
    # ymm2[0:3] = cSum0[3,0,1,2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0xF9, %ymm1, %ymm1
    # ymm3[0:3] = iSum2[4:6,6]
    .add_carry2:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm1[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vptest %ymm0, %ymm0
    jne .prop_carry2
    vptest %ymm1, %ymm1
    jne .prop_carry2
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %xmm1, 32(%rcx)
    vextractf128 $1, %ymm1, %xmm1
    vmovq %xmm1, 48(%rcx)

    lea -(127*64)(%rcx), %rax
    vzeroupper
    vmovups 32(%rsp), %xmm6
    addq $56, %rsp
    ret

    .prop_carry:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vpaddq %xmm2, %xmm5, %xmm5
    # ymm5[0] = carry += c3[7]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry

    .prop_carry2:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry2

    .seh_endproc

    AVX2 is rather poorly suited for this task - it lacks unsigned
    comparison instructions, so the first input should be shifted by
    half-range at the beginning and the result should be shifted back.

    AVX-512 can be more suitable. But the only AVX-512 capable CPU that I
    have access to is miniPC with cheap and slow core-i3 used by family
    members almost exclusively for viewing movies. It does not even have
    minimal programming environments installed.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 20 14:08:34 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.

    The whitepapers about ADX are about long multiplication and squaring
    (for cryptographic uses), and they are from 2012, and ADX was first
    implemented in Broadwell (2014), when microarchitectures were quite a
    bit narrower than the recent ones.

    If you implement the classical long multiplication algorithm, but add
    each line to the intermediate sum as you create it, you need an
    operation like

    intermediateresult += multiplicand*multiplicator[i]

    where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
    somewhat like:

    intermediateresult=0;
    for (i=0; i<n; i++) {
    intermediateresult += multiplicand*multiplicator[i];
    shift intermediate result by one word; /* actually, you will access it at */
    /* an offset, but how to express this in this pseudocode? */ }

    The operation for a single line can be implemented as:

    carry=0;
    for (j=0; j<m; j++)
    uint128_t d = intermediateresult[j] +
    multiplicand[j]*(uint128_t)multiplicator[i] +
    (uint128_t)carry;
    intermediateresult[j] = d; /* low word */
    carry = d >> 64;
    }

    The computation of d (both words) can be written on AMD64 as:

    #multuplicator[i] in rax
    mov multiplicator[i], rax
    mulq multiplicand[j]
    addq intermediateresult[j], rax
    adcq $0, rdx
    addq carry, rax
    adcq $0, rdx
    mov rdx, carry

    With ADX and BMI2, this can be coded as:

    #carry is represented as carry1+C+O
    mulx ma, m, carry2
    adcx mb, m
    adox carry1, m
    #carry is represented as carry2+C+O
    #unroll by an even factor, and switch roles for carry1 and carry2

    Does it matter? We can apply the usual blocking techniques to
    perform, say, a 4x4 submultiplication in the registers (probably even
    a little larger, but let's stick with these numbers). That's 16
    mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
    of 8 words of output. mulx is probably limited to one per cycle, but
    if we want to utilize this on a 4-wide machine line the Broadwell, we
    must have at most 3 additional instructions per mulx; with ADX, one
    additional instruction is adcx, another adox, and the third is either
    a load or a store. Any additional overhead, and the code will be
    limited by resources other than the multiplier.

    On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
    code shown before the ADX code. But then, they might put a second
    multiplier in, and we would profit from ADX again.

    But Intel seems to have second thoughts on ADX itself. ADX has not
    been included in x86-64-4, despite the fact that every CPU that
    supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
    found it on https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
    (i.e., Intel China), but now it's gone there, too. I can still find
    it on <https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>

    There is another whitepaper on using ADX for squaring numbers, but I
    cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
    Ozturk) has also written, there's a patent "SIMD integer
    multiply-accumulate instruction for multi-precision arithmetic" from
    2016, so maybe Intel's thoughts are now into doing it with SIMD
    instead of with scalar instructions.

    Still, why deemphasize ADX? Do they want to drop it eventually? Why?
    They have to support separate renaming of C, O, and the other three
    because of instructions that go much farther back. The only way would
    be to provide alternatives to these instructions, and then deemphasize
    them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 21 21:48:03 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

    I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    This sounds like an idea worth stealing, although no doubt the way I
    would attempt to copy it would be a failure which removed all the
    usefulness of it.

    For one thing, I don't have a stack for calling subroutines, or any other purpose.

    But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

    The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
    of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
    that.)

    In reverse order:
    If TI 9900 used their registers like a write-back cache, then typical access would be fast and efficient. When the register pointer is altered, the old file is written en-massé and a new file is read in en-massé {possibly with some buffering to lessen visible cycle count} ... but I digress.

    {Conceptually}
    My 66000 uses this concept for its GPRs and for its Thread State but only at context switch time, not for subroutine calls and returns. HW saves and restores Thread State and Registers on context switches so that the CPU
    never has to Disable Interrupts (it can, it just doesn't have to). {/Conceptually}
    I bracketed the above with 'Conceptually' because it is completely
    reasonable to envision a Core big enough to have 4 copies of Thread
    State and Register files, and bank switch between them. The important properties are that the switch delivers control reentrantly, HOW any
    given implementation does that is NOT architecture--that it does IS architecture.

    I specifically left how many registers are preserved to SW per CALL
    because up to 50% need 0, and few % require more than 4. This appears
    to indicate that SPARC using 8 was overkill ... but I digress again.

    Safe Stack is a stack used for preserving the ABI contract between caller
    and callee even in the face of buffer overruns, RoP, and other malicious program behavior. SS places the return address and the preserved registers
    in an area of memory where LD and ST instructions have no access (RWE = 000) but ENTER, EXIT, and RET do. This was done in such a way that correct code
    runs both with SS=on and SS=off, so the compiler does not have to know.

    Only CALL, CALX, RET, ENTER, and EXIT are aware of the existence of SS
    and only in HW implementations.

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    I happen to use 2 such stacks mostly to harden the environment at low
    cost to malicious attack vectors. It comes with benefits: Lines removed
    from SS do not migrate to L2 or even DRAM, they can be discarded at
    end-of-use, reducing memory traffic; the SW contract between Caller and
    Callee is guaranteed even in the face of malicious code; it can be used
    as a debug tool to catch malicious code. ...

    NOTE: malicious code can still damage data*, just not the preserved regs,
    the return address, guaranteeing that control returns to the instruction following CALL. And all without adding a single instruction to the CALL
    RET instruction sequence.

    (*) memory

    So I've probably completely misunderstood you here.

    Not the first time ...

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 23 08:51:34 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    I think we discussed this for My 66000 some time ago, but there
    is no resolution as yet.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 23 19:36:47 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    John Levine <johnl@taugh.com> wrote:
    It's also seems rather high for the /91. I can't find any authoritative
    numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.

    I remember this number, but do not remember where I found it. So
    it may be wrong.

    However, one can estimate possible density in a different way: package >probably of similar dimensions as VAX package can hold about 100 TTL
    chips. I do not have detailed data about chip usage and transistor
    couns for each chip. Simple NAND gate is 4 transitors, but input
    transitor has two emiters and really works like two transistors
    so it is probably better to count it as 2 transitors, and conseqently >consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
    20 transistors. D-flop probably is about 20-30 transitors, so
    74S74 is probably around 40-60. Quad D-flop bring us close to 100.
    I suspect that in VAX time octal D-flops were available. There
    were 4 bit ALU slices. Also multiplexers need nontrivial number
    of transistors. So I think that 50 transistors is reasonable (maybe
    low) estimate of average density. Assuming 50 transitors per chip
    that would be 5000 transistors per package. Packages were rather
    flat, so when mounted vertically one probably could allocate 1 cm
    of horizotal space for each. That would allow 30 packages at
    single level. With 7 levels we get 210 packages, enough for
    1 mln transistors.

    I don't see what this could have to do with the 360/91. As I said,
    it was built with SLT, a few individual transistors and resistors
    per package.

    IBM's first integrated circuits were MST used in the 360/85 and System/3.
    Pugh et al are kind of vague about how many transistors per chip but they
    show an exemplar circuit with six transistors and say there were up to four
    per chip so that's still only two dozen transistors per chip.

    I realize that densities got a lot greater, but that was S/370 and beyond.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Aug 26 21:46:24 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 7/28/2025 6:18 PM, John Savard wrote:
    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.


    The use of addressing modes drops off pretty sharply though.

    Like, if one could stat it out, one might see a static-use pattern
    something like:
    80%: [Rb+disp]
    15%: [Rb+Ri*Sc]
    3%: (Rb)+ / -(Rb)
    1%: [Rb+Ri*Sc+Disp]
    <1%: Everything else

    Since RISC-V only has [Rb+dips12] the other 20% take at least 2 instructions. Simple math indicates this requires 1.2+ instructions/mem-ref instead of 1.0 instructions/mem-ref. disp12 does not help either.

    My 66000 does not have (Rb)+ or -(Rb), and most RISC-machines don't either.
    On the other hand, I see more [Rb+Ri<<s+disp] than 1%--more like 3%-4%--
    this is partially due to using indexing than incrementation when doing
    loops::

    MOV Ri,#0
    VEC R15,{}
    LDD R9,[R3,Ri<<3+disp]
    calk
    LOOP LT,Ri,#1,Rn
    instead of:
    MOV Ri,#0
    LDA R14,[R3+disp]
    VEC R15,{}
    LDD R9,(R14)+
    calk
    LOOP LT,Ri,#1,Rn
    {and the second loop has an additional ADD in it}

    Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.

    Granted, the dominance of [Rb+Disp] does drop off slightly when
    considering dynamic instruction use. Part of it is due to the
    prolog/epilog sequences.

    I have a lot of [IP,DISP] due to the way the compile places data.

    If one had instead used (SP)+ and -(SP) addressing for prologs and
    epilogs, then one might see around 20% or so going to these instead.
    Or, if one had PUSH/POP, to PUSH/POP.

    ENTER and EXIT compress prologues and epilogues to a single instruction
    each. They also have the option of placing the preserved registers in
    a place where the called subroutine cannot damage them.

    The discrepancy then between static and dynamic instruction counts them being mostly due to things like loops and similar.

    Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
    seemed to be in the area. Many loops end up unreached in many
    iterations, or only running a few times, so possibly counter-intuitively
    it is often faster to assume that a loop body will likely only cycle 2
    or 3 times rather than 100s or 1000s, and trying to aggressively
    optimize loops by assuming large N tends to be detrimental to performance.

    VAX compilers set the loop-count = 10 and did OK for their era. A
    low count (like 10) ameliorates the small loops (letters in a name)
    against the larger loops like Matrix300.

    Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.


    -----------------------

    One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
    ISA has a lot of registers, the relative benefit of LoadOp is reduced.

    LoadOp being mostly a benefit if the value is loaded exactly once, and
    there is some other ALU operation or similar that can be fused with it.

    Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
    z=arr[i]+x;


    But, the relative incidence of things like this is low enough as to not
    save that much.

    The other thing is that one has to implement it in a way that does not increase pipeline length,

    This is the key point about LD-OPs:: if you build a pipeline to support
    them, then you will suffer when instruction stream is independent RISC-
    like instructions--conversely; if you build the pipeline for RISC-like instructions, LD-OPs take a penalty unless you by off on Medium OoO, at
    least.

    since if one makes the pipeline linger for
    sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.

    And thus, this is why RISC-machines largely avoid LD-OPs.

    One can be like, "But what if the local variables are not in registers?"
    but on a machine with 32 or 64 registers, most likely your local
    variable is already going to be in a register.

    So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".

    So does poking your eye with a hot knife.

    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.


    Yeah.

    There are some living descendants of that family, but pretty much
    everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 00:35:18 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
    PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    Much greater than just "well aware" there were at least 15 grad students
    at CMU working on optimizing compilers AND the VAX ISA; as well as Wulf, Newell, and Bell leading the pack.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Hold on a minute:: My Transcendentals are done in POLY-like fashion,
    it is just that the constants come from ROM inside the FPU, instead
    of user defined DRAM coefficients. Thus, POLY is good, POLY as an
    instruction is bad.

    Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >easier to pipeline.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Compilers have taught us that one-address-mode per instruction is
    "sufficient" {if you are going to have address modes.}

    My work on My 66000 has taught me that 1 constant per instruction
    is nearly sufficient. The only places I break this is ST #val[disp]
    and LOOP cnd,Ri,#inc,#max.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    Condition codes get hard when DECODE width grows greater than 3.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Aug 27 01:01:21 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That is actually what we did on Mc88100, and while a lot better than
    just [Base+Disp] it is still not as good as [RB+Ri<<s+Disp]; the later
    saving instructions that merely create constants.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 27 05:12:57 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    10.5 on a characteristic mix, actually.

    See "A Characterization of Processor Performance in the VAX-11/780"
    by Emer and Clark, their Table 8.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 27 17:19:06 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.
    ...
    [...] POLY as an
    instruction is bad.

    Exactly.

    One must remember that VAX was a 5-cycle per instruction machine !!!
    (200ns : 1 MIP)

    It's better to forget this misinformation, and instead remember that
    the VAX has an average CPI of 10.6 (Table 8 of <https://american.cs.ucdavis.edu/academic/readings/papers/p301-emer.pdf>)

    Table 9 of that reference is also interesting:

    CALL/RET instructions take an average 45 cycles, Character
    instructions (I guess this means stuff like EDIT) takes an average 117
    cycles, and Decimal instructions take an average 101 cycles. It seems
    that these instructions all have no special hardware support on the
    VAX 11/780 and do it all through microcode. So replacing Character
    and Decimal instructions with calls to functions on a RISC-VAX could
    easily outperform the VAX 11/780 even without special hardware
    support. Now add decimal support like the HPPA has done or string
    support like the Alpha has done, and you see even better speed for
    these instructions.

    For CALL/RET, one might use one of the modern calling conventions.
    However, this loses some capabilities compared to the VAX. So one may
    prefer to keep frame pointers by default and maybe other features that
    allow, e.g., universal cross-language debugging on the VAX without monstrosities like ELF and DWARF.

    Pipeline work over 1983-to-current has shown that LD and OPs perform
    just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
    were LD and OP, and there are way to perform LD and OP as if it were
    LD+OP.

    I don't know what you are getting at here. When implementing the 486,
    Intel chose the following pipeline:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2/OP
    Writeback

    This meant that load-and-op instructions take 2 cycles (and RMW
    instructions take three); it gave us the address-generation interlock (op-to-load latency 2), and 3-cycle taken branches. An alternative
    would have been:

    Instruction Fetch
    Instruction Decode
    Mem1
    Mem2
    OP
    Writeback

    This would have resultet in a max throughput of 1 CPI for sequences of load-and-op instructions, but would have resultet in an AGI of 3
    cycles, and 4-cycle taken branches.

    For the Bonnell Intel chose such a pipeline (IIRC with a third mem
    stage), but the Bonnell has a branch predictor, so the longer branch
    latency usually does not strike.

    AFAIK IBM used such a pipeline for some S/360 descendants.

    Condition codes get hard when DECODE width grows greater than 3.

    And yet the widest implementations (up to 10 wide up to now) are of
    ISAs that have condition-code registers. Even particularly nasty ones
    in the case of AMD64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 28 15:10:55 2025
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first, and there are instructions with six operands. Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.
    It must have seemed clever at the time, but ugh.


    What we must all realize is that each address mode in VAX was a microinstruction all unto itself.

    And that is why it was not pipelineable in any real sense.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Aug 29 10:34:31 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    John Levine <johnl@taugh.com> posted:

    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3
    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first, and there are instructions with six operands. Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.
    It must have seemed clever at the time, but ugh.


    What we must all realize is that each address mode in VAX was a microinstruction all unto itself.

    And that is why it was not pipelineable in any real sense.

    Yes. The instructions are designed to parsed by a byte-code interpreter
    in microcode. Even the NVAX in 1992 its Decode can only produce one
    operand per clock.

    If that operand is one of the complex memory address modes then it
    might be possible to dispatch it and let the back end chew on it
    while Decode works on the second operand.

    But that assumes the operands are in slow memory. If they are in fast
    registers then it stalls waiting for the second and third operands to be decoded making a pipeline pointless.

    And since programs mostly put operands in registers it stalls at Decode.

    One might say we should just build a fast decoder. But if you look at
    the instruction formats, even the simplest 2 register instructions are
    3 bytes and would require looking at 24 instruction bits and 3 valid bits
    or 27 bits at once. The 3 operand rs1,rs2,rd instructions is 36 bits.

    That decoder has to deal with 2^27 or 2^36 possibilities!
    And that just handles 2 and 3 register instructions, no memory references.

    It is hypothetically possible with a pre-decode stage to compact those
    down to 17 bits for 2 register and 21 bits for 3 register but that is
    still too many possibilities. That just throws transistors at a problem
    that never needed to exist in the first place, and would still not be affordable in 1992 NMOS, certainly not in 1975 TTL.

    If we look at what the VAX is actually spending most of its time on,
    2 and 3 register ALU operations, those can be decoded in parallel by
    looking at 10 bits (8 opcode + 2 valid) for 2 register,
    15 bits (12 opcode + 3 valid) for 3 register instructions.
    Which is quite doable in 1975 TTL in 1 clock.
    And that allows the pipeline to not stall at Decode.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Aug 29 17:07:03 2025
    From Newsgroup: comp.arch

    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other
    language has this weird combination of features?


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 01:47:03 2025
    From Newsgroup: comp.arch

    On 8/29/2025 4:07 PM, Stefan Monnier wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other language has this weird combination of features?


    FWIW, BGBCC has this (as both a C extension, and within my rarely-used
    BS2 language).

    But, yeah, in this case, the general idea is that lambdas consist of 2
    or 3 parts:
    The function body, located in ".text";
    The data area holding the captured scope;
    An executable "thunk", which loads the data pointer and transfers
    control to the body (may be either RWX memory, or from a "pool" of
    possible function pointers).


    When implemented as RWX heap, the data area directly follows the thunk,
    and both are located in special executable heap memory. An
    automatic-only capture-by-reference form exists, but still uses heap
    memory for this (but these heap allocations will be freed automatically).

    So, the lambdas look the same as normal C function pointers in this way,
    but creating new lambdas may leak memory if they are not freed.


    There is another option which I have used sometimes which doesn't
    require RWX memory, but which may technically abuse the C ABI:
    Create a pool of functions with a more generic argument list, and then allocate lambdas from the pool. Each function in the pool pulls its
    data-area pointer from an array, with each function in the pool having a corresponding array index (with a set upper limit to the maximum number
    of lambdas).

    Though, arguably, if the number of "live lambdas" is large, or the
    lambdas are never freed, arguably there is a problem with the program
    (and even if an implementation has a hard limit, say, of 256 or 1024
    live lambda instances, this usually isn't too much of a problem).


    This strategy works better for ABIs which pass every argument in
    basically the same way (or can be made to look like such). If these
    functions need to care about argument number or types (*), it becomes a
    much harder problem.

    *: Though, usually limited to a scheme like JVM-style I/L/F/D/A, as this
    is sufficient, but "X*5^(1..n)" is still a much bigger number than X,
    meaning 'n' (the maximum number of arguments) would need to be kept
    small. This does not scale well...


    For contrast, if one knows, for example, that in the ABI every relevant argument is passed the same regardless of type (say, as a fixed 64-bit element), and that any 128 bit arguments are passed as an even numbered
    pair or similar (and we can always pretend as if we are passing the
    maximum number of arguments). Things become simpler.

    This later leaves the use of any executable as mostly optional, but
    unlike the pool; executable memory has no set limit on the maximum
    number of lambdas. There are tradeoffs either way.


    Can note that on my ABI designs, the RISC-V LP64 ABI, and the Win64 ABI,
    this property mostly holds. On the SysV AMD64 ABI, or RISC-V LP64D ABI,
    it does not. Can note that BGBCC when targeting RV64 currently uses a
    variant of the LP64 ABI.

    For XG3, it may use either the LP64 ABI, or an experimental "XG3 Native"
    ABI which differs slightly:
    X10..X17 are used for arguments 1..8;
    F10..F17 are used for arguments 9..16;
    F4..F7 are reassigned to being callee-save.
    Partly balancing out the register mix.
    X: 15 scratch; 12 callee-save
    F: 16 scratch; 16 callee-save.
    So: 31 scratch, 28 callee-save.
    Vs: 35 scratch, 24 callee-save.
    Struct pass/return:
    1-8 bytes: 1 register/spot;
    9-16 bytes: 2 registers/spots, padded to an even index.
    17+: pass/return via pointer.
    For struct return, an implicit argument is passed;
    Callee copies returned struct to the address passed by caller.


    Though, another partial motivation for this sort of thing is to make it simpler to marshal COM-style interfaces (it lessens the burden on the
    lower levels to need to care about the method signatures for the
    marshaled objects). Though, a higher level mechanism, such as an RPC implementation, would still need to know about the method signatures.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Aug 30 15:36:46 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other language has this weird combination of features?

    Well, more precisely:
    - function pointer is supposed to take the same space as a single
    machine address
    - function pointer is supposed to be directly invokable, that is
    point to machine code of the function
    - one wants to support nested functions
    - there is no garbage collector, one does not want to introduce extra
    stack and one does not want to leak storage allocated to nested
    functions.

    To explain more:
    - arguably in "safe" C data pointers should consist
    of 3 machine words, such pointer have place for extra data needed
    for nested functions.
    - some calling conventions introduce extra indirection, that is
    function pointer point to a data structure containing address
    of machine code and extra data needed by nested functions.
    Function call puts extra data in dedicated machine register and
    then transfers control via address contained in function data
    structure. IIUC IBM AIX uses such approach.
    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    Concerning languages, any language which has nested functions and
    wants seamless cooperation with C needs to resolve the problem.
    That affects Pascal, Ada, PL/I. That is basicaly most classic
    non-C languages. IIUC several "higher level" languages resolve
    the trouble by combination of parallel stack and/or GC. But
    when language want to compete with efficiency of C and does not
    want GC, then trampolines allocated on machine stack may be the
    only choice (on register starved machine parallel stack may be
    too expensive). AFAIK GNU Ada uses (or used) trampolines
    allocated on machine stack.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 13:19:46 2025
    From Newsgroup: comp.arch

    On 8/30/2025 10:36 AM, Waldek Hebisch wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    There is one additional, quite thorny issue: How to maintain
    state for nested functions to be invoked via pointers, which
    have to have access local variables in the outer scope.
    gcc does so by default by making the stack executable, but
    that is problematic. An alternative is to make some sort of
    executable heap. This is now becoming a real problem, see
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .

    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.

    In most cases you have either one of the other but not both. E.g. in
    C we don't have nested functions, and in Javascript functions are
    heap-allocated objects.

    Other than GNU C (with its support for nested functions), which other
    language has this weird combination of features?

    Well, more precisely:
    - function pointer is supposed to take the same space as a single
    machine address
    - function pointer is supposed to be directly invokable, that is
    point to machine code of the function
    - one wants to support nested functions
    - there is no garbage collector, one does not want to introduce extra
    stack and one does not want to leak storage allocated to nested
    functions.


    Yes.

    To explain more:
    - arguably in "safe" C data pointers should consist
    of 3 machine words, such pointer have place for extra data needed
    for nested functions.

    No one wants to pay for this...
    Even 2 machine words per pointer is a hard sell.
    Much less the mess created by any programs that use integer->pointer
    casts (main option I can think of is turning these cases into runtime
    calls).

    I have experimentally used bounds checking in the past, although the
    main form I ended up using sort of crudely approximates the bounds (with
    a minifloat style format) and shoves them into the high 16 bits of the
    pointer (with 0x0000 still allowed for untagged C pointers). It was more limited, more intended to deal with common case "out of bounds" bugs
    rather than do anything security related.

    - some calling conventions introduce extra indirection, that is
    function pointer point to a data structure containing address
    of machine code and extra data needed by nested functions.
    Function call puts extra data in dedicated machine register and
    then transfers control via address contained in function data
    structure. IIUC IBM AIX uses such approach.

    Yes. Also FDPIC generally falls into that category.

    Function pointer consists of a pointer to a blob of memory holding a
    code pointer and typically the callee's GOT pointer.

    Would be easier to implement lambdas on top of an FDPIC style ABI in
    this way, but FDPIC tends to kinda suck in terms of having more
    expensive function calls.


    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.


    One could classify them based on lifetime:
    Automatic, freed automatically when parent function exits;
    Global, may live indefinitely, but needs to be freed.


    General strategy for auto-freeing is to (internally) build a linked list
    of items that need to be auto-freed when the current function terminates
    (in the epilog, it calls a hidden runtime function which frees
    everything in the list, with each item both providing a data pointer and
    the function needed to free said pointer). This mechanism had also been extended to "alloca()", C99 style VLAs, and any large by-value structs
    or arrays too large to reasonably fit into the current local frame.

    The internal allocation calls can be provided with a double-indirect
    pointer to the linked list of items to be freed.


    Concerning languages, any language which has nested functions and
    wants seamless cooperation with C needs to resolve the problem.
    That affects Pascal, Ada, PL/I. That is basicaly most classic
    non-C languages. IIUC several "higher level" languages resolve
    the trouble by combination of parallel stack and/or GC. But
    when language want to compete with efficiency of C and does not
    want GC, then trampolines allocated on machine stack may be the
    only choice (on register starved machine parallel stack may be
    too expensive). AFAIK GNU Ada uses (or used) trampolines
    allocated on machine stack.


    Yes, true. But, for other reasons we really don't want RWX on the main
    stack.

    Also, GC is generally too expensive and doesn't play well with many
    use-cases (more or less anything that is timing sensitive).


    Automatic reference counting is sometimes an option, but tends to carry
    too much overhead in the general case (but, may make sense for "dynamic"/"variant" types, where one is already paying for the overhead
    of dynamic type-tag checking). But, wouldn't want to use it for general pointers or function pointers in a C like language.

    Refcounting mostly works OK for dynamic types and similar, and has less performance issues than, say, mark/sweep. Usual big downside is that any cyclic structures will tend to be leaked.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat Aug 30 14:22:32 2025
    From Newsgroup: comp.arch

    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words
    (address of the code plus address of the context/environment/GOT), so
    there's no dynamic allocation involved.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 30 13:46:05 2025
    From Newsgroup: comp.arch

    On 8/30/2025 1:22 PM, Stefan Monnier wrote:
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so
    there's no dynamic allocation involved.


    FDPIC typically always uses the normal pointer width, just with more indirection:
    Load target function pointer from GOT;
    Save off current GOT pointer to stack;
    Load code pointer from function pointer;
    Load GOT pointer from function pointer;
    Call function;
    Reload previous GOT pointer.

    It, errm, kinda sucks...

    Seemingly, thus far no FDPIC on RV64 targets (in GCC), but does
    apparently exist for RV32 No-MMU style targets.


    I took a lower overhead approach in my PBO ABI (optional callee side
    GBR/GP reload), but it lacks any implicit way to implement lambdas.



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 31 05:36:53 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    AFAIK this is a problem only in those rare languages where a function
    value is expected to take up the same space as any other pointer while
    at the same time supporting nested functions.
    ...
    Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?

    Forth has single-cell (cell=machine word) execution tokens (~function pointers). In contrast to C, where in theory you could have bigger
    function pointers and only ABIs and legacy code encourage function
    pointers that fit in a single machine word, in a Forth you would have
    to make all cells bigger (i.e., all integers and all addresses) if you
    want more space for xts.

    Gforth adds closures, that of course have to be represented by
    single-cell execution tokens that behave like other execution tokens.
    But Gforth has the advantage that xts are not represented by addresses
    of machine code, and instead there is one indirection level between
    the xt and the machine code. The reason for that is that xts not only represent colon definitions (~C functions), but also variables,
    constants, and words defined with create...does> (somewhat
    closure-like, but "statically" allocated). So Gforth implements
    closures using an extension of that mechanism, see Section 4 of [ertl&paysan18].

    However, there are Forth systems that implement xts as addresses of
    machine code, and if they implemented closures, they would need to use
    run-time code generation.

    - anton

    @InProceedings{ertl&paysan18,
    author = {M. Anton Ertl and Bernd Paysan},
    title = {Closures --- the {Forth} way},
    crossref = {euroforth18},
    pages = {17--30},
    url = {https://www.complang.tuwien.ac.at/papers/ertl%26paysan.pdf},
    url2 = {http://www.euroforth.org/ef18/papers/ertl.pdf},
    slides-url = {http://www.euroforth.org/ef18/papers/ertl-slides.pdf},
    video = {https://wiki.forth-ev.de/doku.php/events:ef2018:closures},
    OPTnote = {refereed},
    abstract = {In Forth 200x, a quotation cannot access a local
    defined outside it, and therefore cannot be
    parameterized in the definition that produces its
    execution token. We present Forth closures; they
    lift this restriction with minimal implementation
    complexity. They are based on passing parameters on
    the stack when producing the execution token. The
    programmer has to explicitly manage the memory of
    the closure. We show a number of usage examples.
    We also present the current implementation, which
    takes 109~source lines of code (including some extra
    features). The programmer can mechanically convert
    lexical scoping (accessing a local defined outside)
    into code using our closures, by applying assignment
    conversion and flat-closure conversion. The result
    can do everything one expects from closures,
    including passing Knuth's man-or-boy test and living
    beyond the end of their enclosing definitions.}
    }

    @Proceedings{euroforth18,
    title = {34th EuroForth Conference},
    booktitle = {34th EuroForth Conference},
    year = {2018},
    key = {EuroForth'18},
    url = {http://www.euroforth.org/ef18/papers/proceedings.pdf}
    }
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Aug 31 16:21:38 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 8/30/2025 1:22 PM, Stefan Monnier wrote:
    Function pointer consists of a pointer to a blob of memory holding
    a code pointer and typically the callee's GOT pointer.

    Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.


    FDPIC typically always uses the normal pointer width, just with more indirection:
    Load target function pointer from GOT;
    Save off current GOT pointer to stack;
    Load code pointer from function pointer;
    Load GOT pointer from function pointer;
    Call function;
    Reload previous GOT pointer.

    My 66000 can indirect through GOT so the above sequence is::

    CALX [ip,,GOT[n]-.]

    and references to GOT are like above (functions) or (extern) as::

    LDD Rp,[ip,,GOT[n]-.]

    Each linked module gets its own GOT.

    It, errm, kinda sucks...

    Bad ISA makes many things suck--whereas good ISA does not.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 07:40:47 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I would have liked to install 64-bit Debian (IIRC I initially ran
    32-bit Debian on the Athlon 64), but they were not ready at the time,
    and still busily working on their multi-arch (IIRC) plans, so
    eventually I decided to go with Fedora Core 1, which just implemented
    /lib and /lib64 and was there first.

    For some reason I switched to Gentoo relatively soon after
    (/etc/hostname from 2005-02-20, and IIRC Debian still had not finished >hammering out multi-arch at that time), before finally settling in >Debian-land several years later.

    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Multiarch was introduced in Debian 7 (Wheezy), released 4 May 2013.

    So Multiarch took much longer than they had originally expected, and
    they apparently settled for the lib64 approach for Debian 4-6.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Mon Sep 1 12:15:54 2025
    From Newsgroup: comp.arch

    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
    machine.
    It didn't have enough RAM to justify the bother of distro hopping. 🙂


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Sep 1 11:33:33 2025
    From Newsgroup: comp.arch

    On 9/1/2025 11:15 AM, Stefan Monnier wrote:
    Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
    first Debian with official AMD64 support.

    Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
    machine.
    It didn't have enough RAM to justify the bother of distro hopping. 🙂


    My first AMD64 machine also ended up mostly running 32-bit OS's, but
    more because initially:
    It was unstable if running 64-bit Linux;
    It was also not very stable initially with XP-X64;
    Driver support for XP-X64, initially, was almost non existent.
    So, ended up mostly running 32-bit WinXP on the thing.

    Though, after the initial weak results, on my next machine I had a
    better experience and IIRC had it set up to dual boot XP-X64 and Fedora,
    by that point stuff was stable and XP-X64 had drivers for stuff. I stuck
    with XP-X64 mostly as Vista was kinda trash (until some years later
    jumping to Win7, and now Win10).


    Well, and (at least in these years) Linux still had serious issues with
    driver compatibility, so you can use the OS but typically with no 3D acceleration or sound (and Mesa-GL in SW mode is horribly slow).

    At least Ethernet tended to work as most MOBOs had settled on the
    RTL8139 or similar (well, until MOBOs started having Gigabit Ethernet,
    and suddenly Ethernet no longer worked in Linux for a while, ...).

    Well, Linux land often failing to provide a great experience, not so
    much because of UI (and, I actually like using the Bash shell for
    stuff), but because of ever-present hardware support issues (so, can't
    usually end up running it as the main OS as much of the HW often didn't
    work).

    ...



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Sep 1 20:34:13 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    It didn't have enough RAM to justify

    My Athlon 64 only had 1GB of RAM, so an IA-32 distribution would have
    done nicely for it, but I wanted to be able to build and run AMD64
    software.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 4 15:23:26 2025
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are:
    Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation
    has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Sep 4 10:25:49 2025
    From Newsgroup: comp.arch

    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.

    And, of course your "150%" is arbitrary, but I agree that small
    differences in code size are not important, except in some small
    embedded applications.

    And I guess I would add, as a third, much lower priority, power usage.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Sep 4 21:00:36 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.

    I can't seem to find the code examples from the snipped examples anywhere.

    For arrays:
    The inner loops are 4 instructions (3 for My 66000) and the loop is 2×
    data dependent on the integer ADDs, so all 4 instructions can be pitched
    at 1-cycle. Let us assume the loop is executed 10×, so 10 loop-latencies
    is 10-cycles plus LD-latency plus ADD latency:: {using LD-latency = 4}

    setup
    | MOV #0 |
    | MOV #0 |
    loop[0] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT ! |
    loop[1] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT ! |
    loop[2] | LD AGEN|rot | Cache | LD align |rot | D ADD |
    | LP ADD | BLT × | repair |
    exit
    | MOV |
    | RET |
    | looping | recovery |

    // where rot is time it takes to route AGEN to the SRAM arrays and back,
    // and showing the exit of the loop by mispredicting the last branch back
    // to the top of the loop, 2-cycle repairing state, and returning from
    // subroutine.

    Any µarchitecture that can start 1 LD per cycle, start 2 integer ADDs
    per cycle, and 1 branch per cycle, has enough resources to perform
    arrays as drawn above.

    For globals:

    RV64GC does 4 LDs and 4 STs, each ST being data dependent on 1 LD.
    It is conceivable that a 10-wide machine might do 4 LDs in a cycle,
    and figure out that the 4 values are in the same cache line, so the
    latency of calculation is LD-latency + ST AGEN. Let's say LD-latency
    is 4-cycles, so the calculation latency is 5-cycles. RET can probably
    be performed simultaneous with the first LD AGEN.

    My 66000 does 4 parallel ST # all of which can start on the same cycle,
    as can RET, for a latency of 1-cycle.

    On the other hand:: My 66000 implementation may only be 6-wide and
    the 4 STs take 2-execution-cycles, but the RET still takes place in
    cycle-1.

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller
    AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    And, of course your "150%" is arbitrary,

    yes, of course, completely arbitrary--but this is the typical RISC-CISC instruction count ratio. Now, on the other hand, My 66000 runs closer to
    115% size and 70% RISC-V count {although the examples above are 66% and
    55%}.

    but I agree that small
    differences in code size are not important, except in some small
    embedded applications.

    And I guess I would add, as a third, much lower priority, power usage.

    I would suggest power has become a second order desire (up from third)
    {maybe even a primary desire at some scales}.

    But note: Nothing delivers a fixed bit-pattern as an operand at lower
    power than plucking the bits from the instruction stream; saving a
    good deal of the power consumed by forwarding (the multiple comparators
    and the find youngest logic plus the buffers to drive the result-to-
    operand multiplexers).

    And certainly: plucking the bit-pattern from the instruction stream is
    vastly lower power than LDing the bit-pattern from memory ! close to
    4× lower.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Sep 4 16:54:21 2025
    From Newsgroup: comp.arch

    On 9/4/2025 12:25 PM, Stephen Fuld wrote:
    On 9/4/2025 8:23 AM, MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
       Crappy with arrays;
       Crappy with code with lots of large immediate values;
       Crappy with code which mostly works using lots of global variables; >>>>>      Say, for example, a lot of Apogee / 3D Realms code;
         They sure do like using lots of global variables.
         id Software also likes globals, but not as much.
       ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
       long i, r;
       for (i=0, r=0; i<n; i++)
         r+=v[i];
       return r;
    }

    arrays:
          MOV  R3,#0
          MOV  R4,#0
          VEC  R5,{}
          LDD  R6,[R1,R3<<3]
          ADD  R4,R4,R6
          LOOP LT,R3,#1,R2
          MOV  R1,R4
          RET


    long a, b, c, d;

    void globals(void)
    {
       a = 0x1234567890abcdefL;
       b = 0xcdef1234567890abL;
       c = 0x567890abcdef1234L;
       d = 0x5678901234abcdefL;
    }

    globals:
         STD #0x1234567890abcdef,[ip,a-.]
         STD #0xcdef1234567890ab,[ip,b-.]
         STD #0x567890abcdef1234,[ip,c-.]
         STD #0x5678901234abcdef,[ip,d-.]
         RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC)
    are:
         Bytes                         Instructions
    arrays globals    Architecture  arrays    globals
    28     66 (34+32) RV64GC            12          9 >>>> 27     69         AMD64             11          9
    44     84         ARM A64           11         22
       32     68         My 66000           8          5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the
    implementation
    has a GBOoO µarchitecture, I would think that fewer instructions is
    better
    than smaller code--so long as the code size is less than 150% of the
    smaller
    AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    In general yes, but as you pointed out in another post, if you are
    talking about a GBOoO machine, it isn't the absolute number of
    instructions (because of parallel execution), but the number of cycles
    to execute a particular routine.  Of course, this is harder to tell at a glance from a code listing.

    And, of course your "150%" is arbitrary, but I agree that small
    differences in code size are not important, except in some small
    embedded applications.


    Yeah.

    Main use case where code side is a big priority is when trying to fit
    code into a small fixed-size ROM. If loading into RAM, and the RAM is non-tiny, then generally exact binary size is much less important, and
    as long as it isn't needlessly huge/bloated, it doesn't matter too much.

    For traditional software, often data/bss, stack, and heap memory, will
    be the dominant factors for overall RAM usage.

    For a lot of command-line tools, often there will be a lot of code for relatively little RAM use, but then the priority is less about minimal code-size (though often small code size will matter more than
    performance for many such tools), but the overhead of creating and
    destroying process instances.

    ...


    And I guess I would add, as a third, much lower priority, power usage.


    It depends:
    For small embedded devices, power usage often dominates;
    Usually, this is most effected by executing as few instructions as
    possible while also using the least complicated hardware logic to
    perform those instructions.



    For a lot of DSP tasks, power use is a priority, while often doing lots
    of math operations, in which case one often wants FPUs and similar with
    the minimal sufficient precision (so, for example, rocking it with lots
    of Binary16 math, and FPUs which natively operate on Binary16); or a lot
    of 8 and 16 bit integer math.

    While FP8 is interesting, sadly direct FP8 math is often too low of
    precision for many tasks.


    I guess the issue then becomes one of the cheapest-possible Binary16
    capable FPU (both in terms of logic resources and energy use).

    Ironically, one option here being to use log-scaled values (scaled to
    mimic Binary16) and just sort of pass it off as Binary16. If one
    switches entirely to log-scaled math, then it can be at least be self-consistent. However, if mixed/matched with "real" Binary16,
    typically only the top few digits will match up.

    Where, as noted, it works well at low precision, but scales poorly (and
    even Binary16 is pushing it).

    Though, unclear about ASIC space.



    For integer math, it might make sense to use a lot of zero-extended
    16-bit math, since using sign-extended math would likely waste more
    energy flipping all the high order bits for sign extension.

    Well, or other possible even more wacky options, like zigzag-folded gray
    coded byte values.

    Though, it would be kinda wacky/nonstandard, if ALU operations fold the
    sign into the LSB and use gray-coding for the value, then arithmetic
    could be performed while minimizing the number of bit flips and thus potentially using less total energy for registers and memory operations.

    Though, potentially, the CPU could be made to look as-if it were
    operating on normal twos complement math; since if the arithmetic
    results are the same, it might be essentially invisible to the software
    that numbers are being stored in a nonstandard way.

    Or, say, mapping from twos complement to folded bytes (with every byte
    being folded):
    00->00, 01->02, 02->06, 03->04, ...
    FF->01, FE->03, FD->07, FC->05, ...
    So, say, a value flipping sign would typically only need to flip a small fraction of the number of bits (and the encode/decode process would
    mostly consist of bitwise XORs).


    Though, might still make sense to keep things "normal" in the ALU and
    CPU registers, but then apply such a transform at the level of the
    memory caches (and in external RAM). A lot may depend on the energy cost
    of performing this transformation though (and, it does implicitly assume
    that much of the RAM is holding signed integer values).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 15:03:15 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    So, the overall sizes (including data size for globals() on RV64GC) are: >> > Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    Performance from a given chip area.

    The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
    instructions, which means that a part of the decoder results will be
    thrown away, increasing the decode cost for a given number of average
    decoded instructions per cycle. Plus, they need more decoded
    instructions per cycle for a given amount of performance.

    Intel and AMD demonstrate that you can get high performance even with
    an instruction set that is even worse for decoding, but that's not cheap.

    ARM A64 goes the other way: Fixed-width instructions ensure that all
    decoding on correctly predicted paths is actually useful.

    However, it pays for this in other ways: Instructions like load pair
    with auto-increment need to write 3 registers, and the write port
    arbitration certainly has a hardware cost. However, such an
    instruction would need two loads and an add if expressed in RISC-V; if
    RISC-V combines these instructions, it has the same write-port
    arbitration problem. If it does not combine at least the loads, it
    will tend to perform worse with the same number of load/store units.

    So it's a balancing game: If you lose some weight here, do you need to
    add the same, more, or less weight elsewhere to compensate for the
    effects elsewhere?

    At some scale, smaller code size is beneficial, but once the implementation >has a GBOoO µarchitecture, I would think that fewer instructions is better >than smaller code--so long as the code size is less than 150% of the smaller >AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    I don't think that even VAX encoding would be the major problem of the
    VAX these days. There are microop caches and speculative decoders for
    that (although, as EricP points out, the VAX is an especially
    expensive nut to crack for a speculative decoder).

    In any case, if smaller code size was it, RV64GC would win according
    to my results. However, compilers often generate code that has a
    bigger code size rather than a smaller one (loop unrolling, inlining),
    so code size is not that important in the eyes of the maintainers of
    these compilers.

    I also often see code produced with more (dynamic) instructions than
    necessary. So the number of instructions is apparently not that
    important, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 14:26:28 2025
    From Newsgroup: comp.arch

    On 9/5/2025 10:03 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    So, the overall sizes (including data size for globals() on RV64GC) are: >>>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    Performance from a given chip area.

    The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
    instructions, which means that a part of the decoder results will be
    thrown away, increasing the decode cost for a given number of average
    decoded instructions per cycle. Plus, they need more decoded
    instructions per cycle for a given amount of performance.

    Intel and AMD demonstrate that you can get high performance even with
    an instruction set that is even worse for decoding, but that's not cheap.

    ARM A64 goes the other way: Fixed-width instructions ensure that all
    decoding on correctly predicted paths is actually useful.

    However, it pays for this in other ways: Instructions like load pair
    with auto-increment need to write 3 registers, and the write port
    arbitration certainly has a hardware cost. However, such an
    instruction would need two loads and an add if expressed in RISC-V; if
    RISC-V combines these instructions, it has the same write-port
    arbitration problem. If it does not combine at least the loads, it
    will tend to perform worse with the same number of load/store units.

    So it's a balancing game: If you lose some weight here, do you need to
    add the same, more, or less weight elsewhere to compensate for the
    effects elsewhere?


    It is tradeoffs...

    Load/Store Pair helps, and isn't too bad if one already has the register
    ports (if it is at least a 2-wide superscalar, you can afford it with
    little additional cost).

    Auto-increment slightly helps with code density, but is a net negative
    in other ways. Depending on implementation, some of its more obvious
    use-cases (such as behaving like a PUSH/POP) may end up slower than
    using separate instructions.


    Say, the most obvious way to implement auto-increment in my case would
    be to likely have the instruction decode as if there were an implicit
    ADD being executed in parallel.

    Say:
    MOV.Q (R10)+, R18
    MOV.Q R19, -(R11)
    Behaving as:
    ADD 8, R10 | MOV.Q (R10), R18
    ADD -8, R11 | MOV.Q R19, -8(R11)
    So far, so good... Both execute with a 1-cycle latency, but...
    MOV.Q R18, -(R2)
    MOV.Q R19, -(R2)
    MOV.Q R20, -(R2)
    MOV.Q R21, -(R2)
    Would take 8 cycles rather than 4 (due to R2 dependencies).

    Vs:
    MOV.Q R18, -8(R2) //*1
    MOV.Q R19, -16(R2)
    MOV.Q R20, -24(R2)
    MOV.Q R21, -32(R2)
    ADD -32, R2
    Needing 5 cycles (or, maybe 4, if the superscalar logic is clever and
    can run the ADD in parallel with the final MOV.Q).

    *1: Where, "-8(R2)" and "(R2, -8)" are analogous as far as BGBCC's ASM
    parser are concerned, but the former is more traditional, so figured I
    would use it here.


    Likewise, in C if you write:
    v0=*cs++;
    v1=*cs++;
    And it were compiled as auto-increment loads, it could also end up
    slower than a Load+Load+ADD sequence (for the same reason).

    But, what about:
    v0=*cs++;
    //... something else unrelated to cs (or v0).
    Well, then the ADD gets executed in parallel with whatever follows, so
    may still work out to a 1-cycle latency in this case.


    And, a feature is not particularly compelling when its main obvious use
    cases would end up with little/no performance gain (or would actually
    end up slower than what one does in its absence).

    Only really works if one has a 1-cycle ADD.

    Where, otherwise, seemingly the only real advantage of auto-increment
    being to make the binaries slightly smaller.


    Wouldn't take much to re-add it though, as noted, the ancestor of the
    current backend was written for an ISA that did have auto-increment. I
    just sort of ended up dropping it as it wasn't really worth it. Not only
    was it not particularly effective, but tended to be a lot further down
    the ranking in terms of usage frequency of addressing modes. If one
    excludes using it for PUSH/POP, its usage frequency basically falls to
    "hardly anything". Otherwise, you can basically count how many times you
    see "*ptr++" or similar in C, this is about all it would ever end up
    being used; which even in C, is often relatively infrequent).




    But, yeah, can noted, the major areas where RISC-V tends to lose out
    IMHO are:
    Lack of Indexed Load/Store;
    Crappy handling of large constants and lack of large immediate values.

    I had noted before, that the specific combination of adding these features:
    Indexed Load/Store;
    Load/Store Pair;
    Jumbo Prefixes.
    Both improves code density over plain RV64G/RV64GC, and also gains a
    roughly 40-50% speedup in programs like Doom.

    While not significantly increasing logic cost over what one would
    already need for a 2-wide machine. Could make sense to skip them for a
    1-wide machine, but then you don't really care too much about
    performance if going for 1-wide.


    Then again, Indexed Load/Store, due to a "register port issue" for
    Indexed Store, does give a performance advantage to going 3 wide over 2
    wide even if the 3rd lane is rarely used otherwise.


    Though, one could argue:
    But, the relative delta (of assuming these features, over plain RV64GC)
    is slightly less if one assumes a CPU with 1 cycle latency on ALU
    instructions and similar. But, this is still kind of weak IMO (ideally,
    the latency cost of ADD and similar should effect everything equally,
    and that 2-cycle ADD and Shifts disproportionately hurts RV64G/RV64GC performance, is not to RV64G's merit).

    Well, and Zba helps, but not fully. If SHnADD still still has a 2c
    latency, well, your indexed load is still 3 cycles vs 5 cycles, but
    still worse than 1 cycle...

    And, statistically, indexed loads tend to be far too large of the
    dynamic instructions mix to justify cheaping out here. Even if static instruction counts make them seem less relevant, indexed loads also tend
    to be more concentrated inside loops (whereas fixed-displacement loads
    are more concentrated in prologs and epilogs). If one excludes the
    prolog and epilog related loads/stores, the proportion of indexed
    load/store goes up significantly.



    At some scale, smaller code size is beneficial, but once the implementation >> has a GBOoO µarchitecture, I would think that fewer instructions is better >> than smaller code--so long as the code size is less than 150% of the smaller >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    I don't think that even VAX encoding would be the major problem of the
    VAX these days. There are microop caches and speculative decoders for
    that (although, as EricP points out, the VAX is an especially
    expensive nut to crack for a speculative decoder).


    Well, if Intel and AMD could make x86 work... yeah...


    In any case, if smaller code size was it, RV64GC would win according
    to my results. However, compilers often generate code that has a
    bigger code size rather than a smaller one (loop unrolling, inlining),
    so code size is not that important in the eyes of the maintainers of
    these compilers.


    I haven't really tested, but I suspect one could improve over RV64GC
    slightly here.


    For example:

    * 00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
    * 01in-nnnn-iiii-0000 LI Imm5s, Rn5
    * 10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
    * 11mn-nnnn-mmmm-0000 MV Rm5, Rn5

    * 0000-nnnn-iiii-0100 ADDW Imm4u, Rn4
    * 0001-nnnn-mmmm-0100 SUB Rm4, Rn4
    * 0010-nnnn-mmmm-0100 ADDW Imm4n, Rn4
    * 0011-nnnn-mmmm-0100 MVW Rm4, Rn4 //ADDW Rm, 0, Rn
    * 0100-nnnn-mmmm-0100 ADDW Rm4, Rn4
    * 0101-nnnn-mmmm-0100 AND Rm4, Rn4
    * 0110-nnnn-mmmm-0100 OR Rm4, Rn4
    * 0111-nnnn-mmmm-0100 XOR Rm4, Rn4

    * 0iii-0nnn-0mmm-1001 ? SLL Rm3, Imm3u, Rn3
    * 0iii-0nnn-1mmm-1001 ? SRL Rm3, Imm3u, Rn3
    * 0iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3u, Rn3
    * 0iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3u, Rn3
    * 1iii-0nnn-0mmm-1001 ? AND Rm3, Imm3u, Rn3
    * 1iii-0nnn-1mmm-1001 ? SRA Rm3, Imm3u, Rn3
    * 1iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3n, Rn3
    * 1iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3n, Rn3

    * 0ooo-0nnn-0mmm-1101 ? SLL Rm3, Ro3, Rn3
    * 0ooo-0nnn-1mmm-1101 ? SRL Rm3, Ro3, Rn3
    * 0ooo-1nnn-0mmm-1101 ? AND Rm3, Ro3, Rn3
    * 0ooo-1nnn-1mmm-1101 ? SRA Rm3, Ro3, Rn3
    * 1ooo-0nnn-0mmm-1101 ? ADD Rm3, Ro3, Rn3
    * 1ooo-0nnn-1mmm-1101 ? SUB Rm3, Ro3, Rn3
    * 1ooo-1nnn-0mmm-1101 ? ADDW Rm3, Ro3, Rn3
    * 1ooo-1nnn-1mmm-1101 ? SUBW Rm3, Ro3, Rn3

    * 0ddd-nnnn-mmmm-0001 LW Disp3u(Rm4), Rn4
    * 1ddd-nnnn-mmmm-0001 LD Disp3u(Rm4), Rn4
    * 0ddd-nnnn-mmmm-0101 SW Rn4, Disp3u(Rm4)
    * 1ddd-nnnn-mmmm-0101 SD Rn4, Disp3u(Rm4)

    * 00dn-nnnn-dddd-1001 LW Disp5u(SP), Rn5
    * 01dn-nnnn-dddd-1001 LD Disp5u(SP), Rn5
    * 10dn-nnnn-dddd-1001 SW Rn5, Disp5u(SP)
    * 11dn-nnnn-dddd-1001 SD Rn5, Disp5u(SP)

    * 00dd-dddd-dddd-1101 J Disp10
    * 01dn-nnnn-dddd-1101 LD Disp5u(SP), FRn5
    * 10in-nnnn-iiii-1101 LUI Imm5s, Rn5
    * 11dn-nnnn-dddd-1101 SD FRn5, Disp5u(SP)

    Could achieve a higher average hit-rate than RV-C while *also* using
    less encoding space.


    Why? Partly because Reg4 (R8..R23) is less useless than Reg3 (R8..R15).

    Less shift range, but shifts are over-represented in RV-C, and the
    shifts that are present have a very low hit rate due to tending not to
    match the patterns that tend to exist in the compiler output (unlike
    ADD, shifts being far more likely to have different source and
    destination registers).


    The 3R/3RI instructions would still be limited to the "kinda useless"
    3-bit registers, but this still isn't exactly worse than what is already
    the case for RV-C (even if they still have a poor hit rate).

    I left out things like ADDI16SP and ADDI4SPN and similar, as these
    aren't frequent enough to be relevant here (nor do existing instances of
    "ADD SP, Imm, Rn" tend to hit within the limitations of "ADDI4SPN", as
    it is still borderline useless in BGBCC in this case, *1).


    *1: The only times Reg3 has an OK hit rate is in leaf functions, and
    there seems to be a strong negative correlation between leaf functions
    and stack arrays. Also at best, the underlying instruction tends to have
    a low hit-rate as, when a stack array is used semi-frequently, BGBCC
    tends to end up loading the address into a register and leaving it there
    for multiple uses (and, due to "quirks", if you access a local array in
    an inner loop, it will tend to end up in the fixed-assignment case, in
    which case the array address is loaded into a register one-off in the
    prolog). The ADDI4SPN instruction only really makes sense if one assumes
    that stack arrays are both very frequent (in leaf functions?) and/or
    that the compiler tends to load the address of the array into a scratch register and then immediately discard it again (neither of which seems
    true in my case).

    ADDI16SP would be relevant for prologs and epilogs, but has a
    statistical incidence too low to really justify a 16 bit encoding (in
    many cases, would only occur twice per function or so, which is
    statistically, a fairly low incidence rate).

    ...


    Though, that said, RVC in BGBCC still does seem to be semi-effective
    despite its limitations.



    I also often see code produced with more (dynamic) instructions than necessary. So the number of instructions is apparently not that
    important, either.


    Yeah, probably true.

    Often it seems better to try to minimize instruction-instruction
    dependency chains.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Sep 5 14:38:15 2025
    From Newsgroup: comp.arch

    On 9/5/2025 2:26 PM, BGB wrote:
    On 9/5/2025 10:03 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    ...

    And, self-correction...


    For example:

    * 00in-nnnn-iiii-0000  ADD        Imm5s, Rn5  //"ADD 0, R0" = TRAP
    * 01in-nnnn-iiii-0000  LI        Imm5s, Rn5
    * 10mn-nnnn-mmmm-0000  ADD        Rm5, Rn5
    * 11mn-nnnn-mmmm-0000  MV        Rm5, Rn5

    * 0000-nnnn-iiii-0100  ADDW        Imm4u, Rn4
    * 0001-nnnn-mmmm-0100  SUB        Rm4, Rn4
    * 0010-nnnn-mmmm-0100  ADDW        Imm4n, Rn4
    * 0011-nnnn-mmmm-0100  MVW        Rm4, Rn4 //ADDW  Rm, 0, Rn
    * 0100-nnnn-mmmm-0100  ADDW        Rm4, Rn4
    * 0101-nnnn-mmmm-0100  AND        Rm4, Rn4
    * 0110-nnnn-mmmm-0100  OR        Rm4, Rn4
    * 0111-nnnn-mmmm-0100  XOR        Rm4, Rn4

    * 0iii-0nnn-0mmm-1001 ? SLL        Rm3, Imm3u, Rn3
    * 0iii-0nnn-1mmm-1001 ? SRL        Rm3, Imm3u, Rn3
    * 0iii-1nnn-0mmm-1001 ? ADD        Rm3, Imm3u, Rn3
    * 0iii-1nnn-1mmm-1001 ? ADDW        Rm3, Imm3u, Rn3
    * 1iii-0nnn-0mmm-1001 ? AND        Rm3, Imm3u, Rn3
    * 1iii-0nnn-1mmm-1001 ? SRA        Rm3, Imm3u, Rn3
    * 1iii-1nnn-0mmm-1001 ? ADD        Rm3, Imm3n, Rn3
    * 1iii-1nnn-1mmm-1001 ? ADDW        Rm3, Imm3n, Rn3

    * 0ooo-0nnn-0mmm-1101 ? SLL        Rm3, Ro3, Rn3
    * 0ooo-0nnn-1mmm-1101 ? SRL        Rm3, Ro3, Rn3
    * 0ooo-1nnn-0mmm-1101 ? AND        Rm3, Ro3, Rn3
    * 0ooo-1nnn-1mmm-1101 ? SRA        Rm3, Ro3, Rn3
    * 1ooo-0nnn-0mmm-1101 ? ADD        Rm3, Ro3, Rn3
    * 1ooo-0nnn-1mmm-1101 ? SUB        Rm3, Ro3, Rn3
    * 1ooo-1nnn-0mmm-1101 ? ADDW        Rm3, Ro3, Rn3
    * 1ooo-1nnn-1mmm-1101 ? SUBW        Rm3, Ro3, Rn3


    ^ flip the LSB for all of the 3R instructions there, it seemed to be a
    screw up when coming up with my listing...

    But, these were the newest and most uncertain addition, as they use a
    big chunk of encoding space and aren't great for hit rate due to Reg3
    and similar.


    * 0ddd-nnnn-mmmm-0001  LW        Disp3u(Rm4), Rn4
    * 1ddd-nnnn-mmmm-0001  LD        Disp3u(Rm4), Rn4
    * 0ddd-nnnn-mmmm-0101  SW        Rn4, Disp3u(Rm4)
    * 1ddd-nnnn-mmmm-0101  SD        Rn4, Disp3u(Rm4)

    * 00dn-nnnn-dddd-1001  LW        Disp5u(SP), Rn5
    * 01dn-nnnn-dddd-1001  LD        Disp5u(SP), Rn5
    * 10dn-nnnn-dddd-1001  SW        Rn5, Disp5u(SP)
    * 11dn-nnnn-dddd-1001  SD        Rn5, Disp5u(SP)

    * 00dd-dddd-dddd-1101  J        Disp10
    * 01dn-nnnn-dddd-1101  LD        Disp5u(SP), FRn5
    * 10in-nnnn-iiii-1101  LUI        Imm5s, Rn5
    * 11dn-nnnn-dddd-1101  SD        FRn5, Disp5u(SP)

    Could achieve a higher average hit-rate than RV-C while *also* using
    less encoding space.


    Granted, more testing could be done.

    This partly came up as another possibility for a "compressed" XG3, which basically just trades the space used for predicated ops back for 16-bit ops.

    But, alas, RV-C doesn't hold up as well if you try to trim it down to
    2/3 the encoding space.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Sep 5 21:56:07 2025
    From Newsgroup: comp.arch

    On 2025-09-04 11:23 a.m., MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    arrays:
    MOV R3,#0
    MOV R4,#0
    VEC R5,{}
    LDD R6,[R1,R3<<3]
    ADD R4,R4,R6
    LOOP LT,R3,#1,R2
    MOV R1,R4
    RET


    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    globals:
    STD #0x1234567890abcdef,[ip,a-.]
    STD #0xcdef1234567890ab,[ip,b-.]
    STD #0x567890abcdef1234,[ip,c-.]
    STD #0x5678901234abcdef,[ip,d-.]
    RET

    -----------------

    So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
    arrays globals Architecture arrays globals
    28 66 (34+32) RV64GC 12 9
    27 69 AMD64 11 9
    44 84 ARM A64 11 22
    32 68 My 66000 8 5

    In light of the above, what do people think is more important, small
    code size or fewer instructions ??

    At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).

    What say ye !

    Things could be architect-ed to allow a tradeoff between code size and
    number of instructions executed in the same ISA. Sometimes one may want
    really small code; other times performance is more important.


    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test.

    Size is one thing, sooner or later one has to execute the instructions,
    and here My 66000needs to execute fewer, while being within spitting
    distance of code size.

    Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions,

    3 for My 66000

    so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four
    instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.
    * My 66000 uses ST immediate for globals

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Sep 10 13:31:58 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    Things could be architect-ed to allow a tradeoff between code size and number of instructions executed in the same ISA. Sometimes one may want really small code; other times performance is more important.

    That's what -Os vs. -O1, -O2, -O3 etc is about :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Sep 12 17:47:59 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 12 19:02:01 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.

    Or longjump around subroutines using 'new'.
    Or longjump out of 'signal' handlers.
    ...

    Somebody should write a paper entitled "longjump considered dangerous" annotating all the way it can be used to abuse compiler assumptions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Sep 14 15:16:33 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Waldek Hebisch <antispam@fricas.org> schrieb:

    - one could create trampolines in a separate area of memory. In
    such case there is trouble with dealocating no longer needed
    trampolines. This trouble can be resolved by using GC. Or
    by using a parallel stack dedicated to trampolines.

    [...]

    gcc has -ftrampoline-impl=[stack|heap], see
    https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

    Don't longjmp out of a nested function though.

    Or longjump around subroutines using 'new'.

    Actually, one must take care when calling longjmp
    from a function that allocates memory to ensure that
    the memory will be tracked and/or deallocated as
    and when required. That's perfectly feasible.

    Although in my experience, code that uses longjmp
    (say instead of C++ exceptions) is performance
    sensitive and performance sensitive code doesn't
    do dynamic allocation (i.e. high-performance
    C++ code won't use the standard C++ library
    functionality that requires dynamic allocation).


    Or longjump out of 'signal' handlers.

    Again, one must take the appropriate care. Such as
    using the correct API (e.g. POSIX siglongjmp(2)).

    It is quite common to use siglongjmp to leave
    a SIGINT (Control-C) handler.

    https://pubs.opengroup.org/onlinepubs/9799919799/functions/siglongjmp.html
    --- Synchronet 3.21a-Linux NewsLink 1.2