So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards workstations within the same company (which would be HARD).
As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could have done to
the PC market, if IBM could have been prevented from gaining such market dominance.
A bit like the /360 strategy, offering a wide range of machines (or CPUs
and systems) with different performance.
counter-argument to ILP64, where the more natural alternative is LP64.
DEC was bigger in the minicomputer market. If DEC could have offered
an open-standard machine, that could have offered serious competition
to IBM. But what OS would they have used? They were still dominated
by Unix-haters then.
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good
match for the beliefs of the time, but that's a different thing.
The evidence of 801 is the 801 did not deliver until more than decade
later. And the variant that delivered was quite different from original
801.
Actually, it can be argued that 801 didn't deliver until more than 15
years late.
[RISC] didn't really make sense until main
memory systems got a lot faster.
BGB <cr88192@gmail.com> writes:
counter-argument to ILP64, where the more natural alternative is LP64.
I am curious what makes you think that I32LP64 is "more natural",
given that C is a human creation.
ILP64 is more consistent with the historic use of int: int is the
integer type corresponding to the unnamed single type of B
(predecessor of C), which was used for both integers and pointers.
You can see that in various parts of C, e.g., in the integer type
promotion rules (all integers are promoted at least to int in any
case, beyond that only when another bigger integer is involved).
Another example is
main(argc, argv)
char *argv[];
{
return 0;
}
Here the return type of main() defaults to int, and the type of argc
defaults to int.
As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
It is the case for ILP64.
Some people conspired in 1992 to set the de-facto standard, and made
the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
this mistake ever since, one way or the other.
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
- anton
In any case, RISCs delivered, starting in 1986.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort
of dependency in renamer. Probably not yet when it is coded as
'inc', but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next
value of [rdi+rcx*8] does depend on value of rbx from previous
iteration, but the next value of rbx depends only on [rsi+rcx*8],
[r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
value of rbx, except for control dependency that hopefully would
be speculated around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries
from the previous round.
This is the carry chain that I don't see any obvious way to
break...
You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
pay a heavy price of branch misprediction. But outside of specially
crafted inputs it is extremely rare.
Aha!
That's _very_ nice.
Terje
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Obectively, a lot of programs fit into 32-bit address space and
may wish to run as 32-bit code for increased performance. Code
that fits into 16-bit address space is rare enough on 64-bit
machines to ignore.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
It is more complex. There are machines on the market with 64 MB
RAM and 64-bit RISCV processor. There are (or were) machines
with 512 MB RAM and 64-bit ARM processor. On such machines it
is quite natural to use 32-bit pointers. With 32-bit pointers
there is possibility to use existing 32-bit code. And
IPL32 is natural model.
You can say that 32-bit pointers on 64-bit hardware are rare.
But we really do not know. And especially in embedded space one
big customer may want a feature and vendor to avoid fragmentation
provides that feature to everyone.
Why such code need 32-bit addressing? Well, if enough parts of
C were undefined compiler could just extend everthing during
load to 64-bits. So equally well you can claim that real problem
is that C standard should have more undefined behaviour.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
but I would be surprised if anyone had written a C compiler for it.
It was bit addressable but memories in those days were so small that a full bit
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.)
It was bit addressable but memories in those days were so small that a full bit
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.)
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few >bytes.
According to Peter Flass <Peter@Iron-Spring.com>:
It was bit addressable but memories in those days were so small that a full bitI don't get why bit-addressability was a thing? Intel iAPX 432 had it, >>too, and it seems like all it does is drastically shrink your address >>space and complexify instruction and operand fetch to (maybe) save a few >>bytes.
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.) >>
STRETCH had a severe case of second system syndrome, and was full of
complex features that weren't worth the effort and it was impressive
that IBM got it to work and to run as fast as it did.
In that era memory was expensive, and usually measured in K, not M.
The idea was presumably to pack data as tightly as possible.
In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
microcode so COBOL programs used the COBOL instruction set, FORTRAN programs >used the FORTRAN instruction set, and so forth, with each one having whatever >word or byte sizes they wanted. In retrospect it seems like a lot of >premature optimization.
For comparison:
SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
in May 1982 (but could only run at 0.5MHz). No date for the completion
of RISC-II, but given that the research project ended in 1984, it was probably at that time. Sun developed Berkeley RISC into SPARC, and the
first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz processor.
Why were 32-bit indices and 32-bit operations more important than 16-bit indices and 16-bit operations?
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few bytes.
CP/M owes a lot to the DEC lineage, although it dispenses with some
of the more tedious mainframe-isms - e.g. the RUN [program]
[parameters] syntax vs. just treating executable files on disk as
commands in themselves.)
On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:
CP/M owes a lot to the DEC lineage, although it dispenses with some
of the more tedious mainframe-isms - e.g. the RUN [program]
[parameters] syntax vs. just treating executable files on disk as
commands in themselves.)
It added its own misfeatures, though. Like single-letter device names,
but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to this day, aggravated by a totally perverse extension of the concept to
paths with hierarchical directory names.
["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
I'd think this was obvious, but if the code depends on word sizes and doesn't
declare its variables to use those word sizes, I don't think "portable" is the
right term.
My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?
Thomas Koenig <tkoenig@netcologne.de> writes:
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
it is not a register machine nor byte-addressed, while the PDP-11 is,
and the RISC-VAX would be, too.
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?
Yes, convincing people in the mid-1970s to bet the company on RISC is
a hard sell, that's I asked for "a magic wand that would convince the
DEC management and workforce that I know how to design their next architecture, and how to compile for it" in
<2025Mar1.125817@mips.complang.tuwien.ac.at>.
Some arguments that might help:
Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
instruction, virtual memory, and precise exceptions.
How the CDC 6600 achieved performance (pipelining) and how non-complex
its instructions are.
I guess I would read through RISC-vs-CISC literature before entering
the time machine in order to have some additional arguments.
Concerning your three options, I think it will be a problem in any
case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode,
so maybe even more in the wrong direction
than VAX; I can imagine a high-performance OoO VAX implementation, but
for an architecture with exposed microcode like FHP an OoO
implementation would probably be pretty challenging. The backup
project that eventually came through was also a CISC.
Concerning founding ones own company, one would have to convince
venture capital, and then run the RISC of being bought by one of the
big players, who buries the architecture. And even if you survive,
you then have to build up the whole thing: production, marketing,
sales, software support, ...
In any case, the original claim was about the VAX, so of course the--
question at hand is what DEC could have done instead.
- anton
On Wed, 6 Aug 2025 16:19:11 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
It's a setup code that looks to me as missing, but zeroing of RDX in
the body of the loop.
I did few tests on few machines: Raptor Cove (i7-14700 P core),
Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
7543P).
In order to see effects more clearly I had to modify Anton's function:
to one that operates on pointers, because otherwise too much time was
spend at caller's site copying things around which made the
measurements too noisy.
void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
*dst = *a + *b + *c;
}
After the change on 3 out of 4 platforms I had seen a significant
speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
narrow to profit from the change. The widest machine (Raptor Cove)
benefited most.
The results appear non-conclusive with regard to question whether
dependency between loop iterations is eliminated completely or just
shortened to 1-2 clock cycles per iteration. Even the widest of my
cores is relatively narrow. Considering that my variant of loop contains
13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
M4 would be too narrow :(
Here are results in nanoseconds for N=65472
Platform RC GM SK Z3
clang 896.1 1476.7 1453.2 1348.0
gcc 879.2 1661.4 1662.9 1655.0
x86 585.8 1489.3 901.5 672.0
Terje's 772.6 1293.2 1012.6 1127.0
My 397.5 803.8 965.3 660.0
ADX 579.1 1650.1 728.9 853.0
x86/u2 581.5 1246.2 679.9 584.0
Terje's/u3 503.7 954.3 630.9 755.0
My/u3 266.6 487.2 486.5 440.0
ADX/u8 350.4 839.3 490.4 451.0
'x86' is a variant that that was sketched in one of my above
posts. It calculates the sum in two passes over arrays.
'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
/u2, /u3, /u8 indicate unroll factors of the inner loop.
Frequency:
RC 5.30 GHz (Est)
GM 4.20 GHz (Est)
SK 4.25 GHz
Z3 3.70 GHz
On 8/6/25 10:25, John Levine wrote:Bit addressing, presumably combined with an easy way to mask the
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
but I would be surprised if anyone had written a C compiler for it.
It was bit addressable but memories in those days were so small that a
full bit
address was only 24 bits. So if I were writing a C compiler, pointers
and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely
unrelated.)
I don't get why bit-addressability was a thing? Intel iAPX 432 had it, > too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few bytes.
Bit addressing, presumably combined with an easy way to mask the results/pick an arbitrary number of bits less or equal to register
width, makes it easier to impement compression/decompression/codecs.
However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.
It added its own misfeatures, though.
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too
However, in the case of the IBM STRETCH, I think there's a good
excuse: If you go from word addressing to subunit addressing (not sure
why Stretch went there, however; does a supercomputer need that?), why
stop at characters (especially given that character size at the time
was still not settled)? Why not continue down to bits?
It's a 32 bit architecture with 31 bit addressing, kludgily extended
from 24 bit addressing in the 1970s.
Peter Flass <Peter@Iron-Spring.com> writes:
[IBM STRETCH bit-addressable]
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too
One might come to think that it's the signature of overambitious
projects that eventually fail.
However, in the case of the IBM STRETCH, I think there's a good
excuse: If you go from word addressing to subunit addressing (not sure
why Stretch went there, however; does a supercomputer need that?)
stop at characters (especially given that character size at the time
was still not settled)? Why not continue down to bits?
The S/360 then found the compromise that conquered the world: Byte
addressing with 8-bit bytes.
Why iAPX432 went for bit addressing at a time when byte addressing and
the 8-bit byte was firmly established, over ten years after the S/360
and 5 years after the PDP-11 is a mystery, however.
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few
bytes.
Bit addressing, presumably combined with an easy way to mask the >results/pick an arbitrary number of bits less or equal to register
width, makes it easier to impement compression/decompression/codecs.
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.
On 8/7/25 3:48 PM, Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500No, I do not. And I am worried.
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
brian
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
he lost his usenet provider,
Terje
and the one I am > using does not seem to accept new registrations
any langer.
On Fri, 8 Aug 2025 11:58:39 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
Good.
he lost his usenet provider,
Terje
I was suspecting that much. What made me worrying is that almost at the
same date he stopped posting on RWT forum.
and the one I am > using does not seem to accept new registrations
any langer.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
Michael S <already5chosen@yahoo.com> writes:
On Fri, 8 Aug 2025 11:58:39 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
Good.
he lost his usenet provider,
Terje
I was suspecting that much. What made me worrying is that almost at the
same date he stopped posting on RWT forum.
and the one I am > using does not seem to accept new registrations
any langer.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
www.usenetserver.com is priced reasonably. I've been using them
for well over a decade now.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".
"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"
Fundamentally, 36-bit words ended up being a dead-end.
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
Reading up about x32, it requires quite a bit more than just
allocating everything in the low 2GB.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
cross@spitfire.i.gajendra.net (Dan Cross) writes:
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
Reading up about x32, it requires quite a bit more than just
allocating everything in the low 2GB.
The primary issue on x86 was with the API definitions. Several
legacy API declarations used signed integers (int) for
address parameters. This limited addresses to 2GB on
a 32-bit system.
https://en.wikipedia.org/wiki/Large-file_support
The Large File Summit (I was one of the Unisys reps at the LFS)
specified a standard way to support files larger than 2GB
on 32-bit systems that used signed integers for file offsets
and file size.
Also, https://en.wikipedia.org/wiki/2_GB_limit
Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
In comp.arch BGB <cr88192@gmail.com> wrote:
Also, IIRC, the major point of X32 was that it would narrow pointers and
similar back down to 32 bits, requiring special versions of any shared
libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
We have done something similar for years at Red Hat: not X32, but
x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
(which we were) all you have to do is copy all 32-bit libraries from
one one repo to the other.
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the
effort.
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >broken.
To be efficient, a RISC needs a full-width (presumably 32 bit)
external data bus, plus a separate address bus, which should at
least be 26 bits, better 32. A random ARM CPU I looked at at
bitsavers had 84 pins, which sounds reasonable.
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move
towards workstations within the same company (which would be HARD).
As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could
have done to the PC market, if IBM could have been prevented
from gaining such market dominance.
On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards
workstations within the same company (which would be HARD).
None of the companies which tried to move in that direction were
successful. The mass micro market had much higher volumes and lower
margins, and those accustomed to lower-volume, higher-margin operation >simply couldn’t adapt.
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the >>effort.
Alpha: On Digital OSF/1 the advantage was to be able to run programs
that work on ILP32, but not I32LP64.
x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
and unmaintained ones did not get an x32 port anyway. And if there
are cases where my expectations do not hold, there still is i386. The
only advantage of x32 was a speed advantage on select programs.
That's apparently not enough to gain a critical mass of x32 programs.
Aarch64-ILP32: My guess is that the situation is very similar to the
x32 situation.
Admittedly, there are CPUs without ARM A32/T32
Thomas Koenig <tkoenig@netcologne.de> writes:
To be efficient, a RISC needs a full-width (presumably 32 bit)
external data bus, plus a separate address bus, which should at
least be 26 bits, better 32. A random ARM CPU I looked at at
bitsavers had 84 pins, which sounds reasonable.
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
One could have done a RISC-VAX microprocessor with 16-bit data bus and
24-bit address bus.
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
porting easier. And then try to sell it to IBM Boca Raton.
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >>broken.
That may be the interface of the C system call wrapper,
errno, but at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >architectures, where the sign is used, mmap(2) cannot return negative >addresses, or must have a special wrapper.
Let's look at what the system call wrappers do on RV64G(C) (which has--- Synchronet 3.21a-Linux NewsLink 1.2
no carry flag). For read(2) the wrapper contains:
0x3ff7f173be <read+20>: ecall
0x3ff7f173c2 <read+24>: lui a5,0xfffff
0x3ff7f173c4 <read+26>: mv s0,a0
0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>
For dup(2) the wrapper contains:
0x3ff7e7fe9a <dup+2>: ecall
0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>
and for mmap(2):
0x3ff7e86b6e <mmap64+12>: ecall
0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>
So instead of checking for the sign flag, on RV64G the wrapper checks
if the result is >0xfffff00000000000. This costs one instruction more
than just checking the sign flag, and allows to almost double the
number of bytes read(2) can read in one call, the number of file ids
that cn be returned by dup(2), and the address range returnable by
mmap(2). Will we ever see processes that need more than 8EB? Maybe
not, but the designers of the RV64G(C) ABI obviously did not want to
be the ones that are quoted as saying "8EB should be enough for
anyone":-).
Followups to comp.arch
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >>>broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>porting easier. And then try to sell it to IBM Boca Raton.
https://en.wikipedia.org/wiki/Rainbow_100
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
One could have done a RISC-VAX microprocessor with 16-bit data bus and >>24-bit address bus.
LSI11?
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally
broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
True, but not relevant for the question at hand.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
it is; the wrapper then typically transforms the register in some way
(e.g., by negating it) and stores the result in errno, and returns -1.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g.,
32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux
increased that to 3GB.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a >>>reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >>>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>>porting easier. And then try to sell it to IBM Boca Raton.
https://en.wikipedia.org/wiki/Rainbow_100
That's completely different from what I suggest above, and DEC
obviously did not capture the PC market with that.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>aph@littlepinkcloud.invalid writes:
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the >>>effort.
Alpha: On Digital OSF/1 the advantage was to be able to run programs
that work on ILP32, but not I32LP64.
I understand what you're saying here, but disagree. A program that
works on ILP32 but not I32LP64 is fundamentally broken, IMHO.
x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
and unmaintained ones did not get an x32 port anyway. And if there
are cases where my expectations do not hold, there still is i386. The
only advantage of x32 was a speed advantage on select programs.
I suspect that performance advantage was minimal, the primary advantage would >have been that existing applications didn't need to be rebuilt
and requalified.
Aarch64-ILP32: My guess is that the situation is very similar to the
x32 situation.
In the early days of AArch64 (2013), we actually built a toolchain to support >Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
and the project was dropped.
Admittedly, there are CPUs without ARM A32/T32
Very few AArch64 designs included AArch32 support
even the Cortex
chips supported it only at exception level zero (user mode)
The markets for AArch64 (servers, high-end appliances) didn't have
a huge existing reservoir of 32-bit ARM applications, so there was
no demand to support them.
scott@slp53.sl.home (Scott Lurndal) writes:
[snip]
errno, but at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed? The
return value from the kernel should be passed through to
the application as per the POSIX language binding requirements.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers, >except in the case where it doesn't provide the correct semantics.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally
broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
True, but not relevant for the question at hand.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
it is; the wrapper then typically transforms the register in some way
(e.g., by negating it) and stores the result in errno, and returns -1.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g.,
32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux
increased that to 3GB.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers
Sure, but system calls are first introduced in real kernels using the
actual system call interface, and are limited by that interface. And
that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
And when you look
closely, you find how the system calls are design to support returning
the error indication, success value, and errno in one register.
lseek64 on 32-bit platforms is an exception (the success value does
not fit in one register), and looking at the machine code of the
wrapper and comparing it with the machine code for the lseek wrapper,
some funny things are going on, but I would have to look at the source
code to understand what is going on. One other interesting thing I
noticed is that the system call wrappers from libc-2.36 on i386 now
draws the boundary between success returns and error returns at
0xfffff000:
0xf7d853c4 <lseek+68>: call *%gs:0x10
0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>
So now the kernel can produce 4095 error values, and the rest can be
success values. In particular, mmap() can return all possible page
addresses as success values with these wrappers. When I last looked
at how system calls are done, I found just a check of the N or the C
flag.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
[snip]
all that said, my initial point about -1 was that applications
should always check for -1 (or MAP_FAILED), not for return
values less than zero. The actual kernel interface to the
C library is clearly implementation dependent although it
must preserve the user-visible required semantics.
In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
For mmap, at least the only documented error return value is
`MAP_FAILED`, and programmers must check for that explicitly.
It strikes me that this implies that the _value_ of `MAP_FAILED`
need not be -1; on x86_64, for instance, it _could_ be any
non-canonical address.
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple
environments, in which overflow is potentially unavoidable.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux >>increased that to 3GB.
The point is that the programmer shouldn't have to care.
Sure, but system calls are first introduced in real kernels using the >>actual system call interface, and are limited by that interface. And
that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative.
Bottom line: If Linux-i386 ever had a different way of determining
whether a system call has an error result, it was changed to the
current way early on. Given that IIRC I looked into that later than
in 2000, my memory is obviously not of Linux. I must have looked at
source code for a different system.
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
Nowadays the exotic hardware where this would
not work that way has almost completely died out (and C is not used on
the remaining exotic hardware),
but now compilers sometimes do funny
things on integer overflow, so better don't go there or anywhere near
it.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple >>environments, in which overflow is potentially unavoidable.
POSIX also has the EOVERFLOW error for exactly that case.
Bottom line: The off_t returned by lseek(2) is signed and always
positive.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>>32-bit Linux originally only produced user-level addresses below 2GB. >>>When memories grew larger, on some architectures (e.g., i386) Linux >>>increased that to 3GB.
The point is that the programmer shouldn't have to care.
True, but completely misses the point.
Sure, but system calls are first introduced in real kernels using the >>>actual system call interface, and are limited by that interface. And >>>that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative. And for mmap() that worked
because user-mode addresses were all below 2GB. Addresses furthere up
where reserved for the kernel.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I have checked that with binaries compiled in 2003 and 2000:
-rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*
[~:160080] file /usr/local/bin/gforth-0.5.0
/usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
[~:160081] file /usr/local/bin/gforth-0.6.2
/usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
GNU/Linux 2.0.0, stripped
So there is actually a difference between these two. However, if I
just strace them as they are now, they both happily produce very high >addresses with mmap, e.g.,
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000
I don't know what the difference is between "for GNU/Linux 2.0.0" and
not having that,
but the addresses produced by mmap() seem unaffected.
However, by calling the binaries with setarch -L, mmap() returns only >addresses < 2GB in all calls I have looked at. I guess if I had
statically linked binaries, i.e., with old system call wrappers, I
would have to use
setarch -L <binary>
to make it work properly with mmap(). Or maybe Linux is smart enough
to do it by itself when it encounters a statically-linked old binary.
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
However, conversion from signed to unsigned has always been
well-defined, and follows effectively 2's complement semantics.
Conversion from unsigned to signed is a bit more complex, and is >implementation defined, but not UB. Given that the system call
interface is necessarily deeply intwined with the implementation
I see no reason why the semantics of signed overflow should be
an issue here.
Nowadays the exotic hardware where this would
not work that way has almost completely died out (and C is not used on
the remaining exotic hardware),
If by "C is not used" you mean newer editions of the C standard
are not used on very old computers with strange representations
of signed integers, then maybe.
but now compilers sometimes do funny
things on integer overflow, so better don't go there or anywhere near
it.
This isn't about signed overflow. The issue here is conversion
of an unsigned value to signed; almost certainly, the kernel
performs the calculation of the actual file offset using
unsigned arithmetic, and relies on the (assembler, mind you)
system call stubs to map those to the appropriate userspace
type.
I think this is mostly irrelevant, as the system call stub,
almost by necessity, must be written in assembler in order to
have percise control over the use of specific registers and so
on. From C's perspective, a program making a system call just
calls some function that's defined to return a signed integer;
the assembler code that swizzles the register that integer will
be extracted from sets things up accordingly. In other words,
the conversion operation that the C standard mentions isn't at
play, since the code that does the "conversion" is in assembly.
Again from C's perspective the return value of the syscall stub
function is already signed with no need of conversion.
No, for `lseek`, the POSIX rationale explains the reasoning here
quite clearly: the 1990 standard permitted negative offsets, and
programs were expected to accommodate this by special handling
of `errno` before and after calls to `lseek` that returned
negative values. This was deemed onerous and fragile, so they
modified the standard to prohibit calls that would result in
negative offsets.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple >>>environments, in which overflow is potentially unavoidable.
POSIX also has the EOVERFLOW error for exactly that case.
Bottom line: The off_t returned by lseek(2) is signed and always
positive.
As I said earlier, post POSIX.1-1990, this is true.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>>>32-bit Linux originally only produced user-level addresses below 2GB. >>>>When memories grew larger, on some architectures (e.g., i386) Linux >>>>increased that to 3GB.
The point is that the programmer shouldn't have to care.
True, but completely misses the point.
I don't see why. You were talking about the system call stubs,
which run in userspace, and are responsbile for setting up state
so that the kernel can perform some requested action on entry,
whether by trap, call gate, or special instruction, and then for
tearing down that state and handling errors on return from the
kernel.
For mmap, there is exactly one value that may be returned from
the its stub that indicates an error; any other value, by
definition, represents a valid mapping. Whether such a mapping
falls in the first 2G, 3G, anything except the upper 256MiB, or
some hole in the middle is the part that's irrelevant, and
focusing on that misses the main point: all the stub has to do
is detect the error, using whatever convetion the kernel
specifies for communicating such things back to the program, and
ensure that in an error case, MAP_FAILED is returned from the
stub and `errno` is set appropriately. Everything else is
superfluous.
Sure, but system calls are first introduced in real kernels using the >>>>actual system call interface, and are limited by that interface. And >>>>that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative. And for mmap() that worked
because user-mode addresses were all below 2GB. Addresses furthere up >>where reserved for the kernel.
Define "Linux-i386" in this case. For the kernel, I'm confident
that was NOT the case, and it is easy enough to research, since
old kernel versions are online. Looking at e.g. 0.99.15, one
can see that they set the carry bit in the flags register to
indicate an error, along with returning a negative errno value: >https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S
By 2.0, they'd stopped setting the carry bit, though they
continued to clear it on entry.
But remember, `mmap` returns a pointer, not an integer, relying
on libc to do the necessary translation between whatever the
kernel returns and what the program expects. So if the behavior
you describe where anywhere, it would be in libc. Given that
they have, and had, a mechanism for signaling an error
independent of C already, and necessarily the fixup of the
return value must happen in the syscall stub in whatever library
the system used, relying soley on negative values to detect
errors seems like a poor design decision ifor a C library.
So if what you're saying were true, such a check wuld have to
be in the userspace library that provides the syscall stubs; the
kernel really doesn't care. I don't know what version libc
Torvalds started with, or if he did his own bespoke thing
initially or something, but looking at some commonly used C
libraries of a certain age, such as glibc 2.0 from 1997-ish, one
can see that they're explicitly testing the error status against
-4095 (as an unsigned value) in the stub. (e.g., in >sysdeps/unix/sysv/linux/i386/syscall.S).
But glibc-1.06.1 is a different story, and _does_ appear to
simply test whether the return value is negative and then jump
to an error handler if so. So mmap may have worked incidentally
due to the restriction on where in the address space it would
place a mapping in very early kernel versions, as you described,
but that's a library issue, not a kernel issue: again, the
kernel doesn't care.
The old version of libc5 available on kernel.org similarly; it
looks like HJ Lu changed the error handling path to explicitly
compare against -4095 in October of 1996.
So, fixed in the most common libc's used with Linux on i386 for
nearly 30 years, well before the existence of x86_64.
I wonder how the kernel is informed that it can now return more >>>>addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I have checked that with binaries compiled in 2003 and 2000:
-rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >>-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*
[~:160080] file /usr/local/bin/gforth-0.5.0
/usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
[~:160081] file /usr/local/bin/gforth-0.6.2
/usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
GNU/Linux 2.0.0, stripped
So there is actually a difference between these two. However, if I
just strace them as they are now, they both happily produce very high >>addresses with mmap, e.g.,
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000
I don't see any reason why it wouldn't.
I don't know what the difference is between "for GNU/Linux 2.0.0" and
not having that,
`file` is pulling that from a `PT_NOTE` segment defined in the
program header for that second file. A better tool for picking
apart the details of those binaries is probably `objdump`.
I'm mildly curious what version of libc those are linked against
(e.g., as reported by `ldd`).
but the addresses produced by mmap() seem unaffected.
I don't see why it would be. Any common libc post 1997-ish
handles errors in a way that permits this to work correctly. If
you tried glibc 1.0, it might be a different story, but the
Linux folks forked that in 1994 and modified it as "Linux libc"
and the
However, by calling the binaries with setarch -L, mmap() returns only >>addresses < 2GB in all calls I have looked at. I guess if I had
statically linked binaries, i.e., with old system call wrappers, I
would have to use
setarch -L <binary>
to make it work properly with mmap(). Or maybe Linux is smart enough
to do it by itself when it encounters a statically-linked old binary.
Unclear without looking at the kernel source code, but possibly.
`setarch -L` turns on the "legacy" virtual address space layout,
but I suspect that the number of binaries that _actually care_
is pretty small, indeed.
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had >>>reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.
On 14.08.2025 17:44, Dan Cross wrote:
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
They would presumably have been part of the justification for supporting >multiple signed integer formats at the time.
UB on signed integer
arithmetic overflow is a different matter altogether.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is >undefined.
"""
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a >correct answer, then then the C standards don't say anything about the >results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world >positive integers, you don't get a negative integer.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer >representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for >general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
According to <aph@littlepinkcloud.invalid>:
In comp.arch BGB <cr88192@gmail.com> wrote:
Also, IIRC, the major point of X32 was that it would narrow pointers and >>> similar back down to 32 bits, requiring special versions of any shared
libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
We have done something similar for years at Red Hat: not X32, but
x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
(which we were) all you have to do is copy all 32-bit libraries from
one one repo to the other.
FreeBSD does the same thing. The 32 bit libraries are installed by default on 64 bit systems because, by current standards, they're not very big.
I've stopped installing them because I know I don't have any 32 bit apps
left but on systems with old packages, who knows?
In article <107l5ju$k78a$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
On 14.08.2025 17:44, Dan Cross wrote:
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
They would presumably have been part of the justification for supporting
multiple signed integer formats at the time.
C90 doesn't have much to say about this at all, other than
saying that the actual representation and ranges of the integer
types are implementation defined (G.3.5 para 1).
C90 does say that, "The representations of integral types shall
define values by use of a pure binary numeration system" (sec
6.1.2.5).
C99 tightens this up and talks about 2's comp, 1's comp, and
sign/mag as being the permissible representations (J.3.5, para
1).
UB on signed integer
arithmetic overflow is a different matter altogether.
I disagree.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
Not in ANSI/ISO 9899-1990. In that revision of the standard,
sec 6.5 covers declarations.
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is
undefined.
"""
In C90, this language appears in sec 6.3 para 5. Note, however,
that they do not define what an exception _is_, only a few
things that _may_ cause one. See below.
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
Consider this language from the (non-normative) example 4 in sec
5.1.2.3:
|On a machine in which overflows produce an exception and in
|which the range of values representable by an *int* is
|[-32768,+32767], the implementation cannot rewrite this
|expression as [continues with the specifics of the example]....
That seems pretty clear that they're thinking about machines
that actually generate a hardware trap of some kind on overflow.
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a
correct answer, then then the C standards don't say anything about the
results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
Sorry, but I don't buy this argument as anything other than a
justification after the fact. We're talking about history and
motivation here, not the behavior described in the standard.
In particular, C is a programming language for actual machines,
not a mathematical notation; the language is free to define the
behavior of arithmetic expressions in any way it chooses, though
one presumes it would do so in a way that makes sense for the
machines that it targets.
Thus, it could have formalized the
result of signed integer overflow to follow 2's complement
semantics had the committee so chosen, in which case the result
would not be "incorrect", it would be well-defined with respect
to the semantics of the language. Java, for example, does this,
as does C11 (and later) atomic integer operations. Indeed, the
C99 rationale document makes frequent reference to twos
complement, where overflow and modular behavior are frequently
equivalent, being the common case. But aside from the more
recent atomics support, C _chose_ not to do this.
Also, consider that _unsigned_ arithmetic is defined as having
wrap-around semantics similar to modular arithmetic, and thus
incapable of overflow.
But that's simply a fiction invented for
the abstract machine described informally in the standard: it
requires special handling one machines like the 1100 series,
because those machines might trap on overflow. The C committee
could just as well have said that the unsigned arithmetic
_could_ overflow and that the result was UB.
So why did C chose this way? The only logical reason is that
there were machines at the time that where a) integer overflow
caused machine exceptions, and b) the representation of signed
integers was not well-defined, so that the actual value
resulting from overflow could not be rigorously defined. Given
that C90 mandated a binary representation for integers and so
the representation of of unsigned integers is basically common,
there was no need to do that for unsigned arithmetic.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer
representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for
general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
Yes. As I said, performance is often the justification.
I'm not convinced that there are no custom chips and/or DSPs
that are not manufactured today. They may not be common, their
mere existence is certainly dumb and offensive, but that does
not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
only mentions _popular_ DSPs, not _all_ DSPs.
Of course, if such machines exist, I will certainly concede that
I doubt very much that anyone is targeting them with C code
written to a modern standard.
David Brown <david.brown@hesbynett.no> schrieb:
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
I believe it was you who wrote "If you add enough apples to a
pile, the number of apples becomes negative", so there is
clerly a defined physical meaning to overflow.
:-)
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
On 14.08.2025 23:44, Dan Cross wrote:
In article <107l5ju$k78a$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
[snip]
UB on signed integer
arithmetic overflow is a different matter altogether.
I disagree.
You have overflow when the mathematical result of an operation cannot be >expressed accurately in the type - regardless of the representation
format for the numbers. Your options, as a language designer or >implementer, of handling the overflow are the same regardless of the >representation. You can pick a fixed value to return, or saturate, or >invoke some kind of error handler mechanism, or return a "don't care" >unspecified value of the type, or perform a specified algorithm to get a >representable value (such as reduction modulo 2^n), or you can simply
say the program is broken if this happens (it is UB).
I don't see where the representation comes into it - overflow is a
matter of values and the ranges that can be stored in a type, not how
those values are stored in the bits of the data.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
Not in ANSI/ISO 9899-1990. In that revision of the standard,
sec 6.5 covers declarations.
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is
undefined.
"""
In C90, this language appears in sec 6.3 para 5. Note, however,
that they do not define what an exception _is_, only a few
things that _may_ cause one. See below.
It's basically the same in C90 onwards, with just small changes to the >wording.
And it /does/ define what is meant by an "exceptional
condition" (or just "exception" in C90) - that is done by the part in >parentheses.
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
Consider this language from the (non-normative) example 4 in sec
5.1.2.3:
|On a machine in which overflows produce an exception and in
|which the range of values representable by an *int* is
|[-32768,+32767], the implementation cannot rewrite this
|expression as [continues with the specifics of the example]....
That seems pretty clear that they're thinking about machines
that actually generate a hardware trap of some kind on overflow.
They are thinking about that possibility, yes. In C90, the term
"exception" here was not clearly defined - and it is definitely not the
same as the term "exception" in 6.3p5. The wording was improved in C99 >without changing the intended meaning - there the term in the paragraph >under "Expressions" is "exceptional condition" (defined in that
paragraph), while in the example in "Execution environments", it says
"On a machine in which overflows produce an explicit trap". (C11
further clarifies what "performs a trap" means.)
But this is about re-arrangements the compiler is allowed to make, or
barred from making - it can't make re-arrangements that would mean
execution failed when the direct execution of the code according to the
C abstract machine would have worked correctly (without ever having >encountered an "exceptional condition" or other UB). Representation is
not relevant here - there is nothing about two's complement, ones' >complement, sign-magnitude, or anything else. Even the machine hardware
is not actually particularly important, given that most processors
support non-trapping integer arithmetic instructions and for those that >don't have explicit trap instructions, a compiler could generate "jump
if overflow flag set" or similar instructions to emulate traps
reasonably efficiently. (Many compilers support that kind of thing as
an option to aid debugging.)
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a >>> correct answer, then then the C standards don't say anything about the
results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
Sorry, but I don't buy this argument as anything other than a
justification after the fact. We're talking about history and
motivation here, not the behavior described in the standard.
It is a fair point that I am describing a rational and sensible reason
for UB on arithmetic overflow - and I do not know the motivation of the >early C language designers, compiler implementers, and authors of the
first C standard.
I do know, however, that the principle of "garbage in, garbage out" was
well established long before C was conceived. And programmers of that
time were familiar with the concept of functions and operations being >defined for appropriate inputs, and having no defined behaviour for
invalid inputs. C is full of other things where behaviour is left
undefined when no sensible correct answer can be specified, and that is
not just because the behaviour of different hardware could vary. It
seems perfectly reasonable to me to suppose that signed integer
arithmetic overflow is just another case, no different from
dereferencing an invalid pointer, dividing by zero, or any one of the
other UB's in the standards.
In particular, C is a programming language for actual machines,
not a mathematical notation; the language is free to define the
behavior of arithmetic expressions in any way it chooses, though
one presumes it would do so in a way that makes sense for the
machines that it targets.
Yes, that is true. It is, however, also important to remember that it
was based on a general abstract machine, not any particular hardware,
and that the operations were intended to follow standard mathematics as
well as practically possible - operations and expressions in C were not >designed for any particular hardware. (Though some design choices were >biased by particular hardware.)
Thus, it could have formalized the
result of signed integer overflow to follow 2's complement
semantics had the committee so chosen, in which case the result
would not be "incorrect", it would be well-defined with respect
to the semantics of the language. Java, for example, does this,
as does C11 (and later) atomic integer operations. Indeed, the
C99 rationale document makes frequent reference to twos
complement, where overflow and modular behavior are frequently
equivalent, being the common case. But aside from the more
recent atomics support, C _chose_ not to do this.
It could have made signed integer overflow defined behaviour, but it did >not. The C standards committee have explicitly chosen not to do that,
even after deciding that two's complement is the only supported >representation for signed integers in C23 onwards. It is fine to have
two's complement representation, and fine to have modulo arithmetic in
some circumstances, while leaving other arithmetic overflow undefined. >Unsigned integer operations in C have always been defined as modulo >arithmetic - addition of unsigned values is a different operation from >addition of signed values. Having some modulo behaviour does not in any
way imply that signed arithmetic should be modulo.
In Java, the language designers decided that integer arithmetic
operations would be modulo operations. Wrapping therefore gives the
correct answer for those operations - it does not give the correct
answer for mathematical integer operations. And Java loses common >mathematical identities which C retains - such as the identity that
adding a positive integer to another integer will increase its value. >Something always has to be lost when approximating unbounded
mathematical integers in a bounded implementation - I think C made the
right choices here about what to keep and what to lose, and Java made
the wrong choices. (Others may of course have different opinions.)
In Zig, unsigned integer arithmetic overflow is also UB as these
operations are not defined as modulo. I think that is a good natural
choice too - but it is useful for a language to have a way to do
wrapping arithmetic on the occasions you need it.
Also, consider that _unsigned_ arithmetic is defined as having
wrap-around semantics similar to modular arithmetic, and thus
incapable of overflow.
Yes. Unsigned arithmetic operations are different operations from
signed arithmetic operations in C.
But that's simply a fiction invented for
the abstract machine described informally in the standard: it
requires special handling one machines like the 1100 series,
because those machines might trap on overflow. The C committee
could just as well have said that the unsigned arithmetic
_could_ overflow and that the result was UB.
They could have done that (as the Zig folk did).
So why did C chose this way? The only logical reason is that
there were machines at the time that where a) integer overflow
caused machine exceptions, and b) the representation of signed
integers was not well-defined, so that the actual value
resulting from overflow could not be rigorously defined. Given
that C90 mandated a binary representation for integers and so
the representation of of unsigned integers is basically common,
there was no need to do that for unsigned arithmetic.
Not at all. Usually when someone says "the only logical reason is...",
they really mean "the only logical reason /I/ can think of is...", or
"the only reason that /I/ can think of that /I/ think is logical is...".
For a language that can be used as a low-level systems language, it is >important to be able to do modulo arithmetic efficiently. It is needed
for a number of low-level tasks, including the implementation of large >arithmetic operations, handling timers, counters, and other bits and
pieces. So it was definitely a useful thing to have in C.
For a language that can be used as a fast and efficient application >language, it must have a reasonable approximation to mathematical
integer arithmetic. Implementations should not be forced to have
behaviours beyond the mathematically sensible answers - if a calculation >can't be done correctly, there's no point in doing it. Giving nonsense >results does not help anyone - C programmers or toolchain implementers,
so the language should not specify any particular result. More sensible >defined overflow behaviour - saturation, error values, language
exceptions or traps, etc., would be very inefficient on most hardware.
So UB is the best choice - and implementations can do something
different if they like.
Too many options make a language bigger - harder to implement, harder to >learn, harder to use. So it makes sense to have modulo arithmetic for >unsigned types, and normal arithmetic for signed types.
I am not claiming to know that this is the reasoning made by the C
language pioneers. But it is definitely an alternative logical reason
for C being the way it is.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer
representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for
general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
Yes. As I said, performance is often the justification.
I'm not convinced that there are no custom chips and/or DSPs
that are not manufactured today. They may not be common, their
mere existence is certainly dumb and offensive, but that does
not mean that they don't exist. Note that the survey in, e.g.,
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
only mentions _popular_ DSPs, not _all_ DSPs.
I think you might have missed a few words in that paragraph, but I
believe I know what you intended. There are certainly DSPs and other
cores that have strong support for alternative overflow behaviour - >saturation is very common in DSPs, and it is also common to have a
"sticky overflow" flag so that you can do lots of calculations in a
tight loop, and check for problems once you are finished. I think it is >highly unlikely that you'll find a core with something other than two's >complement as the representation for signed integer types, though I
can't claim that I know /all/ devices! (I do know a bit about more
cores than would be considered popular or common.)
Of course, if such machines exist, I will certainly concede that
I doubt very much that anyone is targeting them with C code
written to a modern standard.
Modern C is definitely used on DSPs with strong saturation support.
(Even ARM cores have saturated arithmetic instructions.) But they can
also handle two's complement wrapped signed integer arithmetic if the >programmer wants that - after all, it's exactly the same in the hardware
as modulo unsigned arithmetic (except for division). That doesn't mean
that wrapping signed integer overflow is useful or desired behaviour.
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and seemingly found a local optimum at around 16K.
Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
but 16K to 32K or 64K did not see any significant reduction; but did see
a more significant increase in memory footprint due to allocation
overheads (where, OTOH, going from 4K to 16K pages does not see much increase in memory footprint).
Patterns seemed consistent across multiple programs tested, but harder
to say if this pattern would be universal.
Had noted if running stats on where in the pages memory accesses land:
4K: Pages tend to be accessed fairly evenly
16K: Minor variation as to what parts of the page are being used.
64K: Significant variation between parts of the page.
Basically, tracking per-page memory accesses on a finer grain boundary
(eg, 512 bytes).
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at all
(and increasing page size only really sees benefit for TLB miss rate so
long as the whole page is "actually being used").
On 8/15/2025 11:19 AM, BGB wrote:
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to >>>>>> have "typically 97% hit rate". I would go for larger pages, which >>>>>> would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and
seemingly found a local optimum at around 16K.
I think that is consistent with what some others have found. I suspect
the average page size should grow as memory gets cheaper, which leads to >more memory on average in systems. This also leads to larger programs,
as they can "fit" in larger memory with less paging. And as disk
(spinning or SSD) get faster transfer rates, the cost (in time) of
paging a larger page goes down. While 4K was the sweet spot some
decades ago, I think it has increased, probably to 16K. At some point
in the future, it may get to 64K, but not for some years yet.
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at all
(and increasing page size only really sees benefit for TLB miss rate so
long as the whole page is "actually being used").
Not necessarily. Consider the case of a 16K (or larger) page with two
"hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K >pages, but only one with larger pages.
ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.
These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.
John Levine <johnl@taugh.com> writes:
These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.
Two block devices bought less than a year ago:
Disk model: KINGSTON SEDC2000BM8960G
Disk model: WD Blue SN580 2TB
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Disk model: WD Blue SN580 2TB
I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.
On 8/15/2025 11:19 AM, BGB wrote:
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to >>>>>> have "typically 97% hit rate". I would go for larger pages, which >>>>>> would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and
seemingly found a local optimum at around 16K.
I think that is consistent with what some others have found. I suspect
the average page size should grow as memory gets cheaper, which leads to more memory on average in systems. This also leads to larger programs,
as they can "fit" in larger memory with less paging. And as disk
(spinning or SSD) get faster transfer rates, the cost (in time) of
paging a larger page goes down. While 4K was the sweet spot some
decades ago, I think it has increased, probably to 16K. At some point
in the future, it may get to 64K, but not for some years yet.
Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
but 16K to 32K or 64K did not see any significant reduction; but did
see a more significant increase in memory footprint due to allocation
overheads (where, OTOH, going from 4K to 16K pages does not see much
increase in memory footprint).
Patterns seemed consistent across multiple programs tested, but harder
to say if this pattern would be universal.
Had noted if running stats on where in the pages memory accesses land:
4K: Pages tend to be accessed fairly evenly
16K: Minor variation as to what parts of the page are being used.
64K: Significant variation between parts of the page.
Basically, tracking per-page memory accesses on a finer grain boundary
(eg, 512 bytes).
Interesting.
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at
all (and increasing page size only really sees benefit for TLB miss
rate so long as the whole page is "actually being used").
Not necessarily. Consider the case of a 16K (or larger) page with two
"hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K pages, but only one with larger pages.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never
incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready
and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
all makes more sense with that. ebx then contains the carry from
the last cycle on entry. The carry dependency chain starts at
clearing edx, then gets to additional carries, then is copied to
ebx, transferred into the next iteration, and is ended there by
overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
ebx contains the carry from the last iteration
One other problem is that according to Agner Fog's instruction
tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
per cycle, and adc has a latency of 1, so breaking the dependency
chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
same cycle, but looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
- anton
Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:
In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])
It would also be possible to use SETC to save the intermediate carries...
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
I think we're agreeing that even in the early 1980s a 512 byte page was
too small. They certainly couldn't have made it any smaller, but they
should have made it larger.
S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
S/370 was a decade before that and its pages were 2K or 4K. The KI-10,...
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
Note that 360 has optional page protection used only for access
control. In 370 era they had legacy of 2k or 4k pages, and
AFAICS IBM was mainly aiming at bigger machines, so they
were not so worried about fragmentation.
PDP-11 experience possibly contributed to using smaller pages for VAX.
Microprocessors were designed with different constraints, which
led to bigger pages. But VAX apparently could afford resonably
large TLB and due VMS structure gain was bigger than for other
OS-es.
And little correction: VAX architecture handbook is dated 1977,
so actually decision about page size had to be made at least
in 1977 and possibly earlier.
On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.
Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:
Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0
Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44
For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock
Code:
.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi
mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)
Very impressive Michael!
I particularly like how you are interleaving ADOX and ADCX to gain
two carry bits without having to save them off to an additional
register.
Terje
Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.
On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:
I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.
For one thing, I don't have a stack for calling subroutines, or any other purpose.
But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.
The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
that.)
So I've probably completely misunderstood you here.
John Savard
I have harped on you for a while to start development of your compiler.
One of the first things a compiler needs to do is to develop its means
to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.
John Levine <johnl@taugh.com> wrote:
It's also seems rather high for the /91. I can't find any authoritative
numbers but 100K seems more likely. It was SLT, individual transistors
mounted a few to a package. The /91 was big but it wasn't *that* big.
I remember this number, but do not remember where I found it. So
it may be wrong.
However, one can estimate possible density in a different way: package >probably of similar dimensions as VAX package can hold about 100 TTL
chips. I do not have detailed data about chip usage and transistor
couns for each chip. Simple NAND gate is 4 transitors, but input
transitor has two emiters and really works like two transistors
so it is probably better to count it as 2 transitors, and conseqently >consisder 2 input NAND gate as having 5 transitors. So 74S00 gives
20 transistors. D-flop probably is about 20-30 transitors, so
74S74 is probably around 40-60. Quad D-flop bring us close to 100.
I suspect that in VAX time octal D-flops were available. There
were 4 bit ALU slices. Also multiplexers need nontrivial number
of transistors. So I think that 50 transistors is reasonable (maybe
low) estimate of average density. Assuming 50 transitors per chip
that would be 5000 transistors per package. Packages were rather
flat, so when mounted vertically one probably could allocate 1 cm
of horizotal space for each. That would allow 30 packages at
single level. With 7 levels we get 210 packages, enough for
1 mln transistors.
On 7/28/2025 6:18 PM, John Savard wrote:
On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.
So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.
The use of addressing modes drops off pretty sharply though.
Like, if one could stat it out, one might see a static-use pattern
something like:
80%: [Rb+disp]
15%: [Rb+Ri*Sc]
3%: (Rb)+ / -(Rb)
1%: [Rb+Ri*Sc+Disp]
<1%: Everything else
Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.
Granted, the dominance of [Rb+Disp] does drop off slightly when
considering dynamic instruction use. Part of it is due to the
prolog/epilog sequences.
If one had instead used (SP)+ and -(SP) addressing for prologs and
epilogs, then one might see around 20% or so going to these instead.
Or, if one had PUSH/POP, to PUSH/POP.
The discrepancy then between static and dynamic instruction counts them being mostly due to things like loops and similar.
Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
seemed to be in the area. Many loops end up unreached in many
iterations, or only running a few times, so possibly counter-intuitively
it is often faster to assume that a loop body will likely only cycle 2
or 3 times rather than 100s or 1000s, and trying to aggressively
optimize loops by assuming large N tends to be detrimental to performance.
Well, and at least thus far, profiler-driven optimization isn't really a thing in my case.
One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
ISA has a lot of registers, the relative benefit of LoadOp is reduced.
LoadOp being mostly a benefit if the value is loaded exactly once, and
there is some other ALU operation or similar that can be fused with it.
Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
z=arr[i]+x;
But, the relative incidence of things like this is low enough as to not
save that much.
The other thing is that one has to implement it in a way that does not increase pipeline length,
since if one makes the pipeline linger for
sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already needed to be lengthened for other reasons.
One can be like, "But what if the local variables are not in registers?"
but on a machine with 32 or 64 registers, most likely your local
variable is already going to be in a register.
So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".
That being said, though, designing a new machine today like the VAX
would be a huge mistake.
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
Yeah.
There are some living descendants of that family, but pretty much
everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.
John Savard
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate--- Synchronet 3.21a-Linux NewsLink 1.2
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
John Savard
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:...
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
[...] POLY as an
instruction is bad.
One must remember that VAX was a 5-cycle per instruction machine !!!
(200ns : 1 MIP)
Pipeline work over 1983-to-current has shown that LD and OPs perform
just as fast as LD+OP. Also, there are ways to perform LD+OP as if it
were LD and OP, and there are way to perform LD and OP as if it were
LD+OP.
Condition codes get hard when DECODE width grows greater than 3.
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translate
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
But how about this?
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first, and there are instructions with six operands. Or how about
this:
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.
It must have seemed clever at the time, but ugh.
John Levine <johnl@taugh.com> posted:
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translateBut how about this?
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first, and there are instructions with six operands. Or how about
this:
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.
It must have seemed clever at the time, but ugh.
What we must all realize is that each address mode in VAX was a microinstruction all unto itself.
And that is why it was not pipelineable in any real sense.
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.
Other than GNU C (with its support for nested functions), which other language has this weird combination of features?
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are heap-allocated objects.
Other than GNU C (with its support for nested functions), which other language has this weird combination of features?
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
There is one additional, quite thorny issue: How to maintain
state for nested functions to be invoked via pointers, which
have to have access local variables in the outer scope.
gcc does so by default by making the stack executable, but
that is problematic. An alternative is to make some sort of
executable heap. This is now becoming a real problem, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117455 .
AFAIK this is a problem only in those rare languages where a function
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
In most cases you have either one of the other but not both. E.g. in
C we don't have nested functions, and in Javascript functions are
heap-allocated objects.
Other than GNU C (with its support for nested functions), which other
language has this weird combination of features?
Well, more precisely:
- function pointer is supposed to take the same space as a single
machine address
- function pointer is supposed to be directly invokable, that is
point to machine code of the function
- one wants to support nested functions
- there is no garbage collector, one does not want to introduce extra
stack and one does not want to leak storage allocated to nested
functions.
To explain more:
- arguably in "safe" C data pointers should consist
of 3 machine words, such pointer have place for extra data needed
for nested functions.
- some calling conventions introduce extra indirection, that is
function pointer point to a data structure containing address
of machine code and extra data needed by nested functions.
Function call puts extra data in dedicated machine register and
then transfers control via address contained in function data
structure. IIUC IBM AIX uses such approach.
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
Concerning languages, any language which has nested functions and
wants seamless cooperation with C needs to resolve the problem.
That affects Pascal, Ada, PL/I. That is basicaly most classic
non-C languages. IIUC several "higher level" languages resolve
the trouble by combination of parallel stack and/or GC. But
when language want to compete with efficiency of C and does not
want GC, then trampolines allocated on machine stack may be the
only choice (on register starved machine parallel stack may be
too expensive). AFAIK GNU Ada uses (or used) trampolines
allocated on machine stack.
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so
there's no dynamic allocation involved.
Stefan
AFAIK this is a problem only in those rare languages where a function...
value is expected to take up the same space as any other pointer while
at the same time supporting nested functions.
Other than GNU C (with its support for nested functions), which other >language has this weird combination of features?
On 8/30/2025 1:22 PM, Stefan Monnier wrote:
Function pointer consists of a pointer to a blob of memory holding
a code pointer and typically the callee's GOT pointer.
Better skip the redirection and make function pointers take up 2 words (address of the code plus address of the context/environment/GOT), so there's no dynamic allocation involved.
FDPIC typically always uses the normal pointer width, just with more indirection:
Load target function pointer from GOT;
Save off current GOT pointer to stack;
Load code pointer from function pointer;
Load GOT pointer from function pointer;
Call function;
Reload previous GOT pointer.
It, errm, kinda sucks...
I would have liked to install 64-bit Debian (IIRC I initially ran
32-bit Debian on the Athlon 64), but they were not ready at the time,
and still busily working on their multi-arch (IIRC) plans, so
eventually I decided to go with Fedora Core 1, which just implemented
/lib and /lib64 and was there first.
For some reason I switched to Gentoo relatively soon after
(/etc/hostname from 2005-02-20, and IIRC Debian still had not finished >hammering out multi-arch at that time), before finally settling in >Debian-land several years later.
Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
first Debian with official AMD64 support.
Reading some more, Debian 4.0 (Etch), released 8 April 2007, was the
first Debian with official AMD64 support.
Indeed, I misremembered: I used Debian's i386 port on my 2003 AMD64
machine.
It didn't have enough RAM to justify the bother of distro hopping. 🙂
Stefan
It didn't have enough RAM to justify
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
So, the overall sizes (including data size for globals() on RV64GC) are:32 68 My 66000 8 5
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
--- Synchronet 3.21a-Linux NewsLink 1.2So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
3 for My 66000
so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.* My 66000 uses ST immediate for globals
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
On 9/4/2025 8:23 AM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
In general yes, but as you pointed out in another post, if you are
talking about a GBOoO machine, it isn't the absolute number of
instructions (because of parallel execution), but the number of cycles
to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller
AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
And, of course your "150%" is arbitrary,
but I agree that small
differences in code size are not important, except in some small
embedded applications.
And I guess I would add, as a third, much lower priority, power usage.
On 9/4/2025 8:23 AM, MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC)
are:
Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9 >>>> 27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the
implementation
has a GBOoO µarchitecture, I would think that fewer instructions is
better
than smaller code--so long as the code size is less than 150% of the
smaller
AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
In general yes, but as you pointed out in another post, if you are
talking about a GBOoO machine, it isn't the absolute number of
instructions (because of parallel execution), but the number of cycles
to execute a particular routine. Of course, this is harder to tell at a glance from a code listing.
And, of course your "150%" is arbitrary, but I agree that small
differences in code size are not important, except in some small
embedded applications.
And I guess I would add, as a third, much lower priority, power usage.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
#include <stddef.h>32 68 My 66000 8 5
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are: >> > Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation >has a GBOoO µarchitecture, I would think that fewer instructions is better >than smaller code--so long as the code size is less than 150% of the smaller >AND so long as the ISA does not resort to sequential decode (i.e., VAX).
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
#include <stddef.h>32 68 My 66000 8 5
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
So, the overall sizes (including data size for globals() on RV64GC) are: >>>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
Performance from a given chip area.
The RISC-V people argue that they can combine instructions with a few transistors. But, OTOH, they have 16-bit and 32-bit wide
instructions, which means that a part of the decoder results will be
thrown away, increasing the decode cost for a given number of average
decoded instructions per cycle. Plus, they need more decoded
instructions per cycle for a given amount of performance.
Intel and AMD demonstrate that you can get high performance even with
an instruction set that is even worse for decoding, but that's not cheap.
ARM A64 goes the other way: Fixed-width instructions ensure that all
decoding on correctly predicted paths is actually useful.
However, it pays for this in other ways: Instructions like load pair
with auto-increment need to write 3 registers, and the write port
arbitration certainly has a hardware cost. However, such an
instruction would need two loads and an add if expressed in RISC-V; if
RISC-V combines these instructions, it has the same write-port
arbitration problem. If it does not combine at least the loads, it
will tend to perform worse with the same number of load/store units.
So it's a balancing game: If you lose some weight here, do you need to
add the same, more, or less weight elsewhere to compensate for the
effects elsewhere?
At some scale, smaller code size is beneficial, but once the implementation >> has a GBOoO µarchitecture, I would think that fewer instructions is better >> than smaller code--so long as the code size is less than 150% of the smaller >> AND so long as the ISA does not resort to sequential decode (i.e., VAX).
I don't think that even VAX encoding would be the major problem of the
VAX these days. There are microop caches and speculative decoders for
that (although, as EricP points out, the VAX is an especially
expensive nut to crack for a speculative decoder).
In any case, if smaller code size was it, RV64GC would win according
to my results. However, compilers often generate code that has a
bigger code size rather than a smaller one (loop unrolling, inlining),
so code size is not that important in the eyes of the maintainers of
these compilers.
I also often see code produced with more (dynamic) instructions than necessary. So the number of instructions is apparently not that
important, either.
- anton
On 9/5/2025 10:03 AM, Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
For example:
* 00in-nnnn-iiii-0000 ADD Imm5s, Rn5 //"ADD 0, R0" = TRAP
* 01in-nnnn-iiii-0000 LI Imm5s, Rn5
* 10mn-nnnn-mmmm-0000 ADD Rm5, Rn5
* 11mn-nnnn-mmmm-0000 MV Rm5, Rn5
* 0000-nnnn-iiii-0100 ADDW Imm4u, Rn4
* 0001-nnnn-mmmm-0100 SUB Rm4, Rn4
* 0010-nnnn-mmmm-0100 ADDW Imm4n, Rn4
* 0011-nnnn-mmmm-0100 MVW Rm4, Rn4 //ADDW Rm, 0, Rn
* 0100-nnnn-mmmm-0100 ADDW Rm4, Rn4
* 0101-nnnn-mmmm-0100 AND Rm4, Rn4
* 0110-nnnn-mmmm-0100 OR Rm4, Rn4
* 0111-nnnn-mmmm-0100 XOR Rm4, Rn4
* 0iii-0nnn-0mmm-1001 ? SLL Rm3, Imm3u, Rn3
* 0iii-0nnn-1mmm-1001 ? SRL Rm3, Imm3u, Rn3
* 0iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3u, Rn3
* 0iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3u, Rn3
* 1iii-0nnn-0mmm-1001 ? AND Rm3, Imm3u, Rn3
* 1iii-0nnn-1mmm-1001 ? SRA Rm3, Imm3u, Rn3
* 1iii-1nnn-0mmm-1001 ? ADD Rm3, Imm3n, Rn3
* 1iii-1nnn-1mmm-1001 ? ADDW Rm3, Imm3n, Rn3
* 0ooo-0nnn-0mmm-1101 ? SLL Rm3, Ro3, Rn3
* 0ooo-0nnn-1mmm-1101 ? SRL Rm3, Ro3, Rn3
* 0ooo-1nnn-0mmm-1101 ? AND Rm3, Ro3, Rn3
* 0ooo-1nnn-1mmm-1101 ? SRA Rm3, Ro3, Rn3
* 1ooo-0nnn-0mmm-1101 ? ADD Rm3, Ro3, Rn3
* 1ooo-0nnn-1mmm-1101 ? SUB Rm3, Ro3, Rn3
* 1ooo-1nnn-0mmm-1101 ? ADDW Rm3, Ro3, Rn3
* 1ooo-1nnn-1mmm-1101 ? SUBW Rm3, Ro3, Rn3
* 0ddd-nnnn-mmmm-0001 LW Disp3u(Rm4), Rn4
* 1ddd-nnnn-mmmm-0001 LD Disp3u(Rm4), Rn4
* 0ddd-nnnn-mmmm-0101 SW Rn4, Disp3u(Rm4)
* 1ddd-nnnn-mmmm-0101 SD Rn4, Disp3u(Rm4)
* 00dn-nnnn-dddd-1001 LW Disp5u(SP), Rn5
* 01dn-nnnn-dddd-1001 LD Disp5u(SP), Rn5
* 10dn-nnnn-dddd-1001 SW Rn5, Disp5u(SP)
* 11dn-nnnn-dddd-1001 SD Rn5, Disp5u(SP)
* 00dd-dddd-dddd-1101 J Disp10
* 01dn-nnnn-dddd-1101 LD Disp5u(SP), FRn5
* 10in-nnnn-iiii-1101 LUI Imm5s, Rn5
* 11dn-nnnn-dddd-1101 SD FRn5, Disp5u(SP)
Could achieve a higher average hit-rate than RV-C while *also* using
less encoding space.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
arrays:
MOV R3,#0
MOV R4,#0
VEC R5,{}
LDD R6,[R1,R3<<3]
ADD R4,R4,R6
LOOP LT,R3,#1,R2
MOV R1,R4
RET
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
globals:
STD #0x1234567890abcdef,[ip,a-.]
STD #0xcdef1234567890ab,[ip,b-.]
STD #0x567890abcdef1234,[ip,c-.]
STD #0x5678901234abcdef,[ip,d-.]
RET
-----------------
32 68 My 66000 8 5
So, the overall sizes (including data size for globals() on RV64GC) are: >>> Bytes Instructions
arrays globals Architecture arrays globals
28 66 (34+32) RV64GC 12 9
27 69 AMD64 11 9
44 84 ARM A64 11 22
In light of the above, what do people think is more important, small
code size or fewer instructions ??
At some scale, smaller code size is beneficial, but once the implementation has a GBOoO µarchitecture, I would think that fewer instructions is better than smaller code--so long as the code size is less than 150% of the smaller AND so long as the ISA does not resort to sequential decode (i.e., VAX).
What say ye !
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test.
Size is one thing, sooner or later one has to execute the instructions,
and here My 66000needs to execute fewer, while being within spitting
distance of code size.
Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions,
3 for My 66000
so certainly in this case RV64GC is not* My 66000 uses ST immediate for globals
crappier than the others. Interestingly, the reasons for using four
instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
- anton
Things could be architect-ed to allow a tradeoff between code size and number of instructions executed in the same ISA. Sometimes one may want really small code; other times performance is more important.
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
Waldek Hebisch <antispam@fricas.org> schrieb:
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
[...]
gcc has -ftrampoline-impl=[stack|heap], see https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html
Don't longjmp out of a nested function though.
Thomas Koenig <tkoenig@netcologne.de> posted:
Waldek Hebisch <antispam@fricas.org> schrieb:
- one could create trampolines in a separate area of memory. In
such case there is trouble with dealocating no longer needed
trampolines. This trouble can be resolved by using GC. Or
by using a parallel stack dedicated to trampolines.
[...]
gcc has -ftrampoline-impl=[stack|heap], see
https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html
Don't longjmp out of a nested function though.
Or longjump around subroutines using 'new'.
Or longjump out of 'signal' handlers.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,070 |
Nodes: | 10 (0 / 10) |
Uptime: | 127:45:22 |
Calls: | 13,731 |
Calls today: | 1 |
Files: | 186,965 |
D/L today: |
1,259 files (486M bytes) |
Messages: | 2,417,822 |