The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".
https://www.scribd.com/document/231464485/ARM-RG
Yes, scribd.com is annoying, but the document is interesting.
Page 22:
"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time
"Lessons from the ARM Architecture"
In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:
"Lessons from the ARM Architecture"
Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>
Note that this is from 2010 and does not discuss ARM64.
One comment:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:
"Lessons from the ARM Architecture"
Non-Scribd: ><https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>
Note that this is from 2010 and does not discuss ARM64.
One comment:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
Scott Lurndal <scott@slp53.sl.home> schrieb:
The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".
https://www.scribd.com/document/231464485/ARM-RG
Yes, scribd.com is annoying, but the document is interesting.
It certainly is, thanks for sharing!
He left out POWER, for some reason or other.
Page 22:
"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time
Wise words.
Scott Lurndal <scott@slp53.sl.home> schrieb:
The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".
https://www.scribd.com/document/231464485/ARM-RG
Yes, scribd.com is annoying, but the document is interesting.
It certainly is, thanks for sharing!
He left out POWER, for some reason or other.
Page 22:
"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time
Wise words.
On 12/15/2025 1:37 PM, John Dallman wrote:
In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:
"Lessons from the ARM Architecture"
Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>
Note that this is from 2010 and does not discuss ARM64.
One comment:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.
Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.
This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x multiplier out of the effective size of the TLB (and without some of the drawbacks of an Inverted Page Table).
Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).
Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.
One potential ugly point that exists in my designs (including XG3) is
the use of 48-bit pointers with tagging in the high 16 bits. I can't
prove that this wont come back to bite; but then at the same time I also still need some amount of tag bits. And, Intel/AMD/ARM had ended up
partly going in a similar direction as an optional feature (though
differing on the 8 vs 16 bits question).
For textures, can map the textures over to "color()" which is then understood as identifying a texture if not using a normal color
name/value. There is no direct need to specify ST/UV coords here, as
these are implied via texture projection.
One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can
perform FP64×{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.
So, if one can take the Incorrect-Rounding exception in less than 100
cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.
I CAN prove it WILL come back to Bite--but only if your architecture
survives
BGB <cr88192@gmail.com> posted:
On 12/15/2025 1:37 PM, John Dallman wrote:
In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:
"Lessons from the ARM Architecture"
Non-Scribd:
<https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>
Note that this is from 2010 and does not discuss ARM64.
One comment:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented. >>>
IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.
One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can
perform FP64Ă—{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.
So, if one can take the Incorrect-Rounding exception in less than 100
cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.
-----------------------------------------------
Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.
This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x
multiplier out of the effective size of the TLB (and without some of the
drawbacks of an Inverted Page Table).
Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).
Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.
If your HW tablewalker made 1 probe into a large (16MB) SW controlled
buffer of PTEs, and if it found the PTE great, it gets installed and
life goes on, otherwise, SW updates the Table and TLB and returns as
soon as possible.
This gives you a L2-TLB with 1M entries, enough to map 8GB and a
table walking state machine with 3 states.
-----------------------------------------------
One potential ugly point that exists in my designs (including XG3) is
the use of 48-bit pointers with tagging in the high 16 bits. I can't
prove that this wont come back to bite; but then at the same time I also
still need some amount of tag bits. And, Intel/AMD/ARM had ended up
partly going in a similar direction as an optional feature (though
differing on the 8 vs 16 bits question).
I CAN prove it WILL come back to Bite--but only if your architecture survives
-----------------------------------------------------
For textures, can map the textures over to "color()" which is then
understood as identifying a texture if not using a normal color
name/value. There is no direct need to specify ST/UV coords here, as
these are implied via texture projection.
Can you think of uses for Texture where a LD instruction has a FP
index::
LD Rd,[Rbase,Rindex,Displacement]
where: Rbase is a pointer
Rindex is a FP value
Displacement is what it is
The integer part of Rindex and integer part + 1 index the Texture.
The fraction pare of Rindex performs the blending between
LERP( texture[int(Rindex)], texture[int(Rindex)+1], fract(Rindex) );
On 12/15/2025 1:37 PM, John Dallman wrote:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.
According to BGB <cr88192@gmail.com>:
On 12/15/2025 1:37 PM, John Dallman wrote:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.
I think it's always been iffy.
S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
could use an EQUIVALENCE statement to create misaligned doubles, something that
occasionally happened in code ported from 709x where it wasn't a problem. There
was a library trap routine to fix it but it was stupendously slow, so IBM added
the "byte oriented operand" feature to the 360/85 and later models.
The high end 360/91 did not have most of the decimal instructions, on the reasonable theory that few people would run the kind of programs that used them
on a high end scientific machine. The operating system simulated them slowly, so
the programs did still work. But the subsequent 360/195 had the full instruction
set.
On the other hand, back when PCs were young, Intel CPUs had a separate optional
fairly expensive floating point coprocessor. Most PCs didn't have them, and programs that did arithmetic all had software floating point libraries. If you
didn't have the FPU, instructions didn't trap, they just did nothing. The C compiler we used had a clever hack that emitted floating point instructions with the first byte changed to a trap instruction. If the computer had hardware floating point, the trap handler patched the instruction to the
real floating point one and returned to it, otherwise used the software floating point. This got pretty good performance either way. It helped
that the PC had no hardware protection so the C library could just install its own trap handler that went directly to the float stuff.
Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...
BGB <cr88192@gmail.com> posted:
-----------------------------------------------
Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.
This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x
multiplier out of the effective size of the TLB (and without some of the
drawbacks of an Inverted Page Table).
Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).
Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.
If your HW tablewalker made 1 probe into a large (16MB) SW controlled
buffer of PTEs, and if it found the PTE great, it gets installed and
life goes on, otherwise, SW updates the Table and TLB and returns as
soon as possible.
This gives you a L2-TLB with 1M entries, enough to map 8GB and a
table walking state machine with 3 states.
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
Time cost: roughly 450 clock cycles (penalties are steep on this one).
On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:
Misaligned access is common enough here that, if it were not supported
natively, this would likely tank performance...
Still there are/were some architectures that refused to support it.
BGB <cr88192@gmail.com> schrieb:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4]. See https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
for a random reference grabbed off the net how to do it.
[...]
Time cost: roughly 450 clock cycles (penalties are steep on this one).
What is your accuracy?
BGB <cr88192@gmail.com> schrieb:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].
See
https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
for a random reference grabbed off the net how to do it.
[...]
Time cost: roughly 450 clock cycles (penalties are steep on this
one).
What is your accuracy?
Running some stats for misaligned access in my emulator (percentage of
total memory accesses, organized by type).
Initial bootup:
Word : 7.67% DWord: 0.13% QWord: 9.53%
On 12/16/2025 7:44 PM, John Levine wrote:
S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
could use an EQUIVALENCE statement to create misaligned doubles, something that
occasionally happened in code ported from 709x where it wasn't a problem. There
was a library trap routine to fix it but it was stupendously slow, so IBM added
the "byte oriented operand" feature to the 360/85 and later models. ...
Misaligned access is common enough here that, if it were not supported >natively, this would likely tank performance...
The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".
https://www.scribd.com/document/231464485/ARM-RG
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
BGB <cr88192@gmail.com> schrieb:
Running some stats for misaligned access in my emulator (percentage of
total memory accesses, organized by type).
Initial bootup:
Word : 7.67% DWord: 0.13% QWord: 9.53%
As the system is under your total control, you should be able
to find out where this comes from. Does the compiler not place
words correctly, how do you align your structures, do you use
misalligned large words for memcpy, ... ?
jgd@cix.co.uk (John Dallman) writes:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
I disagree. Trap-and-emulate may be too slow for a feature that you
want programmers to use in hot paths on current CPUs, but there are
other cases. In particular, for a feature that cannot be implemented properly yet, if you provide it as a trap-and-emulated instruction in
the current generation, and a faster implementation of the instruction
in a future implementation of the architecture, programmers will be
much less reluctant to use that instruction when its implementation is
fast on the dominant implementation of the day than if it just
produces a SIGILL or somesuch on older chips (or new, but
feature-reduced chips). Intel's marketing does not understand that,
that's why they are selling feature-reduced versions of chips that
have AVX and AVX-512 in hardware.
- anton
On 12/16/2025 3:02 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 12/15/2025 1:37 PM, John Dallman wrote:
In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:
"Lessons from the ARM Architecture"
Non-Scribd:
<https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>
Note that this is from 2010 and does not discuss ARM64.
One comment:
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented. >>>
IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.
One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can perform FP64Ă—{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively. I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.
So, if one can take the Incorrect-Rounding exception in less than 100 cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.
In my case, handling of "sin()" looks sorta like:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
t =th*sintab_c00; x=th*th2;
t+=x*sintab_c01; x*=th2;
t+=x*sintab_c02; x*=th2;
t+=x*sintab_c03; x*=th2;
t+=x*sintab_c04; x*=th2;
t+=x*sintab_c05; x*=th2;
t+=x*sintab_c06; x*=th2;
t+=x*sintab_c07; x*=th2;
t+=x*sintab_c08; x*=th2;
t+=x*sintab_c09; x*=th2;
t+=x*sintab_c10; x*=th2;
t+=x*sintab_c11; x*=th2;
t+=x*sintab_c12; x*=th2;
t+=x*sintab_c13; x*=th2;
t+=x*sintab_c14; x*=th2;
t+=x*sintab_c15; x*=th2;
t+=x*sintab_c16; x*=th2;
t+=x*sintab_c17; x*=th2;
t+=x*sintab_c18; x*=th2;
t+=x*sintab_c19; x*=th2;
t+=x*sintab_c20; x*=th2;
return(t);
}
Time cost: roughly 450 clock cycles (penalties are steep on this one).
This would be roughly 3x slower if handled with a trap and a special instruction.
Does it make sense to add an FSIN instruction? No, as FSIN is nowhere
near common enough to make much impact on code size (and, FSIN will be invariably slower).
How much space does it take? Around 460 bytes.
This is assuming DAZ/FTZ, but denormals shouldn't happen unless ang is pretty close to an integer multiple of M_TAU (2*M_PI).
On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:
Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...
Still there are/were some architectures that refused to support it.
On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].
It's not what I'd do in practice. The degree of poly required after
reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.
There are better methods. Like reducing to much smaller interval, e.g.
to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details of trade
off between size of reduction table and length of polynomial depend on
how often do you plan to use your sin() function.
Michael S <already5chosen@yahoo.com> posted:
On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].
It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.
To be clear, my coefficients are not restricted to 53-bits like a SW implementation.
There are better methods. Like reducing to much smaller interval,
e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
of trade off between size of reduction table and length of
polynomial depend on how often do you plan to use your sin()
function.
With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
are about the same as the table sizes for FDIV/FSQRT when using
Goldschmidt iteration using 11-bit in, 9 bit out tables.
On Wed, 17 Dec 2025 20:06:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
BGB <cr88192@gmail.com> schrieb:
double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].
It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.
To be clear, my coefficients are not restricted to 53-bits like a SW implementation.
There exists tricks that can achieve the same in software, sometimes at
cost of one additional FP op and sometimes even for free. The latter
esp. common when FMA costs the same as FMUL.
There are better methods. Like reducing to much smaller interval,
e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
of trade off between size of reduction table and length of
polynomial depend on how often do you plan to use your sin()
function.
With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
are about the same as the table sizes for FDIV/FSQRT when using
Goldschmidt iteration using 11-bit in, 9 bit out tables.
Software trade offs are different.
Assuming that argument is already in [0:-pi/2] range, reduction down
to [-1/128:+1/128] requires pi/2*64=100 table entries.
Each entry
occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400
bytes total. That size fits very comfortably in L1D cache, so from perspectives of improvement of hit rate there is no incentive to use
smaller table.
At the first glance, reduction to [-1/128:+1/128] appears
especially attractive for implementation that does not look for very
high precision. Something like 0.65 ULP looks attainable with poly in
form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate
in late evening hours, so I can be wrong about it. But I also can be
right.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:
Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...
Still there are/were some architectures that refused to support it.
There are smart fools everywhere.
] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
It appears that John Dallman <jgd@cix.co.uk> said:
] * "Trap and Emulateďż˝ is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
For at least 20 years IBM's mainframes have used what they call millicode. The
relatively simple instructions are implemented in hardware, and everything else
in millicode, including the more complication instructions, I/O, and other system features. Millicode runs on the same CPU using the same instruction set
as regular code with some extra registers and instructions to handle aspects of
the hardware not visible to regular programs. It is stored in dedicated memory
which is loaded at boot time so it's easy to update.
I gather there have been instructions that were implemented in millicode, then
moved into hardware in the next CPU generation since they were used enough for
the speed to matter.
Here's a 2012 slide deck:
https://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_metro_naspa_2012_10_what_and_why_of_system_z_millicode.pdf
On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:
Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...
Still there are/were some architectures that refused to support it.
There are smart fools everywhere.
No doubt another lesson learned from instruction traces: misaligned
accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.
The same conclusion was drawn about integer multiplication and
division in those early days, wasn’t it.
On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:
Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...
Still there are/were some architectures that refused to support it.
There are smart fools everywhere.
No doubt another lesson learned from instruction traces: misaligned
accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.
The same conclusion was drawn about integer multiplication and
division in those early days, wasn’t it.
On 18/12/2025 01:34, John Levine wrote:
It appears that John Dallman <jgd@cix.co.uk> said:
] * "Trap and Emulateďż˝ is an illusion of compatibility"
] * Performance differential is too great for most applications
This is inevitably true nowadays, but wasn't when the idea was invented.
For at least 20 years IBM's mainframes have used what they call millicode. The
relatively simple instructions are implemented in hardware, and everything else
in millicode, ...
Typical IBM boosterism of a minor variant of an existing technique,
as used (for range compatibility, etc) by the ICT 1900 Series in 1965
and before that (for h/w economy) by the Ferranti Atlas and Orion.
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,090 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 16:31:40 |
| Calls: | 13,943 |
| Calls today: | 2 |
| Files: | 187,033 |
| D/L today: |
6,441 files (1,793M bytes) |
| Messages: | 2,460,168 |