Forum: War Ensemble BBS

Lessons from the ARM Architecture

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 15 17:19:35 2025

From Newsgroup: comp.arch

The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".

https://www.scribd.com/document/231464485/ARM-RG

Yes, scribd.com is annoying, but the document is interesting.

Page 22:

"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Dec 15 18:05:53 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".

https://www.scribd.com/document/231464485/ARM-RG

Yes, scribd.com is annoying, but the document is interesting.

It certainly is, thanks for sharing!

He left out POWER, for some reason or other.

Page 22:

"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time

Wise words.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Dec 15 19:37:00 2025

From Newsgroup: comp.arch

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

John
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Dec 15 22:24:42 2025

From Newsgroup: comp.arch

On 12/15/2025 1:37 PM, John Dallman wrote:

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

There is also a tradeoff between density in cold path vs relative
overhead vs a direct runtime call.

I ended up partially migrating towards trap-and-emulate from runtime
calls for "long double" as these can have a higher relative density in
the cold path, and the added cost of the trap is smaller relative to the
cost of running the Binary128 operations.

Though, for semi-common cases, runtime calls are the preferable option.

For example, floating-point or integer divide are better served by a
runtime call than a trap:
Relative cost of the trap is very high in comparison;
More common in the hot-path.

Things like memory barrier instructions and mutex related instructions
could make sense to handle by traps, but preferable still to do mutex
locks in userland using a system call or similar.

In kernel space, it is more of an issue (if not just using a function
call), since then it turns into one of several scenarios:
Multicore:
Need to flush caches and lock the mutex in a coherent way.
Single core:
If mutex isn't free, kernel is already dead...
Though, if locked by the current task,
can just gloss over and let it continue;
Else, trigger a panic or something.

Things like IO devices may present their own sorts of challenges
(particularly if both the main OS device and also the device with the pagefile). IME, generally things like pagefile will need to bypass the
main filesystem, and also filesystem code would need to be structured in
a way that it is (ideally) impossible for a page fault or TLB miss to
happen during an IO operation (for this reason, any VFS block-caching or similar needing to be done with physically-backed memory rather than
virtual memory, etc).

Well, and also some other important structures like task contexts can't
use virtual memory (well, more so if the architecture doesn't allow
interrupt handlers to natively access virtual memory; and it still makes
sense to be able to handle things like task scheduling in an interrupt handler).

...

But, then one can argue that maybe the cost of trap handling could get
higher relative to that of native or runtime call handling.

But example:
It makes sense to handle Binary128 as traps if the overhead can stay
under around 1000 cycles or so, but would not make sense at 100000 or
1000000 clock cycles, where runtime calls would be the clear winner.

Or, alternatively, if the runtime calls for the binary128 operations
were under 100-200 clock cycles (very possible on an OoO machine).

But, if there is a 200% speed difference, 800% code-size difference, and
the operation is cold-path only, trapping makes sense.

A bigger performance concern though is trapping on FDIV or denormals,
where (if not using DAZ+FTZ), may still be common enough to be significant.

Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.

This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x multiplier out of the effective size of the TLB (and without some of the drawbacks of an Inverted Page Table).

Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).

Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

Sometimes, it makes more sense to *not* have an instruction if the
instruction only makes sense if it is semi-frequently used, but at the
same time will negatively effect the implementation and add significant
cost if implemented using traps.

One example being possibly RISC-V's V extension, where:
If implemented as trap handlers;
If programs actually use it.
Is very liable to wreck performance.

A concern is partly that, at least to me, the design of RISC-V's V
extension seems like a foot-gun. Some people argue that it isn't
inherently any more expensive than ARM's NEON, but I still have some
concern here.

Well, I also have some concern over parts of the B extension, which seem overly niche and which don't map onto C or similar.

...

One potential ugly point that exists in my designs (including XG3) is
the use of 48-bit pointers with tagging in the high 16 bits. I can't
prove that this wont come back to bite; but then at the same time I also
still need some amount of tag bits. And, Intel/AMD/ARM had ended up
partly going in a similar direction as an optional feature (though
differing on the 8 vs 16 bits question).

When I ended up moving the FPU status from GBR/GP to SP, this created a
new mess, so was likely a partial mistake. I ended up moving these bits
again to being located in TBR, with SP turned into a special "HOBs are potentially always zero" register.

Granted, this did have a non-zero code impact (the code for the
"FENV_ACCESS" stuff needed to be tweaked). Does leave a problem for
RISC-V as accessing FENV in the standard way (via CSRs) will have a
steep cost. For whatever reason, the RV designers thought it make sense
to put things like rounding-mode and flags into separate sub-CSRs (which
would be too expensive to handle and not used often enough to justify
making it faster).

...

But, activity in my ISA is slow recently.

Has reached a point of "mostly good enough", which makes a difficulty.
Like, what does one do when they reach a point where there isn't as much
to obviously improve on?...

Well, TestKern is still kinda crap, still no working Linux nor ability
to run Linux binaries.

Not entirely clear where to go from here, or the use case for what I
ended up with. Here, XG3 works reasonably OK as a VM, but this was kind
of an overly long and convoluted path if I have still just ended up back
with having a VM (and to what extent I do electronics projects, still
often just using a RasPi or similar).

Like, this isn't much beyond where I was a decade ago.

More recent activity has been in other areas, mostly involving BGBCC's resource-converter stuff:
Working some on trying to improve my UPIC stuff, as UPIC turned out to
be "kinda useful" (*1).

*1: UPIC kinda resembles T.81 JPEG, but:
Uses STF+AdRice rather than Huffman;
Used BHT+RCT rather than DCT+YCbCr;
...

A few more recent extensions were:
Optionally adds:
WHT, LGT-5/3, and DCT transforms;
YCoCg-R and Approx_YCbCr.
Encoder can try out different options and pick the configuration that maximizes Q/bpp (lossy) or minimizes size (lossless, excludes DCT and Approx_YCbCr as these are inherently lossy).

Had noted that the "winner" seems to depend a lot on image type and quality:
Lossless:
BHT or LGT-5/3 usually win (BHT wins more often).
More often RCT wins over YCoCg-R.
So, BHT + RCT seems to dominate for Lossless.
Near Lossless (high quality):
LGT-5/3 and YCoCg-R often jump into the lead.
Low Quality:
DCT and Approx_YCbCr often more into the lead for photo-like images;
LGT-5/3 and YCoCg still do well for cartoon-like images though.

WHT could do OK for photo-like images, but:
Only holds a lead at lower quality levels;
Almost always loses to DCT when DCT is still an option;
So, WHT is making the worse showing here;
Though, is faster than DCT, and supports lossless.
Nevermind if for Lossless it nearly always loses to LGT-5/3.

Approx_YCbCr is basically an approximation of YCbCr:
Y=(G*4+R*3+B)/8
U=B-Y
V=R-Y
Though, this isn't exact, and:
U=(B-Y)/2+128
V=(R-Y)/2+128
Would technically be closer.

Though, in UPIC, the DC bias is 0 rather than 128, so differs here from
JPEG.

Had experimented with a few other possibilities, but none was
particularly competitive.

Where:
BHT : Block Haar (Wavelet) Transform
Recursively maps: (A,B) => ((A+B)/2, A-B)
Split into averages and differences, applied recursively.
LGT 5/3: Block (Le Gall–Tabatabai) 5/3 (Wavelet)
Recursively split even/odd, predict odds from averages of evens.
Bottom step resembles Haar.
WHT : Walsh-Hadamard Transform
Kinda resembles BHT, but more complicated.
There is also another kinda WHT, but it did worse:
(A+B+C+D, A+B-C-D, A-B-C+D, A-B+C-D)
DCT : Discrete Cosine Transform
Basically the same thing JPEG used here.
Lossy only (making DCT lossless is too slow).

As with T.81 JPEG, UPIC works with 8x8 blocks, typically organized into
a 16x16 macroblock with 4:2:0 or 4:4:4 chroma subsampling. As in the
Theora codec, the individual sub-blocks within a macroblock (when more
than one) are encoded in Hilbert order (U shaped). The format also
natively supports an alpha channel.

In this case, the 8x8 transforms are formed by two levels of 1D
transform. Horizontal then vertical for encode, vertical then horizontal
for decode (correct ordering here is important for lossless).

Had noted in some testing:
Lossless compression, for many images, can beat PNG.
At near-lossless quality, Q/bpp also beats T.81 JPEG.

But:
For "very synthetic" images, such as pie charts or graphs, PNG tends to
win over UPIC in terms of size (though UPIC still wins in being faster
to decode than PNG and with less memory footprint).

At the lower end of the quality scale (mostly under 30%), JPEG still
holds on well for Q/bpp, ... But, this is very sensitive to how the quantization matrix is computed (and needs to tune the quantization
matrix used for the specific transform).

Possible reasons:
STF+AdRice is slightly less optimal than Huffman;
The zero count was reduced from 4 to 3 bits, which is weaker with more
longer runs of zeroes (more common with flat color blocks and at low
quality levels).

One considered by untested possibility:
Jointly encoding groups of four 8x8 blocks as one giant 256-entry block
(to partly consolidate the long zero runs and the early EOBs). Unclear
if it would be worth the added complexity though.

Had considered using a variant of my BT5B codec as an image format for textures, but ended deciding against this as the Q/bpp was kinda awful
(and it started to seem better to use UPIC instead and just eat the
extra CPU time for converting this to DXT1 or DXT5). While my older
(BTIC4B) format had worked OK, it is considerably more complicated (and
UPIC being "like JPEG but faster" and transcoding to DXT1 is "maybe good enough").

Where basically, for technical reasons, BT5B was essentially unable to significantly beat the DDS format in Q/bpp. Could get smaller files, but
at much worse quality; being still essentially similar tech to the
MS-CRAM video codec (well, and for similar reasons for why one wouldn't
really want to store textures via CRAM; though was at least possible to
get closer to 1:1 with DDS for quality, but then essentially also
roughly 1:1 in terms of file size).

Note that for a decode path to DXT1/5, UPIC generally decodes and
trasforms each macroblock at a time (this is more cache friendly than
decoding the entire image to RGBA and then DXT encoding it; though it
still uses full-image passes for mipmap generation).

Note that Q/bpp could be improved, technically, by feeding the image
through a bitwise range coder, but this would have a significant adverse impact on speed.

...

Also (probably feature creep) ended up adding 3D model conversion and
CSG stuff to BGBCC (including a makeshift interpreter for the SCAD /
OpenSCAD language).

This was in part because:
BGBCC's core also mostly serves the core as part of my WAD4 packing tool;
It is useful to have model conversion stuff.

It would partly take over the role of using my current 3D engine for
model conversion.

In the 3D engine I had typically be using a modified BASIC dialect, but
had mostly designed the models originally in OpenSCAD, and using the
SCAD language allows keeping the models intact (but OpenSCAD doesn't
natively support colors or textures for any of the 3D model export formats).

Initially, these were only supported for my own 3D model format (used
for my 3D engine) but also supported (in a not yet released version)
using the "Wavefront OBJ" format (theoretically tried implementing some
of the color extensions for Binary STL, but nothing seemed able to
import them with color).

Other formats had other drawbacks:
AMF: More complicated, poorly supported;
XML based.
3MF: Much more complicated.
Format consists of XML blobs inside a ZIP based package, ...
DAE: Yeah, no.
3DS: Also no.

DAE and 3DS seem more sane for high-end mesh modeling, not so much for CSG.

Doing CSG with skeletal animation is currently supported for my BASIC
dialect; could map it over to SCAD, but would not have any direct
equivalent in OpenSCAD.

For textures, can map the textures over to "color()" which is then
understood as identifying a texture if not using a normal color
name/value. There is no direct need to specify ST/UV coords here, as
these are implied via texture projection.

Textures may be specified with planes, and then when projecting it will
pick the plane for this texture that is the closest match to the face
normal.

Sorta like:
texturename:
image [ Sx Sy Sz So ] [ Tx Ty Tz To ]
...
Where: N = S x T
S/T encoding both the scale and translation along the axis.
Basically, working in a vaguely similar way to how texturing worked in Half-Life (just with the 'color' also able to specify a texture).

If I added bones to SCAD, one possibility is:
bone("bonename") {
... geometry ...
}

In the BT3 engine, the animation was handled by assigning geometry to
the skeleton (no vertex weights for now), and then using external files
(also text based) to describe any animation sequences.

Though, I had decided I will likely stick mostly with using sprite
graphics for mobs and NPCs, so the animation part may be less relevant.

I may start making more use of procedural compositing and animation for sprites though (possibly using a similar system to that used for VTuber
rigs; though likely specified as text files and using coords). Haven't
done much yet.

So, say, in this case, might define parts relative to a sprite sheet:
part hand sprites/somechar [ 320 100 ] [ 400 160 ]
part forearm sprites/somechar [ 410 100 ] [ 550 160 ]
...
bone spine hip [ 320 400 ]
bone neckbase spine [ 320 600 ]
bone lupperarm neckbase [ 290 600 ]
bone lforearm lupperarm [ 160 600 ]
bone lhand lforearm [ 100 600 ]
...
attach hand lhand [ 10 580 ] [ 1 0 0 1 ]
attach hand rhand [ 600 580 ] [ -1 0 0 1 ]
...

Well, contrast VTuber rigs usually being done with a lot of
drag-and-drop tools (but, I have my reasons for wanting it to remain
text based; but then probably parsed and cooked down into a binary format).

Or, basically, each part is clipped out of a sheet texture, then
attached to a skeleton giving the base location and a transform (as a
2x2 matrix).

Animations could then be given as a sequence of frames specifying the
relative translation and rotation of each bone (or possibly
hiding/showing certain attachments, say for example open-vs-closed
hands, etc).

Though, a 2D skeleton would still carry some limitations to traditional
2D sprite graphics.

Though, I guess another option here is to use a hybrid approach of
attaching sprites to a 3D skeleton and then animating it as-if it were
3D; but this doesn't really save anything (effort wise) vs the use of
CSG models (and, if anything, more just makes a case for the ability to
attach sprites on in addition to CSG solids).

Then again, some early 3D games, like "Super Mario 64" did make
effective use of such a strategy (character models often combined 3D
elements with the use of sprites in some cases; rather than being purely
3D).

...

Though, for past stuff, had usually drawn out every sprite frame.
But, this doesn't scale well to characters with dynamic movement.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Tue Dec 16 12:02:18 2025

From Newsgroup: comp.arch

In article <memo.20251215193718.13016J@jgd.cix.co.uk>,
John Dallman <jgd@cix.co.uk> wrote:

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd: ><https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Archive.org link to the actual PDF: https://web.archive.org/web/20160201075644/https://www.eit.lth.se/fileadmin/eit/courses/eitf20/ARM_RG.pdf

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

And how!

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 16 15:40:26 2025

From Newsgroup: comp.arch

On Mon, 15 Dec 2025 18:05:53 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".

https://www.scribd.com/document/231464485/ARM-RG

Yes, scribd.com is annoying, but the document is interesting.

It certainly is, thanks for sharing!

He left out POWER, for some reason or other.

I think that the reason was simple: he searched his drawer and found
brochure from 1992 Microprocessor Forum. By chance, IBM didn't present POWER-related stuff at this forum. Likely, because POWER (later known as POWER1) and RSC were presented in the previous year[s] and for PPC601
they wanted to hold cards closer to their collective chests, because
anything else would be incompatible with Apple's culture.

Page 22:

"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time

Wise words.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Dec 16 12:29:58 2025

From Newsgroup: comp.arch

On 12/15/2025 12:05 PM, Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".

https://www.scribd.com/document/231464485/ARM-RG

Yes, scribd.com is annoying, but the document is interesting.

It certainly is, thanks for sharing!

He left out POWER, for some reason or other.

Page 22:

"If a feature requires a combination of hardware and specific
software..."
- Be afraid.
- Be _very_ afraid of language specific features.
- If it feels a bit clunky when you first design it
... it won't improve over time

Wise words.

Yes.

Ideally, any feature should be one of:
Trivially used by a C compiler or similar;
Addresses a scenario that is commonly used but can't be handled
efficiently enough otherwise.

An example of the latter would be things like RGB555 packing/unpacking,
where RGB555 has ended up very common in a lot of my graphics handling
code, but the traditional ways to pack/unpack RGB555 are a bit lacking
if it needs to be done in a tight loop.

Though, in most cases, designing instructions or similar for a specific algorithm seems like a bad idea. In nearly every other situation the instructions would end up being essentially useless.

In my case, I had an overly niche feature I had called "LDTEX":
Takes a compressed texture given as a base address and loads a texel
with fixed-point ST coords given in the index register (nearest neighbor
only, but there was a 64-bit encoding variant to control the texel
rounding to specify one of 4 positions).

This had a very niche use-case:
When implementing a software rasterizer.

Usage elsewhere:
Not really...

Status:
Now mostly leaving it disabled, as it is less needed if using a hardware rasterizer (but would be more relevant if I had GLSL; issue mostly is
that a GLSL compiler would be a bit too much code footprint).

Though, could potentially do ARB shaders, or a non-standard shader
language based on unstructured BASIC or similar.

With the HW rasterizer, there is still the drawback that geometry
processing on the CPU is a significant bottleneck (as are often things
within the 3D engines).

...

Then again, ARM had ended up with some things that were far more niche,
and ultimately even more useless, like Jazelle and similar.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 16 21:02:28 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 12/15/2025 1:37 PM, John Dallman wrote:

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can
perform FP64×{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
I make a rounding error about once every 378 calculations. Running a SW
routing that gives correct rounding is 300-500 cycles.

So, if one can take the Incorrect-Rounding exception in less than 100
cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.

-----------------------------------------------

Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.

This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x multiplier out of the effective size of the TLB (and without some of the drawbacks of an Inverted Page Table).

Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).

Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

If your HW tablewalker made 1 probe into a large (16MB) SW controlled
buffer of PTEs, and if it found the PTE great, it gets installed and
life goes on, otherwise, SW updates the Table and TLB and returns as
soon as possible.

This gives you a L2-TLB with 1M entries, enough to map 8GB and a
table walking state machine with 3 states.

-----------------------------------------------

One potential ugly point that exists in my designs (including XG3) is
the use of 48-bit pointers with tagging in the high 16 bits. I can't
prove that this wont come back to bite; but then at the same time I also still need some amount of tag bits. And, Intel/AMD/ARM had ended up
partly going in a similar direction as an optional feature (though
differing on the 8 vs 16 bits question).

I CAN prove it WILL come back to Bite--but only if your architecture survives

-----------------------------------------------------

For textures, can map the textures over to "color()" which is then understood as identifying a texture if not using a normal color
name/value. There is no direct need to specify ST/UV coords here, as
these are implied via texture projection.

Can you think of uses for Texture where a LD instruction has a FP
index::

LD Rd,[Rbase,Rindex,Displacement]

where: Rbase is a pointer
Rindex is a FP value
Displacement is what it is

The integer part of Rindex and integer part + 1 index the Texture.
The fraction pare of Rindex performs the blending between
LERP( texture[int(Rindex)], texture[int(Rindex)+1], fract(Rindex) );
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Dec 16 16:20:34 2025

From Newsgroup: comp.arch

One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can
perform FP64�{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

So, if one can take the Incorrect-Rounding exception in less than 100
cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.

Hmm... I think that calculation isn't quite right:

- First, this assumes a random distribution. Depending on the details
(e.g. if the 1/378 cases are grouped in a specific subspace that some
algorithm might use heavily), it can get a lot worse (or a bit better).

- During those 300-500 cycles, no other instruction is processed, IOW
it's not as-if this one instruction took just one more cycle, but it's
as-if the CPU stopped all activity during one cycle. For an ILP of 1,
it's about the same, but for the ILP of 4, it's 4x more expensive.

I CAN prove it WILL come back to Bite--but only if your architecture
survives

It's part of the problem with those "lessons from" successful projects:
don't forget that it is sometimes indispensable to cut corners for the
project to survive in the first place. Maybe you'll regret it years
later, yet, without it, there wouldn't even be any opportunity to regret anything years later. There are delicate trade-offs.

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Dec 16 19:01:18 2025

From Newsgroup: comp.arch

On 12/16/2025 3:02 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 12/15/2025 1:37 PM, John Dallman wrote:

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd:
<https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented. >>>

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can
perform FP64×{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

So, if one can take the Incorrect-Rounding exception in less than 100
cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.

In my case, handling of "sin()" looks sorta like:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
t =th*sintab_c00; x=th*th2;
t+=x*sintab_c01; x*=th2;
t+=x*sintab_c02; x*=th2;
t+=x*sintab_c03; x*=th2;
t+=x*sintab_c04; x*=th2;
t+=x*sintab_c05; x*=th2;
t+=x*sintab_c06; x*=th2;
t+=x*sintab_c07; x*=th2;
t+=x*sintab_c08; x*=th2;
t+=x*sintab_c09; x*=th2;
t+=x*sintab_c10; x*=th2;
t+=x*sintab_c11; x*=th2;
t+=x*sintab_c12; x*=th2;
t+=x*sintab_c13; x*=th2;
t+=x*sintab_c14; x*=th2;
t+=x*sintab_c15; x*=th2;
t+=x*sintab_c16; x*=th2;
t+=x*sintab_c17; x*=th2;
t+=x*sintab_c18; x*=th2;
t+=x*sintab_c19; x*=th2;
t+=x*sintab_c20; x*=th2;
return(t);
}

Time cost: roughly 450 clock cycles (penalties are steep on this one).

This would be roughly 3x slower if handled with a trap and a special instruction.

Does it make sense to add an FSIN instruction? No, as FSIN is nowhere
near common enough to make much impact on code size (and, FSIN will be invariably slower).

How much space does it take? Around 460 bytes.

This is assuming DAZ/FTZ, but denormals shouldn't happen unless ang is
pretty close to an integer multiple of M_TAU (2*M_PI).

A possible hack would be to detect that if 'th' is sufficiently close to
0 then just return 0 (and sidestep going into denormal land).

What about "sinl()"?
long double sinl(long double ang);
Rest similar, just long double.

Roughly 80k cycles...
This is kinda awful.

If using function calls, it can drop to 60k cycles, still awful.

How much space:
Around 800 bytes if we assume special instructions and traps.
Eg: FMUL.Q and FADD.Q and similar.
Around 3K if using function calls.

so, in this case, 30% slower but 27% the size still seems like a win...

And, at 60k or 80k cycles, there isn't really any making this fast.

Ironically, might get faster if/when there were some implementation that actually had native (non-trapping) Binary128 support.

Other option is to just use "sin()" internally and convert to/from "long double".

-----------------------------------------------

Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.

This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x
multiplier out of the effective size of the TLB (and without some of the
drawbacks of an Inverted Page Table).

Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).

Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

If your HW tablewalker made 1 probe into a large (16MB) SW controlled
buffer of PTEs, and if it found the PTE great, it gets installed and
life goes on, otherwise, SW updates the Table and TLB and returns as
soon as possible.

This gives you a L2-TLB with 1M entries, enough to map 8GB and a
table walking state machine with 3 states.

-----------------------------------------------

One potential ugly point that exists in my designs (including XG3) is
the use of 48-bit pointers with tagging in the high 16 bits. I can't
prove that this wont come back to bite; but then at the same time I also
still need some amount of tag bits. And, Intel/AMD/ARM had ended up
partly going in a similar direction as an optional feature (though
differing on the 8 vs 16 bits question).

I CAN prove it WILL come back to Bite--but only if your architecture survives

Maybe, but it is unclear if it will at this point.

RISC-V is more likely to survive, and XG3 could make sense in the RISC-V context, but not likely as a primary ISA in any case.

Though, the big uncertainty is whether 48-bit tagged PC and
Link-Registers are too much of an ask. An implementation could be
possible that used 56 bit addresses, but would not be strictly
compatible with my existing implementation (doing mode jumps does
require awareness of the tagging scheme, which has a non-zero code impact).

I had at one point considered the possibility of hacking certain RISC-V
Load instructions to function as mode-selecting JALR's to allow LUI+Lx
to encode an inter-mode JAL (and also as a way to detect XG3 support),
but thus far this has not been done.

Say:
LUI X5, AddrHi;
LW X0, AddrLo(X5) //JALR to XG3 mode
... //If we get here, it didn't work.

Where "LW X0" being understood as JALR to XG3 mode, and "LH X0"
understood as a JALR to RV64GC mode. Scheme works as normal RISC-V ops
are valid in both modes.

One merit of the pointer tagging though is that it allows call/return
between modes, but exposes a certain level of wonk.

-----------------------------------------------------

For textures, can map the textures over to "color()" which is then
understood as identifying a texture if not using a normal color
name/value. There is no direct need to specify ST/UV coords here, as
these are implied via texture projection.

Can you think of uses for Texture where a LD instruction has a FP
index::

LD Rd,[Rbase,Rindex,Displacement]

where: Rbase is a pointer
Rindex is a FP value
Displacement is what it is

The integer part of Rindex and integer part + 1 index the Texture.
The fraction pare of Rindex performs the blending between
LERP( texture[int(Rindex)], texture[int(Rindex)+1], fract(Rindex) );

For the LDTEX instruction, it was:
(63:48): Integer part of T coord
(47:32): Fraction of T coord
(31:16): Integer part of S coord
(15: 0): Fraction of S coord.

For the 64-bit encoding, the Disp field had encoded:
0: Truncate S and T
1: Round S up, Truncate T
2: Truncate S, Round T up
3: Round S and T up.

With the HOB's of the address given to LDTEX encoding the texture size
and format (could all be fit into 16 bits, due to textures being
power-of-2 size and square/rectangular).

This allowed 4 LDTEX instructions, and BLINT instructions, to be used to implement bilinear filtering.

Though, cheaper was to only do 3 LDTEX, 2 BLINT (S and T), and then
average the color vector. This gives an effect similar to the texture filtering used on the Nintendo 64.

In the absence of LDTEX, one might do instead:
PMORT.Q + SHLD + AND + MOV.Q + BLKUTX2

Though, not as bad for bilinear as they could be interleaved (though,
still higher latency than LDTEX).

And, then use packed SIMD calculations rather than BLINT.

In theory, 5 LDTEX's could be used to implement trilinear, but had noted
that some tweaking with the interpolation coords can be used to mimic
the effects of trilinear filtering without actually needing to access
another mipmap level. Rather than loading and interpolating the next
mipmap level, merely push the interpolation coords towards the center of
the 4 texels.

Resulting in a sorta "Crappy bilinear filtering that mimics the look of trilinear filtering". Well, it at least can avoid the main obvious
artifact of bilinear, which is that there is a big obvious seam whenever
it jumps between mipmap levels.

Though, in the case of SCAD, the "color()" operation is usually given a string, say:
color("#AA5500")
{
translate([32,-1,1])
{
cube([8,5,2]);
translate([1,3.2,0])
rotate([0,0,25])
cube([6,2,2]);
}
}

One annoyance with SCAD is that (like with GLSL) it takes a pretty big
chunk of code to deal with it.

I don't generally want to deal with this directly on the target ISA, so
makes more sense to precook it into a triangle-based model.

In this case, one can get creative:
color("sometexture")
{
...
}
Where "sometexture" is understood as giving the name of the texture or material to apply, rather than the name of the color.

But, there is an annoyance of model formats:
BMD: a custom model format.
Mostly uses a packed format for coordinates (joint exponent fun);
Reasonably compact and cheap to unpack.
Partly influenced by the Quake and Half-Life model formats.
STL:
Common;
Lacks any standard way to do colors,
and even then, only in the binary format.
Binary STL files are more bulky (50 bytes per triangle).
OBJ: (Wavefront OBJ)
Semi-common;
More advanced, text based format.
Loader needs a text parser, ...
Text format makes models comparably bulky.
AMF, 3MF:
XML + ZIP
Don't want to deal with this.
XML parsing and Deflate decompression is expensive on a 50 MHz CPU.
DAE: Just no.

Theoretically, there is glTF, but glTF seems a bit overkill for my uses.

In my case, BMD models also support collision detection. Mostly based
on, since the models are generated via CSG, all the meshes can be closed
and non-self-intersecting.

This makes it possible to detect collisions:
Start from a vertex, and trace a line somewhere else (outside of the
model being checked):
If the number of crossed faces is 0 or even, point is outside the model;
If it is odd, point is inside the model.

Doesn't really work if model is self-intersecting or non-closed. Also
can assume that a point is always non-coliding if it falls outside the bounding box.

Note that BSP trees could be used to optimize these sorts of checks, but
a BSP adds complexity that is not usually worthwhile. Currently, BMD
does not store BSP trees.

Can note that in my BT3 engine, it mostly does collision detection and
physics by using point probing (rather than box/plane checks).
Ironically, this is more like old NES/SNES games than it is like Doom or Quake.

But, for physics, it was simpler/easier to answer the question of "is
this point inside anything solid?" for the corners of a bounding box,
than it was to check for lots of AABB collisions or similar.

Though, to check for AABB/mesh checks, also it is needed to do line/box checks:
If any of the points from the box fall inside the mesh; or, if any
polygon edges from the mesh intersect the AABB, assume a collision has occured.

If using CSG brushes, could use the "separating axis theorem" approach,
but this requires keeping the CSG brushes around.

...

Isn't currently a plan to add rigid body physics.
My first 3D engine had this, but:
It is a pain to get this working reliably;
90% of everything still ends up being done with AABBs;
Often lacks an obvious use-case other than "objects can fall over realistically".

Seemed compelling in Half-Life 2 and similar, but mostly amounted to not
much more than a gimmick.

At the time, was more effort than it was worth trying to get objects to
form stable stacks. One of the harder problems of developing such a
physics engine being to get rigid-body objects to stack without becoming unstable.

Then ended up mostly not using the rigid-body physics stuff for much of anything anyways.

In theory, could just use (or could have used) an existing physics engine.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Dec 17 01:44:48 2025

From Newsgroup: comp.arch

According to BGB <cr88192@gmail.com>:

On 12/15/2025 1:37 PM, John Dallman wrote:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

I think it's always been iffy.

S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
8 byte boundaries. Normally the Fortran compiler aligned data correctly but you could use an EQUIVALENCE statement to create misaligned doubles, something that occasionally happened in code ported from 709x where it wasn't a problem. There was a library trap routine to fix it but it was stupendously slow, so IBM added the "byte oriented operand" feature to the 360/85 and later models.

The high end 360/91 did not have most of the decimal instructions, on the reasonable theory that few people would run the kind of programs that used them on a high end scientific machine. The operating system simulated them slowly, so
the programs did still work. But the subsequent 360/195 had the full instruction
set.

On the other hand, back when PCs were young, Intel CPUs had a separate optional fairly expensive floating point coprocessor. Most PCs didn't have them, and programs that did arithmetic all had software floating point libraries. If you didn't have the FPU, instructions didn't trap, they just did nothing. The C compiler we used had a clever hack that emitted floating point instructions with the first byte changed to a trap instruction. If the computer had hardware floating point, the trap handler patched the instruction to the
real floating point one and returned to it, otherwise used the software floating point. This got pretty good performance either way. It helped
that the PC had no hardware protection so the C library could just install
its own trap handler that went directly to the float stuff.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 00:51:17 2025

From Newsgroup: comp.arch

On 12/16/2025 7:44 PM, John Levine wrote:

According to BGB <cr88192@gmail.com>:

On 12/15/2025 1:37 PM, John Dallman wrote:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

I think it's always been iffy.

S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
could use an EQUIVALENCE statement to create misaligned doubles, something that
occasionally happened in code ported from 709x where it wasn't a problem. There
was a library trap routine to fix it but it was stupendously slow, so IBM added
the "byte oriented operand" feature to the 360/85 and later models.

The high end 360/91 did not have most of the decimal instructions, on the reasonable theory that few people would run the kind of programs that used them
on a high end scientific machine. The operating system simulated them slowly, so
the programs did still work. But the subsequent 360/195 had the full instruction
set.

Running some stats for misaligned access in my emulator (percentage of
total memory accesses, organized by type).

Initial bootup:
Word : 7.67% DWord: 0.13% QWord: 9.53%

Doom Startup:
Word : 0.64% DWord: 0.05% QWord: 3.48%
Demo Loop:
Word : 0.03% DWord: 0.03% QWord: 2.27%

SW Quake Startup:
Word : 0.63% DWord: 0.06% QWord: 6.70%
SW Quake first Demo:
Word : 0.03% DWord: 0.07% QWord: 3.18%

GLQuake Startup:
Word : 0.83% DWord: 0.03% QWord: 4.87%
GLQuake first Demo:
Word : 0.23% DWord: 0.02% QWord: 2.23%

Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...

Seems that misaligned QWord is the most common, but then again, QWord is
used for things like "memcpy()" and similar (and a lot of my runtime
code was written to assume fast misaligned memory access, ...).

You could arguably implement "memcpy()" as a series of byte loads and
stores, but then "memcpy()" is 8x slower.

So, in this case, MOV.Q is 4% of CPU time, and say 5% of these are
misaligned.

So, 0.2% of the CPU cycles are misaligned QWord access. What if these
took 500x longer? 100%. Or, effectively, doing this would effectively
halve performance.

So, "rare enough":
Misaligned falls well short of being "rare enough"...

I am not thinking of things like misaligned here, more things like "long double". As-is, "long double" is basically unused in the hot path.

On the other hand, back when PCs were young, Intel CPUs had a separate optional
fairly expensive floating point coprocessor. Most PCs didn't have them, and programs that did arithmetic all had software floating point libraries. If you
didn't have the FPU, instructions didn't trap, they just did nothing. The C compiler we used had a clever hack that emitted floating point instructions with the first byte changed to a trap instruction. If the computer had hardware floating point, the trap handler patched the instruction to the
real floating point one and returned to it, otherwise used the software floating point. This got pretty good performance either way. It helped
that the PC had no hardware protection so the C library could just install its own trap handler that went directly to the float stuff.

Yeah, early PCs and math coprocessor is more analogous to the "long
double" situation.

You already know it is slow, but sometimes need it. But, can't use it
often, because it is slow.

Early PCs used fixed-point. In my case, one still has "float" and
"double", but these still aren't super fast either (so, using
fixed-point can still be faster in many cases).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Dec 17 07:11:04 2025

From Newsgroup: comp.arch

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 01:27:31 2025

From Newsgroup: comp.arch

On 12/16/2025 3:02 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

Getting back to a part I somehow missed responding to...

-----------------------------------------------

Sometimes, I am left considering wacky ideas, say for example:
A Partial Page Table.
Where, rather than having a full page walker, it has a TLB like
structure which merely caches the bottom level of the page table.

This could have some similar merits to software TLB (flexibility and
simpler hardware), while potentially getting a roughly 512x to 2048x
multiplier out of the effective size of the TLB (and without some of the
drawbacks of an Inverted Page Table).

Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
cover 16GB).

Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

If your HW tablewalker made 1 probe into a large (16MB) SW controlled
buffer of PTEs, and if it found the PTE great, it gets installed and
life goes on, otherwise, SW updates the Table and TLB and returns as
soon as possible.

Likely the LLPTLB would be keyed on (47:25) or so, then holding (47:14)
for the PPN. Would likely have a similar structure to the normal TLBE,
just ignoring the low part of the virtual address.

But, yeah, in the case of an LLPTLB hit (and PTE is marked Valid, etc),
it would bypass the usual TLB Miss interrupt, and instead refill the TLB directly from the last level of the page table.

Could potentially result in a significant reduction in TLB miss
frequency. Though, as-is, at least in my test programs, TLB miss rate
tends to be somewhat loower than Timer IRQ rate, so... A stronger case
could almost be more for not hitting the timer interrupt at 1 kHz...

Where:
1024 Hz: Traditional value for RTC interrupt, seemed same.
100 Hz: I thing NT4 used this or something.
18 Hz: MS-DOS and similar.
32768 Hz: MSP430 can use this,
but this would eats the CPU in my case...

As-is, I seem to be seeing values of around 60 to 200 Hz for TLB Miss interrupts (with the existing 256x4 configuration).

Though this is primarily with a single-address-space system.

This gives you a L2-TLB with 1M entries, enough to map 8GB and a
table walking state machine with 3 states.

The idea here is to have a LLPTLB which would still use a TLB Miss event
in the event that both TLB and LLPTLB miss, but would fetch the PTE from memory in the case of an LLPTLB hit.

The TLB Miss handler would then also load an entry (describing the
last-level of the page-table) into the LLPTLB (along with the normal TLBE).

My estimate is that this would greatly increase working set size, but
LLPTLB conflict misses could still be an issue (would need an
associative LLPTLB to reduce conflict misses on context switch; but
LLPTLB misses are likely to be more dominated by conflict misses,
particularly on context switches).

While not as transparent as a full page walker, it would preserve the flexibility to construct non-standard or ad-hoc virtual address spaces
(which would be lost with a more traditional hardware page-walker).
Like, say, I don't need the MMU to be aware of nested page-tables,
because this can be done in software.

Also, avoids the drawback of Inverted Page Tables, in that IPT's would
still effectively need TLB miss events to drip-feed TLBEs into the IPT;
while also being more memory efficient.

Does have the drawback of the MMU needing to be able to access memory,
but I can use a simpler mechanism in that it only needs to be able to
access a single cache line (unlike either a full page walk or IPT, which
would need multiple memory accesses).

Though, the MMU may need some way to temporarily "hold" missed requests
during a TLB Miss to determine whether or not it can resolve them (since
it would no longer know the answer within a fixed latency).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Dec 17 07:59:10 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4]. See https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
for a random reference grabbed off the net how to do it.

[...]

Time cost: roughly 450 clock cycles (penalties are steep on this one).

What is your accuracy?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 02:27:33 2025

From Newsgroup: comp.arch

On 12/17/2025 1:11 AM, Lawrence D’Oliveiro wrote:

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not supported
natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.

Yes.

Or, like the "SiFive U74" and similar, where the funny thing of the
RISC-V ISA using unscaled displacements but then having a CPU that uses internal traps (and is horribly slow) in the case of misaligned access...

Meanwhile, I prefer to have memcpy and LZ decompression where
"performance doesn't suck".

Also useful for things like Huffman and Rice decoding, etc. Say, for
Huffman decoding, if one needs to use branches to detect when to pull in
more bytes, this eats more clock-cycles than advancing the bit-stream
position implicitly via arithmetic tricks.

Well, and is also an example of why to use LSB first bit ordering, and
not to use FF escape encodings and similar:
MSB first, FF escapes, and the 16-bit length limit, etc, manage to make
JPEG bit-stream handling a lot slower than it could have been.

Whereas, say, LSB-first and imposing a 12-bit length limit allows some
speedup here.

Though, the Rice coder in UPIC effectively uses an 8-bit lookup, but
this is because it uses 3 bits for the Rk factor. So, sadly, it needs a fallback path to decode symbols that exceed 8 bits.

So, pseudo-code (for AdRice Decoding):
win=*(u32 *)cs;
b=win>>pos;
ix=(rk<<8)|(b&255);
v=ricefasttab[ix]; //constant lookup table for Rice-code state space
l=(v>>8)&15;
if(l<=8)
{
//faster path
pos+=l;
cs+=pos>>3;
pos&=7;
rk=(v>>12);
return(v&255);
}
// ... slower path ...
q=riceqtab[b&255]; //count bits for Q prefix.
if(q==8)
{
//escape case, Q==8 escapes a raw max-length symbol
l=16;
v=(b>>8)&255;
rk+=2;
if(rk>7)rk=7;
}else
{
l=q+rk+1;
v=((b>>(q+1))&((1<<rk)-1))|(q<<rk);
if((q==0) && (rk>0)) rk--;
if((q>=2) && (rk<7)) rk++;
}
pos+=l;
cs+=pos>>3;
pos&=7;
return(v);

Which may not seem very fast, but could be a lot worse.

In this case (for L1 cache reasons) the slightly more complicated
approach here works out faster on average than using a single giant
lookup table.

So, my CPU supports misaligned access natively.

Can make sense to skip it for microcontroller class cores though; since
in this case "cheaper L1 cache" is likely to be a higher priority.

Doesn't make sense for things bigger than a microcontroller though, as allowing for misaligned memory accesses is too useful IMO.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 03:03:04 2025

From Newsgroup: comp.arch

On 12/17/2025 1:59 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4]. See https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
for a random reference grabbed off the net how to do it.

Possible, I didn't evaluate this.

I just sorta reduced it to [0, 2*PI] noting that if values got too far
outside of this range, accuracy started dropping off, so limiting to
this range kept the accuracy good.

Though, it flips to [-2*PI, 0] if ang<0, since conversion to 'int'
truncates towards zero and not towards negative infinity. Not entirely
sure why I used 'int' and not 'long', possible oversight here...

[...]

Time cost: roughly 450 clock cycles (penalties are steep on this one).

What is your accuracy?

IIRC, I used 20 stages because this seemed to fully converge the double
to about as accurate as it was going to get in my initial testing
(without adding too many extra stages).

IIRC, basic algo was based on the Taylor Series expansion off of
Wikipedia or similar (I didn't show the magic constants).

It was also still a vast improvement over the code that the C library originally came with (IIRC, used a "for()" look and calculated the
factorials and did the division inline; this was far slower than using precomputed constants).

For "long double", it is both slower to calculate and also requires more stages to converge the result.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 11:40:50 2025

From Newsgroup: comp.arch

On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].

It's not what I'd do in practice. The degree of poly required after
reduction to [-pi/4,+pi/4] is way too high. It seems, you would need
Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.

There are better methods. Like reducing to much smaller interval, e.g.
to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details of trade
off between size of reduction table and length of polynomial depend on
how often do you plan to use your sin() function.

See
https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
for a random reference grabbed off the net how to do it.

[...]

Time cost: roughly 450 clock cycles (penalties are steep on this
one).

What is your accuracy?

Before asking that it's worth to ask about an expected range of the
argument.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Dec 17 10:40:16 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

Running some stats for misaligned access in my emulator (percentage of
total memory accesses, organized by type).

Initial bootup:
Word : 7.67% DWord: 0.13% QWord: 9.53%

As the system is under your total control, you should be able
to find out where this comes from. Does the compiler not place
words correctly, how do you align your structures, do you use
misalligned large words for memcpy, ... ?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Wed Dec 17 15:42:58 2025

From Newsgroup: comp.arch

According to BGB <cr88192@gmail.com>:

On 12/16/2025 7:44 PM, John Levine wrote:

S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
could use an EQUIVALENCE statement to create misaligned doubles, something that
occasionally happened in code ported from 709x where it wasn't a problem. There
was a library trap routine to fix it but it was stupendously slow, so IBM added
the "byte oriented operand" feature to the 360/85 and later models. ...

Misaligned access is common enough here that, if it were not supported >natively, this would likely tank performance...

I suspect it is not a coincidence that the 360/85 introduced both a cache and misaligned accesses. Since the cache lines are bigger than words, I would think a cache would greatly decrease the cost of the extra memory references. --
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 17 16:18:24 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

The ARM Lead Architect and Fellow Richard Grisenthwaite
presents "Lessons from the ARM Architecture".

https://www.scribd.com/document/231464485/ARM-RG

I found the lessons not very enlightening, and also too abstract.
Maybe he presented concrete cases to support the lessons in the audio
track that accompanied the slides, but the slides were too general for
my taste.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 17 16:21:36 2025

From Newsgroup: comp.arch

jgd@cix.co.uk (John Dallman) writes:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

I disagree. Trap-and-emulate may be too slow for a feature that you
want programmers to use in hot paths on current CPUs, but there are
other cases. In particular, for a feature that cannot be implemented
properly yet, if you provide it as a trap-and-emulated instruction in
the current generation, and a faster implementation of the instruction
in a future implementation of the architecture, programmers will be
much less reluctant to use that instruction when its implementation is
fast on the dominant implementation of the day than if it just
produces a SIGILL or somesuch on older chips (or new, but
feature-reduced chips). Intel's marketing does not understand that,
that's why they are selling feature-reduced versions of chips that
have AVX and AVX-512 in hardware.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 12:49:34 2025

From Newsgroup: comp.arch

On 12/17/2025 4:40 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

Running some stats for misaligned access in my emulator (percentage of
total memory accesses, organized by type).

Initial bootup:
Word : 7.67% DWord: 0.13% QWord: 9.53%

As the system is under your total control, you should be able
to find out where this comes from. Does the compiler not place
words correctly, how do you align your structures, do you use
misalligned large words for memcpy, ... ?

The most likely source of this particular pattern is LZ4 decompression.
Binaries and a lot of other data use LZ4.
Mostly compression is used here to reduce IO to the SDcard.

RP2 decompression also uses misaligned access, but it is primarily QWORD based. Similar goes for the scheme used for pagefile compression (which
uses natively aligned 16-bit words but misaligned Qwords).

My C compiler follows usual C ABI alignment rules, but both memcpy and
LZ decompression are written to assume misaligned access, because this
is faster than using byte loads/stores on my core.

Some paths for memcpy may use MOV.X (Pair), but this instruction is only
valid with an 8 byte alignment (and gives best performance with a
16-byte alignment). So, generic/small cases use QWord.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 13:51:31 2025

From Newsgroup: comp.arch

On 12/17/2025 10:21 AM, Anton Ertl wrote:

jgd@cix.co.uk (John Dallman) writes:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

I disagree. Trap-and-emulate may be too slow for a feature that you
want programmers to use in hot paths on current CPUs, but there are
other cases. In particular, for a feature that cannot be implemented properly yet, if you provide it as a trap-and-emulated instruction in
the current generation, and a faster implementation of the instruction
in a future implementation of the architecture, programmers will be
much less reluctant to use that instruction when its implementation is
fast on the dominant implementation of the day than if it just
produces a SIGILL or somesuch on older chips (or new, but
feature-reduced chips). Intel's marketing does not understand that,
that's why they are selling feature-reduced versions of chips that
have AVX and AVX-512 in hardware.

Pretty much.

This is why I am promoting it for cold path / infrequent cases.

But, for pretty much anything that happens more often than around 0.001%
of instructions, in most cases, you don't want trapping.

But, if less than 0.001%, or 0.0001%, then trapping may make sense.

As a possible way to support obscure legacy features, or optional or
"not widely supported in hardware" features, it makes sense.

In these cases, "still runs, but slowly" being preferable to "just
crashes due to an unsupported instruction" or similar.

For something like AVX-512, or RISC-V's V extension, it could likely
make more sense to go to a "trap and full emulation" path. In this case, rather than always faulting on an instruction and immediately returning control to the application, it could instead handle blocks of
instructions with a tracing JIT and then only return to normal flow of
control once "the coast is clear" (behaving as-if the JIT'ed
instructions had been run natively on the CPU in question).

Or, one possible strategy here (for normal applications) being to "hot
patch" the application to effectively encode branches into the JIT'ted
code sequences over the top of the offending instructions.

So, for example, quietly turning any offending RV V instructions into
JAL's or similar (with the JIT trace then jumping back into the normal instruction sequence afterwards as-if nothing had happened).

This could be better for performance, but less transparent (because the application could potentially see where its code had been hot-patched).

For something like a VM, it would make mode sense to not hot-patch, and instead use the trap to trigger non-local control flow. This would be
slower than hot-patching, but would be more transparent as it would not
modify the original memory.

Well, or possibly use some sort of hardware double-mapping where the
pages seen by the D$ and I$ differ, so it "looks" as if the VM is still running the original code, but then the CPU's I$ can see a different set
of hot-patched pages.

Well, in my ISA, this could be done in theory by using 96-bit mode and
putting GBH and PCH in two separate address spaces, that look basically
the same but differing solely in that PCH hay point to an address space
with hot-patched pages.

Another possibility could be to hack it in the MMU by having separate
DTLB and ITLB entries:
Normal TLB Entry: Matches on either D$ or I$
DTLB: Only matches for D$
ITLB: Only matches for I$

Then the TLB Miss handler can special-case the handling of hot-patched
pages (might consider this if I get around to implementing the recent
LLPTLB idea; both would have the "feature" of effectively creating
multiple sub-types of TLB entries).

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 19:53:23 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 12/16/2025 3:02 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 12/15/2025 1:37 PM, John Dallman wrote:

In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:

"Lessons from the ARM Architecture"

Non-Scribd:
<https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

Note that this is from 2010 and does not discuss ARM64.

One comment:

] * "Trap and Emulate” is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented. >>>

IME, it depends:
If the scenario is semi-common, it doesn't work out so well;
If it is rare enough, it can be OK.

One must also take into consideration the relative time of each path.
For example, I have Transcendental instructions in my ISA where I can perform FP64×{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively. I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

So, if one can take the Incorrect-Rounding exception in less than 100 cycles (round tip) trap-and-emulate are adding only 1-cycle on average
to the running of the transcendentals.

In my case, handling of "sin()" looks sorta like:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;
t =th*sintab_c00; x=th*th2;
t+=x*sintab_c01; x*=th2;
t+=x*sintab_c02; x*=th2;
t+=x*sintab_c03; x*=th2;
t+=x*sintab_c04; x*=th2;
t+=x*sintab_c05; x*=th2;
t+=x*sintab_c06; x*=th2;
t+=x*sintab_c07; x*=th2;
t+=x*sintab_c08; x*=th2;
t+=x*sintab_c09; x*=th2;
t+=x*sintab_c10; x*=th2;
t+=x*sintab_c11; x*=th2;
t+=x*sintab_c12; x*=th2;
t+=x*sintab_c13; x*=th2;
t+=x*sintab_c14; x*=th2;
t+=x*sintab_c15; x*=th2;
t+=x*sintab_c16; x*=th2;
t+=x*sintab_c17; x*=th2;
t+=x*sintab_c18; x*=th2;
t+=x*sintab_c19; x*=th2;
t+=x*sintab_c20; x*=th2;
return(t);
}

Wow, so big, so slow, it is no wonder it is not used "often".
42 multiplies. Whereas: my 19 cycle version, includes Payne &
Hanek argument reduction (3 cycles) and needs only 7 multiplies
using Chebyshev coefficients.

Time cost: roughly 450 clock cycles (penalties are steep on this one).

This would be roughly 3x slower if handled with a trap and a special instruction.

Does it make sense to add an FSIN instruction? No, as FSIN is nowhere
near common enough to make much impact on code size (and, FSIN will be invariably slower).

Obviously, doing it you way is ineffective, but when SIN() cost no
more than FDIV, it should be included.

How much space does it take? Around 460 bytes.

4-bytes.

This is assuming DAZ/FTZ, but denormals shouldn't happen unless ang is pretty close to an integer multiple of M_TAU (2*M_PI).

I include denorms, no flush to zero, and use 128-bits of 2/pi at argument reduction, giving more than 64 bit of accuracy in the reduced argument.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 19:55:38 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.

There are smart fools everywhere.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 20:06:57 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].

It's not what I'd do in practice. The degree of poly required after
reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.

To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

There are better methods. Like reducing to much smaller interval, e.g.
to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details of trade
off between size of reduction table and length of polynomial depend on
how often do you plan to use your sin() function.

With every shrink of the argument range, the table size blows up
exponentially. For my transcendentals, the combined table sizes
are about the same as the table sizes for FDIV/FSQRT when using
Goldschmidt iteration using 11-bit in, 9 bit out tables.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 23:18:57 2025

From Newsgroup: comp.arch

On Wed, 17 Dec 2025 20:06:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].

It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.

To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

There exists tricks that can achieve the same in software, sometimes at
cost of one additional FP op and sometimes even for free. The latter
esp. common when FMA costs the same as FMUL.

There are better methods. Like reducing to much smaller interval,
e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
of trade off between size of reduction table and length of
polynomial depend on how often do you plan to use your sin()
function.

With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
are about the same as the table sizes for FDIV/FSQRT when using
Goldschmidt iteration using 11-bit in, 9 bit out tables.

Software trade offs are different.
Assuming that argument is already in [0:-pi/2] range, reduction down
to [-1/128:+1/128] requires pi/2*64=100 table entries. Each entry
occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400
bytes total. That size fits very comfortably in L1D cache, so from
perspectives of improvement of hit rate there is no incentive to use
smaller table.

At the first glance, reduction to [-1/128:+1/128] appears
especially attractive for implementation that does not look for very
high precision. Something like 0.65 ULP looks attainable with poly in
form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate
in late evening hours, so I can be wrong about it. But I also can be
right.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 21:36:28 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Wed, 17 Dec 2025 20:06:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

BGB <cr88192@gmail.com> schrieb:

double sin(double ang)
{
double t, x, th, th2;
int i;
i=ang*M_TAU_R;
th=ang-(i*M_TAU);
th2=th*th;

That leaves a lot of room for improvement; you would be better
off reducing to [-pi/4,+pi/4].

It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
calls 'faithful rounding'. For something better, like 0.51 ULP, you
would need one more term.

To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

There exists tricks that can achieve the same in software, sometimes at
cost of one additional FP op and sometimes even for free. The latter
esp. common when FMA costs the same as FMUL.

There are better methods. Like reducing to much smaller interval,
e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
of trade off between size of reduction table and length of
polynomial depend on how often do you plan to use your sin()
function.

With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
are about the same as the table sizes for FDIV/FSQRT when using
Goldschmidt iteration using 11-bit in, 9 bit out tables.

Software trade offs are different.

Indeed. To SW memory is "about" free, whereas to HW ROM is NOT free.

Assuming that argument is already in [0:-pi/2] range, reduction down
to [-1/128:+1/128] requires pi/2*64=100 table entries.

The IEEE 754-2019 specs indicate the argument range to SIN(x) has
x in the range {-infinity..+infinity} So, you need to count the cycles
it takes you to go from {-I..+I} to {-pi/4..+pi/4} or {-½..+½} and
then the top several bits of the fraction index the coefficient
tables.

FP32 can use 128-entry tables and a Quadratic (C0+C1*x+C2*x^2) or
can use 16-entry tables and a Cubic (C0+C1*x+C2*X^2+C3*x^3)

FP64 can use a 128-entry table and a Quartic (C0+C1*x+C2*X^2+C3*x^3+C4*x^4)

This one table for SIN is larger than the combined table sizes for
{SIN, COS, TAN, ATAN, ASIN, ACOS, Ln, Ln2, Log, LnP1, Ln2P1, LOGP1,
exp, exp2, 10**, expM1, exp2M1, 10**M1, pow, ATAN2}

Although SIN being even can use x**k :: K even
COS x**k ;; k odd

Each entry
occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400
bytes total. That size fits very comfortably in L1D cache, so from perspectives of improvement of hit rate there is no incentive to use
smaller table.

In HW there is no LD instruction needed to "get" the coefficients;
there is just the plethora of FMACs (and equivalents).

At the first glance, reduction to [-1/128:+1/128] appears
especially attractive for implementation that does not look for very
high precision. Something like 0.65 ULP looks attainable with poly in
form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate
in late evening hours, so I can be wrong about it. But I also can be
right.

I have a spreadsheet to do all of this for me--including determining the Chebyshev coefficients orders{1,2,3,4,5,6,7,8,9} and graphing of the
number of bits of precision.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Dec 18 00:43:35 2025

From Newsgroup: comp.arch

On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.

There are smart fools everywhere.

No doubt another lesson learned from instruction traces: misaligned
accesses occurred so rarely, it made sense to simplify the hardware by
leaving them out.

The same conclusion was drawn about integer multiplication and
division in those early days, wasn’t it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Dec 18 01:34:05 2025

From Newsgroup: comp.arch

It appears that John Dallman <jgd@cix.co.uk> said:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

For at least 20 years IBM's mainframes have used what they call millicode. The relatively simple instructions are implemented in hardware, and everything else in millicode, including the more complication instructions, I/O, and other system features. Millicode runs on the same CPU using the same instruction set as regular code with some extra registers and instructions to handle aspects of the hardware not visible to regular programs. It is stored in dedicated memory which is loaded at boot time so it's easy to update.

I gather there have been instructions that were implemented in millicode, then moved into hardware in the next CPU generation since they were used enough for the speed to matter.

Here's a 2012 slide deck:

https://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_metro_naspa_2012_10_what_and_why_of_system_z_millicode.pdf
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From moi@findlaybill@blueyonder.co.uk to comp.arch on Thu Dec 18 03:44:51 2025

From Newsgroup: comp.arch

On 18/12/2025 01:34, John Levine wrote:

It appears that John Dallman <jgd@cix.co.uk> said:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

For at least 20 years IBM's mainframes have used what they call millicode. The
relatively simple instructions are implemented in hardware, and everything else
in millicode, including the more complication instructions, I/O, and other system features. Millicode runs on the same CPU using the same instruction set
as regular code with some extra registers and instructions to handle aspects of
the hardware not visible to regular programs. It is stored in dedicated memory
which is loaded at boot time so it's easy to update.

I gather there have been instructions that were implemented in millicode, then
moved into hardware in the next CPU generation since they were used enough for
the speed to matter.

Here's a 2012 slide deck:

https://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_metro_naspa_2012_10_what_and_why_of_system_z_millicode.pdf

Typical IBM boosterism of a minor variant of an existing technique,
as used (for range compatibility, etc) by the ICT 1900 Series in 1965
and before that (for h/w economy) by the Ferranti Atlas and Orion.
--
Bill F.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Dec 18 14:41:31 2025

From Newsgroup: comp.arch

On 12/17/2025 6:43 PM, Lawrence D’Oliveiro wrote:

On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.

There are smart fools everywhere.

No doubt another lesson learned from instruction traces: misaligned
accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.

They occur rarely, or not at all, if avoided.

They seem to occur (on average) for 1-5% of the loads/stores if the code
makes use of them in cases where doing so would be beneficial.

Well, because the naive/portable approaches can often be unacceptably slow.

Granted, dealing with misaligned access does add cost and complexity to
the L1 D$ (particularly due to the case of dealing with misaligned
values crossing a cache-line boundary).

The same conclusion was drawn about integer multiplication and
division in those early days, wasn’t it.

Some timings from my case:
32 bit multiply (3 cycle latency):
0.70% of clock cycles;
MOD and DIV (36 and 68 cycles):
0.36%

Kinda need multiply, but divide is a bit more optional, and doesn't need
to be all that fast.

However, if DIV and MOD were implemented using traps, they are still
common enough to where there would be reason to care (this would still
have an obvious performance impact).

In the absence of hardware DIV/MOD, the better option is mostly to
handle it with runtime calls.

Otherwise, went and added some special case logic to allow for
transparent hot-patching via TLB trickery. Was actually simpler/cheaper
than it seemed at first.

Does have some restrictions though, namely in that it will only work
with read-only pages.

Ended up adding a special case with the Dirty flag:
D+NR+NW: X only, no hit for D$
D+NX+NW: R only, no hit for I$
But:
D+NX: Normal Read/Write Memory
D (along): Normal R/W/X Memory

The D / Dirty flag is used for PTE's, but not changed by HW. In effect
its main use would be more for write barriers (setting up and handling a
trap the first time the page is written to).

Also the issue that one can't encode a branch to an arbitrary address in
32 bits. If effect, hot-patching in this way would need somewhere to
branch-to that could be put within the window of what is reachable.

For plain RISC-V, there is another issue:
There is no way to do longer distance branches that wont stomp a register.

Jumbo prefixes and XG3 at least allow other options:
XG3's branches have a 32MB range, so more likely to be able to reach
something as most binaries are not that large (but, N/E in RV64GC mode);
Jumbo Prefixes: Can encode a +/- 4GB branch in 64 bits (but then needs
to patch at least 2 instructions).

Another issue being that any such logic needs to be able to operate with
zero free registers, so at least in this sense isn't much better off
than an interrupt handler. But, main difference being that any hot
patching doesn't need to decode an instruction and can be a special
sequence representing the instructions that originally generated the
trap (rather than a general purpose handler).

In the relevant ABIs, could assume that memory below SP is always safe
to use though (to save/restore any working registers).

In premise, one could put the hot-patching area before the loaded
binary, but generally this would only be usable (in RV64GC or similar)
if ".text" is somewhat less than 1MB.

Likely would make sense to handle it as, say:
SD X1, -8(SP)
SD X5, -16(SP)
LUI X5, AddrHi
JALR X1, DispLo(X5)
LD X5, -16(SP)
LD X1, -8(SP)
JAL X0, RetAddr

With another area (somewhere in the low 2GB) handling the actual traces (trying to keep the area just before ".text" mostly limited to
trampoline handlers, except for extra short sequences).

Or, AUIPC if the handler is placed +/- 2GB from this table.

Or, loading a 64-bit address from memory and then possibly running the
handler code in XG3 mode (would have access to 128-bit arithmetic and
some other things lacking in RV mode).

Likely would make sense to handle it as, say:
SD X1, -8(SP)
SD X5, -16(SP)
AUIPC X5, AddrHi
LD X5, DispLo(X5) //load address of entry point, PC-rel (*1)
JALR X1, Disp2(X5)
LD X5, -16(SP)
LD X1, -8(SP)
JAL X0, RetAddr

*1: Also no way in RV64GC to directly include a PC-rel load in a single instruction, so need an AUIPC to do so. In this case, need to jump
through a full 64-bit pointer to be able to perform the mode switch (AUIPC+JALR would merely branch within RV64GC mode).

Though, in some cases could make sense to keep the handlers in RV64G
mode, in which case no mode-change is needed.

...

The initial setup for these cases would likely be the same as that for ("normal") trap and emulate, just with the option of replacing some instructions with alternative handlers that are not quite as inefficient...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 18 21:17:28 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

Misaligned access is common enough here that, if it were not
supported natively, this would likely tank performance...

Still there are/were some architectures that refused to support it.

There are smart fools everywhere.

No doubt another lesson learned from instruction traces: misaligned
accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.

The same conclusion was drawn about integer multiplication and
division in those early days, wasn’t it.

My RISC 1st gen processor had 3-cycle integer multiply, and 35-cycle
division.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Dec 19 03:25:27 2025

From Newsgroup: comp.arch

It appears that moi <findlaybill@blueyonder.co.uk> said:

On 18/12/2025 01:34, John Levine wrote:

It appears that John Dallman <jgd@cix.co.uk> said:

] * "Trap and Emulate� is an illusion of compatibility"
] * Performance differential is too great for most applications

This is inevitably true nowadays, but wasn't when the idea was invented.

For at least 20 years IBM's mainframes have used what they call millicode. The
relatively simple instructions are implemented in hardware, and everything else
in millicode, ...

Typical IBM boosterism of a minor variant of an existing technique,
as used (for range compatibility, etc) by the ICT 1900 Series in 1965
and before that (for h/w economy) by the Ferranti Atlas and Orion.

I don't think they ever claimed it was a new idea, but they're definitely still using it in computers that they are selling today.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Noozle
  Tue Dec 23 08:01:34 2025
  from Noozle City via Telnet
- Microbot
  Tue Dec 23 02:03:41 2025
  from Moore, Ok via Telnet
- Noozle
  Mon Dec 22 07:31:41 2025
  from Noozle City via Telnet
- Microbot
  Mon Dec 22 03:48:38 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,090
Nodes:	10 (0 / 10)
Uptime:	16:31:40
Calls:	13,943
Calls today:	2
Files:	187,033
D/L today:	6,441 files (1,793M bytes)
Messages:	2,460,168

Lessons from the ARM Architecture

Who's Online

Recent Visitors

System Info