• Lessons from the ARM Architecture

    From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 15 17:19:35 2025
    From Newsgroup: comp.arch

    The ARM Lead Architect and Fellow Richard Grisenthwaite
    presents "Lessons from the ARM Architecture".

    https://www.scribd.com/document/231464485/ARM-RG

    Yes, scribd.com is annoying, but the document is interesting.


    Page 22:

    "If a feature requires a combination of hardware and specific
    software..."
    - Be afraid.
    - Be _very_ afraid of language specific features.
    - If it feels a bit clunky when you first design it
    ... it won't improve over time

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Dec 15 18:05:53 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    The ARM Lead Architect and Fellow Richard Grisenthwaite
    presents "Lessons from the ARM Architecture".

    https://www.scribd.com/document/231464485/ARM-RG

    Yes, scribd.com is annoying, but the document is interesting.

    It certainly is, thanks for sharing!

    He left out POWER, for some reason or other.


    Page 22:

    "If a feature requires a combination of hardware and specific
    software..."
    - Be afraid.
    - Be _very_ afraid of language specific features.
    - If it feels a bit clunky when you first design it
    ... it won't improve over time

    Wise words.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Mon Dec 15 19:37:00 2025
    From Newsgroup: comp.arch

    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    John
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Dec 15 22:24:42 2025
    From Newsgroup: comp.arch

    On 12/15/2025 1:37 PM, John Dallman wrote:
    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.


    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    There is also a tradeoff between density in cold path vs relative
    overhead vs a direct runtime call.


    I ended up partially migrating towards trap-and-emulate from runtime
    calls for "long double" as these can have a higher relative density in
    the cold path, and the added cost of the trap is smaller relative to the
    cost of running the Binary128 operations.

    Though, for semi-common cases, runtime calls are the preferable option.

    For example, floating-point or integer divide are better served by a
    runtime call than a trap:
    Relative cost of the trap is very high in comparison;
    More common in the hot-path.


    Things like memory barrier instructions and mutex related instructions
    could make sense to handle by traps, but preferable still to do mutex
    locks in userland using a system call or similar.

    In kernel space, it is more of an issue (if not just using a function
    call), since then it turns into one of several scenarios:
    Multicore:
    Need to flush caches and lock the mutex in a coherent way.
    Single core:
    If mutex isn't free, kernel is already dead...
    Though, if locked by the current task,
    can just gloss over and let it continue;
    Else, trigger a panic or something.

    Things like IO devices may present their own sorts of challenges
    (particularly if both the main OS device and also the device with the pagefile). IME, generally things like pagefile will need to bypass the
    main filesystem, and also filesystem code would need to be structured in
    a way that it is (ideally) impossible for a page fault or TLB miss to
    happen during an IO operation (for this reason, any VFS block-caching or similar needing to be done with physically-backed memory rather than
    virtual memory, etc).

    Well, and also some other important structures like task contexts can't
    use virtual memory (well, more so if the architecture doesn't allow
    interrupt handlers to natively access virtual memory; and it still makes
    sense to be able to handle things like task scheduling in an interrupt handler).

    ...


    But, then one can argue that maybe the cost of trap handling could get
    higher relative to that of native or runtime call handling.

    But example:
    It makes sense to handle Binary128 as traps if the overhead can stay
    under around 1000 cycles or so, but would not make sense at 100000 or
    1000000 clock cycles, where runtime calls would be the clear winner.

    Or, alternatively, if the runtime calls for the binary128 operations
    were under 100-200 clock cycles (very possible on an OoO machine).

    But, if there is a 200% speed difference, 800% code-size difference, and
    the operation is cold-path only, trapping makes sense.


    A bigger performance concern though is trapping on FDIV or denormals,
    where (if not using DAZ+FTZ), may still be common enough to be significant.


    Sometimes, I am left considering wacky ideas, say for example:
    A Partial Page Table.
    Where, rather than having a full page walker, it has a TLB like
    structure which merely caches the bottom level of the page table.

    This could have some similar merits to software TLB (flexibility and
    simpler hardware), while potentially getting a roughly 512x to 2048x multiplier out of the effective size of the TLB (and without some of the drawbacks of an Inverted Page Table).

    Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
    PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
    cover 16GB).

    Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.



    Sometimes, it makes more sense to *not* have an instruction if the
    instruction only makes sense if it is semi-frequently used, but at the
    same time will negatively effect the implementation and add significant
    cost if implemented using traps.

    One example being possibly RISC-V's V extension, where:
    If implemented as trap handlers;
    If programs actually use it.
    Is very liable to wreck performance.

    A concern is partly that, at least to me, the design of RISC-V's V
    extension seems like a foot-gun. Some people argue that it isn't
    inherently any more expensive than ARM's NEON, but I still have some
    concern here.

    Well, I also have some concern over parts of the B extension, which seem overly niche and which don't map onto C or similar.

    ...



    One potential ugly point that exists in my designs (including XG3) is
    the use of 48-bit pointers with tagging in the high 16 bits. I can't
    prove that this wont come back to bite; but then at the same time I also
    still need some amount of tag bits. And, Intel/AMD/ARM had ended up
    partly going in a similar direction as an optional feature (though
    differing on the 8 vs 16 bits question).


    When I ended up moving the FPU status from GBR/GP to SP, this created a
    new mess, so was likely a partial mistake. I ended up moving these bits
    again to being located in TBR, with SP turned into a special "HOBs are potentially always zero" register.

    Granted, this did have a non-zero code impact (the code for the
    "FENV_ACCESS" stuff needed to be tweaked). Does leave a problem for
    RISC-V as accessing FENV in the standard way (via CSRs) will have a
    steep cost. For whatever reason, the RV designers thought it make sense
    to put things like rounding-mode and flags into separate sub-CSRs (which
    would be too expensive to handle and not used often enough to justify
    making it faster).

    ...



    But, activity in my ISA is slow recently.

    Has reached a point of "mostly good enough", which makes a difficulty.
    Like, what does one do when they reach a point where there isn't as much
    to obviously improve on?...

    Well, TestKern is still kinda crap, still no working Linux nor ability
    to run Linux binaries.

    Not entirely clear where to go from here, or the use case for what I
    ended up with. Here, XG3 works reasonably OK as a VM, but this was kind
    of an overly long and convoluted path if I have still just ended up back
    with having a VM (and to what extent I do electronics projects, still
    often just using a RasPi or similar).

    Like, this isn't much beyond where I was a decade ago.



    More recent activity has been in other areas, mostly involving BGBCC's resource-converter stuff:
    Working some on trying to improve my UPIC stuff, as UPIC turned out to
    be "kinda useful" (*1).


    *1: UPIC kinda resembles T.81 JPEG, but:
    Uses STF+AdRice rather than Huffman;
    Used BHT+RCT rather than DCT+YCbCr;
    ...

    A few more recent extensions were:
    Optionally adds:
    WHT, LGT-5/3, and DCT transforms;
    YCoCg-R and Approx_YCbCr.
    Encoder can try out different options and pick the configuration that maximizes Q/bpp (lossy) or minimizes size (lossless, excludes DCT and Approx_YCbCr as these are inherently lossy).

    Had noted that the "winner" seems to depend a lot on image type and quality:
    Lossless:
    BHT or LGT-5/3 usually win (BHT wins more often).
    More often RCT wins over YCoCg-R.
    So, BHT + RCT seems to dominate for Lossless.
    Near Lossless (high quality):
    LGT-5/3 and YCoCg-R often jump into the lead.
    Low Quality:
    DCT and Approx_YCbCr often more into the lead for photo-like images;
    LGT-5/3 and YCoCg still do well for cartoon-like images though.

    WHT could do OK for photo-like images, but:
    Only holds a lead at lower quality levels;
    Almost always loses to DCT when DCT is still an option;
    So, WHT is making the worse showing here;
    Though, is faster than DCT, and supports lossless.
    Nevermind if for Lossless it nearly always loses to LGT-5/3.

    Approx_YCbCr is basically an approximation of YCbCr:
    Y=(G*4+R*3+B)/8
    U=B-Y
    V=R-Y
    Though, this isn't exact, and:
    U=(B-Y)/2+128
    V=(R-Y)/2+128
    Would technically be closer.

    Though, in UPIC, the DC bias is 0 rather than 128, so differs here from
    JPEG.


    Had experimented with a few other possibilities, but none was
    particularly competitive.


    Where:
    BHT : Block Haar (Wavelet) Transform
    Recursively maps: (A,B) => ((A+B)/2, A-B)
    Split into averages and differences, applied recursively.
    LGT 5/3: Block (Le Gall–Tabatabai) 5/3 (Wavelet)
    Recursively split even/odd, predict odds from averages of evens.
    Bottom step resembles Haar.
    WHT : Walsh-Hadamard Transform
    Kinda resembles BHT, but more complicated.
    There is also another kinda WHT, but it did worse:
    (A+B+C+D, A+B-C-D, A-B-C+D, A-B+C-D)
    DCT : Discrete Cosine Transform
    Basically the same thing JPEG used here.
    Lossy only (making DCT lossless is too slow).

    As with T.81 JPEG, UPIC works with 8x8 blocks, typically organized into
    a 16x16 macroblock with 4:2:0 or 4:4:4 chroma subsampling. As in the
    Theora codec, the individual sub-blocks within a macroblock (when more
    than one) are encoded in Hilbert order (U shaped). The format also
    natively supports an alpha channel.

    In this case, the 8x8 transforms are formed by two levels of 1D
    transform. Horizontal then vertical for encode, vertical then horizontal
    for decode (correct ordering here is important for lossless).


    Had noted in some testing:
    Lossless compression, for many images, can beat PNG.
    At near-lossless quality, Q/bpp also beats T.81 JPEG.

    But:
    For "very synthetic" images, such as pie charts or graphs, PNG tends to
    win over UPIC in terms of size (though UPIC still wins in being faster
    to decode than PNG and with less memory footprint).

    At the lower end of the quality scale (mostly under 30%), JPEG still
    holds on well for Q/bpp, ... But, this is very sensitive to how the quantization matrix is computed (and needs to tune the quantization
    matrix used for the specific transform).

    Possible reasons:
    STF+AdRice is slightly less optimal than Huffman;
    The zero count was reduced from 4 to 3 bits, which is weaker with more
    longer runs of zeroes (more common with flat color blocks and at low
    quality levels).

    One considered by untested possibility:
    Jointly encoding groups of four 8x8 blocks as one giant 256-entry block
    (to partly consolidate the long zero runs and the early EOBs). Unclear
    if it would be worth the added complexity though.


    Had considered using a variant of my BT5B codec as an image format for textures, but ended deciding against this as the Q/bpp was kinda awful
    (and it started to seem better to use UPIC instead and just eat the
    extra CPU time for converting this to DXT1 or DXT5). While my older
    (BTIC4B) format had worked OK, it is considerably more complicated (and
    UPIC being "like JPEG but faster" and transcoding to DXT1 is "maybe good enough").

    Where basically, for technical reasons, BT5B was essentially unable to significantly beat the DDS format in Q/bpp. Could get smaller files, but
    at much worse quality; being still essentially similar tech to the
    MS-CRAM video codec (well, and for similar reasons for why one wouldn't
    really want to store textures via CRAM; though was at least possible to
    get closer to 1:1 with DDS for quality, but then essentially also
    roughly 1:1 in terms of file size).


    Note that for a decode path to DXT1/5, UPIC generally decodes and
    trasforms each macroblock at a time (this is more cache friendly than
    decoding the entire image to RGBA and then DXT encoding it; though it
    still uses full-image passes for mipmap generation).

    Note that Q/bpp could be improved, technically, by feeding the image
    through a bitwise range coder, but this would have a significant adverse impact on speed.

    ...



    Also (probably feature creep) ended up adding 3D model conversion and
    CSG stuff to BGBCC (including a makeshift interpreter for the SCAD /
    OpenSCAD language).

    This was in part because:
    BGBCC's core also mostly serves the core as part of my WAD4 packing tool;
    It is useful to have model conversion stuff.

    It would partly take over the role of using my current 3D engine for
    model conversion.

    In the 3D engine I had typically be using a modified BASIC dialect, but
    had mostly designed the models originally in OpenSCAD, and using the
    SCAD language allows keeping the models intact (but OpenSCAD doesn't
    natively support colors or textures for any of the 3D model export formats).

    Initially, these were only supported for my own 3D model format (used
    for my 3D engine) but also supported (in a not yet released version)
    using the "Wavefront OBJ" format (theoretically tried implementing some
    of the color extensions for Binary STL, but nothing seemed able to
    import them with color).

    Other formats had other drawbacks:
    AMF: More complicated, poorly supported;
    XML based.
    3MF: Much more complicated.
    Format consists of XML blobs inside a ZIP based package, ...
    DAE: Yeah, no.
    3DS: Also no.

    DAE and 3DS seem more sane for high-end mesh modeling, not so much for CSG.


    Doing CSG with skeletal animation is currently supported for my BASIC
    dialect; could map it over to SCAD, but would not have any direct
    equivalent in OpenSCAD.

    For textures, can map the textures over to "color()" which is then
    understood as identifying a texture if not using a normal color
    name/value. There is no direct need to specify ST/UV coords here, as
    these are implied via texture projection.

    Textures may be specified with planes, and then when projecting it will
    pick the plane for this texture that is the closest match to the face
    normal.

    Sorta like:
    texturename:
    image [ Sx Sy Sz So ] [ Tx Ty Tz To ]
    ...
    Where: N = S x T
    S/T encoding both the scale and translation along the axis.
    Basically, working in a vaguely similar way to how texturing worked in Half-Life (just with the 'color' also able to specify a texture).

    If I added bones to SCAD, one possibility is:
    bone("bonename") {
    ... geometry ...
    }

    In the BT3 engine, the animation was handled by assigning geometry to
    the skeleton (no vertex weights for now), and then using external files
    (also text based) to describe any animation sequences.

    Though, I had decided I will likely stick mostly with using sprite
    graphics for mobs and NPCs, so the animation part may be less relevant.


    I may start making more use of procedural compositing and animation for sprites though (possibly using a similar system to that used for VTuber
    rigs; though likely specified as text files and using coords). Haven't
    done much yet.

    So, say, in this case, might define parts relative to a sprite sheet:
    part hand sprites/somechar [ 320 100 ] [ 400 160 ]
    part forearm sprites/somechar [ 410 100 ] [ 550 160 ]
    ...
    bone spine hip [ 320 400 ]
    bone neckbase spine [ 320 600 ]
    bone lupperarm neckbase [ 290 600 ]
    bone lforearm lupperarm [ 160 600 ]
    bone lhand lforearm [ 100 600 ]
    ...
    attach hand lhand [ 10 580 ] [ 1 0 0 1 ]
    attach hand rhand [ 600 580 ] [ -1 0 0 1 ]
    ...

    Well, contrast VTuber rigs usually being done with a lot of
    drag-and-drop tools (but, I have my reasons for wanting it to remain
    text based; but then probably parsed and cooked down into a binary format).

    Or, basically, each part is clipped out of a sheet texture, then
    attached to a skeleton giving the base location and a transform (as a
    2x2 matrix).

    Animations could then be given as a sequence of frames specifying the
    relative translation and rotation of each bone (or possibly
    hiding/showing certain attachments, say for example open-vs-closed
    hands, etc).

    Though, a 2D skeleton would still carry some limitations to traditional
    2D sprite graphics.


    Though, I guess another option here is to use a hybrid approach of
    attaching sprites to a 3D skeleton and then animating it as-if it were
    3D; but this doesn't really save anything (effort wise) vs the use of
    CSG models (and, if anything, more just makes a case for the ability to
    attach sprites on in addition to CSG solids).

    Then again, some early 3D games, like "Super Mario 64" did make
    effective use of such a strategy (character models often combined 3D
    elements with the use of sprites in some cases; rather than being purely
    3D).

    ...



    Though, for past stuff, had usually drawn out every sprite frame.
    But, this doesn't scale well to characters with dynamic movement.

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Tue Dec 16 12:02:18 2025
    From Newsgroup: comp.arch

    In article <memo.20251215193718.13016J@jgd.cix.co.uk>,
    John Dallman <jgd@cix.co.uk> wrote:
    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd: ><https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Archive.org link to the actual PDF: https://web.archive.org/web/20160201075644/https://www.eit.lth.se/fileadmin/eit/courses/eitf20/ARM_RG.pdf

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    And how!

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Dec 16 15:40:26 2025
    From Newsgroup: comp.arch

    On Mon, 15 Dec 2025 18:05:53 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    The ARM Lead Architect and Fellow Richard Grisenthwaite
    presents "Lessons from the ARM Architecture".

    https://www.scribd.com/document/231464485/ARM-RG

    Yes, scribd.com is annoying, but the document is interesting.

    It certainly is, thanks for sharing!

    He left out POWER, for some reason or other.


    I think that the reason was simple: he searched his drawer and found
    brochure from 1992 Microprocessor Forum. By chance, IBM didn't present POWER-related stuff at this forum. Likely, because POWER (later known as POWER1) and RSC were presented in the previous year[s] and for PPC601
    they wanted to hold cards closer to their collective chests, because
    anything else would be incompatible with Apple's culture.



    Page 22:

    "If a feature requires a combination of hardware and specific
    software..."
    - Be afraid.
    - Be _very_ afraid of language specific features.
    - If it feels a bit clunky when you first design it
    ... it won't improve over time

    Wise words.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 16 12:29:58 2025
    From Newsgroup: comp.arch

    On 12/15/2025 12:05 PM, Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:
    The ARM Lead Architect and Fellow Richard Grisenthwaite
    presents "Lessons from the ARM Architecture".

    https://www.scribd.com/document/231464485/ARM-RG

    Yes, scribd.com is annoying, but the document is interesting.

    It certainly is, thanks for sharing!

    He left out POWER, for some reason or other.


    Page 22:

    "If a feature requires a combination of hardware and specific
    software..."
    - Be afraid.
    - Be _very_ afraid of language specific features.
    - If it feels a bit clunky when you first design it
    ... it won't improve over time

    Wise words.

    Yes.


    Ideally, any feature should be one of:
    Trivially used by a C compiler or similar;
    Addresses a scenario that is commonly used but can't be handled
    efficiently enough otherwise.

    An example of the latter would be things like RGB555 packing/unpacking,
    where RGB555 has ended up very common in a lot of my graphics handling
    code, but the traditional ways to pack/unpack RGB555 are a bit lacking
    if it needs to be done in a tight loop.


    Though, in most cases, designing instructions or similar for a specific algorithm seems like a bad idea. In nearly every other situation the instructions would end up being essentially useless.


    In my case, I had an overly niche feature I had called "LDTEX":
    Takes a compressed texture given as a base address and loads a texel
    with fixed-point ST coords given in the index register (nearest neighbor
    only, but there was a 64-bit encoding variant to control the texel
    rounding to specify one of 4 positions).

    This had a very niche use-case:
    When implementing a software rasterizer.

    Usage elsewhere:
    Not really...


    Status:
    Now mostly leaving it disabled, as it is less needed if using a hardware rasterizer (but would be more relevant if I had GLSL; issue mostly is
    that a GLSL compiler would be a bit too much code footprint).

    Though, could potentially do ARB shaders, or a non-standard shader
    language based on unstructured BASIC or similar.

    With the HW rasterizer, there is still the drawback that geometry
    processing on the CPU is a significant bottleneck (as are often things
    within the 3D engines).

    ...


    Then again, ARM had ended up with some things that were far more niche,
    and ultimately even more useless, like Jazelle and similar.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 16 21:02:28 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/15/2025 1:37 PM, John Dallman wrote:
    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd: <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.


    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    One must also take into consideration the relative time of each path.
    For example, I have Transcendental instructions in my ISA where I can
    perform FP64Ă—{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
    I make a rounding error about once every 378 calculations. Running a SW
    routing that gives correct rounding is 300-500 cycles.

    So, if one can take the Incorrect-Rounding exception in less than 100
    cycles (round tip) trap-and-emulate are adding only 1-cycle on average
    to the running of the transcendentals.

    -----------------------------------------------
    Sometimes, I am left considering wacky ideas, say for example:
    A Partial Page Table.
    Where, rather than having a full page walker, it has a TLB like
    structure which merely caches the bottom level of the page table.

    This could have some similar merits to software TLB (flexibility and
    simpler hardware), while potentially getting a roughly 512x to 2048x multiplier out of the effective size of the TLB (and without some of the drawbacks of an Inverted Page Table).

    Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
    cover 16GB).

    Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

    If your HW tablewalker made 1 probe into a large (16MB) SW controlled
    buffer of PTEs, and if it found the PTE great, it gets installed and
    life goes on, otherwise, SW updates the Table and TLB and returns as
    soon as possible.

    This gives you a L2-TLB with 1M entries, enough to map 8GB and a
    table walking state machine with 3 states.

    -----------------------------------------------
    One potential ugly point that exists in my designs (including XG3) is
    the use of 48-bit pointers with tagging in the high 16 bits. I can't
    prove that this wont come back to bite; but then at the same time I also still need some amount of tag bits. And, Intel/AMD/ARM had ended up
    partly going in a similar direction as an optional feature (though
    differing on the 8 vs 16 bits question).

    I CAN prove it WILL come back to Bite--but only if your architecture survives

    -----------------------------------------------------
    For textures, can map the textures over to "color()" which is then understood as identifying a texture if not using a normal color
    name/value. There is no direct need to specify ST/UV coords here, as
    these are implied via texture projection.

    Can you think of uses for Texture where a LD instruction has a FP
    index::

    LD Rd,[Rbase,Rindex,Displacement]

    where: Rbase is a pointer
    Rindex is a FP value
    Displacement is what it is

    The integer part of Rindex and integer part + 1 index the Texture.
    The fraction pare of Rindex performs the blending between
    LERP( texture[int(Rindex)], texture[int(Rindex)+1], fract(Rindex) );
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Dec 16 16:20:34 2025
    From Newsgroup: comp.arch

    One must also take into consideration the relative time of each path.
    For example, I have Transcendental instructions in my ISA where I can
    perform FP64×{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
    I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

    So, if one can take the Incorrect-Rounding exception in less than 100
    cycles (round tip) trap-and-emulate are adding only 1-cycle on average
    to the running of the transcendentals.

    Hmm... I think that calculation isn't quite right:

    - First, this assumes a random distribution. Depending on the details
    (e.g. if the 1/378 cases are grouped in a specific subspace that some
    algorithm might use heavily), it can get a lot worse (or a bit better).

    - During those 300-500 cycles, no other instruction is processed, IOW
    it's not as-if this one instruction took just one more cycle, but it's
    as-if the CPU stopped all activity during one cycle. For an ILP of 1,
    it's about the same, but for the ILP of 4, it's 4x more expensive.

    I CAN prove it WILL come back to Bite--but only if your architecture
    survives

    It's part of the problem with those "lessons from" successful projects:
    don't forget that it is sometimes indispensable to cut corners for the
    project to survive in the first place. Maybe you'll regret it years
    later, yet, without it, there wouldn't even be any opportunity to regret anything years later. There are delicate trade-offs.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Dec 16 19:01:18 2025
    From Newsgroup: comp.arch

    On 12/16/2025 3:02 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 12/15/2025 1:37 PM, John Dallman wrote:
    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd:
    <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented. >>>

    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    One must also take into consideration the relative time of each path.
    For example, I have Transcendental instructions in my ISA where I can
    perform FP64Ă—{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively.
    I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

    So, if one can take the Incorrect-Rounding exception in less than 100
    cycles (round tip) trap-and-emulate are adding only 1-cycle on average
    to the running of the transcendentals.



    In my case, handling of "sin()" looks sorta like:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;
    t =th*sintab_c00; x=th*th2;
    t+=x*sintab_c01; x*=th2;
    t+=x*sintab_c02; x*=th2;
    t+=x*sintab_c03; x*=th2;
    t+=x*sintab_c04; x*=th2;
    t+=x*sintab_c05; x*=th2;
    t+=x*sintab_c06; x*=th2;
    t+=x*sintab_c07; x*=th2;
    t+=x*sintab_c08; x*=th2;
    t+=x*sintab_c09; x*=th2;
    t+=x*sintab_c10; x*=th2;
    t+=x*sintab_c11; x*=th2;
    t+=x*sintab_c12; x*=th2;
    t+=x*sintab_c13; x*=th2;
    t+=x*sintab_c14; x*=th2;
    t+=x*sintab_c15; x*=th2;
    t+=x*sintab_c16; x*=th2;
    t+=x*sintab_c17; x*=th2;
    t+=x*sintab_c18; x*=th2;
    t+=x*sintab_c19; x*=th2;
    t+=x*sintab_c20; x*=th2;
    return(t);
    }

    Time cost: roughly 450 clock cycles (penalties are steep on this one).

    This would be roughly 3x slower if handled with a trap and a special instruction.

    Does it make sense to add an FSIN instruction? No, as FSIN is nowhere
    near common enough to make much impact on code size (and, FSIN will be invariably slower).

    How much space does it take? Around 460 bytes.


    This is assuming DAZ/FTZ, but denormals shouldn't happen unless ang is
    pretty close to an integer multiple of M_TAU (2*M_PI).

    A possible hack would be to detect that if 'th' is sufficiently close to
    0 then just return 0 (and sidestep going into denormal land).



    What about "sinl()"?
    long double sinl(long double ang);
    Rest similar, just long double.


    Roughly 80k cycles...
    This is kinda awful.

    If using function calls, it can drop to 60k cycles, still awful.

    How much space:
    Around 800 bytes if we assume special instructions and traps.
    Eg: FMUL.Q and FADD.Q and similar.
    Around 3K if using function calls.

    so, in this case, 30% slower but 27% the size still seems like a win...

    And, at 60k or 80k cycles, there isn't really any making this fast.


    Ironically, might get faster if/when there were some implementation that actually had native (non-trapping) Binary128 support.



    Other option is to just use "sin()" internally and convert to/from "long double".



    -----------------------------------------------
    Sometimes, I am left considering wacky ideas, say for example:
    A Partial Page Table.
    Where, rather than having a full page walker, it has a TLB like
    structure which merely caches the bottom level of the page table.

    This could have some similar merits to software TLB (flexibility and
    simpler hardware), while potentially getting a roughly 512x to 2048x
    multiplier out of the effective size of the TLB (and without some of the
    drawbacks of an Inverted Page Table).

    Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
    PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
    cover 16GB).

    Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

    If your HW tablewalker made 1 probe into a large (16MB) SW controlled
    buffer of PTEs, and if it found the PTE great, it gets installed and
    life goes on, otherwise, SW updates the Table and TLB and returns as
    soon as possible.

    This gives you a L2-TLB with 1M entries, enough to map 8GB and a
    table walking state machine with 3 states.

    -----------------------------------------------
    One potential ugly point that exists in my designs (including XG3) is
    the use of 48-bit pointers with tagging in the high 16 bits. I can't
    prove that this wont come back to bite; but then at the same time I also
    still need some amount of tag bits. And, Intel/AMD/ARM had ended up
    partly going in a similar direction as an optional feature (though
    differing on the 8 vs 16 bits question).

    I CAN prove it WILL come back to Bite--but only if your architecture survives


    Maybe, but it is unclear if it will at this point.

    RISC-V is more likely to survive, and XG3 could make sense in the RISC-V context, but not likely as a primary ISA in any case.

    Though, the big uncertainty is whether 48-bit tagged PC and
    Link-Registers are too much of an ask. An implementation could be
    possible that used 56 bit addresses, but would not be strictly
    compatible with my existing implementation (doing mode jumps does
    require awareness of the tagging scheme, which has a non-zero code impact).

    I had at one point considered the possibility of hacking certain RISC-V
    Load instructions to function as mode-selecting JALR's to allow LUI+Lx
    to encode an inter-mode JAL (and also as a way to detect XG3 support),
    but thus far this has not been done.

    Say:
    LUI X5, AddrHi;
    LW X0, AddrLo(X5) //JALR to XG3 mode
    ... //If we get here, it didn't work.

    Where "LW X0" being understood as JALR to XG3 mode, and "LH X0"
    understood as a JALR to RV64GC mode. Scheme works as normal RISC-V ops
    are valid in both modes.


    One merit of the pointer tagging though is that it allows call/return
    between modes, but exposes a certain level of wonk.


    -----------------------------------------------------
    For textures, can map the textures over to "color()" which is then
    understood as identifying a texture if not using a normal color
    name/value. There is no direct need to specify ST/UV coords here, as
    these are implied via texture projection.

    Can you think of uses for Texture where a LD instruction has a FP
    index::

    LD Rd,[Rbase,Rindex,Displacement]

    where: Rbase is a pointer
    Rindex is a FP value
    Displacement is what it is

    The integer part of Rindex and integer part + 1 index the Texture.
    The fraction pare of Rindex performs the blending between
    LERP( texture[int(Rindex)], texture[int(Rindex)+1], fract(Rindex) );


    For the LDTEX instruction, it was:
    (63:48): Integer part of T coord
    (47:32): Fraction of T coord
    (31:16): Integer part of S coord
    (15: 0): Fraction of S coord.

    For the 64-bit encoding, the Disp field had encoded:
    0: Truncate S and T
    1: Round S up, Truncate T
    2: Truncate S, Round T up
    3: Round S and T up.

    With the HOB's of the address given to LDTEX encoding the texture size
    and format (could all be fit into 16 bits, due to textures being
    power-of-2 size and square/rectangular).


    This allowed 4 LDTEX instructions, and BLINT instructions, to be used to implement bilinear filtering.

    Though, cheaper was to only do 3 LDTEX, 2 BLINT (S and T), and then
    average the color vector. This gives an effect similar to the texture filtering used on the Nintendo 64.

    In the absence of LDTEX, one might do instead:
    PMORT.Q + SHLD + AND + MOV.Q + BLKUTX2

    Though, not as bad for bilinear as they could be interleaved (though,
    still higher latency than LDTEX).

    And, then use packed SIMD calculations rather than BLINT.


    In theory, 5 LDTEX's could be used to implement trilinear, but had noted
    that some tweaking with the interpolation coords can be used to mimic
    the effects of trilinear filtering without actually needing to access
    another mipmap level. Rather than loading and interpolating the next
    mipmap level, merely push the interpolation coords towards the center of
    the 4 texels.

    Resulting in a sorta "Crappy bilinear filtering that mimics the look of trilinear filtering". Well, it at least can avoid the main obvious
    artifact of bilinear, which is that there is a big obvious seam whenever
    it jumps between mipmap levels.


    Though, in the case of SCAD, the "color()" operation is usually given a string, say:
    color("#AA5500")
    {
    translate([32,-1,1])
    {
    cube([8,5,2]);
    translate([1,3.2,0])
    rotate([0,0,25])
    cube([6,2,2]);
    }
    }

    One annoyance with SCAD is that (like with GLSL) it takes a pretty big
    chunk of code to deal with it.

    I don't generally want to deal with this directly on the target ISA, so
    makes more sense to precook it into a triangle-based model.

    In this case, one can get creative:
    color("sometexture")
    {
    ...
    }
    Where "sometexture" is understood as giving the name of the texture or material to apply, rather than the name of the color.



    But, there is an annoyance of model formats:
    BMD: a custom model format.
    Mostly uses a packed format for coordinates (joint exponent fun);
    Reasonably compact and cheap to unpack.
    Partly influenced by the Quake and Half-Life model formats.
    STL:
    Common;
    Lacks any standard way to do colors,
    and even then, only in the binary format.
    Binary STL files are more bulky (50 bytes per triangle).
    OBJ: (Wavefront OBJ)
    Semi-common;
    More advanced, text based format.
    Loader needs a text parser, ...
    Text format makes models comparably bulky.
    AMF, 3MF:
    XML + ZIP
    Don't want to deal with this.
    XML parsing and Deflate decompression is expensive on a 50 MHz CPU.
    DAE: Just no.

    Theoretically, there is glTF, but glTF seems a bit overkill for my uses.


    In my case, BMD models also support collision detection. Mostly based
    on, since the models are generated via CSG, all the meshes can be closed
    and non-self-intersecting.

    This makes it possible to detect collisions:
    Start from a vertex, and trace a line somewhere else (outside of the
    model being checked):
    If the number of crossed faces is 0 or even, point is outside the model;
    If it is odd, point is inside the model.

    Doesn't really work if model is self-intersecting or non-closed. Also
    can assume that a point is always non-coliding if it falls outside the bounding box.

    Note that BSP trees could be used to optimize these sorts of checks, but
    a BSP adds complexity that is not usually worthwhile. Currently, BMD
    does not store BSP trees.



    Can note that in my BT3 engine, it mostly does collision detection and
    physics by using point probing (rather than box/plane checks).
    Ironically, this is more like old NES/SNES games than it is like Doom or Quake.

    But, for physics, it was simpler/easier to answer the question of "is
    this point inside anything solid?" for the corners of a bounding box,
    than it was to check for lots of AABB collisions or similar.

    Though, to check for AABB/mesh checks, also it is needed to do line/box checks:
    If any of the points from the box fall inside the mesh; or, if any
    polygon edges from the mesh intersect the AABB, assume a collision has occured.

    If using CSG brushes, could use the "separating axis theorem" approach,
    but this requires keeping the CSG brushes around.

    ...



    Isn't currently a plan to add rigid body physics.
    My first 3D engine had this, but:
    It is a pain to get this working reliably;
    90% of everything still ends up being done with AABBs;
    Often lacks an obvious use-case other than "objects can fall over realistically".

    Seemed compelling in Half-Life 2 and similar, but mostly amounted to not
    much more than a gimmick.

    At the time, was more effort than it was worth trying to get objects to
    form stable stacks. One of the harder problems of developing such a
    physics engine being to get rigid-body objects to stack without becoming unstable.

    Then ended up mostly not using the rigid-body physics stuff for much of anything anyways.

    In theory, could just use (or could have used) an existing physics engine.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Dec 17 01:44:48 2025
    From Newsgroup: comp.arch

    According to BGB <cr88192@gmail.com>:
    On 12/15/2025 1:37 PM, John Dallman wrote:
    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    I think it's always been iffy.

    S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
    8 byte boundaries. Normally the Fortran compiler aligned data correctly but you could use an EQUIVALENCE statement to create misaligned doubles, something that occasionally happened in code ported from 709x where it wasn't a problem. There was a library trap routine to fix it but it was stupendously slow, so IBM added the "byte oriented operand" feature to the 360/85 and later models.

    The high end 360/91 did not have most of the decimal instructions, on the reasonable theory that few people would run the kind of programs that used them on a high end scientific machine. The operating system simulated them slowly, so
    the programs did still work. But the subsequent 360/195 had the full instruction
    set.

    On the other hand, back when PCs were young, Intel CPUs had a separate optional fairly expensive floating point coprocessor. Most PCs didn't have them, and programs that did arithmetic all had software floating point libraries. If you didn't have the FPU, instructions didn't trap, they just did nothing. The C compiler we used had a clever hack that emitted floating point instructions with the first byte changed to a trap instruction. If the computer had hardware floating point, the trap handler patched the instruction to the
    real floating point one and returned to it, otherwise used the software floating point. This got pretty good performance either way. It helped
    that the PC had no hardware protection so the C library could just install
    its own trap handler that went directly to the float stuff.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 00:51:17 2025
    From Newsgroup: comp.arch

    On 12/16/2025 7:44 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    On 12/15/2025 1:37 PM, John Dallman wrote:
    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    I think it's always been iffy.

    S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
    8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
    could use an EQUIVALENCE statement to create misaligned doubles, something that
    occasionally happened in code ported from 709x where it wasn't a problem. There
    was a library trap routine to fix it but it was stupendously slow, so IBM added
    the "byte oriented operand" feature to the 360/85 and later models.

    The high end 360/91 did not have most of the decimal instructions, on the reasonable theory that few people would run the kind of programs that used them
    on a high end scientific machine. The operating system simulated them slowly, so
    the programs did still work. But the subsequent 360/195 had the full instruction
    set.


    Running some stats for misaligned access in my emulator (percentage of
    total memory accesses, organized by type).


    Initial bootup:
    Word : 7.67% DWord: 0.13% QWord: 9.53%

    Doom Startup:
    Word : 0.64% DWord: 0.05% QWord: 3.48%
    Demo Loop:
    Word : 0.03% DWord: 0.03% QWord: 2.27%

    SW Quake Startup:
    Word : 0.63% DWord: 0.06% QWord: 6.70%
    SW Quake first Demo:
    Word : 0.03% DWord: 0.07% QWord: 3.18%

    GLQuake Startup:
    Word : 0.83% DWord: 0.03% QWord: 4.87%
    GLQuake first Demo:
    Word : 0.23% DWord: 0.02% QWord: 2.23%


    Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...


    Seems that misaligned QWord is the most common, but then again, QWord is
    used for things like "memcpy()" and similar (and a lot of my runtime
    code was written to assume fast misaligned memory access, ...).

    You could arguably implement "memcpy()" as a series of byte loads and
    stores, but then "memcpy()" is 8x slower.



    So, in this case, MOV.Q is 4% of CPU time, and say 5% of these are
    misaligned.

    So, 0.2% of the CPU cycles are misaligned QWord access. What if these
    took 500x longer? 100%. Or, effectively, doing this would effectively
    halve performance.

    So, "rare enough":
    Misaligned falls well short of being "rare enough"...


    I am not thinking of things like misaligned here, more things like "long double". As-is, "long double" is basically unused in the hot path.



    On the other hand, back when PCs were young, Intel CPUs had a separate optional
    fairly expensive floating point coprocessor. Most PCs didn't have them, and programs that did arithmetic all had software floating point libraries. If you
    didn't have the FPU, instructions didn't trap, they just did nothing. The C compiler we used had a clever hack that emitted floating point instructions with the first byte changed to a trap instruction. If the computer had hardware floating point, the trap handler patched the instruction to the
    real floating point one and returned to it, otherwise used the software floating point. This got pretty good performance either way. It helped
    that the PC had no hardware protection so the C library could just install its own trap handler that went directly to the float stuff.


    Yeah, early PCs and math coprocessor is more analogous to the "long
    double" situation.

    You already know it is slow, but sometimes need it. But, can't use it
    often, because it is slow.

    Early PCs used fixed-point. In my case, one still has "float" and
    "double", but these still aren't super fast either (so, using
    fixed-point can still be faster in many cases).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Dec 17 07:11:04 2025
    From Newsgroup: comp.arch

    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 01:27:31 2025
    From Newsgroup: comp.arch

    On 12/16/2025 3:02 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    Getting back to a part I somehow missed responding to...


    -----------------------------------------------
    Sometimes, I am left considering wacky ideas, say for example:
    A Partial Page Table.
    Where, rather than having a full page walker, it has a TLB like
    structure which merely caches the bottom level of the page table.

    This could have some similar merits to software TLB (flexibility and
    simpler hardware), while potentially getting a roughly 512x to 2048x
    multiplier out of the effective size of the TLB (and without some of the
    drawbacks of an Inverted Page Table).

    Say, for example, with a 64-entry 1-way LLPTLB, and 16K pages (with 8B
    PTEs), this would cover a working set of 2GB of VAS (and 256x2 would
    cover 16GB).

    Well, vs the 16MB of coverage by a 256x4 TLB with 16K pages.

    If your HW tablewalker made 1 probe into a large (16MB) SW controlled
    buffer of PTEs, and if it found the PTE great, it gets installed and
    life goes on, otherwise, SW updates the Table and TLB and returns as
    soon as possible.


    Likely the LLPTLB would be keyed on (47:25) or so, then holding (47:14)
    for the PPN. Would likely have a similar structure to the normal TLBE,
    just ignoring the low part of the virtual address.


    But, yeah, in the case of an LLPTLB hit (and PTE is marked Valid, etc),
    it would bypass the usual TLB Miss interrupt, and instead refill the TLB directly from the last level of the page table.

    Could potentially result in a significant reduction in TLB miss
    frequency. Though, as-is, at least in my test programs, TLB miss rate
    tends to be somewhat loower than Timer IRQ rate, so... A stronger case
    could almost be more for not hitting the timer interrupt at 1 kHz...

    Where:
    1024 Hz: Traditional value for RTC interrupt, seemed same.
    100 Hz: I thing NT4 used this or something.
    18 Hz: MS-DOS and similar.
    32768 Hz: MSP430 can use this,
    but this would eats the CPU in my case...

    As-is, I seem to be seeing values of around 60 to 200 Hz for TLB Miss interrupts (with the existing 256x4 configuration).

    Though this is primarily with a single-address-space system.


    This gives you a L2-TLB with 1M entries, enough to map 8GB and a
    table walking state machine with 3 states.


    The idea here is to have a LLPTLB which would still use a TLB Miss event
    in the event that both TLB and LLPTLB miss, but would fetch the PTE from memory in the case of an LLPTLB hit.

    The TLB Miss handler would then also load an entry (describing the
    last-level of the page-table) into the LLPTLB (along with the normal TLBE).


    My estimate is that this would greatly increase working set size, but
    LLPTLB conflict misses could still be an issue (would need an
    associative LLPTLB to reduce conflict misses on context switch; but
    LLPTLB misses are likely to be more dominated by conflict misses,
    particularly on context switches).


    While not as transparent as a full page walker, it would preserve the flexibility to construct non-standard or ad-hoc virtual address spaces
    (which would be lost with a more traditional hardware page-walker).
    Like, say, I don't need the MMU to be aware of nested page-tables,
    because this can be done in software.

    Also, avoids the drawback of Inverted Page Tables, in that IPT's would
    still effectively need TLB miss events to drip-feed TLBEs into the IPT;
    while also being more memory efficient.

    Does have the drawback of the MMU needing to be able to access memory,
    but I can use a simpler mechanism in that it only needs to be able to
    access a single cache line (unlike either a full page walk or IPT, which
    would need multiple memory accesses).


    Though, the MMU may need some way to temporarily "hold" missed requests
    during a TLB Miss to determine whether or not it can resolve them (since
    it would no longer know the answer within a fixed latency).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Dec 17 07:59:10 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4]. See https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
    for a random reference grabbed off the net how to do it.

    [...]

    Time cost: roughly 450 clock cycles (penalties are steep on this one).

    What is your accuracy?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 02:27:33 2025
    From Newsgroup: comp.arch

    On 12/17/2025 1:11 AM, Lawrence D’Oliveiro wrote:
    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not supported
    natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.

    Yes.

    Or, like the "SiFive U74" and similar, where the funny thing of the
    RISC-V ISA using unscaled displacements but then having a CPU that uses internal traps (and is horribly slow) in the case of misaligned access...

    Meanwhile, I prefer to have memcpy and LZ decompression where
    "performance doesn't suck".

    Also useful for things like Huffman and Rice decoding, etc. Say, for
    Huffman decoding, if one needs to use branches to detect when to pull in
    more bytes, this eats more clock-cycles than advancing the bit-stream
    position implicitly via arithmetic tricks.

    Well, and is also an example of why to use LSB first bit ordering, and
    not to use FF escape encodings and similar:
    MSB first, FF escapes, and the 16-bit length limit, etc, manage to make
    JPEG bit-stream handling a lot slower than it could have been.

    Whereas, say, LSB-first and imposing a 12-bit length limit allows some
    speedup here.

    Though, the Rice coder in UPIC effectively uses an 8-bit lookup, but
    this is because it uses 3 bits for the Rk factor. So, sadly, it needs a fallback path to decode symbols that exceed 8 bits.

    So, pseudo-code (for AdRice Decoding):
    win=*(u32 *)cs;
    b=win>>pos;
    ix=(rk<<8)|(b&255);
    v=ricefasttab[ix]; //constant lookup table for Rice-code state space
    l=(v>>8)&15;
    if(l<=8)
    {
    //faster path
    pos+=l;
    cs+=pos>>3;
    pos&=7;
    rk=(v>>12);
    return(v&255);
    }
    // ... slower path ...
    q=riceqtab[b&255]; //count bits for Q prefix.
    if(q==8)
    {
    //escape case, Q==8 escapes a raw max-length symbol
    l=16;
    v=(b>>8)&255;
    rk+=2;
    if(rk>7)rk=7;
    }else
    {
    l=q+rk+1;
    v=((b>>(q+1))&((1<<rk)-1))|(q<<rk);
    if((q==0) && (rk>0)) rk--;
    if((q>=2) && (rk<7)) rk++;
    }
    pos+=l;
    cs+=pos>>3;
    pos&=7;
    return(v);

    Which may not seem very fast, but could be a lot worse.

    In this case (for L1 cache reasons) the slightly more complicated
    approach here works out faster on average than using a single giant
    lookup table.



    So, my CPU supports misaligned access natively.


    Can make sense to skip it for microcontroller class cores though; since
    in this case "cheaper L1 cache" is likely to be a higher priority.

    Doesn't make sense for things bigger than a microcontroller though, as allowing for misaligned memory accesses is too useful IMO.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 03:03:04 2025
    From Newsgroup: comp.arch

    On 12/17/2025 1:59 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4]. See https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
    for a random reference grabbed off the net how to do it.


    Possible, I didn't evaluate this.

    I just sorta reduced it to [0, 2*PI] noting that if values got too far
    outside of this range, accuracy started dropping off, so limiting to
    this range kept the accuracy good.

    Though, it flips to [-2*PI, 0] if ang<0, since conversion to 'int'
    truncates towards zero and not towards negative infinity. Not entirely
    sure why I used 'int' and not 'long', possible oversight here...


    [...]

    Time cost: roughly 450 clock cycles (penalties are steep on this one).

    What is your accuracy?


    IIRC, I used 20 stages because this seemed to fully converge the double
    to about as accurate as it was going to get in my initial testing
    (without adding too many extra stages).

    IIRC, basic algo was based on the Taylor Series expansion off of
    Wikipedia or similar (I didn't show the magic constants).


    It was also still a vast improvement over the code that the C library originally came with (IIRC, used a "for()" look and calculated the
    factorials and did the division inline; this was far slower than using precomputed constants).


    For "long double", it is both slower to calculate and also requires more stages to converge the result.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 11:40:50 2025
    From Newsgroup: comp.arch

    On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4].

    It's not what I'd do in practice. The degree of poly required after
    reduction to [-pi/4,+pi/4] is way too high. It seems, you would need
    Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
    calls 'faithful rounding'. For something better, like 0.51 ULP, you
    would need one more term.

    There are better methods. Like reducing to much smaller interval, e.g.
    to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details of trade
    off between size of reduction table and length of polynomial depend on
    how often do you plan to use your sin() function.

    See
    https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReduction.pdf
    for a random reference grabbed off the net how to do it.

    [...]

    Time cost: roughly 450 clock cycles (penalties are steep on this
    one).

    What is your accuracy?

    Before asking that it's worth to ask about an expected range of the
    argument.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Dec 17 10:40:16 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    Running some stats for misaligned access in my emulator (percentage of
    total memory accesses, organized by type).


    Initial bootup:
    Word : 7.67% DWord: 0.13% QWord: 9.53%

    As the system is under your total control, you should be able
    to find out where this comes from. Does the compiler not place
    words correctly, how do you align your structures, do you use
    misalligned large words for memcpy, ... ?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Dec 17 15:42:58 2025
    From Newsgroup: comp.arch

    According to BGB <cr88192@gmail.com>:
    On 12/16/2025 7:44 PM, John Levine wrote:
    S/360 required strict boundary alignment, floats on 4 byte boundaries, doubles on
    8 byte boundaries. Normally the Fortran compiler aligned data correctly but you
    could use an EQUIVALENCE statement to create misaligned doubles, something that
    occasionally happened in code ported from 709x where it wasn't a problem. There
    was a library trap routine to fix it but it was stupendously slow, so IBM added
    the "byte oriented operand" feature to the 360/85 and later models. ...

    Misaligned access is common enough here that, if it were not supported >natively, this would likely tank performance...

    I suspect it is not a coincidence that the 360/85 introduced both a cache and misaligned accesses. Since the cache lines are bigger than words, I would think a cache would greatly decrease the cost of the extra memory references. --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 17 16:18:24 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    The ARM Lead Architect and Fellow Richard Grisenthwaite
    presents "Lessons from the ARM Architecture".

    https://www.scribd.com/document/231464485/ARM-RG

    I found the lessons not very enlightening, and also too abstract.
    Maybe he presented concrete cases to support the lessons in the audio
    track that accompanied the slides, but the slides were too general for
    my taste.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Dec 17 16:21:36 2025
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    I disagree. Trap-and-emulate may be too slow for a feature that you
    want programmers to use in hot paths on current CPUs, but there are
    other cases. In particular, for a feature that cannot be implemented
    properly yet, if you provide it as a trap-and-emulated instruction in
    the current generation, and a faster implementation of the instruction
    in a future implementation of the architecture, programmers will be
    much less reluctant to use that instruction when its implementation is
    fast on the dominant implementation of the day than if it just
    produces a SIGILL or somesuch on older chips (or new, but
    feature-reduced chips). Intel's marketing does not understand that,
    that's why they are selling feature-reduced versions of chips that
    have AVX and AVX-512 in hardware.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 12:49:34 2025
    From Newsgroup: comp.arch

    On 12/17/2025 4:40 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    Running some stats for misaligned access in my emulator (percentage of
    total memory accesses, organized by type).


    Initial bootup:
    Word : 7.67% DWord: 0.13% QWord: 9.53%

    As the system is under your total control, you should be able
    to find out where this comes from. Does the compiler not place
    words correctly, how do you align your structures, do you use
    misalligned large words for memcpy, ... ?


    The most likely source of this particular pattern is LZ4 decompression.
    Binaries and a lot of other data use LZ4.
    Mostly compression is used here to reduce IO to the SDcard.

    RP2 decompression also uses misaligned access, but it is primarily QWORD based. Similar goes for the scheme used for pagefile compression (which
    uses natively aligned 16-bit words but misaligned Qwords).


    My C compiler follows usual C ABI alignment rules, but both memcpy and
    LZ decompression are written to assume misaligned access, because this
    is faster than using byte loads/stores on my core.


    Some paths for memcpy may use MOV.X (Pair), but this instruction is only
    valid with an 8 byte alignment (and gives best performance with a
    16-byte alignment). So, generic/small cases use QWord.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Dec 17 13:51:31 2025
    From Newsgroup: comp.arch

    On 12/17/2025 10:21 AM, Anton Ertl wrote:
    jgd@cix.co.uk (John Dallman) writes:
    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    I disagree. Trap-and-emulate may be too slow for a feature that you
    want programmers to use in hot paths on current CPUs, but there are
    other cases. In particular, for a feature that cannot be implemented properly yet, if you provide it as a trap-and-emulated instruction in
    the current generation, and a faster implementation of the instruction
    in a future implementation of the architecture, programmers will be
    much less reluctant to use that instruction when its implementation is
    fast on the dominant implementation of the day than if it just
    produces a SIGILL or somesuch on older chips (or new, but
    feature-reduced chips). Intel's marketing does not understand that,
    that's why they are selling feature-reduced versions of chips that
    have AVX and AVX-512 in hardware.


    Pretty much.

    This is why I am promoting it for cold path / infrequent cases.

    But, for pretty much anything that happens more often than around 0.001%
    of instructions, in most cases, you don't want trapping.

    But, if less than 0.001%, or 0.0001%, then trapping may make sense.


    As a possible way to support obscure legacy features, or optional or
    "not widely supported in hardware" features, it makes sense.

    In these cases, "still runs, but slowly" being preferable to "just
    crashes due to an unsupported instruction" or similar.


    For something like AVX-512, or RISC-V's V extension, it could likely
    make more sense to go to a "trap and full emulation" path. In this case, rather than always faulting on an instruction and immediately returning control to the application, it could instead handle blocks of
    instructions with a tracing JIT and then only return to normal flow of
    control once "the coast is clear" (behaving as-if the JIT'ed
    instructions had been run natively on the CPU in question).

    Or, one possible strategy here (for normal applications) being to "hot
    patch" the application to effectively encode branches into the JIT'ted
    code sequences over the top of the offending instructions.

    So, for example, quietly turning any offending RV V instructions into
    JAL's or similar (with the JIT trace then jumping back into the normal instruction sequence afterwards as-if nothing had happened).

    This could be better for performance, but less transparent (because the application could potentially see where its code had been hot-patched).


    For something like a VM, it would make mode sense to not hot-patch, and instead use the trap to trigger non-local control flow. This would be
    slower than hot-patching, but would be more transparent as it would not
    modify the original memory.

    Well, or possibly use some sort of hardware double-mapping where the
    pages seen by the D$ and I$ differ, so it "looks" as if the VM is still running the original code, but then the CPU's I$ can see a different set
    of hot-patched pages.

    Well, in my ISA, this could be done in theory by using 96-bit mode and
    putting GBH and PCH in two separate address spaces, that look basically
    the same but differing solely in that PCH hay point to an address space
    with hot-patched pages.

    Another possibility could be to hack it in the MMU by having separate
    DTLB and ITLB entries:
    Normal TLB Entry: Matches on either D$ or I$
    DTLB: Only matches for D$
    ITLB: Only matches for I$

    Then the TLB Miss handler can special-case the handling of hot-patched
    pages (might consider this if I get around to implementing the recent
    LLPTLB idea; both would have the "feature" of effectively creating
    multiple sub-types of TLB entries).


    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 19:53:23 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 12/16/2025 3:02 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 12/15/2025 1:37 PM, John Dallman wrote:
    In article <HqX%Q.42014$tt1a.10258@fx47.iad>, scott@slp53.sl.home (Scott >>> Lurndal) wrote:

    "Lessons from the ARM Architecture"

    Non-Scribd:
    <https://studylib.net/doc/8671203/lessons-from-the-arm-architecture>

    Note that this is from 2010 and does not discuss ARM64.

    One comment:

    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented. >>>

    IME, it depends:
    If the scenario is semi-common, it doesn't work out so well;
    If it is rare enough, it can be OK.

    One must also take into consideration the relative time of each path.
    For example, I have Transcendental instructions in my ISA where I can perform FP64Ă—{SIN, COS, Ln2, 2**} in {19, 19, 14, 15} cycles respectively. I make a rounding error about once every 378 calculations. Running a SW routing that gives correct rounding is 300-500 cycles.

    So, if one can take the Incorrect-Rounding exception in less than 100 cycles (round tip) trap-and-emulate are adding only 1-cycle on average
    to the running of the transcendentals.



    In my case, handling of "sin()" looks sorta like:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;
    t =th*sintab_c00; x=th*th2;
    t+=x*sintab_c01; x*=th2;
    t+=x*sintab_c02; x*=th2;
    t+=x*sintab_c03; x*=th2;
    t+=x*sintab_c04; x*=th2;
    t+=x*sintab_c05; x*=th2;
    t+=x*sintab_c06; x*=th2;
    t+=x*sintab_c07; x*=th2;
    t+=x*sintab_c08; x*=th2;
    t+=x*sintab_c09; x*=th2;
    t+=x*sintab_c10; x*=th2;
    t+=x*sintab_c11; x*=th2;
    t+=x*sintab_c12; x*=th2;
    t+=x*sintab_c13; x*=th2;
    t+=x*sintab_c14; x*=th2;
    t+=x*sintab_c15; x*=th2;
    t+=x*sintab_c16; x*=th2;
    t+=x*sintab_c17; x*=th2;
    t+=x*sintab_c18; x*=th2;
    t+=x*sintab_c19; x*=th2;
    t+=x*sintab_c20; x*=th2;
    return(t);
    }

    Wow, so big, so slow, it is no wonder it is not used "often".
    42 multiplies. Whereas: my 19 cycle version, includes Payne &
    Hanek argument reduction (3 cycles) and needs only 7 multiplies
    using Chebyshev coefficients.

    Time cost: roughly 450 clock cycles (penalties are steep on this one).

    This would be roughly 3x slower if handled with a trap and a special instruction.

    Does it make sense to add an FSIN instruction? No, as FSIN is nowhere
    near common enough to make much impact on code size (and, FSIN will be invariably slower).

    Obviously, doing it you way is ineffective, but when SIN() cost no
    more than FDIV, it should be included.

    How much space does it take? Around 460 bytes.

    4-bytes.

    This is assuming DAZ/FTZ, but denormals shouldn't happen unless ang is pretty close to an integer multiple of M_TAU (2*M_PI).

    I include denorms, no flush to zero, and use 128-bits of 2/pi at argument reduction, giving more than 64 bit of accuracy in the reduced argument.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 19:55:38 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not supported natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.

    There are smart fools everywhere.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 20:06:57 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4].

    It's not what I'd do in practice. The degree of poly required after
    reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
    calls 'faithful rounding'. For something better, like 0.51 ULP, you
    would need one more term.

    To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

    There are better methods. Like reducing to much smaller interval, e.g.
    to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details of trade
    off between size of reduction table and length of polynomial depend on
    how often do you plan to use your sin() function.

    With every shrink of the argument range, the table size blows up
    exponentially. For my transcendentals, the combined table sizes
    are about the same as the table sizes for FDIV/FSQRT when using
    Goldschmidt iteration using 11-bit in, 9 bit out tables.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Dec 17 23:18:57 2025
    From Newsgroup: comp.arch

    On Wed, 17 Dec 2025 20:06:57 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4].

    It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
    calls 'faithful rounding'. For something better, like 0.51 ULP, you
    would need one more term.

    To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

    There exists tricks that can achieve the same in software, sometimes at
    cost of one additional FP op and sometimes even for free. The latter
    esp. common when FMA costs the same as FMUL.


    There are better methods. Like reducing to much smaller interval,
    e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
    of trade off between size of reduction table and length of
    polynomial depend on how often do you plan to use your sin()
    function.

    With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
    are about the same as the table sizes for FDIV/FSQRT when using
    Goldschmidt iteration using 11-bit in, 9 bit out tables.

    Software trade offs are different.
    Assuming that argument is already in [0:-pi/2] range, reduction down
    to [-1/128:+1/128] requires pi/2*64=100 table entries. Each entry
    occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400
    bytes total. That size fits very comfortably in L1D cache, so from
    perspectives of improvement of hit rate there is no incentive to use
    smaller table.

    At the first glance, reduction to [-1/128:+1/128] appears
    especially attractive for implementation that does not look for very
    high precision. Something like 0.65 ULP looks attainable with poly in
    form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate
    in late evening hours, so I can be wrong about it. But I also can be
    right.












    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 17 21:36:28 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 17 Dec 2025 20:06:57 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 17 Dec 2025 07:59:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    BGB <cr88192@gmail.com> schrieb:

    double sin(double ang)
    {
    double t, x, th, th2;
    int i;
    i=ang*M_TAU_R;
    th=ang-(i*M_TAU);
    th2=th*th;

    That leaves a lot of room for improvement; you would be better
    off reducing to [-pi/4,+pi/4].

    It's not what I'd do in practice. The degree of poly required after reduction to [-pi/4,+pi/4] is way too high. It seems, you would need Chebyshev poly of 15th degree (8 odd terms) just to get what Mitch
    calls 'faithful rounding'. For something better, like 0.51 ULP, you
    would need one more term.

    To be clear, my coefficients are not restricted to 53-bits like a SW implementation.

    There exists tricks that can achieve the same in software, sometimes at
    cost of one additional FP op and sometimes even for free. The latter
    esp. common when FMA costs the same as FMUL.


    There are better methods. Like reducing to much smaller interval,
    e.g. to [-1/64,+1/64]. May be, even to [-1/128,+1/128]. The details
    of trade off between size of reduction table and length of
    polynomial depend on how often do you plan to use your sin()
    function.

    With every shrink of the argument range, the table size blows up exponentially. For my transcendentals, the combined table sizes
    are about the same as the table sizes for FDIV/FSQRT when using
    Goldschmidt iteration using 11-bit in, 9 bit out tables.

    Software trade offs are different.

    Indeed. To SW memory is "about" free, whereas to HW ROM is NOT free.

    Assuming that argument is already in [0:-pi/2] range, reduction down
    to [-1/128:+1/128] requires pi/2*64=100 table entries.

    The IEEE 754-2019 specs indicate the argument range to SIN(x) has
    x in the range {-infinity..+infinity} So, you need to count the cycles
    it takes you to go from {-I..+I} to {-pi/4..+pi/4} or {-½..+½} and
    then the top several bits of the fraction index the coefficient
    tables.

    FP32 can use 128-entry tables and a Quadratic (C0+C1*x+C2*x^2) or
    can use 16-entry tables and a Cubic (C0+C1*x+C2*X^2+C3*x^3)

    FP64 can use a 128-entry table and a Quartic (C0+C1*x+C2*X^2+C3*x^3+C4*x^4)

    This one table for SIN is larger than the combined table sizes for
    {SIN, COS, TAN, ATAN, ASIN, ACOS, Ln, Ln2, Log, LnP1, Ln2P1, LOGP1,
    exp, exp2, 10**, expM1, exp2M1, 10**M1, pow, ATAN2}

    Although SIN being even can use x**k :: K even
    COS x**k ;; k odd

    Each entry
    occupies, depending on used format, 18 to 24 bytes. So, 1800 to 2400
    bytes total. That size fits very comfortably in L1D cache, so from perspectives of improvement of hit rate there is no incentive to use
    smaller table.

    In HW there is no LD instruction needed to "get" the coefficients;
    there is just the plethora of FMACs (and equivalents).

    At the first glance, reduction to [-1/128:+1/128] appears
    especially attractive for implementation that does not look for very
    high precision. Something like 0.65 ULP looks attainable with poly in
    form (A*x**4 + B*x**2 + C)*x. That's just my back of envelop estimate
    in late evening hours, so I can be wrong about it. But I also can be
    right.

    I have a spreadsheet to do all of this for me--including determining the Chebyshev coefficients orders{1,2,3,4,5,6,7,8,9} and graphing of the
    number of bits of precision.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Dec 18 00:43:35 2025
    From Newsgroup: comp.arch

    On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not
    supported natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.

    There are smart fools everywhere.

    No doubt another lesson learned from instruction traces: misaligned
    accesses occurred so rarely, it made sense to simplify the hardware by
    leaving them out.

    The same conclusion was drawn about integer multiplication and
    division in those early days, wasn’t it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Dec 18 01:34:05 2025
    From Newsgroup: comp.arch

    It appears that John Dallman <jgd@cix.co.uk> said:
    ] * "Trap and Emulate” is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    For at least 20 years IBM's mainframes have used what they call millicode. The relatively simple instructions are implemented in hardware, and everything else in millicode, including the more complication instructions, I/O, and other system features. Millicode runs on the same CPU using the same instruction set as regular code with some extra registers and instructions to handle aspects of the hardware not visible to regular programs. It is stored in dedicated memory which is loaded at boot time so it's easy to update.

    I gather there have been instructions that were implemented in millicode, then moved into hardware in the next CPU generation since they were used enough for the speed to matter.

    Here's a 2012 slide deck:

    https://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_metro_naspa_2012_10_what_and_why_of_system_z_millicode.pdf
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From moi@findlaybill@blueyonder.co.uk to comp.arch on Thu Dec 18 03:44:51 2025
    From Newsgroup: comp.arch

    On 18/12/2025 01:34, John Levine wrote:
    It appears that John Dallman <jgd@cix.co.uk> said:
    ] * "Trap and Emulateďż˝ is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    For at least 20 years IBM's mainframes have used what they call millicode. The
    relatively simple instructions are implemented in hardware, and everything else
    in millicode, including the more complication instructions, I/O, and other system features. Millicode runs on the same CPU using the same instruction set
    as regular code with some extra registers and instructions to handle aspects of
    the hardware not visible to regular programs. It is stored in dedicated memory
    which is loaded at boot time so it's easy to update.

    I gather there have been instructions that were implemented in millicode, then
    moved into hardware in the next CPU generation since they were used enough for
    the speed to matter.

    Here's a 2012 slide deck:

    https://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_metro_naspa_2012_10_what_and_why_of_system_z_millicode.pdf


    Typical IBM boosterism of a minor variant of an existing technique,
    as used (for range compatibility, etc) by the ICT 1900 Series in 1965
    and before that (for h/w economy) by the Ferranti Atlas and Orion.
    --
    Bill F.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Dec 18 14:41:31 2025
    From Newsgroup: comp.arch

    On 12/17/2025 6:43 PM, Lawrence D’Oliveiro wrote:
    On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not
    supported natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.

    There are smart fools everywhere.

    No doubt another lesson learned from instruction traces: misaligned
    accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.


    They occur rarely, or not at all, if avoided.

    They seem to occur (on average) for 1-5% of the loads/stores if the code
    makes use of them in cases where doing so would be beneficial.

    Well, because the naive/portable approaches can often be unacceptably slow.

    Granted, dealing with misaligned access does add cost and complexity to
    the L1 D$ (particularly due to the case of dealing with misaligned
    values crossing a cache-line boundary).


    The same conclusion was drawn about integer multiplication and
    division in those early days, wasn’t it.


    Some timings from my case:
    32 bit multiply (3 cycle latency):
    0.70% of clock cycles;
    MOD and DIV (36 and 68 cycles):
    0.36%

    Kinda need multiply, but divide is a bit more optional, and doesn't need
    to be all that fast.

    However, if DIV and MOD were implemented using traps, they are still
    common enough to where there would be reason to care (this would still
    have an obvious performance impact).

    In the absence of hardware DIV/MOD, the better option is mostly to
    handle it with runtime calls.



    Otherwise, went and added some special case logic to allow for
    transparent hot-patching via TLB trickery. Was actually simpler/cheaper
    than it seemed at first.

    Does have some restrictions though, namely in that it will only work
    with read-only pages.

    Ended up adding a special case with the Dirty flag:
    D+NR+NW: X only, no hit for D$
    D+NX+NW: R only, no hit for I$
    But:
    D+NX: Normal Read/Write Memory
    D (along): Normal R/W/X Memory

    The D / Dirty flag is used for PTE's, but not changed by HW. In effect
    its main use would be more for write barriers (setting up and handling a
    trap the first time the page is written to).



    Also the issue that one can't encode a branch to an arbitrary address in
    32 bits. If effect, hot-patching in this way would need somewhere to
    branch-to that could be put within the window of what is reachable.

    For plain RISC-V, there is another issue:
    There is no way to do longer distance branches that wont stomp a register.

    Jumbo prefixes and XG3 at least allow other options:
    XG3's branches have a 32MB range, so more likely to be able to reach
    something as most binaries are not that large (but, N/E in RV64GC mode);
    Jumbo Prefixes: Can encode a +/- 4GB branch in 64 bits (but then needs
    to patch at least 2 instructions).

    Another issue being that any such logic needs to be able to operate with
    zero free registers, so at least in this sense isn't much better off
    than an interrupt handler. But, main difference being that any hot
    patching doesn't need to decode an instruction and can be a special
    sequence representing the instructions that originally generated the
    trap (rather than a general purpose handler).

    In the relevant ABIs, could assume that memory below SP is always safe
    to use though (to save/restore any working registers).



    In premise, one could put the hot-patching area before the loaded
    binary, but generally this would only be usable (in RV64GC or similar)
    if ".text" is somewhat less than 1MB.

    Likely would make sense to handle it as, say:
    SD X1, -8(SP)
    SD X5, -16(SP)
    LUI X5, AddrHi
    JALR X1, DispLo(X5)
    LD X5, -16(SP)
    LD X1, -8(SP)
    JAL X0, RetAddr

    With another area (somewhere in the low 2GB) handling the actual traces (trying to keep the area just before ".text" mostly limited to
    trampoline handlers, except for extra short sequences).

    Or, AUIPC if the handler is placed +/- 2GB from this table.


    Or, loading a 64-bit address from memory and then possibly running the
    handler code in XG3 mode (would have access to 128-bit arithmetic and
    some other things lacking in RV mode).

    Likely would make sense to handle it as, say:
    SD X1, -8(SP)
    SD X5, -16(SP)
    AUIPC X5, AddrHi
    LD X5, DispLo(X5) //load address of entry point, PC-rel (*1)
    JALR X1, Disp2(X5)
    LD X5, -16(SP)
    LD X1, -8(SP)
    JAL X0, RetAddr

    *1: Also no way in RV64GC to directly include a PC-rel load in a single instruction, so need an AUIPC to do so. In this case, need to jump
    through a full 64-bit pointer to be able to perform the mode switch (AUIPC+JALR would merely branch within RV64GC mode).

    Though, in some cases could make sense to keep the handlers in RV64G
    mode, in which case no mode-change is needed.

    ...


    The initial setup for these cases would likely be the same as that for ("normal") trap and emulate, just with the option of replacing some instructions with alternative handlers that are not quite as inefficient...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 18 21:17:28 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 17 Dec 2025 19:55:38 GMT, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 17 Dec 2025 00:51:17 -0600, BGB wrote:

    Misaligned access is common enough here that, if it were not
    supported natively, this would likely tank performance...

    Still there are/were some architectures that refused to support it.

    There are smart fools everywhere.

    No doubt another lesson learned from instruction traces: misaligned
    accesses occurred so rarely, it made sense to simplify the hardware by leaving them out.

    The same conclusion was drawn about integer multiplication and
    division in those early days, wasn’t it.

    My RISC 1st gen processor had 3-cycle integer multiply, and 35-cycle
    division.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Dec 19 03:25:27 2025
    From Newsgroup: comp.arch

    It appears that moi <findlaybill@blueyonder.co.uk> said:
    On 18/12/2025 01:34, John Levine wrote:
    It appears that John Dallman <jgd@cix.co.uk> said:
    ] * "Trap and Emulateďż˝ is an illusion of compatibility"
    ] * Performance differential is too great for most applications

    This is inevitably true nowadays, but wasn't when the idea was invented.

    For at least 20 years IBM's mainframes have used what they call millicode. The
    relatively simple instructions are implemented in hardware, and everything else
    in millicode, ...

    Typical IBM boosterism of a minor variant of an existing technique,
    as used (for range compatibility, etc) by the ICT 1900 Series in 1965
    and before that (for h/w economy) by the Ferranti Atlas and Orion.

    I don't think they ever claimed it was a new idea, but they're definitely still using it in computers that they are selling today.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2