• VLIW Architecture of Google TPUs

    From John Levine@johnl@taugh.com to comp.arch on Wed Jun 17 02:07:46 2026
    From Newsgroup: comp.arch

    Here's a preprint of an IEEE Micro article:

    Google's Training Supercomputers from TPU v2 to Ironwood:
    Architectural Stability, Scale, Resilience, Power Efficiency,
    and Sustainability Across Five Generations

    They call the architecture VLIW which I don't think is quite
    right -- they do indeed have wide instruction words but I don't
    think they do speculative execution.

    https://arxiv.org/abs/2606.15870
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 17 15:12:44 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    They call the architecture VLIW which I don't think is quite
    right -- they do indeed have wide instruction words but I don't
    think they do speculative execution.

    Speculative execution in a VLIW (and EPIC) architecture with in-order implementation (the usual case) depends on the compiler reordering the instructions. Some architectures have architectural support for
    speculative execution, e.g., the speculative load of IA-64, but these architectures tend to go by the label EPIC rather than VLIW. IIRC
    classic VLIWs like the Cydrome Cydra 5 or the Multiflow machines do
    not have architectural support for speculation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jun 17 16:05:55 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    Here's a preprint of an IEEE Micro article:

    Google's Training Supercomputers from TPU v2 to Ironwood:
    Architectural Stability, Scale, Resilience, Power Efficiency,
    and Sustainability Across Five Generations

    They call the architecture VLIW which I don't think is quite
    right -- they do indeed have wide instruction words but I don't
    think they do speculative execution.

    It is more like clocked (or lock-step) data flow architectures,
    without the natural asynchronicity in data flow.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jun 17 15:22:35 2026
    From Newsgroup: comp.arch

    On 6/17/2026 11:05 AM, Scott Lurndal wrote:
    John Levine <johnl@taugh.com> writes:
    Here's a preprint of an IEEE Micro article:

    Google's Training Supercomputers from TPU v2 to Ironwood:
    Architectural Stability, Scale, Resilience, Power Efficiency,
    and Sustainability Across Five Generations

    They call the architecture VLIW which I don't think is quite
    right -- they do indeed have wide instruction words but I don't
    think they do speculative execution.

    It is more like clocked (or lock-step) data flow architectures,
    without the natural asynchronicity in data flow.


    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.


    If limited to strictly one operation per cycle (or less), there could be potentially be another option, like, say:
    Mem/Mem/Mem
    Or, basically:
    Load, Load, Op, Store
    Within a single instruction.
    Per-cycle, with any additional logic mostly for addressing, but could be reduced by having more addressing modes (with double-indirect and double-indirect auto-increment, could potentially eliminate the need for
    CPU registers).

    But, say, would fall on its face as soon as one allows superscalar, and
    would be nearly impossible to scale. Even getting 1 IPC could end up
    being a feat of engineering.

    Any advantage would also be lost in the face of memory-access latency.



    But, once the possibility of superscalar is in play, the balance would
    shift highly towards register machines.

    Then the tradeoff becomes more one of "register width" vs "machine width".

    Traditional SIMD: Favors wider registers to reduce instruction counts,
    but the "ever wider SIMD" inclination seems to follow the implicit
    assumption that one can't do superscalar on the SIMD ops.

    Like, the advantage of 256 or 512 bit SIMD would mostly evaporate if you assume that 2 or 4 128-bit SIMD operations could potentially be done in
    a single clock-cycle (and, if the wider SIMD doesn't match the native
    data width well, it could actually become a liability).


    Though, one could argue about the relative cost of register ports, say,
    for example, if each register port remains the same maximum width, then pushing larger SIMD vectors would necessarily require more register
    ports which may scale more steeply than increasing register-port width.


    Well, and/or one ends up with a CPU where the port-count (and max ILP)
    remains fixed, so then using narrower operations comes at a cost of only
    using part of the registers (and there may be penalties if multiple instructions use virtual registers that correspond to the same machine register).


    Say, for example:
    Machine is 6R3W with native 128-bit ports, which are divided in half for 64-bit ops.
    So, say:
    ADD R10, R23, R8
    ADD R11, R29, R9
    Can't co-execute because, at the HW register level, both are accessing
    the same logical registers (well, unless additional logic existed to
    subdivide access to these 128-bit ports).

    For Read, an additional bit of the register specifier selecting which
    half for use, and for store which part to update

    Say:
    00: Update the low 64-bits (64b op);
    01: Update the high 64-bits (64b op);
    10: Update both (128b op);
    11: -



    Or, could instead do "multi-ganging" in the decoder:
    ADD R12, R24, R16
    ADD R13, R25, R17
    ADD R14, R26, R18
    ADD R15, R27, R19
    Being special-case recognized as mapping to all mapping the same
    operations to the same machine registers (and so thus behaves as a single-cycle 256-bit SIMD op to the CPU).

    Though, van note a that a similar trick has been considered for implied 128-bit SIMD ops in my ISA, but the relative merit is lower when it
    already has explicit 128-bit SIMD ops that would cover mostly the same
    cases (except for 64b integer SIMD, which could still use such pairing
    in this scheme; and even when unpaired is still typically the most
    efficient way to use conventional superscalar).

    In the 2-wide case, this mostly means detecting cases where both cases
    map to appropriate FUs, and where the registers follow the correct
    patterns (high-bits equal but low-bits differ, and no output-input dependencies on the same machine physical register).



    But, a 6R3W 128-bit machine could do 3R1R for 256-bit (or 4R2W could do
    2R1W ops); or an 8R4W machine could potentially do native 512-bit vector
    ops.

    Though, wider SIMD registers does also avoid the drawback that every
    time the effective register width is increased by pairing, it reduces
    the number of available registers at the larger width (so, if you wanted
    32x 256-bit registers, would need 128x 64-bit; or end up with a skewed
    space where part of the register space that exists at wider widths is
    not as easily accessible at the narrower widths).

    Though, in practice, this doesn't seem like a huge loss (since it
    doesn't actually necessarily increase the number of physical registers
    or register powers, just means that patterns which don't fit the HW
    resources come at an ILP cost).


    Either way, the relative cost at a 64b/128b scale is small (and 256b
    seems to be diminishing returns relative to 128b).


    But, alas, my approach does still allow:
    PMULX.F R40, R48, R52
    PMULX.F R42, R50, R24
    To potentially behave as a 256-bit vector if the hardware allows it.

    ...


    Dunno how well any of this would map to OoO.
    But, if one is doing OoO, they can presumably still afford the logic to
    detect "the next two or four instructions represent a single SIMD
    operation at a larger width"; which could likely be handled during IF
    (so, before it hits the ROB and so-on).

    But, not sure why this approach isn't more popular (vs, say, "ever wider
    SIMD registers" or vector ISAs).




    But, yeah, can note that recently I did get around to getting ALUX
    support re-implemented for XG3, but ended up adding a new set of 128-bit compare ops (to match the more RV-like patterns that XG3 uses).

    In this case, ended up using 64-bit encodings for the 128-bit CMPxx ops
    as they aren't used often enough to justify burning 32-bit encoding
    space on them (but are still common enough in live execution to make it undesirable to fake them via multi-op sequences).

    Doing 128-bit branch-compare still requires at least 2 ops though, as
    128b branch-compare is very rare and would be hard-pressed to justify
    the added cost (of encoding or allowing 128-bit inputs to the Bcc ops).

    Or, say:
    CMPEQ.X R10, R12, R5
    BNEZ R5, label
    Rather than, say:
    BEQ.X R10, R12, label

    Though, debatable, could consider allowing BEQ.X and similar as pseudo-instructions (likely decaying as above).

    Or, I re-allow the XG2 ops if PRED is enabled, so it turns into,
    essentially:
    CMPEQ.X R10, R12
    BRA?T label

    Which has the merit of larger branch displacement and more compact encoding.

    Where "BRA?T label" is equivalent to "BT label", but more explicit about
    the encoding.

    Well, or one use CMPXEQ, but this is mostly just a notation/ASM-level difference. Decided to start treating X as a type-suffix, rather than
    some of the infix-type notation sometimes used. But, this was itself a side-effect of dropping the "/" used in SH op naming patterns.
    "CMP/EQ" and "CMP.Q/EQ" (BJX1-64C) which collapsed to "CMPQEQ".

    But, at some point I decided that '/' in instruction names was ugly.


    Otherwise, almost half tempting to consider tweaking the Reg6->Imm33s
    case in XG3 to have a bit-pattern more consistent with the Imm10->Imm33s
    case, but, this would be another breaking change (and would also break
    from XG2).

    Say, as-is:
    Imm10 -> Imm33s: ckkkkkkkkjjjjjjjjjjjjjjjjiiiiiiii
    Reg6 -> Imm33s: daaackkkkkkkkjjjjjjjjjjjjjjjjiiii
    Put, say:
    Reg6 -> Imm33s: ckkkkkkkkjjjjjjjjjjjjjjjjaaadiiii
    Meaning that less bits need to move around when gluing on the prefix (potentially saving some MUXing here).

    Then again, this would break consistency between the Reg6->Imm33s and Reg6->Imm17s patterns, which are consistent for the bits that overlap,
    etc, so likely better to not poke at it for now.


    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 18 01:19:12 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/17/2026 11:05 AM, Scott Lurndal wrote:
    John Levine <johnl@taugh.com> writes:
    Here's a preprint of an IEEE Micro article:

    Google's Training Supercomputers from TPU v2 to Ironwood:
    Architectural Stability, Scale, Resilience, Power Efficiency,
    and Sustainability Across Five Generations

    They call the architecture VLIW which I don't think is quite
    right -- they do indeed have wide instruction words but I don't
    think they do speculative execution.

    It is more like clocked (or lock-step) data flow architectures,
    without the natural asynchronicity in data flow.


    An idle thought here is whether there is any "better" option than conventional register-machine designs.

    Consider a machine designed to crunch AI-like vector data.

    A vector can contain between 768-and-8192 datums.

    One has an HSNW tree of vectors in a vector data base.

    A query takes one <query> vector and compares it with a number of
    vectors contained in the DB. One comparison is sum {of vector-length multiplies}. So a <small> comparison requires 768×s and 767+s.

    Given that the VBD is going to be stored in some kind of persistent
    store (say FLASH), and given that the FLASH data rate is 5 GHz and
    4 beats per FP32, one can assign a FMAC per FLASH bus and do all the
    arithmetic on-line, never needing a single register, and consuming
    the data as fast as it streams out of the store. One simply builds
    the HW to perform this kind of work.

    Power savings:
    1) don't have to run the data up PCIe tree and into DRAM
    2) don't have to take an interrupt per vector
    3) don't have to read vector from DRAM into core <SRAM>
    4) don't have to read data from SRMA (cache) into register(s)
    5) perform FMACs at 5GHz

    In effect one converts 3072 bytes into 4 bytes as the data returned to the point of query, and converts 200-500 interrupts into 1.

    Sometimes the proper point of doing a calculation is not in the core.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Jun 17 23:17:07 2026
    From Newsgroup: comp.arch

    On 6/17/2026 1:22 PM, BGB wrote:

    snip

    An idle thought here is whether there is any "better" option than conventional register-machine designs.

    There has been a fair amount of work on, and several working, analog
    neural network chips. I believe there has also been some work on
    digital chips which contain lots of "neurons" each with a little
    hardwired logic to sum the weighted inputs, compare to a threshold and
    output a signal if the threshold has been exceeded, sort of like real
    neurons do. The issue with these is the interconnect.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Jun 18 07:45:02 2026
    From Newsgroup: comp.arch

    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than conventional register-machine designs.

    Back in the 1980s and earlier, there were many architectural
    possibilities being considered.

    Consider the Transputer: this consisted of a lot of CPU nodes, each
    with its own local memory. I guess they envisioned scaling up both
    numbers of CPUs as well as amount of memory in more powerful
    configurations, so the amount of memory per CPU stayed roughly
    constant.

    Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
    memory has in reality been increasing much faster than the number of
    CPUs.

    And I don’t think you see NUMA in consumer machines; maybe in
    supercomputers.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 03:50:52 2026
    From Newsgroup: comp.arch

    On 6/18/2026 2:45 AM, Lawrence D’Oliveiro wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    Back in the 1980s and earlier, there were many architectural
    possibilities being considered.

    Consider the Transputer: this consisted of a lot of CPU nodes, each
    with its own local memory. I guess they envisioned scaling up both
    numbers of CPUs as well as amount of memory in more powerful
    configurations, so the amount of memory per CPU stayed roughly
    constant.

    Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
    memory has in reality been increasing much faster than the number of
    CPUs.

    And I don’t think you see NUMA in consumer machines; maybe in supercomputers.

    Well, such is the problem.
    Seemingly computational throughput is easier to achieve than memory
    bandwidth.


    Well, even with an FPGA:
    It wouldn't be that hard to make a SIMD unit that could pull off 400 of
    800 MFLOP at 50 MHz...

    Keeping it fed is another problem entirely.



    I guess one relative merit of register machines is that one can try to static-schedule around memory access patterns to some extent.

    At least, more so than with hardware that streams to/from memory. Well,
    and memory streaming would introduce unavoidable round-trips to memory, whereas operating within registers could potentially do more stuff
    locally in registers before needing to hit back to memory.


    I guess a big what if is, say, rather than having a 64-bit or 128-bit
    pipe to a relatively large RAM, you could have a whole lot of pipes to
    smaller and narrower RAM modules.


    Say, for example, Say, for example, 64x 16b LPDDR?...

    As opposed to say, two 128-bit channels covering 4 DIMMs?...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Jun 18 16:52:28 2026
    From Newsgroup: comp.arch

    On 2026-06-18 10:45, Lawrence D’Oliveiro wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    Back in the 1980s and earlier, there were many architectural
    possibilities being considered.

    Consider the Transputer: this consisted of a lot of CPU nodes, each
    with its own local memory. I guess they envisioned scaling up both
    numbers of CPUs as well as amount of memory in more powerful
    configurations, so the amount of memory per CPU stayed roughly
    constant.

    Why didn’t that work? Obviously, because CPU speeds increased disproportionately more than memory speeds. And so the total amount of
    memory has in reality been increasing much faster than the number of
    CPUs.

    The Transputer (from Inmos) led to the XCORE many-core chips from XMOS (https://www.xmos.com/) which seem to be somewhat successful today.

    The Transputer was successful for a while, but IMO then waned because
    Inmos focused on making each single-core processor chip more powerful,
    which put the transputer into direction competition with conventional processors and made many-processor transputer systems expensive, instead
    of making many-core chips, which is what XMOS does, making the cost per
    core small.






    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 18 14:39:07 2026
    From Newsgroup: comp.arch

    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 16:54:57 2026
    From Newsgroup: comp.arch

    On 18/06/2026 15:52, Niklas Holsti wrote:
    On 2026-06-18 10:45, Lawrence D’Oliveiro wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    Back in the 1980s and earlier, there were many architectural
    possibilities being considered.

    Consider the Transputer: this consisted of a lot of CPU nodes, each
    with its own local memory. I guess they envisioned scaling up both
    numbers of CPUs as well as amount of memory in more powerful
    configurations, so the amount of memory per CPU stayed roughly
    constant.

    Why didn’t that work? Obviously, because CPU speeds increased
    disproportionately more than memory speeds. And so the total amount of
    memory has in reality been increasing much faster than the number of
    CPUs.

    The Transputer (from Inmos) led to the XCORE many-core chips from XMOS (https://www.xmos.com/) which seem to be somewhat successful today.

    The Transputer was successful for a while, but IMO then waned because
    Inmos focused on making each single-core processor chip more powerful,
    which put the transputer into direction competition with conventional processors and made many-processor transputer systems expensive, instead
    of making many-core chips, which is what XMOS does, making the cost per
    core small.


    The XCORE core in the XMOS devices is not actually multi-core - it is a
    single core with a very deterministic multi-threading system. You get 8 hardware threads, with stepping between threads on every clock tick.
    The original XCOREs ran at a fixed 500 MHz (I believe they are faster
    now) with a 5 stage pipeline, with pipeline overlap between hardware
    threads but not within a hardware thread. So no virtual cpu would run
    faster than 100 MHz (but it could be slower if more than 5 of the 8
    hardware threads is active). Since the individual virtual cpus do not
    see the pipelining, everything is very deterministic and predictable -
    there are no delays for branches, no pipeline stalls, no waits for
    memory (using the onboard static ram), etc.

    There are some XMOS devices with more than one XCORE on the same die, so
    they are mutli-core. All communication between cores - real or virtual
    - is supposed to be via message passing, which is supported by hardware (including dedicated processor instructions).

    The whole idea was to take the principle of having many communicating
    cores, but implement it efficiently in hardware - reusing the same ALU
    and other hardware for multiple virtual cores to save space and cost,
    while simultaneously eliminating the timing complications from running a pipelined core quickly.



    Another massively multi-core device I read about was the GreenArray
    GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing elements means it can do a massive number of operations per second with
    very little power and cost - in practice, the tiny amount of ram for
    code and data on each element means it can do almost nothing. It is programmed in a type of Forth (I know there are Forth experts in this
    group, who might have more informed opinions on the chip and development
    for it), but it is an obscure and limited Forth. Combined with the complication of splitting tasks between many elements and communicating
    and synchronising between them, I making use of these devices is a very
    niche skill.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 16:59:11 2026
    From Newsgroup: comp.arch

    On 18/06/2026 16:39, Scott Lurndal wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    There have been plenty of microcontrollers where there is only one or
    very few actual registers - everything else is ram. The 8-bit PIC
    family works like that, and has been hugely popular. There are also
    "stack machine" architectures where you have, at most, a register for
    the top-of-stack (along with at least one stack pointer register, a
    program counter, and perhaps a flag/status register). Pretty much all
    4-bit processors work like that, AFAIK.

    I think there's a lot to be said for stack machine type designs,
    possibly with more than one stack.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 16:21:04 2026
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Registers in high-performance CPUs give you several benefits:

    1) The addresses are hard-coded in the instructions. This means that
    read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early and used
    for forwarding, and for renaming registers), and for reducing port requirements.

    2) They have many read and write ports.

    3) Fast access time. Well, maybe. Thanks to 1) fast access time is
    actually not necessary, it just means that you need fewer forwarding
    paths.

    Let's look at your thought experiment:

    Advantage 1 is missing. Some AMD64 implementations still manage to
    implement 0-cycle store-to-load-forwarding in many cases, but AFAIK
    not as reliably as for registers.

    Advantage 2 tends to be missing. E.g., the most extreme I have seen
    up to now is 3 reads and 2 writes per cycle, and IIRC <5 total memory
    accesses per cycle, on a machine that can do 8 or 10 instructions per
    cycle, i.e. at least 16 register reads and 8 register writes per cycle
    (maybe limited to less, but with advantage 1 mitigating that to some
    extent).

    Advantage 3: What would single-cycle memory access mean for d=a+b+c? It
    would be compiled to

    t=b+c
    d=a+t

    With registers this has a latency of typically 2 cycles. With
    single-cycle memory access this typically has a latency of 6 cycles.

    BTW, it's not just a though experiment:

    A number of IA-64 implementations have had single-cycle D-cache
    access. It still had registers.

    Processors like the 6502 and the 6809 have single-cycle memory access.
    They still have registers (actually, accumulators and index
    registers).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 18 16:53:42 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 18/06/2026 16:39, Scott Lurndal wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    There have been plenty of microcontrollers where there is only one or
    very few actual registers - everything else is ram. The 8-bit PIC
    family works like that, and has been hugely popular. There are also
    "stack machine" architectures where you have, at most, a register for
    the top-of-stack (along with at least one stack pointer register, a
    program counter, and perhaps a flag/status register). Pretty much all
    4-bit processors work like that, AFAIK.

    The Burroughs B3500 and IBM 1401 were memory-to-memory
    architectures and were popular in the day. In both cases
    they supported index registers mapped to memory addresses.

    The B3500 TOS (Top Of Stack) was stored in a reserved memory
    address (address 40), although the original parameter passing
    mechanism was not reentrant (parameters were stored in the
    code space immediately following the enter (NTR) instruction).

    A later enhancement to the architecture (the VEN - Virtual Enter)
    instruction was re-entrant.



    I think there's a lot to be said for stack machine type designs,
    possibly with more than one stack.

    The B6500 family lives on today (albeit in emulation). The HP-3000
    was also a stack-based architecture influenced by the Burroughs
    large systems.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 16:49:40 2026
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    The Transputer was successful for a while, but IMO then waned because
    Inmos focused on making each single-core processor chip more powerful,
    which put the transputer into direction competition with conventional >processors and made many-processor transputer systems expensive, instead
    of making many-core chips, which is what XMOS does, making the cost per
    core small.

    The multiprocessor Transputer systems were never particularly
    successful. I have read that the focus on programming the system in
    Occam and the limitations of that programming language resulted in a
    lack of adoption. However, thanks to the possibility to use the
    transputers with few support chips, they were successful as
    single-processors in high-end embedded systems.

    The early transputers were fast for their time, e.g., with the T414 at
    up to 20MHz (and single-cycle instruction execution) in 1985, while
    the 80386 was introduced in 1985 at 12.5MHz (with at least two cycles
    per instruction), and the MIPS R2000 introduced in 1986 was available
    at up to 15MHz.

    Later they tried to follow up on that with the T9000 with multiple
    instructions per cycle and higher clock rate, but that ran into long
    delays, deadly in the 1990s with it's extreme clock rate increases
    every year, and so the T9000 was eventually cancelled.

    As for multi-processing, in addition to Occam the Transputer concept
    suffers from distributed memory. As for the idea that the work on the
    T9000 prevented success of having a many-processor with cheap CPUs,
    that's not the case because the T9000 never appeared, and you could
    buy various cheaper transputers. So if a transputer system made up of
    lots of T212s would have been a winner, nobody (and certainly not the
    T9000) stopped that from happening. It was not a winner, because
    programming a distributed-memory machine is harder than a sequential
    machine.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Jun 18 17:38:06 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    Another massively multi-core device I read about was the GreenArray
    GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing >elements means it can do a massive number of operations per second with
    very little power and cost - in practice, the tiny amount of ram for
    code and data on each element means it can do almost nothing. It is >programmed in a type of Forth (I know there are Forth experts in this
    group, who might have more informed opinions on the chip and development
    for it), but it is an obscure and limited Forth.

    That is not its major problem IMO.

    The Greenarrays chips have IIRC 64 18-bit words per core. That's
    really little for a general-purpose computer, and too little to be of
    any use in that capacity. A number of people in the Forth community
    were fascinated by these chips and ordered some to play around with
    them, but I rarely heard of any actual uses, much less production
    uses. Greenarrays apparently is still around, so maybe someone has
    found some use for it.

    One suggestion I have read is that it would be useful for bit-banging
    on I/O lines. 64 words might be enough for that (as long as the
    protocol is not too complex), and at 700MHz these chips might outdo
    FPGAs in some of these applications. But I have not heard much about
    such applications, either.

    Combined with the
    complication of splitting tasks between many elements and communicating
    and synchronising between them, I making use of these devices is a very >niche skill.

    One interesting aspect is the synchronization. AFAIK you can send
    over a word to a neighboring core. As long as that word is not
    consumed, sending another word will block. Reading a word from a
    neighbor if there is none available will block, too. Not sure if it
    has a way to check whether something is in the buffer before trying to
    read from or write to it.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 20:01:34 2026
    From Newsgroup: comp.arch

    On 18/06/2026 19:38, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    Another massively multi-core device I read about was the GreenArray
    GA144 <https://www.greenarraychips.com/>. In theory, the 144 processing
    elements means it can do a massive number of operations per second with
    very little power and cost - in practice, the tiny amount of ram for
    code and data on each element means it can do almost nothing. It is
    programmed in a type of Forth (I know there are Forth experts in this
    group, who might have more informed opinions on the chip and development
    for it), but it is an obscure and limited Forth.

    That is not its major problem IMO.

    The Greenarrays chips have IIRC 64 18-bit words per core. That's
    really little for a general-purpose computer, and too little to be of
    any use in that capacity. A number of people in the Forth community
    were fascinated by these chips and ordered some to play around with
    them, but I rarely heard of any actual uses, much less production
    uses. Greenarrays apparently is still around, so maybe someone has
    found some use for it.

    Yes, it is the small program size that is the big limit. If there were
    more code space, it would be straightforward to simply add new Forth
    words as needed until you had something that was more directly practical.


    One suggestion I have read is that it would be useful for bit-banging
    on I/O lines. 64 words might be enough for that (as long as the
    protocol is not too complex), and at 700MHz these chips might outdo
    FPGAs in some of these applications. But I have not heard much about
    such applications, either.

    That is something that is done with XMOS devices (at 100 MHz per virtual
    cpu). But to make it work well, they also have a large number of
    hardware timers and parallel-to-serial and serial-to-parallel shift
    registers. This lets you make things like a 100 Mbps Ethernet MAC in
    only a few hardware threads, or multiple UARTs on one thread. One of
    the GA144 example applications is an Ethernet MAC, but IIRC it takes
    over have the chip. And even though the XMOS devices could do Ethernet
    and USB (480 Mbps) in software, they quickly realised that they are much
    more efficient in dedicated hardware blocks.


    Combined with the
    complication of splitting tasks between many elements and communicating
    and synchronising between them, I making use of these devices is a very
    niche skill.

    One interesting aspect is the synchronization. AFAIK you can send
    over a word to a neighboring core. As long as that word is not
    consumed, sending another word will block. Reading a word from a
    neighbor if there is none available will block, too. Not sure if it
    has a way to check whether something is in the buffer before trying to
    read from or write to it.


    That's a nice idea. But can have a bit of flexibility here, with a FIFO
    of size greater than 1 ? That would provide many other uses. I suppose
    it is possible to use one of the cores as a FIFO.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Jun 18 11:08:25 2026
    From Newsgroup: comp.arch

    On 6/18/2026 7:59 AM, David Brown wrote:
    On 18/06/2026 16:39, Scott Lurndal wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    There have been plenty of microcontrollers where there is only one or
    very few actual registers - everything else is ram.  The 8-bit PIC
    family works like that, and has been hugely popular.  There are also
    "stack machine" architectures where you have, at most, a register for
    the top-of-stack (along with at least one stack pointer register, a
    program counter, and perhaps a flag/status register).  Pretty much all 4-bit processors work like that, AFAIK.

    I think there's a lot to be said for stack machine type designs,
    possibly with more than one stack.
    Yes, for some applications. As you noted, many/most of the successful
    stack architectures CPUs are in the small embedded space. The
    advantages of stack architectures, besides the ones you mentioned
    include smaller code footprint and faster context switch.

    The downside becomes more problematic when you get to more powerful
    systems and try to do superscaler operations. For example, it is easy
    to see how to perform simultaneously two adds that involve different registers, but since essentially all operations in a stack machine have
    top of stack as the destination, it gets more tricky.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 18 21:08:09 2026
    From Newsgroup: comp.arch

    On 18/06/2026 20:08, Stephen Fuld wrote:
    On 6/18/2026 7:59 AM, David Brown wrote:
    On 18/06/2026 16:39, Scott Lurndal wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    There have been plenty of microcontrollers where there is only one or
    very few actual registers - everything else is ram.  The 8-bit PIC
    family works like that, and has been hugely popular.  There are also
    "stack machine" architectures where you have, at most, a register for
    the top-of-stack (along with at least one stack pointer register, a
    program counter, and perhaps a flag/status register).  Pretty much all
    4-bit processors work like that, AFAIK.

    I think there's a lot to be said for stack machine type designs,
    possibly with more than one stack.
    Yes, for some applications.  As you noted, many/most of the successful stack architectures CPUs are in the small embedded space.  The
    advantages of stack architectures, besides the ones you mentioned
    include smaller code footprint and faster context switch.

    The downside becomes more problematic when you get to more powerful
    systems and try to do superscaler operations.  For example, it is easy
    to see how to perform simultaneously two adds that involve different registers, but since essentially all operations in a stack machine have
    top of stack as the destination, it gets more tricky.


    While a lot of problems benefit most from fast performance per thread,
    some can spread across many threads, and performance per Watt is key.
    Then it doesn't matter if you can't perform multiple simultaneous
    additions if you can switch to a new thread in a single cycle.

    Of course, the challenge here is that programming is significantly
    different from what we are used to, and you need a new type of OS as
    well as new applications - and ideally, new programming languages.
    That's a lot of big hurdles to clear even if the result is theoretically
    more efficient. (And you still need to keep the fast single-threaded
    cpus, and the fast SIMD / vector processing systems, for other kinds of tasks.)

    Possibly the biggest millstone around the neck of computing
    architectures is the C language. For every processor that is not just
    for highly niche code (like gpus), what matters is how fast C code can
    run on it. Most other languages either use a similar model, or run on
    VMs written in C. Why bother with support for multiple stacks, or other interesting hardware innovations, if it doesn't support faster C? With
    all due respect to Anton and other Forth enthusiasts, "fastest Forth benchmarks" is not going to attract much investment money.

    I'd love to see new architectures and new hardware features that are
    genuinely different, but they rarely turn up. Even with C programming,
    there are so many things that could be made more efficient with a bit of interesting hardware. (I say this with little knowledge of the
    complications implied.) A lot of time in C code is spend in memory
    allocation work - that's got to be a prime candidate for hardware acceleration, especially if we can get away from the brutish malloc/free approach. (Stack-based allocators are one possibility for a lot of allocations.) There could be hardware support for threading, locking,
    and inter-process communication. Separate data stacks and return stacks
    would make things faster and more secure. Fat pointers that can track
    access modes and range limits, at least in common cases, would aid
    reliability and security.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 14:16:46 2026
    From Newsgroup: comp.arch

    On 6/18/2026 9:39 AM, Scott Lurndal wrote:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    This is what I was debating. But, trying to push registers out of the
    mix adds a new complexity:
    Either need for very complex addressing mode, or a section of memory
    whose sole purpose is to behave like registers.


    Could mostly eliminate GPRs, and assume only a few SPRs, say:
    ZP: Zero-Page Base
    SP: Stack Pointer
    GP: Global Pointer
    PC: Program Counter

    Then, you can use one of these as a base register, and access an offset.
    Say:
    (ZP,Disp) //location at ZP+Disp
    Indirect:
    @(ZP,Disp) //location pointed to by pointer at ZP+Disp
    Increment:
    @(ZP,Disp)+, @-(ZP,Disp)
    Uses pointer, then increments.

    This makes a problem for structs and arrays though...

    You either need a multi-op sequence to access a struct member, or the
    ability to double-up the addressing modes:
    ((ZP,Disp),Disp2)
    ((ZP,Disp),(ZP,Disp2))

    And, if one allows:
    ADD.L
    ((ZP,Disp),(ZP,Disp2)),
    ((ZP,Disp),(ZP,Disp2)),
    ((ZP,Disp),(ZP,Disp2))

    This would effectively be 9 memory accesses in a single instruction...


    Another option is a stack-machine, like Forth or PostScript (or JVM).

    Naively this works, but as noted similar work in a stack machine tends
    to need around 60% more instructions than a register machine (works OK
    as a compiler IR though mostly because many of the ops can evaporate
    when translating to 3AC/SSA).

    But:
    PUSH.L (ZP,Disp)
    PUSH.L (ZP,Disp)
    ADD
    POP.L (ZP,Disp)
    4 ops to do "c=a+b;"

    Or, to access an array:
    PUSH.A (ZP,Disp)
    PUSH.L (ZP,Disp)
    LOADINDEX.L
    POP.L (ZP,Disp)

    Also to be effective if compiling a language C, would likely need a frame-pointer to allow effective access to stack-based local variables,
    but the relative merit of using ZP for global state would be reduced.



    Does all seem to work out to being a disadvantage vs a register machine...



    The other option being register machine with some Load-Op and Op-Store instructions and maybe some more advanced addressing modes.

    But, can note (if running Doom):
    (Rb, Disp): ~ 60%
    (Rb, Index): ~ 36% (*1)
    (Rb, Index, Disp): ~ 2%
    (Rb)+ / -(Rb): ~ 2%
    *1: Live stats, closer to (76, 21, 1, 2) for static counts.


    While one could think of specific cases like:
    [SP+Ix*4+Ofs]

    This mostly becomes moot if the address of the array ends up pinned in a register. This leaves array-inside-struct as the primary use-case, but
    not common enough to really justify it.


    This leaves Load-Op and Op-Store:
    ADD.L (SP, Disp), Rn //Rn=Rn+(SP,Disp)
    ADD.L Rn, (SP, Disp) //(SP,Disp)=(SP,Disp)+Rn

    Which can help if the variable is not in a register and only accessed
    once. This is theoretically spec'ed, but not really implemented for XG3
    in BGBCC yet. The emulator and CPU core should support it though (though
    is an optional feature, mostly overlaps with the same mechanism as the
    RV AMO ops).

    Relative gains are small but non-zero.


    The main useful instruction from this cohort is mostly:
    XCHGV.L (Rb, Disp), Rn //Rp<=Rn, LD=>Rn
    Which can do:
    Rn=(Rb,Disp)
    (Rb,Disp)=Rp
    As a single operation, with volatile access to main RAM (for the V variant).

    This happens to map over to the:
    InterlockedExchange()
    Intrinsic from MSVC, which can be used to implement spinlocks and
    similar. Granted, this mostly defeats the use-case for CAS or LR/SC
    (which are both more expensive and have little obvious advantage in this case).

    ...



    Otherwise, did another test:
    Modeled what the result would be if the CPU imposes register-aliasing
    between registers within the same pair (say, assume a core with 64-bit
    logical registers but 128-bit physical registers):
    Current shuffling: ~ 4% penalty
    Pairing aware shuffling: ~ 1%
    Paired shuffling vs current on current cores: ~ 1%

    So, in effect, the overall cost of 128-bit physical registers would be
    around a 2% penalty vs 64-bit physical registers (absent maybe also
    making the register allocator aware of such shenanigans, and avoiding allocating registers within the same pair except when register pressure demands it).

    Or, in effect:
    6R3W register file with 30x 128-bit physical registers;
    Each physical register containing a logical pair of GPRs.
    WB would only write the low or high-half for 64b ops.
    This would be for exploring an option that could support 256-bit SIMD.
    Could make a tricked-out SIMD monster...
    But not enough memory bandwidth to make effective use of it.
    ...

    Though, this is still a fair bit less than the penalty from using RV-C
    (closer to around 10-15% vs non RV-C). This penalty could be avoided
    mostly by having superscalar handling that can deal with RV-C, but this
    is easier said than done. Well, or have a compiler that gets "clever"
    and tries to constrain RV-C to behave like pair-packing rather than
    free-form (trading some code-density for higher performance by not
    wrecking things as badly).


    Well, and as-is, XG3 is beating plain RV64GC on both code density and performance (RV64GC+Jx puts up a stronger fight in that at least
    binaries can be smaller, *).

    *: Though, to put RV64GC+Jx ahead, ended up adding stuff like 32-bit
    encodings for "LD/LW/LWU Disp17s*{4|8}(GP)" and similar (along with jumbo-prefixes, etc). Though, XG3 still remains faster.


    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 15:37:39 2026
    From Newsgroup: comp.arch

    On 6/18/2026 11:21 AM, Anton Ertl wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better" option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Registers in high-performance CPUs give you several benefits:

    1) The addresses are hard-coded in the instructions. This means that
    read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early and used
    for forwarding, and for renaming registers), and for reducing port requirements.

    2) They have many read and write ports.

    3) Fast access time. Well, maybe. Thanks to 1) fast access time is
    actually not necessary, it just means that you need fewer forwarding
    paths.

    Let's look at your thought experiment:

    Advantage 1 is missing. Some AMD64 implementations still manage to
    implement 0-cycle store-to-load-forwarding in many cases, but AFAIK
    not as reliably as for registers.

    Advantage 2 tends to be missing. E.g., the most extreme I have seen
    up to now is 3 reads and 2 writes per cycle, and IIRC <5 total memory accesses per cycle, on a machine that can do 8 or 10 instructions per
    cycle, i.e. at least 16 register reads and 8 register writes per cycle
    (maybe limited to less, but with advantage 1 mitigating that to some
    extent).

    Advantage 3: What would single-cycle memory access mean for d=a+b+c? It would be compiled to

    t=b+c
    d=a+t

    With registers this has a latency of typically 2 cycles. With
    single-cycle memory access this typically has a latency of 6 cycles.

    BTW, it's not just a though experiment:

    A number of IA-64 implementations have had single-cycle D-cache
    access. It still had registers.

    Processors like the 6502 and the 6809 have single-cycle memory access.
    They still have registers (actually, accumulators and index
    registers).



    Seemingly, I can try various ideas in my head, but seemingly almost invariably, "have registers, and a good number of them", mostly comes
    out as the best-case option.


    One other possibility could be some sort of dynamically reconfigurable systolic array, but this would be "not so great" on area cost. And,
    would likely still make sense to drive them using big SIMD registers or similar as inputs.

    Say:
    Shove SIMD vectors into specialized reconfigurable unit;
    Wait for an N cycle latency;
    Deal with results coming out the other side.
    If exposed in an ISA, would likely make sense to violate the assumption
    that the instruction produces an immediate result, or alternatively the
    input and output ports are decoupled.

    SYSARR_IN_A0 Rv1, Rv2, Rv3 //Feed inputs to unit A0 (3R)
    ...
    SYSARR_OUT_A0 Rv4 //get results from A0 (1W)

    With pipelining:
    SYSARR_IN_A0 Rv1, Rv2, Rv3
    SYSARR_IN_A0 Rv4, Rv5, Rv6
    ...
    SYSARR_OUT_A0 Rv10
    SYSARR_OUT_A0 Rv11
    ...

    But, this would be a bit messy, as the instructions would effectively
    have their timing latency as part of the operation. Likely each unit
    would need control registers to describe the operation to be performed.


    But, still the same fundamental issue:
    Feeding data in and dealing with the data out is likely to be a bigger bottleneck than the computation itself.

    Well, and that when memory access stops being as big of a dominant
    bottleneck, things like going superscalar or OoO on the memory
    operations opens up as an option (well, and the main reason to have
    memory operations as single-ported is mostly due to the higher cost of multi-ported access to memory, which isn't really solvable in any
    obvious way until one already has the resource budget to address it).

    Well, and "why not just make load/store and the bus wider?" is
    moderately effective (if naive) limited mostly by how much one can
    reasonably deal with, or the costs one can pay (say, major reason I am
    using 16-byte cache-lines being because it is more expensive to use
    32-byte cache lines, even if the per-line overheads would be lower at 32 bytes).

    But, say, if the core could do 256-bit load/store, would make sense to
    use a bus design with 256 or 512 bit cache lines. Well, and then 256b load/store would likely require 128-bit alignment.


    But, not there at the moment.



    One could map systolic arrays directly to memory, which does at least
    seem less awkward and works so long as the whole operation can be mapped
    to the input/output buffers, working like a niche co-processor. But,
    logic complexity and flexibility would be fairly limited in this case.

    Such a unit would still be limited to whatever the memory subsystem can deliver (so not necessarily much faster than a CPU running SIMD, which
    is limited to the same underlying constraint).

    Then, the only merit of the systolic array is if it can use less area or energy than a comparable CPU core (which isn't as likely to be true in a bandwidth-limited scenario; unless it is very narrow so that the bus can
    keep up, but then almost may as well just use a CPU).

    But, such an approach could make sense assuming there is enough memory bandwidth to keep it fed (and thus able to spend nearly every clock
    cycle doing useful work).


    ...




    But, does bring up an idle thought (unrelated):
    Could probably do a faster 3D renderer for my project if a lot of the
    work, like the dynamic tessellation and transforms, could be cached
    between frames.

    I guess could maybe try to bolt it onto OpenGL by hashing the vertex
    arrays, then building a cached tessellated array's contents and
    modelview matrix are stable for N frames. Could then use a cheaper path
    to feed all the (pre-tessellated) primitives through the projection
    matrix and rasterizer.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 18 22:37:29 2026
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:
    -----------------------
    Of course, the challenge here is that programming is significantly
    different from what we are used to, and you need a new type of OS as
    well as new applications - and ideally, new programming languages.
    That's a lot of big hurdles to clear even if the result is theoretically more efficient. (And you still need to keep the fast single-threaded
    cpus, and the fast SIMD / vector processing systems, for other kinds of tasks.)

    Possibly the biggest millstone around the neck of computing
    architectures is the C language.

    C is not an albatross !! it is a standard to which one designs--
    exactly like air (our atmosphere) provides the standard to which
    airplanes have to be designed.

    C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
    ...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.

    For every processor that is not just
    for highly niche code (like gpus), what matters is how fast C code can
    run on it. Most other languages either use a similar model, or run on
    VMs written in C.

    You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...

    Why bother with support for multiple stacks, or other interesting hardware innovations, if it doesn't support faster C?

    I can argue that having 2 stacks {one for the preserved state from
    caller to callee, the other for data} enables ever so slightly faster
    C--but that is not the point--the point is robustness in the face of
    threats (buffer overruns, ROP, malicious use of memory).

    The speed advantage is by knowing that registers written to the call-
    stack do not need to be written to L2 (or farther) if RET has occurred
    and the cache line replaced. Saves a trifling of power, too. These is
    another advantage is when an EXIT instruction is still reading from
    stack and an ENTER instruction starts writing to the stack. WE taught
    the compiler a prescribed order to utilize the registers, so that when
    an EXIT is running and an ENTER is decoded, the EXIT can be short
    circuited and some of the ENTER short circuited, eliding cycle waste.

    With
    all due respect to Anton and other Forth enthusiasts, "fastest Forth benchmarks" is not going to attract much investment money.

    I'd love to see new architectures and new hardware features that are genuinely different, but they rarely turn up.

    My 66000 is replete with those features--and it is argued here daily
    that it (my 66000 ISA) has gone too far !!

    [Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any
    time.

    Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

    Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers
    directly through GOT (or other SW table).

    Multi-line multi-instruction ATOMIC sequences freely available to SW.

    Transcendental instructions that take FDIV number of cycles.

    Context Switches performed without instruction execution--as if the
    state of a thread was treated like a write-back cache.

    Interrupt tables that can be used as a low level scheduler built into
    the priority (and privilege) model with support for vVMs monitoring vMs
    One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core
    and table.)

    Even with C programming, there are so many things that could be made more efficient with a bit of interesting hardware. (I say this with little knowledge of the complications implied.) A lot of time in C code is spend in memory allocation work - that's got to be a prime candidate for hardware acceleration, especially if we can get away from the brutish malloc/free approach.

    And then C++ goes all 'new' on using memory...

    (Stack-based allocators are one possibility for a lot of allocations.)
    There could be hardware support for threading, locking,
    and inter-process communication.

    My 66000 can switch threads in a single instruction.
    My 66000 ESM provides unrealized synchronization capabilities.
    My 66000 Interrupt tables provide single instruction message send and
    single instruction message receives.
    SW determines what the messages mean.

    Separate data stacks and return stacks would make things faster and more secure.

    Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

    Plus, the code sees a 64-bit Virtual Address Space, while the system has
    four 64-bit physical address spaces. The spaces are used to determine
    the consistency model:: {
    Cacheable DRAM is causally ordered and coherent and cached
    unCacheable DRAM is sequentially consistent incoherent
    ROM is unordered incoherent but cached
    MMI/O is sequentially consistent incoherent
    Config is strongly ordered incoherent
    }
    Fat pointers that can track access modes and range limits, at least in common cases, would aid reliability and security.

    A bit Too "Cheri" for me.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Jun 18 22:58:26 2026
    From Newsgroup: comp.arch

    On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

    I guess a big what if is, say, rather than having a 64-bit or
    128-bit pipe to a relatively large RAM, you could have a whole lot
    of pipes to smaller and narrower RAM modules.

    Say, for example, Say, for example, 64x 16b LPDDR?...

    16-bit ... wasn’t that the bus width of Rambus?

    Was Rambus trying to do anything like this? Remember, Intel invested
    heavily in it ... only for just about the entire rest of the industry
    to bring out DDR.

    Yes, pipelining RAM would seem an obvious answer to try to keep up
    with faster and faster CPUs.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 00:26:42 2026
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

    I guess a big what if is, say, rather than having a 64-bit or
    128-bit pipe to a relatively large RAM, you could have a whole lot
    of pipes to smaller and narrower RAM modules.

    Say, for example, Say, for example, 64x 16b LPDDR?...

    16-bit ... wasn’t that the bus width of Rambus?

    Was Rambus trying to do anything like this? Remember, Intel invested
    heavily in it ... only for just about the entire rest of the industry
    to bring out DDR.

    Yes, pipelining RAM would seem an obvious answer to try to keep up
    with faster and faster CPUs.

    See my patent 5,367,494

    Where there were 3 busses, an address bus, a readd-out bus and a
    write-in bus, each bus had/has independent timing.

    Basically a mainframe multi-banked memory system in a single chip.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 18 23:48:10 2026
    From Newsgroup: comp.arch

    On 6/18/2026 5:58 PM, Lawrence D’Oliveiro wrote:
    On Thu, 18 Jun 2026 03:50:52 -0500, BGB wrote:

    I guess a big what if is, say, rather than having a 64-bit or
    128-bit pipe to a relatively large RAM, you could have a whole lot
    of pipes to smaller and narrower RAM modules.

    Say, for example, Say, for example, 64x 16b LPDDR?...

    16-bit ... wasn’t that the bus width of Rambus?

    Was Rambus trying to do anything like this? Remember, Intel invested
    heavily in it ... only for just about the entire rest of the industry
    to bring out DDR.

    Yes, pipelining RAM would seem an obvious answer to try to keep up
    with faster and faster CPUs.

    The idea here is partly:
    LPDDR vs DDR:
    LPDDR has lower pin count due to multiplexing;
    Narrower interface (typically 16 bit);
    Commonly used in cellphones or similar;
    ...

    So, for normal DDR, there are normally:
    Command pins;
    Address pins;
    Data Pins.
    The command/address is sent in parallel, typically SDR.


    On a DIMM, typically the C/A pins would be shared across all of the
    chips, so each chip would individually provide an 8 or 16 bit interface,
    but then they are ganged up for a 64 bit interface.

    If each DDR chip were addressed individually, the C/A pins would be
    greatly outweigh the data pins.


    Whereas with individual addressing, LPDDR wouldn't have nearly as big of
    an impact on pin count. The C/A pins are more heavily multiplexed and
    driven using DDR signals.


    Why go narrower?...
    Mostly, each memory access has a certain latency RAS and CAS, which
    needs to be paid for every access. To get the most efficient use of
    bandwidth, one effectively needs to perform a relatively large burst
    transfers (say, would need to transfer around 512 bytes or so to get
    peak efficiency from a DIMM; can do 128 or 256 byte bursts, but then one
    is wasting more of the time on CAS latency).

    But, then a more subtle problem emerges:
    The bigger the block you transfer, the lower the probability that all of
    the data in that block will actually be used.


    Say, block size:
    16B: Very likely all of it is relevant;
    32B: Also likely that all of it is relevant;
    64B: Meh;
    128B: Chances are half the block is wasted;
    256B/512B: Likely only part of the block will be accessed in the near-term.


    If the RAM interface is pushed narrower, this block width is pushed
    down, so a higher percentage is likely to be useful (within the time
    window of however long it is in the cache).

    The pin budget can then instead be used to access more blocks. And, if
    you have a lot of cores, more likely they are going to be accessing RAM
    in a fairly scatter-shot pattern rather than anything resembling a
    sequential access pattern.

    Likewise, more of the total RAM's bandwidth budget is likely to go
    towards useful work vs accessing RAM that just happened to be nearby the
    data that was being accessed.


    Though, the correlation pattern is different than the one effecting MMU
    page size, but MMU page size applies over a few orders of magnitude
    larger time scale (where, say, data in the L1 and L2 caches tends to be
    much shorter lived vs stuff in the TLB).

    But, yeah, you wouldn't likely want 256B or 512B burst transfers for
    similar reasons to why you wouldn't want a 64K or 128K page size.

    ...


    But, then again, probably if mainstream systems designers felt there
    were good reason to stick a bunch of cellphone style RAM chips onto the
    DIMMs, they would have probably done it...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 19 09:24:43 2026
    From Newsgroup: comp.arch

    On 19/06/2026 00:37, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:
    -----------------------
    Of course, the challenge here is that programming is significantly
    different from what we are used to, and you need a new type of OS as
    well as new applications - and ideally, new programming languages.
    That's a lot of big hurdles to clear even if the result is theoretically
    more efficient. (And you still need to keep the fast single-threaded
    cpus, and the fast SIMD / vector processing systems, for other kinds of
    tasks.)

    Possibly the biggest millstone around the neck of computing
    architectures is the C language.

    C is not an albatross !! it is a standard to which one designs--
    exactly like air (our atmosphere) provides the standard to which
    airplanes have to be designed.

    C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
    ...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.


    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed
    to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    This is not necessarily a bad thing - it lets everyone get stuff done.
    But it means that we are stuck on a local maxima. If there is a better
    way out there somewhere, it would be a long and arduous journey to get
    there.

    For every processor that is not just
    for highly niche code (like gpus), what matters is how fast C code can
    run on it. Most other languages either use a similar model, or run on
    VMs written in C.

    You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...


    Having a consistency here makes a lot of things easier and more
    efficient. But it also makes change harder, even if the end results
    would perhaps be better. (I don't know if there are other models that
    really are better - this thread is titled "Thought experiment", after all!)

    Why bother with support for multiple stacks, or other
    interesting hardware innovations, if it doesn't support faster C?

    I can argue that having 2 stacks {one for the preserved state from
    caller to callee, the other for data} enables ever so slightly faster
    C--but that is not the point--the point is robustness in the face of
    threats (buffer overruns, ROP, malicious use of memory).


    Those could be handled in hardware too.

    Intel "Control-flow Enforcement Technology" sounds fancy and innovative,
    but it is really nothing more than having a second stack for return
    addresses. Having two or more stacks, with hardware protection for what
    can be done with them, should be very positive for robustness and security.

    The speed advantage is by knowing that registers written to the call-
    stack do not need to be written to L2 (or farther) if RET has occurred
    and the cache line replaced. Saves a trifling of power, too. These is
    another advantage is when an EXIT instruction is still reading from
    stack and an ENTER instruction starts writing to the stack. WE taught
    the compiler a prescribed order to utilize the registers, so that when
    an EXIT is running and an ENTER is decoded, the EXIT can be short
    circuited and some of the ENTER short circuited, eliding cycle waste.


    That is one benefit, yes - things above the stack line (for any of the
    stacks) can be discarded without being pushed back to main memory. But
    you can do better. They can not only be discarded, but cleared,
    improving security. They can use specialised cpu-local caches for
    different purposes - return stacks with just addresses will be much
    smaller than data stacks, and can fit tightly together with prefetches, speculative execution, etc. You know that the different types of data
    don't overlap, there is no need to worry about data accesses addressing
    things on the return stack, and so on. (That would add restrictions
    limiting some kinds of self-modifying program - good riddance!)

    With
    all due respect to Anton and other Forth enthusiasts, "fastest Forth
    benchmarks" is not going to attract much investment money.

    I'd love to see new architectures and new hardware features that are
    genuinely different, but they rarely turn up.

    My 66000 is replete with those features--and it is argued here daily
    that it (my 66000 ISA) has gone too far !!

    [Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any time.

    Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

    Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers directly through GOT (or other SW table).


    Those might all be useful at times, but are not game-changers as far as
    I can see.

    Multi-line multi-instruction ATOMIC sequences freely available to SW.

    That's more fun.


    Transcendental instructions that take FDIV number of cycles.

    That's "just" efficiency.


    Context Switches performed without instruction execution--as if the
    state of a thread was treated like a write-back cache.

    /Now/ we are getting somewhere. That sounds like the kind of feature I
    am talking about - something that changes the way people design code,
    not just making existing stuff faster.


    Interrupt tables that can be used as a low level scheduler built into
    the priority (and privilege) model with support for vVMs monitoring vMs
    One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core and table.)

    Even with C programming,
    there are so many things that could be made more efficient with a bit of
    interesting hardware. (I say this with little knowledge of the
    complications implied.) A lot of time in C code is spend in memory
    allocation work - that's got to be a prime candidate for hardware
    acceleration, especially if we can get away from the brutish malloc/free
    approach.

    And then C++ goes all 'new' on using memory...

    "new" is mostly built on top of malloc/free. While you /can/ make your
    own overrides for "new", either globally or for specific types, in the
    great majority of cases it boils down to calling "malloc" then calling
    the type's constructor.


    (Stack-based allocators are one possibility for a lot of
    allocations.)
    There could be hardware support for threading, locking,
    and inter-process communication.

    My 66000 can switch threads in a single instruction.
    My 66000 ESM provides unrealized synchronization capabilities.
    My 66000 Interrupt tables provide single instruction message send and
    single instruction message receives.
    SW determines what the messages mean.


    That gives me confidence that not all my ideas are crazy ! I hope you
    succeed with your design - these are features I would love to be able to
    use in my daily work.

    Separate data stacks and return stacks
    would make things faster and more secure.

    Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

    Plus, the code sees a 64-bit Virtual Address Space, while the system has
    four 64-bit physical address spaces. The spaces are used to determine
    the consistency model:: {
    Cacheable DRAM is causally ordered and coherent and cached
    unCacheable DRAM is sequentially consistent incoherent
    ROM is unordered incoherent but cached
    MMI/O is sequentially consistent incoherent
    Config is strongly ordered incoherent
    }
    Fat pointers that can track
    access modes and range limits, at least in common cases, would aid
    reliability and security.

    A bit Too "Cheri" for me.

    I think "Cheri" took it too far. I believe there is scope for tagging a
    bit of information onto pointers without trying to do everything.

    I also think a lot can be done on the side of programming languages and
    tools, which could catch far more possible pointer mistakes. That won't
    stop the bad guys, of course, but I think more bad accesses are from
    bugs than hackers.





    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 19 06:02:16 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. For every processor that is not just
    for highly niche code (like gpus), what matters is how fast C code can
    run on it. Most other languages either use a similar model, or run on
    VMs written in C. Why bother with support for multiple stacks, or other >interesting hardware innovations, if it doesn't support faster C? With
    all due respect to Anton and other Forth enthusiasts, "fastest Forth >benchmarks" is not going to attract much investment money.

    That is probably the case, but "fast Java performance", "fast Python performance", or "fast JavaScript performance" might be a different
    issue. And indeed, once upon a time Sun boasted architectural
    features to support Java. AFAIK these were features to improve the indirect-branch performance in the Java VM interpreter, to improve the
    startup performance. However, other CPU manufacturers (in particular
    Intel and AMD, and, more recently, ARM and Apple and probably
    Nuvia/Qualcomm) made indirect branches fast without architectural
    support.

    Concerning multiple stacks, existing architectures and their
    implementations support multiple stacks just fine. While AMD64, which
    is descended from an architecture that is older than C, has special architectural support for one stack, based on the call/return stack of
    its ancestor, implementing a stack using a different register as stack
    pointer works just fine and is efficient. In newer architectures like
    ARM A64 and RISC-V it's just software convention that one register is designated as stack pointer (the Compression extension of RISC-V has
    compressed instructions for dealing with that register, but apart from
    a code size advantage with C the stack pointer is just another
    register on RISC-V).

    Most Forth implementations on AMD64 implement the data stack, the most
    heavily used stack in Forth using a different register than %RSP.
    There is one high-performance implementation that uses %RSP as data
    stack, but it does not perform generally better than other
    high-performance implementations that use a different register.

    Of course, high-performance implementations tend to keep data stack
    items in registers and access them in memory as rarely as possible, so
    the question of having stack pointers with different performance characteristics would have less influence than some might expect, but
    I have not observed or read about different performance
    characteristics. A long time ago I read about optimizing consecutive
    pushes and pops being optimized to avoid the dependency due to
    stack-pointer updates, but for a stack implemented using stores,
    loads, and additions to a register, any even moderately sophisticated
    compiler combines the additions of consecutive stack accesses, too.

    One case where C and Forth have a difference and where a preference
    for C may show in some architectures is in reifying comparison results
    as integer values. Most comparison results are only used by
    conditional branches, and compilers can easily avoid having to reify
    them in this context, but sometimes this optimization does not happen.
    There C produces 1 for true, while Forth produces -1 (all bits set)
    for true.

    RISC-V clearly is in C land, and its comparison instructions produce 1
    for true, and Forth needs an additional instruction for its reified
    flag.

    AMD64 inherits the SETcc instruction that produces a byte result of 1
    for true, and needs another instruction to produce an integer result.
    This tends to result in one additional instruction for Forth, but in
    some cases shorter sequences are possible. E.g., for "5 u<" VFX Forth
    produces the code:

    CMP RBX, # 05
    SBB RBX, RBX

    In AVX2 the comparison instructions produce all-bits-set for true.
    Did they design it for Forth? Probably not, but they designed it for
    use of the values with bitwise operations (and, or, xor), just like
    the designers of Forth-83 did. In any case, this is a counterexample
    to the theory that everything is designed to accomodate C.

    ARM A64 supports reifying flags in the Forth way just as well as it
    does reifying them in the C way. E.g., in Gforth the code for < is:

    cmp x28, x20
    csetm x28, lt // lt = tstop

    Actually, cmp and csetm are aliases for specific uses of more
    versatile instructions. The same code decode with the more versatile instructions:

    subs xzr, x28, x20 #set the flags, throw subtraction result away
    csinv x28, xzr, xzr, ge #select either xzr or xzr-1, depending on flag

    It's interesting the ARM A64 has not just the instruction, but also a
    separate mnemonic for the 0-or-all-bits-set case. The ARM A64
    architects obviously had more than just C in mind.

    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz
    Lisp developers has explained in this newsgroup, they actually did not
    use this feature, because the performance benefit was not big enough
    to justify the complications of modifying their tagging architecture
    to make use of that. However, in recent years AMD, ARM, and Intel
    have added features to ignore the top (7,8, or 16) bits in every
    address (how many depends on the feature and the selected variant of
    the feature), probably to support pointer tagging in such programming languages. I am sure that no C need is behind this feature addition.

    I'd love to see new architectures and new hardware features that are >genuinely different, but they rarely turn up.

    I see lots of architectural features that are not or badly supported
    by C, and so obviously are not designed for C.

    A prominent example is SIMD. Standard C does not have language
    features that map to SIMD instructions, and much as C compiler writers
    want to make use of them with auto-vectorization, the result is
    hit-or-miss. Admittedly, SIMD has existed for more than 50 years, so
    it's not a new architectural feature, but the fact that it has been
    added to many architectures after C became prominent is another
    indication that architects do not restrain themselves to things that C supports.

    Another example is the ADX extension for AMD64 (introduced with
    Broadwell (released 2014)), which does not correspond to a language
    feature by C before C23's _BitInt, and which existing C compilers do
    not support at all AFAICS. Read all about it in <https://repositum.tuwien.at/bitstream/20.500.12708/226349/5/Ertl-2025-Multi-precision%20integer%20arithmetics-vor.pdf>

    Even with C programming,
    there are so many things that could be made more efficient with a bit of >interesting hardware. (I say this with little knowledge of the >complications implied.) A lot of time in C code is spend in memory >allocation work - that's got to be a prime candidate for hardware >acceleration, especially if we can get away from the brutish malloc/free >approach. (Stack-based allocators are one possibility for a lot of >allocations.)

    What architectural features do you have in mind?

    There could be hardware support for threading, locking,
    and inter-process communication.

    There is, some better, some worse. First of all, we have shared
    memory rather than distributed memory. Next, we have cache-coherent
    shared memory. The cache coherence architectures are generally
    deficient, with the deficiencies described as "memory model".

    Separate data stacks and return stacks
    would make things faster and more secure.

    It's not clear that more architectural support for that would make
    things faster, see above; plus, one of the results of my work on PAF <https://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf>
    is that for the majority of Forth code, one stack pointer is enough;
    that's not because the architectural support for several stack
    pointers is so bad, but because registers are a limited resource, and
    being stingy with them helps.

    Concerning "more secure", the idea is probably that buffer overflows
    then cannot overwrite return addresses. Interestingly, some
    architectures now support an additional return stack to thwart such
    attacks. Anyway, buffer overflows are still a security problem even
    with separate return stacks, because the data contains code pointers,
    sometimes with some indirections, such as C function pointers,
    pointers to virtual function tables in object-oriented programs etc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 18 23:57:22 2026
    From Newsgroup: comp.arch

    Anton Ertl [2026-06-18 16:49:40] wrote:
    The early transputers were fast for their time, e.g., with the T414 at
    up to 20MHz (and single-cycle instruction execution) in 1985, while
    the 80386 was introduced in 1985 at 12.5MHz (with at least two cycles
    per instruction), and the MIPS R2000 introduced in 1986 was available
    at up to 15MHz.

    Another interesting project in that ballpark was the [iWarp](https://en.wikipedia.org/wiki/IWarp)

    I never really played with one, but I worked next to one during
    one summer. 🙂


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Fri Jun 19 11:20:10 2026
    From Newsgroup: comp.arch

    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz Lisp developers has explained in this newsgroup, they actually did not use
    this feature, because the performance benefit was not big enough to
    justify the complications of modifying their tagging architecture to
    make use of that. However, in recent years AMD, ARM, and Intel have
    added features to ignore the top (7,8, or 16) bits in every address (how
    many depends on the feature and the selected variant of the feature), probably to support pointer tagging in such programming languages. I am
    sure that no C need is behind this feature addition.

    The architectural support for tagging in SPARC only avoided the need to
    untag and tag integers in compiled code.

    David Ungar's thesis on SOAR provided measurements of the impact of this
    on benchmarks for Smalltalk.

    The layout of having the tags in the bottom 2 bits of a 32 bit word works
    fine without architectural support, being able to turn on traps for
    unaligned data access helps though.

    In 64 bit machines you can use the bottom three bits for Lisp tags but
    SPARC64 didn't provide instructions to work with this.

    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    In previous discussions, I had tried to press Mitch to see if he could
    remember what kind of benchmarks they had run on the 88100 that showed
    it running Lisp faster than SPARC.

    To me, the old SPEC li benchmark was a test of the speed of an interpreter written in C and doesn't say anything useful about how well a system would
    run Lisp that had been compiled to machine code.

    There were well known (non SPEC) Lisp benchmarks at the time.

    I'd love to see new architectures and new hardware features that are >>genuinely different, but they rarely turn up.

    I see lots of architectural features that are not or badly supported by
    C, and so obviously are not designed for C.

    The one architectural feature of Lisp Machines that I don't think has been carried forward was a multi-way switch instruction.

    The rest of the MIT Lisp Machine microarchitecture was just a pipelined,
    three address, load/store one that provides another data point for the discussion from a few months ago on whether VAX could have been a RISC
    using TTL chips.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Fri Jun 19 15:08:40 2026
    From Newsgroup: comp.arch

    In article <2026Jun19.080216@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    once upon a time Sun boasted architectural features to support
    Java. AFAIK these were features to improve the indirect-branch
    performance in the Java VM interpreter, to improve the startup
    performance.

    Did this become obsolete when Java runtime environments switched to
    JITing to native code?

    Admittedly, SIMD has existed for more than 50 years, so it's not
    a new architectural feature, but the fact that it has been added
    to many architectures after C became prominent is another
    indication that architects do not restrain themselves to things
    that C supports.

    They don't, but they don't do a good job of making those features usable either. Support for new instructions is readily provided via intrinsics,
    but those aren't portable. Back when I had a close relationship with
    Intel, it seemed that they assumed all software would be built for a
    specific combination of manufacturer and chip generation, and software suppliers would happily maintain several such versions simultaneously.

    John
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Jun 19 16:16:05 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:
    -----------------------
    Of course, the challenge here is that programming is significantly
    different from what we are used to, and you need a new type of OS as
    well as new applications - and ideally, new programming languages.
    That's a lot of big hurdles to clear even if the result is theoretically
    more efficient. (And you still need to keep the fast single-threaded
    cpus, and the fast SIMD / vector processing systems, for other kinds of
    tasks.)

    Possibly the biggest millstone around the neck of computing
    architectures is the C language.

    C is not an albatross !! it is a standard to which one designs--
    exactly like air (our atmosphere) provides the standard to which
    airplanes have to be designed.

    C ended up being this model because its floor supports almost all other programming languages: certainly {Fortran, C++, Algol68, Pascal, Jovial,
    ...} and is not all that bad when doing {LISP, RPG, COBOL, APL, Snowbal}.

    For every processor that is not just
    for highly niche code (like gpus), what matters is how fast C code can
    run on it. Most other languages either use a similar model, or run on
    VMs written in C.

    You state that like it was a BAD thing--it is not. I just we had all chosen the same standards at which to design {BE or LE} is like having the steering wheel on the {left or right}, ...

    Why bother with support for multiple stacks, or other
    interesting hardware innovations, if it doesn't support faster C?

    I can argue that having 2 stacks {one for the preserved state from
    caller to callee, the other for data} enables ever so slightly faster
    C--but that is not the point--the point is robustness in the face of
    threats (buffer overruns, ROP, malicious use of memory).

    The speed advantage is by knowing that registers written to the call-
    stack do not need to be written to L2 (or farther) if RET has occurred
    and the cache line replaced. Saves a trifling of power, too. These is
    another advantage is when an EXIT instruction is still reading from
    stack and an ENTER instruction starts writing to the stack. WE taught
    the compiler a prescribed order to utilize the registers, so that when
    an EXIT is running and an ENTER is decoded, the EXIT can be short
    circuited and some of the ENTER short circuited, eliding cycle waste.

    With
    all due respect to Anton and other Forth enthusiasts, "fastest Forth
    benchmarks" is not going to attract much investment money.

    I'd love to see new architectures and new hardware features that are
    genuinely different, but they rarely turn up.

    My 66000 is replete with those features--and it is argued here daily
    that it (my 66000 ISA) has gone too far !!

    [Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere in memory at any time.

    Nice to have.

    Jump-Through-Table (switch) making PIC standard; while making the tables smaller {1/8th to 1/4th}

    I used this idea, in its extreme form when my 486 Word Count code had
    the state variable in BL and loaded the next byte into BH: At this point
    I could jump directly to the code BX was pointing to, so a 256*number of
    main states (=2, inside or outside a word) => a 512-entry jump table.

    When the Pentium turned up a few years later, branching got relatively
    even costlier, so I got rid of every branch inside the 256-byte main processing loop.


    Load IP instructions (CALX, JMPX, CALA, JMPA) enable control transfers directly through GOT (or other SW table).

    Also nice to have.


    Multi-line multi-instruction ATOMIC sequences freely available to SW.

    Transcendental instructions that take FDIV number of cycles.

    :-)


    Context Switches performed without instruction execution--as if the
    state of a thread was treated like a write-back cache.

    Interrupt tables that can be used as a low level scheduler built into
    the priority (and privilege) model with support for vVMs monitoring vMs
    One can schedule an DPC/sofIRQ in 1 instruction that never fails (excepting when the interrupt message takes an unrecoverable ECC failure between core and table.)

    Even with C programming,
    there are so many things that could be made more efficient with a bit of
    interesting hardware. (I say this with little knowledge of the
    complications implied.) A lot of time in C code is spend in memory
    allocation work - that's got to be a prime candidate for hardware
    acceleration, especially if we can get away from the brutish malloc/free
    approach.

    And then C++ goes all 'new' on using memory...

    (Stack-based allocators are one possibility for a lot of
    allocations.)
    There could be hardware support for threading, locking,
    and inter-process communication.

    My 66000 can switch threads in a single instruction.
    My 66000 ESM provides unrealized synchronization capabilities.
    My 66000 Interrupt tables provide single instruction message send and
    single instruction message receives.
    SW determines what the messages mean.

    Separate data stacks and return stacks
    would make things faster and more secure.

    Already present. But in addition, My 66000 DRAM controller is free from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

    Plus, the code sees a 64-bit Virtual Address Space, while the system has
    four 64-bit physical address spaces. The spaces are used to determine
    the consistency model:: {
    Cacheable DRAM is causally ordered and coherent and cached
    unCacheable DRAM is sequentially consistent incoherent
    ROM is unordered incoherent but cached
    MMI/O is sequentially consistent incoherent
    Config is strongly ordered incoherent
    }
    Fat pointers that can track
    access modes and range limits, at least in common cases, would aid
    reliability and security.

    A bit Too "Cheri" for me.

    :-)

    The Mill is probably the closest to Cheri that is still in active
    development.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 14:34:10 2026
    From Newsgroup: comp.arch


    Robert Swindells <rjs@fdy2.co.uk> posted:

    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:

    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz Lisp developers has explained in this newsgroup, they actually did not use
    this feature, because the performance benefit was not big enough to
    justify the complications of modifying their tagging architecture to
    make use of that. However, in recent years AMD, ARM, and Intel have
    added features to ignore the top (7,8, or 16) bits in every address (how many depends on the feature and the selected variant of the feature), probably to support pointer tagging in such programming languages. I am sure that no C need is behind this feature addition.

    The architectural support for tagging in SPARC only avoided the need to
    untag and tag integers in compiled code.

    David Ungar's thesis on SOAR provided measurements of the impact of this
    on benchmarks for Smalltalk.

    The layout of having the tags in the bottom 2 bits of a 32 bit word works fine without architectural support, being able to turn on traps for unaligned data access helps though.

    In 64 bit machines you can use the bottom three bits for Lisp tags but SPARC64 didn't provide instructions to work with this.

    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    In previous discussions, I had tried to press Mitch to see if he could remember what kind of benchmarks they had run on the 88100 that showed
    it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to
    use 2 shifts to perform an extract; indexing was scaled:: both helped interpreters.

    To me, the old SPEC li benchmark was a test of the speed of an interpreter written in C and doesn't say anything useful about how well a system would run Lisp that had been compiled to machine code.

    There were well known (non SPEC) Lisp benchmarks at the time.

    I'd love to see new architectures and new hardware features that are >>genuinely different, but they rarely turn up.

    I see lots of architectural features that are not or badly supported by
    C, and so obviously are not designed for C.

    The one architectural feature of Lisp Machines that I don't think has been carried forward was a multi-way switch instruction.

    My 66000 has a jump-through-table instruction which performs the C-switch including range checks and default substitutions--doing the work of 4-5
    normal instructions and allowing the jump table to be <short> integer data instead of pointers.

    The rest of the MIT Lisp Machine microarchitecture was just a pipelined, three address, load/store one that provides another data point for the discussion from a few months ago on whether VAX could have been a RISC
    using TTL chips.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Jun 19 18:16:16 2026
    From Newsgroup: comp.arch

    On Fri, 19 Jun 2026 16:16:05 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:
    -----------------------
    Of course, the challenge here is that programming is significantly
    different from what we are used to, and you need a new type of OS
    as well as new applications - and ideally, new programming
    languages. That's a lot of big hurdles to clear even if the result
    is theoretically more efficient. (And you still need to keep the
    fast single-threaded cpus, and the fast SIMD / vector processing
    systems, for other kinds of tasks.)

    Possibly the biggest millstone around the neck of computing
    architectures is the C language.

    C is not an albatross !! it is a standard to which one designs--
    exactly like air (our atmosphere) provides the standard to which
    airplanes have to be designed.

    C ended up being this model because its floor supports almost all
    other programming languages: certainly {Fortran, C++, Algol68,
    Pascal, Jovial, ...} and is not all that bad when doing {LISP, RPG,
    COBOL, APL, Snowbal}.
    For every processor that is not
    just for highly niche code (like gpus), what matters is how fast C
    code can run on it. Most other languages either use a similar
    model, or run on VMs written in C.

    You state that like it was a BAD thing--it is not. I just we had
    all chosen the same standards at which to design {BE or LE} is like
    having the steering wheel on the {left or right}, ...

    Why bother with support for multiple stacks,
    or other interesting hardware innovations, if it doesn't support
    faster C?

    I can argue that having 2 stacks {one for the preserved state from
    caller to callee, the other for data} enables ever so slightly
    faster C--but that is not the point--the point is robustness in the
    face of threats (buffer overruns, ROP, malicious use of memory).

    The speed advantage is by knowing that registers written to the
    call- stack do not need to be written to L2 (or farther) if RET has occurred and the cache line replaced. Saves a trifling of power,
    too. These is another advantage is when an EXIT instruction is
    still reading from stack and an ENTER instruction starts writing to
    the stack. WE taught the compiler a prescribed order to utilize the registers, so that when an EXIT is running and an ENTER is decoded,
    the EXIT can be short circuited and some of the ENTER short
    circuited, eliding cycle waste.
    With >> all due respect to Anton and other Forth enthusiasts, "fastest
    Forth benchmarks" is not going to attract much investment money.

    I'd love to see new architectures and new hardware features that
    are genuinely different, but they rarely turn up.

    My 66000 is replete with those features--and it is argued here daily
    that it (my 66000 ISA) has gone too far !!

    [Rbase+Rindex<<scale+DISP] is more than most would allow. Yet with universal Constants, a single memory reference can access anywhere
    in memory at any time.

    Nice to have.

    Jump-Through-Table (switch) making PIC standard; while making the
    tables smaller {1/8th to 1/4th}

    I used this idea, in its extreme form when my 486 Word Count code had
    the state variable in BL and loaded the next byte into BH: At this
    point I could jump directly to the code BX was pointing to, so a
    256*number of main states (=2, inside or outside a word) => a
    512-entry jump table.

    When the Pentium turned up a few years later, branching got
    relatively even costlier, so I got rid of every branch inside the
    256-byte main processing loop.


    Load IP instructions (CALX, JMPX, CALA, JMPA) enable control
    transfers directly through GOT (or other SW table).

    Also nice to have.


    Multi-line multi-instruction ATOMIC sequences freely available to
    SW.

    Transcendental instructions that take FDIV number of cycles.

    :-)


    "FDIV number of cycles" is a moving target. Mitch has a tendency of
    using "Opteron" as his measurement stick. The question of what is
    "number of cycles" is also not obvious. Single or double precision?
    Latency or throughput?

    Apple has single-cycle FDIV throughput since ~2019. That applies to
    both scalar and 128bit SIMD variants of instruction.
    So, for single-precision vector variant the throughput is 4 FDIV per
    clock.

    Intel has 4-cycle SP FDIV throughput (or 5-cycle by other sources) for
    256-bit vectors since 2015. That's 2 SP FDIV per clock.

    AMD started with 6-cycle 256-bit SP FDIV on Zen1.
    It progressed to 3-3.5 cycles on Zen 2/3/4. Then on Zen5 they
    progressed to 3 cycles per 512bit vector register. So, by now they are
    at 5.33 SP FDIV per clock - ahead of Apple of 6 years ago.
    I don't know where Apple stands right now.

    Somehow, I suspect that when Mitch says that his transcendental
    instructions "take FDIV number of cycles" he does not mean that he can
    run 5.33 transcendental instructions per clock.

    Against DP rather than SP and latency rather than throughput, Mith's
    claim is probably closer to reality. But still...
    Apple of 6 years ago had latency of DP FDIV = 10.
    AMD has worst case latency = 13 since Zen1 (best latency used to be
    faster, but on newer chips worst and best are the same).
    Intel has worst case DP FDIV latency = 14 since Ivy Bridge (2012-04).
    That's probably close to the date when Mitch started to consider His
    66000.


    Context Switches performed without instruction execution--as if the
    state of a thread was treated like a write-back cache.

    Interrupt tables that can be used as a low level scheduler built
    into the priority (and privilege) model with support for vVMs
    monitoring vMs One can schedule an DPC/sofIRQ in 1 instruction that
    never fails (excepting when the interrupt message takes an
    unrecoverable ECC failure between core and table.)

    Even with C
    programming, there are so many things that could be made more
    efficient with a bit of interesting hardware. (I say this with
    little knowledge of the complications implied.) A lot of time in
    C code is spend in memory allocation work - that's got to be a
    prime candidate for hardware acceleration, especially if we can
    get away from the brutish malloc/free approach.

    And then C++ goes all 'new' on using memory...

    (Stack-based allocators are one possibility for a lot
    of allocations.)
    There could be hardware support for threading,
    locking, and inter-process communication.

    My 66000 can switch threads in a single instruction.
    My 66000 ESM provides unrealized synchronization capabilities.
    My 66000 Interrupt tables provide single instruction message send
    and single instruction message receives.
    SW determines what the messages mean.

    Separate data stacks and return
    stacks would make things faster and more secure.

    Already present. But in addition, My 66000 DRAM controller is free
    from RowHammer-like attack vectors--By Architecture of the memory hierarchy.

    Plus, the code sees a 64-bit Virtual Address Space, while the
    system has four 64-bit physical address spaces. The spaces are used
    to determine the consistency model:: {
    Cacheable DRAM is causally ordered and coherent and cached
    unCacheable DRAM is sequentially consistent incoherent
    ROM is unordered incoherent but cached
    MMI/O is sequentially consistent incoherent
    Config is strongly ordered incoherent
    }
    Fat pointers that can
    track access modes and range limits, at least in common cases,
    would aid reliability and security.

    A bit Too "Cheri" for me.

    :-)

    The Mill is probably the closest to Cheri that is still in active development.

    Terje



    Did not Ivan said that he likes capabilities, but decided that Mill
    already has too many innovative concepts as it goes, Including
    capabilities would be too much.
    Also, claiming that Mill is still in active development sounds to me as
    a stretch of the word "active".





    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Fri Jun 19 20:20:31 2026
    From Newsgroup: comp.arch

    On 2026-06-19 18:16, Michael S wrote:
    On Fri, 19 Jun 2026 16:16:05 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    [snip]

    Fat pointers that can
    track access modes and range limits, at least in common cases,
    would aid reliability and security.

    A bit Too "Cheri" for me.

    :-)

    The Mill is probably the closest to Cheri that is still in active
    development.

    Terje



    Did not Ivan said that he likes capabilities, but decided that Mill
    already has too many innovative concepts as it goes, Including
    capabilities would be too much.

    As I recall, Ivan used to say that he knew how to /build/ a capability machine, but did not know how to /sell/ it. I believe he meant that such
    a machine would not run "normal" C/C++ code, at least not very well,
    which would turn away many potential customers.




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Jun 19 18:59:47 2026
    From Newsgroup: comp.arch

    According to David Brown <david.brown@hesbynett.no>:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. ...

    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed
    to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    C killed off every memory model other than flat byte addressed memory.
    Pointers are sort of typed, but any real C program does stuff like this:

    p = (struct foo *) malloc(42 * sizeof(struct foo));

    or worse

    typedef struct { // string with explicit length
    int len:
    char str[0];
    } varstr;

    varstr *p;
    char *s = "swordfish";

    // initialize p from s
    p = (varstr *)malloc(sizeof(varstr)+strlen(s));
    p->len = strlen(s);
    strncpy(p->str, s, p->len);

    so in practice pointers all have to be pointers to bytes or something
    that can losslessly be converted to and from them.

    This evolution was certainly helped along by the horrible implementaton
    of segmented memory in the Intel 8086 and 286, which persuaded people
    that segments are a plague to be avoided rather than a tool to make
    programs more reliable.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 19 13:09:04 2026
    From Newsgroup: comp.arch

    On 6/19/2026 11:59 AM, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. ...

    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed
    to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    C killed off every memory model other than flat byte addressed memory. Pointers are sort of typed, but any real C program does stuff like this:

    p = (struct foo *) malloc(42 * sizeof(struct foo));

    Fwiw, why all of the casts?
    __________________
    #include <stdio.h>
    #include <stdlib.h>


    struct foo
    {
    int m_bar;
    };


    int main()
    {
    struct foo* foo = malloc(sizeof(*foo));

    printf("foo = %p", (void*)foo); // cast needed for %p

    free(foo);

    return 0;
    }
    __________________



    or worse

    typedef struct { // string with explicit length
    int len:
    char str[0];
    } varstr;

    varstr *p;
    char *s = "swordfish";

    // initialize p from s
    p = (varstr *)malloc(sizeof(varstr)+strlen(s));
    p->len = strlen(s);
    strncpy(p->str, s, p->len);

    so in practice pointers all have to be pointers to bytes or something
    that can losslessly be converted to and from them.

    This evolution was certainly helped along by the horrible implementaton
    of segmented memory in the Intel 8086 and 286, which persuaded people
    that segments are a plague to be avoided rather than a tool to make
    programs more reliable.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 19 13:10:37 2026
    From Newsgroup: comp.arch

    On 6/19/2026 7:16 AM, Terje Mathisen wrote:
    [...]
    The Mill is probably the closest to Cheri that is still in active development.

    How close are you guys to making a Mill processor?

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jun 19 22:03:38 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Fri, 19 Jun 2026 16:16:05 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:
    ---------------------
    Transcendental instructions that take FDIV number of cycles.

    :-)


    "FDIV number of cycles" is a moving target. Mitch has a tendency of
    using "Opteron" as his measurement stick. The question of what is
    "number of cycles" is also not obvious. Single or double precision?
    Latency or throughput?

    One can use more/less HoBs of the fraction to index coefficient tables.
    This enables a tradeoff of FMAC cycles with table size. Given one has
    64-bit constant fractions of 2/e, 10/e, and a 1072-bits of 2/pi::
    {
    For a coefficient table with size equal to that of Goldschmidt FDIV+
    SQRT tables:: one can implement {Ln2, Ln, LOG, Exp2, Exp, Exp10, SIN,
    COS, TAN, ASIN, ACOS, ATAN, POW, POWI, ATAN2} that use 9 multiplies
    and 8 adds to achieve 17-cycle Transcendentals. This number is similar
    to FDIV at Opteron times.

    POW is basically 2 transcendentals with a 64-bit fraction multiply in
    the middle. I see 38-cycles as typical.

    ATAN2 is ½ the time 1 transcendental and a 64-bit fraction multiply, and
    the other ½ the time 2 transcendentals and a 64-bit fraction multiply.
    So, typically ~26-cycles.

    And ROUGHLY::
    For every doubling (2×) of table size one saves 1 multiply and 1 add.
    So, a quad-sized table could perform these in 13-cycles, or a quarter
    (¼×) sized table could perform at 21-cycles.
    }

    Apple has single-cycle FDIV throughput since ~2019. That applies to
    both scalar and 128bit SIMD variants of instruction.

    What is the latency of each step in:
    FDIV Rd,---,--
    FDIV Rd,Rs1,Rd
    FDIV Rd,Rs2,Rd
    // until blue in the face
    ???

    So, for single-precision vector variant the throughput is 4 FDIV per
    clock.

    Intel has 4-cycle SP FDIV throughput (or 5-cycle by other sources) for 256-bit vectors since 2015. That's 2 SP FDIV per clock.

    All of the numbers I have talked about (above and before) are FP64.
    FP32 numbers are roughly 2/3rds the cycle count.

    AMD started with 6-cycle 256-bit SP FDIV on Zen1.
    It progressed to 3-3.5 cycles on Zen 2/3/4. Then on Zen5 they
    progressed to 3 cycles per 512bit vector register. So, by now they are
    at 5.33 SP FDIV per clock - ahead of Apple of 6 years ago.
    I don't know where Apple stands right now.

    Somehow, I suspect that when Mitch says that his transcendental
    instructions "take FDIV number of cycles" he does not mean that he can
    run 5.33 transcendental instructions per clock.

    Given K SIMD vector of FMAC units, I can run {K double, 2×K single, 4×K
    half} transcendentals in {quoted, 2/3×quoted, 4/9×quoted}-cycles.

    Against DP rather than SP and latency rather than throughput, Mitch's
    claim is probably closer to reality. But still...

    Apple of 6 years ago had latency of DP FDIV = 10.
    AMD has worst case latency = 13 since Zen1 (best latency used to be
    faster, but on newer chips worst and best are the same).
    Intel has worst case DP FDIV latency = 14 since Ivy Bridge (2012-04).
    That's probably close to the date when Mitch started to consider His
    66000.

    It's all a tradeoff between table size and loop iterations.

    -------------
    Fat pointers that can
    track access modes and range limits, at least in common cases,
    would aid reliability and security.

    A bit Too "Cheri" for me.

    :-)

    The Mill is probably the closest to Cheri that is still in active development.

    Terje



    Did not Ivan said that he likes capabilities, but decided that Mill
    already has too many innovative concepts as it goes, Including
    capabilities would be too much.

    Given some thought, adding capabilities is acceptable, so long as one
    can support a flat VAS of at least 64-bits. In effect, capability
    pointers add the capability stuff in another 64-bit container attached
    to the pointer.

    Also, claiming that Mill is still in active development sounds to me as
    a stretch of the word "active".
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Fri Jun 19 22:43:28 2026
    From Newsgroup: comp.arch

    On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
    <david.brown@hesbynett.no> wrote:

    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed
    to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    This is not necessarily a bad thing - it lets everyone get stuff done.
    But it means that we are stuck on a local maxima. If there is a better
    way out there somewhere, it would be a long and arduous journey to get >there.

    I might well be willing to concede that C does have its flaws. But
    these are flaws it has _as a programming language_, and not flaws that
    have affected the design of computers. Why do I say this?

    Because while C, as a procedural language, took a lot from its
    predecessors - its punctuation came from PL/I - it was designed to not
    be too different from the underlying hardware. It had pointers to
    memory, which most programming languages up until then did not bother
    with, in order to be able to substitute for assembler language.

    If, instead, a "better" language, like ALGOL or Pascal, had become the "standard", we might have ended up with computers like the Burroughs
    B6500 or the Intel 432. That would indeed have been a situation where
    computers were less efficient and powerful because they were designed
    around the peculiarities of the most-used higher-level language.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Fri Jun 19 22:49:53 2026
    From Newsgroup: comp.arch

    On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Back when logic and memory were more evenly matched, computers still
    had one register - the accumulator. And instructions basically did
    arithmetic between memory and the accumulator. Of course,
    memory-to-memory operations were also possible without even an
    accumulator.
    But since memory isn't likely to get that fast, it's not really useful
    to think of how to design for something that can't happen.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Jun 20 00:07:03 2026
    From Newsgroup: comp.arch

    quadibloc@invalid.com (John Savard) writes:
    On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal)
    wrote:

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Back when logic and memory were more evenly matched, computers still
    had one register - the accumulator. And instructions basically did
    arithmetic between memory and the accumulator. Of course,
    memory-to-memory operations were also possible without even an
    accumulator.

    And some computers in those days simply used memory to memory operations exclusively without needing an accumulator.

    Memory superscaler/OoO require a ROB that works with addresses rather than registers (perhaps CAM based); the size of the ROB is still limited
    to the degree of OoO.

    That noted, it seems to me that if access to all of memory costs
    the same as access to a register, the need for OoO support in
    the core becomes less interesting when the normal delays for which instruction-level parallism helps don't apply
    (e.g. cache misses, NUMA latency, etc).

    But since memory isn't likely to get that fast, it's not really useful
    to think of how to design for something that can't happen.

    Never say never.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Jun 20 00:09:48 2026
    From Newsgroup: comp.arch

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/19/2026 11:59 AM, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. ...

    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    C killed off every memory model other than flat byte addressed memory.
    Pointers are sort of typed, but any real C program does stuff like this:

    p = (struct foo *) malloc(42 * sizeof(struct foo));

    Fwiw, why all of the casts?

    C and C++ handle void* conversions differently. You must cast
    the malloc result to a pointer of the declared type when using C++.

    It doesn't hurt to add the cast in C, and may help with documenting
    the intention of the programmer who wrote the code.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:01:29 2026
    From Newsgroup: comp.arch


    quadibloc@invalid.com (John Savard) posted:

    On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
    <david.brown@hesbynett.no> wrote:

    De-facto standards are /always/ albatrosses to some extent. Things are >done that way because things are done that way - processors are designed >to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar >languages, or languages with a VM written in C) because that's how >existing processors work.

    This is not necessarily a bad thing - it lets everyone get stuff done.
    But it means that we are stuck on a local maxima. If there is a better >way out there somewhere, it would be a long and arduous journey to get >there.

    I might well be willing to concede that C does have its flaws. But
    these are flaws it has _as a programming language_, and not flaws that
    have affected the design of computers. Why do I say this?

    Because while C, as a procedural language, took a lot from its
    predecessors - its punctuation came from PL/I

    I think it comes closer to the Algol line of languages; but more
    like BCPL, Bliss, and B; than contemporaneous others.

    PL/1 has things like <well> like::

    struct {type var, var1;
    like other_struct}; // so you don't have to type it all in again

    - it was designed to not
    be too different from the underlying hardware. It had pointers to
    memory,

    PL/1s most useful memory trick was using an area. So, one could 'malloc'
    a bunch of data, and then free it all with one free! Nothing prevents C
    from doing this, but C++ has new and new is not compatible with area.

    which most programming languages up until then did not bother
    with, in order to be able to substitute for assembler language.

    If, instead, a "better" language, like ALGOL

    Algol was ruined with its parameter passing in 'thunks' and strict
    1-file compilation.

    or Pascal, had become the "standard", we might have ended up with computers like the Burroughs
    B6500 or the Intel 432.

    That is one bullet we dodged!!

    That would indeed have been a situation where computers were less efficient and powerful because they were designed
    around the peculiarities of the most-used higher-level language.

    Instead, C presents a programming model way down at the vonNeumann level::
    1 unit of work (step) at a time.

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:06:10 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    quadibloc@invalid.com (John Savard) writes:
    On Thu, 18 Jun 2026 14:39:07 GMT, scott@slp53.sl.home (Scott Lurndal) >wrote:

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Back when logic and memory were more evenly matched, computers still
    had one register - the accumulator. And instructions basically did >arithmetic between memory and the accumulator. Of course,
    memory-to-memory operations were also possible without even an
    accumulator.

    And some computers in those days simply used memory to memory operations exclusively without needing an accumulator.

    Memory superscaler/OoO require a ROB that works with addresses rather than registers (perhaps CAM based); the size of the ROB is still limited
    to the degree of OoO.

    That noted, it seems to me that if access to all of memory costs
    the same as access to a register, the need for OoO support in
    the core becomes less interesting when the normal delays for which instruction-level parallism helps don't apply
    (e.g. cache misses, NUMA latency, etc).

    With the execution window being architected to absorb latency
    AND
    memory being longer latency than registers,

    The size of the RoB would have to be larger, and the size of each field
    being compared grows from 8-bits (256 dynamic names) to <what> 64-bits::
    the RoB, RAT, and other structures get "out of hand" quickly.

    But since memory isn't likely to get that fast, it's not really useful
    to think of how to design for something that can't happen.

    Never say never.

    "When your Register file is as big as your cache,
    register access will be as slow as your cache" Andy Glew.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 01:09:28 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/19/2026 11:59 AM, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. ...

    De-facto standards are /always/ albatrosses to some extent. Things are >>> done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    C killed off every memory model other than flat byte addressed memory.
    Pointers are sort of typed, but any real C program does stuff like this: >>
    p = (struct foo *) malloc(42 * sizeof(struct foo));

    Fwiw, why all of the casts?

    C and C++ handle void* conversions differently. You must cast
    the malloc result to a pointer of the declared type when using C++.

    This reminds me of an entry into the obfuscated C context way back::
    There was an entry where one could compile it in {C, Fortran, pascal,
    and some other language} and source would be compiled into the same
    object module from each.

    It doesn't hurt to add the cast in C, and may help with documenting
    the intention of the programmer who wrote the code.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sat Jun 20 11:09:46 2026
    From Newsgroup: comp.arch

    On 2026-06-20 4:01, MitchAlsup wrote:

    quadibloc@invalid.com (John Savard) posted:

    On Fri, 19 Jun 2026 09:24:43 +0200, David Brown
    <david.brown@hesbynett.no> wrote:

    De-facto standards are /always/ albatrosses to some extent. Things are
    done that way because things are done that way - processors are designed >>> to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    This is not necessarily a bad thing - it lets everyone get stuff done.
    But it means that we are stuck on a local maxima. If there is a better
    way out there somewhere, it would be a long and arduous journey to get
    there.

    I might well be willing to concede that C does have its flaws. But
    these are flaws it has _as a programming language_, and not flaws that
    have affected the design of computers. Why do I say this?

    Because while C, as a procedural language, took a lot from its
    predecessors - its punctuation came from PL/I

    [snip irrelevant syntax discussion]

    - it was designed to not
    be too different from the underlying hardware.

    The underlying hardware *of that time*. Therefore, it may have
    contributed to "locking in" that style of hardware. But I do not pretend
    to know that a different style of hardware would be better today.
    PL/1s most useful memory trick was using an area. So, one could 'malloc'
    a bunch of data, and then free it all with one free! Nothing prevents C
    from doing this, but C++ has new and new is not compatible with area.

    (Mostly irrelevant to the suggestion that C is an "albatross", but Ada provides such areas in the form of "storage pools" which can also
    contain "subpools". One can allocate objects in a pool or subpool and
    then deallocate the whole pool or subpool at once.)

    If, instead, a "better" language, like ALGOL

    Algol was ruined with its parameter passing in 'thunks' and strict
    1-file compilation.

    The 1-file compilation was an implementation issue. For example,
    Burroughs Algol supported (and no doubt still supports) separate
    compilation of subprograms. (IIRC, even the paper-tape-based HP Algol
    for the HP2100 series did that.) Algol can do pass-by-value, and the alternative pass-by-name method could have been deprecated and removed
    as the language evolved, or reduced to pass-by-reference, if it was
    judged to be an obstruction.

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    Well, who knows what would have resulted if a similar amount of effort
    had been made to improve such computers, as has been made to improve the currently conventional style of computers.

    Algol, Pascal, Ada, etc. can be compiled as well for the currently conventional style of computers as for B6500-style computers, while
    compiling C for a B6500-style computer limits the compiler to use a
    subset of the hardware functions, as I have understood it, because of
    the usual assumption in C programs of a single flat memory space.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 10:33:58 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    "When your Register file is as big as your cache,
    register access will be as slow as your cache" Andy Glew.

    Looking at https://www.guru3d.com/data/publish/223/54520bdd20560bcbc963979637025fd69682f6/afnviogfwsce6yxo.webp

    It seems that the vector register file is about as large as the
    D-cache ("L1d$"). The I-cache seems to be only a little larger than
    the integer register file, and is smaller than 1/4 of the vector
    register file.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 10:46:59 2026
    From Newsgroup: comp.arch

    jgd@cix.co.uk (John Dallman) writes:
    In article <2026Jun19.080216@mips.complang.tuwien.ac.at>, >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    once upon a time Sun boasted architectural features to support
    Java. AFAIK these were features to improve the indirect-branch
    performance in the Java VM interpreter, to improve the startup
    performance.

    Did this become obsolete when Java runtime environments switched to
    JITing to native code?

    I don't remember in which SPARC this feature was added, but IIRC it
    was after the introduction of the HotSpot Java VM implementation.
    Note that HotSpot uses an interpreter on startup and on the cold
    methods, and only compiles hot methods to native code after executing
    them for a while and collecting execution statistics.

    They don't, but they don't do a good job of making those features usable >either. Support for new instructions is readily provided via intrinsics,
    but those aren't portable.

    Yes. But with a programming language like C, what is the alternative?

    GNU C supports a vector extension; I don't know how fast Intel added
    support for AVX, AVX2, and the various AVX-512 variants to gcc and
    llvm (which also supports this extension); certainly recent gcc and
    clang versions use AVX-512 if you press the right buttons.

    Fortran supports the array sublanguage which AFAIU makes vectorization
    easy within an expression. But as Thomas Koenig tells us, gcc's
    Fortran front end translates that into scalar code and relies on auto-vectorization in the back end to produce vectorized code.
    Intel's Fortran compiler ifort has been replaced by something that
    uses IIRC LLVM as back end.

    And finally we have auto-vectorization, where it is a matter of luck
    whether the scalar code is vectorized or not (e.g., I have code that
    one compiler auto-vectorizes with -Os, but not -O3, and another
    compiler that autovectorizes with -O3, but not -Os).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sat Jun 20 14:25:41 2026
    From Newsgroup: comp.arch

    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he could
    remember what kind of benchmarks they had run on the 88100 that showed
    it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to use
    2 shifts to perform an extract; indexing was scaled:: both helped interpreters.

    Production Lisp environments are not interpreters, even back then.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jun 20 17:26:43 2026
    From Newsgroup: comp.arch

    Niklas Holsti wrote:
    On 2026-06-19 18:16, Michael S wrote:
    On Fri, 19 Jun 2026 16:16:05 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

       [snip]

    Fat pointers that can
    track access modes and range limits, at least in common cases,
    would aid reliability and security.

    A bit Too "Cheri" for me.

    :-)

    The Mill is probably the closest to Cheri that is still in active
    development.

    Terje



    Did not Ivan said that he likes capabilities, but decided that Mill
    already has too many innovative concepts as it goes, Including
    capabilities would be too much.

    As I recall, Ivan used to say that he knew how to /build/ a capability > machine, but did not know how to /sell/ it. I believe he meant that such
    a machine would not run "normal" C/C++ code, at least not very well,
    which would turn away many potential customers.
    The most Cheri-like feature of the Mill is probably the hardware byte granularity security, with the user ability to create subsets in size
    and/or access rights.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jun 20 17:27:59 2026
    From Newsgroup: comp.arch

    Chris M. Thomasson wrote:
    On 6/19/2026 7:16 AM, Terje Mathisen wrote:
    [...]
    The Mill is probably the closest to Cheri that is still in active
    development.

    How close are you guys to making a Mill processor?

    I don't know, I'm just the FP emulation guy in the project. :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jun 20 17:33:13 2026
    From Newsgroup: comp.arch

    Robert Swindells <rjs@fdy2.co.uk> writes:
    The architectural support for tagging in SPARC only avoided the need to
    untag and tag integers in compiled code.

    David Ungar's thesis on SOAR provided measurements of the impact of this
    on benchmarks for Smalltalk.

    The layout of having the tags in the bottom 2 bits of a 32 bit word works >fine without architectural support, being able to turn on traps for >unaligned data access helps though.

    Actually IA-32 (since the 486) and AMD64 have a bit for turning on
    unaligned traps, but unfortunately there is too much software in
    libararies that performs unaligned accesses.

    On IA-32 that's the result of aligning FP numbers to 4-byte boundaries
    in the ABI, while the hardware requires 8-byte alignment (both
    hardware and ABI come from Intel).

    On AMD64 that's a result of statement-level auto-vectorization; e.g.,
    two bytes get loaded as one 16-bit value even if the resulting access
    is not naturally aligned.

    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    Then I misremembered the Lisp system's name. But it was one whose name
    I had already read before.

    To me, the old SPEC li benchmark was a test of the speed of an interpreter >written in C and doesn't say anything useful about how well a system would >run Lisp that had been compiled to machine code.

    It also does not say anything useful about the speed of a
    high-performance interpreter. See Figure 5 of <https://jilp.org/vol5/v5paper12.pdf>.

    The one architectural feature of Lisp Machines that I don't think has been >carried forward was a multi-way switch instruction.

    What you get today is indirect branches with good predictors, and
    conditional branches that take 0 cycles if not taken and correctly
    predicted, and 1 cycle if taken and correctly predicted. I expect they are pretty good for implementing multi-way switches.

    The rest of the MIT Lisp Machine microarchitecture was just a pipelined, >three address, load/store one that provides another data point for the >discussion from a few months ago on whether VAX could have been a RISC
    using TTL chips.

    Hmm, but IIRC the architecture was more in the direction of "closing
    the semantic gap".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 14:02:56 2026
    From Newsgroup: comp.arch

    On 6/19/2026 2:24 AM, David Brown wrote:
    On 19/06/2026 00:37, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:
    -----------------------

    <snip>

                                                Fat pointers that can track
    access modes and range limits, at least in common cases, would aid
    reliability and security.

    A bit Too "Cheri" for me.

    I think "Cheri" took it too far.  I believe there is scope for tagging a bit of information onto pointers without trying to do everything.

    I also think a lot can be done on the side of programming languages and tools, which could catch far more possible pointer mistakes.  That won't stop the bad guys, of course, but I think more bad accesses are from
    bugs than hackers.


    Agreed, this is a route I experimented with.

    A basic bounds-checking mechanism can help with debugging and security.
    One option here is, say, using pointer tagging bits to encode
    bounds-check information and then have the compiler emit instructions to detect (roughly) when an access has gone out-of-bounds.

    Extending it to the scope CHERI did adds new problems:
    Adds significant implementation overhead;
    Interferes with C programming practices;
    ...

    And with a glaring weakness:
    By its design, it is theoretically incapable of by itself forming a
    sandbox capable of stopping actively hostile code.

    It *could* still make it a PITA for human programmers to break out of,
    but if a determined human programmer (or an AI assisted one) could put
    in the work and break out of it via convoluted pointer de-referencing
    (and if this break-ability is likely necessary for things like the C
    runtime to be able to work), this is a weak point.

    And, if it can't lock down security against actively hostile code, then
    its more heavy-handed aspects are no longer justifiable.


    Meanwhile, if the task is subdivided, some similar benefits can be
    realized more cheaply:
    Bounds checked pointers to trap on out-of-bounds access;
    ASLR to make it much harder for shell-code to know where anything is;
    Tagging to make it harder to stomp the link register;
    Any attack attempt will need to know an N-bit magic number.
    ...

    Or, if code sandboxing is needed, MMU based trickery is known effective.
    Code can't access something if it is not accessible within the current
    address space;
    It can't gain additional access unless either CPU instructions exist or SYSCALL trickery would allow for some level of privilege escalation.

    This is sort of the route I had been (gradually) investigating (though,
    using the VUGID/ACLID system within a SAS rather than separate address spaces).


    A similar mechanism can be used both for intra-process calls across
    security domains, and for inter-process communication:
    Wrapping interfaces in COM-like objects and routing the method calls
    over the SYSCALL mechanism.

    Had considered the possibility of some lower overhead mechanisms (such
    as a "Secure Execute" mode that could allow for very localized CPU
    privilege escalation), but demoted these ideas, mostly because I
    realized that they were still exploitable (and by nature using them
    would leave a security risk).

    Like, even if the untrusted code can't itself get access to privileged instructions, doesn't mean it can't try to use the security-transfer
    mechanism as a widget to try to grant itself access to other parts of
    the addresses space (such as by forging cross-domain COM-like objects
    that transfer control back to itself).

    To enforce things, the privilege transfer can't exist within the scope
    of the memory that the unprivileged code has access to, which in this
    case, effectively means it needs to be abstract handles that only the
    kernel or similar can deal with.

    ...


    Ironically, all this has less visible surface area at the ISA level.
    Like, still works with just plain RISC-V code or similar. Like, ideally,
    the code should not need to be able to see the bars of the jail in which
    it exists (or that, just outside of its reach, there exists a register
    that if it could somehow zero-out, it would gain access to the entire
    address space; but that it is limited only to its own code and data
    sections and any local heap, say, with even its own C runtime access
    being possibly via a concealed COM-like object).

    For untrusted/sandboxed code, could maybe also make sense for there to
    be an OS feature to disallow true OS syscalls (so, unlike the normal user-mode; it isn't allowed to talk to the OS directly, but only to
    objects supplied by the host application).


    Granted, one could maybe argue that a mechanism where
    cross-security-domain calls need to go across COM objects or similar, is
    a bit heavy handed. Or, for this matter, what exactly to call these
    things (which are mostly like COM objects, but do not involve MS's tools
    or APIs).

    Well, or if it is more the:
    (*pVt)->SomeMethod(pVt, ...);
    Aspect that matters.


    Side note:
    Could in theory route it over individual function pointers, it is just
    that the mechanisms in question would be in effect too heavyweight to
    justify using them to map function pointers (an object can effectively
    map a whole big group of function pointers for the same cost as a single pointer in a callback-marshaling approach; and also objects are somewhat
    less of a pain to deal with than a "WvateverGetProcAddress()" style mechanism).


    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Jun 20 19:07:15 2026
    From Newsgroup: comp.arch

    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    - [C] was designed to not
    be too different from the underlying hardware.

    The underlying hardware *of that time*. Therefore, it may have
    contributed to "locking in" that style of hardware. But I do not pretend
    to know that a different style of hardware would be better today.

    C evolved from B which had a memory model that addressed words, which made sense for a lot of the computers of the 1960s. I gather the earliest
    versions of C were on the GE 635 which was a 36 bit word addressed machine
    but when it moved to the byte addressed PDP-11, dmr had to add typed pointers so it could do something reasonable with pointers to character strings vs pointers to words.

    I think that with or without C, flat byte addressed memory would have won out due to the success of S/360 and the PDP-11, both of which were programmed
    in lots of languages other than C.

    Algol was ruined with its parameter passing in 'thunks' and strict
    1-file compilation.

    The 1-file compilation was an implementation issue. For example,
    Burroughs Algol supported (and no doubt still supports) separate
    compilation of subprograms. (IIRC, even the paper-tape-based HP Algol
    for the HP2100 series did that.) Algol can do pass-by-value, and the >alternative pass-by-name method could have been deprecated and removed
    as the language evolved, or reduced to pass-by-reference, if it was
    judged to be an obstruction.

    Call by name and thunks were a mistake. The Algol committee was trying
    to write an elegant description of call by reference and only when
    Jensen's device came along did they realize what they'd done. Alan
    Perlis, who was on the Algol committee, told me this. Then when they
    tried to fix Algol 60, the committee was hijacked by people who
    produced Algol 68 which was quite a good language, but was defined so
    obscurely that people wrongly assumed it was hard to learn and use.

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it. Several parallel strands of RISC research independently found
    that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always want to turn off the error checks in the production build of their code.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Jun 20 19:17:45 2026
    From Newsgroup: comp.arch

    According to Robert Swindells <rjs@fdy2.co.uk>:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he could
    remember what kind of benchmarks they had run on the 88100 that showed
    it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to use
    2 shifts to perform an extract; indexing was scaled:: both helped
    interpreters.

    Production Lisp environments are not interpreters, even back then.

    Quite right. Lisp 1.5 on the 7094 had a compiler. See the Lisp 1.5
    manual published in 1962. Appendix D describes the compiler:

    https://archive.org/details/bitsavers_mitrlelisprammersManual2ed1985_9279667
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 14:51:46 2026
    From Newsgroup: comp.arch

    On 6/20/2026 5:46 AM, Anton Ertl wrote:
    jgd@cix.co.uk (John Dallman) writes:
    In article <2026Jun19.080216@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    once upon a time Sun boasted architectural features to support
    Java. AFAIK these were features to improve the indirect-branch
    performance in the Java VM interpreter, to improve the startup
    performance.

    Did this become obsolete when Java runtime environments switched to
    JITing to native code?

    I don't remember in which SPARC this feature was added, but IIRC it
    was after the introduction of the HotSpot Java VM implementation.
    Note that HotSpot uses an interpreter on startup and on the cold
    methods, and only compiles hot methods to native code after executing
    them for a while and collecting execution statistics.

    They don't, but they don't do a good job of making those features usable
    either. Support for new instructions is readily provided via intrinsics,
    but those aren't portable.

    Yes. But with a programming language like C, what is the alternative?


    Define a language extension for vectors that "doesn't suck";
    Get widespread enough support that code can use it without it turning
    into a mess;
    Make it work whether or not the target has native architectural support
    for a given feature.


    GNU C supports a vector extension; I don't know how fast Intel added
    support for AVX, AVX2, and the various AVX-512 variants to gcc and
    llvm (which also supports this extension); certainly recent gcc and
    clang versions use AVX-512 if you press the right buttons.


    You don't want to enable this stuff too fast though.

    AVX512 is far from universal.

    In my case, I am still running a CPU where enabling the use of the
    256-bit AVX instructions comes with a significant performance penalty.


    OTOH, GCC still also tends to fall into the trap of only supporting
    certain features on certain targets if the specific ISA (or target configuration) can support them natively.

    Having C code that can either compile or fail to compile depending on
    the specific combination of target machine and compiler feature flags is
    a poor situation.


    Fortran supports the array sublanguage which AFAIU makes vectorization
    easy within an expression. But as Thomas Koenig tells us, gcc's
    Fortran front end translates that into scalar code and relies on auto-vectorization in the back end to produce vectorized code.
    Intel's Fortran compiler ifort has been replaced by something that
    uses IIRC LLVM as back end.

    And finally we have auto-vectorization, where it is a matter of luck
    whether the scalar code is vectorized or not (e.g., I have code that
    one compiler auto-vectorizes with -Os, but not -O3, and another
    compiler that autovectorizes with -O3, but not -Os).


    IMO auto-vectorizarion is another mess:
    Sometimes effective, but sometimes makes it worse.

    With BGBCC at least, have not gone that route.


    I probably would not, unless it could get consistently positive gains,
    the current scattershot of "maybe faster, maybe shoot oneself in the
    foot, almost invariably makes binary bigger" situation just kinda sucks. Doesn't exactly give me strong optimism in this area.

    Though, MSVC is kind of a bigger offender in some ways (kinda almost
    wish for a "Have SIMD enabled, but do not attempt auto-vectorization"
    option). Kinda ironically the closest it does have to this is "/O1", but
    this is more of a "just half-ass the optimizations" option rather than explicitly disallowing vectorization (or, contrast with "/O0" which is
    "Hope you like every operation to be 'Load Load Op Store', and no
    constant folding or similar"; and then VS debugger only really debugs
    stuff effectively when it is built in this "performs like epic dog crap" mode).

    Ironically, BGBCC doesn't go this far, as there isn't any direct
    equivalent of "/O0" or "-O0" behavior, as I would actually have to go
    out of my way to make the compiler generate code that poorly...


    Then again:
    RV64G/RV64GC doesn't really have SIMD;
    XG1/XG2/XG3 have SIMD, but it is based on some fundamentally different assumptions from either SSE/AVX or RV-V, and does not have separate SIMD/Vector registers.

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 15:39:06 2026
    From Newsgroup: comp.arch

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he could
    remember what kind of benchmarks they had run on the 88100 that showed
    it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to use
    2 shifts to perform an extract; indexing was scaled:: both helped
    interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any
    given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of interpreter...



    Well, but I found one, "naive walk of XML DOM trees" being slower.
    Back in the 2000s, managed to make a script interpreter that was "so
    damn slow" that when used in a makeshift 3D engine, could actually
    "feel" whenever the short script fragments fired off...

    Though, this was a design where basically something as little as a
    constant expression would require:
    Walk along and repeatedly check the node-tag against each tag string;
    Fetch the attribute holding the value as a string;
    "atof()" or similar;
    Allocate a memory box to hold the value;
    Put the value in a memory box.

    This VM didn't live long...


    But, ironically, part of this code is what eventually became BGBCC.
    Though BGBCC's node system is still far more optimized than what this interpreter originally used (because early on, BGBCC's compilation
    process was also painfully slow for similar reasons).


    Ironically, what did the script-language interpreter do?
    Got rewritten to re-use part of a Scheme interpreter I had written in high-school as the backend.

    Which later got rewritten to target a bytecode, and then to have a JIT.
    Which was not exactly all that dissimilar from the RIL bytecode (used by BGBCC) and general structure in JX2VM and similar.

    Though, JX2VM ended up mostly not JIT'ing, as JIT is unnecessary when
    limited to emulating a 50 MHz target CPU on a modern desktop PC.

    ...


    But, one can be like, what is faster (and less LOC) than a Lisp or
    Scheme interpreter?...

    A pre-tokenized BASIC interpreter.
    But, then, arguably, BASIC is a cruftier language; but one that is
    ironically still more readily usable for non-toy programming tasks.

    Though, then I recent-ish ended up using BASIC as a basis for limited
    CSG tasks, and the result caused it to start to mutate into something resembling a BASIC/Emacs-Lisp hybrid.

    If intentionally going this direction though, would make more sense to
    just go for something more resembling something like Emacs-Lisp though.

    Well, as trying to hybridize Lisp and BASIC gives something that sucks
    worse than either Lisp or BASIC would on their own (like a "Beware, this
    path leads to dog crap" thing).

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Sat Jun 20 16:52:44 2026
    From Newsgroup: comp.arch

    On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:


    PL/1s most useful memory trick was using an area. So, one could 'malloc'
    a bunch of data, and then free it all with one free!

    Aka "Mark-Release".


    Nothing prevents C from doing this, but C++ has new and new is not >compatible with area.


    If by "compatible" you mean being able to select an area on a
    per-allocation basis, then C's default malloc can't do that either.

    You certainly can write a custom malloc, but for compatibility with
    the standard function it would have to work from a "default" heap. You
    could change the default heap at will, but not in the "compatible"
    malloc call itself.


    You can do the same in C++: although AFAIK there is no standard way to
    do it, every compiler I have seen provided some way to replace the
    standard allocator. If you do this, then all new/delete calls will
    use it.

    Also constructors for all the standard containers take an extra
    parameter which tells them which allocator to use [and defaults to the
    standard allocator].

    And new()/delete() can be overloaded per class to do whatever you
    want.


    Few programmers in C++ ever delve so deeply into the mysteries of
    allocation. IME too many have trouble even understanding the working
    of C++'s "placement" new.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 22:01:54 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/19/2026 2:24 AM, David Brown wrote:
    On 19/06/2026 00:37, MitchAlsup wrote:
    --------------------------
    A bit Too "Cheri" for me.

    I think "Cheri" took it too far.  I believe there is scope for tagging a bit of information onto pointers without trying to do everything.

    I also think a lot can be done on the side of programming languages and tools, which could catch far more possible pointer mistakes.  That won't stop the bad guys, of course, but I think more bad accesses are from
    bugs than hackers.


    Agreed, this is a route I experimented with.

    A basic bounds-checking mechanism can help with debugging and security.
    One option here is, say, using pointer tagging bits to encode
    bounds-check information and then have the compiler emit instructions to detect (roughly) when an access has gone out-of-bounds.

    How do you do this with 64-bit registers and a 64-bit virtual address space ??!!

    Extending it to the scope CHERI did adds new problems:
    Adds significant implementation overhead;
    Interferes with C programming practices;

    Which means it is not going to fly.....
    ...

    And with a glaring weakness:
    By its design, it is theoretically incapable of by itself forming a
    sandbox capable of stopping actively hostile code.

    It *could* still make it a PITA for human programmers to break out of,
    but if a determined human programmer (or an AI assisted one) could put
    in the work and break out of it via convoluted pointer de-referencing
    (and if this break-ability is likely necessary for things like the C
    runtime to be able to work), this is a weak point.

    And, if it can't lock down security against actively hostile code, then
    its more heavy-handed aspects are no longer justifiable.


    Meanwhile, if the task is subdivided, some similar benefits can be
    realized more cheaply:
    Bounds checked pointers to trap on out-of-bounds access;
    ASLR to make it much harder for shell-code to know where anything is;

    With pure PIC coding practices, and the link-pointer being stored directly
    onto the call stack (not the data stack), one has no particular reason to
    know where they currently are (IP) and few ways of seeing where they are.

    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.

    Any attack attempt will need to know an N-bit magic number.

    No need for a number, it cannot be accessed outside of calling and returning. {{Which is not being done with a series of instructions--but by 1 designed
    for the task at hand}}


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Jun 20 22:08:23 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he could >>> remember what kind of benchmarks they had run on the 88100 that showed >>> it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to use >> 2 shifts to perform an extract; indexing was scaled:: both helped
    interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of interpreter...

    Given that one can create a data structure in LISP, and then execute it;
    how would you do this without an interpreter or a JIT ??
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sat Jun 20 10:15:41 2026
    From Newsgroup: comp.arch

    Robert Swindells [2026-06-19 11:20:10] wrote:
    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:
    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz Lisp
    developers has explained in this newsgroup, they actually did not use
    this feature, because the performance benefit was not big enough to
    [...]
    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    I guess you two aren't talking bout the same "Franz Lisp".
    AFAIK Anton is referring to the commercial Common Lisp compiler
    associated with the Franz Inc company, marketed under the name
    "Allegro".


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 20:42:03 2026
    From Newsgroup: comp.arch

    On 6/20/2026 5:08 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he could >>>>> remember what kind of benchmarks they had run on the 88100 that showed >>>>> it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to use >>>> 2 shifts to perform an extract; indexing was scaled:: both helped
    interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any
    given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of interpreter...

    Given that one can create a data structure in LISP, and then execute it;
    how would you do this without an interpreter or a JIT ??

    Eval can be implemented, just eval isn't necessarily fast.

    It is common to implement eval by using a recursive tree walk:
    Is list:
    Fetch first item in list:
    Is function or builtin?
    Eval each argument;
    Apply function;
    Is a macro:
    Do macro-handling stuff.
    Is fixnum or flonum or similar:
    Evaluates to itself.
    ...

    There is a penalty due to things like type-checks, etc.



    Contrast, say, a naive bytecode:
    Go through a switch;
    Goes fairly directly to logic.

    A language like Forth or PostScript can be pretokenized fairly directly
    to a stack-based bytecode.

    In a pretokenized BASIC, one could use a dispatch based on the first token.


    Granted, the "per-eval" cost for a simple Lisp expression (such as in a
    REPL) would be lower than that of a JavaScript interpreter (which will typically need to parse and translate to some internal format before
    running it; such as a stack-machine bytecode, which is then often taken further).


    And, admittedly, this was back to my first BGBScript, where:
    First version:
    Parse to a DOM-based AST;
    Tree walk the AST.
    Second version:
    Parse into a Scheme-based format;
    Run a modified Scheme interpreter (older).
    Then became:
    Translate to a stack based bytecode;
    Translate stack bytecode into 3AC ops in a "trace graph";
    Optional JIT compile the trace graph.
    Could push close to native C speeds with this at the time.

    As over time it went from a dynamic to static typed core, and the VM
    became very big and complicated (though, later looked at V8 and
    SpiderMonkey, which were basically in a similar weight class; and had seemingly ended up crossing into a lot of similar domain here).


    Then I did a reboot, turning it into a Java style language (using JSON
    based ASTs), which was faster still, but lost the ability to be used for scripting tasks or to eval anything. It also still failed to compete
    well with C on C's home turf.

    Well, and the original reason I had written my own scheme interpreter
    was due to frustrations with SCM and GNU Guile (I had looked at the code
    and was like, "I could just do this stuff myself").


    And, as noted, another offshoot of this first VM became BGBCC, which has outlived the descendants of the second VM.



    Ironically, now I am back where I started, still lacking a particularly
    good general-purpose scripting VM.

    In the minimal case, 80s style BASIC can be implemented in minimal LOC,
    but doesn't scale very well.

    A minimal JS-like language needs more LOC, but is more complicated.


    Ideally, one wants something small and self-contained enough that it can
    be casually copied from one project or context to another without having
    a bunch of extra "plumbing pain", which is notoriously harder to pull
    off for scripting VMs than, say, file-format importers/exporters or data-compression code.


    Something like a small Lisp/Scheme almost starts looking tempting again.
    Slower in a naive implementation than BASIC;
    But, scales a little better to some tasks;
    Smaller implementation cost than a minimal JS dialect.


    One might also end up with goals like, say:
    Whole VM needs to fit within a single source file with no external dependencies beyond the basic C library;
    Also whole VM shouldn't be more than a few kLOC;
    ...



    Well, and I am lacking a good way to implement something like a GLSL
    compiler in a way that isn't overly heavyweight. But, GLSL poses a very different problem space from a light duty scripting language (would need
    to implement this to get TKRA-GL beyond 1.x territory).

    Then again, could ironically make a case again for using S-Expressions
    again, but say:
    Parse GLSL into S-Expressions;
    Translate S-Expressions into a modified ARB-like IR;
    Map ARB-like commands to blobs of XG3 machine instructions or similar.
    Possibly using a vaguely similar approach to the Quake3 QVM;
    Naive/sucks, but kinda works;
    Could map IR registers ~ 1:1 to CPU registers;
    This basically means the reg-alloc happens when compiling the AST.
    ...


    Despite front-end similarities, the backend for a GLSL compiler would
    not be a good fit for scripting tasks.


    Was able to get a basic JS like interpreter into a few kLOC, but
    extending this into a GLSL compiler (while trying to keep code footprint small) would be a harder problem.


    Well, and is likely to end about like how my effort to implement a new lighter-weight C compiler ended. Previously I had wanted to try to
    implement a C compiler in a smaller code footprint than the Doom engine,
    but kinda failed at this. And then it fizzled, like unless it fits into
    Doom's code size and memory footprint, there was little reason to keep
    working on it; vs just continuing to use BGBCC; which, granted, weighs
    in at around 10x the size of Doom; or on-par with the size of Quake 3 Arena.


    Can't really use BGBCC to compile anything within TestKern itself mostly because it uses too much RAM (would need a fairly large pagefile, and
    then it would spend most of its time paging).

    Though, BGBCC does work very different from 80s/90s style compilers, effectively loading the whole target program into RAM and dealing with everything, rather than compiling each translation unit sequentially
    from top to bottom (and then running a linker).


    Not entirely sure how linkers worked on very old computers though, as seemingly the working set of the linker would likely end up needing to
    be larger than the VAS on something like a PDP-11.


    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jun 20 21:04:41 2026
    From Newsgroup: comp.arch

    On 6/20/2026 5:01 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/19/2026 2:24 AM, David Brown wrote:
    On 19/06/2026 00:37, MitchAlsup wrote:
    --------------------------
    A bit Too "Cheri" for me.

    I think "Cheri" took it too far.  I believe there is scope for tagging a >>> bit of information onto pointers without trying to do everything.

    I also think a lot can be done on the side of programming languages and
    tools, which could catch far more possible pointer mistakes.  That won't >>> stop the bad guys, of course, but I think more bad accesses are from
    bugs than hackers.


    Agreed, this is a route I experimented with.

    A basic bounds-checking mechanism can help with debugging and security.
    One option here is, say, using pointer tagging bits to encode
    bounds-check information and then have the compiler emit instructions to
    detect (roughly) when an access has gone out-of-bounds.

    How do you do this with 64-bit registers and a 64-bit virtual address space ??!!


    You can't...


    It is doable at least with a 48-bit VAS as there are just enough HOBs
    left over to sort of shove in an approximate bounds-check scheme.

    Exponent, Range_Mantissa, Range_Bias.


    Extending it to the scope CHERI did adds new problems:
    Adds significant implementation overhead;
    Interferes with C programming practices;

    Which means it is not going to fly.....
    ...

    Yeah.

    It is in this awkward area of "sorta works" but far from being entirely transparent.



    And with a glaring weakness:
    By its design, it is theoretically incapable of by itself forming a
    sandbox capable of stopping actively hostile code.

    It *could* still make it a PITA for human programmers to break out of,
    but if a determined human programmer (or an AI assisted one) could put
    in the work and break out of it via convoluted pointer de-referencing
    (and if this break-ability is likely necessary for things like the C
    runtime to be able to work), this is a weak point.

    And, if it can't lock down security against actively hostile code, then
    its more heavy-handed aspects are no longer justifiable.


    Meanwhile, if the task is subdivided, some similar benefits can be
    realized more cheaply:
    Bounds checked pointers to trap on out-of-bounds access;
    ASLR to make it much harder for shell-code to know where anything is;

    With pure PIC coding practices, and the link-pointer being stored directly onto the call stack (not the data stack), one has no particular reason to know where they currently are (IP) and few ways of seeing where they are.

    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.


    Can at least reduce success rate (for stomped LR being able to redirect control without immediate CPU fault) from 100% down to 0.4%, ...

    Stack canaries can also help, but then compilers (and programmers) like
    to disable them for sake of the "usually fraction of a percent"
    performance overhead.


    Can also be defeated if the attacker can guess the stack canary (could
    be very well possible if they know enough to be able to defeat the ASLR).

    If the start-up is predictable enough, an attacker could likely guess
    the ALSR, stack-canary value and XOR mask, and the magic number that
    goes in the link register, ... then we have a harder problem.


    Well, and then again, there is still the problem of all the potential weak-points that don't depend on injecting shell code. Like, it is
    rendered moot if a program can use a malformed string to escape commands
    into a script interpreter or use an insecure interface to get at the C library's "system()" function or similar...




    Any attack attempt will need to know an N-bit magic number.

    No need for a number, it cannot be accessed outside of calling and returning. {{Which is not being done with a series of instructions--but by 1 designed for the task at hand}}


    OK.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From quadibloc@quadibloc@invalid.com (John Savard) to comp.arch on Sun Jun 21 05:10:00 2026
    From Newsgroup: comp.arch

    On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup
    <user5857@newsgrouper.org.invalid> wrote:

    Instead, C presents a programming model way down at the vonNeumann level::
    1 unit of work (step) at a time.

    That could be considered a flaw.

    Of course, there are languages that address that flaw.

    APL has mathematical operators that act directly on vectors and
    matrices without loops.

    Modula-2, ADA, and some other languages include constructs for
    parallel execution.

    But then, even C has fork().

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Jun 21 10:45:53 2026
    From Newsgroup: comp.arch

    On 2026-Jun-20 15:07, John Levine wrote:

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it. Several parallel strands of RISC research independently found that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always want to turn off the error checks in the production build of their code.

    I suspect that is because error checks were so badly designed.
    e.g. the x86 BOUND instruction costs more to set up than it saves
    because it requires 2 bounds to be set up in memory and then
    read every time.

    If checks are designed from a risc point of view
    they should have little to no runtime costs.

    For example, almost all arrays are 1-dimension, base-0 or base-1
    and most array bounds are constants, so one only needs to check,
    - for base-0 a single index unsigned < register or constant limit,
    - for base-1 a single index != 0 and unsigned <= register or constant limit. (It uses an unsigned compare because signed negative integers
    will be treated as large positive unsigned integers and fault.)

    Since the index will already be in a register, this is just a
    reg-reg or reg-imm compare and possibly fault.

    There are two forms of conditional check ChkCC.
    With the standard check, the following LD or ST is not dependent on
    the check success and could be speculatively executed before an
    index fault was thrown. It is therefore slightly faster but not
    Spectre safe, suitable for secure environments.

    The second form is a sequential check ChkSeqCC has 3 operands:
    the source index register , the limit imm or register, and a dest register.
    ChkSeqCC rd_index, rs1_index, imm_limit
    ChkSeqCC rd_index, rs1_index, rs2_limit
    When the check succeeds the rs1_index is copied into rd_index register,
    and the rd_index register is then used in the LD or ST instruction.
    This creates a sequential dependency of the LD/ST on the check having
    been passed and thus blocks Spectre style speculative indexing.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 15:16:29 2026
    From Newsgroup: comp.arch

    On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he
    could remember what kind of benchmarks they had run on the 88100
    that showed it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to
    use 2 shifts to perform an extract; indexing was scaled:: both
    helped interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any
    given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of
    interpreter...

    Given that one can create a data structure in LISP, and then execute it;
    how would you do this without an interpreter or a JIT ??

    You don't do that for anything serious.

    You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any
    other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

    A good overview of the SoTA back then is this:

    <https://dreamsongs.com/Files/Timrep.pdf>

    I would expect that some PDP-10s at CMU had MacLisp installed on them.

    The Franz Lisp source code was available from UCB including the compiler.

    The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
    to the M88K.

    The Lisp system developed by the SPICE Project at CMU initially targeted
    the PERQ but later switched to running on conventional CPUs and is still
    in use today in SBCL and CMUCL variants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jun 21 14:52:05 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like this:

    p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it
    through pointers derived from p. And programs usually satisfy that requirement.

    typedef struct { // string with explicit length
    int len:
    char str[0];
    } varstr;

    varstr *p;
    char *s = "swordfish";

    // initialize p from s
    p = (varstr *)malloc(sizeof(varstr)+strlen(s));
    len = strlen(s);
    strncpy(p->str, s, p->len);

    so in practice pointers all have to be pointers to bytes or something
    that can losslessly be converted to and from them.

    So you want typed pointers. Other languages have more type safety.

    What kind of segmentation do you have in mind that would provide type
    safety?

    This evolution was certainly helped along by the horrible implementaton
    of segmented memory in the Intel 8086 and 286, which persuaded people
    that segments are a plague to be avoided rather than a tool to make
    programs more reliable.

    The 286 provides segments that fit the C standard. It seems that what
    people found horrible about them was that they are limited to 64KB,
    and that using them is slow, and that the 80286 protected mode was
    completely at odds with real mode instead of an upwards-compatible
    thing.

    The limit could be fixed, and they would require more hardware
    resources and be even slower. The limit on the number of segments is
    probable also a problem if you want to use them for C-standard
    objects.

    Concerning the performance, one can probably improve that, at the cost
    of additional hardware, but I fail to see how any segment-based
    hardware could be as fast (or at least close to) as flat memory with
    software bounds checking.

    One issue that segments as on the 80286 do not fix is dangling
    references (C memory safety checkers go to considerable lengths to
    deal with that problem). So the language implementation of a language
    with explicit deallocation (e.g., C or Pascal) would deallocate the
    segment, but, given the finite number of segment numbers, pass out the
    segment number again, and and access through a dangling reference to
    the old segment could wreak havoc.

    One problem connected to C and the 286 segments is that each pointer
    would require a segment number and an offset withing the segment. For
    a language like Java, the segment number would be sufficient. But,
    e.g., Pascal reference parameters can reference a specific field in a
    record or an array element, so they would have to be represented by segment+offset, too.

    Do you have any example of non-horrible segmentation that provides
    memory safety. If not, do you have an idea what that would look like?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jun 21 15:38:52 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    - [C] was designed to not
    be too different from the underlying hardware.

    The underlying hardware *of that time*. Therefore, it may have
    contributed to "locking in" that style of hardware. But I do not pretend >>to know that a different style of hardware would be better today.

    C evolved from B which had a memory model that addressed words, which made >sense for a lot of the computers of the 1960s. I gather the earliest >versions of C were on the GE 635 which was a 36 bit word addressed machine >but when it moved to the byte addressed PDP-11, dmr had to add typed pointers >so it could do something reasonable with pointers to character strings vs >pointers to words.

    My understanding is that Ritchie designed C to address B's
    deficiencies for byte-addressed machines; so C for a word-addressed
    machine makes no sense.

    Looking at "The Development of the C Language" <https://9p.io/cm/cs/who/dmr/chist.html>, Ritchie writes:

    |In 1971 I began to extend the B language by adding a character type
    |and also rewrote its compiler to generate PDP-11 machine instructions
    |instead of threaded code. Thus the transition from B to C was
    |contemporaneous with the creation of a compiler capable of producing
    |programs fast and small enough to compete with assembly language. I
    |called the slightly-extended language NB, for `new B.'

    There are three mentions of the GE-635 in this paper, none of it about
    a C compiler.

    I think that with or without C, flat byte addressed memory would have won out >due to the success of S/360 and the PDP-11, both of which were programmed
    in lots of languages other than C.

    I agree. The IBM 704 also has flat memory.

    From the Datapoint 2200 up to and including the 8085 Intel used flat
    memory. The segments of the 8086 look more like a way to support more
    the 64KB than anything else:

    Memory safety? No.

    Rearrange memory? No, because the segment number directly specifies
    the memory.

    The 80286 protected mode was a serious attempt to provide memory
    safety and to make memory rearrangeable, but it failed in the market
    before C became dominant. Most software used real mode on the 286.

    Then they developed the 386 and knew that people want flat memory, so
    they found a way to extend 286 protected mode to the 386, but in a way
    that allows using a flat memory model; they also provided decent ways
    to run 8086 software (in particular, virtual-8086 mode); and the 386
    added paging.

    The 6800, 6502 and 68000 were designed before C became dominant and
    have flat addressing. And of course ARM also has flat addressing,
    because of the 6502 heritage, and because they could not afford
    segmentation frills.

    The other kind of "segmentation" I encountered is HPPA's and Power's
    address handling, but that has nothing to do with memory safety or
    other language concepts.

    Call by name and thunks were a mistake. The Algol committee was trying
    to write an elegant description of call by reference and only when
    Jensen's device came along did they realize what they'd done. Alan
    Perlis, who was on the Algol committee, told me this. Then when they
    tried to fix Algol 60, the committee was hijacked by people who
    produced Algol 68 which was quite a good language, but was defined so >obscurely that people wrongly assumed it was hard to learn and use.

    The Van Wijngaarden grammars of Algol 68 can be seen as second-systems
    effect. After the success of BNF for Algol 60, they wanted to
    increase the reach of formal specifications, so they developed vW
    grammars and specified Algol 68 in it. No successful language
    followed that step (and I think most languages where formalism extends
    beyond the context-free grammar have not become popular).

    Was that the main reason why Algol 68 never became popular? Maybe,
    but others were possible. Algol 60, which did not have this problem,
    also was more popular as inspiration for other programming languages
    than for writing programs (Burroghs machines excepted).

    GCC now supports Algol 68, so one can try it out relatively easily.

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    AFAIK the B6500 leaves all security and safety to the compiler, so in
    a way it is the exact opposite of the iAPX432. We have settled on an in-between approach, where safety inside a process is the job of the
    compiler or the programmer), while security between processes is
    managed by hardware.

    Concerning the 432: The 286 protected mode was inspired by the 432,
    but did Pascal compilers make use of it? Not that I know of.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:20:56 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    In addition, code does not need to access a GOT entry and then call
    the address of an entry, one can LD directly into IP and at the
    same time deposit the return address where it can't be stomped.

    Only EXIT and RET can access the return address and 99% of the
    time it goes directly into IP.


    Can at least reduce success rate (for stomped LR being able to redirect control without immediate CPU fault) from 100% down to 0.4%, ...

    My way gets it down to 0%.

    Stack canaries can also help, but then compilers (and programmers) like
    to disable them for sake of the "usually fraction of a percent"
    performance overhead.

    Stack canaries are <unnecessary> added work to the instruction stream.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:29:55 2026
    From Newsgroup: comp.arch


    quadibloc@invalid.com (John Savard) posted:

    On Sat, 20 Jun 2026 01:01:29 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Instead, C presents a programming model way down at the vonNeumann level:: >1 unit of work (step) at a time.

    That could be considered a flaw.

    I assume you do not want to render your ISA code where each instruction performs 1/3 units of work. So, we are left with how many units of work
    should a single instruction do (incrementing IP is not considered a UoW).

    Indeed, compared to RISC-V, My 66000 performs 1.4 UoW per RISC-V UoW.

    Of course, there are languages that address that flaw.

    APL has mathematical operators that act directly on vectors and
    matrices without loops.

    By stating that this {vector, matrix} calculation is 1 step, you address
    the vast majority of the 'flaw' mentioned above.

    Modula-2, ADA, and some other languages include constructs for
    parallel execution.

    But then, even C has fork().

    John Savard
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:37:45 2026
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-Jun-20 15:07, John Levine wrote:

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it. Several parallel strands of RISC research independently found that moving complexity from the hardware into the compiler made computers faster and cheaper. IBM's PL.8 compiler had excellent error checking even though it was originally targeted at the RISC 801, but somehow people always
    want to turn off the error checks in the production build of their code.

    I suspect that is because error checks were so badly designed.
    e.g. the x86 BOUND instruction costs more to set up than it saves
    because it requires 2 bounds to be set up in memory and then
    read every time.

    If checks are designed from a risc point of view
    they should have little to no runtime costs.

    For example, almost all arrays are 1-dimension, base-0 or base-1
    and most array bounds are constants, so one only needs to check,
    - for base-0 a single index unsigned < register or constant limit,
    - for base-1 a single index != 0 and unsigned <= register or constant limit. (It uses an unsigned compare because signed negative integers
    will be treated as large positive unsigned integers and fault.)

    Since the index will already be in a register, this is just a
    reg-reg or reg-imm compare and possibly fault.

    My 66000 has bounds checks built into the CMP instruction.
    C would use the CIN check (0<=Rindex<Rcomparand)
    Fortran would use the FIN check (0<Rindex<=Rcomparand)
    An advantage of condition-code-less comparisons.

    There are two forms of conditional check ChkCC.
    With the standard check, the following LD or ST is not dependent on
    the check success and could be speculatively executed before an
    index fault was thrown. It is therefore slightly faster but not
    Spectre safe, suitable for secure environments.

    Everything becomes SPectré safe when caches are not updated until
    the causing instruction retires, AND when memory indirect* cannot
    access the cache a second time, until the first access passes its
    RWE permission checks.

    (*) LD Ri,[address] // get pointer
    MEM Rd,[Ri,...,...] // touch memory indirectly

    The second form is a sequential check ChkSeqCC has 3 operands:
    the source index register , the limit imm or register, and a dest register.
    ChkSeqCC rd_index, rs1_index, imm_limit
    ChkSeqCC rd_index, rs1_index, rs2_limit
    When the check succeeds the rs1_index is copied into rd_index register,
    and the rd_index register is then used in the LD or ST instruction.
    This creates a sequential dependency of the LD/ST on the check having
    been passed and thus blocks Spectre style speculative indexing.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jun 21 13:55:59 2026
    From Newsgroup: comp.arch

    On 6/21/2026 10:16 AM, Robert Swindells wrote:
    On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he
    could remember what kind of benchmarks they had run on the 88100
    that showed it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to >>>>> use 2 shifts to perform an extract; indexing was scaled:: both
    helped interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any
    given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of
    interpreter...

    Given that one can create a data structure in LISP, and then execute it;
    how would you do this without an interpreter or a JIT ??

    You don't do that for anything serious.

    You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

    A good overview of the SoTA back then is this:

    <https://dreamsongs.com/Files/Timrep.pdf>

    I would expect that some PDP-10s at CMU had MacLisp installed on them.

    The Franz Lisp source code was available from UCB including the compiler.

    The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
    to the M88K.

    The Lisp system developed by the SPICE Project at CMU initially targeted
    the PERQ but later switched to running on conventional CPUs and is still
    in use today in SBCL and CMUCL variants.


    Yeah.


    Though, I guess one merit of a Lisp like language is that it is a lot
    easier to parse, and it could be possible to implement a fairly cheap
    compiler for it (in the basic case).

    Usual downside it that the excessive parenthesis tend to turn into a
    usability issue.

    One other major hassle was typically a lack of C style loops (with break
    or continue), but this could be addressed in theory.


    There is the pros/cons aspect of being typically dynamically typed, but
    could be possible to define a statically-typed dialect. Pure dynamic
    typing has unavoidable performance costs, etc.


    Hmm...
    (let (((:int x) 1) ((:int y) 2)))
    (defun (:int foo) ((:int x) (:int y)) (+ x y))

    Where, say, replacing a symbol with (:keyword symbol) being understood
    to declare it as a typed symbol.

    Likely (:keyword expr) could in other contexts be understood as a
    contextual attribute, with (cast :type expr) for casts.

    Could make sense to require all non-primitive types to effectively be typedef'ed rather than declared by an inline notation.

    (typedef pvoid (:ptr :void))
    (typedef ppvoid (:ptr :pvoid))

    The default type could maybe be a sort of auto-variant:
    Make an attempt to infer the type if possible;
    Failing this, fall back to variant (dynamic types).


    Could turn this into a more C-like language by throwing a full parser on
    the front (this was how my second implementation of the BGBScript VM
    worked). Initially, it remained fully dynamically typed, but the final
    version (before this project/language had died) had moved over to a static-typed core (with the top-level language kinda resembling
    ActionScript3 or HaXE).



    Could likely go from S expressions to a stack based bytecode, but could
    make sense to store the bytecode in a format where separate random
    access is more possible. This was a downside of BGBCC's bytecode; it
    basically required "all at once" processing. To be more friendly to a
    compiler with a more limited memory budget, would ideally want to be
    able to read in the bytecode and walk the reach-ability graph without
    needing to convert the whole thing to 3AC.

    Maybe keep the format still able to support C and similar.

    FWIW: BGBCC's RIL format was itself partly derived from the BGBScript
    VM's bytecode format (they had originally sort of co-evolved; and used
    very similar approaches).

    As noted, the BSVM backend didn't interpret the bytecode directly, but
    rather unpacked it into a 3AC format which was what it actually ran (vs
    BGBCC where the 3AC is instead used to drive machine-code generation).

    Both were developed in the era when RAM was assumed plentiful and cheap
    (or, at least, normal desktop PC doesn't care if one loads something the
    size of Doom or Quake entirely into decoded 3AC ops).

    Well, ironically, this is what JX2VM does as well for emulating stuff,
    but JX2VM couldn't really self host...



    A compromise option would be to move the outer layers to a TLV format,
    but not try to turn it into a structure resembling the RIFF/AVI format
    (or .NET style table-driven metadata), but instead maybe stay with local tagging.

    Say:
    BYTE tag; BYTE nlen; BYTE data[~nLen]; //small tag
    WORD tag; WORD nlen; BYTE data[~nLen]; //mediam tag
    DWORD tag; DWORD nlen; BYTE data[~nLen]; //long tag
    Where, TAG bytes are required to stay in the range 01..7F ( or 20..7E),
    and lengths are stored as a complement, so the HOB is always set.

    This format makes it possible to detect the tag format, without more heavy-handed tagging (like in ASN.1 or MKV). Had used this sort of TLV
    format in some of my other formats (and ironically, contains a RIFF-like format as a subset case; differing primarily in that the length is complemented, and the data's stored length is not padded up to an even
    byte).


    So, one can have tags like:
    'FN': Function
    'VR': Variable
    'LV': Local Variables (functions)
    'LA': Local Arguments (functions)
    'N': Name (String Index, Name)
    'T': Type (String Index, Sig)
    'F': Flags (Maybe String Index, Sig-like)
    'B': Body (functions, bytecode)
    'V': Value (variables, maybe bytecode)
    'G': Global Dependencies List (likely global indices)

    May or may not make sense (for a compiler IL) to still use bytecode for declaring values.

    RIL had used inline strings, but arguably it is better to use a string
    table (with string literals as offsets). May make sense to support both
    normal (raw) strings, and LZ compressed strings (for larger text/data
    blobs).

    Might make sense to have both a global and local (per-function) string
    tables, with the literals able to encode which table they are pulling from.

    Possibly each declaration would be required to give a list of global dependencies (so that the dependency graph-walk can be done without
    needing to process the bytecode).


    Ideally, want to be able to use a process where one can read the image
    into RAM, and pull out bits when needed, rather than needing to up-front
    parse the whole thing (which was a design limitation with RIL, which
    wasn't really designed with the idea of needing to compile stuff with a
    more limited memory footprint).

    Likely the bytecode could use a similar format to RIL, say:
    Opcodes:
    00..DF: Single Byte
    E0..EF: Two Byte (0000..0FFF)
    (Longer, probably unneeded)
    ...
    Integer Numbers:
    00..7F: One Byte (00..7F)
    80..BF: Two Byte (0000..3FFF)
    C0..DF: Three Byte (...)
    ...
    Signed integers are stored sign-folded.

    The format isn't likely to be really optimized for interpreters, and
    would probably stick with leaving off opcode types (type information
    would be carried along the operand stack).


    If compiling a language like C, would need to decide between one
    bytecode blob per translation unit, or merging all TU's into a single
    big blob for libraries.

    If aiming for a RAM-conserving compiler, per-TU blobs likely make more
    sense (with libraries either storing compound blobs, or something like
    WAD or "!<arch>" libraries). Almost more tempting to just use a WAD
    variant for libraries (the traditional ".a" format just kinda sucks;
    would almost rather just use the ".tar" format than this).

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 18:57:56 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like this:

    p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it through pointers derived from p. And programs usually satisfy that requirement.

    {
    p = (struct foo *) malloc(42 * sizeof(struct foo));
    fprintf( stream, "0x16,", p );
    ...
    if( fscanf( stream, "x16", q ) ) {
    use q
    }
    }

    is q "derived" though p ??


    typedef struct { // string with explicit length
    int len:
    char str[0];
    } varstr;

    varstr *p;
    char *s = "swordfish";

    // initialize p from s
    p = (varstr *)malloc(sizeof(varstr)+strlen(s));
    len = strlen(s);
    strncpy(p->str, s, p->len);

    so in practice pointers all have to be pointers to bytes or something
    that can losslessly be converted to and from them.

    So you want typed pointers. Other languages have more type safety.

    What kind of segmentation do you have in mind that would provide type
    safety?

    This evolution was certainly helped along by the horrible implementaton
    of segmented memory in the Intel 8086 and 286, which persuaded people
    that segments are a plague to be avoided rather than a tool to make >programs more reliable.

    The 286 provides segments that fit the C standard. It seems that what
    people found horrible about them was that they are limited to 64KB,
    and that using them is slow, and that the 80286 protected mode was
    completely at odds with real mode instead of an upwards-compatible
    thing.

    That did not help--but the problem goes much deeper--especially with
    modern software adding DLLs as the <static> application ages.

    The limit could be fixed, and they would require more hardware
    resources and be even slower. The limit on the number of segments is probable also a problem if you want to use them for C-standard
    objects.

    Yes, for a 64-bit VAS, you want a 64-bit pointer to the start, no less
    than a 40-bit value for its size, and then 2 (or 3) sets of permissions--
    each permission being 7-8-bits long. You can pack all this into a 128-
    bit descriptor if you accept that no segment can be larger than 2^40
    bytes. A 40-bit size is a bit on the restrictive side.

    Concerning the performance, one can probably improve that, at the cost
    of additional hardware, but I fail to see how any segment-based
    hardware could be as fast (or at least close to) as flat memory with
    software bounds checking.

    Neither does anyone else in the RISC camp who architects instructions.

    Yes, you can throw a bunch of HW at the problem and almost make the
    performance degradation vanish--but at what cost {area, power, cycles}??

    One issue that segments as on the 80286 do not fix is dangling
    references (C memory safety checkers go to considerable lengths to
    deal with that problem). So the language implementation of a language
    with explicit deallocation (e.g., C or Pascal) would deallocate the
    segment, but, given the finite number of segment numbers, pass out the segment number again, and and access through a dangling reference to
    the old segment could wreak havoc.

    Yes, you are going to want at least 2^32 segments...

    One problem connected to C and the 286 segments is that each pointer
    would require a segment number and an offset withing the segment. For
    a language like Java, the segment number would be sufficient. But,
    e.g., Pascal reference parameters can reference a specific field in a
    record or an array element, so they would have to be represented by segment+offset, too.

    And an ability to create, restrict, pass, and receive segment-descriptors across some kind of interface that takes very few instructions to set up
    and use--to other threads which share only part of your VAS {memmap()}.

    Do you have any example of non-horrible segmentation that provides
    memory safety. If not, do you have an idea what that would look like?

    - anton
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jun 21 19:02:32 2026
    From Newsgroup: comp.arch


    Robert Swindells <rjs@fdy2.co.uk> posted:

    On Sat, 20 Jun 2026 22:08:23 GMT, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 9:25 AM, Robert Swindells wrote:
    On Fri, 19 Jun 2026 14:34:10 GMT, MitchAlsup wrote:

    Robert Swindells <rjs@fdy2.co.uk> posted:

    In previous discussions, I had tried to press Mitch to see if he
    could remember what kind of benchmarks they had run on the 88100
    that showed it running Lisp faster than SPARC.

    M88K shift instructions could perform extracts, whereas SPARC had to
    use 2 shifts to perform an extract; indexing was scaled:: both
    helped interpreters.

    Production Lisp environments are not interpreters, even back then.


    Lisp is a funny language:
    Big promises in the design;
    But, only deliver them poorly (and can't improve on the delivery of any
    given thing without eroding the original promises).

    Simplicity and elegance of an interpreter,
    But only if already operating within a Lisp environment...
    Clean and elegant syntax,
    That actually sucks real hard to use in practice.
    Performance, but only if compiled to something else...

    A naive Lisp interpreter being almost the slowest style of
    interpreter...

    Given that one can create a data structure in LISP, and then execute it; how would you do this without an interpreter or a JIT ??

    You don't do that for anything serious.

    You have an Ahead Of Time compiler that takes a file of source code and generates a file of machine code equivalent to it just as you do for any other Algol-family language. You also have the option of compiling an individual function to RAM but not saving it out to a file.

    A good overview of the SoTA back then is this:

    <https://dreamsongs.com/Files/Timrep.pdf>

    I would expect that some PDP-10s at CMU had MacLisp installed on them.

    The Franz Lisp source code was available from UCB including the compiler.

    The commercial Common Lisp implementations from Franz Inc., Lucid Inc. and Harlequin came after this point. I don't know if any of them were ported
    to the M88K.

    I was told that the prolog application on M88K was faster than competing
    RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

    The Lisp system developed by the SPICE Project at CMU initially targeted
    the PERQ but later switched to running on conventional CPUs and is still
    in use today in SBCL and CMUCL variants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sun Jun 21 21:15:22 2026
    From Newsgroup: comp.arch

    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like this: >>>
    p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it
    through pointers derived from p. And programs usually satisfy that
    requirement.

    {
    p = (struct foo *) malloc(42 * sizeof(struct foo));
    fprintf( stream, "0x16,", p );
    ...
    if( fscanf( stream, "x16", q ) ) {
    use q
    }
    }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"
    and when a compiler can know pointers definitely alias, definitely do
    not alias, can be assumed not to alias or must be assumed to possibly
    alias. I haven't followed the details enough to say what would be the
    case here, or if a consensus has been reached about such situations.

    However, Anton did say that programs /usually/ satisfy that requirement.
    It is possible to play silly buggers with pointers in C, but few
    people would do something as "creative" as you are suggesting here.

    There can be good reasons for doing some weird things with pointers in
    C, and it's not always clear what is entirely allowed or not. Sometimes
    it is safer to use unsigned character pointers (such as with "memcpy")
    to be sure - there's a risk to efficiency, but good compilers will often generate efficient results.

    C++23 introduced the "start_lifetime_as" which lets you be a bit clearer
    about some things.

    I doubt if anyone will claim that C is a "perfect" language here, or
    that there are no risks of misunderstandings between the programmer and
    the compiler if you try to do very strange things. But usually you
    don't need to do such strange things in code.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 19:52:10 2026
    From Newsgroup: comp.arch

    On Sun, 21 Jun 2026 19:02:32 GMT, MitchAlsup wrote:

    I was told that the prolog application on M88K was faster than competing
    RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

    XLisp doesn't use tags to encode types, it just has a C union of structs
    with a one byte field for the type. You can still find the source to the version used by SPECint. It doesn't provide a compiler.

    It isn't a useful benchmark for anyone who was interested in running Lisp
    back then.

    I'm not trying to defend SPARC and am happy to take your word for it that
    M88K was fast for the time.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Sun Jun 21 19:56:40 2026
    From Newsgroup: comp.arch

    On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

    Though, I guess one merit of a Lisp like language is that it is a lot
    easier to parse, and it could be possible to implement a fairly cheap compiler for it (in the basic case).

    Usual downside it that the excessive parenthesis tend to turn into a usability issue.

    You use an editor that keeps track of them.

    One other major hassle was typically a lack of C style loops (with break
    or continue), but this could be addressed in theory.

    It is addressed in practice.

    You could run SBCL on your CPU, it has a RISC-V backend to the compiler.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jun 21 13:11:25 2026
    From Newsgroup: comp.arch

    On 6/21/2026 11:37 AM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-Jun-20 15:07, John Levine wrote:

    or Pascal, had become the "standard", we might have ended up with
    computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it. Several parallel strands of RISC research independently found >>> that moving complexity from the hardware into the compiler made computers >>> faster and cheaper. IBM's PL.8 compiler had excellent error checking even >>> though it was originally targeted at the RISC 801, but somehow people always
    want to turn off the error checks in the production build of their code.

    I suspect that is because error checks were so badly designed.
    e.g. the x86 BOUND instruction costs more to set up than it saves
    because it requires 2 bounds to be set up in memory and then
    read every time.

    If checks are designed from a risc point of view
    they should have little to no runtime costs.

    For example, almost all arrays are 1-dimension, base-0 or base-1
    and most array bounds are constants, so one only needs to check,
    - for base-0 a single index unsigned < register or constant limit,
    - for base-1 a single index != 0 and unsigned <= register or constant limit. >> (It uses an unsigned compare because signed negative integers
    will be treated as large positive unsigned integers and fault.)

    Since the index will already be in a register, this is just a
    reg-reg or reg-imm compare and possibly fault.

    My 66000 has bounds checks built into the CMP instruction.
    C would use the CIN check (0<=Rindex<Rcomparand)
    Fortran would use the FIN check (0<Rindex<=Rcomparand)
    An advantage of condition-code-less comparisons.

    Yes, although a perhaps minor quibble. You would use the compare
    followed presumably by a branch on bit instruction. I believe Eric's
    proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution. I am not sure how much the
    extra instruction costs, but if it occurs on every array reference, it
    might be an issue.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jun 21 16:22:43 2026
    From Newsgroup: comp.arch

    On 6/21/2026 1:20 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the
    application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    In addition, code does not need to access a GOT entry and then call
    the address of an entry, one can LD directly into IP and at the
    same time deposit the return address where it can't be stomped.

    Only EXIT and RET can access the return address and 99% of the
    time it goes directly into IP.


    This requires a CPU that can deal with PUSH/POP mechanics in hardware.
    With a Link Register, HW doesn't need to deal with this.



    Then again, did see a video recently about a new interrupt-handling and
    system call mechanism for x86-64 (called FRED).

    And the big apparent change:
    Mostly makes SYSCALL behave like a normal interrupt, but drops the IDT
    in favor of BaseRegister + Disp and similar.


    So, seemingly:
    IDT:
    Push RIP and RFLAGS and similar;
    Jump to entry point loaded from IDT;
    Specific behavior depends on interrupt type, etc.
    SYSCALL:
    Copies RIP and RFLAGS to different registers;
    Jump to entry point from an MSR.
    New mechanism (FRED):
    Pushes stuff to stack again, but more stuff to the stack;
    Jump to fixed entry point with a per-category displacement;
    Stack contents are more consistent.


    Well, contrast the interrupt mechanism used in my ISA:
    Saves SR (flags) and PC and similar to special registers;
    Branches to VBR + Disp (category);
    Loads some mode-state from VBR;
    VBR can encode which ISA mode handles the interrupt (similar to LR).
    Mode flag causes SP and SSP to swap places in the decoder.
    Debatable, but stack-swapping avoids a bunch of PITA...


    There is a difference in that in my ISA designs, ISR needs to manually save/restore GPRs, whereas traditionally x86 / x86-64 dumps them to the
    TSS. Well, and ironically it seems FRED drops the TSS, so apparently now
    the ISR needs to save/restore all the GPRs itself.


    Like, they are seemingly (almost) converging towards a mechanism similar
    to what I was using, just with more pushing stuff to the stack, and 4
    stack pointers (one for each ring), rather than 2.

    Well, if they would have skipped the stack PUSH'ing and just copied the
    RIP and RFLAGS and similar to MSRs, it would have been "almost the same thing".


    Almost funny in a way, as my mechanism was basically designed around
    trying to do the bare minimum of what could have been done and still
    allow stuff to work.


    Could probably LOL pretty hard if x86-64 then went and added an explicit "Branch-with-Link" instruction (as a replacement for CALL).

    ...




    Can at least reduce success rate (for stomped LR being able to redirect
    control without immediate CPU fault) from 100% down to 0.4%, ...

    My way gets it down to 0%.


    One could also make a case for having a separate Stack and Hunk, with
    things like arrays/structs going onto the Hunk rather than the Stack.

    But, this would be like 2 pointers and some extra mechanics when the
    Hunk needs to be used.

    This would also eliminate the same basic issue, because buffer overflows
    can't stomp the saved registers (and if the Hunk moves upwards, then OOB
    is far less likely to stomp anything meaningful).

    Where, the Hunk allocator in this case could be functionally similar to
    the ones in Quake 2/3 (Quake 1 also used one).


    Stack canaries can also help, but then compilers (and programmers) like
    to disable them for sake of the "usually fraction of a percent"
    performance overhead.

    Stack canaries are <unnecessary> added work to the instruction stream.


    They are pretty effective though against the specific case of
    out-of-bounds stomping saved registers or the return address...

    Also MSVC uses them by default (as does BGBCC), but GCC and Clang
    generally don't use them.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jun 22 00:24:07 2026
    From Newsgroup: comp.arch

    On Sun, 21 Jun 2026 14:52:05 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Do you have any example of non-horrible segmentation that provides
    memory safety. If not, do you have an idea what that would look like?

    - anton

    Nick McLaren used to say that he knows how to do segments right.
    But it was impossible to press him into providing even minimally
    detailed description of his ideas.







    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 00:26:09 2026
    From Newsgroup: comp.arch

    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like
    this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it
    through pointers derived from p.  And programs usually satisfy that
    requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

    (Meanings of "providence" quoted from Wiktionary.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:02:05 2026
    From Newsgroup: comp.arch


    Robert Swindells <rjs@fdy2.co.uk> posted:

    On Sun, 21 Jun 2026 19:02:32 GMT, MitchAlsup wrote:

    I was told that the prolog application on M88K was faster than competing RISC processors. I remember that SPECint XLISP and M88Ksim were higher performing than several other competitors. I was told that the bit-field extract instructions had a lot to do with that.

    XLisp doesn't use tags to encode types, it just has a C union of structs with a one byte field for the type. You can still find the source to the version used by SPECint. It doesn't provide a compiler.

    It isn't a useful benchmark for anyone who was interested in running Lisp back then.

    I'm not trying to defend SPARC and am happy to take your word for it that M88K was fast for the time.

    I think it is more appropriate to say the M88K was peaky--some things
    it did quite well, and others "not so much".
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:04:11 2026
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 6/21/2026 11:37 AM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-Jun-20 15:07, John Levine wrote:

    or Pascal, had become the "standard", we might have ended up with >>>>>> computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it. Several parallel strands of RISC research independently found
    that moving complexity from the hardware into the compiler made computers >>> faster and cheaper. IBM's PL.8 compiler had excellent error checking even
    though it was originally targeted at the RISC 801, but somehow people always
    want to turn off the error checks in the production build of their code. >>
    I suspect that is because error checks were so badly designed.
    e.g. the x86 BOUND instruction costs more to set up than it saves
    because it requires 2 bounds to be set up in memory and then
    read every time.

    If checks are designed from a risc point of view
    they should have little to no runtime costs.

    For example, almost all arrays are 1-dimension, base-0 or base-1
    and most array bounds are constants, so one only needs to check,
    - for base-0 a single index unsigned < register or constant limit,
    - for base-1 a single index != 0 and unsigned <= register or constant limit.
    (It uses an unsigned compare because signed negative integers
    will be treated as large positive unsigned integers and fault.)

    Since the index will already be in a register, this is just a
    reg-reg or reg-imm compare and possibly fault.

    My 66000 has bounds checks built into the CMP instruction.
    C would use the CIN check (0<=Rindex<Rcomparand)
    Fortran would use the FIN check (0<Rindex<=Rcomparand)
    An advantage of condition-code-less comparisons.

    Yes, although a perhaps minor quibble. You would use the compare
    followed presumably by a branch on bit instruction. I believe Eric's proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution. I am not sure how much the
    extra instruction costs, but if it occurs on every array reference, it
    might be an issue.

    In the domain of GBOoO implementations, as long as the extra instruction
    does not add to the latency to a critical path, the only degradation is
    found in delta-ICache-performance.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:23:04 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/21/2026 1:20 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the >>> application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a THROW() and its unstructured equivalent longjump().

    In addition, code does not need to access a GOT entry and then call
    the address of an entry, one can LD directly into IP and at the
    same time deposit the return address where it can't be stomped.

    Only EXIT and RET can access the return address and 99% of the
    time it goes directly into IP.


    This requires a CPU that can deal with PUSH/POP mechanics in hardware.
    With a Link Register, HW doesn't need to deal with this.

    Consider the Push/Pop mechanics in HW compared to FMAC in HW--which
    do you think is easier ???

    Now consider 16 pushed in a row versus a single instruction that performs
    the same amount of work. Which one needs to translate an address more
    often, which one needs to AGEN more often, and which one can access the
    cache once for up to 8 registers ???

    Now given that the whole push sequence is dumped onto a MEMory unit,
    and the other FUs are available, how much easier is it to find work
    after the 16 pushes (or 1 ENTER) has been moved to the MEM FU ??? ---------------------

    Then again, did see a video recently about a new interrupt-handling and system call mechanism for x86-64 (called FRED).

    And the big apparent change:
    Mostly makes SYSCALL behave like a normal interrupt, but drops the IDT
    in favor of BaseRegister + Disp and similar.


    So, seemingly:
    IDT:
    Push RIP and RFLAGS and similar;
    Jump to entry point loaded from IDT;
    Specific behavior depends on interrupt type, etc.
    SYSCALL:
    Copies RIP and RFLAGS to different registers;
    Jump to entry point from an MSR.
    New mechanism (FRED):
    Pushes stuff to stack again, but more stuff to the stack;
    Jump to fixed entry point with a per-category displacement;
    Stack contents are more consistent.


    Well, contrast the interrupt mechanism used in my ISA:
    Saves SR (flags) and PC and similar to special registers;
    Branches to VBR + Disp (category);
    Loads some mode-state from VBR;
    VBR can encode which ISA mode handles the interrupt (similar to LR).
    Mode flag causes SP and SSP to swap places in the decoder.
    Debatable, but stack-swapping avoids a bunch of PITA...

    And My 66000::

    The SVC instruction uses the 16-bit immediate to specify what service
    is being requested, and uses the SRC1 field as a 5-bit immediate to
    specify which application Registers are passed to the service routine.
    SVC pushes 1 DW (64-bits) onto the call stack. HW saves application thread.state and R[16:31] in permanent memory; and reads in thread.state
    for the service routine to be run.

    When control arrives, the registers are already present for the service
    routine to get about its business, interrupts were never disabled, it
    has a stack, and other useful pointers and local data in its VAS along
    with its ASID, ... SW needs to do nothing to get here.

    SVR undoes the state-changing control transfer. SVR uses the SRC1 field
    to specify how many supervisor registers are being copied back as one
    or more results {Linux uses 0, 1, 2 results}, clears the registers that
    are not reloaded R[2:15] while reloading the preserved registers. ------------------
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 00:23:57 2026
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory. >>>
    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like
    this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it >>> through pointers derived from p.  And programs usually satisfy that
    requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

    Although many do .....

    (Meanings of "providence" quoted from Wiktionary.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Clayton@paaronclayton@gmail.com to comp.arch on Sun Jun 21 20:49:38 2026
    From Newsgroup: comp.arch

    On 6/18/26 12:21 PM, Anton Ertl wrote:> scott@slp53.sl.home
    (Scott Lurndal) writes:
    On Wed, 17 Jun 2026 15:22:35 -0500, BGB wrote:

    An idle thought here is whether there is any "better"
    option than
    conventional register-machine designs.

    As an interesting thought experiment, let's assume that a vast
    amount of memory is available with access times better than
    SRAM (let's suppose 1-cycle for the purposes of this thread).

    Would registers even be needed in such an architecture?

    Registers in high-performance CPUs give you several benefits:

    1) The addresses are hard-coded in the instructions. This
    means that
    read access can start early, that dependencies (read-after-write, write-after-write, write-after-read) can be determined early
    and used
    for forwarding, and for renaming registers), and for reducing
    port
    requirements.

    2) They have many read and write ports.

    3) Fast access time. Well, maybe. Thanks to 1) fast access
    time is
    actually not necessary, it just means that you need fewer
    forwarding
    paths.

    I would rather write that the three (typical) advantages of
    registers are:

    1) [Same] The address is hard-coded in the instruction.
    2) They are relative few in number (typically less than 256) and
    smallish in storage..
    3) The access is "word" aligned (typically).

    Your number 2 derives mostly from my number 2. My number 2
    also helps with latency and access power. Code density may
    also be an advantage here, though variable length encoding
    could support diverse address sizes which could avoid explicit
    moves into the smaller address space (registers).

    My number 3 (alignment) helps with latency and access power at
    some cost in effective capacity. Partial reads do not seem that
    expensive (especially for operations where it would be practical
    to treat small operands as SIMD with the other lane(s)
    suppressed — carry suppression instead of alignment shifting).
    Partial writes introduce complications for out-of-order
    execution (with some similarity to conditional move).

    The direct addressing also facilities use of compiler-assisted
    banking, though this may be limited to a renaming table (RAT) in
    an out-of-order implementation. (It _may_ be practical for a
    microarchitecture to support two banks based on the compiler-
    intended banking, i.e., some of the compiler's work may be
    useful. I suspect a greater degree of static banking even as
    hints would introduce utilization balance issues with out-of-
    order execution and even variable execution width.)

    As Mitch Alsup noted, a smaller address space also reduces the
    overhead of dependency tracking. (There may be some potential
    for exploiting spatial locality with accesses in a large address
    space, e.g., sharing most significant bits. It may also be
    possible to benefit from a conservative filter for dependency
    checking; such was proposed for Itanium's advanced loads.)

    Queue-based storage (like the Mill's Belt) further simplifies
    dependency tracking but introduces issues for long-lived values
    (they have to be "moved" to be preserved). One could provide two
    queues (one for longer-lived items) to reduce the number of
    preservation actions required.

    More persistent values might benefit from special handling since
    they would tend not to be retrieved by a forwarding path.

    There may also be benefits to special handling of values that
    are only slightly adjusted such as counters. If the modification
    is only dependent on the previous value and the instruction,
    including immediate, then reversal may be cheaper than storage.
    Even being able to share the storage of most significant bits
    could be useful.

    Completely avoiding named storage for temporary values that are
    known to be consumed immediately (effectively quasi-explicit
    instruction fusion) might be useful as well.

    Let's look at your thought experiment:

    Advantage 1 is missing. Some AMD64 implementations still
    manage to
    implement 0-cycle store-to-load-forwarding in many cases, but
    AFAIK
    not as reliably as for registers.

    Advantage 2 tends to be missing. E.g., the most extreme I
    have seen
    up to now is 3 reads and 2 writes per cycle, and IIRC <5
    total memory
    accesses per cycle, on a machine that can do 8 or 10
    instructions per
    cycle, i.e. at least 16 register reads and 8 register writes
    per cycle
    (maybe limited to less, but with advantage 1 mitigating that
    to some
    extent).

    A large storage area does reduce the overheads of banking.
    Subarrays are naturally used, so "only" routing overhead would
    be added.

    *General* memory accesses also introduce tag and permission
    check overhead. The former is a consequence of the huge address
    space (too large to store with cheap access — the proposal
    assumed small enough for fast access, so tags might be limited
    to something like ASID and such a direct-mapped cache can
    speculate on a hit and so cover tag checking latency).

    Since temporal locality is a major justification for memory
    hierarchy, a system designed for workloads lacking such might
    not have a substantial set of registers. (Spatial locality also
    justifies caching but does not typically apply to registers —
    software may sometimes load multiple values into a single
    register but that seems uncommon. Prefetching or other diverse
    ordering of accessing can also justify caching/buffering.)

    Advantage 3: What would single-cycle memory access mean for
    d=a+b+c? It
    would be compiled to

    t=b+c
    d=a+t

    With registers this has a latency of typically 2 cycles. With
    single-cycle memory access this typically has a latency of 6
    cycles.

    BTW, it's not just a though experiment:

    A number of IA-64 implementations have had single-cycle D-cache
    access. It still had registers.

    A cache supports indirection and spatial locality and is
    expected to be managed primarily by hardware. I doubt anyone
    would try to implement a cache with 12 wordlines per entry,
    which Itanium 2 had (to provide 12 read and 8 write ports).

    If there is no indirection, such might be considered just a
    large register file. (Note that some stack cache proposals and
    other direct-mapped specialized caches such as the Knapsack
    cache could provide register-like access characteristics. The
    CRISP architecture specifically used a multi-ported stack cache
    for "registers"; this cache was accessed for ordinary memory
    accesses in the 32-word range cached, so some indirection was
    possible.)

    Processors like the 6502 and the 6809 have single-cycle
    memory access.
    They still have registers (actually, accumulators and index
    registers).

    - anton

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jun 22 06:28:16 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Swindells <rjs@fdy2.co.uk> posted:
    I was told that the prolog application on M88K was faster than competing
    RISC processors.

    I contributed to a Prolog implementation in 1990. We developed on DG
    Aviion machines that had an 88100 CPU, but wrote the Prolog system in
    C. It is based on the WAM, and IIRC we used the 4 most significant
    bits of every word for type tags. The bitfield instructions of the
    88k did not provide an advantage, because the top 4 bits can just as
    easily be accessed with shifts that every architecture has (and the C
    code used shifts).

    I also ran this system on a DecStation (MIPS CPU), and IIRC the
    DecStation was faster per MHz, and I attributed that to the larger
    caches of the DecStation, but the performance on the DecStation was
    pretty brittle (probably due to direct mapped caches). Inserting some
    code caused a 20% slowdown on a benchmark that did not execute the new
    code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Jun 22 10:05:25 2026
    From Newsgroup: comp.arch

    On 21/06/2026 23:26, Niklas Holsti wrote:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory. >>>>
    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like
    this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it >>>> through pointers derived from p.  And programs usually satisfy that
    requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...


    I thought that's how most C programming was done? :-)

    I probably rely too much on spell chequers - if there are no wiggly red
    lines, my post is ready to send!

    (Meanings of "providence" quoted from Wiktionary.)



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Jun 22 10:44:42 2026
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory. >>>>
    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like
    this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it >>>> through pointers derived from p.  And programs usually satisfy that
    requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the "careful governance and guidance of God", or on an "instance of divine intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem? If so, "careful governance and guidance from God" may
    indeed be required.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 15:38:30 2026
    From Newsgroup: comp.arch

    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory. >>>>>
    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like >>>>>> this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it >>>>> through pointers derived from p.  And programs usually satisfy that >>>>> requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the
    "careful governance and guidance of God", or on an "instance of divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem? If so, "careful governance and guidance from God" may
    indeed be required.

    I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
    to some extent.

    I am reminded of the person, apparently very religious, who some decades
    ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard
    Stallman was too atheistic and had tainted his products. I have not
    heard how that went.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Jun 22 15:19:09 2026
    From Newsgroup: comp.arch

    On 22/06/2026 14:38, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed
    memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

       p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only
    access it
    through pointers derived from p.  And programs usually satisfy that >>>>>> requirement.

          {
              p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
              ...
              if( fscanf( stream, "x16", q ) ) {
                  use q
              }
          }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence" >>>
    Perhaps you meant pointer "provenance"? I hope we will not rely on the
    "careful governance and guidance of God", or on an "instance of divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
    to some extent.


    I am not sure if "pointer safety" has any specific defined meaning - but
    I strongly suspect you are right for any reasonable definition, with
    pointers as unrestricted as in C. But being undecidable does not mean equivalent to the halting problem - after all, Turing machines only have
    to cope with defined behaviour, while C programs can have undefined
    behaviour. That may mean that even if you have an oracle that solves
    the halting problem, it still could not be used in an algorithm to
    determine the pointer safety of a given C program.

    I am reminded of the person, apparently very religious, who some decades
    ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard
    Stallman was too atheistic and had tainted his products. I have not
    heard how that went.

    I believe I know which person you are referring to, and have not heard
    from him for a long time. But I can't say I have made any effort to
    contact him after he stopped posting in comp.lang.c. I hope he got the
    help he /really/ needed, or at least found somewhere where he could be
    happily crazy rather than in constant conflict with reality and everyone
    he was in contact with.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jun 22 09:26:15 2026
    From Newsgroup: comp.arch

    On 2026-Jun-21 16:11, Stephen Fuld wrote:
    On 6/21/2026 11:37 AM, MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    On 2026-Jun-20 15:07, John Levine wrote:

    or Pascal, had become the "standard", we might have ended up with >>>>>>> computers like the Burroughs B6500 or the Intel 432.

    That is one bullet we dodged!!

    I doubt it.  Several parallel strands of RISC research independently found
    that moving complexity from the hardware into the compiler made computers >>>> faster and cheaper.  IBM's PL.8 compiler had excellent error checking even
    though it was originally targeted at the RISC 801, but somehow people always
    want to turn off the error checks in the production build of their code. >>>
    I suspect that is because error checks were so badly designed.
    e.g. the x86 BOUND instruction costs more to set up than it saves
    because it requires 2 bounds to be set up in memory and then
    read every time.

    If checks are designed from a risc point of view
    they should have little to no runtime costs.

    For example, almost all arrays are 1-dimension, base-0 or base-1
    and most array bounds are constants, so one only needs to check,
    - for base-0 a single index unsigned < register or constant limit,
    - for base-1 a single index != 0 and unsigned <= register or constant limit.
    (It uses an unsigned compare because signed negative integers
    will be treated as large positive unsigned integers and fault.)

    Since the index will already be in a register, this is just a
    reg-reg or reg-imm compare and possibly fault.

    My 66000 has bounds checks built into the CMP instruction.
    C       would use the CIN check (0<=Rindex<Rcomparand)
    Fortran would use the FIN check (0<Rindex<=Rcomparand)
    An advantage of condition-code-less comparisons.

    Yes, although  a perhaps minor quibble.  You would use the compare followed presumably by a branch on bit instruction.  I believe Eric's proposal would generate a fault if the comparison failed, so a single instruction versus two for your solution.  I am not sure how much the extra instruction costs, but if it occurs on every array reference, it might be an issue.

    Yes, CHKcc has a general set of condition codes to test like CMPcc
    and generates a fault exception if the test fails.
    Plus the special CC for handling base-1 array has the "and index != 0".
    These could be used for all kinds of range checks, not just array indexes.

    (A fault exception is precise and leaves the instruction pointer
    pointing at the instruction that triggered the exception.
    That combined with a stack trace-back, ideally with source routine
    names and line numbers, facilitates problem diagnosis.
    Yes, I do miss those VMS exception trace-back stacks.)

    There is also CHKVcc to check a single value with tests same as
    BRcc reg, offset but which generates a fault exception if the test fails.
    For example, in a checked language, a cast of a signed integer to unsigned could check that the value was >= 0.

    I also have integer down-size conversion check instructions.
    An unchecked conversion of int64 to int8 would truncate to 8 bits,
    but a checked conversion tests if the high order bits [63:8] are
    all the same as the sign bit [7], and throws an overflow exception if not.
    This ensures you are not trying to put 10 pounds into a 5 pound bag.

    In addition to the usual modulo arithmetic and shift instructions
    there are ones that check for signed and unsigned integer overflows
    and throw an exception if there is.

    All of this makes the cost of cost of verifying the original code's
    design assumptions at runtime nearly or actually zero.
    Programmers don't have to use these but it makes their lives easier
    to have code debug itself, and removes the "cost too much" excuse.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 22 07:59:12 2026
    From Newsgroup: comp.arch

    On 6/22/2026 3:44 AM, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed memory. >>>>>
    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like >>>>>> this:

      p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only access it >>>>> through pointers derived from p.  And programs usually satisfy that >>>>> requirement.

         {
             p = (struct foo *) malloc(42 * sizeof(struct foo));
             fprintf( stream, "0x16,", p );
             ...
             if( fscanf( stream, "x16", q ) ) {
                 use q
             }
         }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the
    "careful governance and guidance of God", or on an "instance of divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem? If so, "careful governance and guidance from God" may
    indeed be required.

    I don't know the answer to your question, but presumably we can do
    better than C does. Isn't that one of the, at least claimed, advantages
    of Rust, and perhaps even Ada? Also, I believe that had the originators
    of C not allowed arithmetic on pointers (comparisons for equality would
    still be allowed, and array addressing would have to use subscripts)
    many of the problems with C pointers wouldn't have occurred. Of course,
    that horse has left the barn a long time ago.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Mon Jun 22 08:33:29 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Actually IA-32 (since the 486) and AMD64 have a bit for turning on
    unaligned traps, but unfortunately there is too much software in
    libararies that performs unaligned accesses.

    Interesting... what's this bit called? In which register does
    it live?

    Thanks!

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Mon Jun 22 19:26:03 2026
    From Newsgroup: comp.arch

    On 2026-06-22 17:59, Stephen Fuld wrote:
    On 6/22/2026 3:44 AM, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed
    memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

       p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only
    access it
    through pointers derived from p.  And programs usually satisfy that >>>>>> requirement.

          {
              p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
              ...
              if( fscanf( stream, "x16", q ) ) {
                  use q
              }
          }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence" >>>
    Perhaps you meant pointer "provenance"? I hope we will not rely on the
    "careful governance and guidance of God", or on an "instance of divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I don't know the answer to your question, but presumably we can do
    better than C does.  Isn't that one of the, at least claimed, advantages
    of Rust, and perhaps even Ada?

    Both Rust and Ada have to be restricted in certain ways in order to
    ensure absence of pointer errors: Rust has to avoid "unsafe" code, and
    Ada has to avoid pointer-related "unchecked" constructs and certain
    undefined behavior (which does exist in Ada, but less so than in C). The
    Ada subset called SPARK, together with its proof tools, is meant for
    such programming, and has a feature similar to Rust "ownership" though standard Ada does not.

     Also, I believe that had the originators
    of C not allowed arithmetic on pointers (comparisons for equality would still be allowed, and array addressing would have to use subscripts)
    many of the problems with C pointers wouldn't have occurred.  Of course, that horse has left the barn a long time ago.

    I recently helped to debug an Ada program that now and then, but not
    often, was overwriting some buffers. At one point in that program I had *cough* used pointer arithmetic *blush* instead of array indexing, for
    what I felt were good reasons at the time. But it bit me. An amusing
    clue to the error was that the bug happened more often when the
    satellite running the program was above Russia's borders. Perhaps you
    can guess reasons for that :-)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jun 22 09:50:16 2026
    From Newsgroup: comp.arch

    On 6/22/2026 9:26 AM, Niklas Holsti wrote:
    On 2026-06-22 17:59, Stephen Fuld wrote:
    On 6/22/2026 3:44 AM, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed >>>>>>>> memory.

    At least in the C standard the memory is segmented into objects. >>>>>>>
    Pointers are sort of typed, but any real C program does stuff like >>>>>>>> this:

       p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only
    access it
    through pointers derived from p.  And programs usually satisfy that >>>>>>> requirement.

          {
              p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>>           fprintf( stream, "0x16,", p );
              ...
              if( fscanf( stream, "x16", q ) ) {
                  use q
              }
          }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer
    providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the >>>> "careful governance and guidance of God", or on an "instance of divine >>>> intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I don't know the answer to your question, but presumably we can do
    better than C does.  Isn't that one of the, at least claimed,
    advantages of Rust, and perhaps even Ada?

    Both Rust and Ada have to be restricted in certain ways in order to
    ensure absence of pointer errors: Rust has to avoid "unsafe" code,

    I like Rust's solution. You can do unsafe things - sometimes they are
    just necessary - but they are not the default was of doing things, and
    you have to notate them in the source code which serves to discourage
    them and points people debugging errors to certain areas of the code
    that are more likley to be problematic.

    and
    Ada has to avoid pointer-related "unchecked" constructs and certain undefined behavior (which does exist in Ada, but less so than in C). The
    Ada subset called SPARK, together with its proof tools, is meant for
    such programming, and has a feature similar to Rust "ownership" though standard Ada does not.

    Is programming under SPARK rules significantly harder than under
    nonSPARK Ada?


     Also, I believe that had the originators of C not allowed arithmetic
    on pointers (comparisons for equality would still be allowed, and
    array addressing would have to use subscripts) many of the problems
    with C pointers wouldn't have occurred.  Of course, that horse has
    left the barn a long time ago.

    I recently helped to debug an Ada program that now and then, but not
    often, was overwriting some buffers. At one point in that program I had *cough* used pointer arithmetic *blush* instead of array indexing, for
    what I felt were good reasons at the time. But it bit me. An amusing
    clue to the error was that the bug happened more often when the
    satellite running the program was above Russia's borders. Perhaps you
    can guess reasons for that :-)

    Interesting. Perhaps it is because Russia has less "careful governance
    and guidance from God" :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Mon Jun 22 17:10:45 2026
    From Newsgroup: comp.arch

    According to Andy Valencia <vandys@vsta.org>: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Actually IA-32 (since the 486) and AMD64 have a bit for turning on
    unaligned traps, but unfortunately there is too much software in
    libararies that performs unaligned accesses.

    Interesting... what's this bit called? In which register does
    it live?

    A quick grep of the Intel® 64 and IA-32 Architectures Software
    Developer’s Manual finds the AC bit in the EFLAGS register. I happen
    to have an old 486 manual and it's there, too.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jun 22 17:09:25 2026
    From Newsgroup: comp.arch

    Andy Valencia <vandys@vsta.org> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Actually IA-32 (since the 486) and AMD64 have a bit for turning on
    unaligned traps, but unfortunately there is too much software in
    libararies that performs unaligned accesses.

    Interesting... what's this bit called? In which register does
    it live?

    The bit is called "alignment check (AC)" and is bit 18 of EFLAGS.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jun 22 17:47:02 2026
    From Newsgroup: comp.arch


    Andy Valencia <vandys@vsta.org> posted:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Actually IA-32 (since the 486) and AMD64 have a bit for turning on unaligned traps, but unfortunately there is too much software in
    libararies that performs unaligned accesses.

    Even My 66000 has a bit to turn on misaligned checks {for debugging}.
    Added late last year.

    Interesting... what's this bit called? In which register does
    it live?

    Thanks!

    Andy Valencia
    Home page: https://www.vsta.org/andy/

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Mon Jun 22 18:49:40 2026
    From Newsgroup: comp.arch

    On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
    <monnier@iro.umontreal.ca> wrote:

    Robert Swindells [2026-06-19 11:20:10] wrote:
    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:
    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz Lisp
    developers has explained in this newsgroup, they actually did not use
    this feature, because the performance benefit was not big enough to
    [...]
    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    I guess you two aren't talking bout the same "Franz Lisp".
    AFAIK Anton is referring to the commercial Common Lisp compiler
    associated with the Franz Inc company, marketed under the name
    "Allegro".

    === Stefan

    ISTM there were at least a couple of Lisps available for the Vax. I
    can't speak to Franz, but I do know at least one Vax Lisp was a
    BIBOP[1] system that (generally) did not use tags.


    In BIBOP, memory "pages"[2] are dedicated to a single data type. The
    base address of the page is mapped to the type of the objects the page contains, and so the objects (and pointers to them) need no type
    information themselves. This allowed for full width pointers, fixnums
    and floats, and for conses, boxes, and other fixed sized data types
    (including user types) to avoid tagging.

    Obviously there has to be a way to identify which "slots" on each page
    are in use. This typically is done using a bitmap kept with the type information. If a page becomes empty it can be repurposed to host
    another type.

    Also large and/or variably sized objects that don't fit into a single
    page or do not cleanly divide the chosen page size had to be handled separately. These objects need to be tagged (though their pointers do
    not) and allocated from a general heap.


    [1] BIg Bag Of Pages
    [2] BIBoP "pages" could be actual VMM pages, or just same-sized
    contiguous blocks defined by software.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 22 08:46:55 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-06-22 00:02:05] wrote:
    I think it is more appropriate to say the M88K was peaky--some things
    it did quite well, and others "not so much".

    I'd be interested to hear of the cases where it shone and the cases
    where it had more difficulty. Same for other architectures of the time.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jun 22 09:02:11 2026
    From Newsgroup: comp.arch

    BGB [2026-06-21 16:22:43] wrote:
    This requires a CPU that can deal with PUSH/POP mechanics in hardware.
    With a Link Register, HW doesn't need to deal with this.

    I don't think that's a very useful way to look at it: it's very easy
    for a CPU to handle PUSH/POP in hardware (as evidenced by the fact that
    many early CPUs did it).

    I think the downside of "hardware-managed stack" is not the hardware
    cost but the fact that it may not quite fit the needs of the compiler
    (e.g. it may be tricky to use if you compiler uses heap allocation for
    the stack frames).

    IOW, you need to make sure your CPU can be used efficiently without
    using the hardware-managed-stack, which should be easy to do, since it
    doesn't require much more than "jump-and-link" (all the rest is
    standard LD/ST/ADD/SUB).


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Mon Jun 22 16:28:17 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    A quick grep of the Intel® 64 and IA-32 Architectures Software
    Developer's Manual finds the AC bit in the EFLAGS register. I happen
    to have an old 486 manual and it's there, too.

    Thank you! I read with interest its interaction with SMEP/SMAP as well.

    I have been out of the kernel game for many dog-years.

    -- Andy
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 01:13:13 2026
    From Newsgroup: comp.arch

    On 6/21/2026 2:56 PM, Robert Swindells wrote:
    On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

    Though, I guess one merit of a Lisp like language is that it is a lot
    easier to parse, and it could be possible to implement a fairly cheap
    compiler for it (in the basic case).

    Usual downside it that the excessive parenthesis tend to turn into a
    usability issue.

    You use an editor that keeps track of them.


    Probably.
    The main editor I use on Windows, Notepad2, has syntax highlighting and parenthesis matching.

    Normal Notepad does not.

    Though, would seem that these features have become fairly common in text-editors in Linux land.


    One other major hassle was typically a lack of C style loops (with break
    or continue), but this could be addressed in theory.

    It is addressed in practice.

    You could run SBCL on your CPU, it has a RISC-V backend to the compiler.

    Probably, would need to look into it.

    But, yeah, if it could target RV64G or similar and doesn't have too many platform dependencies, could work.

    ...




    Well, even with the annoyance of being unable to self-host with my C
    compiler due to RAM footprint (ideally, would want it to be able to
    compile something like Doom, or itself, in under around 30MB of RAM).

    As-is, compiling Doom in BGBCC needing ~ 100 MB (as more, fiddled with
    it and got the footprint down a little more).


    Looks like main things eating RAM are (according to some internal
    mem-use stats via BGBCC's allocator, descending ranking):
    Register Info structs (used for describing variables), ~ 12MB;
    VirtOp structures (used for 3AC ops), ~ 9MB;
    Arrays of Pointers to RegInfo, ~ 7MB;
    Section Buffers, ~ 6MB;
    AST nodes, ~4MB;
    ...

    The internal memory manager lists around 64MB used, but Visual Studio
    says 100MB, though currently the internal stats don't list allocator
    overheads or the amount of memory eaten by the big interned-string table.

    Checking, string table is currently around 50K strings and using around
    920K of RAM, so an average of around 18 bytes per string.


    RegInfo's are using ~ 12MB, each struct is 320 bytes ATM, so around 40K RegInfo structs...

    Struct is kinda bulky as it has the combined fields needed to express variables, functions, structs/unions, ...

    It has a lot of pointers, difficult to avoid in this case. Did end up
    moving some of the string members from raw pointers to interned string indices, which maybe at least saved something...

    Checking:
    ~ 6K global declarations
    ~ 9K structs / typedefs / array-initializers / ...
    ~ 25K internal (non-global) variables
    Divided as some combination of: locals, args, struct fields, ...

    Of which, 1200 are functions that make it into the final binary.


    Looks like AST nodes aren't being too unreasonable ATM (~ 4MB) but these
    get reused between TUs.

    As for speed:
    ~ 14s with debug dumping;
    ~ 9s with no debug dumping;
    ~ 6s if I compile the compiler with optimizations enabled.

    ...

    Checking:
    ~ 5MB for .text
    ~ 300K for data+bss.


    Granted, maybe still moderately fast and lightweight by modern desktop
    PC standards...


    Though unclear how I could significantly reduce the memory footprint of
    the compiler if still doing whole-program builds, as it would appear a
    few significant memory consumers are things I would need even if I were
    doing the 3AC translation incrementally.

    Like, would likely still need to build a view of the global toplevel
    (and only real way to make the struct smaller would be to try to
    separate off the members needed for structs/functions from those needed
    for plain variables).

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jun 23 05:54:47 2026
    From Newsgroup: comp.arch

    Andy Valencia <vandys@vsta.org> writes:
    I read with interest its interaction with SMEP/SMAP as well.

    That's gross. Because it's so cumbersome to change the SMAP bit
    (which prevents prevents access to user pages from supervisor mode),
    they repurposed the AC flag to mean, in supervisor mode, that
    user-memory accesses are allowed even if the SMAP bit is set. The AC
    bit is easy to set with the STAC and clear with the CLAC instruction
    (these instructions do not exist in my Pentium manual, so they were
    added later). I guess that a set AC does not trap unaligned accesses
    in supervisor mode.

    I have been out of the kernel game for many dog-years.

    The alignment trap functionality can be switched on or of from user
    mode, and apparently, after reading the stuff about SMAP, I expect
    that this functionality exists only in user mode.

    One interesting aspect is that at some point, instructions for setting
    and clearing AC were added, making that operation look cheap, but they
    actually are serializing instructions on AMD processors up to and
    including Zen4, and probably also on some or all Intel processors.
    Thinking about the fact that the bit enables or disables trapping some conditions, yes, that's a way to implement these instructions.
    According to <https://github.com/advisories/ghsa-mhcq-hvgj-xm2h>, Zen5
    renames AC, which would mean that every memory access gets the renamed
    value of the AC flag as input.

    Another implementation that comes to my mind would be to implement
    STAC and CLAC purely in the front end without serialization, and
    provide the value of the AC bit as extra bit in every instruction
    coming out of the decoder; POPFD and possibly other instructions that
    change AC would be serializing, however.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Jun 23 10:34:44 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the
    application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Jun 23 16:37:58 2026
    From Newsgroup: comp.arch

    On 2026-06-22 19:50, Stephen Fuld wrote:
    On 6/22/2026 9:26 AM, Niklas Holsti wrote:
    On 2026-06-22 17:59, Stephen Fuld wrote:
    On 6/22/2026 3:44 AM, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:

    [snip]

    There is a discussion going on at the moment about "pointer
    providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on the >>>>> "careful governance and guidance of God", or on an "instance of divine >>>>> intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I don't know the answer to your question, but presumably we can do
    better than C does.  Isn't that one of the, at least claimed,
    advantages of Rust, and perhaps even Ada?

    Both Rust and Ada have to be restricted in certain ways in order to
    ensure absence of pointer errors: Rust has to avoid "unsafe" code,

    I like Rust's solution.  You can do unsafe things - sometimes they are
    just necessary - but they are not the default was of doing things, and
    you have to notate them in the source code which serves to discourage
    them and points people debugging errors to certain areas of the code
    that are more likley to be problematic.

    Same in Ada, mostly: some unsafe things are named "Unchecked_Xxx",
    others are available only if some specific predefined packages are used,
    which are not needed for most safe things.

    and Ada has to avoid pointer-related "unchecked" constructs and
    certain undefined behavior (which does exist in Ada, but less so than
    in C). The Ada subset called SPARK, together with its proof tools, is
    meant for such programming, and has a feature similar to Rust
    "ownership" though standard Ada does not.

    Is programming under SPARK rules significantly harder than under
    nonSPARK Ada?

    I don't have personal experience, but my impression is that it does not
    make it markedly harder than the usual restrictions on embedded,
    more-or-less critical software do. SPARK is defined and supported by the AdaCore company, not a standards group, and is evolving. The
    documentation is at https://www.adacore.com/documentation?tab=spark; the
    main restrictions are (quoted from https://docs.adacore.com/live/wave/spark2014/html/spark2014_rm/introduction.html#principal-language-restrictions,
    with my comments in []):

    --- quote:

    To facilitate formal analyses and verification, SPARK enforces a number
    of global restrictions to Ada. While these are covered in more detail in
    the remaining chapters of this document, the most notable restrictions are:

    - Restrictions on the use of access types and values [pointers], similar
    in some ways to the ownership model of the programming language Rust.

    - All expressions (including function calls) are free of side effects.

    - Aliasing of names is not permitted in general but the renaming of
    entities is permitted as there is a static relationship between the two
    names. In analysis all names introduced by a renaming declaration are
    replaced by the name of the renamed entity. This replacement is applied recursively when there are multiple renames of an entity.

    - Backward goto statements are not permitted.

    - The use of controlled types is not currently permitted. [These are
    types with automatic invocation of user-defined initialization and finalization operations on object creation, copying, and deletion.]

    - Tasks and protected objects are permitted only if the Ravenscar
    profile (or the Jorvik profile) is specified. [The main limitation in
    these profiles is that the set of tasks (threads) is static, no task
    ever terminates, and inter-task communication is by protected objects (monitors, synchronized objects) and not by rendez-vous.]

    - Raising and handling of exceptions is not currently permitted
    (exceptions can be included in a program but proof must be used to show
    that they cannot be raised).

    --- end quote.

     Also, I believe that had the originators of C not allowed arithmetic
    on pointers (comparisons for equality would still be allowed, and
    array addressing would have to use subscripts) many of the problems
    with C pointers wouldn't have occurred.  Of course, that horse has
    left the barn a long time ago.

    I recently helped to debug an Ada program that now and then, but not
    often, was overwriting some buffers. At one point in that program I
    had *cough* used pointer arithmetic *blush* instead of array indexing,
    for what I felt were good reasons at the time. But it bit me. An
    amusing clue to the error was that the bug happened more often when
    the satellite running the program was above Russia's borders. Perhaps
    you can guess reasons for that :-)

    Interesting.  Perhaps it is because Russia has less "careful governance
    and guidance from God" :-)
    One could indeed say so, because the reason is Putin's attack on
    Ukraine, as you may have guessed.

    The Ada program runs a satellite-based GNSS receiver that acquires
    (finds) and then tracks GNSS signals from GNSS satellites (GPS, Galileo,
    and others) as those satellites rise or set. The purpose is to measure atmospheric properties from the way the atmosphere refracts the signal.

    The design and/or coding error was in the transition between two stages
    of the multi-stage procedure for finding and starting to track a GNSS
    signal from a GNSS satellite.

    So then: Russia attacks Ukraine => Ukraine defends itself with
    long-distance drones => Russia jams and perturbs GNSS signals along its borders => the satellite software often loses track of a signal it is
    tracking => the satellite software often has to re-acquire signals =>
    the bug manifests more often over Russia's borders.

    If one favours the Ukrainian Orthodox church, which objects to this war, Russia is going against God's guidance. If one favours the Russian
    Orthodox church, which blesses this war, Russia is following God's guidance.

    (The bug was not found in testing because it did not manifest on every transition between the two acquisition stages -- it manifested only when
    two other dynamic program states occurred together, at the same time as
    the transition, and one of these states is rather rare, at least in test conditions.)

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Jun 23 14:32:22 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 6/21/2026 2:56 PM, Robert Swindells wrote:
    On Sun, 21 Jun 2026 13:55:59 -0500, BGB wrote:

    Though, I guess one merit of a Lisp like language is that it is a lot
    easier to parse, and it could be possible to implement a fairly cheap
    compiler for it (in the basic case).

    Usual downside it that the excessive parenthesis tend to turn into a
    usability issue.

    You use an editor that keeps track of them.


    Probably.
    The main editor I use on Windows, Notepad2, has syntax highlighting and >parenthesis matching.

    Normal Notepad does not.

    Though, would seem that these features have become fairly common in >text-editors in Linux land.

    THat feature has been common in Unix and linux land for close to three decades.

    One might even note that color syntax highlighting predates
    Windows completely in one form or another (e.g 1969 Emily editor).



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Jun 23 07:37:12 2026
    From Newsgroup: comp.arch

    On 6/23/2026 6:37 AM, Niklas Holsti wrote:
    On 2026-06-22 19:50, Stephen Fuld wrote:
    On 6/22/2026 9:26 AM, Niklas Holsti wrote:
    On 2026-06-22 17:59, Stephen Fuld wrote:
    On 6/22/2026 3:44 AM, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:

       [snip]

    There is a discussion going on at the moment about "pointer
    providence"

    Perhaps you meant pointer "provenance"? I hope we will not rely on >>>>>> the
    "careful governance and guidance of God", or on an "instance of
    divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I don't know the answer to your question, but presumably we can do
    better than C does.  Isn't that one of the, at least claimed,
    advantages of Rust, and perhaps even Ada?

    Both Rust and Ada have to be restricted in certain ways in order to
    ensure absence of pointer errors: Rust has to avoid "unsafe" code,

    I like Rust's solution.  You can do unsafe things - sometimes they are
    just necessary - but they are not the default was of doing things, and
    you have to notate them in the source code which serves to discourage
    them and points people debugging errors to certain areas of the code
    that are more likley to be problematic.

    Same in Ada, mostly: some unsafe things are named "Unchecked_Xxx",
    others are available only if some specific predefined packages are used, which are not needed for most safe things.

    and Ada has to avoid pointer-related "unchecked" constructs and
    certain undefined behavior (which does exist in Ada, but less so than
    in C). The Ada subset called SPARK, together with its proof tools, is
    meant for such programming, and has a feature similar to Rust
    "ownership" though standard Ada does not.

    Is programming under SPARK rules significantly harder than under
    nonSPARK Ada?

    I don't have personal experience, but my impression is that it does not
    make it markedly harder than the usual restrictions on embedded, more- or-less critical software do. SPARK is defined and supported by the
    AdaCore company, not a standards group, and is evolving. The
    documentation is at https://www.adacore.com/documentation?tab=spark; the main restrictions are (quoted from https://docs.adacore.com/live/wave/ spark2014/html/spark2014_rm/introduction.html#principal-language- restrictions, with my comments in []):

    --- quote:

    To facilitate formal analyses and verification, SPARK enforces a number
    of global restrictions to Ada. While these are covered in more detail in
    the remaining chapters of this document, the most notable restrictions are:

    - Restrictions on the use of access types and values [pointers], similar
    in some ways to the ownership model of the programming language Rust.

    - All expressions (including function calls) are free of side effects.

    - Aliasing of names is not permitted in general but the renaming of
    entities is permitted as there is a static relationship between the two names. In analysis all names introduced by a renaming declaration are replaced by the name of the renamed entity. This replacement is applied recursively when there are multiple renames of an entity.

    - Backward goto statements are not permitted.

    - The use of controlled types is not currently permitted. [These are
    types with automatic invocation of user-defined initialization and finalization operations on object creation, copying, and deletion.]

    - Tasks and protected objects are permitted only if the Ravenscar
    profile (or the Jorvik profile) is specified. [The main limitation in
    these profiles is that the set of tasks (threads) is static, no task
    ever terminates, and inter-task communication is by protected objects (monitors, synchronized objects) and not by rendez-vous.]

    - Raising and handling of exceptions is not currently permitted
    (exceptions can be included in a program but proof must be used to show
    that they cannot be raised).

    --- end quote.

     Also, I believe that had the originators of C not allowed
    arithmetic on pointers (comparisons for equality would still be
    allowed, and array addressing would have to use subscripts) many of
    the problems with C pointers wouldn't have occurred.  Of course,
    that horse has left the barn a long time ago.

    I recently helped to debug an Ada program that now and then, but not
    often, was overwriting some buffers. At one point in that program I
    had *cough* used pointer arithmetic *blush* instead of array
    indexing, for what I felt were good reasons at the time. But it bit
    me. An amusing clue to the error was that the bug happened more often
    when the satellite running the program was above Russia's borders.
    Perhaps you can guess reasons for that :-)

    Interesting.  Perhaps it is because Russia has less "careful
    governance and guidance from God" :-)
    One could indeed say so, because the reason is Putin's attack on
    Ukraine, as you may have guessed.

    The Ada program runs a satellite-based GNSS receiver that acquires
    (finds) and then tracks GNSS signals from GNSS satellites (GPS, Galileo,
    and others) as those satellites rise or set. The purpose is to measure atmospheric properties from the way the atmosphere refracts the signal.

    The design and/or coding error was in the transition between two stages
    of the multi-stage procedure for finding and starting to track a GNSS
    signal from a GNSS satellite.

    So then: Russia attacks Ukraine => Ukraine defends itself with long- distance drones => Russia jams and perturbs GNSS signals along its
    borders => the satellite software often loses track of a signal it is tracking => the satellite software often has to re-acquire signals =>
    the bug manifests more often over Russia's borders.

    If one favours the Ukrainian Orthodox church, which objects to this war, Russia is going against God's guidance. If one favours the Russian
    Orthodox church, which blesses this war, Russia is following God's
    guidance.

    (The bug was not found in testing because it did not manifest on every transition between the two acquisition stages -- it manifested only when
    two other dynamic program states occurred together, at the same time as
    the transition, and one of these states is rather rare, at least in test conditions.)

    For both the above and the discussion about SPARK, Thanks Niklas, quite interesting.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jun 23 17:33:49 2026
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the >> > application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 ISA--except for the case where one wants to walk the stack back on a THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Jun 23 17:43:38 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the >> >> > application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000
    ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)
    {
    int num_frames;
    void *framelist[100];
    char **strings;

    num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));
    strings = ::backtrace_symbols(framelist, num_frames);
    if (strings == NULL) {
    lp->log("Unable to obtain simulator stack traceback: %s\n",
    strerror(errno));
    return;
    }
    for(int frame=0; frame < num_frames; frame++) {
    lp->log("[%2.2d] %s\n", frame, strings[frame]);
    }
    ::free(strings);
    }

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 15:15:49 2026
    From Newsgroup: comp.arch

    On 6/23/2026 12:43 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the >>>>>> application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good >>>>> when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 >>>> ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)
    {
    int num_frames;
    void *framelist[100];
    char **strings;

    num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));
    strings = ::backtrace_symbols(framelist, num_frames);
    if (strings == NULL) {
    lp->log("Unable to obtain simulator stack traceback: %s\n",
    strerror(errno));
    return;
    }
    for(int frame=0; frame < num_frames; frame++) {
    lp->log("[%2.2d] %s\n", frame, strings[frame]);
    }
    ::free(strings);
    }


    Yeah, for what arguable benefits separate call / data stacks could
    bring, or making the call stack inaccessible to the program, this
    doesn't fit with the vibe of either RISC philosophy, or for sake of
    practical things like implementing C++ style throw/catch, or mechanisms
    like C's longjmp, ...

    One would likely need to defy minimalism by having additional hardware mechanisms to support these kinda things.


    Or, at least more than the damage already done in my case by putting
    mode tag bits and similar in the the link register, which could
    potentially effect code which messes with the link register value
    directly and assumes the link register represents a bare address value.

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 17:48:22 2026
    From Newsgroup: comp.arch

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    C killed off every memory model other than flat byte addressed
    memory.

    At least in the C standard the memory is segmented into objects.

    Pointers are sort of typed, but any real C program does stuff like >>>>>>> this:

       p = (struct foo *) malloc(42 * sizeof(struct foo));

    That produces an object of a certain size, and you must only
    access it
    through pointers derived from p.  And programs usually satisfy that >>>>>> requirement.

          {
              p = (struct foo *) malloc(42 * sizeof(struct foo)); >>>>>           fprintf( stream, "0x16,", p );
              ...
              if( fscanf( stream, "x16", q ) ) {
                  use q
              }
          }

    is q "derived" though p ??

    There is a discussion going on at the moment about "pointer providence" >>>
    Perhaps you meant pointer "provenance"? I hope we will not rely on the
    "careful governance and guidance of God", or on an "instance of divine
    intervention" to ensure pointer safety...

    Has pointer safety been shown to be equivalent to the halting
    problem?  If so, "careful governance and guidance from God" may
    indeed be required.

    I would assume it is undecidable, for unrestricted programs. The aim of pointer provenance is no doubt to restrict programs to make it decidable
    to some extent.


    I didn't really understand it myself.

    In my case, I tended to use more conservative approaches and then only optimize based on what can be verified by the compiler within certain fundamental assumptions.

    Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they
    don't alias with a pointer to a local array;
    ...


    Another option is a sort of "selective TBAA":
    Enable TBAA, but only if the current function doesn't contain any
    obvious pointer casts or similar.

    ...


    Then had noted that in my compiler (while working on it to try to reduce memory use), that there was a feature to walk the call-flow graph and
    mark off whichever global variables may be modified and similar as a
    result of calling some function.

    Had sort of forgot this existed, but is sometimes useful to know (can
    keep a global cached in a register if one knows the called function will
    not modify it, otherwise spill/reload is necessary).

    ...


    I am reminded of the person, apparently very religious, who some decades
    ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard
    Stallman was too atheistic and had tainted his products. I have not
    heard how that went.

    There were a few people like that...

    There is seemingly a fine line though between being overly religious and
    being insane. A few of the people who I had seen who were like that, had
    been a bit of the latter.



    There is a lot of complexity with things like doctrine and theology,
    etc, but there is a characteristic difference IME.

    Well, and a leaning towards "reality defying" views; more emphasis on supernatural events and experiences, defiance of things like basic
    physics or rules of mathematics; and often pairing the outward
    religiosity with rather unstable or inconsistent adherence to moral or
    ethical behavior (or, applying it only to other people, while giving themselves free reign to indulge in whatever they feel like doing); ...

    Well, and seemingly, the "more genuine" thing being to express restraint
    in ones' own behavior in these areas, not to worry about or try to
    control what anyone else is doing.

    Well, and then there is sorta the cultural expectation that one
    evangelize to others, etc, but this doesn't make as much sense in
    contexts where everyone likely already knows and/or has already made up
    their mind.


    Or, one ends up getting on others' bad sides, say, if one admits that
    they don't personally buy into the "Young Earth Creationist" mindset,
    and feel that (as a society) people have mostly been interpreting
    Genesis incorrectly (and making themselves look stupid in the process,
    by insisting that everyone adopt an overly particular and somewhat
    nonsensical interpretation).

    But, alas, ...




    Though, this doesn't mean that I can claim to always have a 100% stable
    hold on what constitutes "reality" (and my own experiences do include
    things that seem to deviate from normal expectations).

    Though, most in my experience seem to be things like seeming time-flow
    and causality breaks: experiences where normal linear time-flow seems to
    break down; where events happen in ways that seem to break forwards
    causal order; or where sometimes stuff just "changes around" for no
    particular reason.


    Though, I guess I differ by not claiming to have any higher explanation
    for stuff like this...

    Also often more like "bad Sci-Fi tropes" than particularly religious
    though (like one seemingly encounters weirdness that more seems like
    something out of Star Trek or something...).


    Well, like high-level examples:
    Seeming delayed-choice key-ring color instability;
    Choose one color of keyring, it flip flops and changes later.
    Several instances variations of:
    Go to the bathroom, seemingly experience time displacement.
    Like, go into bathroom, may emerge with an unexpected time delta.
    Times where events seemingly tie into "time knots";
    Event sequence becomes paradoxical, causes/effects are reversed;
    Or occasional time-loops (reliving the same events multiple times);
    Unexpected changes appearing in ones' code;
    Like, the code was one way, then it was different.
    Or, in CNC, an event where some M01's turned into M00's somehow.
    Or, one remembers documenting something,
    but then what they wrote is nowhere to be found.
    ...


    Could try to come up with some sort of explanation, but purely
    observational, one might just claim "if one goes to the bathroom or
    similar, they may sometimes somehow initiate temporal anomalies". Well,
    and/or attribute it to neurological factors.

    Though, many of these sorts of events do seem oddly correlated with
    "went to the bathroom, then some weirdness happens...".



    Well, and realizing that some bigger mysteries from earlier in my life
    had more mundane explanations:
    Weird Mac style computer with external magneto-optical drive, etc:
    Apparently was actually a thing at the time...
    I just don't know why anyone would have showed it to me.
    Like, not actually "alien", just absurdly expensive.
    But, then, I question if I really saw it.
    But, why would I remember such a setup if I didn't see it?...
    But, why subject a random 3rd grader to Pascal and MPW?...
    Like, a story with plausible explanations, technically.
    But, the "why" aspect doesn't make sense...
    LaserDisc disappearance in the early 2000s:
    Apparently people mostly just got rid of them...
    The rare purple LaserDisc's:
    Apparently recordable LaserDisc was just a market flop,
    not some weird alien tech.
    A one-off incident in a school A/V setup.
    Like, sometimes they used LD, and not just VCRs.
    ...


    Though, it is still odd sometimes to have maybe encountered weird tech,
    to then have it disappear and never seeing it again (like, where one can question whether their current self is still living in the same timeline
    they existed in during their childhood).


    Not like there is anything particular religious about tech though, and
    if I saw this stuff as an adult would probably have not thought as much
    about it.


    Does sometimes seem like life could have gone differently in some areas,
    I was just sort of an epic fail at everything.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jun 24 00:54:32 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only optimize based on what can be verified by the compiler within certain fundamental assumptions.

    Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error.

    -----------------
    I am reminded of the person, apparently very religious, who some decades ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard Stallman was too atheistic and had tainted his products. I have not
    heard how that went.

    Rick...

    --------
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jun 24 00:59:12 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the
    application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 >> > ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    When Thread is unExceptional it cannot access Call Stack,
    when Thread is Exceptional it can.

    ENTER, EXIT, and RET are exempt from the protection check.
    Call Stack Pointer is not accessible to unprivileged code.

    Don't see how one gets from a running application into debugger without
    taking an exception !?! or from running in the debugger to running in application without returning from an exception !!!
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 21:01:35 2026
    From Newsgroup: comp.arch

    On 6/23/2026 7:54 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only
    optimize based on what can be verified by the compiler within certain
    fundamental assumptions.

    Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they
    don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error.


    Hard proof that alias is impossible is harder to achieve in practice...

    A softer "there is no reasonable possibility of alias" is easier to achieve.

    Like, one can assume that each independent memory object exists in its
    own local void, and that there is no reasonable way to reach from one to another.


    Like, even if you can potentially go out of bounds to reach from one independent memory object to another, for a compiler it may be
    sufficient merely to prove that the origins reflect two independent
    memory objects (and not two pointers within the same object, or a
    parent/child relationship).

    Likewise, global variables can be seen as separate objects, along with independent local variables.

    Say:
    int arra[16];
    int arrb[16];
    With arra and arrb being assumed independent, even if in-memory they are
    right next to each other, but excluding arrays within a common struct
    (where the containing struct can be seen as a common origin point).


    Everything passed in can go into an "unknown" category; where unknown
    pointers may be assumed to alias with each other.

    Otherwise, one would need to make assumptions about "every possible
    caller", which is unreasonable (caller behavior can be assumed to fall
    into an open-ended set, even in cases where callee behavior can be
    reasoned about via graph walks).

    Though, could still be done when one assumes that the callers form a
    closed set.

    ...


    -----------------
    I am reminded of the person, apparently very religious, who some decades >>> ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard
    Stallman was too atheistic and had tainted his products. I have not
    heard how that went.

    Rick...


    That was one of them...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Jun 24 02:25:30 2026
    From Newsgroup: comp.arch

    According to BGB <cr88192@gmail.com>:
    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error. >>
    Hard proof that alias is impossible is harder to achieve in practice...

    A softer "there is no reasonable possibility of alias" is easier to achieve.

    Sort of. The standard says that the compiler can assume no type punning, so that
    if pointers are of different types, they can't point at the same thing (with an exception for pointers to unions.)

    Even so, C has "restrict" to tell the compiler to assume that pointers never alias, and "volatile" to assume they always do.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Jun 23 22:41:19 2026
    From Newsgroup: comp.arch

    On 6/23/2026 9:25 PM, John Levine wrote:
    According to BGB <cr88192@gmail.com>:
    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error. >>>
    Hard proof that alias is impossible is harder to achieve in practice...

    A softer "there is no reasonable possibility of alias" is easier to achieve.

    Sort of. The standard says that the compiler can assume no type punning, so that
    if pointers are of different types, they can't point at the same thing (with an
    exception for pointers to unions.)

    Even so, C has "restrict" to tell the compiler to assume that pointers never alias, and "volatile" to assume they always do.


    Possibly, though traditional type-based aliasing rules run into a
    problem in that pointer casting can break its assumptions, and a lot of
    code doesn't respect these rules (which taken purely at face value, are
    overly limiting).


    One option though is "if enabled, assume the rules are followed unless
    the compiler sees them being broken", in which case it disables TBAA
    when faced with TBAA violations. This approach seems to be moderately effective, and allows benefiting from some of the performance advantages
    of TBAA while also being more friendly to code that goes "wild west"
    with things like pointer casts and "cast and dereference" patterns.

    So, say, a nicer compromise (even if still breakable).
    int foo1(char *s, int *t)
    {
    *s=*t+1;
    return *t;
    }
    //assume not directly visible within same context:
    int foo2()
    {
    int i, j;
    i=4;
    j=foo1((char *)(&i), &i);
    return j;
    }
    What is the result of calling foo2?...
    Here, foo2 breaks TBAA but in a way invisible to foo1.

    Though, in theory, one workaround is that the compiler can see that foo2 breaks TBAA and can then flag foo1 that its operands may not safely
    assume TBAA.

    Though, this poses a problem for my current compiler design, as some of
    the alias handling stuff happens before the compiler will have a
    complete view of the call-graph.

    Would in effect need to add an additional internal compiler pass to
    detect and mark all the TBAA violations within the call-graph.

    But, for now, seems "mostly good enough".




    For volatile, one typically needs to go a little further:
    Every load and store needs to be performed explicitly;
    There is a need to disallow load/store reordering;
    ...
    Mostly because volatile may be used to access MMIO, and MMIO is more
    strict than normal RAM in this area.

    Though, could maybe be better if "volatile" could be broken into several subtypes depending on which particular behaviors are needed:
    Weaker case: Assume aliasing happens.
    May still prune non-aliasing load/store or reorder;
    Normal case:
    Every load/store needs to happen;
    No reordering allowed.
    Stronger case:
    Like the above, but also needs to be synchronous between cores;
    Though, this role overlaps with _Atomic.

    There is also ambiguity as to how far the volatile-ness extends, but
    this can be avoided by doing it at the point of cast-and-deref:
    (*(volatile uint64_t *)ptr)
    In this case, it applying explicitly to the deref operation rather than
    the handling of the pointer before this point.

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 05:34:30 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Consider the Push/Pop mechanics in HW compared to FMAC in HW--which
    do you think is easier ???

    Now consider 16 pushed in a row versus a single instruction that performs
    the same amount of work. Which one needs to translate an address more
    often, which one needs to AGEN more often, and which one can access the
    cache once for up to 8 registers ???

    Modern x86 processors have a "stack engine" to address this
    problems. Multiple push or pop instructions, respectively,
    are split into two microops (one memory access, one decrement or
    increment), and the decrement/increment microops are then merged.

    This proably costs an extra cycle pipeline depth or so, but
    I haven't been able (after cursory looking) to find a number for
    newer architectures.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jun 24 05:48:44 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:06:14 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the >> >> > application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good >> >> when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000
    ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    But no error backtrace from an error occuring in a normal program?

    Example (Fortran reading from a non-opened file, compiled with
    -g -static-libgfortran):

    program memain
    call foo(a)
    print *,a
    end program memain

    subroutine foo(a)
    read (10) a
    end subroutine foo

    $ ./a.out
    At line 7 of file foo.f90 (unit = 10, file = 'fort.10')
    Fortran runtime error: End of file

    Error termination. Backtrace:
    #0 0x407d27 in us_read
    at ../../../dump/libgfortran/io/transfer.c:2983
    #1 0x407e24 in pre_position
    at ../../../dump/libgfortran/io/transfer.c:3109
    #2 0x40ae34 in data_transfer_init
    at ../../../dump/libgfortran/io/transfer.c:3562
    #3 0x40392f in foo_
    at /tmp/foo.f90:7
    #4 0x403976 in memain
    at /tmp/foo.f90:2
    #5 0x403a0f in main
    at /tmp/foo.f90:4
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:08:15 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the
    application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 >> >> > ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    When Thread is unExceptional it cannot access Call Stack,
    when Thread is Exceptional it can.

    ENTER, EXIT, and RET are exempt from the protection check.
    Call Stack Pointer is not accessible to unprivileged code.

    Don't see how one gets from a running application into debugger without taking an exception !?! or from running in the debugger to running in application without returning from an exception !!!

    Issuing in application error for which one might want to look at
    a backtrace (see previous Fortran example).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jun 24 06:20:45 2026
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    [LISP]

    Usual downside it that the excessive parenthesis tend to turn into a usability issue.

    Ample fun has been made of this over time.

    Example: https://xkcd.com/297/

    Or, from the priceless "A Brief, Incomplete, and Mostly Wrong History of Programming Languages":

    # 1958 - John McCarthy and Paul Graham invent LISP. Due to high
    # costs caused by a post-war depletion of the strategic parentheses
    # reserve LISP never becomes popular... Fortunately for computer
    # science the supply of curly braces and angle brackets remains high.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 08:50:18 2026
    From Newsgroup: comp.arch

    On 24/06/2026 02:54, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only
    optimize based on what can be verified by the compiler within certain
    fundamental assumptions.

    Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they
    don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error.


    C lets the compiler assume that things do not alias, under certain circumstances. If you have a local array (pointer 1) and its address
    does not "escape", and a global array (pointer 2), the compiler can
    assume they do not alias, as any aliasing could only be the result of UB.

    For pointers passed into functions, the compiler won't have any such
    knowledge (unless it happens to be able to see the calling and called
    code at the same time - if they are in the same file, or you are using
    some kind of link-time or whole-program optimisation). But you can tell
    the compiler that pointers don't alias, with the "restrict" qualifier.
    This can make a significant difference in some code, and means that the "Fortran is faster than C because pointer parameters can't alias"
    argument has not been true since 1999. (Fortran code may be faster for
    other reasons - such as "C programmers don't know how to use restrict".)


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jun 24 02:01:40 2026
    From Newsgroup: comp.arch

    On 6/24/2026 12:48 AM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.


    Yes, this is one place where I disagree with GCC on.
    I decided to go with "more sane" default behavior (no TBAA by default,
    it is opt-in).

    Goal is to find rules that are "mostly sane" while still being effective.


    Localized approaches can work OK, but necessarily need to be conservative.

    Something like full provenance poses a harder problem though, as to know
    a solid answer requires tracing the flow of a variable across multiple control-flow frames (or maybe going further, into reasoning about things
    like objects and linked lists).

    Decided not to go too much into it, but this is not the first time I
    have encountered a variation of this problem. It is doable in theory,
    but actually doing it in a compiler is a bit more of a pain...

    ...


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 09:58:35 2026
    From Newsgroup: comp.arch

    On 24/06/2026 05:41, BGB wrote:
    On 6/23/2026 9:25 PM, John Levine wrote:
    According to BGB  <cr88192@gmail.com>:
    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming
    error.

    Hard proof that alias is impossible is harder to achieve in practice...

    A softer "there is no reasonable possibility of alias" is easier to
    achieve.

    Sort of.  The standard says that the compiler can assume no type
    punning, so that
    if pointers are of different types, they can't point at the same thing
    (with an
    exception for pointers to unions.)

    Even so, C has "restrict" to tell the compiler to assume that pointers
    never
    alias, and "volatile" to assume they always do.


    Possibly, though traditional type-based aliasing rules run into a
    problem in that pointer casting can break its assumptions, and a lot of
    code doesn't respect these rules (which taken purely at face value, are overly limiting).


    It's true that some programmers seem to think you can do whatever you
    like with pointers converted between different types. A lot of use of converted pointer types will be UB in C, but C does not make it at all difficult to write code with these conversions. There's a fair argument
    to be made that type-based alias analysis rarely gives good optimisation opportunities, restricts programmers, and lets people write code that
    they think is correct, but is not. Quite a number of C compilers
    specifically do not do any type-based aliasing analysis, or let you turn
    it off (gcc -fno-strict-aliasing).

    One key point is that in C++, type-based alias analysis is much more
    useful as you generally use far more different types (typedef in C does
    not make different types), and code is generally much more careful about accessing them with correct pointer types (or better, references,
    containers, smart pointers, etc.).

    Maybe things could be helped by attributes that give you better control
    over aliasing - gcc has a "may_alias" type attribute that can be used to
    give a type the "aliasing superpowers" of character types.


    One option though is "if enabled, assume the rules are followed unless
    the compiler sees them being broken", in which case it disables TBAA
    when faced with TBAA violations.

    That sounds /really/ bad. You can't have the behaviour - the semantics
    - dependent on whether or not a compiler is able to find an error in
    your code!

    An option to say TBAA is enabled or not makes sense. Even better, is
    having it as a pragma. (I always use gcc optimise pragmas if I need to disable a particular optimisation, to keep it safe regardless of command
    line options.) Standardising this in some way could be useful. And it
    is probably a good idea to have TBAA off by default - let those who
    understand it and want it, enable it. (But it should probably be on by default for C++.)

    And when a compiler has TBAA enabled, and it spots a violation, that's
    time for an error message - not silently disabling it!


    This approach seems to be moderately
    effective, and allows benefiting from some of the performance advantages
    of TBAA while also being more friendly to code that goes "wild west"
    with things like pointer casts and "cast and dereference" patterns.

    So, say, a nicer compromise (even if still breakable).

    It's better to use something other than "char" pointers, since character pointers can be used to access any data. There is no UB in your example
    here, that I can see - "*s" is allowed to access data pointed to by
    "*t". Let's pretend you use "short * s" or "float * s" instead.

      int foo1(char *s, int *t)
      {
        *s=*t+1;
        return *t;
      }
      //assume not directly visible within same context:
      int foo2()
      {
        int i, j;
        i=4;
        j=foo1((char *)(&i), &i);
        return j;
      }
    What is the result of calling foo2?...
      Here, foo2 breaks TBAA but in a way invisible to foo1.

    With the proviso mentioned above, there are countless ways in which you calling a function with unexpected or inappropriate parameters leads to
    UB. You always have to know the requirements for the parameters before calling a function. This situation is a drop in the ocean, and not
    really worth worrying about specifically IMHO.


    For volatile, one typically needs to go a little further:
      Every load and store needs to be performed explicitly;
      There is a need to disallow load/store reordering;

    "volatile" does not affect hardware ordering - it only affects the
    ordering within the program. It cannot see things that are at a level
    below the generated code.

    If you want to influence hardware ordering, use atomics and fences (from
    C11, or implementation extensions).

      ...
    Mostly because volatile may be used to access MMIO, and MMIO is more
    strict than normal RAM in this area.

    In the microcontroller world at least, that is done by the memory
    management unit or memory protection unit, specifying which address
    areas are accessible in different ways, which are cacheable, which can
    be buffered or re-ordered. That is all well below the level visible in
    a programming language - and once the MPU is set up correctly, it all
    "just works".


    Though, could maybe be better if "volatile" could be broken into several subtypes depending on which particular behaviors are needed:
      Weaker case: Assume aliasing happens.
        May still prune non-aliasing load/store or reorder;

    C does not need volatile for that - you've got aliasing superpower
    character types. In practice, you have memcpy() / memmove() to read or
    write data that might be aliased, or different types. All you need is
    for compilers to handle small memcpy's with fixed sizes efficiently (as
    gcc and clang do). Often this is more efficient that using volatiles -
    using "float f = 12.3; uint32_t x; memcpy(&x, &f, 4); return x;" will typically result in a float register to integer register move instruction.

      Normal case:
        Every load/store needs to happen;
        No reordering allowed.

    No re-ordering with respect to other volatiles, you mean. That is what volatile does today.

      Stronger case:
        Like the above, but also needs to be synchronous between cores;
        Though, this role overlaps with _Atomic.

    As you say, that is the job for atomics - or volatile atomics.

    So we already have all the features you want. It is fair to say,
    however, that some programmers misunderstand "volatile" and think it
    means one of the other cases you list. (Or the case that you didn't
    list - the assumption that volatile forces an order on non-volatile
    accesses or operations.)


    There is also ambiguity as to how far the volatile-ness extends, but
    this can be avoided by doing it at the point of cast-and-deref:
      (*(volatile uint64_t *)ptr)
    In this case, it applying explicitly to the deref operation rather than
    the handling of the pointer before this point.


    I don't know what you mean by an "ambiguity" here. There is no
    ambiguity in C about where "volatile" applies. There might be confusion
    or misunderstanding amongst some programmers, but not an ambiguity in
    the semantics.

    It is, IME, helpful to remember that "volatile" is primarily about
    /accesses/, rather than objects. This was somewhat unclear in the C
    standards until C17.



    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Jun 24 10:30:12 2026
    From Newsgroup: comp.arch

    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.


    That's the way C is defined. It is debatable as to whether the rules in
    the C standard are ideal (I don't think they are, but the changes I'd
    make might be different from the ones you would like). But it is
    entirely appropriate for a compiler to follow the C rules unless you
    specify otherwise.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jun 24 14:30:17 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    BGB <cr88192@gmail.com> posted:

    On 6/20/2026 5:01 PM, MitchAlsup wrote:
    ---------------
    Tagging to make it harder to stomp the link register;

    Put it somewhere it can't be stomped on !! like in memory on a page the
    application has no access permissions.


    Multiple stacks is a big ask, and non-accessible memory is not so good
    when dealing with an ISA where user code needs to handle the Link-Register.

    Code does not need to access or look at the return address in My 66000 >> >> > ISA--except for the case where one wants to walk the stack back on a
    THROW() and its unstructured equivalent longjump().

    What about a debugging stack trace?

    The debugger runs in a separate process with access to application
    Root pointer and ASID. In that process, Call-stack is RW-.

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    When Thread is unExceptional it cannot access Call Stack,
    when Thread is Exceptional it can.

    ENTER, EXIT, and RET are exempt from the protection check.
    Call Stack Pointer is not accessible to unprivileged code.

    Don't see how one gets from a running application into debugger without >taking an exception !?! or from running in the debugger to running in >application without returning from an exception !!!

    The glibc function ::backtrace can be called at any time, in any context.

    Then there are the unix context functions that also allow access to
    resources not normally visible to an application - getcontext(2), makecontext(3) and the setjmp/sigsetjmp functions which also
    gather the thread context, including the current stack pointer.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Wed Jun 24 14:38:02 2026
    From Newsgroup: comp.arch

    On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:

    On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
    <monnier@iro.umontreal.ca> wrote:

    Robert Swindells [2026-06-19 11:20:10] wrote:
    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:
    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and
    SPARC contains some support for that, but as one of the IIRC Franz
    Lisp developers has explained in this newsgroup, they actually did
    not use this feature, because the performance benefit was not big
    enough to
    [...]
    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton
    is referring to the commercial Common Lisp compiler associated with the >>Franz Inc company, marketed under the name "Allegro".

    === Stefan

    ISTM there were at least a couple of Lisps available for the Vax. I
    can't speak to Franz, but I do know at least one Vax Lisp was a BIBOP[1] system that (generally) did not use tags.

    Franz Lisp used BiBOP.

    I posted a link earlier in the thread to the PDF of "Performance and Evaluation of Lisp Systems", Chapter 2 contains descriptions of the
    various implementations available at that time.

    <https://dreamsongs.com/Files/Timrep.pdf>

    The following chapter lists the benchmarks used and results for each implementation.

    For some reason, later versions of SPECint li ran these Lisp benchmarks
    in the XLisp interpreter that had been compiled for the CPU under test.

    The benchmark results reported in the book are for fully compiled code.

    In BIBOP, memory "pages"[2] are dedicated to a single data type. The
    base address of the page is mapped to the type of the objects the page contains, and so the objects (and pointers to them) need no type
    information themselves. This allowed for full width pointers, fixnums
    and floats, and for conses, boxes, and other fixed sized data types (including user types) to avoid tagging.

    I ran Franz Lisp on the Atari ST, not having tags made it easy to
    interface to the GEM GUI.

    I also made a start on a hardware accelerator for BiBOP type checking for
    the ST. The expansion connector on it provided access to the full 68k bus including the function pins.

    The idea was to look for data reads within a defined range then use the
    page number as an address for a small SRAM holding the BiBOP table and
    latch the value stored at that address. Would tweak the compiler slightly
    to read a value into a CPU register before needing to read the latched
    type of it.

    A variant of this idea could be to store the type value in spare bits in a PTE, then define an instruction that treats the contents of a register as
    an address and returns the matching "type" bits for it.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Jun 24 20:17:45 2026
    From Newsgroup: comp.arch

    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jun 24 16:34:55 2026
    From Newsgroup: comp.arch

    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library, whose
    main purpose is to give this sort of self-overlapping copy behavior (and
    to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but as
    noted the behavior in the case of a short forwards copy are different.

    For non-overlap cases it can just invoke "memcpy()".





    Otherwise, was sitting around trying to fiddle with memory usage in my compiler, and there still seems to be around 16MB unaccounted for (after tracking basically all of the memory allocation and freeing in the
    compiler, VS debugger reports around 16MB more memory being used than
    the internal memory-use tracking does).


    This also seems larger than easily explained by the binary itself...

    Would estimate ~ 7MB for the EXE's sections + OS stack (1MB in Windows).


    But, in the past few days of fiddling I have gotten it from ~ 250 MB to compile Doom down to around 64MB (in VS), or 48MB (according to the
    internal allocation tracking).

    Though seemingly in the process, the build times have gotten ~ 2 seconds longer.

    Though, there were some changes effecting significant compiler
    structures (changing some raw strings to string handles, and breaking
    one larger structure into multiple parts), so this isn't entirely unreasonable. Changing the structure was a bit annoying as it was one of
    the most heavily used in the compiler, so involved touching a lot of code.


    One annoyance is that the 3AC opcode structure has a few fields that are minority use, but it likely isn't worth the pain of messing with it.

    Unlike the other struct, the Op struct is small enough that splitting it
    via a pointer would likely end up being net-negative for memory use.
    Would likely need to get a bit tricky and use two different-sized
    structs depending on sub-type and putting the lesser-used fields on the
    end, but this would be awkward and ugly. Not likely worth it.

    Well, because to address the full range of operations, it effectively has:
    operation tags
    2 types;
    2 destinations;
    4 sources;
    a 24-byte tagged-union immediate-value field
    needed for calls, member load/store, ...

    The 2-size strategy would likely be to have a full version with all the fields, and a subset version with:
    operation tags
    1 type
    1 destination
    3 sources
    And, then questioning whether it would be worth it to shave 48 bytes off
    an 88 byte struct (and maybe save a few MB at most).


    Well, and then I can see that the compiler is burning around 2MB on the
    data for figuring out whether called functions may have touched
    particular global variables (though this effects what optimizations the compiler can do, so isn't purely waste), ...

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 00:46:17 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 6/24/2026 3:17 PM, John Levine wrote:
    -------------------
    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library, whose
    main purpose is to give this sort of self-overlapping copy behavior (and
    to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

    Instead, I added MM instruction to ISA. MM is memmove() ! LLVM is happy to
    use MM as a struct copy (sa = sb;) independent of where sa or sb are.

    In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

    HW is really good at pointer compares and loop inversions.

    For non-overlap cases it can just invoke "memcpy()".

    Unnecessary with MM.

    Plus, while MM is doing its thing, non-memory ref instructions can make
    forward progress, and non-aliasing memory refs can use the 'other' Memory Units..
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Andy Valencia@vandys@vsta.org to comp.arch on Wed Jun 24 19:17:39 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:
    Usual downside it that the excessive parenthesis tend to turn into a usability issue.
    Ample fun has been made of this over time.

    From rec.humor.funny:

    From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
    Newsgroups: rec.humor.funny
    Subject: The Strategic Defense Initiative (SDI/Star Wars)
    Keywords: computer, funny
    Message-ID: <137457@looking.on.ca>
    Date: 23 Apr 90 10:30:08 GMT
    Sender: funnyr@looking.on.ca
    Posted: Mon Apr 23 11:30:08 1990
    Reply-Path: mist.cs.orst.edu!jasmerb

    Through some clever security hole manipulation if I have been able to
    break into all of the government's computers and acquire the Lisp code
    to SDI. Here is the last page (tail -10) of it to prove that I actually
    have the code:

    )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))


    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 09:18:38 2026
    From Newsgroup: comp.arch

    On 24/06/2026 22:17, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    I think it is perhaps better to say that one of the less fortunate
    things about C is that people make assumptions without learning the
    language properly or looking up the details. And then code with these incorrect assumptions is then propagated.

    "memcpy" is a fine example of this. It says on the tin that using it
    for objects that overlap is undefined behaviour - C standard code for
    "don't do that".

    But lots of people hammer away at their keyboards without reading the
    manuals or instructions, or paying much attention to their tutorial
    books or courses. Some languages are much more forgiving there - Python
    aims to accept as wide a range of inputs as possible for any operation
    or library function, and aims to give you as much feedback about your
    mistakes at runtime. C aims for maximal efficiency on the assumption
    that you have read the specifications for the language, and follow the
    rules. You can do a lot of Python programming by combining trial and
    error with a bit of "how do I do this in Python" googling. In C, that's
    going to lead to tears sooner rather than later.

    In the case of "memcpy", a lot of people think it is defined - specified
    - by the naïve implementation (I believe it is shown as an example in
    K&R). And so they use "memcpy" everywhere, even in cases where
    "memmove" is the appropriate choice. Not long ago, a glibc developer discovered that on some Intel processors, in some circumstances, running
    the memory copy backwards lead to a noticeable speedup for memcpy(), and
    thus implemented that. The backlash of people who said the change
    "broke" their code was overwhelming, and the change was reverted.

    I really do think that things like this should be /easy/ to get right. Parameters to "memcpy" are not allowed to overlap, so that the copying
    can be as efficient as possible. "memmove" allows the parameters to
    overlap, but is likely to be less efficient. Use the one that suits
    your requirements.

    But people get it wrong. There's a lot of people who sit alone,
    programming in C, who should not be programming in C - they should be
    using different languages, or learning C better before using it. Or
    they should have better guidance and help, code reviews from people more experienced. Many people programming in C don't even enable warnings on
    their compiler. C is not a language for people who program "by
    intuition", it requires more discipline in developers than many other languages.

    You are right that there are more subtle possibilities for errors in C,
    and I know of no one who thinks the rules of C and the standard library
    are all ideal. But a huge percentage of the code bugs in C (as distinct
    from logical errors, specification errors, etc., that plague all
    programming in all languages) could be avoided by better development practices.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 09:22:49 2026
    From Newsgroup: comp.arch

    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish.  Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address.  One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the rules in >>> the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write
    code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't.  This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library, whose
    main purpose is to give this sort of self-overlapping copy behavior (and
    to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42. "memmove" acts as
    though it copies the source to a temporary buffer, then copies that
    temporary buffer to the destination. (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)

    How is your "_memlzcpy" defined that is different from that?

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 03:23:27 2026
    From Newsgroup: comp.arch

    On 6/25/2026 2:22 AM, David Brown wrote:
    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish.  Actually, by default gcc assumes (i.e., it does not prove) >>>>> that pointers to different types (except char) do not point to the
    same address.  One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the
    rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write
    code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't.  This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library,
    whose main purpose is to give this sort of self-overlapping copy
    behavior (and to consolidate nearly every LZ77 style decompressor
    otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but
    as noted the behavior in the case of a short forwards copy are different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42.  "memmove" acts as
    though it copies the source to a temporary buffer, then copies that temporary buffer to the destination.  (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)


    Yeah, this is why I created "_memlzcpy()", because the defined behavior
    for "memmove()" is not what one wants for self-overlapping forward copy.


    How is your "_memlzcpy" defined that is different from that?
    Here:
    _memlzcpy(dst+1, dst, len);
    Is functionally equivalent to:
    memset(dst+1, *dst, len);

    But, it can do more:
    _memlzcpy(dst+2, dst, len); //repeating 2-byte pattern
    _memlzcpy(dst+3, dst, len); //repeating 3-byte pattern
    ...

    So, required to work for every self-overlap distance.


    Or, in the case as commonly used in an LZ77 style decompressor:
    _memlzcpy(dest, dest-distance, length);


    Though, there are also:
    _memcpyf()
    _memmovef()
    _memlzcpyf()

    Where the 'f' in this case means:
    Allowed to be a little faster by potentially going up to 32 bytes extra.

    Where, in some cases it is faster to overshoot the end than to give an
    exact length, but it would not be valid to overshoot the copy for the
    normal versions (exact length even if it is a little slower).


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Wed Jun 24 19:50:09 2026
    From Newsgroup: comp.arch

    Robert Swindells [2026-06-24 14:38:02] wrote:
    On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:
    On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
    <monnier@iro.umontreal.ca> wrote:
    Robert Swindells [2026-06-19 11:20:10] wrote:
    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:
    Another architectural feature: One might think that tagging support
    would help dynamically typed programming languages (e.g., Lisp), and >>>>> SPARC contains some support for that, but as one of the IIRC Franz
    Lisp developers has explained in this newsgroup, they actually did
    not use this feature, because the performance benefit was not big
    enough to
    [...]
    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton
    is referring to the commercial Common Lisp compiler associated with the >>>Franz Inc company, marketed under the name "Allegro".

    === Stefan

    ISTM there were at least a couple of Lisps available for the Vax. I
    can't speak to Franz, but I do know at least one Vax Lisp was a BIBOP[1]
    system that (generally) did not use tags.

    Franz Lisp used BiBOP.

    Side note: the BiBoP technique is largely orthogonal to the
    architectural support for pointer tagging, because usually BiBoP is used
    to "eliminate" the tags present inside the heap representation of
    objects rather than the few tagbits stolen from pointers: the purpose of
    those tagbits is usually to be able to determine the type of the object *without* any memory access whereas BiBoP stores the corresponding info
    in memory.

    E.g. tagbits are most commonly used to distinguish between an immediate
    small integer value and a pointer. BiBoP wouldn't help with that,
    forcing the small integer to be stored in some "page of small integers"
    which could have a very serious performance impact.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jun 25 14:39:54 2026
    From Newsgroup: comp.arch

    BGB wrote:
    On 6/25/2026 2:22 AM, David Brown wrote:
    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish.  Actually, by default gcc assumes (i.e., it does not prove) >>>>>> that pointers to different types (except char) do not point to the>>>>>> same address.  One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the
    rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write
    code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>
        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that>>>> moves larger blocks won't.  This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library,
    whose main purpose is to give this sort of self-overlapping copy
    behavior (and to consolidate nearly every LZ77 style decompressor
    otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but >>> as noted the behavior in the case of a short forwards copy are
    different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42.  "memmove" acts as
    though it copies the source to a temporary buffer, then copies that
    temporary buffer to the destination.  (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)


    Yeah, this is why I created "_memlzcpy()", because the defined behavior
    for "memmove()" is not what one wants for self-overlapping forward copy.


    How is your "_memlzcpy" defined that is different from that?
    Here:
      _memlzcpy(dst+1, dst, len);
    Is functionally equivalent to:
      memset(dst+1, *dst, len);

    But, it can do more:
      _memlzcpy(dst+2, dst, len);  //repeating 2-byte pattern
      _memlzcpy(dst+3, dst, len);  //repeating 3-byte pattern
      ...

    So, required to work for every self-overlap distance.


    Or, in the case as commonly used in an LZ77 style decompressor:
      _memlzcpy(dest, dest-distance, length);


    Though, there are also:
      _memcpyf()
      _memmovef()
      _memlzcpyf()

    Where the 'f' in this case means:
    Allowed to be a little faster by potentially going up to 32 bytes extra.
    I'm guessing you really meant up to 31 bytes extra?
    This is what my own (faster than Google's version) LZ4 decompressor uses internally.
    I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of
    an overlapping copy that unrolls RLL-encoded data, I start by loading
    the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the
    required results, for any pattern up to 32 bytes long.
    swizzle_table:
    [0,0,0,0,0,0,0,...
    [0,1,0,1,0,1,0,1,...
    [0,1,2,0,1,2,0,1,2,...
    [0,1,2,3,0,1,2,3,...
    [0,1,2,3,4,0,1,2,3,..
    etc.
    Note that having 31 entries of 32 bytes each means that I'm allocating
    almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Thu Jun 25 13:24:45 2026
    From Newsgroup: comp.arch

    On Wed, 24 Jun 2026 19:50:09 -0400, Stefan Monnier wrote:

    Robert Swindells [2026-06-24 14:38:02] wrote:
    On Mon, 22 Jun 2026 18:49:40 -0400, George Neuner wrote:
    On Sat, 20 Jun 2026 10:15:41 -0400, Stefan Monnier
    <monnier@iro.umontreal.ca> wrote:
    Robert Swindells [2026-06-19 11:20:10] wrote:
    On Fri, 19 Jun 2026 06:02:16 GMT, Anton Ertl wrote:
    Another architectural feature: One might think that tagging support >>>>>> would help dynamically typed programming languages (e.g., Lisp),
    and SPARC contains some support for that, but as one of the IIRC
    Franz Lisp developers has explained in this newsgroup, they
    actually did not use this feature, because the performance benefit >>>>>> was not big enough to
    [...]
    Franz Lisp doesn't use tags at all and only ran on VAX and 68k.

    I guess you two aren't talking bout the same "Franz Lisp". AFAIK Anton >>>>is referring to the commercial Common Lisp compiler associated with
    the Franz Inc company, marketed under the name "Allegro".

    === Stefan

    ISTM there were at least a couple of Lisps available for the Vax. I
    can't speak to Franz, but I do know at least one Vax Lisp was a
    BIBOP[1]
    system that (generally) did not use tags.

    Franz Lisp used BiBOP.

    Side note: the BiBoP technique is largely orthogonal to the
    architectural support for pointer tagging, because usually BiBoP is used
    to "eliminate" the tags present inside the heap representation of
    objects rather than the few tagbits stolen from pointers: the purpose of those tagbits is usually to be able to determine the type of the object *without* any memory access whereas BiBoP stores the corresponding info
    in memory.

    E.g. tagbits are most commonly used to distinguish between an immediate
    small integer value and a pointer. BiBoP wouldn't help with that,
    forcing the small integer to be stored in some "page of small integers"
    which could have a very serious performance impact.

    But if you know that you have a "page of small integers" then you can just
    do address comparisons between them, the Franz Lisp compiler did this.



    === Stefan

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Jun 25 09:38:22 2026
    From Newsgroup: comp.arch

    On 2026-Jun-25 08:39, Terje Mathisen wrote:
    BGB wrote:
    On 6/25/2026 2:22 AM, David Brown wrote:
    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias. >>>>>>>
    I wish.  Actually, by default gcc assumes (i.e., it does not prove) >>>>>>> that pointers to different types (except char) do not point to the >>>>>>> same address.  One has to turn that off with -fno-strict-aliasing. >>>>>>> Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>>
        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that >>>>> moves larger blocks won't.  This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library, whose main purpose is to give this sort of self-overlapping copy behavior (and to consolidate nearly every LZ77 style decompressor otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but as noted the behavior in the case of a short forwards copy are different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42.  "memmove" acts as though it copies the source to a temporary buffer, then copies that temporary buffer to the destination.  (If you want to fill the buffer with the value 42, "memset" is the function to use.)


    Yeah, this is why I created "_memlzcpy()", because the defined behavior for "memmove()" is not what one wants for self-overlapping forward copy.


    How is your "_memlzcpy" defined that is different from that?
    Here:
       _memlzcpy(dst+1, dst, len);
    Is functionally equivalent to:
       memset(dst+1, *dst, len);

    But, it can do more:
       _memlzcpy(dst+2, dst, len);  //repeating 2-byte pattern
       _memlzcpy(dst+3, dst, len);  //repeating 3-byte pattern
       ...

    So, required to work for every self-overlap distance.


    Or, in the case as commonly used in an LZ77 style decompressor:
       _memlzcpy(dest, dest-distance, length);


    Though, there are also:
       _memcpyf()
       _memmovef()
       _memlzcpyf()

    Where the 'f' in this case means:
    Allowed to be a little faster by potentially going up to 32 bytes extra.

    I'm guessing you really meant up to 31 bytes extra?

    This is what my own (faster than Google's version) LZ4 decompressor uses internally.

    I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of an overlapping copy that unrolls RLL-encoded data, I start by loading the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the required results, for any pattern up to 32 bytes long.

    swizzle_table:

    [0,0,0,0,0,0,0,...
    [0,1,0,1,0,1,0,1,...
    [0,1,2,0,1,2,0,1,2,...
    [0,1,2,3,0,1,2,3,...
    [0,1,2,3,4,0,1,2,3,..

    etc.

    Note that having 31 entries of 32 bytes each means that I'm allocating almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.

    Terje


    If I had 256b,32B registers I would like to have LDV Load Variable and STV Store Variable
    instructions, which take an address, a src/dst simd register, and either a scalar register
    or immediate byte count in the range 0..32. LDV loads the specified number of bytes into
    the simd starting at the least significant byte and zero-fills any unread ones. These should be relatively easy to implement if one already has unaligned SIMD LD/ST.

    One might also consider LDBV/STBV variable length bit vectors 0 to 256b,




    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jun 25 15:17:52 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in >>the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that >is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.

    The burroughs B3500 and medium systems successors, which is a
    memory-to-memory architecture had a number of move instructions,
    several of which had architecturally defined semantics for
    overlapping source and destination fields, which included
    functionality similar to that you describe above.

    MVR (Move Repeat) was the one most commonly used to fill
    a single value into multiple memory locations (where
    the value could be from one to 100 digits and the repeat count
    between 1 and 100).

    The remaining move instructions MVA (Move Alpha - i.e. bytes)
    MVD (Move Data), MVW (Move Words) and
    MVN (Move Numeric) had defined semantics
    for some cases of overlapping operands that could result in
    "smearing" a store over a large region of memory or
    repeating a digit throughout the receiving operand.

    For example, the overlap behavior for MVW was

    "When the final B address is less than the final A
    address and the fields partially overlap, the source
    data field will be shifted by that number of digits
    to the left. When the B data field partially overlaps
    the A data field and B is greater than A, repeat the data from
    the A address to the B address throughout the destination
    data field. The B data field may totally overlap the A
    data field"

    The overlap behavior for MVC (Move and Clear) could be
    used to right justify the A data in the B field with
    the destination filled with leading zeros or shift
    the data to the left depending on the relationship
    between A and B.

    All other overlap results were dependent upon the
    generation of processor and could not be relied upon
    between generations.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Jun 25 16:26:43 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:

    "memmove" will not fill the array above with 42. "memmove" acts as
    though it copies the source to a temporary buffer, then copies that temporary buffer to the destination. (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)

    This is what Fortran does for array assignment. From the language
    definition, the right-hand side of an assignment is evaluated
    completely, then the value is assigned to the lefth-and side.
    So, from the language definition,

    a = a + 1.0

    is something like, assuming a suitable declaration for tmp,

    allocate (tmp(size(a)))
    tmp = a + 1.0
    a = tmp
    deallocate (tmp)

    and a compiler is free to do that. However, for efficiency
    reason, a compiler write is well-advised to detect this
    case and make it into a simple loop.

    A lot of tricks can be played with dependency checking, loop
    reversal etc.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jun 25 18:50:27 2026
    From Newsgroup: comp.arch

    EricP wrote:
    On 2026-Jun-25 08:39, Terje Mathisen wrote:
    BGB wrote:
    On 6/25/2026 2:22 AM, David Brown wrote:
    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias. >>>>>>>>
    I wish.  Actually, by default gcc assumes (i.e., it does not >>>>>>>> prove)
    that pointers to different types (except char) do not point to the >>>>>>>> same address.  One has to turn that off with
    -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether >>>>>>> the rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to
    write code that
    is intuitively reasonable and sometimes works but isn't portable, >>>>>> e.g.:

        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that >>>>>> moves larger blocks won't.  This example is really obvious (it's >>>>>> why there's also memmove()) but there's plenty of more subtle ones. >>>>>>

    This one is why I added a "_memlzcpy()" function to my C library,
    whose main purpose is to give this sort of self-overlapping copy
    behavior (and to consolidate nearly every LZ77 style decompressor
    otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()",
    but as noted the behavior in the case of a short forwards copy are >>>>> different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42.  "memmove" acts >>>> as though it copies the source to a temporary buffer, then copies
    that temporary buffer to the destination.  (If you want to fill >>>> the buffer with the value 42, "memset" is the function to use.)


    Yeah, this is why I created "_memlzcpy()", because the defined
    behavior for "memmove()" is not what one wants for self-overlapping
    forward copy.


    How is your "_memlzcpy" defined that is different from that?
    Here:
       _memlzcpy(dst+1, dst, len);
    Is functionally equivalent to:
       memset(dst+1, *dst, len);

    But, it can do more:
       _memlzcpy(dst+2, dst, len);  //repeating 2-byte pattern
       _memlzcpy(dst+3, dst, len);  //repeating 3-byte pattern
       ...

    So, required to work for every self-overlap distance.


    Or, in the case as commonly used in an LZ77 style decompressor:
       _memlzcpy(dest, dest-distance, length);


    Though, there are also:
       _memcpyf()
       _memmovef()
       _memlzcpyf()

    Where the 'f' in this case means:
    Allowed to be a little faster by potentially going up to 32 bytes extra.

    I'm guessing you really meant up to 31 bytes extra?

    This is what my own (faster than Google's version) LZ4 decompressor
    uses internally.

    I am using either a pair of SSE or a single AVX register (so 32 bytes >> in both cases) as the copy granule. For the specific,very common, case
    of an overlapping copy that unrolls RLL-encoded data, I start by
    loading the starting pattern into the bottom of a register, then use
    the pattern length to index into a table of swizzle patterns that will
    generate the required results, for any pattern up to 32 bytes long.

    swizzle_table:

    [0,0,0,0,0,0,0,...
    [0,1,0,1,0,1,0,1,...
    [0,1,2,0,1,2,0,1,2,...
    [0,1,2,3,0,1,2,3,...
    [0,1,2,3,4,0,1,2,3,..

    etc.

    Note that having 31 entries of 32 bytes each means that I'm allocating
    almost a KB of $L1 cache space just for this table, but when you're
    decompressing lots of data it pays off.


    If I had 256b,32B registers I would like to have LDV Load Variable and > STV Store Variable
    instructions, which take an address, a src/dst simd register, and either
    a scalar register
    or immediate byte count in the range 0..32. LDV loads the specified
    number of bytes into
    the simd starting at the least significant byte and zero-fills any
    unread ones.
    These should be relatively easy to implement if one already has
    unaligned SIMD LD/ST.

    One might also consider LDBV/STBV variable length bit vectors 0 to 256b,
    We do have that, in the form of a masked move, but it is more efficient
    to simply use the regular unaligned store op.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:13:04 2026
    From Newsgroup: comp.arch


    Andy Valencia <vandys@vsta.org> posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:
    Usual downside it that the excessive parenthesis tend to turn into a usability issue.
    Ample fun has been made of this over time.

    From rec.humor.funny:

    From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
    Newsgroups: rec.humor.funny
    Subject: The Strategic Defense Initiative (SDI/Star Wars)
    Keywords: computer, funny
    Message-ID: <137457@looking.on.ca>
    Date: 23 Apr 90 10:30:08 GMT
    Sender: funnyr@looking.on.ca
    Posted: Mon Apr 23 11:30:08 1990
    Reply-Path: mist.cs.orst.edu!jasmerb

    Through some clever security hole manipulation if I have been able to
    break into all of the government's computers and acquire the Lisp code
    to SDI. Here is the last page (tail -10) of it to prove that I actually have the code:

    )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))


    I remember the LISP on PDP-8. One could use the character ] to mean as many
    )s as needed to close the lambda.

    Andy Valencia
    Home page: https://www.vsta.org/andy/
    To contact me: https://www.vsta.org/contact/andy.html
    No AI was used in the composition of this message
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:14:28 2026
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in >the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );

    ?????

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 17:20:01 2026
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 24/06/2026 23:34, BGB wrote:
    ---------------------
    "memmove" will not fill the array above with 42. "memmove" acts as
    though it copies the source to a temporary buffer, then copies that temporary buffer to the destination. (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)

    Act as though it copies twice is utterly unnecessary as overlapping
    memory can simply be performed back-to-front instead of front-to-back.

    How is your "_memlzcpy" defined that is different from that?

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Jun 25 20:45:26 2026
    From Newsgroup: comp.arch

    On 25/06/2026 19:20, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 24/06/2026 23:34, BGB wrote:
    ---------------------
    "memmove" will not fill the array above with 42. "memmove" acts as
    though it copies the source to a temporary buffer, then copies that
    temporary buffer to the destination. (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)

    Act as though it copies twice is utterly unnecessary as overlapping
    memory can simply be performed back-to-front instead of front-to-back.


    You are mixing up "act as though" it does something, and implementing it
    that way. memmove implementations will typically figure out if they can
    work as a forwards loop or a backwards loop, and do that. For moves
    that are big enough to be worth the effort, they'll do it using big
    lumps (64 bit, or bigger if that is more efficient) and then handle any
    last few bytes individually. If the overlap is closer than a "lump",
    more effort is needed. As Thomas said in reference to Fortran array assignment, there are lots of tricks possible that give faster results
    than the simple forward or backwards byte copying.

    But however it is implemented, the result is the same as you would get
    by copying to a temporary area.


    How is your "_memlzcpy" defined that is different from that?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Jun 25 19:19:47 2026
    From Newsgroup: comp.arch

    According to MitchAlsup <user5857@newsgrouper.org.invalid>:

    John Levine <johnl@taugh.com> posted:

    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in >> >the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );

    Jeez, it's an example.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 14:55:28 2026
    From Newsgroup: comp.arch

    On 6/25/2026 7:39 AM, Terje Mathisen wrote:
    BGB wrote:
    On 6/25/2026 2:22 AM, David Brown wrote:
    On 24/06/2026 23:34, BGB wrote:
    On 6/24/2026 3:17 PM, John Levine wrote:
    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias. >>>>>>>
    I wish.  Actually, by default gcc assumes (i.e., it does not prove) >>>>>>> that pointers to different types (except char) do not point to the >>>>>>> same address.  One has to turn that off with -fno-strict-aliasing. >>>>>>> Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the >>>>>> rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to
    write code that
    is intuitively reasonable and sometimes works but isn't portable,
    e.g.:

        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that >>>>> moves larger blocks won't.  This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    This one is why I added a "_memlzcpy()" function to my C library,
    whose main purpose is to give this sort of self-overlapping copy
    behavior (and to consolidate nearly every LZ77 style decompressor
    otherwise needing to supply their own version).

    In the case of a short backwards copy, it will call "memmove()", but
    as noted the behavior in the case of a short forwards copy are
    different.

    For non-overlap cases it can just invoke "memcpy()".

    "memmove" will not fill the array above with 42.  "memmove" acts as
    though it copies the source to a temporary buffer, then copies that
    temporary buffer to the destination.  (If you want to fill the
    buffer with the value 42, "memset" is the function to use.)


    Yeah, this is why I created "_memlzcpy()", because the defined
    behavior for "memmove()" is not what one wants for self-overlapping
    forward copy.


    How is your "_memlzcpy" defined that is different from that?
    Here:
       _memlzcpy(dst+1, dst, len);
    Is functionally equivalent to:
       memset(dst+1, *dst, len);

    But, it can do more:
       _memlzcpy(dst+2, dst, len);  //repeating 2-byte pattern
       _memlzcpy(dst+3, dst, len);  //repeating 3-byte pattern
       ...

    So, required to work for every self-overlap distance.


    Or, in the case as commonly used in an LZ77 style decompressor:
       _memlzcpy(dest, dest-distance, length);


    Though, there are also:
       _memcpyf()
       _memmovef()
       _memlzcpyf()

    Where the 'f' in this case means:
    Allowed to be a little faster by potentially going up to 32 bytes extra.

    I'm guessing you really meant up to 31 bytes extra?


    Yeah, off by 1 error.


    This is what my own (faster than Google's version) LZ4 decompressor uses internally.

    I am using either a pair of SSE or a single AVX register (so 32 bytes in both cases) as the copy granule. For the specific,very common, case of
    an overlapping copy that unrolls RLL-encoded data, I start by loading
    the starting pattern into the bottom of a register, then use the pattern length to index into a table of swizzle patterns that will generate the required results, for any pattern up to 32 bytes long.

    swizzle_table:

    [0,0,0,0,0,0,0,...
    [0,1,0,1,0,1,0,1,...
    [0,1,2,0,1,2,0,1,2,...
    [0,1,2,3,0,1,2,3,...
    [0,1,2,3,4,0,1,2,3,..

    etc.

    Note that having 31 entries of 32 bytes each means that I'm allocating almost a KB of $L1 cache space just for this table, but when you're decompressing lots of data it pays off.


    Not using my own C library on x86-64, ... usually (depending on context) MSVCRT or glibc or similar.



    In my case it is typically using 64-bit copies, but for better pipeline utilization it is better to copy in groups of 4x 64-bit loads/stores, or
    32 bytes.

    Well, and a 0,2,1,3 order on my core; but this is mostly a benefit if
    the pointer is aligned on a 16-byte boundary (can potentially avoid some internal penalties within the L1 cache).


    But, yeah, I can use this for both LZ4 and RP2 compression, which are typically the main two that I use.

    Some common properties:
    Both byte-oriented designs that allow for fast decompressors.
    Some different properties:
    LZ4 usually does slightly better for program binaries;
    RP2 usually does better for general data;
    LZ4 is usually faster on OoO machines;
    RP2 is usually faster on in-order.
    LZ4 limits:
    Distance: 64K
    Literal Length: Unbounded
    Match Length: Unbounded
    RP2 limits (typical):
    Distance: 128K
    Literal Length: Unbounded
    Match Length: 516 bytes

    There exist variants of RP2 which allow larger limits, but for many
    use-cases, the version with these limits makes the most sense. A newer
    variant added a different mechanism for long-distance matching (the
    original mechanism wasn't ideal in some ways), and some special cases to
    help with long RLE runs (a long-match / very-short-distance case), but
    these aren't really used much.

    As noted, the design of RP2 was a mutation of the EA RefPack design, but
    with bits moved around to work better for a little endian (the original
    design seemingly assumed big endian). Also typically traded 1 bit of
    match length for 1 bit of literal length (literal runs typically 0..7 vs 0..3).

    Did investigate at one point whether it might have been better to trade
    1 bit on distance instead, but testing seemed to confirm taking it from
    match length as the correct choice.

    In both cases, there is a presumed positive correlation between match
    length and distance, unlike LZ4 where they are uncorrelated (distance
    always 16 bits).

    ...



    Had also experimented with bolt-on post-compressors for RP2:
    STF + AdRice:
    Can push compression similar to that of Deflate
    It tends to have an advantage for small payloads.
    But, relative gains diminish with payload size.
    Speed drops into Deflate-like areas.
    Range Coder:
    Compression increases to LZMA like areas;
    Speed decreases to LZMA like areas.

    No Huffman or ANS, but:
    Huffman would be a more complex mechanism to apply as a bolt-on post compressor, and isn't likely to see a significant compression delta.

    ANS seems very weird / convoluted;
    Extant implementations I have looked at don't seem to live up to claims
    of extreme compression at high speeds, most like both speed and
    compression that seem to fall between Huffman and Range-Coding;
    Has quirks that would pose problems for use as a simple post-compressor.


    The idea being that one first encodes as RP2 normally, and then checks
    if a post compressor could give an acceptable level of additional
    compression. This avoids wasting the speed-cost of more expensive post-encoders on data which does not significantly benefit (and entropy
    coding isn't always the win one might think it is).

    The post-compressor effectively sorts out bytes into various categories
    and runs them through the corresponding entropy context.

    Decompression could be done 2-pass, but usually faster to do a combined decoder.

    Might seem like an overly jank approach, but worked well in testing...



    STF + AdRice is used in some of my compressors:

    STF: Each time a symbol is encode it, swap it towards the front of the
    list. The symbols are encoded as their positions in the list.
    Typical swapping is with the value 7/8 or 15/16 the current index (7/8
    is better for short payloads, 15/16 or 31/32 for long payloads).
    Initial state is typically the bytes from 00 to FF in order, but
    sometimes differs for some uses. For a byte context, needs 256 bytes of storage per context for decoding (encoding typically needs 512, for a reverse-lookup table). Main premise is to turn raw bytes into something
    that Rice coding can use.

    AdRice:
    Adaptive variant of Rice coding.
    Encodes Q as a unary coded prefix, followed by K bit suffix;
    Q=Val>>K, Suf=Val&((1<<K)-1)
    Except for an escape-case, where if Q>7:
    Encode an 8-bit prefix, and raw 8-bit index.
    State update:
    Typical variant:
    Q=0, if K>0, decrement K
    Q=1: Leave K as-is
    Q>1: If K<7, increment K
    Alt variant:
    K is understood as having a 3 bit fraction.
    Update instead increments the fraction,
    so major K update happens more slowly.

    The alternate variant's update rules can help for compression in some use-cases at the expense of others. It comes at a minor speed cost in
    some cases, as the typical variant can move all of the short cases into
    a lookup table, operating more like a table-driven Huffman decoder, but
    the fractional K doesn't map well to expressing the updated K via a
    combined lookup table.

    The merit is that slowing down the K adaptation causes K to more often
    be at the optimal value, whereas with the simple case it is typically
    off by 1.


    As for Huffman, it poses a frequent problem:
    At a 15 or 16 bit symbol length, the needed lookup table to do a whole
    symbol at once does not fit in L1 cache and often has a poor L1 hit rate;
    At 12 or 13 bits, a single lookup strategy is faster, but compression
    suffers (and mostly loses the compression advantage over STF+AdRice,
    while still having higher cache pressure).

    A partial table, say the first 8 bits, with fallback for the rest, can
    work, but is also a speed penalty.
    It also poses a problem for small payloads in that filling the lookup
    table is often a significant time penalty.

    ...


    There are some claims of formats with "extreme compression at amazing
    speeds" from some companies (like, Deflate-like compression at LZ4 like speeds), but given I don't have them, I can't test anything myself.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 14:58:36 2026
    From Newsgroup: comp.arch

    On 6/25/2026 12:20 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 24/06/2026 23:34, BGB wrote:
    ---------------------
    "memmove" will not fill the array above with 42. "memmove" acts as
    though it copies the source to a temporary buffer, then copies that
    temporary buffer to the destination. (If you want to fill the buffer
    with the value 42, "memset" is the function to use.)

    Act as though it copies twice is utterly unnecessary as overlapping
    memory can simply be performed back-to-front instead of front-to-back.


    Yes, this is how memmove is done in practice IME.
    Forwards copy:
    Do it end-to-front (backwards)
    Backwards copy:
    Do it front-to-end (forwards)

    For what memmove is intended to do, it works well, but is not always
    what someone needs.

    And, for whatever reason, the people developing the C standards
    seemingly didn't feel "a copy function that explicitly produces
    repeating N byte patterns in the case of self-overlap" to be a priority...


    How is your "_memlzcpy" defined that is different from that?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Jun 25 18:07:15 2026
    From Newsgroup: comp.arch

    On 6/25/2026 12:13 PM, MitchAlsup wrote:

    Andy Valencia <vandys@vsta.org> posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    BGB <cr88192@gmail.com> schrieb:
    Usual downside it that the excessive parenthesis tend to turn into a
    usability issue.
    Ample fun has been made of this over time.

    From rec.humor.funny:

    From: jasmerb@mist.cs.orst.edu (Bryce Jasmer)
    Newsgroups: rec.humor.funny
    Subject: The Strategic Defense Initiative (SDI/Star Wars)
    Keywords: computer, funny
    Message-ID: <137457@looking.on.ca>
    Date: 23 Apr 90 10:30:08 GMT
    Sender: funnyr@looking.on.ca
    Posted: Mon Apr 23 11:30:08 1990
    Reply-Path: mist.cs.orst.edu!jasmerb

    Through some clever security hole manipulation if I have been able to
    break into all of the government's computers and acquire the Lisp code
    to SDI. Here is the last page (tail -10) of it to prove that I actually
    have the code:

    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
    ))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))


    I remember the LISP on PDP-8. One could use the character ] to mean as many )s as needed to close the lambda.


    I had ended up in a custom dialect using [] and {} for other things, say:
    (+ x y) //ye olde
    [1 2 3] //sorta like #<1 2 3>
    {:x 1 :y 2} //associative dictionary/object
    {x: 1 y: 2} //basically the same

    ":x" and "x:" were both understood as keywords (like in CL), sometimes
    but not always interchangeable (both were treated the same for most
    runtime tasks, but "(eq? :x x:)" would be false, as they were considered
    as distinct sub-types of keyword, mostly differentiated by remembering
    which side the colon went on).

    Though, these would evaluate its operands, so more like:
    [1 2 3] => (vector 1 2 3)
    {x: 1 y: 2} => (object x: 1 y: 2)
    Where, say:
    (define obj {x: 1 y: 2})
    (obj :z (+ (obj :x) (obj :y)))
    Would add a "z" member holding 3.

    But, say:
    (obj foo: 1 2 3)
    Would call a 'foo' method on the object with 1, 2, 3.

    (define obj2 {x: 1 y: 2 bar: (lambda () (+ x y))})

    Or, something to this effect...

    Also, IIRC, members starting with '$' would delegate, so like:
    (define obj1 {x: 1 y: 2})
    (define obj2 {z: 3 $up: obj1})

    (obj2 :y) => 2 (via the $up) member (there could be multiple).


    Though, this language wasn't pure by any means, sort of a Scheme /
    Common Lisp / Self hybrid.


    The original BGBScript language continued on with a similar model (*1),
    just replacing "$up" with "_up_" or similar. Also the lookup process
    would track where it had been, so cycles would not explode the lookup.

    Also there was a big hash table to track object/member lookups, so often lookups could remain (moderately) fast (by the standards of that era for
    a script VM).


    *1: Well, after the disastrously slow first BGBScript VM the second
    reused the first language (and its VM) as the core (throwing the JS like syntax on top).


    Seemed like a cool/nifty idea at the time, as did using this as the
    logical basis of the entire scoping model. But, when later wanting to
    move to a static-typed core (with type-inference, etc), this stuff came
    back to bite.


    Also went between different tagref formats.

    Say: Early VM:
    (31:3): Address
    ( 2:0): Tag
    Tag:
    000: Object Reference
    100: Cons Cell
    110: Various literal value types.
    x01: Fixnum (30 bits)
    x11: Flonum (30 bits, Binary32 with 2b cut off)

    First BS VM:
    Went over to bare pointers (more C friendly);
    Crammed fixnum and flonum into 24 bit address ranges (sucked).
    Second BSVM:
    Went back to the tagrefs.
    Also a precise GC, but this was a pain.
    Third VM:
    Mostly Went back to bare pointers for objects and cons cells;
    Went back to a conservative GC.

    Later, eg:
    Went 64-bit, then ended up moving the VM over to a 64-bit format.
    (63:48): Tag Bits
    (47: 0): Bare Address
    With a tag in the HOBs:
    0000: Object/Etc
    0001: Literal values/etc
    001x: Misc (bounded pointers ATM)
    01xx: Fixnum
    10xx: Flonum
    11xx: ...



    Some may or not recognize this as the format my ISA project is using...

    But, the basic scheme itself originated when the BGBScript VM went 64-bit.

    Well, also the type-tagging notation that BGBCC uses was also partly
    shared with the BGBScript VM.

    Ironically, cons cells and cons lists still exist, sorta, but are not
    really widely used in TestKern.

    Ironically, I had also partly used this typesystem internally in the
    makeshift BASIC dialect, which in another offshoot, started to gain some Lisp-like appendages.


    But, not entirely sure that "Weird mix of Lisp and 1980s style
    unstructured BASIC" is really a direction I want to go in.

    Though, could revive a Lisp style dialect, but with C style loops:
    (while (cond...) (begin
    (if (something) (break))
    ...
    ))
    (let-for (i 0) (< i 10) (++ i) (println "Yeah " i))
    With (break) and (continue).




    Well, and maybe in certain contexts bring back 32-bit tagrefs as a way
    to save memory (though, operating within the limits of a constrained
    heap, rather than public memory, and possibly referencing any external
    objects as handles).


    Could go further (eg, 16 bits), though most non-toy examples are likely
    to either run out of addressable cons-cells, or not actually have enough
    going on to benefit from the 16-bit handles.

    Say (if 16b):
    00: Object Handle
    01: CONS Handle
    10: Fixnum
    11: Misc
    00: Symbols (4K unique symbols)
    01: Keywords
    10: Magic values
    11: ?

    ...


    Though, could make sense for a GLSL compiler, as one is not as likely to
    run out of cons cells when compiling a shader. Fully populated CONS heap
    would be 64K, vs 256K for the same number of cons cells vs the normal
    memory management.

    Then again, the number of AST nodes that BGBCC uses when compiling Doom
    isn't wildly larger than this, so it is very possible that 16K cons
    cells could be enough to compile even a fairly complex GLSL shader...

    Well, or get wacky and use 48-bit cons cells with 24-bit tagrefs (4M
    cons cells, could compile full on C programs with this), hrrm...

    Well, also C would need more than 4K unique symbols, but one does
    generally stay under a 64K symbol limit (for something in a Doom'ish
    size range).

    ...

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 10:33:38 2026
    From Newsgroup: comp.arch

    The glibc function ::backtrace can be called at any time, in any context.

    I guess, internally, that backtrace function can signal an exception
    after setting up an appropriate "debugger" that will collect the
    backtrace and return it to the application.

    Then there are the unix context functions that also allow access to
    resources not normally visible to an application - getcontext(2), makecontext(3) and the setjmp/sigsetjmp functions which also
    gather the thread context, including the current stack pointer.

    IIRC these don't need to *look* at the stack, they can limit their work
    to manipulating the stack pointer.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jun 25 23:46:20 2026
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)
    {
    int num_frames;
    void *framelist[100];
    char **strings;

    num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

    Where does ::backtrace get access to the number of preserved registers
    on the stack and where the return address is on a per subroutine basis ??

    That is: each stack frame is of a different size with return address at a different spot per subroutine.

    strings = ::backtrace_symbols(framelist, num_frames);
    if (strings == NULL) {
    lp->log("Unable to obtain simulator stack traceback: %s\n",
    strerror(errno));
    return;
    }
    for(int frame=0; frame < num_frames; frame++) {
    lp->log("[%2.2d] %s\n", frame, strings[frame]);
    }
    ::free(strings);
    }

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Jun 25 20:29:42 2026
    From Newsgroup: comp.arch

    On 6/19/2026 5:09 PM, Scott Lurndal wrote:
    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
    On 6/19/2026 11:59 AM, John Levine wrote:
    According to David Brown <david.brown@hesbynett.no>:
    Possibly the biggest millstone around the neck of computing
    architectures is the C language. ...

    De-facto standards are /always/ albatrosses to some extent. Things are >>>> done that way because things are done that way - processors are designed >>>> to run C (or C-model languages, if you like) because that's what
    existing code is written in, and code is written in C (or similar
    languages, or languages with a VM written in C) because that's how
    existing processors work.

    C killed off every memory model other than flat byte addressed memory.
    Pointers are sort of typed, but any real C program does stuff like this: >>>
    p = (struct foo *) malloc(42 * sizeof(struct foo));

    Fwiw, why all of the casts?

    C and C++ handle void* conversions differently. You must cast
    the malloc result to a pointer of the declared type when using C++.

    Oh well, yeah. I had C on the brain. :^o



    It doesn't hurt to add the cast in C, and may help with documenting
    the intention of the programmer who wrote the code.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jun 26 06:08:15 2026
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)

    Nit: That is not glibc code, glibc code is C (it would be strange to
    have a C++ runtime library for C...)

    The glibc code can be seen, for example, at

    https://github.com/bminor/glibc/blob/master/debug/backtrace.c
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jun 26 06:38:48 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)
    {
    int num_frames;
    void *framelist[100];
    char **strings;

    num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

    Where does ::backtrace get access to the number of preserved registers
    on the stack and where the return address is on a per subroutine basis ??

    That is: each stack frame is of a different size with return address at a different spot per subroutine.

    There are several methods.

    Rolling back via the frame pointer is one method, which of course
    incurs overhead.

    Then ther's EH frame based stack tracing, which uses DWARF debug
    info that is also used for exception handling. To use this,
    you need to interpret DWARF opcodes. (You also need to interpret
    DWARF opcodes for exception handling. An exception will usually
    cost you thousands of cycles, which is HUGE).

    The latest and greatest bor backtrace is probably SFrame (used in
    the Linux kernel, for example). This uses a lookup table to locate
    the current function. See https://sourceware.org/binutils/wiki/sframe .

    Hmm... one other question. Would EH frame-based stack unwinding
    (which is now the standard) work with My 66000's safe stack?
    I think not, because it needs the return address, but I may
    be wrong.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 03:24:25 2026
    From Newsgroup: comp.arch

    On 6/23/2026 5:54 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only
    optimize based on what can be verified by the compiler within certain
    fundamental assumptions.

    Say:
    Pointer 1 points at a stack array in the local function;
    Pointer 2 was derived from taking the address of a global array;
    Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they
    don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error.

    -----------------
    I am reminded of the person, apparently very religious, who some decades >>> ago posted to solicit help for reimplementing all of computing (gcc,
    GNU, et cetera) on Biblical principles, because he thought Richard
    Stallman was too atheistic and had tainted his products. I have not
    heard how that went.

    Rick...

    --------

    sorry of if this is way off base, but well...

    What about container_of, or CONTAINING_RECORD?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 22:18:18 2026
    From Newsgroup: comp.arch

    But if you know that you have a "page of small integers" then you can just
    do address comparisons between them, the Franz Lisp compiler did this.

    Ah, so you're using the leading bits that correspond to that "page of
    integers" as tagbits. You don't really need BiBoP allocation to do
    that, tho.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Thu Jun 25 22:23:35 2026
    From Newsgroup: comp.arch

    John Levine [2026-06-25 19:19:47] wrote:
    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    John Levine <johnl@taugh.com> posted:
    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );

    Jeez, it's an example.

    It's an example, indeed, but it's a pretty bad one since using `memset`
    is more clear, more concise, and actually works, whereas your example
    seems very contrived.


    === Stefan
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 26 14:10:55 2026
    From Newsgroup: comp.arch

    On 26/06/2026 04:23, Stefan Monnier wrote:
    John Levine [2026-06-25 19:19:47] wrote:
    According to MitchAlsup <user5857@newsgrouper.org.invalid>:
    John Levine <johnl@taugh.com> posted:
    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.: >>>>
    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );

    Jeez, it's an example.

    It's an example, indeed, but it's a pretty bad one since using `memset`
    is more clear, more concise, and actually works, whereas your example
    seems very contrived.


    While a memset would be much better in this case (assuming that is the
    effect the author was aiming for), it is certainly the case that people
    have used memcpy() with overlapping regions and an assumption that it
    copies forward in some way. John's point that "it is easy to write code
    that is intuitively reasonable and sometimes works but isn't portable"
    is to a fair extent independent of the quick example he wrote to
    demonstrate it.

    However, the example does show how John is somewhat inaccurate - and it demonstrates how difficult things are when people write code with
    undefined behaviour.

    It is /not/ intuitively reasonable to write code like that example. But
    it /looks/ like it is reasonable. The critical issue is that it is not
    clear from the code whether the author wants the memset-like behaviour
    that some memcpy implementations would give, where a[] is filled with
    42, or if the author wants the memmove-like behaviour that many other
    memcpy implementations would give (where a[0] remains 42, a[1] gets 42, a[2..99] gets whatever was previously in a[1..98]). What those values
    were depends on any initialisation there was of the rest of a[] - if
    they were not initialised, then memmove() here would also have been UB.

    John is also a bit inaccurate in writing "sometimes works but isn't
    portable" - when you have UB in the code, it's just luck if the end
    results meet your intentions. "Non-portable" code, as I see it, is code
    that does what you want on one target or compiler, but might not do so
    for other targets or compilers. Code with UB is worse - even if your
    code "works" at the moment, apparently unconnected changes to other
    parts of your code, changes to compiler flags, small updates to your
    tools, can all give you an end result that no longer fits your
    intentions. There's nothing wrong with writing non-portable code,
    though it is best to be aware that you are doing so - people do that all
    the time. There is always something wrong with writing code with UB - unfortunately, people do that a lot too.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Jun 26 15:08:22 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    John Levine <johnl@taugh.com> posted:

    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in >>> the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );


    In the case of a single repeating byte, memset is of course optimal, but
    the same LZ4 encoding is used to encode any repeating pattern, of
    lengths from 1 and up. There is no indexed memset where the pattern is
    of arbitrary length.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Jun 26 15:50:49 2026
    From Newsgroup: comp.arch

    On 26/06/2026 15:08, Terje Mathisen wrote:
    MitchAlsup wrote:

    John Levine <johnl@taugh.com> posted:

    According to David Brown  <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish.  Actually, by default gcc assumes (i.e., it does not prove) >>>>> that pointers to different types (except char) do not point to the
    same address.  One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined.  It is debatable as to whether the
    rules in
    the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write
    code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

        char a[100];

        a[0] = 42;
        memcpy(a+1, a, 99);

    Why not::

            memset( a, 42, 100 );


    In the case of a single repeating byte, memset is of course optimal, but
    the same LZ4 encoding is used to encode any repeating pattern, of
    lengths from 1 and up. There is no indexed memset where the pattern is
    of arbitrary length.


    I don't think memset is necessarily "optimal", because the optimal
    solution will depend on the number of bytes to fill, and possibly
    alignments, and details of the exact processor. A particular memset implementation could be close to optimal for large blocks, where it is
    worth picking the best algorithm at runtime. And a compiler could pick
    the algorithm details at compile time if it knows the size of the target block. "Optimal" is a strong word.

    I would think that the kind of copying you need for LZ4 is quite
    specialised for that task - a function to do that belongs in LZ4 implementation code rather than as a standard function. Trying to use memcpy() for the task is, however, a recipe for having your name cursed
    by future maintainers!

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jun 26 10:11:05 2026
    From Newsgroup: comp.arch

    On 2026-Jun-26 06:24, Chris M. Thomasson wrote:
    On 6/23/2026 5:54 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only
    optimize based on what can be verified by the compiler within certain
    fundamental assumptions.

    Say:
        Pointer 1 points at a stack array in the local function;
        Pointer 2 was derived from taking the address of a global array;
        Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they
    don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming error.

    sorry of if this is way off base, but well...

    What about container_of, or CONTAINING_RECORD?

    If that is what I think it is, where it cast from
    a pointer to a field inside a struct back to the containing struct
    by subtracting the field byte offset and changing the pointer type, irrespective of programming language that mechanism has been used
    by operating systems at least since RSX days.
    It is a compact way of having structs linked to many other structures.

    That macro is just a variant of the mechanism for C.
    The method is used by WinNT and Linux, and I believe also by the BSD's.

    GCC has a compile option, no_strict_alias or something, that anyone
    using it and doing "illegal" pointer casting must use.
    In Windows land, pointer casting at least used to be Microsoft's
    recommended method and is supported by their compiler because
    they use it too, extensively.

    I have used it when I had complex multiple linkages between data structures. Say an object is in multiple double linked lists and an index tree and I
    need to cast from a pointer to a list link field back to the object
    containing that link field. I also often put a validity check marker for
    each object type at the start of the container and Assert its correctness.
    The marker is zeroed when the container is destroyed to catch
    any dangling references.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 26 14:21:10 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)
    {
    int num_frames;
    void *framelist[100];
    char **strings;

    num_frames = ::backtrace(framelist, sizeof(framelist)/sizeof(framelist[0]));

    Where does ::backtrace get access to the number of preserved registers
    on the stack and where the return address is on a per subroutine basis ??

    https://elixir.bootlin.com/glibc/glibc-2.43.9000/A/ident/backtrace

    It is processor dependent, of course.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Jun 26 14:22:29 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    GLIBC has a function to obtain a backtrace at a current point
    in time. This is called in the context of the thread that invokes
    the call. It requires access to the call records on the stack
    in the context of the thread (the glicb functions are backtrace(3)
    and backtrace_symbols(3)).

    /**
    * Log a simulator stack traceback.
    */
    void
    c_osdep::backtrace(c_logger *lp)

    Nit: That is not glibc code, glibc code is C (it would be strange to
    have a C++ runtime library for C...)

    Indeed it is C++ code calling a GLIBC function.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Jun 26 16:15:06 2026
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    it is certainly the case that people
    have used memcpy() with overlapping regions and an assumption that it
    copies forward in some way.

    More precisely, in 2010 there was a big flamewar because a newer glibc
    used backwards stride on some processors for some combinations of
    source and destination addresses, and this broke a pre-existing binary
    (of a Flash player IIRC). The "solution" was to use memmove for
    memcpy for existing binaries, and use the processor-dependent memcpy
    for new binaries. I heard no complaints about the solution for
    existing binaries.

    This shows that no binaries that link to glibc assumed that dest can
    overlap source with dest>src, and get some kind of replicating
    behaviour (probably because glibc stopped using byte-by-byte copying
    much earlier, if it ever had it at all).

    What the Flash player apparently used is operlapping memcpy with
    dest<src. It worked like memmove before the glibc release that caused
    the flame war, and actually used memmove once the solution was
    implemented.

    A better solution might have been to implement memcpy on the funny
    processors as follows:


    if (prefer_forward_stride(dest,src) || (((uintptr_t)src)-((uintptr_t)dest))<n )
    return memcpy_forward_stride(dest, src, n);
    else
    return memcpy_backward_stride(dest, src, n);

    This would have covered the Flash player usage (and any like it). Of
    course, memmove is only slightly more expensive to implement (you also
    have to cover the case where src<dest<src+n).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Jun 26 13:51:01 2026
    From Newsgroup: comp.arch

    On 2026-Jun-26 12:15, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    it is certainly the case that people
    have used memcpy() with overlapping regions and an assumption that it
    copies forward in some way.

    More precisely, in 2010 there was a big flamewar because a newer glibc
    used backwards stride on some processors for some combinations of
    source and destination addresses, and this broke a pre-existing binary
    (of a Flash player IIRC).

    Unbroken parts of Flash player existed?


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:33:21 2026
    From Newsgroup: comp.arch

    On 6/25/2026 10:14 AM, MitchAlsup wrote:

    John Levine <johnl@taugh.com> posted:

    According to David Brown <david.brown@hesbynett.no>:
    On 24/06/2026 07:48, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    C requires the compiler to prove that the pointers cannot alias.

    I wish. Actually, by default gcc assumes (i.e., it does not prove)
    that pointers to different types (except char) do not point to the
    same address. One has to turn that off with -fno-strict-aliasing.
    Other C compilers use the same assumption.

    That's the way C is defined. It is debatable as to whether the rules in >>> the C standard are ideal ...

    One of the less fortunate things about C is that it is easy to write code that
    is intuitively reasonable and sometimes works but isn't portable, e.g.:

    char a[100];

    a[0] = 42;
    memcpy(a+1, a, 99);

    Why not::

    memset( a, 42, 100 );

    ?????

    Or something akin to, pesudo code:

    struct buffer
    {
    char a[100];
    };

    struct buffer b0 = { '\0' };

    or

    struct buffer b0 = { };

    Try to hold the flames for a little while. Typed it in as is from
    memory. ;^)



    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.


    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:36:30 2026
    From Newsgroup: comp.arch

    On 6/26/2026 7:11 AM, EricP wrote:
    On 2026-Jun-26 06:24, Chris M. Thomasson wrote:
    On 6/23/2026 5:54 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only >>>> optimize based on what can be verified by the compiler within certain
    fundamental assumptions.

    Say:
        Pointer 1 points at a stack array in the local function;
        Pointer 2 was derived from taking the address of a global array; >>>>     Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume they >>>> don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming
    error.

    sorry of if this is way off base, but well...

    What about container_of, or CONTAINING_RECORD?

    If that is what I think it is, where it cast from
    a pointer to a field inside a struct back to the containing struct
    by subtracting the field byte offset and changing the pointer type, irrespective of programming language that mechanism has been used
    by operating systems at least since RSX days.
    It is a compact way of having structs linked to many other structures.

    That macro is just a variant of the mechanism for C.
    The method is used by WinNT and Linux, and I believe also by the BSD's.

    GCC has a compile option, no_strict_alias or something, that anyone
    using it and doing "illegal" pointer casting must use.
    In Windows land, pointer casting at least used to be Microsoft's
    recommended method and is supported by their compiler because
    they use it too, extensively.

    I have used it when I had complex multiple linkages between data
    structures.
    Say an object is in multiple double linked lists and an index tree and I
    need to cast from a pointer to a list link field back to the object containing that link field. I also often put a validity check marker for
    each object type at the start of the container and Assert its correctness. The marker is zeroed when the container is destroyed to catch
    any dangling references.



    Yup. You got it and basically had to use it the same way I have in the
    past. Its really cool. Also, check this shit out:

    #define RALLOC_ALIGN_OF(mp_type) \
    offsetof( \
    struct { \
    char pad_RALLOC_ALIGN_OF; \
    mp_type type_RALLOC_ALIGN_OF; \
    }, \
    type_RALLOC_ALIGN_OF \
    )

    ;^D
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Jun 26 11:40:07 2026
    From Newsgroup: comp.arch

    On 6/26/2026 11:36 AM, Chris M. Thomasson wrote:
    On 6/26/2026 7:11 AM, EricP wrote:
    On 2026-Jun-26 06:24, Chris M. Thomasson wrote:
    On 6/23/2026 5:54 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 6/22/2026 7:38 AM, Niklas Holsti wrote:
    On 2026-06-22 13:44, Thomas Koenig wrote:
    Niklas Holsti <niklas.holsti@tidorum.invalid> schrieb:
    On 2026-06-21 22:15, David Brown wrote:
    On 21/06/2026 20:57, MitchAlsup wrote:
    -------------
    In my case, I tended to use more conservative approaches and then only >>>>> optimize based on what can be verified by the compiler within certain >>>>> fundamental assumptions.

    Say:
        Pointer 1 points at a stack array in the local function;
        Pointer 2 was derived from taking the address of a global array; >>>>>     Compiler can safely assume no-alias.

    Also, if two pointers were passed into a function, can also assume
    they
    don't alias with a pointer to a local array;

    C requires the compiler to prove that the pointers cannot alias.
    Fortran specifies that if the 2 argument alias, it is a programming
    error.

    sorry of if this is way off base, but well...

    What about container_of, or CONTAINING_RECORD?

    If that is what I think it is, where it cast from
    a pointer to a field inside a struct back to the containing struct
    by subtracting the field byte offset and changing the pointer type,
    irrespective of programming language that mechanism has been used
    by operating systems at least since RSX days.
    It is a compact way of having structs linked to many other structures.

    That macro is just a variant of the mechanism for C.
    The method is used by WinNT and Linux, and I believe also by the BSD's.

    GCC has a compile option, no_strict_alias or something, that anyone
    using it and doing "illegal" pointer casting must use.
    In Windows land, pointer casting at least used to be Microsoft's
    recommended method and is supported by their compiler because
    they use it too, extensively.

    I have used it when I had complex multiple linkages between data
    structures.
    Say an object is in multiple double linked lists and an index tree and I
    need to cast from a pointer to a list link field back to the object
    containing that link field. I also often put a validity check marker for
    each object type at the start of the container and Assert its
    correctness.
    The marker is zeroed when the container is destroyed to catch
    any dangling references.



    Yup. You got it and basically had to use it the same way I have in the
    past. Its really cool. Also, check this shit out:

    #define RALLOC_ALIGN_OF(mp_type) \
      offsetof( \
        struct { \
          char pad_RALLOC_ALIGN_OF; \
          mp_type type_RALLOC_ALIGN_OF; \
        }, \
        type_RALLOC_ALIGN_OF \
      )

    ;^D

    fwiw, https://groups.google.com/g/comp.lang.c/c/7oaJFWKVCTw/m/sSWYU9BUS_QJ

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Jun 26 18:47:05 2026
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    a[0] = 42;
    memcpy(a+1, a, 99);

    A naive byte copy will fill a[] with 42, a more typical version that
    moves larger blocks won't. This example is really obvious (it's
    why there's also memmove()) but there's plenty of more subtle ones.

    The burroughs B3500 and medium systems successors, which is a >memory-to-memory architecture had a number of move instructions,
    several of which had architecturally defined semantics for
    overlapping source and destination fields, which included
    functionality similar to that you describe above. ...

    So does S/360 and its sucessors. The original 360 had MVC which takes two addresses and a length, and the spec says it acts as if it copies a byte at a time, so the a -> a+1 hack is the usual way to set a block to a specific value. S/370 added MOVE LONG with separate lengths for the two operands and an explicit
    padding byte, so you fill memory with the padding byte by setting the source length to zero. If the operands have "destructive overlap", it sets a condition
    code and moves nothing. They also added MOVE INVERSE which reverses a byte string
    and has explicitly undefined results if the operands overlap by more than one byte.

    S/390 added the confusing;y named MOVE LONG UNICODE which moves and pads pairs of bytes. (That works OK for UTF-16, not any other Unicode encoding) It also has
    MOVE PAGE which blats a 4K page at a time, MOVE STRING which does a C-style copy
    up to a delimiter byte, and MOVE LONG EXTENDED which puts the padding byte in an operand (typically immediate in the instruction) rather than a register and doesn't check for destructive overlap, you get what you get.

    z/Series adds MOVE RIGHT TO LEFT which is similar to MVC except it's specified to move bytes right to left rather than left to right, with the example being to
    add a hole in the middle of an array.

    Copying strings is surprisingly complicated.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.22a-Linux NewsLink 1.2