• locals (was: Coroutines in Forth)

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 04:47:12 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    There's also the realization that computer memory except for a few >specialized Forth chips is always made from RAM. So ideological
    devotion to a pure stack VM seems to pass up perfectly good hardware >capabilities.

    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals, and 2) no RAM access is happens (unless the compiler runs out
    of registers). This is demonstrated by lxf on the 3DUP variants <2024Apr10.090038@mips.complang.tuwien.ac.at>; to spare you having to
    look this posting up, here's the relevant part:

    |: 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    |: 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    |: 3dup.3 {: a b c :} a b c a b c ;
    |: 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;
    |
    |These four ways of expressing 3DUP are all compiled to exactly the
    |same code by lxf/ntf:
    |
    | 804FC0A 8B4500 mov eax , [ebp]
    | 804FC0D 8945F4 mov [ebp-Ch] , eax
    | 804FC10 8B4504 mov eax , [ebp+4h]
    | 804FC13 8945F8 mov [ebp-8h] , eax
    | 804FC16 895DFC mov [ebp-4h] , ebx
    | 804FC19 8D6DF4 lea ebp , [ebp-Ch]
    | 804FC1C C3 ret near

    That leads to the questions in this discussion:

    1) Should we optimize for less competent compilers? Why?

    a) If yes, should we optimize all code, or only the part of the
    code that is actually executed frequently?

    2) Are there other criteria for deciding between the alternatives?
    Which ones?

    Gforth does support address-like locals if you want to use them.

    Gforth has provided variable-flavoured locals since I implemented
    locals (in 1994), because I had the idea that using ! is preferable to
    using TO, but in practice I did not use variable-flavoured locals, and
    instead preferred to avoid TO by defining locals where their value is
    known, and then just using them (possibly defining additional locals
    instead of using TO on existing locals). And AFAIK others have rarely
    used variable-flavoured locals, either.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Apr 24 23:21:28 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals

    "Competent Forth compilers" there describes what by Forth standards
    would be called quite fancy optimizing compilers ("analytic compilers").
    They are a significant technical feat and there aren't that many of
    them. Traditionally Forth has been implemented as simple interpreters.

    In that case, a pure stack VM seems to ignore capabilities of the
    underlying hardware. Particularly, the the stack's memory actually
    being RAM. Doesn't PICK go back to the earliest days of Forth, as a way
    to bypass the limitation?
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 05:26:47 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    If you want to use a language that is "ideologically devoted" to the >architecture, maybe you shouldn't use Forth at all - and stick with C.

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register allocation to those that are so reliable that you usually count on
    them).

    Given the stuff I read about Chuck Moore's goals in designing Forth
    and what I read about the development of BCPL, B, and C, it's not too surprising that they are close to the hardware of the time when they
    were designed. It is interesting that both Forth and C standards
    (and, to some extent, implementations) have not reflected newer
    architectural features such as SIMD instructions. At least they
    managed to reflect different machine-word sizes (BLISS didn't,
    resulting in differences between BLISS-10, BLISS-11, and BLISS-32, and
    its losing against C despite having superior compilers for more than a
    decade.

    I know there are situations when there are six values on the data stack
    and four on the return stack which leave you with few other options. But
    you can always use vanilla variables or an extra stack (which is trivial
    to implement) to remedy that.

    Using Forth means being resourceful. Not to choose the most convenient
    and lazy solution imaginable.

    According to <https://www.dictionary.com/browse/resourceful>:

    |able to deal skillfully and promptly with new situations,
    |difficulties, etc.

    Forth systems that do not implement locals are not a new situation.
    So do you mean to say that it is a difficulty? I would agree. That's
    fine if you are using a tiny system and do not want to use an umbilical/tethered system, but if the system is big enough to support
    locals, lack of locals of the system shows the lazyness of the system implementor.

    But blaming the programmer for the system implementor's failings is a
    tactic used widely by system implementors (in the C world as well as
    in the Forth world), and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich Drepper)), and enough people fall for this that they repeat such
    arguments and come up with additional arguments of this kind.

    In any case, why should it be better to use an inconvenient solution
    that requires more work rather than a convenient solution that
    requires less work (i.e., is lazy)?

    For me virtues in programming are to produce correct code, to produce
    it quickly, the code should use the resources economically (which does
    not mean that saving a few bytes on a machine with GBs of memory is
    virtuos), and the code should be readable and maintainable.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri Apr 24 23:55:16 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult...

    I believe early C compilers didn't attempt much if any register
    allocation. You could say "register int x" to manually assign a
    register to x if one was available. You were limited to 2 or 3 of those
    on the PDP-11. Local variables in C otherwise lived in the stack. The difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    In assembler, you could also program in a stack-oriented style yet straightforwardly access the inner elements. Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more hardware-sympathetic model would have been "stack implemented in RAM".

    So I still don't understand the benefit of the "pure abstract stack"
    approach, other than for a few weird special CPU's.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 06:43:23 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    With competent Forth compilers, the machine code is 1) the same when
    using stack operations, when using the return stack, or when using
    locals

    "Competent Forth compilers" there describes what by Forth standards
    would be called quite fancy optimizing compilers ("analytic compilers").
    They are a significant technical feat and there aren't that many of
    them. Traditionally Forth has been implemented as simple interpreters.

    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work; on current non-tiny machines, the latter aspect still
    exists, and IMO is a big motivation for anti-locals advocacy (i.e., a sour-grapes argument).

    It's a bit perverse: You argue for locals with simple implementations,
    while anti-locals advocates argue against locals with simple
    implementations.

    And because it's more work, there are fewer sophisticated than simple
    systems. But who cares how many there are? The question is what
    programmers and users use and what their goals are.

    In any case, when it comes to performance measurements on "simple
    interpreters" like the Gforth of 1994, Forth code with locals usually
    turns out to be slower and consume more memory than Forth code using
    (and trying to avoid) stack juggling. E.g., my paper [ertl94l]
    contains the following comparison:

    locals
    with without ratio
    max 3.56us 2.69us 1.32
    strcmp 83.20us 70.50us 1.18

    Numbers from a 486DX2/66, strcmp compares a string with 17 characters
    with itself.

    The explanation given is:

    |The slowdown factor of using locals is due to the execution of more |primitives (e.g., 14 instead of 12 per character in
    |"strcmp"). Originally there was also a large overhead due to fetching
    |inline arguments, resulting in slowdowns of 1.58 for "max" and 1.41
    |for "strcmp". This overhead has been eliminated mostly by using
    |versions of the primitives specialized for frequent inline arguments
    |(e.g., "8lp+!" as specialization of "lp+!#" with the inline
    |argument 8).

    @InProceedings{ertl94l,
    author = "M. Anton Ertl",
    title = "Automatic Scoping of Local Variables",
    booktitle = "EuroForth~'94 Conference Proceedings",
    year = "1994",
    address = "Winchester, UK",
    pages = "31--37",
    url = "https://www.complang.tuwien.ac.at/papers/ertl94l.ps.gz",
    abstract = "In the process of lifting the restrictions on using
    locals in Forth, an interesting problem poses
    itself: What does it mean if a local is defined in a
    control structure? Where is the local visible? Since
    the user can create every possible control structure
    in ANS Forth, the answer is not as simple as it may
    seem. Ideally, the local is visible at a place if
    the control flow {\em must} pass through the
    definition of the local to reach this place. This
    paper discusses locals in general, the visibility
    problem, its solution, the consequences and the
    implementation as well as related programming style
    questions."
    }

    It might be interesting to measure this again on current hardware with
    the current, somewhat more sophisticated, but not yet "competent"
    Gforth, and maybe I will, at some other time. However, looking at the
    code for Gforth for 3DUP.3 compared to the ohers, Gforth still uses
    more primitives (even with superinstructions) and more machine
    instructions; From <2025Oct2.224440@mips.complang.tuwien.ac.at>:

    : 3dup.1 ( a b c -- a b c a b c ) >r 2dup r@ -rot r> ;
    : 3dup.2 ( a b c -- a b c a b c ) 2 pick 2 pick 2 pick ;
    : 3dup.3 {: a b c :} a b c a b c ;
    : 3dup.4 ( a b c -- a b c a b c ) dup 2over rot ;

    And here's the gforth-fast code on AMD64:

    3dup.1 3dup.2 3dup.3 3dup.4
    r 1->0 third 1->2 >l >l 1->1 dup 1->1
    mov -$08[r14],r13 mov r15,$10[r10] >l 1->1 mov [r10],r13
    sub r14,$08 third 2->3 mov -$08[rbp],r13 sub r10,$08 2dup 0->2 mov r9,$08[r10] mov rdx,$08[r10] 2over 1->3
    mov r13,$10[r10] third 3->1 mov rax,rbp mov r15,$18[r10
    mov r15,$08[r10] mov [r10],r13 add r10,$10 mov r9,$10[r10]
    i 2->3 sub r10,$18 lea rbp,-$10[rbp] rot 3->1
    mov r9,[r14] mov $10[r10],r15 mov -$10[rax],rdx mov [r10],r15 -rot 3->2 mov $08[r10],r9 mov r13,[r10] sub r10,$10
    mov [r10],r9 ;s 1->1 >l @local0 1->1 mov $08[r10],r9
    sub r10,$08 mov rbx,[r14] @local0 1->1 ;s 1->1
    2->1 add r14,$08 mov rax,rbp mov rbx,[r14]
    mov -$08[r10],r15 mov rax,[rbx] lea rbp,-$08[rbp] add r14,$08
    sub r10,$10 jmp eax mov -$08[rax],r13 mov rax,[rbx]
    mov $10[r10],r13 @local1 1->2 jmp eax
    mov r13,[r14] mov r15,$08[rbp]
    add r14,$08 @local2 2->1
    ;s 1->1 mov -$08[r10],r15
    mov rbx,[r14] sub r10,$10
    add r14,$08 mov $10[r10],r13
    mov rax,[rbx] mov r13,$10[rbp]
    jmp eax @local0 1->2
    mov r15,$00[rbp]
    @local1 2->3
    mov r9,$08[rbp]
    @local2 3->1
    mov -$10[r10],r9
    sub r10,$18
    mov $10[r10],r15
    mov $18[r10],r13
    mov r13,$10[rbp]
    lit 1->2
    #24
    mov r15,$50[rbx]
    lp+! 2->1
    add rbp,r15
    ;s 1->1
    mov rbx,[r14]
    add r14,$08
    mov rax,[rbx]
    jmp eax

    [Note that for a superinstruction like ">l >l" or ">l @local0", all
    threaded code cells are shown, the first as superinstruction, and the
    remaining ones as the simple primitive in that threaded-code slot; but
    the other threaded-code slots have no separate code generated.]

    You seem to argue that the random-access aspect of locals provides a performance advantage on simple systems, but in most cases, code using
    locals is at a performance disadvantage on such systems (and
    traditionalists have often used that to argue against locals).

    In that case, a pure stack VM seems to ignore capabilities of the
    underlying hardware. Particularly, the the stack's memory actually
    being RAM.

    Keeping at least one stack item in a register leads to a smaller and
    faster implementation, and is not more complex than keeping all the
    stack memory in RAM. It does require enough registers, however (i.e.,
    you do not use this technique on the 6502).

    Doesn't PICK go back to the earliest days of Forth, as a way
    to bypass the limitation?

    A way to use RAM that is less frowned upon by Forth traditionalists is
    (global) variables. The fact that the use of global variables is
    frowned upon in the wider programming community for various reasons
    seems to pour oil into the fire of their elitism.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 08:21:41 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    I believe early C compilers didn't attempt much if any register
    allocation.

    Yes, they did not allocate auto variables (what we consider locals) to registers.

    The
    difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    That's the same with unsophisticated locals implementations like those
    of Gforth (I do not mention other Forth systems with such
    implementations to protect the guilty).

    Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".

    What do you mean by that? Forth already provides PICK. ROLL and
    -ROLL are either slow to implement in RAM or require significant sophistication. In addition, Gforth has

    : stick ( x0 x1 ... xu x u -- x x1 ... xu ) \ gforth-internal
    \ replace x0 with x; e.g., 5 PICK 1+ 5 STICK increments the 6th
    \ stack element (not recommended).
    2 + cells sp@ + ! ;

    which is used in the Gforth source code 7 times (compared to 20 times
    for PICK, 4 for FOURTH, 38 for THIRD, 308 for OVER and 1128 for DUP),
    always with colon-sys-xt-offset as U, so STICK is only used to
    manipulate colon-sys control-flow stack items. I have also had little
    appetite to use it elsewhere.

    In general, in Forth programming one copies things from various places
    in stacks with DUP, OVER, PICK, and R@; sometimes you do not need the
    item in its original place any more, then you SWAP, ROT or ROLL it
    instead of keeping it on the stack and dropping it later (and the item
    might be in the way). Very occasuinally, you copy an item deeper into
    the stack, as with TUCK, or -ROT or -ROLL it out of the way.

    But overwriting an existing stack item with something else as done by
    STICK is not something we tend to do, and this also shows in the
    absence of such words for the top few stack items (while 1 PICK is
    OVER, there is no word that corresponds to 1 STICK). I think the
    reason why it is not done is that we avoid keeping dead stack items
    around that we might overwrite. Such dead stack items would often be
    in the way.

    And if someone has the desire for having a storage location that they
    want to overwrite, Forth has locals (although I avoid overwriting
    them, too, see <https://net2o.de/gforth/Locals-programming-style.html>).

    So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.

    The benefit of not implementing locals is that implementing the Forth
    system takes less time and the resulting system is smaller.

    PICK tends to be frowned upon because it is a code small that suggests
    that you have too much going on on the stack, which makes the program
    hard to understand, and you should be looking for alternatives.

    ROLL and -ROLL are avoided for the same reason and because they are
    slow on many implementations.

    As for STICK, see above.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Apr 25 11:27:36 2026
    From Newsgroup: comp.lang.forth

    In article <87fr4j4jhn.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register
    allocation of any of the three is similarly difficult...

    I believe early C compilers didn't attempt much if any register
    allocation. You could say "register int x" to manually assign a
    register to x if one was available. You were limited to 2 or 3 of those
    on the PDP-11. Local variables in C otherwise lived in the stack. The >difference was that the C compiler generated straightforward assembly
    code to access those variables even when they were in the stack
    interior. You didn't have to use ROT or juggle stuff to the R stack to
    get to the inner elements.

    In assembler, you could also program in a stack-oriented style yet >straightforwardly access the inner elements. Forth for whatever reason
    chose strict stack discipline (with some loopholes like PICK). I
    understand wanting to stay with purity of a model, but a more >hardware-sympathetic model would have been "stack implemented in RAM".

    There are more loopholes, once you think of it.
    Suppose you have a recursive integration algorithm. Define an object
    that contains all relevant recursive data. Allocate it on the data
    stack ( DSP@ size - DSP! ) and make it the current object (DSP@ ^recdat !)
    Free the stack once you're done ( DSP@ size + DSP! ) .
    In this context you are using normal float, not weird locals, and .
    your choice of normal or single floats.
    [ More politically correct is probably to ALLOCATE FREE for no
    clear benefit. ]


    So I still don't understand the benefit of the "pure abstract stack" >approach, other than for a few weird special CPU's.


    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sat Apr 25 11:43:30 2026
    From Newsgroup: comp.lang.forth

    In article <2026Apr25.084323@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    <SNIP>

    locals
    with without ratio
    max 3.56us 2.69us 1.32
    strcmp 83.20us 70.50us 1.18

    Interestingly, I don't allow complicated definitions with assembler implementations in ciforth.
    E.g. + XOR 0< EXECUTE are all low level, not much more.
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    <SNIP>

    - anton

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 10:22:16 2026
    From Newsgroup: comp.lang.forth

    albert@spenarnc.xs4all.nl writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we
    additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the
    ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly
    lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sat Apr 25 15:43:06 2026
    From Newsgroup: comp.lang.forth

    On 25-04-2026 07:26, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register allocation to those that are so reliable that you usually count on
    them).

    Well, you're actually shooting at Paul Rubin - not at me. Thank you! I
    take all the help I can get!

    Using Forth means being resourceful. Not to choose the most convenient
    and lazy solution imaginable.

    According to <https://www.dictionary.com/browse/resourceful>:

    |able to deal skillfully and promptly with new situations,
    |difficulties, etc.

    That's EXACTLY what I meant!

    Forth systems that do not implement locals are not a new situation.
    So do you mean to say that it is a difficulty?

    You're completely beside the point I wanted to make. I meant the design
    or algorithm one has to implement.

    But blaming the programmer for the system implementor's failings is a
    tactic used widely by system implementors (in the C world as well as
    in the Forth world).

    YAGNI is not a "system implementers failing". It is a choice he made,
    because you (a) really don't need it - or (b) if you need it you can add
    it yourself. Which all seems very Forth like.

    (..) and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich Drepper).

    It's your own pal Bernd that said: "A good programmer will write even
    better code in Forth. A bad programmer will write abysmal code in Forth.
    And I'm sorry to say - but most programmers are quite bad."

    So, either you agree with him or we have an unfortunate departure of one
    of the most foremost members of Gforth. Because this states - in no
    uncertain words - that Forth programmers *ARE* elite.

    Which in itself is a defensible position. I mean - we're 0.1% of the programming population according to TIOBE. I blame it soley on our
    inability to procreate, but you may put up some other viable explanation.

    Moore himself thinks we're elite: "I must say that I'm appalled at the
    code I see. Because all this code suffers the same failings, I conclude
    it's not a sporadic problem."

    I mean - there is nothing wrong from being a subpar programmer. Plenty
    of languages to choose from - and still get bread on the table.

    Of course, it's expected that one states that "All humans are equal -
    even if they're programming". That's the time we live in.

    But I quote Jan Cremer, a famous Dutch writer: "'I'm okay and you're
    okay.' That sounds quite nice. But 'I'm okay and you're a dick' feels
    much better."

    Humanity can be divided in four groups:
    1. Those who can not write Forth;
    2. Those who tried Forth, but failed;
    3. Those who pretend to write Forth, but still fail;
    4. Those who can write Forth.

    I mean: the truth must be said. I'm Dutch. I can't help myself.

    In any case, why should it be better to use an inconvenient solution
    that requires more work rather than a convenient solution that
    requires less work (i.e., is lazy)?

    It would be better to think deeply, find an original solution and learn.
    Like Albert with his brilliant ;: word.

    For me virtues in programming are to produce correct code, to produce
    it quickly, the code should use the resources economically (which does
    not mean that saving a few bytes on a machine with GBs of memory is
    virtuos), and the code should be readable and maintainable.

    Well, to me it's something different. Who cares what you or I think.
    It's about what you can prove decisively.

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Sat Apr 25 16:07:47 2026
    From Newsgroup: comp.lang.forth

    On Sat, 25 Apr 2026 10:22:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    albert@spenarnc.xs4all.nl writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    BR
    Peter



    - anton


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sat Apr 25 17:38:11 2026
    From Newsgroup: comp.lang.forth

    On 25-04-2026 16:07, peter wrote:
    On Sat, 25 Apr 2026 10:22:16 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    albert@spenarnc.xs4all.nl writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    So in the loop we have a loop count (on the return stack), two cursors
    (s1 and s2) into the compared strings, and within the loop body we
    additionally have the two characters, for a total of five live values,
    three of which survive across iterations and are changed in every
    iteration. One could implement it as

    \ untested, and the following versions, too
    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    where only one of the values changes in each iteration, but now the
    ?DO...LOOP cannot be replaced with a version that does not store a
    second value but counts down (or up) to 0, so now we have a total of 6
    live values, four of which survive across iterations, and one is
    changed on every iteration.

    One can reduce this by one value by keeping one of the cursors in the
    loop counter:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - ?dup
    if
    unloop exit
    then
    loop
    u1 u2 - ;

    So now we have five live values in the body of the loop at the same
    time, three of which live across iterations, and one of which changes
    in each iteration. Keeping the loop parameters separate significantly
    lessens the load on the data stack.

    Let's see if we can eliminate the local from the loop body:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - ?dup
    if
    nip unloop exit
    then
    loop
    drop u1 u2 - ;

    That leaves stack purists with the task of eliminating the locals from
    the prologue and epilogue of this word. Two items have to be stored
    across the loop, or the difference could be computed speculatively and
    only one item stored across the loop. And the computations before the
    loop involve four values alive at the same time (fortunately addr2 is
    does not live long). Let's see:

    : strcmp {: addr1 u1 addr2 u2 -- n :}
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - ?dup
    if
    nip unloop r> drop exit
    then
    loop
    drop r> negate ;

    As can be seen by the many stack comments, the stack load here is more
    than I can easily deal with.

    Maybe a stack purist can improve on that. But can he improve it
    enough to make it as easy to understand as any of the versions with
    locals?

    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    BR
    Peter

    This one is about a third bigger than yours - if we disregard the "UC",
    that is:

    : comp
    rot over - if drop 2drop true exit then
    0 ?do
    over i chars + c@ over i chars + c@ -
    if drop drop unloop true exit then
    loop drop drop false
    ;

    In 4tH, it is even visually more compact:

    : comp
    rot over - if drop 2drop true ;then
    0 ?do over i [] c@ over i [] c@ - if drop drop unloop true ;then loop
    drop drop false
    ;

    The extra length comes mainly from the three different possible exits:
    - It's not the same size (first line);
    - It's not the same content (exit within loop);
    - It's the same thing (after loop).

    I can't say I particularly like the use of "COUNT" here - because it
    actually represents "C@+" - except for the first run. Neither am I very
    happy with the BEGIN..WHILE..WHILE..REPEAT..THEN construct - but that's
    not your fault ;-)

    All that being said, I cannot deny it is a clever piece of code using
    the full capabilities of the language, bravo!

    Hans Bezemer
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Apr 25 17:21:11 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 25-04-2026 07:26, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    [reinserted deleted, relevant context]
    If you want to use a language that is "ideologically devoted" to the
    architecture, maybe you shouldn't use Forth at all - and stick with C.

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register
    allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register
    allocation to those that are so reliable that you usually count on
    them).

    Well, you're actually shooting at Paul Rubin - not at me. Thank you! I
    take all the help I can get!

    Actually, this whole paragraph is a reaction on your statement, not
    his. You deleted it for whatever reason, so I reinserted it.
    Concerning Paul Rubin, just because he is wrong does not mean you are
    right.

    (..) and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich
    Drepper).

    It's your own pal Bernd that said: "A good programmer will write even
    better code in Forth. A bad programmer will write abysmal code in Forth.
    And I'm sorry to say - but most programmers are quite bad."

    So, either you agree with him or we have an unfortunate departure of one
    of the most foremost members of Gforth. Because this states - in no >uncertain words - that Forth programmers *ARE* elite.

    What departure? We disagree on a number of things.

    And the issue is not whether Forth programmers or any other
    programmers are elite, but that many programmers think that they are
    elite (whether they are or aren't) and that the designers or advocates
    of deficient programming systems make use of that to dupe them, along
    the lines of: "You as elite programmers can cope with this deficiency
    [of course they don't call it a definiency], it's only subpar
    programmers [more elaborate denigrations are common, see Ulrich
    Drepper] who complain about it."

    In the case of Forth and locals this tactic has not worked very well,
    so even Forth, Inc. (who have been the most vocal among the commercial
    Forth providers about their dislike of locals) have implemented
    locals. But of course we see the echo of all of this still around
    here.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 26 00:34:19 2026
    From Newsgroup: comp.lang.forth

    In article <nnd$1196d1a5$0da70c85@6de98b5b6c1b0418>,
    Hans Bezemer <the.beez.speaks@gmail.com> wrote:
    <SNIP>
    It would be better to think deeply, find an original solution and learn.
    Like Albert with his brilliant ;: word.

    Chuck Moore invented and coined the ;: word.
    I came up with CO with is similar, or maybe the same.

    <SNIP>

    Hans Bezemer

    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 26 00:51:56 2026
    From Newsgroup: comp.lang.forth

    In article <2026Apr25.122216@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    albert@spenarnc.xs4all.nl writes:
    String handling and move operation are the exception, because
    they are both simpler and faster in low level.
    Simpler is the argument (especially for i86).
    Faster is the bonus.

    In other words, Forth without locals is not well suited for words
    that have so much active data. That is also reflected in hardware
    designed for Forth, which got additional registers like A or B (or
    additional capabilities for the top of the return stack register R),
    which make it simpler and faster to implement such words.

    A definition of STRCMP in the paper is

    : strcmp { addr1 u1 addr2 u2 -- n }
    addr1 addr2
    u1 u2 min 0
    ?do { s1 s2 }
    s1 c@ s2 c@ - ?dup
    if
    unloop exit
    then
    s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;

    Compare this with
    REPZ CMPSB ; Intel 86
    once the registers are filled.
    There are some extra instruction to massage the resulting
    zero/carry in the required form (-1/0/1)

    I choose to implement the primitive CORA
    HEADER( {MEMORY},{CORA},{CORA},{addr1 addr2 len --- n},{CIF},
    {Compare the memory areas at forthvar({addr1}) and forthvar({addr2})
    over a length forthvar({len}) .
    For the first bytes that differ, return -1 if the byte
    from forthvar({addr1}) is less (unsigned) than the one from forthvar({addr2}), and 1 if it is greater.
    If all forthvar({len}) bytes are equal, return zero. }

    - anton
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 26 01:13:55 2026
    From Newsgroup: comp.lang.forth

    In article <nnd$548d4f1b$1e104571@905dda44db1f54ae>,
    Hans Bezemer <the.beez.speaks@gmail.com> wrote:
    This one is about a third bigger than yours - if we disregard the "UC",
    that is:

    : comp
    rot over - if drop 2drop true exit then
    0 ?do
    over i chars + c@ over i chars + c@ -
    if drop drop unloop true exit then
    loop drop drop false
    ;

    In 4tH, it is even visually more compact:

    : comp
    rot over - if drop 2drop true ;then
    0 ?do over i [] c@ over i [] c@ - if drop drop unloop true ;then loop
    drop drop false
    ;

    The extra length comes mainly from the three different possible exits:
    - It's not the same size (first line);
    - It's not the same content (exit within loop);
    - It's the same thing (after loop).

    I can't say I particularly like the use of "COUNT" here - because it
    actually represents "C@+" - except for the first run. Neither am I very
    happy with the BEGIN..WHILE..WHILE..REPEAT..THEN construct - but that's
    not your fault ;-)

    All that being said, I cannot deny it is a clever piece of code using
    the full capabilities of the language, bravo!

    The corresponding word in ciforth is
    ( caddr len dea -- matchflag dea )
    : ~MATCH >R OVER R@ >NFA @ $@ CORA R> SWAP ;
    The pointer in the input stream is compared to the name over the
    length of the name. (Its length is ignored).
    This works because the names of words in the
    dictionary cannot contain spaces. It also finds := in the context
    of a pascal interpreter where the text is ":=(a+eb)" provided
    := is a PREFIX word.
    (Also note that CORA can serve to make a strcmp. )
    Hans Bezemer
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 26 15:21:19 2026
    From Newsgroup: comp.lang.forth

    On 26/04/2026 3:21 am, Anton Ertl wrote:
    ...
    In the case of Forth and locals this tactic has not worked very well,
    so even Forth, Inc. (who have been the most vocal among the commercial
    Forth providers about their dislike of locals) have implemented
    locals.

    Well, they seemed reluctant to adopt {: :} having previously implemented
    and used ANS LOCALS| .

    If one examines Forth Inc's use of locals within SwiftForth, one finds it
    is the exception and confined to 'subpar' code. That they use locals so infrequently suggests to me it was more effort deciding *whether* to use locals. This, I can understand perfectly - 'Shall I use forth, or shall I
    use locals?' - as they represent different mindsets. I just can't do that.
    I feel no need to do that.

    AFAIK Forth Inc doesn't offer locals for its SwiftX products.

    ...

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sat Apr 25 22:40:01 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work

    A simple implementation of locals doesn't sound like that much work?
    Mostly you need a runtime scheme to make sure the locals are cleaned up
    in case of exceptions being thrown. If you're willing to ignore the
    standard you don't need to complicate the text interpreter much. I
    remember Mark Wills' TI99/4A Forth simply reserved 4 extra return stack
    cells at each level of subroutine call, that you could treat like
    registers. Maybe that burns too much memory for a very small CPU. I've imagined some alternate versions of COLON, e.g.
    : foo ( ... ) ; \ regular colon, no locals
    1: foo ( ... ) ; \ one local called A
    2: foo (... ) ; \ two locals, A and B
    ...
    4: foo (... ) ; \ four locals: A, B, C, D.

    In any case, when it comes to performance measurements on "simple interpreters" like the Gforth of 1994, Forth code with locals usually
    turns out to be slower and consume more memory than Forth code using
    (and trying to avoid) stack juggling.

    The slowdown doesn't surprise me but it's not that big a deal, compared
    to the slowdown of using interpreted Forth instead of assembly language
    in the first place. Words that don't use the locals won't be affected.

    ... looking at the code for Gforth for 3DUP.3 compared to the others,
    Gforth still uses more primitives ...

    That's a lot of code in the expansion! I wonder how it will look in a
    simple interpreter.

    You seem to argue that the random-access aspect of locals provides a performance advantage on simple systems, but in most cases, code using
    locals is at a performance disadvantage on such systems

    Well, if the slowdown is less than say 2x, I'd say the code cleanup
    matters more, due to the traditional 90/10 rule (maybe now 99/1) of
    where CPU cycles go. Code the hot spots for speed and the rest for convenience.

    (and traditionalists have often used that to argue against locals).

    The REAL traditionalists (machine language programmers) can use the same argument against Forth itself.

    Keeping at least one stack item in a register leads to a smaller and
    faster implementation, and is not more complex than keeping all the
    stack memory in RAM.

    That's only with a fancy compiler AND a requirement of the application
    code having statically determined stack effects. Traditional words like
    ?DUP would confuse this scheme amirite?

    A way to use RAM that is less frowned upon by Forth traditionalists is (global) variables. The fact that the use of global variables is
    frowned upon in the wider programming community for various reasons
    seems to pour oil into the fire of their elitism.

    I see what you mean by that. But, whole-program C compilers do
    something like register allocation to re-use those "global" cells when
    sets of them won't be needed at the same time. The Forth approach would
    need either a similar fancy compiler, or else require the programmer to
    do an error-prone manual memory layout process, or else burn memory unnecessarily for those cells whose usage doesn't overlap.

    Currently I'm thinking about an 8051 part which has 256 bytes of RAM, so
    that issue is potentially significant.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 26 05:55:04 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    We do have N>R (https://forth-standard.org/standard/tools/NtoR). So if
    the whole problem is "there is no more room on the FP stack", there is
    a way out.

    That must be pretty new (it's not in gforth 0.7.3)

    It was accepted into Forth-200x at the 2010 standards meeting.

    so I wonder how
    helpful it really is.

    We have two uses in the Gforth sources. I.e., not particularly
    useful.

    In any case, it does not help with FP stack limitations at all,
    because N>R transfers cells from the data stack to the return stack.

    My take on FP stack depth limitations in some systems is that you use
    as much FP stack as you need, and a Forth system (like Gforth) where
    you can make the FP stack as deep as available memory and address
    space permit, and publish that. Maybe it will inspire the system
    implementors with shallow FP stacks to provide deep FP stacks, at
    least optionally.

    However, when I did something that required a deep FP stack (adding up
    an array with pairwise addition
    <2025Jul16.132504@mips.complang.tuwien.ac.at>), I actually worked
    around the limitations of systems that only provide a shallow FP
    stack. But that was easy enough in that case.

    Concerning systems with FP stack limits, AFAIK VFX has FP packages
    that support very deep stacks, including the SSE-based package that
    used to be the default in VFX64 for a while.

    iForth implements a deep stack: it uses the 387 stack within a
    definition and stores the FP stack items that are on the 387 stack to
    memory on calls, and if the FP stack would overflow from the
    computations within a word. I think this is a good approach: Much FP computation time is spent in words that do not call other words, or at
    least the FP stack items do not live across the calls. iForth seems
    to overdo it, however, even code like

    : bar
    dup f@ cell+ dup f@ cell+ dup f@ cell+
    dup f@ cell+ dup f@ cell+ dup f@ cell+
    f+ f+ f+ f+ f+ ;

    which uses only 6 FP stack items does not produce the obvious code,
    but something significantly longer: It first performs 6 FLD
    instructions corresponding to the 6 F@, then stores 4 FP items,
    presumably on the memory FP stack, and only then starts the additions (interleaved with some other code).

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sun Apr 26 00:28:06 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    In any case, it does not help with FP stack limitations at all,
    because N>R transfers cells from the data stack to the return stack.

    In the code I mentioned, I wasn't running out of FP stack space, but
    rather, I didn't see how to write the function in any non-horrible way
    without using FP locals. Horrible ways included: 1) implementing a
    separate FP stack in memory for intermediate values during the
    recursion, or 2) using ugly hacks to stash FP values on the regular data
    stack.

    R was suggested as a way to implement horribleness #2 but it would
    actually have to be FN>R or something like that.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Sun Apr 26 09:57:52 2026
    From Newsgroup: comp.lang.forth

    On Sun, 26 Apr 2026 05:55:04 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Paul Rubin <no.email@nospam.invalid> writes:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    We do have N>R (https://forth-standard.org/standard/tools/NtoR). So if
    the whole problem is "there is no more room on the FP stack", there is
    a way out.

    That must be pretty new (it's not in gforth 0.7.3)

    It was accepted into Forth-200x at the 2010 standards meeting.

    so I wonder how
    helpful it really is.

    We have two uses in the Gforth sources. I.e., not particularly
    useful.

    In any case, it does not help with FP stack limitations at all,
    because N>R transfers cells from the data stack to the return stack.

    My take on FP stack depth limitations in some systems is that you use
    as much FP stack as you need, and a Forth system (like Gforth) where
    you can make the FP stack as deep as available memory and address
    space permit, and publish that. Maybe it will inspire the system implementors with shallow FP stacks to provide deep FP stacks, at
    least optionally.

    However, when I did something that required a deep FP stack (adding up
    an array with pairwise addition <2025Jul16.132504@mips.complang.tuwien.ac.at>), I actually worked
    around the limitations of systems that only provide a shallow FP
    stack. But that was easy enough in that case.

    Concerning systems with FP stack limits, AFAIK VFX has FP packages
    that support very deep stacks, including the SSE-based package that
    used to be the default in VFX64 for a while.

    iForth implements a deep stack: it uses the 387 stack within a
    definition and stores the FP stack items that are on the 387 stack to
    memory on calls, and if the FP stack would overflow from the
    computations within a word. I think this is a good approach: Much FP computation time is spent in words that do not call other words, or at
    least the FP stack items do not live across the calls. iForth seems
    to overdo it, however, even code like

    : bar
    dup f@ cell+ dup f@ cell+ dup f@ cell+
    dup f@ cell+ dup f@ cell+ dup f@ cell+
    f+ f+ f+ f+ f+ ;

    which uses only 6 FP stack items does not produce the obvious code,
    but something significantly longer: It first performs 6 FLD
    instructions corresponding to the 6 F@, then stores 4 FP items,
    presumably on the memory FP stack, and only then starts the additions (interleaved with some other code).

    - anton

    lxf uses the cpu FP stack. I think that is one of the worse decisions
    I made for it. It will fail on all but the simplest complex fp math
    operations. For lxf64 a priority was to have a separate in memory
    FP stack. It has worked out very well!

    You bar example becomesseea bar
    0x4282B0 C5FB100B vmovsd xmm1, qword [rbx]
    0x4282B4 4883C308 add rbx, 0x8
    0x4282B8 C5FB1013 vmovsd xmm2, qword [rbx]
    0x4282BC 4883C308 add rbx, 0x8
    0x4282C0 C5FB101B vmovsd xmm3, qword [rbx]
    0x4282C4 4883C308 add rbx, 0x8
    0x4282C8 C5FB1023 vmovsd xmm4, qword [rbx]
    0x4282CC 4883C308 add rbx, 0x8
    0x4282D0 C5FB102B vmovsd xmm5, qword [rbx]
    0x4282D4 4883C308 add rbx, 0x8
    0x4282D8 C5FB1033 vmovsd xmm6, qword [rbx]
    0x4282DC 4883C308 add rbx, 0x8
    0x4282E0 C5CB58F5 vaddsd xmm6, xmm6, xmm5
    0x4282E4 C5CB58F4 vaddsd xmm6, xmm6, xmm4
    0x4282E8 C5CB58F3 vaddsd xmm6, xmm6, xmm3
    0x4282EC C5CB58F2 vaddsd xmm6, xmm6, xmm2
    0x4282F0 C5CB58F1 vaddsd xmm6, xmm6, xmm1
    0x4282F4 C4C17B1145F8 vmovsd qword [r13-0x8], xmm0
    0x4282FA C5FB10C6 vmovsd xmm0, xmm0, xmm6
    0x4282FE 4D8D6DF8 lea r13, [r13-0x8]
    0x428302 C3 ret
    83 bytes, 21 instructions

    As seen all the fp registers can be put to good use!

    BR
    Peter


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sun Apr 26 19:55:06 2026
    From Newsgroup: comp.lang.forth

    On 26/04/2026 5:28 pm, Paul Rubin wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    In any case, it does not help with FP stack limitations at all,
    because N>R transfers cells from the data stack to the return stack.

    In the code I mentioned, I wasn't running out of FP stack space, but
    rather, I didn't see how to write the function in any non-horrible way without using FP locals. Horrible ways included: 1) implementing a
    separate FP stack in memory for intermediate values during the
    recursion, or 2) using ugly hacks to stash FP values on the regular data stack.

    R was suggested as a way to implement horribleness #2 but it would
    actually have to be FN>R or something like that.

    Probably the flocals are more complicated but users rarely look beyond
    the interface.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 26 14:34:29 2026
    From Newsgroup: comp.lang.forth

    In article <20260426095752.00006baa@tin.it>,
    peter <peter.noreply@tin.it> wrote:
    <SNIP>
    lxf uses the cpu FP stack. I think that is one of the worse decisions
    I made for it. It will fail on all but the simplest complex fp math >operations. For lxf64 a priority was to have a separate in memory
    FP stack. It has worked out very well!

    You give up on portable fp programs. The standard guarantees only
    8 items. So for "all but the simplest" you must take care.

    P.S. The example Anton gave is silly.
    MHX undoubtedly programs more like this:

    [ this example is ciforth code. Only 64 bits floats. ]
    WANT -fp-
    : BAR 0_ ( addr -- ) ( F: -- f )
    6 0 DO
    DUP F@ F+ CELL+ CELL+
    LOOP DROP ;

    How convenient, I could have added 100 floats !

    In building the transputer Forth I was obliged to generate
    Chebychov approximations for every transcendental function.
    You have to do that too if you forego the Intel fp stack.

    The Intel internal stack gives FSIN FEXP etc. in single instruction.
    CODE FCOS FCOS, NEXT, END-CODE
    CODE FLOG FLDLG2, FXCH, ST1| FYL2X, NEXT, END-CODE
    CODE 0_ FLDZ, NEXT, END-CODE
    BR
    Peter


    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Apr 26 14:55:52 2026
    From Newsgroup: comp.lang.forth

    In article <87340i46vi.fsf@nightsong.com>,
    Paul Rubin <no.email@nospam.invalid> wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work

    A simple implementation of locals doesn't sound like that much work?
    Mostly you need a runtime scheme to make sure the locals are cleaned up
    in case of exceptions being thrown. If you're willing to ignore the
    standard you don't need to complicate the text interpreter much. I

    I get by with one screen, ignoring the standard only in the sense that
    LOCALs should be recursive.

    For a long time FORTRAN was the way to go. There are no locals in
    FORTRAN, only (c-speak) static variables, no recursion.
    Modules, including "named commons" only perform name hiding amongst each
    other.

    For decennia the equivalent in Forth was
    (This is present in a lot of Marcel Hendrix programs. )
    PRIVATES
    VARIABLE x PRIVATE
    : aap .... x .. ;
    DEPRIVE
    This prevents visiblity of x in the remainder of the program.
    It doesn't catch on.

    Also namespaces ("wordlist") has the same functionality for hiding.
    You can emulate a FORTRAN program using wordlists.
    This is much more powerful than defining LOCAL then DLOCAL then FLOCAL
    then DFLOCAL then scratching your head inventing arrays of xxLOCAL
    stuff.

    Groetjes Albert
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Apr 26 15:08:28 2026
    From Newsgroup: comp.lang.forth

    On 25-04-2026 19:21, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 25-04-2026 07:26, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    [reinserted deleted, relevant context]
    If you want to use a language that is "ideologically devoted" to the
    architecture, maybe you shouldn't use Forth at all - and stick with C.

    I don't see anything about C that is closer to the hardware than Forth
    is, and I think that both languages are about equally '"ideologically
    devoted" to the architecture'. In particular, a C local variable is
    no closer to a register (the most efficient hardware feature for
    storing data) than a stack item or return stack item is, and register
    allocation of any of the three is similarly difficult (with big
    differences in difficulty between solutions that provide some register
    allocation to those that are so reliable that you usually count on
    them).

    Well, you're actually shooting at Paul Rubin - not at me. Thank you! I
    take all the help I can get!

    Actually, this whole paragraph is a reaction on your statement, not
    his. You deleted it for whatever reason, so I reinserted it.
    Concerning Paul Rubin, just because he is wrong does not mean you are
    right.

    I leave it here, because it doesn't hurt my point in any way whatsoever.

    What you obviously fail to recognize is that I'm just using a debating technique. You see, I'm not too interested in hardware. To me it's just
    a bottle to get to the soda - i.e. you need hardware to run a program.
    I'm not completely ignorant on the subject, but I'm not an expert by any measure.

    So what I do is *assume* the statement is true - and work out the consequences. In this case, if Forth is not the right language for an
    x86_64 architecture, why not turn to the most logical candidate (in this
    case, C, because it features local variables) instead of manhandling
    this alien concept into the Forth language?

    Yes, it would also mean I'm using an "inferior language", but watch my
    face and see how much I care.

    What you actually do is nullify his original statement, so there is no
    reason for either local variables or me changing my favorite language.

    Effectively, it's turned into a Catch-22. Look it up, if you don't know
    what this means. So yes - I win in both cases. Take it or leave it. :-)

    (..) and they often find some arguments that appeal to
    elitism (i.e., only the chosen ones can use this programming language
    for the elite as it should be used, and the others should program in
    Python or "should never have been allowed to touch a keyboard" (Ulrich
    Drepper).

    It's your own pal Bernd that said: "A good programmer will write even
    better code in Forth. A bad programmer will write abysmal code in Forth.
    And I'm sorry to say - but most programmers are quite bad."

    So, either you agree with him or we have an unfortunate departure of one
    of the most foremost members of Gforth. Because this states - in no
    uncertain words - that Forth programmers *ARE* elite.

    What departure? We disagree on a number of things.

    You must be great friends! :-)

    And the issue is not whether Forth programmers or any other
    programmers are elite, but that many programmers think that they are
    elite (whether they are or aren't) and that the designers or advocates
    of deficient programming systems make use of that to dupe them, along
    the lines of: "You as elite programmers can cope with this deficiency
    [of course they don't call it a definiency], it's only subpar
    programmers [more elaborate denigrations are common, see Ulrich
    Drepper] who complain about it."

    "Deficient" can be considered a secondary quality in Lockes ideas on properties, which makes the entire discussion futile at best. BTW, the
    same goes for "elite or subpar programmers". In order to validate such a discussion one might need to agree on standards. Which we rarely do, is
    my experience ;-)

    But I noticed you get triggered when dividing the world into "elite" and "subpar" programmers. Nietzsche called that (real or performed) humility "slave mentality". I see that a lot in governmental agencies. Everyone
    is afraid to stand up and put out their ideas - because you never know
    who is gonna punish you for challenging the boss.

    I can tell you with utmost certainty it kills innovation - and drives
    your best people out of your organization. So I can't stand it. I truly believe challenging ideas - no matter how established - is the only way forward. The point is, the only true difference between ideas is, which achieve the desired result - and which don't.

    The mindset classical Forth breeds has done wonders for me. And that experience simply cannot be denied.

    In the case of Forth and locals this tactic has not worked very well,
    so even Forth, Inc. (who have been the most vocal among the commercial
    Forth providers about their dislike of locals) have implemented
    locals. But of course we see the echo of all of this still around
    here.

    That is the most ridiculous argument I've ever seen appear from your
    hand. Really! Let me take myself as an example. I'm *NOT* a fan of
    locals, agree?

    But I have and maintain *FOUR* different "locals" libraries and *THREE* preprocessor libraries. Darn, I even got libraries for PICK, ROLL and
    ?DUP - which I usually refuse to touch without having a cross in my hands.

    You know, I think this statement says more about you than about me or
    Forth inc. Yeah, sometimes I port a library with locals of deep stack operators.

    And I think there are situations where one of my users might need those
    cursed words. So, I see no need to say: "You are not worthy of Forth. Go
    to hell, you sinner - and repent for your questionable choices!"

    They chose how they use my product. They don't need me to do that for
    them. So I provide it, no problem.

    But thank you for providing this insight. Although it is *completely*
    contrary to your last argument, it explains a lot.

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Apr 26 15:10:37 2026
    From Newsgroup: comp.lang.forth

    On 26-04-2026 00:34, albert@spenarnc.xs4all.nl wrote:
    In article <nnd$1196d1a5$0da70c85@6de98b5b6c1b0418>,
    Hans Bezemer <the.beez.speaks@gmail.com> wrote:
    <SNIP>
    It would be better to think deeply, find an original solution and learn.
    Like Albert with his brilliant ;: word.

    Chuck Moore invented and coined the ;: word.
    I came up with CO with is similar, or maybe the same.

    Thank you for that correction! Consider my mistake as a sign of respect ;-)

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 26 09:50:59 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work

    A simple implementation of locals doesn't sound like that much work?

    Bernd Paysan wrote a simple locals implementation <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/locals.fs>
    that takes 84 SLOC:

    [~/gforth:167833] cat locals.fs|grep -v '^\\'|grep -v '^$'|wc -l
    84

    When loaded it takes 3096 bytes on a 32-bit gforth-0.4.0, so at least
    1500 bytes on a system with 16-bit cells. Given the memory limits in
    the old days, it's no surprise that they did without that at first.
    Later a number of Forth programmers were proud of their skill in
    working without locals and found reasons (or, maybe, justifications)
    why it was still relevant when memory was no longer so scarce. You
    can read some of those reasons in this thread.

    I've
    imagined some alternate versions of COLON, e.g.
    : foo ( ... ) ; \ regular colon, no locals
    1: foo ( ... ) ; \ one local called A
    2: foo (... ) ; \ two locals, A and B
    ...
    4: foo (... ) ; \ four locals: A, B, C, D.

    If you cannot chose the names, locals lose a lot of their benefits in
    making the code more understandable (OTOH, mathematicians have made to
    with similar naming schemes for a long time). You might then just as
    well work with >R >R >R >R and R@, R'@, 2 RPICK and 3 RPICK.

    The slowdown doesn't surprise me but it's not that big a deal, compared
    to the slowdown of using interpreted Forth instead of assembly language
    in the first place.

    It means that the argument line about locals making better use of the random-access memory provided by hardware does not hold water.

    As for assembly language, that has been part of Forth since the
    beginning, and telling people to write code words has not only been
    suggested in cases where more performance was necessary than
    high-level Forth provided [1], but also in cases like the strcmp
    example where so many values are active at the same time that
    high-level Forth without locals becomes cumbersome. We have also seen
    that in this thread.

    [1] In a reversal of earlier Forth marketing, IIRC VFX was later
    described as having the benefit of no longer needing to write code
    words.

    ... looking at the code for Gforth for 3DUP.3 compared to the others,
    Gforth still uses more primitives ...

    That's a lot of code in the expansion! I wonder how it will look in a
    simple interpreter.

    In the code you see the threaded code interspersed with the native
    code. If you ignore the native code, you see what a simple
    interpreter would see (if it had a locals implementation that produced
    code similar to that of Gforth).

    You seem to argue that the random-access aspect of locals provides a
    performance advantage on simple systems, but in most cases, code using
    locals is at a performance disadvantage on such systems

    Well, if the slowdown is less than say 2x, I'd say the code cleanup
    matters more, due to the traditional 90/10 rule (maybe now 99/1) of
    where CPU cycles go. Code the hot spots for speed and the rest for >convenience.

    So it's "code cleanup", not making use of hardware facilities for
    efficiency on simple interpreters, that you see as the benefit of
    locals.

    Keeping at least one stack item in a register leads to a smaller and
    faster implementation, and is not more complex than keeping all the
    stack memory in RAM.

    That's only with a fancy compiler AND a requirement of the application
    code having statically determined stack effects. Traditional words like
    ?DUP would confuse this scheme amirite?

    No. No fancy compiler; the compiler does not know about how the stack
    is represented. No statically determined stack effect necessary,
    because every word begins and ends in the same stack representation;
    Even with multi-representation stack-caching as used since Gforth 0.7
    (which does require more compiler smarts), no statically determined
    stack effect is necessary, because the code generator returns to the
    canonical state on control-flow.

    ?DUP also benefits: Implementation when TOS is in memory:

    tmp = sp[0]
    if tmp == 0 goto done
    sp = sp - cell
    sp[0] = tmp
    done:
    NEXT

    Implementation when TOS is in a register:

    if TOS == 0 goto done
    sp = sp - cell
    sp[0] = TOS #if SP points to the second item
    done:
    NEXT

    So the first instruction is left away. The code that gcc generates
    for Gforth (TOS in memory for gforth, TOS in register for gforth-fast)
    is suboptimal, but if you really want, you can inspect it with SEE
    ?DUP and puzzle out which instruction corresponds to which part of the pseudocode above, and which instructions are just a sign of
    suboptimality.

    A way to use RAM that is less frowned upon by Forth traditionalists is
    (global) variables. The fact that the use of global variables is
    frowned upon in the wider programming community for various reasons
    seems to pour oil into the fire of their elitism.

    I see what you mean by that. But, whole-program C compilers do
    something like register allocation to re-use those "global" cells when
    sets of them won't be needed at the same time. The Forth approach would
    need either a similar fancy compiler, or else require the programmer to
    do an error-prone manual memory layout process, or else burn memory >unnecessarily for those cells whose usage doesn't overlap.

    Yes. It's even worse: Such variables are often user variables. But
    looking at the usage of such things in Forth systems, we have user
    variables like BASE and HLD (in F83, HOLDPTR in gforth). They are
    used across multiple words, and the fact that you don't have to pass
    them and put them into a local has been touted as an advantage over
    locals: Definitions that use global variables are easier to factor.

    BASE lives during the whole session, and its memory cannot be reused.
    The memory of HLD lives only between <# and #>, and could be reused,
    but has not been.

    In any case, this approach is not taken often, and when it is, often
    to good effect (that may be survivor's bias). I don't see a lot of
    overlap with the cases where one uses locals, but one can argue that
    it reduces stack pressure in those places where one would otherwise be
    tempted to use locals.

    Another case of reducing stack pressure is ?DO...LOOP and friends.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Sun Apr 26 16:22:50 2026
    From Newsgroup: comp.lang.forth

    On 26-04-2026 11:50, Anton Ertl wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    And traditionally Forth has been implemented without locals, for the
    same reason: It takes less memory and, for the system implementor,
    less work

    A simple implementation of locals doesn't sound like that much work?

    Bernd Paysan wrote a simple locals implementation <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/locals.fs>
    that takes 84 SLOC:

    With all respect to Bernd, but yeah - compare that to this 0.5 SLOC implementation of local:

    : local r> swap dup >r @ >r ;: r> r> ! ;

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 26 14:03:03 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    Clever, but, at least without comment, too clever.

    This code, and, more clearly, Hans Bezemers version demonstrate that
    STR= is easier than COMPARE, STRCMP, or STR<, because you can deal
    with the case of length difference right at the start, whereas the
    latter words have to check the characters up to the end of the shorter
    string first before dealing with the length. This shows the greatest
    benefit in cases like

    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp

    As for STRCMP, I have measured the five versions shown in my earlier
    posting (whole program posted below), with the bugs fixed, and the
    ?DUP IF replaced by DUP IF ... THEN DROP, because it produces better
    code.

    I have also included the following versions:

    : strcmp { addr1 u1 addr2 u2 -- n }
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - ?dup
    if
    unloop exit
    then
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;

    This comes from the '94 paper and is the version that uses TO instead
    of defining new locals at every iteration. Paul Rubin will love the
    code that current Gforth produces for "addr2 char+ TO addr2":

    <strcmp+$E0> @local2 1->2
    $7F337DA71BBA: mov 0x10(%rbp),%r15
    <strcmp+$E8> char+ 2->2
    $7F337DA71BBE: add $0x1,%r15
    <strcmp+$F0> !local2 2->1
    $7F337DA71BC2: mov %r15,0x10(%rbp)

    The TO <local> code was not that efficient in earlier Gforth versions.

    The other version I added is:

    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( addr1 addr2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;

    This is the STRCMP3 from <2024Apr9.175958@mips.complang.tuwien.ac.at>
    and may be the locals-less version I compared against in the '94
    paper.

    I also included your version (without the UC call) and Hans Bezemer's
    version.

    I benchmarked two Forth systems, gforth-fast and gforth-itc.
    gforth-itc uses indirect-threaded code and should perform similar to
    the "simple interpreters" that Paul Rubin had in mind.

    I ran three different benchmarks on these words, which performed the
    following a number of times:

    s" 0123456789abcdefg" 2dup strcmp drop \ bench1
    s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2
    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3

    In bench1 the strings are equal and everything has to be compared. In
    bench2 the strings have the same length, but differ in the first char,
    so the loop can terminate after the first char. In bench3 the strings
    have different length, but all chars that both strings have are the
    same. In the latter case versionpeter and versionbezemer have an
    advantage from not performing the same functionality.

    The cycles numbers are per invocation of STRCMP, including benchmark overhead.

    The benchmarks are run on a Ryzen 8700G (Zen4)>

    In addition to the cycles, I also show the bytes of the native code of
    the whole word in gforth-fast on AMD64 (without the final jmp (2
    Bytes)), and of the loop (including the code for the if...then).

    Bytes | cycles gforth-fast | cycles gforth-itc |
    strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 |
    262 127 | 109.5 16.6 109.4 | 1732.7 147.4 1724.5 | version0
    303 151 | 164.2 17.2 164.4 | 1714.1 170.4 1613.5 | version1
    257 122 | 105.3 17.4 105.1 | 1496.7 166.4 1493.0 | version2
    280 113 | 98.6 19.2 99.0 | 1230.1 194.4 1116.2 | version3
    267 118 | 91.2 17.9 91.2 | 1268.6 198.4 1269.0 | version4
    273 108 | 89.9 17.0 90.0 | 1136.0 178.4 1138.9 | version5
    261 128 | 121.1 14.6 118.5 | 1221.4 131.3 1213.3 | version6
    210 142 | 137.5 15.4 9.5 | 1244.4 155.3 78.3 | versionpeter
    260 119 | 107.8 16.4 9.8 | 1186.2 134.5 71.3 | versionbezemer

    So the champion among the full-featured strcmps for bench1 and bench3
    is version5, for bench2 version6. The str= variants are much faster
    for bench3 (of course), but slower than several other versions for
    bench1 and slower than version6 for bench2. The native code size is
    smallest for version2 (among the full-featured strcmp
    implementations), so the locals-less versions do not win everything.

    So locals-less (version5 and version6) is somewhat faster on both
    gforth-fast and gforth-itc.

    lxf has a more efficient locals implementation. Let's see how it
    fares. It does not support the usage in version1, so I leave that
    away.

    cycles lxf
    bench1 bench2 bench3
    79.9 12.0 79.9 version0
    99.6 12.0 99.6 version2
    98.8 14.1 98.1 version3
    86.0 13.2 86.0 version4
    84.1 12.6 84.2 version5
    88.7 10.0 92.8 version6
    98.3 10.0 6.0 versionpeter
    72.1 9.5 6.0 versionbezemer

    On lxf version0 (with locals) is the fastest for bench1 and bench3,
    and version6 is the fastest for bench2. Hans Bezemers version wins
    everything if we are only interested in str= functionality.

    And here's the code (measurement scripts at the bottom): ----------------------------------------------------------
    [defined] version0 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - dup
    if
    unloop exit
    then
    drop
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;
    [then]

    [defined] version1 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr1 addr2
    u1 u2 min 0
    ?do {: s1 s2 :}
    s1 c@ s2 c@ - dup
    if
    unloop exit
    then
    drop s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;
    [then]

    [defined] version2 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - dup
    if
    unloop exit
    then
    drop
    loop
    u1 u2 - ;
    [then]

    [defined] version3 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - dup
    if
    unloop exit
    then
    drop
    loop
    u1 u2 - ;
    [then]

    [defined] version4 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - dup
    if
    nip negate unloop exit
    then
    drop
    loop
    drop u1 u2 - ;
    [then]

    [defined] version5 [if]
    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - dup
    if
    nip negate unloop r> drop exit
    then
    drop
    loop
    drop r> negate ;
    [then]

    [defined] version6 [if]
    [undefined] 2rdrop [if]
    : 2rdrop postpone 2r> postpone 2drop ; immediate
    [then]

    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( addr1 addr2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;
    [then]

    [defined] versionpeter [if]
    \ from <20260425160747.00007f4a@tin.it>
    \ renamed and deleted the call to UC
    : strcmp ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;
    [then]

    [defined] versionbezemer [if]
    \ from <nnd$548d4f1b$1e104571@905dda44db1f54ae>
    \ renamed
    : strcmp
    rot over - if drop 2drop true exit then
    0 ?do
    over i chars + c@ over i chars + c@ -
    if drop drop unloop true exit then
    loop drop drop false
    ;
    [then]

    [defined] t{ [if]
    t{ s" abc" s" abc" strcmp -> 0 }t
    t{ s" abc" s" abcd" strcmp -> -1 }t
    t{ s" abc" s" abd" strcmp -> -1 }t
    t{ s" abd" s" abc" strcmp -> 1 }t
    t{ s" cbc" s" abc" strcmp -> 2 }t
    t{ s" abc" s" adc" strcmp -> -2 }t
    [then]

    \ Benchmarks

    [undefined] iterations [if]
    100000000 constant iterations
    [then]

    : benchmark ( c-addr1 u1 c-addr2 u2 -- )
    iterations 0 do
    2over 2over strcmp drop
    loop
    2drop 2drop ;

    : bench1
    s" 0123456789abcdefg" 2dup benchmark ;

    : bench2
    s" 0123456789abcdefg" s" 2123456789abcdefg" benchmark ;

    : bench3
    s" 0123456789abcdefg" s" 0123456789abcdefgh" benchmark ;


    0 [if]
    # bash script for producing the cycles
    IFS=":"
    for i in 0 1 2 3 4 5 6 peter bezemer; do
    for forthit in gforth-fast:100000000 gforth-itc:10000000; do
    fields=($forthit); forth="${fields[0]}"; iterations="${fields[1]}"
    for bench in 1 2 3; do
    perf stat --log-fd 3 -x, -e cycles:u $forth -e "create version$i $iterations constant iterations" ~/forth/strcmp.4th -e "bench$bench bye" 3>&1 >/dev/null|
    awk -F, '{printf "%6.1f ",$1/'$iterations'}'
    done
    done
    echo version$i
    done
    IFS=":"
    for i in 0 2 3 4 5 6 peter bezemer; do
    forth=lxf; iterations=100000000
    for bench in 1 2 3; do
    perf stat --log-fd 3 -x, -e cycles:u $forth "create version$i $iterations constant iterations include $HOME/forth/strcmp.4th bench$bench bye" 3>&1 >/dev/null|
    awk -F, '{printf "%6.1f ",$1/'$iterations'}'
    done
    echo version$i
    done
    [then]
    --------------------------------------------------------------

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Apr 26 17:04:39 2026
    From Newsgroup: comp.lang.forth

    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 26-04-2026 11:50, Anton Ertl wrote:
    Bernd Paysan wrote a simple locals implementation
    <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/locals.fs>
    that takes 84 SLOC:

    With all respect to Bernd, but yeah - compare that to this 0.5 SLOC >implementation of local:

    : local r> swap dup >r @ >r ;: r> r> ! ;

    Let's see:

    [~:167902] gforth-0.5.0
    GForth 0.5.0, Copyright (C) 1995-2000 Free Software Foundation, Inc.
    GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    warnings off include locals.fs ok
    ok
    : local r> swap dup >r @ >r ;: r> r> ! ;
    *the terminal*:1: Undefined word
    : local r> swap dup >r @ >r ;: r> r> ! ;
    ^^
    Backtrace:
    $F7B5A158 throw
    $F7B6418C no.extensions

    Although, admittedly, while Bernd Paysan's locals.fs loads, it does
    not work AFAICT (I tried it on gforth-0.4 and gforth-0.5; it does not
    load on gforth-0.6 and later). Apparently it had bitrotted between
    the time when it was written in 1992 and gforth-0.4 in 1998.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 27 11:12:03 2026
    From Newsgroup: comp.lang.forth

    On 26/04/2026 7:50 pm, Anton Ertl wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    ...
    I've
    imagined some alternate versions of COLON, e.g.
    : foo ( ... ) ; \ regular colon, no locals
    1: foo ( ... ) ; \ one local called A
    2: foo (... ) ; \ two locals, A and B
    ...
    4: foo (... ) ; \ four locals: A, B, C, D.

    If you cannot chose the names, locals lose a lot of their benefits in
    making the code more understandable (OTOH, mathematicians have made to
    with similar naming schemes for a long time). You might then just as
    well work with >R >R >R >R and R@, R'@, 2 RPICK and 3 RPICK.

    That Julian Noble (among others) felt the need for FTRAN INTRAN etc informs what scientists and academics really want - and it's a long way from the
    'stack based' locals offered by most forth systems. The latter represent
    a concession to forth before a user has even begun to consider identifiers.
    To an outsider, forth locals do nothing to ameliorate what they see as fundamentally broken about the language. ISTM if a forther has conceded to
    use stack-based locals, he can certainly make choices about what form identifiers take.

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Mon Apr 27 11:51:17 2026
    From Newsgroup: comp.lang.forth

    On 27/04/2026 3:04 am, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 26-04-2026 11:50, Anton Ertl wrote:
    Bernd Paysan wrote a simple locals implementation
    <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/locals.fs>
    that takes 84 SLOC:

    With all respect to Bernd, but yeah - compare that to this 0.5 SLOC
    implementation of local:

    : local r> swap dup >r @ >r ;: r> r> ! ;

    Let's see:

    [~:167902] gforth-0.5.0
    GForth 0.5.0, Copyright (C) 1995-2000 Free Software Foundation, Inc.
    GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    warnings off include locals.fs ok
    ok
    : local r> swap dup >r @ >r ;: r> r> ! ;
    *the terminal*:1: Undefined word

    That only tells what Gforth doesn't have. DX-Forth comes with four
    variants of locals. The next release will include a variant of FSL's
    flocals. It's not an endorsement of locals, rather a way of saying
    there's lots of ways to skin a cat and one isn't necessarily best.




    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Mon Apr 27 09:31:03 2026
    From Newsgroup: comp.lang.forth

    On Sun, 26 Apr 2026 14:03:03 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    I recently reviewed the string comparison for search-wordlist
    and came up with the following

    The string stored in the word header is already uppercased.
    So string comparison will be case insensitive

    : UC ( c -- c' ) \ uppercase char
    dup $61 $7B within $20 and - ;


    : NCOMP4 ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count uc \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;

    First iteration in the loop it does not compare chars but the length!

    Clever, but, at least without comment, too clever.

    This code, and, more clearly, Hans Bezemers version demonstrate that
    STR= is easier than COMPARE, STRCMP, or STR<, because you can deal
    with the case of length difference right at the start, whereas the
    latter words have to check the characters up to the end of the shorter
    string first before dealing with the length. This shows the greatest
    benefit in cases like

    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp

    As for STRCMP, I have measured the five versions shown in my earlier
    posting (whole program posted below), with the bugs fixed, and the
    ?DUP IF replaced by DUP IF ... THEN DROP, because it produces better
    code.

    I have also included the following versions:

    : strcmp { addr1 u1 addr2 u2 -- n }
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - ?dup
    if
    unloop exit
    then
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;

    This comes from the '94 paper and is the version that uses TO instead
    of defining new locals at every iteration. Paul Rubin will love the
    code that current Gforth produces for "addr2 char+ TO addr2":

    <strcmp+$E0> @local2 1->2
    $7F337DA71BBA: mov 0x10(%rbp),%r15
    <strcmp+$E8> char+ 2->2
    $7F337DA71BBE: add $0x1,%r15
    <strcmp+$F0> !local2 2->1
    $7F337DA71BC2: mov %r15,0x10(%rbp)

    The TO <local> code was not that efficient in earlier Gforth versions.

    The other version I added is:

    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( addr1 addr2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;

    This is the STRCMP3 from <2024Apr9.175958@mips.complang.tuwien.ac.at>
    and may be the locals-less version I compared against in the '94
    paper.

    I also included your version (without the UC call) and Hans Bezemer's version.

    I benchmarked two Forth systems, gforth-fast and gforth-itc.
    gforth-itc uses indirect-threaded code and should perform similar to
    the "simple interpreters" that Paul Rubin had in mind.

    I ran three different benchmarks on these words, which performed the following a number of times:

    s" 0123456789abcdefg" 2dup strcmp drop \ bench1
    s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2
    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3

    In bench1 the strings are equal and everything has to be compared. In
    bench2 the strings have the same length, but differ in the first char,
    so the loop can terminate after the first char. In bench3 the strings
    have different length, but all chars that both strings have are the
    same. In the latter case versionpeter and versionbezemer have an
    advantage from not performing the same functionality.

    The cycles numbers are per invocation of STRCMP, including benchmark overhead.

    The benchmarks are run on a Ryzen 8700G (Zen4)>

    In addition to the cycles, I also show the bytes of the native code of
    the whole word in gforth-fast on AMD64 (without the final jmp (2
    Bytes)), and of the loop (including the code for the if...then).

    Bytes | cycles gforth-fast | cycles gforth-itc |
    strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 |
    262 127 | 109.5 16.6 109.4 | 1732.7 147.4 1724.5 | version0
    303 151 | 164.2 17.2 164.4 | 1714.1 170.4 1613.5 | version1
    257 122 | 105.3 17.4 105.1 | 1496.7 166.4 1493.0 | version2
    280 113 | 98.6 19.2 99.0 | 1230.1 194.4 1116.2 | version3
    267 118 | 91.2 17.9 91.2 | 1268.6 198.4 1269.0 | version4
    273 108 | 89.9 17.0 90.0 | 1136.0 178.4 1138.9 | version5
    261 128 | 121.1 14.6 118.5 | 1221.4 131.3 1213.3 | version6
    210 142 | 137.5 15.4 9.5 | 1244.4 155.3 78.3 | versionpeter
    260 119 | 107.8 16.4 9.8 | 1186.2 134.5 71.3 | versionbezemer

    So the champion among the full-featured strcmps for bench1 and bench3
    is version5, for bench2 version6. The str= variants are much faster
    for bench3 (of course), but slower than several other versions for
    bench1 and slower than version6 for bench2. The native code size is
    smallest for version2 (among the full-featured strcmp
    implementations), so the locals-less versions do not win everything.

    So locals-less (version5 and version6) is somewhat faster on both
    gforth-fast and gforth-itc.

    lxf has a more efficient locals implementation. Let's see how it
    fares. It does not support the usage in version1, so I leave that
    away.

    cycles lxf
    bench1 bench2 bench3
    79.9 12.0 79.9 version0
    99.6 12.0 99.6 version2
    98.8 14.1 98.1 version3
    86.0 13.2 86.0 version4
    84.1 12.6 84.2 version5
    88.7 10.0 92.8 version6
    98.3 10.0 6.0 versionpeter
    72.1 9.5 6.0 versionbezemer

    On lxf version0 (with locals) is the fastest for bench1 and bench3,
    and version6 is the fastest for bench2. Hans Bezemers version wins everything if we are only interested in str= functionality.

    Anton, thanks for running all these tests.
    I have now also run them on my Ryzen 9950X.
    There is an error in version 6 that i corrected.
    2rdrop needs to be after unloop. On lxf64 that uses registers for
    loop parameters this is necessary!
    version3 does not run as lxf64 does not support defining locals
    several times. I will see if this can be changed.

    I needed also to change the log-fd to 5 to get it to run.
    The tests are run with Debian under WSL2.

    Here are the results

    lxf64
    59.1 10.0 57.6 version0
    48.1 10.0 48.4 version2
    43.0 10.7 42.5 version4
    42.2 9.1 42.2 version5
    55.1 9.0 55.0 version6
    65.7 8.0 6.0 versionpeter
    32.8 9.0 4.2 versionbezemer

    lxf
    64.2 8.5 64.2 version0
    112.3 10.2 90.1 version2
    78.8 10.6 75.6 version4
    88.1 9.4 88.2 version5
    112.2 7.5 114.7 version6
    71.0 8.2 7.4 versionpeter
    50.9 8.3 4.3 versionbezemer

    There is a significant impact in having loop parameters in registers!
    version 2 and 6 are interesting for lxf. The full stat gives some more
    info. Sorry for the long lines
    version 2 compared to version 0

    Peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version2 100000000 constant iterations include strcmp.4th bench1 bye"


    Performance counter stats for './lxf create version2 100000000 constant iterations include strcmp.4th bench1 bye':

    1,955.50 msec task-clock:u # 0.998 CPUs utilized
    0 context-switches:u # 0.000 /sec
    0 cpu-migrations:u # 0.000 /sec
    64 page-faults:u # 32.728 /sec
    10,973,742,845 cycles:u # 5.612 GHz
    966,332,718 stalled-cycles-frontend:u # 8.81% frontend cycles idle
    34,901,611,693 instructions:u # 3.18 insn per cycle
    # 0.03 stalled cycles per insn
    3,900,350,964 branches:u # 1.995 G/sec
    36,727 branch-misses:u # 0.00% of all branches

    1.960183288 seconds time elapsed

    1.955783000 seconds user
    0.000000000 seconds sys


    peter@R9950WSL:/mnt/d/Dev/forth/lxf32v17$ perf stat ./lxf "create version0 100000000 constant iterations include strcmp.4th bench1 bye"


    Performance counter stats for './lxf create version0 100000000 constant iterations include strcmp.4th bench1 bye':

    1,158.97 msec task-clock:u # 0.996 CPUs utilized
    0 context-switches:u # 0.000 /sec
    0 cpu-migrations:u # 0.000 /sec
    64 page-faults:u # 55.221 /sec
    6,415,119,211 cycles:u # 5.535 GHz
    4,510,117 stalled-cycles-frontend:u # 0.07% frontend cycles idle
    38,301,605,801 instructions:u # 5.97 insn per cycle
    # 0.00 stalled cycles per insn
    3,900,348,894 branches:u # 3.365 G/sec
    19,563 branch-misses:u # 0.00% of all branches

    1.163667408 seconds time elapsed

    1.151174000 seconds user
    0.007966000 seconds sys

    BR
    Peter


    And here's the code (measurement scripts at the bottom): ----------------------------------------------------------
    [defined] version0 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    u1 u2 min 0
    ?do
    addr1 c@ addr2 c@ - dup
    if
    unloop exit
    then
    drop
    addr1 char+ TO addr1
    addr2 char+ TO addr2
    loop
    u1 u2 - ;
    [then]

    [defined] version1 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr1 addr2
    u1 u2 min 0
    ?do {: s1 s2 :}
    s1 c@ s2 c@ - dup
    if
    unloop exit
    then
    drop s1 char+ s2 char+
    loop
    2drop
    u1 u2 - ;
    [then]

    [defined] version2 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    u1 u2 min 0
    ?do
    addr1 i + c@ addr2 i + c@ - dup
    if
    unloop exit
    then
    drop
    loop
    u1 u2 - ;
    [then]

    [defined] version3 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - {: offset :}
    u1 u2 min addr1 + addr1 ?do
    i c@ i offset + c@ - dup
    if
    unloop exit
    then
    drop
    loop
    u1 u2 - ;
    [then]

    [defined] version4 [if]
    : strcmp {: addr1 u1 addr2 u2 -- n :}
    addr2 addr1 - ( offset )
    u1 u2 min addr1 + addr1 ?do ( offset )
    dup i + c@ i c@ - dup
    if
    nip negate unloop exit
    then
    drop
    loop
    drop u1 u2 - ;
    [then]

    [defined] version5 [if]
    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup - >r ( addr1 addr2 u1 u2 R: n1 )
    min -rot over - ( u12 addr1 offset R: n1 )
    swap rot bounds ( offset limit start R: n1 )
    ?do ( offset R: n1 loop-sys )
    dup i + c@ i c@ - dup
    if
    nip negate unloop r> drop exit
    then
    drop
    loop
    drop r> negate ;
    [then]

    [defined] version6 [if]
    [undefined] 2rdrop [if]
    : 2rdrop postpone 2r> postpone 2drop ; immediate
    [then]

    : strcmp ( addr1 u1 addr2 u2 -- n )
    rot 2dup 2>r min 0 ?do ( addr1 addr2 )
    over c@ over c@ - dup if
    nip nip 2rdrop unloop exit then
    drop
    char+ swap char+ swap
    loop
    2drop r> r> - ;
    [then]

    [defined] versionpeter [if]
    \ from <20260425160747.00007f4a@tin.it>
    \ renamed and deleted the call to UC
    : strcmp ( addr n addr' n' - f) \ 0 is match
    dup >r
    begin
    rot = while \ str cstr
    r> dup 1- >r
    while \ str cstr
    swap count \ cstr str' s1
    rot count \ str' s1 cstr' c1
    repeat
    2drop r> drop 0 exit
    then
    2drop r> drop 1 ;
    [then]

    [defined] versionbezemer [if]
    \ from <nnd$548d4f1b$1e104571@905dda44db1f54ae>
    \ renamed
    : strcmp
    rot over - if drop 2drop true exit then
    0 ?do
    over i chars + c@ over i chars + c@ -
    if drop drop unloop true exit then
    loop drop drop false
    ;
    [then]

    [defined] t{ [if]
    t{ s" abc" s" abc" strcmp -> 0 }t
    t{ s" abc" s" abcd" strcmp -> -1 }t
    t{ s" abc" s" abd" strcmp -> -1 }t
    t{ s" abd" s" abc" strcmp -> 1 }t
    t{ s" cbc" s" abc" strcmp -> 2 }t
    t{ s" abc" s" adc" strcmp -> -2 }t
    [then]

    \ Benchmarks

    [undefined] iterations [if]
    100000000 constant iterations
    [then]

    : benchmark ( c-addr1 u1 c-addr2 u2 -- )
    iterations 0 do
    2over 2over strcmp drop
    loop
    2drop 2drop ;

    : bench1
    s" 0123456789abcdefg" 2dup benchmark ;

    : bench2
    s" 0123456789abcdefg" s" 2123456789abcdefg" benchmark ;

    : bench3
    s" 0123456789abcdefg" s" 0123456789abcdefgh" benchmark ;


    0 [if]
    # bash script for producing the cycles
    IFS=":"
    for i in 0 1 2 3 4 5 6 peter bezemer; do
    for forthit in gforth-fast:100000000 gforth-itc:10000000; do
    fields=($forthit); forth="${fields[0]}"; iterations="${fields[1]}"
    for bench in 1 2 3; do
    perf stat --log-fd 3 -x, -e cycles:u $forth -e "create version$i $iterations constant iterations" ~/forth/strcmp.4th -e "bench$bench bye" 3>&1 >/dev/null|
    awk -F, '{printf "%6.1f ",$1/'$iterations'}'
    done
    done
    echo version$i
    done
    IFS=":"
    for i in 0 2 3 4 5 6 peter bezemer; do
    forth=lxf; iterations=100000000
    for bench in 1 2 3; do
    perf stat --log-fd 3 -x, -e cycles:u $forth "create version$i $iterations constant iterations include $HOME/forth/strcmp.4th bench$bench bye" 3>&1 >/dev/null|
    awk -F, '{printf "%6.1f ",$1/'$iterations'}'
    done
    echo version$i
    done
    [then]
    --------------------------------------------------------------

    - anton


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Mon Apr 27 07:53:58 2026
    From Newsgroup: comp.lang.forth

    peter <peter.noreply@tin.it> writes:
    On Sun, 26 Apr 2026 14:03:03 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    I benchmarked two Forth systems, gforth-fast and gforth-itc.
    gforth-itc uses indirect-threaded code and should perform similar to
    the "simple interpreters" that Paul Rubin had in mind.

    I ran three different benchmarks on these words, which performed the
    following a number of times:

    s" 0123456789abcdefg" 2dup strcmp drop \ bench1
    s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2
    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3

    In bench1 the strings are equal and everything has to be compared. In
    bench2 the strings have the same length, but differ in the first char,
    so the loop can terminate after the first char. In bench3 the strings
    have different length, but all chars that both strings have are the
    same. In the latter case versionpeter and versionbezemer have an
    advantage from not performing the same functionality.

    The cycles numbers are per invocation of STRCMP, including benchmark overhead.

    The benchmarks are run on a Ryzen 8700G (Zen4)>

    In addition to the cycles, I also show the bytes of the native code of
    the whole word in gforth-fast on AMD64 (without the final jmp (2
    Bytes)), and of the loop (including the code for the if...then).

    Bytes | cycles gforth-fast | cycles gforth-itc |
    strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 |
    262 127 | 109.5 16.6 109.4 | 1732.7 147.4 1724.5 | version0
    303 151 | 164.2 17.2 164.4 | 1714.1 170.4 1613.5 | version1
    257 122 | 105.3 17.4 105.1 | 1496.7 166.4 1493.0 | version2
    280 113 | 98.6 19.2 99.0 | 1230.1 194.4 1116.2 | version3
    267 118 | 91.2 17.9 91.2 | 1268.6 198.4 1269.0 | version4
    273 108 | 89.9 17.0 90.0 | 1136.0 178.4 1138.9 | version5
    261 128 | 121.1 14.6 118.5 | 1221.4 131.3 1213.3 | version6
    210 142 | 137.5 15.4 9.5 | 1244.4 155.3 78.3 | versionpeter
    260 119 | 107.8 16.4 9.8 | 1186.2 134.5 71.3 | versionbezemer
    ...
    lxf has a more efficient locals implementation. Let's see how it
    fares. It does not support the usage in version1, so I leave that
    away.

    cycles lxf
    bench1 bench2 bench3
    79.9 12.0 79.9 version0
    99.6 12.0 99.6 version2
    98.8 14.1 98.1 version3
    86.0 13.2 86.0 version4
    84.1 12.6 84.2 version5
    88.7 10.0 92.8 version6
    98.3 10.0 6.0 versionpeter
    72.1 9.5 6.0 versionbezemer

    And, to top it off, sf64 and vfx64, after correcting the bug in
    version6 that you pointed out:

    cycles sf-4.0.0-RC89 | cycles vfx64 5.43 |
    bench1 bench2 bench3 | bench1 bench2 bench3 |
    195.1 62.0 194.5 | 124.2 42.2 123.3 | version0
    136.3 63.0 136.2 | 200.4 124.1 204.4 | version2
    143.7 69.6 143.4 | 90.7 36.7 91.3 | version4
    115.1 36.0 114.1 | 102.0 30.2 101.8 | version5
    132.8 38.0 133.3 | 85.8 28.2 88.2 | version6
    182.0 19.0 9.0 | 95.7 10.2 6.2 | versionpeter
    224.9 40.2 8.0 | 63.2 29.2 6.2 | versionbezemer

    Interesting performance variations.

    Anton, thanks for running all these tests.
    I have now also run them on my Ryzen 9950X.
    There is an error in version 6 that i corrected.
    2rdrop needs to be after unloop. On lxf64 that uses registers for
    loop parameters this is necessary!

    Thanks. In sf64 and vfx64 this change is necessary, too.

    I needed also to change the log-fd to 5 to get it to run.
    The tests are run with Debian under WSL2.

    WSL2 supports performance counters. Great!

    What happens with log-fd=3?


    Here are the results

    lxf64
    59.1 10.0 57.6 version0
    48.1 10.0 48.4 version2
    43.0 10.7 42.5 version4
    42.2 9.1 42.2 version5
    55.1 9.0 55.0 version6
    65.7 8.0 6.0 versionpeter
    32.8 9.0 4.2 versionbezemer

    lxf
    64.2 8.5 64.2 version0
    112.3 10.2 90.1 version2
    78.8 10.6 75.6 version4
    88.1 9.4 88.2 version5
    112.2 7.5 114.7 version6
    71.0 8.2 7.4 versionpeter
    50.9 8.3 4.3 versionbezemer

    There is a significant impact in having loop parameters in registers!
    version 2 and 6 are interesting for lxf. The full stat gives some more
    info.

    Not any info that I find helpful. But my guess is as follows: Keeping
    the loop index in memory has reliably meant that counted loops take at
    least 5 cycles per iteration. In recent processors (from this decade
    or a little earlier), hardware can perform zero-cycle store-to-load
    forwarding, but it is not reliable. So my guess is that in version2
    and version6 we are seeing cases where this hardware optimization has
    not worked. So, yes, keeping loop parameters that change in registers
    is a good idea even on recent CPUs.

    The differences between Zen4 and Zen5 on lxf are significant, but I
    guess that if you take the average, you get the picture of small
    progress that I see on various websites.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From peter@peter.noreply@tin.it to comp.lang.forth on Mon Apr 27 11:52:41 2026
    From Newsgroup: comp.lang.forth

    On Mon, 27 Apr 2026 07:53:58 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    peter <peter.noreply@tin.it> writes:
    On Sun, 26 Apr 2026 14:03:03 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    I benchmarked two Forth systems, gforth-fast and gforth-itc.
    gforth-itc uses indirect-threaded code and should perform similar to
    the "simple interpreters" that Paul Rubin had in mind.

    I ran three different benchmarks on these words, which performed the
    following a number of times:

    s" 0123456789abcdefg" 2dup strcmp drop \ bench1
    s" 0123456789abcdefg" s" 2123456789abcdefg" strcmp drop \ bench2
    s" 0123456789abcdefg" s" 0123456789abcdefgh" strcmp drop \bench3

    In bench1 the strings are equal and everything has to be compared. In
    bench2 the strings have the same length, but differ in the first char,
    so the loop can terminate after the first char. In bench3 the strings
    have different length, but all chars that both strings have are the
    same. In the latter case versionpeter and versionbezemer have an
    advantage from not performing the same functionality.

    The cycles numbers are per invocation of STRCMP, including benchmark overhead.

    The benchmarks are run on a Ryzen 8700G (Zen4)>

    In addition to the cycles, I also show the bytes of the native code of
    the whole word in gforth-fast on AMD64 (without the final jmp (2
    Bytes)), and of the loop (including the code for the if...then).

    Bytes | cycles gforth-fast | cycles gforth-itc |
    strcmp loop|bench1 bench2 bench3 | bench1 bench2 bench3 |
    262 127 | 109.5 16.6 109.4 | 1732.7 147.4 1724.5 | version0
    303 151 | 164.2 17.2 164.4 | 1714.1 170.4 1613.5 | version1
    257 122 | 105.3 17.4 105.1 | 1496.7 166.4 1493.0 | version2
    280 113 | 98.6 19.2 99.0 | 1230.1 194.4 1116.2 | version3
    267 118 | 91.2 17.9 91.2 | 1268.6 198.4 1269.0 | version4
    273 108 | 89.9 17.0 90.0 | 1136.0 178.4 1138.9 | version5
    261 128 | 121.1 14.6 118.5 | 1221.4 131.3 1213.3 | version6
    210 142 | 137.5 15.4 9.5 | 1244.4 155.3 78.3 | versionpeter
    260 119 | 107.8 16.4 9.8 | 1186.2 134.5 71.3 | versionbezemer ...
    lxf has a more efficient locals implementation. Let's see how it
    fares. It does not support the usage in version1, so I leave that
    away.

    cycles lxf
    bench1 bench2 bench3
    79.9 12.0 79.9 version0
    99.6 12.0 99.6 version2
    98.8 14.1 98.1 version3
    86.0 13.2 86.0 version4
    84.1 12.6 84.2 version5
    88.7 10.0 92.8 version6
    98.3 10.0 6.0 versionpeter
    72.1 9.5 6.0 versionbezemer

    And, to top it off, sf64 and vfx64, after correcting the bug in
    version6 that you pointed out:

    cycles sf-4.0.0-RC89 | cycles vfx64 5.43 |
    bench1 bench2 bench3 | bench1 bench2 bench3 |
    195.1 62.0 194.5 | 124.2 42.2 123.3 | version0
    136.3 63.0 136.2 | 200.4 124.1 204.4 | version2
    143.7 69.6 143.4 | 90.7 36.7 91.3 | version4
    115.1 36.0 114.1 | 102.0 30.2 101.8 | version5
    132.8 38.0 133.3 | 85.8 28.2 88.2 | version6
    182.0 19.0 9.0 | 95.7 10.2 6.2 | versionpeter
    224.9 40.2 8.0 | 63.2 29.2 6.2 | versionbezemer

    Interesting performance variations.

    Anton, thanks for running all these tests.
    I have now also run them on my Ryzen 9950X.
    There is an error in version 6 that i corrected.
    2rdrop needs to be after unloop. On lxf64 that uses registers for
    loop parameters this is necessary!

    Thanks. In sf64 and vfx64 this change is necessary, too.

    I needed also to change the log-fd to 5 to get it to run.
    The tests are run with Debian under WSL2.

    WSL2 supports performance counters. Great!

    What happens with log-fd=3?


    Here are the results

    lxf64
    59.1 10.0 57.6 version0
    48.1 10.0 48.4 version2
    43.0 10.7 42.5 version4
    42.2 9.1 42.2 version5
    55.1 9.0 55.0 version6
    65.7 8.0 6.0 versionpeter
    32.8 9.0 4.2 versionbezemer

    lxf
    64.2 8.5 64.2 version0
    112.3 10.2 90.1 version2
    78.8 10.6 75.6 version4
    88.1 9.4 88.2 version5
    112.2 7.5 114.7 version6
    71.0 8.2 7.4 versionpeter
    50.9 8.3 4.3 versionbezemer

    There is a significant impact in having loop parameters in registers! >version 2 and 6 are interesting for lxf. The full stat gives some more >info.

    Not any info that I find helpful. But my guess is as follows: Keeping
    the loop index in memory has reliably meant that counted loops take at
    least 5 cycles per iteration. In recent processors (from this decade
    or a little earlier), hardware can perform zero-cycle store-to-load forwarding, but it is not reliable. So my guess is that in version2
    and version6 we are seeing cases where this hardware optimization has
    not worked. So, yes, keeping loop parameters that change in registers
    is a good idea even on recent CPUs.

    The differences between Zen4 and Zen5 on lxf are significant, but I
    guess that if you take the average, you get the picture of small
    progress that I see on various websites.

    - anton

    I think that code placement in memory plays a role. look at this 2 runs of version 6

    Performance counter stats for './lxf create version6 100000000 constant iterations include strcmp.4th bench1 bye':

    2,008.59 msec task-clock:u # 0.998 CPUs utilized
    0 context-switches:u # 0.000 /sec
    0 cpu-migrations:u # 0.000 /sec
    64 page-faults:u # 31.863 /sec
    11,290,674,838 cycles:u # 5.621 GHz
    1,554,765,442 stalled-cycles-frontend:u # 13.77% frontend cycles idle
    34,301,643,546 instructions:u # 3.04 insn per cycle
    # 0.05 stalled cycles per insn
    3,900,356,221 branches:u # 1.942 G/sec
    32,958 branch-misses:u # 0.00% of all branches

    2.013169221 seconds time elapsed

    2.004809000 seconds user
    0.003993000 seconds sys

    compare with this one where i only allot 16 bytes to the code segment before loading

    Performance counter stats for './lxf create version6 100000000 constant iterations 16 allot-c include strcmp.4th bench1 bye':

    1,202.67 msec task-clock:u # 0.996 CPUs utilized
    0 context-switches:u # 0.000 /sec
    0 cpu-migrations:u # 0.000 /sec
    64 page-faults:u # 53.215 /sec
    6,630,029,444 cycles:u # 5.513 GHz
    439,780,595 stalled-cycles-frontend:u # 6.63% frontend cycles idle
    34,301,649,298 instructions:u # 5.17 insn per cycle
    # 0.01 stalled cycles per insn
    3,900,358,028 branches:u # 3.243 G/sec
    146,947 branch-misses:u # 0.00% of all branches

    1.207212030 seconds time elapsed

    1.202879000 seconds user
    0.000000000 seconds sys

    I have observed the same behavior on other benchmarks
    BR
    Peter

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Apr 28 08:21:37 2026
    From Newsgroup: comp.lang.forth

    On 26-04-2026 19:04, Anton Ertl wrote:
    Hans Bezemer <the.beez.speaks@gmail.com> writes:
    On 26-04-2026 11:50, Anton Ertl wrote:
    Bernd Paysan wrote a simple locals implementation
    <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/locals.fs>
    that takes 84 SLOC:

    With all respect to Bernd, but yeah - compare that to this 0.5 SLOC
    implementation of local:

    : local r> swap dup >r @ >r ;: r> r> ! ;

    Let's see:

    [~:167902] gforth-0.5.0
    GForth 0.5.0, Copyright (C) 1995-2000 Free Software Foundation, Inc.
    GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    warnings off include locals.fs ok
    ok
    : local r> swap dup >r @ >r ;: r> r> ! ;
    *the terminal*:1: Undefined word
    : local r> swap dup >r @ >r ;: r> r> ! ;
    ^^
    Backtrace:
    $F7B5A158 throw
    $F7B6418C no.extensions

    Although, admittedly, while Bernd Paysan's locals.fs loads, it does
    not work AFAICT (I tried it on gforth-0.4 and gforth-0.5; it does not
    load on gforth-0.6 and later). Apparently it had bitrotted between
    the time when it was written in 1992 and gforth-0.4 in 1998.

    - anton

    Oh dear, huge Gforth doesn't feature a ;: word? Let me help you. From
    the humble 4tH repository:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    Well, that boots it to almost 0.625 SLOC. It's almost bloatware!

    Now. Let's see how it performs:

    Gforth 0.7.9_20250321
    Authors: Anton Ertl, Bernd Paysan, Jens Wilke et al., for more type
    `authors'
    Copyright © 2025 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later
    <https://gnu.org/licenses/gpl.html>
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `help' for basic help
    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ; ok
    variable x ok
    : test x local x ! x ? cr ; ok
    ok
    25 x ! 12 test x ? cr 12
    25
    ok

    Well - it actually works! It's amazing! What a solid piece of software engineering!

    Hans Bezemer


    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Tue Apr 28 14:34:39 2026
    From Newsgroup: comp.lang.forth

    On 27-04-2026 03:12, dxf wrote:
    On 26/04/2026 7:50 pm, Anton Ertl wrote:
    Paul Rubin <no.email@nospam.invalid> writes:
    ...
    I've
    imagined some alternate versions of COLON, e.g.
    : foo ( ... ) ; \ regular colon, no locals
    1: foo ( ... ) ; \ one local called A
    2: foo (... ) ; \ two locals, A and B
    ...
    4: foo (... ) ; \ four locals: A, B, C, D.


    As a matter of fact, this thingy creates locals:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    If you cannot chose the names, locals lose a lot of their benefits in
    making the code more understandable (OTOH, mathematicians have made to
    with similar naming schemes for a long time). You might then just as
    well work with >R >R >R >R and R@, R'@, 2 RPICK and 3 RPICK.

    That Julian Noble (among others) felt the need for FTRAN INTRAN etc
    informs
    what scientists and academics really want - and it's a long way from the 'stack based' locals offered by most forth systems. The latter represent
    a concession to forth before a user has even begun to consider
    identifiers.
    To an outsider, forth locals do nothing to ameliorate what they see as fundamentally broken about the language. ISTM if a forther has
    conceded to
    use stack-based locals, he can certainly make choices about what form identifiers take.

    You can do that with the 4tH preprocessor. This uses the above code:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    variable x \ var x
    variable y \ var y

    : multiply ( n1 n2 -- n1*n2)
    x local \ turn into local
    y local \ turn into local

    let y=; let x=; \ take values from the stack

    let x = (x * y); \ multiply them
    let x,|. cr|; \ get x, perform ". cr"
    ;

    23 x ! \ proof it is a local
    7 6 multiply \ multiply 6 by y
    x ? cr \ now let's check that

    And this is the output:

    $ pp4th -x testme.4pp
    42
    23

    Note that the output of the preprocessor works fine on a vanilla Forth:

    Gforth 0.7.9_20250321
    Authors: Anton Ertl, Bernd Paysan, Jens Wilke et al., for more type
    `authors'
    Copyright © 2025 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later
    <https://gnu.org/licenses/gpl.html>
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `help' for basic help
    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ; ok
    ok
    variable x ok
    variable y ok
    ok
    : multiply compiling
    x local compiling
    y local compiling
    compiling
    y ! x ! compiling
    compiling
    x @ y @ * x ! compiling
    x @ . cr compiling
    ; ok
    ok
    23 x ! ok
    7 6 multiply 42
    ok
    x ? cr 23
    ok

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Gerry Jackson@do-not-use@swldwa.uk to comp.lang.forth on Wed Apr 29 12:44:30 2026
    From Newsgroup: comp.lang.forth

    On 28/04/2026 13:34, Hans Bezemer wrote:
    As a matter of fact, this thingy creates locals:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    LOCAL can also be defined as:
    : local r> over @ rot 2>r ;: 2r> ! ;
    which I guess you won't like, but is a bit shorter. It also survives
    your pre-processor conversion of 2>r to >r >r, similarly 2r>
    --
    Gerry
    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Apr 29 14:37:28 2026
    From Newsgroup: comp.lang.forth

    On 29-04-2026 13:44, Gerry Jackson wrote:
    On 28/04/2026 13:34, Hans Bezemer wrote:
    As a matter of fact, this thingy creates locals:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    LOCAL can also be defined as:
    : local r> over @ rot 2>r ;: 2r> !  ;
    which I guess you won't like, but is a bit shorter. It also survives
    your pre-processor conversion of 2>r to >r >r, similarly 2r>


    I don't say you're wrong, but there is some logic to this madness:

    1. In 4tH, "2>R" is the same as ">R >R". The compiler expands it like
    that. So -- there is no advantage to do "2>R". Yes, you can do "2R@",
    but not "R@". It won't be portable;

    2. I don't consider "2>R" as an optimization. To me it is an operator to
    a different type. There *HAS* to be a connection between two values.
    Like addr/count, double number, array/size, etc. To me it means I'm
    dealing with a "two cells type". My future me will thank me.

    And that's why I don't agree with you ;-)

    Hans Bezemer



    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Wed Apr 29 14:44:02 2026
    From Newsgroup: comp.lang.forth

    On 29-04-2026 14:37, Hans Bezemer wrote:
    On 29-04-2026 13:44, Gerry Jackson wrote:
    On 28/04/2026 13:34, Hans Bezemer wrote:
    As a matter of fact, this thingy creates locals:

    : ;: >r ; : local r> swap dup >r @ >r ;: r> r> ! ;

    LOCAL can also be defined as:
    : local r> over @ rot 2>r ;: 2r> !  ;
    which I guess you won't like, but is a bit shorter. It also survives
    your pre-processor conversion of 2>r to >r >r, similarly 2r>


    I don't say you're wrong, but there is some logic to this madness:

    1. In 4tH, "2>R" is the same as ">R >R". The compiler expands it like
    that. So -- there is no advantage to do "2>R". Yes, you can do "2R@",
    but not "R@". It won't be portable;

    QED:

    Addr| Opcode Operand Argument

    0| branch 2 ;:
    1| >r 0
    2| exit 0
    3| branch 14 local
    4| r> 0
    5| over 0
    6| @ 0
    7| rot 0
    8| >r 0
    9| >r 0
    10| call 0 ;:
    11| r> 0
    12| r> 0
    13| ! 0
    14| exit 0

    No trickery. That's the way it is. :-)

    Hans Bezemer

    --- Synchronet 3.21f-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri May 1 23:50:04 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    If you cannot chose the names... You might then just as well work with
    R >R >R >R and R@, R'@, 2 RPICK and 3 RPICK.

    That actually might be a workable idea. Thanks.

    In the code you see the threaded code interspersed with the native
    code. If you ignore the native code, you see what a simple
    interpreter would see (if it had a locals implementation that produced
    code similar to that of Gforth).

    I wonder if gforth would get less code bloat if you added some
    primitives for pushing more than one local. E.g. 2>L, 3>L, etc. would
    push that many stack elements to LOCAL0, LOCAL1, LOCAL2. Then there
    wouldn't be that big chunk of replicated code.

    So it's "code cleanup", not making use of hardware facilities for
    efficiency on simple interpreters, that you see as the benefit of
    locals.

    Well, I had hoped to get both, but yeah, ultimately cleaner and more
    reliable code takes precedence in most situations, by the 90/10 rule.

    Even with multi-representation stack-caching as used since Gforth 0.7
    (which does require more compiler smarts), no statically determined
    stack effect is necessary, because the code generator returns to the canonical state on control-flow.

    I see, yeah, but that means stack juggling to get to the canonical
    state.

    ... we have user variables like BASE and HLD (in F83, HOLDPTR in
    gforth). They are used across multiple words, and the fact that you
    don't have to pass them and put them into a local has been touted as
    an advantage over locals: Definitions that use global variables are
    easier to factor.

    Urgggh...
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Fri May 1 23:54:11 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    You might then just as well work with >R >R >R >R and R@, R'@, 2 RPICK
    and 3 RPICK.

    But, now you have to avoid mixing that style with using the R stack for temporaries, including stuff like loop indexes which sometimes go
    there. And you have to clean up the R stack before returning, and maybe arrange for that to happen in case of an exception.

    Flashforth has a separate P stack which can be used for temporaries
    within a word, but I don't remember how cleanup is handled, if at all.
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From dxf@dxforth@gmail.com to comp.lang.forth on Sat May 2 17:36:25 2026
    From Newsgroup: comp.lang.forth

    On 2/05/2026 4:54 pm, Paul Rubin wrote:
    ...
    Flashforth has a separate P stack which can be used for temporaries
    within a word, but I don't remember how cleanup is handled, if at all.

    It's a cpu register - not a stack. For re-entrancy old value must first
    be pushed onto the cpu stack before loading the new. IIRC FF has a word
    that combines those. Basically a variable.

    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Paul Rubin@no.email@nospam.invalid to comp.lang.forth on Sat May 2 01:11:53 2026
    From Newsgroup: comp.lang.forth

    dxf <dxforth@gmail.com> writes:
    It's a cpu register - not a stack. For re-entrancy old value must first
    be pushed onto the cpu stack before loading the new. IIRC FF has a word
    that combines those. Basically a variable.

    Aha, thanks, I mis-remembered how it worked.

    https://pajacobs-ghub.github.io/flashforth/ff5-quick-ref.html#_the_p_register --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat May 2 10:34:29 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    [...]
    I wonder if gforth would get less code bloat if you added some
    primitives for pushing more than one local. E.g. 2>L, 3>L, etc. would
    push that many stack elements to LOCAL0, LOCAL1, LOCAL2. Then there
    wouldn't be that big chunk of replicated code.

    Gforth has superinstructions for the sequences

    sequence count AMD64 code gforth-fast
    l >l 62 len= 4+ 26+ 3
    l >l >l 9 len= 4+ 34+ 3
    l >l >l >l 5 len= 4+ 42+ 3
    l f>l 2 len= 4+ 42+ 3
    l @local0 20 len= 4+ 11+ 3
    l lit f@localn 1 len= 4+ 24+ 3

    compared to

    primitive count AMD64 code gforth-fast
    l 67 len= 4+ 18+ 3
    l 10 len= 4+ 23+ 3

    The counts are static occurences of the code for this primitive, i.e.,
    if for a sequence >l >l the superinstruction is selected, the superinstruction's count is increase, while the count of >l stays the
    same. The whole data is for gforth-fast with disabled static stack
    caching. With static stack caching enabled, there are more variants
    of >l that the counts distribute over.

    As far as native code is concerned, these superinstructions already
    give the benefit of the additional primitives you suggest. On the threaded-code side there is a threaded-code slot for each >L. That
    simplifies the implementation of superinstructions: no need to
    rearrange the threaded code when the decision about the use of superinstructions is taken.

    We could add some special mechanism to the locals implementation that
    produces 2>L etc. instead of just producing a sequence of >Ls, and
    letting the ordinary superinstruction mechanism in Gforth combine
    them. But would such a mechanism cost less in code size than the
    62+9*2+5*3=95 cells that it saves? Not to mention the development and maintenance effort.

    While we are at it, here are the other locals-related primitives:

    primitive count AMD64 code gforth-fast
    @localn 1 len= 4+ 5+ 3
    @local0 115 len= 4+ 11+ 3
    @local1 87 len= 4+ 11+ 3
    @local2 62 len= 4+ 11+ 3
    @local3 31 len= 4+ 11+ 3
    @local4 18 len= 4+ 11+ 3
    @local5 16 len= 4+ 11+ 3
    @local6 10 len= 4+ 11+ 3
    @local7 4 len= 4+ 11+ 3
    !localn 0 len= 4+ 16+ 3
    !local0 4 len= 4+ 11+ 3
    !local1 1 len= 4+ 11+ 3
    !local2 4 len= 4+ 11+ 3
    !local3 2 len= 4+ 11+ 3
    !local4 0 len= 4+ 11+ 3
    !local5 0 len= 4+ 11+ 3
    !local6 0 len= 4+ 11+ 3
    !local7 0 len= 4+ 11+ 3
    +!localn 3 len= 4+ 16+ 3
    lp+n 82 len= 4+ 3+ 3
    f@localn 11 len= 4+ 24+ 3
    lp@ 14 len= 4+ 10+ 3
    lp+! 67 len= 4+ 10+ 3
    lp- 3 len= 4+ 4+ 3
    lp+ 36 len= 4+ 4+ 3
    lp+2 33 len= 4+ 4+ 3
    lp! 12 len= 4+ 10+ 3

    Even with multi-representation stack-caching as used since Gforth 0.7
    (which does require more compiler smarts), no statically determined
    stack effect is necessary, because the code generator returns to the
    canonical state on control-flow.

    I see, yeah, but that means stack juggling to get to the canonical
    state.

    In clf, "Stack juggling" usually means using words like ROT (see the
    cartoon about ROT in starting Forth). That's not what happens here.

    What happens is that there is code that performs a transition between
    stack representations. Between any two primitives, as well as at the
    start and end of a sequence, the code generator can insert such code.
    It uses a shortest-path algorithm to find the shortest native-code
    sequence for the threaded-code sequence. The result is never longer
    than the native-code sequence that you get when you always use the implementation of the primitive that starts in the canonical
    representation and ends in the canonical representation, and it often
    is shorter.

    Here is the usage of the transitions between the stack representations
    on AMD64 for the gforth.fi image:

    trans count AMD64 code gforth-fast
    1-0 2932 len= 0+ 7+ 3
    2-0 944 len= 0+ 12+ 3
    3-0 135 len= 0+ 16+ 3
    0-1 152 len= 0+ 8+ 3
    2-1 87 len= 0+ 10+ 3
    3-1 35 len= 0+ 14+ 3
    0-2 39 len= 0+ 12+ 3
    1-2 151 len= 0+ 10+ 3
    3-2 0 len= 0+ 13+ 3
    0-3 15 len= 0+ 15+ 3
    1-3 48 len= 0+ 15+ 3
    2-3 5 len= 0+ 13+ 3

    The transition is shown as M-N, where M is the number of stack items
    in registers before the transition and N is the number of stack items
    in registers after the transition. The high number of transitions
    with N=0 is interesting, given that the canonical representation is 1.

    My impression is, that for a primitive that pushes a value, such as
    lit, r@ or @local0, the code generator selects the transition to 0
    followed by the 0-1 variant of the primitive over the 1-1 variant of
    the primitive when the code size is the same.

    ... we have user variables like BASE and HLD (in F83, HOLDPTR in
    gforth). They are used across multiple words, and the fact that you
    don't have to pass them and put them into a local has been touted as
    an advantage over locals: Definitions that use global variables are
    easier to factor.

    Urgggh...

    When that is combined with proper wrapping, it can be a useful
    mechanism. E.g., environment variables in shell scripts work that
    way, or the graphics state in Postscript. But it's something that
    needs a lot of restraint to avoid creating a mess. Hanson and
    Proebsting presented a case (and efficient implementation) for this
    kind of mechanism:

    @InProceedings{hanson&proebsting01,
    author = {David. R. Hanson and Todd A. Proebsting},
    title = {Dynamic Variables},
    crossref = {sigplan01},
    pages = {264--273},
    annote = {}
    }

    @Proceedings{sigplan01,
    booktitle = "SIGPLAN '01 Conference on Programming Language
    Design and Implementation",
    title = "SIGPLAN '01 Conference on Programming Language
    Design and Implementation",
    year = "2001",
    key = "PLDI '01"
    }

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat May 2 15:58:27 2026
    From Newsgroup: comp.lang.forth

    Paul Rubin <no.email@nospam.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    You might then just as well work with >R >R >R >R and R@, R'@, 2 RPICK
    and 3 RPICK.

    But, now you have to avoid mixing that style with using the R stack for >temporaries, including stuff like loop indexes which sometimes go
    there.

    Standard locals have some of these restrictions, too, but not all of
    them. Concerning counted loops, you may want to take their return
    stack usage into account when RPICKing.

    And you have to clean up the R stack before returning,

    Yes. What's easier to implement is often harder to use.

    and maybe
    arrange for that to happen in case of an exception.

    THROW resets the return stack to the CATCH depth, so no extra work
    necessary.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.22a-Linux NewsLink 1.2