• DOES> speed

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Sep 6 14:48:57 2025
    From Newsgroup: comp.lang.forth

    Last year I found that the cd16sim benchmark can be sped up in
    gforth-fast by a factor of 3 by replacing one occurence of

    does> ;

    with

    ;

    When I looked at the generated code, I found that the code that Gforth generates for does>-defined words can be improved <2024Sep21.192551@mips.complang.tuwien.ac.at> <2024Oct3.125926@mips.complang.tuwien.ac.at>.

    This year I find that on a Core i5-1135G7 cd16sim on recent Gforth is
    twice as fast as on the Gforth that I used for the measurements last
    year; the one used for measurements last year is based on a July 2023
    version, with ip-update optimization and more stack caching integrated
    (which should be the relevant differences fro the recent version, and,
    for the other benchmarks, they are). So knowing about this cd16sim
    issue, I wanted to check the performance of an empty does> with the
    following microbenchmark:

    : foo create does> ;
    foo bar
    : bench 100000000 0 do
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    loop ;

    So 1G invocations of BAR DROP.

    The results are:

    old current
    22_005_621_021 13_516_461_910 instructions
    23_773_076_801 4_657_432_473 cycles Zen4 (Ryzen 8700G)
    6_861_042_139 4_654_259_763 cycles Tiger Lake (Core i5-1135G7)

    So on the old Gforth this microbenchmark seems to hit a glass jaw of
    Zen4 with the old code, but the Tiger Lake is not twice as slow for
    this microbenchmark, and the other code in CD16sim should dilute the
    speed difference even more, but who knows.

    The code for calling bar once is as follows:

    does-xt 1->1 lit 0->1
    bar bar
    mov [r11],rbx mov r8,$10[rbx]
    mov rbx,$10[r13] call 1->1
    sub r11,$08 $7DF305A7DD40
    add r13,$10 mov rax,$20[rbx]
    mov rax,-$08[rbx] sub r14,$08
    mov rdx,$20[rax] add rbx,$28
    mov rax,-$10[rdx] mov [r14],rbx
    jmp eax mov rbx,rax
    mov rax,[rbx]
    jmp eax

    DOES-XT pushes the body address of BAR and then EXECUTEs the xt of the anonymous colon definition started by DOES>, so it also runs DOCOL:

    add r13,$08
    sub r10,$08
    mov [r10],r13
    mov r13,rdx
    mov rax,$00[r13]
    jmp eax

    More precisely, the old code executes the following sequence:

    $7EF48495C548 does-xt 1->1
    $7EF48495C550 bar
    7EF4846035CD: mov [r11],rbx
    7EF4846035D0: mov rbx,$10[r13]
    7EF4846035D4: sub r11,$08
    7EF4846035D8: add r13,$10
    7EF4846035DC: mov rax,-$08[rbx]
    7EF4846035E0: mov rdx,$20[rax]
    7EF4846035E4: mov rax,-$10[rdx]
    7EF4846035E8: jmp eax
    DOCOL
    401A9A: add r13,$08
    401A9E: sub r10,$08
    401AA2: mov [r10],r13
    401AA5: mov r13,rdx
    401AA8: mov rax,$00[r13]
    401AAC: jmp eax

    And this reminds me what the glass jaw of Zen4 (and IIRC Zen3 is):
    Indirect branch prediction does not work properly, if the target is
    too far from the indirect branch; the target of the indirect branch in
    DOCOL is:

    $7EF4846CC658 ;s 1->1
    7EF484603409: mov rax,[r10]
    7EF48460340C: add r10,$08
    7EF484603410: mov r13,rax
    7EF484603413: mov rax,$00[r13]
    7EF484603417: jmp eax

    so you get two such branch prediction hickups per call to BAR, one on indirect-branching to DOCOL, and one for the indirect branch in DOCOL
    to the ;S.

    IIRC Gracemont (E-core of Alder Lake/Raptor Lake) suffers from the
    same problem. I have reported that on comp.arch or here when I first
    found it.

    Unfortunately, I still have not found the reason why cd16sim has been
    sped up by a factor of 2 when going from the "old" Gforth to the
    current Gforth, but it's nice to see that the new DOES> implementation
    provides a speedup.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Sep 7 06:12:27 2025
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Last year I found that the cd16sim benchmark can be sped up in
    gforth-fast by a factor of 3 by replacing one occurence of

    does> ;

    with

    ;

    When I looked at the generated code, I found that the code that Gforth >generates for does>-defined words can be improved ><2024Sep21.192551@mips.complang.tuwien.ac.at> ><2024Oct3.125926@mips.complang.tuwien.ac.at>.

    This year I find that on a Core i5-1135G7 cd16sim on recent Gforth is
    twice as fast as on the Gforth that I used for the measurements last
    year.

    To further investigate this, I measured both versions of Gforth with
    the original CD16sim benchmark (which I use for regular benchmarking)
    and with the modification that leaves the DOES> away in the definition
    where it is followed by ";".

    The results are as follows:

    DOES> ; | ;
    old DOES-XT new LIT CALL | old DOES-XT new LIT CALL
    5_821_729_204 2_605_381_330 | 1_835_604_860 1_747_520_536 cycles:u 8_358_289_634 7_264_296_443 | 5_418_477_733 5_107_832_523 instructions:u
    84_279_937 7_877_931 | 5_085_282 4_332_441 branch-misses:u
    9_128_041 3_195_014 | 2_034_349 1_594_811 L1-dcache-load-misses
    34_196_386 2_130_747 | 852_225 514_153 L1-icache-load-misses

    So we see that leaving away this one DOES> eliminates most of the
    speed difference. We also see that the version with DOES> and DOES-XT
    has many more branch mispredictions and I-cache misses, and somewhat
    more D-cache misses, and even the LIT CALL version has more I-cache
    and D-cache misses. At the moment I don't understand where those
    additional cache misses are coming from.

    The branch mispredictions are more plausible: with LIT CALL a history
    of one indirect branch is enough to make a correct prediction, with
    DOES-XT two are needed; I would have thought that the indirect-branch
    history is longer on modern CPUs, but maybe it is not.

    I usually estimate a branch misprediction penalty at 20 cycles. The
    L2 latency for Tiger Lake is about 15 cycles (but OoO execution may
    overlap that with other instructions, reducing the increase in
    execution time). I have also measured the number of LLC-loads:u
    (last-level cache), and they are small, so nearly all L1 misses seem
    to hit in L2.

    With these estimates, the DOES-XT variant should consume about

    84_279_937 7_877_931 - 20 *
    9_128_041 34_196_386 + 3_195_014 2_130_747 + - 15 * +

    i.e., 2.1G cycles more than LIT CALL. The actual difference is 3.2G
    cycles. A little of that may be coming from having to execute 1.1G
    extra instructions, but I guess that there is another
    microarchitectural aspect involved. But I don't plan to investigate
    that.

    In any case, the DOES-XT implementation of DOES> seems to run in microarchitectural performance pitfalls on Tiger Lake that we see in
    CD16sim, but don't see in the microbenchmark (for a reminder, I show
    the microbenchmark results again below); the LIT CALL implementation
    also seems to have these issues, but they are much smaller. In
    addition, we see on the microbenchmark that DOES-XT runs into a microarchitectural pitfall on Zen4 (and, as shown in earlier
    measurements, Gracement), while LIT CALL does not. So LIT CALL offers performance benefits beyond the obvious reduction ins instructions.

    For a reminder, her's the microbenchmark and its results:

    : foo create does> ;
    foo bar
    : bench 100000000 0 do
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    bar drop
    loop ;

    So 1G invocations of BAR DROP.

    The results are:

    old current
    22_005_621_021 13_516_461_910 instructions
    23_773_076_801 4_657_432_473 cycles Zen4 (Ryzen 8700G)
    6_861_042_139 4_654_259_763 cycles Tiger Lake (Core i5-1135G7)

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
    EuroForth 2025 registration: https://euro.theforth.net/
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Sep 7 11:40:56 2025
    From Newsgroup: comp.lang.forth

    In article <2025Sep6.164857@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Last year I found that the cd16sim benchmark can be sped up in
    gforth-fast by a factor of 3 by replacing one occurence of

    does> ;

    with

    ;


    All buffers generated by CREATE has a DOES> pointer that can be changed.
    That makes CREATE not the fundamental building block it is suggested to be. ciforth has DATA
    that is in effect the familiar VARIABLE but without the allocation.
    : BUFFER DATA CELLS ALLOCATE ;
    : VARIABLE DATA 1 CELLS ALLOCATE ;

    DATA is creating a header with DOVAR as its action

    A proper building block would be
    HEADER ( name label -- dea)
    Where the name is the created name and the label points to low level action. dea is the "xt/nt" resulting. dea is a dictionary entry so that
    other fields (data content, immediate flags etc.) can be filled in later.

    DATA becomes
    parse a name and pass a label dovar to HEADER leaving dea
    fill in HERE as the low level data field in the dea
    (Not to be confused by >BODY that makes only sense in the
    context of CREATE DOES> )

    Remember CDFLN (code data flags link name)

    - anton

    Groetjes Albert.
    --
    The Chinese government is satisfied with its military superiority over USA.
    The next 5 year plan has as primary goal to advance life expectancy
    over 80 years, like Western Europe.
    --- Synchronet 3.21a-Linux NewsLink 1.2