From Newsgroup: comp.lang.forth
Last year I found that the cd16sim benchmark can be sped up in
gforth-fast by a factor of 3 by replacing one occurence of
does> ;
with
;
When I looked at the generated code, I found that the code that Gforth generates for does>-defined words can be improved <
2024Sep21.192551@mips.complang.tuwien.ac.at> <
2024Oct3.125926@mips.complang.tuwien.ac.at>.
This year I find that on a Core i5-1135G7 cd16sim on recent Gforth is
twice as fast as on the Gforth that I used for the measurements last
year; the one used for measurements last year is based on a July 2023
version, with ip-update optimization and more stack caching integrated
(which should be the relevant differences fro the recent version, and,
for the other benchmarks, they are). So knowing about this cd16sim
issue, I wanted to check the performance of an empty does> with the
following microbenchmark:
: foo create does> ;
foo bar
: bench 100000000 0 do
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
loop ;
So 1G invocations of BAR DROP.
The results are:
old current
22_005_621_021 13_516_461_910 instructions
23_773_076_801 4_657_432_473 cycles Zen4 (Ryzen 8700G)
6_861_042_139 4_654_259_763 cycles Tiger Lake (Core i5-1135G7)
So on the old Gforth this microbenchmark seems to hit a glass jaw of
Zen4 with the old code, but the Tiger Lake is not twice as slow for
this microbenchmark, and the other code in CD16sim should dilute the
speed difference even more, but who knows.
The code for calling bar once is as follows:
does-xt 1->1 lit 0->1
bar bar
mov [r11],rbx mov r8,$10[rbx]
mov rbx,$10[r13] call 1->1
sub r11,$08 $7DF305A7DD40
add r13,$10 mov rax,$20[rbx]
mov rax,-$08[rbx] sub r14,$08
mov rdx,$20[rax] add rbx,$28
mov rax,-$10[rdx] mov [r14],rbx
jmp eax mov rbx,rax
mov rax,[rbx]
jmp eax
DOES-XT pushes the body address of BAR and then EXECUTEs the xt of the anonymous colon definition started by DOES>, so it also runs DOCOL:
add r13,$08
sub r10,$08
mov [r10],r13
mov r13,rdx
mov rax,$00[r13]
jmp eax
More precisely, the old code executes the following sequence:
$7EF48495C548 does-xt 1->1
$7EF48495C550 bar
7EF4846035CD: mov [r11],rbx
7EF4846035D0: mov rbx,$10[r13]
7EF4846035D4: sub r11,$08
7EF4846035D8: add r13,$10
7EF4846035DC: mov rax,-$08[rbx]
7EF4846035E0: mov rdx,$20[rax]
7EF4846035E4: mov rax,-$10[rdx]
7EF4846035E8: jmp eax
DOCOL
401A9A: add r13,$08
401A9E: sub r10,$08
401AA2: mov [r10],r13
401AA5: mov r13,rdx
401AA8: mov rax,$00[r13]
401AAC: jmp eax
And this reminds me what the glass jaw of Zen4 (and IIRC Zen3 is):
Indirect branch prediction does not work properly, if the target is
too far from the indirect branch; the target of the indirect branch in
DOCOL is:
$7EF4846CC658 ;s 1->1
7EF484603409: mov rax,[r10]
7EF48460340C: add r10,$08
7EF484603410: mov r13,rax
7EF484603413: mov rax,$00[r13]
7EF484603417: jmp eax
so you get two such branch prediction hickups per call to BAR, one on indirect-branching to DOCOL, and one for the indirect branch in DOCOL
to the ;S.
IIRC Gracemont (E-core of Alder Lake/Raptor Lake) suffers from the
same problem. I have reported that on comp.arch or here when I first
found it.
Unfortunately, I still have not found the reason why cd16sim has been
sped up by a factor of 2 when going from the "old" Gforth to the
current Gforth, but it's nice to see that the new DOES> implementation
provides a speedup.
- anton
--
M. Anton Ertl
http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs:
http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard:
https://forth-standard.org/
EuroForth 2025 CFP:
http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration:
https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2