Forum: War Ensemble BBS

DOES> speed

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sat Sep 6 14:48:57 2025

From Newsgroup: comp.lang.forth

Last year I found that the cd16sim benchmark can be sped up in
gforth-fast by a factor of 3 by replacing one occurence of

does> ;

with

;

When I looked at the generated code, I found that the code that Gforth generates for does>-defined words can be improved <2024Sep21.192551@mips.complang.tuwien.ac.at> <2024Oct3.125926@mips.complang.tuwien.ac.at>.

This year I find that on a Core i5-1135G7 cd16sim on recent Gforth is
twice as fast as on the Gforth that I used for the measurements last
year; the one used for measurements last year is based on a July 2023
version, with ip-update optimization and more stack caching integrated
(which should be the relevant differences fro the recent version, and,
for the other benchmarks, they are). So knowing about this cd16sim
issue, I wanted to check the performance of an empty does> with the
following microbenchmark:

: foo create does> ;
foo bar
: bench 100000000 0 do
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
loop ;

So 1G invocations of BAR DROP.

The results are:

old current
22_005_621_021 13_516_461_910 instructions
23_773_076_801 4_657_432_473 cycles Zen4 (Ryzen 8700G)
6_861_042_139 4_654_259_763 cycles Tiger Lake (Core i5-1135G7)

So on the old Gforth this microbenchmark seems to hit a glass jaw of
Zen4 with the old code, but the Tiger Lake is not twice as slow for
this microbenchmark, and the other code in CD16sim should dilute the
speed difference even more, but who knows.

The code for calling bar once is as follows:

does-xt 1->1 lit 0->1
bar bar
mov [r11],rbx mov r8,$10[rbx]
mov rbx,$10[r13] call 1->1
sub r11,$08 $7DF305A7DD40
add r13,$10 mov rax,$20[rbx]
mov rax,-$08[rbx] sub r14,$08
mov rdx,$20[rax] add rbx,$28
mov rax,-$10[rdx] mov [r14],rbx
jmp eax mov rbx,rax
mov rax,[rbx]
jmp eax

DOES-XT pushes the body address of BAR and then EXECUTEs the xt of the anonymous colon definition started by DOES>, so it also runs DOCOL:

add r13,$08
sub r10,$08
mov [r10],r13
mov r13,rdx
mov rax,$00[r13]
jmp eax

More precisely, the old code executes the following sequence:

$7EF48495C548 does-xt 1->1
$7EF48495C550 bar
7EF4846035CD: mov [r11],rbx
7EF4846035D0: mov rbx,$10[r13]
7EF4846035D4: sub r11,$08
7EF4846035D8: add r13,$10
7EF4846035DC: mov rax,-$08[rbx]
7EF4846035E0: mov rdx,$20[rax]
7EF4846035E4: mov rax,-$10[rdx]
7EF4846035E8: jmp eax
DOCOL
401A9A: add r13,$08
401A9E: sub r10,$08
401AA2: mov [r10],r13
401AA5: mov r13,rdx
401AA8: mov rax,$00[r13]
401AAC: jmp eax

And this reminds me what the glass jaw of Zen4 (and IIRC Zen3 is):
Indirect branch prediction does not work properly, if the target is
too far from the indirect branch; the target of the indirect branch in
DOCOL is:

$7EF4846CC658 ;s 1->1
7EF484603409: mov rax,[r10]
7EF48460340C: add r10,$08
7EF484603410: mov r13,rax
7EF484603413: mov rax,$00[r13]
7EF484603417: jmp eax

so you get two such branch prediction hickups per call to BAR, one on indirect-branching to DOCOL, and one for the indirect branch in DOCOL
to the ;S.

IIRC Gracemont (E-core of Alder Lake/Raptor Lake) suffers from the
same problem. I have reported that on comp.arch or here when I first
found it.

Unfortunately, I still have not found the reason why cd16sim has been
sped up by a factor of 2 when going from the "old" Gforth to the
current Gforth, but it's nice to see that the new DOES> implementation
provides a speedup.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Sun Sep 7 06:12:27 2025

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Last year I found that the cd16sim benchmark can be sped up in
gforth-fast by a factor of 3 by replacing one occurence of

does> ;

with

;

When I looked at the generated code, I found that the code that Gforth >generates for does>-defined words can be improved ><2024Sep21.192551@mips.complang.tuwien.ac.at> ><2024Oct3.125926@mips.complang.tuwien.ac.at>.

This year I find that on a Core i5-1135G7 cd16sim on recent Gforth is
twice as fast as on the Gforth that I used for the measurements last
year.

To further investigate this, I measured both versions of Gforth with
the original CD16sim benchmark (which I use for regular benchmarking)
and with the modification that leaves the DOES> away in the definition
where it is followed by ";".

The results are as follows:

DOES> ; | ;
old DOES-XT new LIT CALL | old DOES-XT new LIT CALL
5_821_729_204 2_605_381_330 | 1_835_604_860 1_747_520_536 cycles:u 8_358_289_634 7_264_296_443 | 5_418_477_733 5_107_832_523 instructions:u
84_279_937 7_877_931 | 5_085_282 4_332_441 branch-misses:u
9_128_041 3_195_014 | 2_034_349 1_594_811 L1-dcache-load-misses
34_196_386 2_130_747 | 852_225 514_153 L1-icache-load-misses

So we see that leaving away this one DOES> eliminates most of the
speed difference. We also see that the version with DOES> and DOES-XT
has many more branch mispredictions and I-cache misses, and somewhat
more D-cache misses, and even the LIT CALL version has more I-cache
and D-cache misses. At the moment I don't understand where those
additional cache misses are coming from.

The branch mispredictions are more plausible: with LIT CALL a history
of one indirect branch is enough to make a correct prediction, with
DOES-XT two are needed; I would have thought that the indirect-branch
history is longer on modern CPUs, but maybe it is not.

I usually estimate a branch misprediction penalty at 20 cycles. The
L2 latency for Tiger Lake is about 15 cycles (but OoO execution may
overlap that with other instructions, reducing the increase in
execution time). I have also measured the number of LLC-loads:u
(last-level cache), and they are small, so nearly all L1 misses seem
to hit in L2.

With these estimates, the DOES-XT variant should consume about

84_279_937 7_877_931 - 20 *
9_128_041 34_196_386 + 3_195_014 2_130_747 + - 15 * +

i.e., 2.1G cycles more than LIT CALL. The actual difference is 3.2G
cycles. A little of that may be coming from having to execute 1.1G
extra instructions, but I guess that there is another
microarchitectural aspect involved. But I don't plan to investigate
that.

In any case, the DOES-XT implementation of DOES> seems to run in microarchitectural performance pitfalls on Tiger Lake that we see in
CD16sim, but don't see in the microbenchmark (for a reminder, I show
the microbenchmark results again below); the LIT CALL implementation
also seems to have these issues, but they are much smaller. In
addition, we see on the microbenchmark that DOES-XT runs into a microarchitectural pitfall on Zen4 (and, as shown in earlier
measurements, Gracement), while LIT CALL does not. So LIT CALL offers performance benefits beyond the obvious reduction ins instructions.

For a reminder, her's the microbenchmark and its results:

: foo create does> ;
foo bar
: bench 100000000 0 do
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
bar drop
loop ;

So 1G invocations of BAR DROP.

The results are:

old current
22_005_621_021 13_516_461_910 instructions
23_773_076_801 4_657_432_473 cycles Zen4 (Ryzen 8700G)
6_861_042_139 4_654_259_763 cycles Tiger Lake (Core i5-1135G7)

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html
EuroForth 2025 registration: https://euro.theforth.net/
--- Synchronet 3.21a-Linux NewsLink 1.2

From albert@albert@spenarnc.xs4all.nl to comp.lang.forth on Sun Sep 7 11:40:56 2025

From Newsgroup: comp.lang.forth

In article <2025Sep6.164857@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Last year I found that the cd16sim benchmark can be sped up in
gforth-fast by a factor of 3 by replacing one occurence of

does> ;

with

;

All buffers generated by CREATE has a DOES> pointer that can be changed.
That makes CREATE not the fundamental building block it is suggested to be. ciforth has DATA
that is in effect the familiar VARIABLE but without the allocation.
: BUFFER DATA CELLS ALLOCATE ;
: VARIABLE DATA 1 CELLS ALLOCATE ;

DATA is creating a header with DOVAR as its action

A proper building block would be
HEADER ( name label -- dea)
Where the name is the created name and the label points to low level action. dea is the "xt/nt" resulting. dea is a dictionary entry so that
other fields (data content, immediate flags etc.) can be filled in later.

DATA becomes
parse a name and pass a label dovar to HEADER leaving dea
fill in HERE as the low level data field in the dea
(Not to be confused by >BODY that makes only sense in the
context of CREATE DOES> )

Remember CDFLN (code data flags link name)

- anton

Groetjes Albert.
--
The Chinese government is satisfied with its military superiority over USA.
The next 5 year plan has as primary goal to advance life expectancy
over 80 years, like Western Europe.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Zenobyte
  Wed Sep 17 14:30:46 2025
  from San Juan, Pr via Telnet
- Microbot
  Wed Sep 17 08:25:14 2025
  from Moore, Ok via Telnet
- Zenobyte
  Tue Sep 16 15:06:15 2025
  from San Juan, Pr via Telnet
- Microbot
  Tue Sep 16 10:00:46 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,070
Nodes:	10 (0 / 10)
Uptime:	161:02:41
Calls:	13,734
Calls today:	2
Files:	186,966
D/L today:	843 files (303M bytes)
Messages:	2,418,725

DOES> speed

Who's Online

Recent Visitors

System Info