Forum: War Ensemble BBS

?DUP

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri May 1 06:41:25 2026

From Newsgroup: comp.lang.forth

While looking at the strcmp code, I saw that the code that used ?DUP
was large, and replaced it with code using DUP and, in the 0 branch,
DROP. So now I look closely at the code produced for ?DUP and the alternatives.

Since the beginning Gforth has provided ?DUP-IF and ?DUP-0=-IF, which
have the advantage that the value is not tested twice (once in ?DUP
and once in IF), and there are not two branches. One might imagine
that this produces the optimal code, but see below.

Other systems may or may not combine the ?DUP and the IF by themselves
to avoid the two branches mentioned above.

Here are three simple words that all do the same thing:

: ?dup-test1 ( n1 -- n2 )
?dup if exit then 1 ;

: ?dup-test2 ( n1 -- n2 )
dup if exit then drop 1 ;

[defined] ?dup-if [if]
: ?dup-test3 ( n1 -- n2 )
?dup-if exit then 1 ;
[then]

Here's the code that gforth-fast produces for them (the first line
shows only the code that is different):

?dup if exit then dup if exit then drop ?dup-if exit then
?dup 1->1 dup 1->2 ?dup-?branch 1->1
add $0x8,%r10 mov %r13,%r15 <?dup-test2+$18>
test %r13,%r13 ?branch 2->1 add $0x10,%rbx
je 0x7f6ca1271b18 <?dup-test3+$20> mov -0x8(%rbx),%rax
mov %r13,-0x8(%r10) add $0x20,%rbx add $0x8,%r10
sub $0x8,%r10 mov (%rbx),%rax test %r13,%r13
sub $0x8,%r10 test %r15,%r15 je x ?branch 1->1 jne ;s mov %r13,-0x8(%r10) <?dup-test1+$20> jmp *%rax mov %rbx,%rax
add $0x20,%rbx ;s 1->1 sub $0x8,%r10
add $0x8,%r10 mov (%r14),%rbx x:mov (%r10),%r13
test %r13,%r13 add $0x8,%r14 mov %rax,%rbx
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
mov (%r10),%r13 jmp *%rax jmp *%rax
jne ;s drop 1->0 ;s 1->1
jmp *%rax lit 0->1 mov (%r14),%rbx
;s 1->1 #1 add $0x8,%r14
mov (%r14),%rbx mov 0x10(%rbx),%r13 mov (%rbx),%rax
add $0x8,%r14 ;s 1->1 jmp *%rax
mov (%rbx),%rax mov (%r14),%rbx lit 1->1
jmp *%rax add $0x8,%r14 #1
lit 1->1 mov (%rbx),%rax mov %r13,(%r10)
#1 jmp *%rax sub $0x8,%r10
mov %r13,(%r10) mov 0x8(%rbx),%r13
sub $0x8,%r10 ;s 1->1
mov 0x8(%rbx),%r13 mov (%r14),%rbx
;s 1->1 add $0x8,%r14
mov (%r14),%rbx mov (%rbx),%rax
add $0x8,%r14 jmp *%rax
mov (%rbx),%rax
jmp *%rax

So in the left column ?DUP checks the TOS and performs one branch, and
?BRANCH checks and branches again. The taken jne is the fall-through
path on the Forth level, the Forth-level taken branch is performed by
the "jmp *rax". All control-flow in Gforth works in the
direct-threaded code way, as seen in ?BRANCH and ;S.

In the middle column we see nicely that the DUP is compiled to only
one instruction (and ?BRANCH) becomes shorter than in the left
column. DROP is compiled to 0 instructions (and it makes the following
LIT cheaper, compare with the LIT in the left and the right column).
All of this is due to multi-state stack caching.

The right column drops TOS when TOS=0 (and restores the canonical
state that TOS is in %r13), and ?DUP-?BRANCH uses the threaded-code
dispatch on both outcomes. The latter could be fixed, but the code
would probably not become shorter.

Overall the middle-column approach shines. This is a best case for
this approach, but I think that it would still be at least on par for
other cases.

Currently there are 60 instances of ?DUP-IF and one instance of
?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
IF ... THEN DROP. Another day maybe.

Let's see how things look for other systems:

iForth 5.1-mini:
?dup if exit then dup if exit then drop
pop rbx pop rbx
or rbx, rbx cmp rbx, 0 b#
je x je y
push rbx push rbx
x:cmp rbx, 0 b# jmp z
je y pop rbx
jmp z y:push 1 b#
y:push 1 b# z:;
z:;

lxf:
?dup if exit then dup if exit then drop
or ebx , ebx cmp ebx , # 0h
je "0804FBEE" je "0804FC14"
mov [ebp-4h] , ebx ret near
lea ebp , [ebp-4h] x:mov ebx , # 1h
cmp ebx , # 0h ret near
mov ebx , [ebp]
lea ebp , [ebp+4h]
je x
ret near
x:mov [ebp-4h] , ebx
mov ebx , # 1h
lea ebp , [ebp-4h]
ret near

SwiftForth 4.0.0-RC89
?dup if exit then dup if exit then drop
RBX RBX OR RBX RBX OR
x JNZ x JZ
0 [RBP] RBX MOV RET
8 [RBP] RBP LEA x:0 [RBP] RBX MOV
y JMP 8 [RBP] RBP LEA
x:RET -8 [RBP] RBP LEA
y:-8 [RBP] RBP LEA RBX 0 [RBP] MOV
RBX 0 [RBP] MOV 1 # EBX MOV
1 # EBX MOV RET
RET

SwiftForth peephole-optimizes ?DUP IF into ?DUP-IF with the rule:

OPTIMIZE ?DUP (IF) SUBSTITUTE ?DUP-IF

(but, interestingly, "[defined] ?DUP-IF" produces 0 (while "locate
?DUP-IF" shows the source code). It does not know how to optimize
"DROP 1", but the result of the right-hand approach is still better.

vfx64 5.43
?dup if exit then dup if exit then drop
TEST RBX, RBX TEST RBX, RBX
JNZ/NE x JZ/E x
MOV RBX, [RBP] RET/NEXT
LEA RBP, [RBP+08] x:MOV EBX, # 00000001
JMP y RET/NEXT
x:RET/NEXT
y:LEA RBP, [RBP+-08]
MOV [RBP], RBX
MOV EBX, # 00000001
RET/NEXT

VFX also seems to optimize ?DUP IF to avoid the double checking, but
the result is still worse than for the approach in the second column.

Bottom line: On all of these systems, DUP IF ... THEN DROP produces
shorter code than ?DUP IF ... THEN or (if present) ?DUP-IF ... THEN.
?DUP and ?DUP-IF may be good ideas for traditional threaded-code
systems, but produce larger code on these more sophisticated systems.

Of course, with additional sophistication, a compiler could produce
the same code for ?DUP IF ... THEN as for DUP IF ... THEN DROP, but
for now we are not there, and is ?DUP IF used often enough to add such sophistication? In SwiftForth's source code there are 100 occurences
of ?DUP, and 57 of ?DUP IF.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
--- Synchronet 3.22a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri May 1 10:36:57 2026

From Newsgroup: comp.lang.forth

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Currently there are 60 instances of ?DUP-IF and one instance of
?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
IF ... THEN DROP. Another day maybe.

Actually the same day. I replaced 25 (out of 60) instances of ?DUP-IF
and the only occurence of ?DUP-0=-IF with DUP IF and an additional
DROP or equivalent elsewhere; these were the cases that have an ELSE
branch or the code between IF and THEN ends in EXIT. In the other
cases I left the ?DUP-IF alone.

The cases that were changed included some that are used frequently, in particular THROW.

As a result, the native-code size on AMD64 was reduced by 497 bytes,
i.e., by 19 bytes per replacement on average.

The threaded-code size increased by 47 cells (376 bytes), i.e, 1.8
cells per replacement. That was to be expected given the additional
primitives involved.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
--- Synchronet 3.22a-Linux NewsLink 1.2

From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri May 1 13:05:13 2026

From Newsgroup: comp.lang.forth

On 01-05-2026 08:41, Anton Ertl wrote:

While looking at the strcmp code, I saw that the code that used ?DUP
was large, and replaced it with code using DUP and, in the 0 branch,
DROP. So now I look closely at the code produced for ?DUP and the alternatives.

Since the beginning Gforth has provided ?DUP-IF and ?DUP-0=-IF, which
have the advantage that the value is not tested twice (once in ?DUP
and once in IF), and there are not two branches. One might imagine
that this produces the optimal code, but see below.

Other systems may or may not combine the ?DUP and the IF by themselves
to avoid the two branches mentioned above.

Here are three simple words that all do the same thing:

: ?dup-test1 ( n1 -- n2 )
?dup if exit then 1 ;

: ?dup-test2 ( n1 -- n2 )
dup if exit then drop 1 ;

[defined] ?dup-if [if]
: ?dup-test3 ( n1 -- n2 )
?dup-if exit then 1 ;
[then]

Here's the code that gforth-fast produces for them (the first line
shows only the code that is different):

?dup if exit then dup if exit then drop ?dup-if exit then
?dup 1->1 dup 1->2 ?dup-?branch 1->1
add $0x8,%r10 mov %r13,%r15 <?dup-test2+$18>
test %r13,%r13 ?branch 2->1 add $0x10,%rbx
je 0x7f6ca1271b18 <?dup-test3+$20> mov -0x8(%rbx),%rax
mov %r13,-0x8(%r10) add $0x20,%rbx add $0x8,%r10
sub $0x8,%r10 mov (%rbx),%rax test %r13,%r13
sub $0x8,%r10 test %r15,%r15 je x
?branch 1->1 jne ;s mov %r13,-0x8(%r10) <?dup-test1+$20> jmp *%rax mov %rbx,%rax
add $0x20,%rbx ;s 1->1 sub $0x8,%r10
add $0x8,%r10 mov (%r14),%rbx x:mov (%r10),%r13
test %r13,%r13 add $0x8,%r14 mov %rax,%rbx
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
mov (%r10),%r13 jmp *%rax jmp *%rax
jne ;s drop 1->0 ;s 1->1
jmp *%rax lit 0->1 mov (%r14),%rbx
;s 1->1 #1 add $0x8,%r14
mov (%r14),%rbx mov 0x10(%rbx),%r13 mov (%rbx),%rax
add $0x8,%r14 ;s 1->1 jmp *%rax
mov (%rbx),%rax mov (%r14),%rbx lit 1->1
jmp *%rax add $0x8,%r14 #1
lit 1->1 mov (%rbx),%rax mov %r13,(%r10)
#1 jmp *%rax sub $0x8,%r10
mov %r13,(%r10) mov 0x8(%rbx),%r13
sub $0x8,%r10 ;s 1->1
mov 0x8(%rbx),%r13 mov (%r14),%rbx
;s 1->1 add $0x8,%r14
mov (%r14),%rbx mov (%rbx),%rax
add $0x8,%r14 jmp *%rax
mov (%rbx),%rax
jmp *%rax

So in the left column ?DUP checks the TOS and performs one branch, and ?BRANCH checks and branches again. The taken jne is the fall-through
path on the Forth level, the Forth-level taken branch is performed by
the "jmp *rax". All control-flow in Gforth works in the
direct-threaded code way, as seen in ?BRANCH and ;S.

In the middle column we see nicely that the DUP is compiled to only
one instruction (and ?BRANCH) becomes shorter than in the left
column. DROP is compiled to 0 instructions (and it makes the following
LIT cheaper, compare with the LIT in the left and the right column).
All of this is due to multi-state stack caching.

The right column drops TOS when TOS=0 (and restores the canonical
state that TOS is in %r13), and ?DUP-?BRANCH uses the threaded-code
dispatch on both outcomes. The latter could be fixed, but the code
would probably not become shorter.

Overall the middle-column approach shines. This is a best case for
this approach, but I think that it would still be at least on par for
other cases.

Currently there are 60 instances of ?DUP-IF and one instance of
?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
IF ... THEN DROP. Another day maybe.

Let's see how things look for other systems:

iForth 5.1-mini:
?dup if exit then dup if exit then drop
pop rbx pop rbx
or rbx, rbx cmp rbx, 0 b#
je x je y
push rbx push rbx
x:cmp rbx, 0 b# jmp z
je y pop rbx
jmp z y:push 1 b#
y:push 1 b# z:;
z:;

lxf:
?dup if exit then dup if exit then drop
or ebx , ebx cmp ebx , # 0h
je "0804FBEE" je "0804FC14"
mov [ebp-4h] , ebx ret near
lea ebp , [ebp-4h] x:mov ebx , # 1h
cmp ebx , # 0h ret near
mov ebx , [ebp]
lea ebp , [ebp+4h]
je x
ret near
x:mov [ebp-4h] , ebx
mov ebx , # 1h
lea ebp , [ebp-4h]
ret near

SwiftForth 4.0.0-RC89
?dup if exit then dup if exit then drop
RBX RBX OR RBX RBX OR
x JNZ x JZ
0 [RBP] RBX MOV RET
8 [RBP] RBP LEA x:0 [RBP] RBX MOV
y JMP 8 [RBP] RBP LEA
x:RET -8 [RBP] RBP LEA
y:-8 [RBP] RBP LEA RBX 0 [RBP] MOV
RBX 0 [RBP] MOV 1 # EBX MOV
1 # EBX MOV RET
RET

SwiftForth peephole-optimizes ?DUP IF into ?DUP-IF with the rule:

OPTIMIZE ?DUP (IF) SUBSTITUTE ?DUP-IF

(but, interestingly, "[defined] ?DUP-IF" produces 0 (while "locate
?DUP-IF" shows the source code). It does not know how to optimize
"DROP 1", but the result of the right-hand approach is still better.

vfx64 5.43
?dup if exit then dup if exit then drop
TEST RBX, RBX TEST RBX, RBX
JNZ/NE x JZ/E x
MOV RBX, [RBP] RET/NEXT
LEA RBP, [RBP+08] x:MOV EBX, # 00000001
JMP y RET/NEXT
x:RET/NEXT
y:LEA RBP, [RBP+-08]
MOV [RBP], RBX
MOV EBX, # 00000001
RET/NEXT

VFX also seems to optimize ?DUP IF to avoid the double checking, but
the result is still worse than for the approach in the second column.

Bottom line: On all of these systems, DUP IF ... THEN DROP produces
shorter code than ?DUP IF ... THEN or (if present) ?DUP-IF ... THEN.
?DUP and ?DUP-IF may be good ideas for traditional threaded-code
systems, but produce larger code on these more sophisticated systems.

Of course, with additional sophistication, a compiler could produce
the same code for ?DUP IF ... THEN as for DUP IF ... THEN DROP, but
for now we are not there, and is ?DUP IF used often enough to add such sophistication? In SwiftForth's source code there are 100 occurences
of ?DUP, and 57 of ?DUP IF.

- anton

Frankly - I don't implement ?DUP. For the same reason I detest >FLOAT.
Words should *ALWAYS* have one - and only one stack diagram. It's
already hard enough to follow the stack flow - let alone if you
introduce conditional stack diagrams.

I love speed as much as the next guy - unless it destroys the clarity of
the code.

Consequently, I always use DUP IF .. ELSE DROP .. THEN, being assured
that the same stack diagram is provided by both branches.

In the roughly 400 example programs in the 4tH repository, ?DUP appears exactly 4 times - in programs I ported from Forth as verbatim as I could.

In the about 650 libraries of 4tH, ?DUP appears exactly *ONE* time - in
the library where it is defined. And it shares that space with PICK and
ROLL, a few other notorious citizens of NO-NO land.

Personally, I see where you're coming from with ?DUP-TEST - I just don't recognize the pattern, but I do like that the stack diagram remains the
same.

However, I do enjoy that compilers seem to agree with the practice I've exercised for roughly the last 30 years ;-)

Hans Bezemer
--- Synchronet 3.22a-Linux NewsLink 1.2

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,124
Nodes:	10 (0 / 10)
Uptime:	22:23:13
Calls:	14,392
Calls today:	1
Files:	186,389
D/L today:	3,401 files (859M bytes)
Messages:	2,544,924
Posted today:	1

?DUP

Who's Online

Recent Visitors

System Info