Currently there are 60 instances of ?DUP-IF and one instance of
?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
IF ... THEN DROP. Another day maybe.
While looking at the strcmp code, I saw that the code that used ?DUP
was large, and replaced it with code using DUP and, in the 0 branch,
DROP. So now I look closely at the code produced for ?DUP and the alternatives.
Since the beginning Gforth has provided ?DUP-IF and ?DUP-0=-IF, which
have the advantage that the value is not tested twice (once in ?DUP
and once in IF), and there are not two branches. One might imagine
that this produces the optimal code, but see below.
Other systems may or may not combine the ?DUP and the IF by themselves
to avoid the two branches mentioned above.
Here are three simple words that all do the same thing:
: ?dup-test1 ( n1 -- n2 )
?dup if exit then 1 ;
: ?dup-test2 ( n1 -- n2 )
dup if exit then drop 1 ;
[defined] ?dup-if [if]
: ?dup-test3 ( n1 -- n2 )
?dup-if exit then 1 ;
[then]
Here's the code that gforth-fast produces for them (the first line
shows only the code that is different):
?dup if exit then dup if exit then drop ?dup-if exit then
?dup 1->1 dup 1->2 ?dup-?branch 1->1
add $0x8,%r10 mov %r13,%r15 <?dup-test2+$18>
test %r13,%r13 ?branch 2->1 add $0x10,%rbx
je 0x7f6ca1271b18 <?dup-test3+$20> mov -0x8(%rbx),%rax
mov %r13,-0x8(%r10) add $0x20,%rbx add $0x8,%r10
sub $0x8,%r10 mov (%rbx),%rax test %r13,%r13
sub $0x8,%r10 test %r15,%r15 je x
?branch 1->1 jne ;s mov %r13,-0x8(%r10) <?dup-test1+$20> jmp *%rax mov %rbx,%rax
add $0x20,%rbx ;s 1->1 sub $0x8,%r10
add $0x8,%r10 mov (%r14),%rbx x:mov (%r10),%r13
test %r13,%r13 add $0x8,%r14 mov %rax,%rbx
mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
mov (%r10),%r13 jmp *%rax jmp *%rax
jne ;s drop 1->0 ;s 1->1
jmp *%rax lit 0->1 mov (%r14),%rbx
;s 1->1 #1 add $0x8,%r14
mov (%r14),%rbx mov 0x10(%rbx),%r13 mov (%rbx),%rax
add $0x8,%r14 ;s 1->1 jmp *%rax
mov (%rbx),%rax mov (%r14),%rbx lit 1->1
jmp *%rax add $0x8,%r14 #1
lit 1->1 mov (%rbx),%rax mov %r13,(%r10)
#1 jmp *%rax sub $0x8,%r10
mov %r13,(%r10) mov 0x8(%rbx),%r13
sub $0x8,%r10 ;s 1->1
mov 0x8(%rbx),%r13 mov (%r14),%rbx
;s 1->1 add $0x8,%r14
mov (%r14),%rbx mov (%rbx),%rax
add $0x8,%r14 jmp *%rax
mov (%rbx),%rax
jmp *%rax
So in the left column ?DUP checks the TOS and performs one branch, and ?BRANCH checks and branches again. The taken jne is the fall-through
path on the Forth level, the Forth-level taken branch is performed by
the "jmp *rax". All control-flow in Gforth works in the
direct-threaded code way, as seen in ?BRANCH and ;S.
In the middle column we see nicely that the DUP is compiled to only
one instruction (and ?BRANCH) becomes shorter than in the left
column. DROP is compiled to 0 instructions (and it makes the following
LIT cheaper, compare with the LIT in the left and the right column).
All of this is due to multi-state stack caching.
The right column drops TOS when TOS=0 (and restores the canonical
state that TOS is in %r13), and ?DUP-?BRANCH uses the threaded-code
dispatch on both outcomes. The latter could be fixed, but the code
would probably not become shorter.
Overall the middle-column approach shines. This is a best case for
this approach, but I think that it would still be at least on par for
other cases.
Currently there are 60 instances of ?DUP-IF and one instance of
?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
IF ... THEN DROP. Another day maybe.
Let's see how things look for other systems:
iForth 5.1-mini:
?dup if exit then dup if exit then drop
pop rbx pop rbx
or rbx, rbx cmp rbx, 0 b#
je x je y
push rbx push rbx
x:cmp rbx, 0 b# jmp z
je y pop rbx
jmp z y:push 1 b#
y:push 1 b# z:;
z:;
lxf:
?dup if exit then dup if exit then drop
or ebx , ebx cmp ebx , # 0h
je "0804FBEE" je "0804FC14"
mov [ebp-4h] , ebx ret near
lea ebp , [ebp-4h] x:mov ebx , # 1h
cmp ebx , # 0h ret near
mov ebx , [ebp]
lea ebp , [ebp+4h]
je x
ret near
x:mov [ebp-4h] , ebx
mov ebx , # 1h
lea ebp , [ebp-4h]
ret near
SwiftForth 4.0.0-RC89
?dup if exit then dup if exit then drop
RBX RBX OR RBX RBX OR
x JNZ x JZ
0 [RBP] RBX MOV RET
8 [RBP] RBP LEA x:0 [RBP] RBX MOV
y JMP 8 [RBP] RBP LEA
x:RET -8 [RBP] RBP LEA
y:-8 [RBP] RBP LEA RBX 0 [RBP] MOV
RBX 0 [RBP] MOV 1 # EBX MOV
1 # EBX MOV RET
RET
SwiftForth peephole-optimizes ?DUP IF into ?DUP-IF with the rule:
OPTIMIZE ?DUP (IF) SUBSTITUTE ?DUP-IF
(but, interestingly, "[defined] ?DUP-IF" produces 0 (while "locate
?DUP-IF" shows the source code). It does not know how to optimize
"DROP 1", but the result of the right-hand approach is still better.
vfx64 5.43
?dup if exit then dup if exit then drop
TEST RBX, RBX TEST RBX, RBX
JNZ/NE x JZ/E x
MOV RBX, [RBP] RET/NEXT
LEA RBP, [RBP+08] x:MOV EBX, # 00000001
JMP y RET/NEXT
x:RET/NEXT
y:LEA RBP, [RBP+-08]
MOV [RBP], RBX
MOV EBX, # 00000001
RET/NEXT
VFX also seems to optimize ?DUP IF to avoid the double checking, but
the result is still worse than for the approach in the second column.
Bottom line: On all of these systems, DUP IF ... THEN DROP produces
shorter code than ?DUP IF ... THEN or (if present) ?DUP-IF ... THEN.
?DUP and ?DUP-IF may be good ideas for traditional threaded-code
systems, but produce larger code on these more sophisticated systems.
Of course, with additional sophistication, a compiler could produce
the same code for ?DUP IF ... THEN as for DUP IF ... THEN DROP, but
for now we are not there, and is ?DUP IF used often enough to add such sophistication? In SwiftForth's source code there are 100 occurences
of ?DUP, and 57 of ?DUP IF.
- anton
| Sysop: | DaiTengu |
|---|---|
| Location: | Appleton, WI |
| Users: | 1,116 |
| Nodes: | 10 (0 / 10) |
| Uptime: | 89:18:11 |
| Calls: | 14,306 |
| Calls today: | 1 |
| Files: | 186,338 |
| D/L today: |
1,552 files (532M bytes) |
| Messages: | 2,525,562 |