• ?DUP

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri May 1 06:41:25 2026
    From Newsgroup: comp.lang.forth

    While looking at the strcmp code, I saw that the code that used ?DUP
    was large, and replaced it with code using DUP and, in the 0 branch,
    DROP. So now I look closely at the code produced for ?DUP and the alternatives.

    Since the beginning Gforth has provided ?DUP-IF and ?DUP-0=-IF, which
    have the advantage that the value is not tested twice (once in ?DUP
    and once in IF), and there are not two branches. One might imagine
    that this produces the optimal code, but see below.

    Other systems may or may not combine the ?DUP and the IF by themselves
    to avoid the two branches mentioned above.

    Here are three simple words that all do the same thing:

    : ?dup-test1 ( n1 -- n2 )
    ?dup if exit then 1 ;

    : ?dup-test2 ( n1 -- n2 )
    dup if exit then drop 1 ;

    [defined] ?dup-if [if]
    : ?dup-test3 ( n1 -- n2 )
    ?dup-if exit then 1 ;
    [then]


    Here's the code that gforth-fast produces for them (the first line
    shows only the code that is different):

    ?dup if exit then dup if exit then drop ?dup-if exit then
    ?dup 1->1 dup 1->2 ?dup-?branch 1->1
    add $0x8,%r10 mov %r13,%r15 <?dup-test2+$18>
    test %r13,%r13 ?branch 2->1 add $0x10,%rbx
    je 0x7f6ca1271b18 <?dup-test3+$20> mov -0x8(%rbx),%rax
    mov %r13,-0x8(%r10) add $0x20,%rbx add $0x8,%r10
    sub $0x8,%r10 mov (%rbx),%rax test %r13,%r13
    sub $0x8,%r10 test %r15,%r15 je x ?branch 1->1 jne ;s mov %r13,-0x8(%r10) <?dup-test1+$20> jmp *%rax mov %rbx,%rax
    add $0x20,%rbx ;s 1->1 sub $0x8,%r10
    add $0x8,%r10 mov (%r14),%rbx x:mov (%r10),%r13
    test %r13,%r13 add $0x8,%r14 mov %rax,%rbx
    mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
    mov (%r10),%r13 jmp *%rax jmp *%rax
    jne ;s drop 1->0 ;s 1->1
    jmp *%rax lit 0->1 mov (%r14),%rbx
    ;s 1->1 #1 add $0x8,%r14
    mov (%r14),%rbx mov 0x10(%rbx),%r13 mov (%rbx),%rax
    add $0x8,%r14 ;s 1->1 jmp *%rax
    mov (%rbx),%rax mov (%r14),%rbx lit 1->1
    jmp *%rax add $0x8,%r14 #1
    lit 1->1 mov (%rbx),%rax mov %r13,(%r10)
    #1 jmp *%rax sub $0x8,%r10
    mov %r13,(%r10) mov 0x8(%rbx),%r13
    sub $0x8,%r10 ;s 1->1
    mov 0x8(%rbx),%r13 mov (%r14),%rbx
    ;s 1->1 add $0x8,%r14
    mov (%r14),%rbx mov (%rbx),%rax
    add $0x8,%r14 jmp *%rax
    mov (%rbx),%rax
    jmp *%rax

    So in the left column ?DUP checks the TOS and performs one branch, and
    ?BRANCH checks and branches again. The taken jne is the fall-through
    path on the Forth level, the Forth-level taken branch is performed by
    the "jmp *rax". All control-flow in Gforth works in the
    direct-threaded code way, as seen in ?BRANCH and ;S.

    In the middle column we see nicely that the DUP is compiled to only
    one instruction (and ?BRANCH) becomes shorter than in the left
    column. DROP is compiled to 0 instructions (and it makes the following
    LIT cheaper, compare with the LIT in the left and the right column).
    All of this is due to multi-state stack caching.

    The right column drops TOS when TOS=0 (and restores the canonical
    state that TOS is in %r13), and ?DUP-?BRANCH uses the threaded-code
    dispatch on both outcomes. The latter could be fixed, but the code
    would probably not become shorter.

    Overall the middle-column approach shines. This is a best case for
    this approach, but I think that it would still be at least on par for
    other cases.

    Currently there are 60 instances of ?DUP-IF and one instance of
    ?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
    IF ... THEN DROP. Another day maybe.

    Let's see how things look for other systems:

    iForth 5.1-mini:
    ?dup if exit then dup if exit then drop
    pop rbx pop rbx
    or rbx, rbx cmp rbx, 0 b#
    je x je y
    push rbx push rbx
    x:cmp rbx, 0 b# jmp z
    je y pop rbx
    jmp z y:push 1 b#
    y:push 1 b# z:;
    z:;

    lxf:
    ?dup if exit then dup if exit then drop
    or ebx , ebx cmp ebx , # 0h
    je "0804FBEE" je "0804FC14"
    mov [ebp-4h] , ebx ret near
    lea ebp , [ebp-4h] x:mov ebx , # 1h
    cmp ebx , # 0h ret near
    mov ebx , [ebp]
    lea ebp , [ebp+4h]
    je x
    ret near
    x:mov [ebp-4h] , ebx
    mov ebx , # 1h
    lea ebp , [ebp-4h]
    ret near

    SwiftForth 4.0.0-RC89
    ?dup if exit then dup if exit then drop
    RBX RBX OR RBX RBX OR
    x JNZ x JZ
    0 [RBP] RBX MOV RET
    8 [RBP] RBP LEA x:0 [RBP] RBX MOV
    y JMP 8 [RBP] RBP LEA
    x:RET -8 [RBP] RBP LEA
    y:-8 [RBP] RBP LEA RBX 0 [RBP] MOV
    RBX 0 [RBP] MOV 1 # EBX MOV
    1 # EBX MOV RET
    RET

    SwiftForth peephole-optimizes ?DUP IF into ?DUP-IF with the rule:

    OPTIMIZE ?DUP (IF) SUBSTITUTE ?DUP-IF

    (but, interestingly, "[defined] ?DUP-IF" produces 0 (while "locate
    ?DUP-IF" shows the source code). It does not know how to optimize
    "DROP 1", but the result of the right-hand approach is still better.

    vfx64 5.43
    ?dup if exit then dup if exit then drop
    TEST RBX, RBX TEST RBX, RBX
    JNZ/NE x JZ/E x
    MOV RBX, [RBP] RET/NEXT
    LEA RBP, [RBP+08] x:MOV EBX, # 00000001
    JMP y RET/NEXT
    x:RET/NEXT
    y:LEA RBP, [RBP+-08]
    MOV [RBP], RBX
    MOV EBX, # 00000001
    RET/NEXT

    VFX also seems to optimize ?DUP IF to avoid the double checking, but
    the result is still worse than for the approach in the second column.

    Bottom line: On all of these systems, DUP IF ... THEN DROP produces
    shorter code than ?DUP IF ... THEN or (if present) ?DUP-IF ... THEN.
    ?DUP and ?DUP-IF may be good ideas for traditional threaded-code
    systems, but produce larger code on these more sophisticated systems.

    Of course, with additional sophistication, a compiler could produce
    the same code for ?DUP IF ... THEN as for DUP IF ... THEN DROP, but
    for now we are not there, and is ?DUP IF used often enough to add such sophistication? In SwiftForth's source code there are 100 occurences
    of ?DUP, and 57 of ?DUP IF.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.lang.forth on Fri May 1 10:36:57 2026
    From Newsgroup: comp.lang.forth

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Currently there are 60 instances of ?DUP-IF and one instance of
    ?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
    IF ... THEN DROP. Another day maybe.

    Actually the same day. I replaced 25 (out of 60) instances of ?DUP-IF
    and the only occurence of ?DUP-0=-IF with DUP IF and an additional
    DROP or equivalent elsewhere; these were the cases that have an ELSE
    branch or the code between IF and THEN ends in EXIT. In the other
    cases I left the ?DUP-IF alone.

    The cases that were changed included some that are used frequently, in particular THROW.

    As a result, the native-code size on AMD64 was reduced by 497 bytes,
    i.e., by 19 bytes per replacement on average.

    The threaded-code size increased by 47 cells (376 bytes), i.e, 1.8
    cells per replacement. That was to be expected given the additional
    primitives involved.

    - anton
    --
    M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
    comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
    New standard: https://forth-standard.org/
    EuroForth 2025 proceedings: http://www.euroforth.org/ef25/papers/
    --- Synchronet 3.22a-Linux NewsLink 1.2
  • From Hans Bezemer@the.beez.speaks@gmail.com to comp.lang.forth on Fri May 1 13:05:13 2026
    From Newsgroup: comp.lang.forth

    On 01-05-2026 08:41, Anton Ertl wrote:
    While looking at the strcmp code, I saw that the code that used ?DUP
    was large, and replaced it with code using DUP and, in the 0 branch,
    DROP. So now I look closely at the code produced for ?DUP and the alternatives.

    Since the beginning Gforth has provided ?DUP-IF and ?DUP-0=-IF, which
    have the advantage that the value is not tested twice (once in ?DUP
    and once in IF), and there are not two branches. One might imagine
    that this produces the optimal code, but see below.

    Other systems may or may not combine the ?DUP and the IF by themselves
    to avoid the two branches mentioned above.

    Here are three simple words that all do the same thing:

    : ?dup-test1 ( n1 -- n2 )
    ?dup if exit then 1 ;

    : ?dup-test2 ( n1 -- n2 )
    dup if exit then drop 1 ;

    [defined] ?dup-if [if]
    : ?dup-test3 ( n1 -- n2 )
    ?dup-if exit then 1 ;
    [then]


    Here's the code that gforth-fast produces for them (the first line
    shows only the code that is different):

    ?dup if exit then dup if exit then drop ?dup-if exit then
    ?dup 1->1 dup 1->2 ?dup-?branch 1->1
    add $0x8,%r10 mov %r13,%r15 <?dup-test2+$18>
    test %r13,%r13 ?branch 2->1 add $0x10,%rbx
    je 0x7f6ca1271b18 <?dup-test3+$20> mov -0x8(%rbx),%rax
    mov %r13,-0x8(%r10) add $0x20,%rbx add $0x8,%r10
    sub $0x8,%r10 mov (%rbx),%rax test %r13,%r13
    sub $0x8,%r10 test %r15,%r15 je x
    ?branch 1->1 jne ;s mov %r13,-0x8(%r10) <?dup-test1+$20> jmp *%rax mov %rbx,%rax
    add $0x20,%rbx ;s 1->1 sub $0x8,%r10
    add $0x8,%r10 mov (%r14),%rbx x:mov (%r10),%r13
    test %r13,%r13 add $0x8,%r14 mov %rax,%rbx
    mov (%rbx),%rax mov (%rbx),%rax mov (%rbx),%rax
    mov (%r10),%r13 jmp *%rax jmp *%rax
    jne ;s drop 1->0 ;s 1->1
    jmp *%rax lit 0->1 mov (%r14),%rbx
    ;s 1->1 #1 add $0x8,%r14
    mov (%r14),%rbx mov 0x10(%rbx),%r13 mov (%rbx),%rax
    add $0x8,%r14 ;s 1->1 jmp *%rax
    mov (%rbx),%rax mov (%r14),%rbx lit 1->1
    jmp *%rax add $0x8,%r14 #1
    lit 1->1 mov (%rbx),%rax mov %r13,(%r10)
    #1 jmp *%rax sub $0x8,%r10
    mov %r13,(%r10) mov 0x8(%rbx),%r13
    sub $0x8,%r10 ;s 1->1
    mov 0x8(%rbx),%r13 mov (%r14),%rbx
    ;s 1->1 add $0x8,%r14
    mov (%r14),%rbx mov (%rbx),%rax
    add $0x8,%r14 jmp *%rax
    mov (%rbx),%rax
    jmp *%rax

    So in the left column ?DUP checks the TOS and performs one branch, and ?BRANCH checks and branches again. The taken jne is the fall-through
    path on the Forth level, the Forth-level taken branch is performed by
    the "jmp *rax". All control-flow in Gforth works in the
    direct-threaded code way, as seen in ?BRANCH and ;S.

    In the middle column we see nicely that the DUP is compiled to only
    one instruction (and ?BRANCH) becomes shorter than in the left
    column. DROP is compiled to 0 instructions (and it makes the following
    LIT cheaper, compare with the LIT in the left and the right column).
    All of this is due to multi-state stack caching.

    The right column drops TOS when TOS=0 (and restores the canonical
    state that TOS is in %r13), and ?DUP-?BRANCH uses the threaded-code
    dispatch on both outcomes. The latter could be fixed, but the code
    would probably not become shorter.

    Overall the middle-column approach shines. This is a best case for
    this approach, but I think that it would still be at least on par for
    other cases.

    Currently there are 60 instances of ?DUP-IF and one instance of
    ?DUP-0=-IF in Gforth. It may be a good idea to replace them whith DUP
    IF ... THEN DROP. Another day maybe.

    Let's see how things look for other systems:

    iForth 5.1-mini:
    ?dup if exit then dup if exit then drop
    pop rbx pop rbx
    or rbx, rbx cmp rbx, 0 b#
    je x je y
    push rbx push rbx
    x:cmp rbx, 0 b# jmp z
    je y pop rbx
    jmp z y:push 1 b#
    y:push 1 b# z:;
    z:;

    lxf:
    ?dup if exit then dup if exit then drop
    or ebx , ebx cmp ebx , # 0h
    je "0804FBEE" je "0804FC14"
    mov [ebp-4h] , ebx ret near
    lea ebp , [ebp-4h] x:mov ebx , # 1h
    cmp ebx , # 0h ret near
    mov ebx , [ebp]
    lea ebp , [ebp+4h]
    je x
    ret near
    x:mov [ebp-4h] , ebx
    mov ebx , # 1h
    lea ebp , [ebp-4h]
    ret near

    SwiftForth 4.0.0-RC89
    ?dup if exit then dup if exit then drop
    RBX RBX OR RBX RBX OR
    x JNZ x JZ
    0 [RBP] RBX MOV RET
    8 [RBP] RBP LEA x:0 [RBP] RBX MOV
    y JMP 8 [RBP] RBP LEA
    x:RET -8 [RBP] RBP LEA
    y:-8 [RBP] RBP LEA RBX 0 [RBP] MOV
    RBX 0 [RBP] MOV 1 # EBX MOV
    1 # EBX MOV RET
    RET

    SwiftForth peephole-optimizes ?DUP IF into ?DUP-IF with the rule:

    OPTIMIZE ?DUP (IF) SUBSTITUTE ?DUP-IF

    (but, interestingly, "[defined] ?DUP-IF" produces 0 (while "locate
    ?DUP-IF" shows the source code). It does not know how to optimize
    "DROP 1", but the result of the right-hand approach is still better.

    vfx64 5.43
    ?dup if exit then dup if exit then drop
    TEST RBX, RBX TEST RBX, RBX
    JNZ/NE x JZ/E x
    MOV RBX, [RBP] RET/NEXT
    LEA RBP, [RBP+08] x:MOV EBX, # 00000001
    JMP y RET/NEXT
    x:RET/NEXT
    y:LEA RBP, [RBP+-08]
    MOV [RBP], RBX
    MOV EBX, # 00000001
    RET/NEXT

    VFX also seems to optimize ?DUP IF to avoid the double checking, but
    the result is still worse than for the approach in the second column.

    Bottom line: On all of these systems, DUP IF ... THEN DROP produces
    shorter code than ?DUP IF ... THEN or (if present) ?DUP-IF ... THEN.
    ?DUP and ?DUP-IF may be good ideas for traditional threaded-code
    systems, but produce larger code on these more sophisticated systems.

    Of course, with additional sophistication, a compiler could produce
    the same code for ?DUP IF ... THEN as for DUP IF ... THEN DROP, but
    for now we are not there, and is ?DUP IF used often enough to add such sophistication? In SwiftForth's source code there are 100 occurences
    of ?DUP, and 57 of ?DUP IF.

    - anton

    Frankly - I don't implement ?DUP. For the same reason I detest >FLOAT.
    Words should *ALWAYS* have one - and only one stack diagram. It's
    already hard enough to follow the stack flow - let alone if you
    introduce conditional stack diagrams.

    I love speed as much as the next guy - unless it destroys the clarity of
    the code.

    Consequently, I always use DUP IF .. ELSE DROP .. THEN, being assured
    that the same stack diagram is provided by both branches.

    In the roughly 400 example programs in the 4tH repository, ?DUP appears exactly 4 times - in programs I ported from Forth as verbatim as I could.

    In the about 650 libraries of 4tH, ?DUP appears exactly *ONE* time - in
    the library where it is defined. And it shares that space with PICK and
    ROLL, a few other notorious citizens of NO-NO land.

    Personally, I see where you're coming from with ?DUP-TEST - I just don't recognize the pattern, but I do like that the stack diagram remains the
    same.

    However, I do enjoy that compilers seem to agree with the practice I've exercised for roughly the last 30 years ;-)

    Hans Bezemer
    --- Synchronet 3.22a-Linux NewsLink 1.2