I have had so much success in adjusting Concertina II to achieve my goals more fully than I had thought possible... that I now think that it may be possible to proceed from Concertina II to a design which gets rid of the
one feature of Concertina II that has been the most controversial.
Yes, I think that I could actually do without block structure.
What would Concertina III look like?
Well, the basic instruction set would be similar to that of Concertina II. But the P bits would be taken out of the operate instructions, and so
would the option of replacing a register specification by a pseudo-
immediate pointer.
The tiny gaps between the opcodes of some instructions to squeeze out
space for block headers would be removed.
But the big spaces for the shortest block header prefixes would be what is used for doing without headers.
Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
contained within a sequence of pairs of 32-bit instructions of this form:
11110xx(17 bits)(8 bits)
11111x(9 bits)(17 bits)
Instructions could be 17 bits long, 34 bits long, 51 bits long, and so on, any multiple of 17 bits in length.
In the first instruction slot of the pair, the two bits xx indicate, for
the two 17-bit regions of the variable-length instruction area that start
in it, if they are the first 17-bit area of an instruction. The second instruction slot only contains the start of one 17-bit area, so only one
bit x is needed. Since 17 is an odd number, this meshes perfectly with the fact that the 17-bit area which straddles both words isn't split evenly,
but rather one extra bit of it is in the second 32-bit instruction slot.
I had been hoping to use 18-bit areas instead, but after re-checking my calculations, I found there just wasn't enough opcode space.
Long instructions that contain immediates would not be part of variable- length instruction code. Instead, their lengths would be multiples of 32 bits, making them part of ordinary code with 32-bit instructions.
Their form would be like this:
32-bit immediate:
1010(12 bits)(16 bits)
10111(11 bits)(16 bits)'
where the first parenthesized area belongs to the instruction, and the
second to the immediate.
48-bit immediate:
1010(12 bits)(16 bits)
10110(11 bits)(16 bits)
10111(11 bits)(16 bits)
64-bit immediate:
1010(12 bits)(16 bits)
10110(3 bits)(24 bits)
10111(3 bits)(24 bits)
Since the instruction, exclusive of the immediate, really only needs 12
bits - 7 bit opcode, and 5 bit destination register - in each case there's enough additional space for the instruction to begin with a few bits that indicates its length, so that decoding is simple.
The scheme is not really space-efficient.
But the question that I really have is... is this really any better than having block headers? Or is it just as bad, just as complicated?
How about, say, 16/32/48/64/96:
xxxx-xxxx-xxxx-xxx0 //16 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix
Already elaborate enough...
On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:
How about, say, 16/32/48/64/96:
xxxx-xxxx-xxxx-xxx0 //16 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix
Already elaborate enough...
Thank you for your interesting suggestions.
I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.
Like Concertina II, it is to meet the overriding condition that
instructions do not have to be decoded sequentially. This means that
whenever an instruction, or group of instructions, spans more than 32
bits, the 32 bit areas of the instruction, other than the first, must
begin with a combination of bits that says "don't decode me".
The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.
This has the consequence that any immediate value that is 32 bits or more
in length has to be split up into smaller pieces; this is what I really
don't like about giving up the block structure.
John Savard
On 9/2/2025 4:15 AM, John Savard wrote:
On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:
How about, say, 16/32/48/64/96:
xxxx-xxxx-xxxx-xxx0 //16 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix
Already elaborate enough...
Thank you for your interesting suggestions.
I'm envisaging Concertina III as closely based on Concertina II, with
only
minimal changes.
Like Concertina II, it is to meet the overriding condition that
instructions do not have to be decoded sequentially. This means that
whenever an instruction, or group of instructions, spans more than 32
bits, the 32 bit areas of the instruction, other than the first, must
begin with a combination of bits that says "don't decode me".
The first 32 bits of an instruction get decoded directly, and then
trigger
and control the decoding of the rest of the instruction.
This has the consequence that any immediate value that is 32 bits or more
in length has to be split up into smaller pieces; this is what I really
don't like about giving up the block structure.
Note that tagging like that described does still allow some amount of parallel decoding, since we still have combinatorial logic. Granted, scalability is an issue.
As can be noted, my use of jumbo-prefixes for large immediate values
does have the property of allowing reusing 32-bit decoders for 64-bit
and 96-bit instructions. In most cases, the 64-bit and 96-bit encodings don't change the instruction being decoded, but merely extend it.
Some internal plumbing is needed to stitch the immediate values together though, typically:
We have OpA, OpB, OpC
DecC gets OpC, and JBits from OpB
DecB gets OpB, and JBits from OpA
DecA gets OpA, and 0 for JBits.
In my CPU core, I had a few times considered changing how decoding
worked, to either reverse or right-align the instruction block to reduce
the amount of MUX'ing needed in the decoder. If going for right-
alignment, then DecC would always go to Lane1, DecB to Lane2, and DecA
to Lane3.
Can note that for immediate-handling, the Lane1 decoder produces the low
33 bits of the result. If a decoder has a jumbo prefix and is itself
given a jumbo-prefix, it assumes a 96 bit encoding and produces the
value for the high 32 bits.
At least in my designs, I only account for 33 bits of immediate per
lane. Instead, when a full 64-bit immediate is encoded, its value is assembled in the ID2/RF stage.
Though, admittedly my CPU core design did fall back to sequential
execution for 16-bit ops, but this was partly for cost reasons.
For BJX2/XG1 originally, it was because the instructions couldn't use
WEX tagging, but after adding superscalar it was because I would either
need multiple parallel 16-bit decoders, or to change how 16 bit ops were handled (likely using a 16->32 repacker).
So, say:
IF stage:
Retrieve instruction from Cache Line;
Determine fetch length:
XG1/XG2 used explicit tagging;
XG3 and RV use SuperScalar checks.
Run repackers.
Currently both XG3 and RISC-V 48-bit ops are handled by repacking.
Decode Stage:
Decode N parallel 32-bit ops;
Prefixes route to the corresponding instructions;
Any lane holding solely a prefix goes NOP.
For a repacker, it would help if there were fairly direct mappings
between the 16-bit and 32-bit ops. Contrary to claims, RVC does not
appear to fit such a pattern. Personally, there isn't much good to say
about RVC's encoding scheme, as it is very much ad-hoc dog chew.
The usual claim is more that it is "compressed" in that you can first generate a 32-bit op internally and "squish" it down into a 16-bit form
if it fits. This isn't terribly novel as I see it. Repacking RVC has
similar problems to decoding it directly, namely that for a fair number
of instructions, nearly each instruction has its own idiosyncratic
encoding scheme (and you can't just simply shuffle some of the bits
around and fill others with 0s and similar to arrive back at a valid RVI instruction).
Contrast, say, XG3 is mostly XG2 with the bits shuffled around; though
there were some special cases made in the decoding rules. Though,
admittedly I did do more than the bare minimum here (to fit it into the
same encoding space as RV), mostly as I ended up going for a "Dog Chew Reduction" route rather than merely a "do the bare minimum bit-shuffling needed to make it fit".
For better or worse, it effectively made XG3 its own ISA as far as BGBCC
is concerned. Even if in theory I could have used repacking, the
original XG1/XG2 emitter logic is a total mess. It was written
originally for fixed-length 16-bit ops, so encodes and outputs
instructions 16 bits at a time (using big "switch()" blocks, but the
RISC-V and XG3 emitters also went this path; as far as BGBCC is
concerned, it is treating XG3 as part of RISC-V land).
Both the CPU core and also JX2VM handle it by repacking to XG2 though.
For the XG3VM (userland only emulator for now), it instead decodes XG3 directly, with decoders for XG3, RVI, and RVC.
Had noted the relative irony that despite XG3 having a longer
instruction listing (than RVI) it still ends up with a slightly shorter decoder.
Some of this has to deal with one big annoyance of RISC-V's encoding
scheme:
Its inconsistent and dog-chewed handling of immediate and displacement values.
Though, for mixed-output, there are still a handful of cases where RVI encodings can beat XG3 encodings, mostly involving cases where the RVI encodings have a slightly larger displacement.
In compiler stats, this seems to mostly affect:
LB, LBU, LW, LWU
SB, SW
ADDI, ADDIW, LUI
The former:
Well, unscaled 12-bit beats scaled 10-bit for 8 and 16-bit load/store.
ADDI: 12b > 10b
LUI: because loading a 32-bit value of the form XXXXX000 does happen sometimes it seems.
Instruction counts are low enough that a "pure XG3" would likely result
in Doom being around 1K larger (the cases where RVI ops are used would
need a 64-bit jumbo-encoding in XG3).
Though, the relative wonk of handling ADD in XG1/XG2/XG3 by using
separate Imm10u/Imm10n encodings, rather than an Imm10s, does have merit
in that this effectively gives it an Imm11s encoding; and ADD is one of
the main instructions that tends to be big-immediate-heavy (and in early design it was a close race between ADD ImmU/ImmN, vs ADD/SUB ImmU, but
the current scheme has a tiny bit more range, albeit SUB-ImmU could have possibly avoided the need for an ImmN case).
So, say:
ADD: Large immediate heavy.
SUB: Can reduce to ADD in the immediate case.
AND: Preferable to have signed immediate values;
Not common enough to justify the ImmU/ImmN scheme.
OR: Almost exclusively positive;
XOR: Almost exclusively positive.
Load/Store displacements are very lopsided in the positive direction.
Disp10s slightly beats Disp10u though.
More negative displacements than 512..1023.
XG1 had sorta hacked around it by:
Disp9u, Disp5n
Disp10u, Disp6n was considered, but didn't go that way.
Disp10s was at least, "slightly less ugly",
Even if Disp10u+Disp6n would have arguably been better
for code density.
Or, cough, someone could maybe do signed load/store displacements like:
000..110: Positive
111: Negative
So, Disp10as is 0..1791, -256..-1
Would better fit the statistical distribution, but... Errm...
...
John Savard
On Sun, 31 Aug 2025 13:12:52 -0500, BGB wrote:
How about, say, 16/32/48/64/96:
xxxx-xxxx-xxxx-xxx0 //16 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxyy-yyy1 //32 bit
xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xx11-1111 //64/48/96 bit prefix
Already elaborate enough...
Thank you for your interesting suggestions.
I'm envisaging Concertina III as closely based on Concertina II, with only minimal changes.
Like Concertina II, it is to meet the overriding condition that
instructions do not have to be decoded sequentially. This means that whenever an instruction, or group of instructions, spans more than 32
bits, the 32 bit areas of the instruction, other than the first, must
begin with a combination of bits that says "don't decode me".
The first 32 bits of an instruction get decoded directly, and then trigger and control the decoding of the rest of the instruction.
This has the consequence that any immediate value that is 32 bits or more
in length has to be split up into smaller pieces; this is what I really don't like about giving up the block structure.
John Savard--- Synchronet 3.21a-Linux NewsLink 1.2
Lest one thinks this results in serial decoding, consider that the
pattern decoder is 40 gates (just larger than 3-flip-flops) so one can
afford to put this pattern decoder on every word in the inst- buffer
On Tue, 02 Sep 2025 18:40:16 +0000, MitchAlsup wrote:
Lest one thinks this results in serial decoding, consider that the
pattern decoder is 40 gates (just larger than 3-flip-flops) so one can
afford to put this pattern decoder on every word in the inst- buffer
Yes, given sufficiently simple decoding, one could allow backtracking when the second word of an instruction is decoded as if it was the first.
Of course, though, it wastes electricity and produces heat, but a
negligible amount, I agree.
I'm designing my ISA, though, to make it simple to implement... in one specific sense. It's horribly large and complicated, but at least it
doesn't demand that imlementors understand any fancy tricks.
John Savard
However, I also found that STs need an immediate and a displacement, so, Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
But it also doesn't need two immediates.
A 16-bit integer or float and a 16-bit offset packed into a
single 32-bit immediate would suffice for most purposes.
Instead of a block header being used to indicate code consisting of variable-length instructions, variable-length instructions would be
contained within a sequence of pairs of 32-bit instructions of this
form:
11110xx(17 bits)(8 bits)
11111x(9 bits)(17 bits)
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement, so, >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or comparing against 0 (which can use the Zero Register). Only a relative minority being compares against non-zero constants.
One could argue:
This is high enough to care, but is it cheap enough?...
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement,
so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a relative
minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>> potential displacement (from D12ds above) and the immediate has the >>>>> size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of zero?
I expect these to be a significant enough percentage to skew your analysis.
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has
the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or >>>>>> comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, >>>>> 1982
That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
Kinda depends on the compilers used for the workload. I suspect
that those workloads were mostly COBOL and FORTRAN and maybe BLISS-32
or MACRO-32.
Without a heavy C or C++ workload, the need to check for NULL
pointer is rare.
Unlike Unix, successful returns from VMS system and library calls
was SS$_NORMAL, which had the value 1 rather than zero, which
would probably also reduce the uses of TST to check for zero.
111010(10 bits)(16 bits)
111011(2 bits)(24 bits)
111011(2 bits)(24 bits)
Fitting a 64-bit immediate into three words (rather than four) is also
still doable. It takes 1/4 of the available opcode space - but that's
OK, because nothing else has a similar problem, not 48-bit immediates,
and not 128-bit immediates.
The only thing I do lose is being able to also have, as I had only very recently introduced to Concertina II, the use of the 64-bit immediate structure to have memory-reference instructions with 64-bit absolute addresses.
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to zero. I would have expected higher.
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>>> potential displacement (from D12ds above) and the immediate has >>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or >>>>> comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution,
1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
On 9/4/2025 3:20 PM, Stephen Fuld wrote:
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>> with
potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional
or comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler
Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
Looking at some stats generated by my compiler (for branches):
61% of branches are unconditional
15% are comparing to 0
13% are comparing two registers
11% are comparing to some other non-zero constant.
Stephen Fuld wrote:
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>> with
potential displacement (from D12ds above) and the immediate has >>>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional
or comparing against 0 (which can use the Zero Register). Only a
relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler
Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP. >>>>> That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
Oh no I didn't mean that. I meant that a compiler that wanted to
to compare with zero would use a TST instruction not a CMP.
That could be used any branch GT, GE, LE, LT, EQ, NE.
And that is < 2%
It would use CMP to compare with a number other than 0, which was 10-12%.
On 9/4/2025 7:53 PM, BGB wrote:
On 9/4/2025 3:20 PM, Stephen Fuld wrote:
On 9/4/2025 12:06 PM, EricP wrote:
Stephen Fuld wrote:
On 9/4/2025 10:19 AM, EricP wrote:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
Compare and Branch can also use two immediates as it
However, I also found that STs need an immediate and a
displacement, so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions >>>>>>>>> with
potential displacement (from D12ds above) and the immediate has >>>>>>>>> the size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96] >>>>>>>>
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies. >>>>>>>
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional >>>>>>> or comparing against 0 (which can use the Zero Register). Only a >>>>>>> relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers: >>>>>> A Case Study of VAX-11 Instruction Set Usage For Compiler
Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP. >>>>>> That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
OK, but does this tell you how many of the CMPs are to a value of
zero? I expect these to be a significant enough percentage to skew
your analysis.
Looking at
Measurement and Analysis of Instruction Use in VAX 780, 1982
VAX had a TST instruction which was the same as CMP src,#0.
TST has < 2% usage while CMP 10-12%.
Thanks. That's interesting. So perhaps ~15% of all compares are to
zero. I would have expected higher.
Looking at some stats generated by my compiler (for branches):
61% of branches are unconditional
15% are comparing to 0
13% are comparing two registers
11% are comparing to some other non-zero constant.
So ~39% of branches are conditional, and 15% compare to zero. So
(15/39) ~38% of conditional branches are comparing to zero. That is
more in line with what I had expected.
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't >know about those side effect flags and will always emit a CMP or TST first.
EricP <ThatWouldBeTelling@thevillage.com> writes:
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first.
Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have
if (a+b<0) ...
the compiler may naively translate this to
add tmp = a, b
tst tmp
bge cont
The peephole optimizer can have a rule that says that this is
equivalent to
add tmp = a, b
bge cont
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
Look, Ma, no tst.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't
know about those side effect flags and will always emit a CMP or TST first.
Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have
if (a+b<0) ...
the compiler may naively translate this to
add tmp = a, b
tst tmp
bge cont
The peephole optimizer can have a rule that says that this is
equivalent to
add tmp = a, b
bge cont
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
Look, Ma, no tst.
- anton
This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.
For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.
The Microsoft compiler uses LEA to do the add which doesn't change flags
so even if it has a flags optimization it would not detect it:
long foo(long,long) PROC ; foo, COMDAT
lea eax, DWORD PTR [rcx+rdx]
test eax, eax
jns SHORT $LN2@foo
sub ecx, edx
mov eax, ecx
ret 0
$LN2@foo:
imul ecx, edx
mov eax, ecx
ret 0
Also if MS had moved ecx to eax first as GCC does then it could have
the function result land in eax and eliminate the final two MOV eax,ecx.
Anton Ertl wrote:...
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.
For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.
BGB wrote:I know I have seen lots of examples of x86 compilers which used side
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement,
so,
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the >>>> size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a relative
minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't know about those side effect flags and will always emit a CMP or TST first.
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement, so, >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or comparing against 0 (which can use the Zero Register). Only a relative minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't know about those side effect flags and will always emit a CMP or TST first. Possibly those VAX that Bcc using ALU side effect flags were assembler.
One could argue:
This is high enough to care,
but is it cheap enough?...
The instruction fetch buffer has to be larger as the worst case size
just got larger. And there are more format variations so the Parser
gets more complex. And Decode has to pick apart the two immediates
and place them in different fields so more muxes.
Each front end uOp lane would have two immediate fields, one for an
integer or float data value up to 8 bytes, one for up to 8 byte offset.
Then at Dispatch (hand-off to the back end) muxes to route each
immediate onto the FU operand bus.
The difference comes in the back end Reservation Stations.
If they are valued RS then the immediates are held just like
any other operand values that were ready at time of Dispatch.
The number of operands doesn't change so no extra cost here.
But if they are valueless RS then it has no place to hold those
immediates so it needs some place to stash them until the uOp launches.
In that case it might be better if Decode took all the immediates and--- Synchronet 3.21a-Linux NewsLink 1.2
stash them in a circular buffer and just passed the indexes in the uOp.
Then at launch the FU would pull in the immediates
just like it pulls in the register operand values.
This gets rid of the extra front end costs.
EricP <ThatWouldBeTelling@thevillage.com> posted:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement, so, >>>>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>> potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a relative
minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
About 25% = (12%-9%)/12% use ALU CCs.
I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first. >> Possibly those VAX that Bcc using ALU side effect flags were assembler.
VAX had "more regular" settings of ALU CCs than typical CISCs.
This regularity made it easier for the compiler to track.
On the other hand:: a gain of 25%*12% = 4% would not have allowed CCs
to "make the cut" for RISC ISA designs.
One could argue:
This is high enough to care,
boarder line
but is it cheap enough?...
not for me as it causes RoB/RETIRE to do a lot more work.
It does require Forwarding to do more work;
It may also cause DECODE to do more work.
EricP <ThatWouldBeTelling@thevillage.com> posted:
BGB wrote:
On 9/3/2025 9:42 PM, EricP wrote:
MitchAlsup wrote:
However, I also found that STs need an immediate and a displacement, so, >> >>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with
potential displacement (from D12ds above) and the immediate has the
size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
Compare and Branch can also use two immediates as it
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
Can be done, yes.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a relative
minority being compares against non-zero constants.
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
About 25% = (12%-9%)/12% use ALU CCs.
I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first. >> Possibly those VAX that Bcc using ALU side effect flags were assembler.
VAX had "more regular" settings of ALU CCs than typical CISCs.
This regularity made it easier for the compiler to track.
An interesting note in the aforementioned analysis is why
the call instruction was so expensive in time - the 780 cache
was write-through, so the multiple stores would be limited
to DRAM speeds.
scott@slp53.sl.home (Scott Lurndal) writes:
An interesting note in the aforementioned analysis is why
the call instruction was so expensive in time - the 780 cache
was write-through, so the multiple stores would be limited
to DRAM speeds.
But do you need fewer stores if you use simpler instructions? Did the
C compiler that used BSR etc. to implement a call store less? How so?
Also, the DRAM speed is three cycles.
CALL/RET took an average 45
cycles.
RET does not store. So if most of the cost is storing and
loading, and, say, each instruction has 10 cycles overhead (which
would already be a lot), that's 90 cycles for a call and a ret, and 70
cycles of that for n stores and n loads. With stores taking 3 cycles
and loads taking 1 (the stack stuff is usually in the cache),
n=17.5. But VAX has only 16 registers (including PC), and not every
one of them is saved on every call. So there were additional
overheads.
With good support for making full use of the cache read bandwidth, the loading part could be sped up to two loads per cycle. But I expect--- Synchronet 3.21a-Linux NewsLink 1.2
that the VAX 11/780 did not do that.
- anton
EricP wrote:
The only instruction usage stats I have are from those VAX papers:
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers
don't
know about those side effect flags and will always emit a CMP or TST
first.
I know I have seen lots of examples of x86 compilers which used side
effect flags, they are pretty much the standard idiom for decrementing
loops or incrementing from negative start. The latter case is a common optimization which allows you to use the same register as the source index/indices and the destination index, along with the loop counter
itself.
Terje
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
EricP <ThatWouldBeTelling@thevillage.com> posted:
BGB wrote:About 25% = (12%-9%)/12% use ALU CCs.
On 9/3/2025 9:42 PM, EricP wrote:The only instruction usage stats I have are from those VAX papers:
MitchAlsup wrote:Can be done, yes.
However, I also found that STs need an immediate and a displacement, so, >>>>>> Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has the >>>>>> size of the ST. This provides for::Compare and Branch can also use two immediates as it
std #4607182418800017408,[r3,r2<<3,96]
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or
comparing against 0 (which can use the Zero Register). Only a relative >>>> minority being compares against non-zero constants.
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982 >>>
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't >>> know about those side effect flags and will always emit a CMP or TST first. >>> Possibly those VAX that Bcc using ALU side effect flags were assembler.VAX had "more regular" settings of ALU CCs than typical CISCs.
This regularity made it easier for the compiler to track.
It also had AOB<cc> and SOB<cc> which combined the branch
with the increment/decrement operation.
100$: MOVAL ERTAB,R1 ;ADDRESS OF TABLE
CLRL R2 ;COUNT
101$: CMPL (R1)+[R2],4(R0) ;LOOK FOR A MATCH
BEQL 108$ ;BRANCH IF FOUND
AOBLEQ S^#ERNM,R2,101$ ;LOOP TILL DONE
105$: MOVZWL #SS$_RESIGNAL,R0 ;DON'T WANT TO KNOW IT
RET ;GIVE BACK TO SYSTEM
108$: MOVL (R1)[R2],R0 ;GET ADDRESS OF DESCRIPTOR
MOVQ (R0),R4 ;GET DESCRIPTOR
BRB PRERLN ;AND PRINT
;
(fragment from the VAX FOCAL interpreter).
An interesting note in the aforementioned analysis is why
the call instruction was so expensive in time - the 780 cache
was write-through, so the multiple stores would be limited
to DRAM speeds.
Scott Lurndal wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
EricP <ThatWouldBeTelling@thevillage.com> posted:
BGB wrote:About 25% = (12%-9%)/12% use ALU CCs.
On 9/3/2025 9:42 PM, EricP wrote:The only instruction usage stats I have are from those VAX papers:
MitchAlsup wrote:Can be done, yes.
However, I also found that STs need an immediate and a displacement, so,Compare and Branch can also use two immediates as it
Major == 0b'001001 and minor == 0b'011xxx has 4 ST instructions with >>>>>> potential displacement (from D12ds above) and the immediate has the >>>>>> size of the ST. This provides for::
std #4607182418800017408,[r3,r2<<3,96]
has reg-reg or reg-imm compares plus displacement.
And has high enough frequency to be worth considering.
High enough frequency/etc, is where the possible debate lies.
Checking stats, it can effect roughly 1.9% of the instructions.
Or, around 11% of branches; most of the rest being unconditional or >>>> comparing against 0 (which can use the Zero Register). Only a relative >>>> minority being compares against non-zero constants.
A Case Study of VAX-11 Instruction Set Usage For Compiler Execution, 1982 >>>
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don'tThis regularity made it easier for the compiler to track.
know about those side effect flags and will always emit a CMP or TST first.
Possibly those VAX that Bcc using ALU side effect flags were assembler. >> VAX had "more regular" settings of ALU CCs than typical CISCs.
It also had AOB<cc> and SOB<cc> which combined the branch
with the increment/decrement operation.
100$: MOVAL ERTAB,R1 ;ADDRESS OF TABLE
CLRL R2 ;COUNT
101$: CMPL (R1)+[R2],4(R0) ;LOOK FOR A MATCH
BEQL 108$ ;BRANCH IF FOUND
AOBLEQ S^#ERNM,R2,101$ ;LOOP TILL DONE
105$: MOVZWL #SS$_RESIGNAL,R0 ;DON'T WANT TO KNOW IT
RET ;GIVE BACK TO SYSTEM
108$: MOVL (R1)[R2],R0 ;GET ADDRESS OF DESCRIPTOR
MOVQ (R0),R4 ;GET DESCRIPTOR
BRB PRERLN ;AND PRINT
;
(fragment from the VAX FOCAL interpreter).
An interesting note in the aforementioned analysis is why
the call instruction was so expensive in time - the 780 cache
was write-through, so the multiple stores would be limited
to DRAM speeds.
AOB Add-One-Branch, SOB Subtract-One-Branch, ACB Add-Compare-Branch,
could be nice single cycle, single write port, risc-ish instructions.
The problem comes from the most optimal and frequent formats would
have two or three immediate values.
AOBcc count_Rsd, limit_Rs, offset_imm
AOBcc count_Rsd, limit_imm, offset_imm
SOBcc count_Rsd, limit_Rs, offset_imm
SOBcc count_Rsd, limit_imm, offset_imm
ACBcc count_Rsd, addend_Rs, limit_Rs, offset_imm
ACBcc count_Rsd, addend_Imm, limit_Rs, offset_imm
ACBcc count_Rsd, addend_Rs, limit_imm, offset_imm
ACBcc count_Rsd, addend_Imm, limit_imm, offset_imm
Merging the two 16-bit immediate format into one 32-bit field would
suffice for most purposes. The last ACB packs three 16-bit immediates
into a 48-bit field.
If the addend or limit operands are constants but do not fit into the
16-bit field available then one must load the constant into a register.
If the branch offset doesn't fit into 16 bits then one cannot use these instructions for that loop and must use individual branch instructions.
But it would be pretty rare for a loop to cross more that 32k bytes/words.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,070 |
Nodes: | 10 (0 / 10) |
Uptime: | 127:42:37 |
Calls: | 13,731 |
Calls today: | 1 |
Files: | 186,965 |
D/L today: |
1,258 files (486M bytes) |
Messages: | 2,417,820 |