A common function in the GBA community is to convert the directional pad buttons into a signed integer value for each axis.
If Left is pressed, a -1
is returned, if Right is pressed, a +1
is returned. If both, or neither, are pressed: 0
is returned.
This is often called a “tribool”, AKA a 3-state value.
When represented with -1
, 0
, +1
, the tribool can be directly added to a position value to move a character left or right. Very handy.
Assume our tribool is constructed from a bitmask, where bit 0 is +1
, and bit 1 is -1
.
Most programmers would probably write something like this as their first attempt:
int tribool(int mask) {
const int bit0 = mask & 1;
const int bit1 = (mask >> 1) & 1;
if (bit0 == bit1) return 0;
return bit0 ? 1 : -1;
}
Compiled with GCC -O2
the 32-bit ARM code is:
tribool:
asr r3, r0, #1
and r3, r3, #1
and r0, r0, #1
cmp r0, r3
beq .L3
cmp r0, #0
mvneq r0, #0
bx lr
.L3:
mov r0, #0
bx lr
The libtonc implementation looks like:
int tribool(int mask) {
const int bit0 = mask & 1;
const int bit1 = (mask >> 1) & 1;
return bit0 - bit1;
}
Compiled with GCC -Og
the 32-bit ARM code is:
tribool:
and r3, r0, #1
asr r0, r0, #1
and r0, r0, #1
sub r0, r3, r0
bx lr
This looks astonishingly optimised, however I suspected we could do better.
On the GBA the key indices for each axis on the dpad, and the shoulder buttons, are next to each other.
I noticed that simply sign-extending the upper bit will get us somewhat close to a tribool implementation:
int almost_tribool(int mask) {
return (mask << 30) >> 30;
}
For an 8-bit number, this sign extend produces the following states:
2-bit Mask | 8-bit Sign Extended | “Tribool” State |
---|---|---|
0b00 | 0b00000000 | 0 |
0b01 | 0b00000001 | 1 |
0b10 | 0b11111110 | -2 |
0b11 | 0b11111111 | -1 |
Unfortunately this does not respect the exclusivity requirement of our tribool function: When both bits are equal (0b00
or 0b11
), the function must always return 0
.
But this is close, and for original GBA hardware where it is physically impossible to hit both left and right buttons at the same time this almost works (as long as you only do sign comparisons or compare against zero).
For anyone interested the GCC -Og
32-bit ARM code is:
almost_tribool:
lsls r0, r0, #30
asrs r0, r0, #30
bx lr
A simple sign-extend.
My first instinct was to analyse the possible states, and figure out how to manually fix the -2
and -1
states.
I noticed that just adding 1
gets these into range, however this breaks the 0
and 1
cases:
2-bit Mask | 8-bit Sign Extended + 1 |
“Tribool” State |
---|---|---|
0b00 | 0b00000001 | 1 |
0b01 | 0b00000010 | 2 |
0b10 | 0b11111111 | -1 |
0b11 | 0b00000000 | 0 |
So we should only add 1
for the negative cases, conveniently the cases with a sign-bit set:
int tribool(int mask) {
mask = (mask << 30) >> 30;
if (mask >= 0) return mask;
return mask + 1;
}
Compiled with GCC -O1
the 32-bit ARM code is perfect:
tribool:
lsl r0, r0, #30
asrs r0, r0, #30
addmi r0, r0, #1
bx lr
The addmi
instruction is conditionally executed only when the mi
CPU flag (the minus flag) is set, which the preceeding asrs
instruction sets if the result or arithmetically shifting right is a negative/minus number.
The 16-bit thumb code with GCC -O1
is equally as good:
tribool:
lsls r3, r0, #30
asrs r0, r3, #30
lsrs r3, r3, #31
adds r0, r0, r3
bx lr
In this code, the second lsrs
converts the sign bit into the number 1
and adds that to the original value.
-Og
?
Notice that I showed the GCC -O1
assembly output for the fixed implementation of sign-extended tribool.
For GDB debugging on embedded systems, especially retro consoles like the GBA, you should really be using -Og
.
GCC -Og
32-bit ARM:
tribool:
lsl r0, r0, #30
asrs r0, r0, #30
bxpl lr
add r0, r0, #1
bx lr
GCC -Og
16-bit thumb:
tribool:
lsls r3, r0, #30
asrs r0, r3, #30
cmp r3, #0
bge .L1
adds r0, r0, #1
.L1:
bx lr
The ARM code isn’t too bad, with bxpl
returning if the result is positive, but the thumb code now has a full-fat branch.
We can further fix this by explicitly writing out the 16-bit thumb logic in C:
int tribool(int mask) {
mask = (mask << 30) >> 30;
int fix = (unsigned int) mask >> 31;
return mask + fix;
}
Compiled with GCC -Og
16-bit thumb:
tribool:
lsls r3, r0, #30
asrs r0, r3, #30
lsrs r3, r3, #31
adds r0, r0, r3
bx lr
Compiled with GCC -Og
32-bit ARM:
tribool:
lsl r0, r0, #30
lsr r3, r0, #31
add r0, r3, r0, asr #30
bx lr
This is the best we can get with 32-bit ARM, with 1 instruction fewer than Tonclib.
For 16-bit thumb, there is one more optimisation available.
There’s never enough registers on a CPU, so often a lot of time is spent juggling registers on the stack.
If we’re inlining these functions (which for sometihng this small, you probably should be) then optimising how many registers are used becomes important.
The problem is the temporary used:
int fix = (unsigned int) mask >> 31;
The fix
value depends on mask
, and our mask + fix
return value depends on both, so two registers will be used.
If we could remove fix
then the compiler will be able to operate solely on the mask
.
Looking back at the 8-bit sign extended binary:
2-bit Mask | 8-bit Sign Extended | “Tribool” State |
---|---|---|
0b00 | 0b00000000 | 0 |
0b01 | 0b00000001 | 1 |
0b10 | 0b11111110 | -2 |
0b11 | 0b11111111 | -1 |
The two correct states (0
and 1
) only differ by bit index 0 (the lowest bit), and the two incorrect states (-2
and -1
) differ by the same bit index 0.
If we were to shift right by 1 bit with sign extend then we can separate the correct and incorrect states into two values:
2-bit Mask | 8-bit Sign Extended | Shifted right 1 | Fix State |
---|---|---|---|
0b00 | 0b00000000 | 0b00000000 | 0 |
0b01 | 0b00000001 | 0b00000000 | 0 |
0b10 | 0b11111110 | 0b11111111 | -1 |
0b11 | 0b11111111 | 0b11111111 | -1 |
The correct states are now 0
, and the incorrect states are now -1
, we can subtract by this -1
to apply our + 1
fix from earlier:
2-bit Mask | Sign Extended | Fix State | Sign Extended - Fix State |
---|---|---|---|
0b00 | 0 | 0 | 0 - 0 = 0 |
0b01 | 1 | 0 | 1 - 0 = 1 |
0b10 | -2 | -1 | -2 - -1 = -1 |
0b11 | -1 | -1 | -1 - -1 = 0 |
And now we have our desired tribool.
The C for this is:
int tribool(int mask) {
mask = (mask << 30) >> 30;
return mask - (mask >> 1);
}
But wait, this still internally requires a temporary register for mask >> 1
to be added to the original mask
, so all we’ve done is hide the register in C.
After staring at the binary for a while I noticed that the two incorrect states are ones’ complement of the actual state I needed them to be in.
So this whole time I was trying to figure out how to convert 2-bit ones’ complement negative numbers into 2-bit twos’ complement.
The Wikipedia articles for both ones’ complement and two’s complement has good information:
This took me down the rabbit hole of trying to convert using a combination of bit flips (~mask
) and two’s complement negation (-mask
), but nothing I tried was an improvement.
I went back to the analysis of the + 1
fix:
2-bit Mask | 8-bit Sign Extended + 1 |
“Tribool” State |
---|---|---|
0b00 | 0b00000001 | 1 |
0b01 | 0b00000010 | 2 |
0b10 | 0b11111111 | -1 |
0b11 | 0b00000000 | 0 |
After spending so long staring at the bits, I noticed something: The values I was looking for were right there all along!
2-bit Mask | 8-bit Sign Extended + 1 |
---|---|
0b00 | 0b00000001 |
0b01 | 0b00000010 |
0b10 | 0b11111111 |
0b11 | 0b00000000 |
Just right shifting one more place (with sign extend) gets the results I had been looking for.
2-bit Mask | 8-bit Sign Extended + 1 |
Shift Right | Tribool State |
---|---|---|---|
0b00 | 0b00000001 | 0b00000000 | 0 |
0b01 | 0b00000010 | 0b00000001 | 1 |
0b10 | 0b11111111 | 0b11111111 | -1 |
0b11 | 0b00000000 | 0b00000000 | 0 |
The C for this is the obvious:
int tribool(int mask) {
mask = (mask << 30) >> 30;
return (mask + 1) >> 1;
}
Compiled with GCC -Og
16-bit thumb:
tribool:
lsls r0, r0, #30
asrs r0, r0, #30
adds r0, r0, #1
asrs r0, r0, #1
bx lr
Compiled with GCC -Og
32-bit ARM:
tribool:
lsl r0, r0, #30
asr r0, r0, #30
add r0, r0, #1
asr r0, r0, #1
bx lr
Notice that the assembly for both thumb and ARM only operates on r0
, which is the input parameter mask
.
32-bit ARM mode has more available registers than 16-bit thumb, so optimising for register pressure isn’t necessarily the best choice.
My final implementation selects between optimising for code size with 32-bit ARM, and register pressure for 16-bit thumb.
int tribool(int mask) {
mask = (mask << 30) >> 30; // Sign extend high bit
#if defined(__arm__) && !defined(__thumb__)
return mask - (mask >> 1); // Optimise for code size
#else
return (mask + 1) >> 1; // Optimise for register pressure
#endif
}
Compiling for GCC -Og
on x86_64 will select the register pressure optimisation:
tribool:
mov eax, edi
sal eax, 30
sar eax, 30
add eax, 1
sar eax
ret
The GBA keypad register is actually bit inverted to what most programmers would expect, so a button pressed is 0
, and a button released is 1
.
Most developers invert the keypad mask after reading it, so they can test mask & KEYLEFT
and get the result they expect.
This extra invert isn’t strictly necessary (especially if you’re using C++ which can wrap the raw data with accessing member functions), but the tribool implementation needs to be inverted to account for 0
being our “true” value, and 1
being our “false” value.
int inv_tribool(int mask) {
mask = (mask << 30) >> 30; // Sign extend high bit
#if defined(__arm__) && !defined(__thumb__)
return (mask >> 1) - mask; // Optimise for code size
#else
return (0 - mask) >> 1; // Optimise for register pressure
#endif
}
Notice the symmetry with the regular function.
The thumb and ARM assembly is just as sensible:
16-bit thumb:
inv_tribool:
lsls r0, r0, #30
asrs r0, r0, #30
rsbs r0, r0, #0
asrs r0, r0, #1
bx lr
32-bit ARM:
inv_tribool:
lsl r0, r0, #30
asr r3, r0, #31
sub r0, r3, r0, asr #30
bx lr