Hardest bug I ever solved

A lot of people probably have that hard bug they spent a lot of hours trying to solve. Almost 1.5 years after solving my hardest bug I got nostalgic and thought "Hey, this was pretty cool! I should write about it to not forget it". Since it's so long ago there might be things I misremember, but the gist of it should still apply.

Let's jump to it! Presenting: "The curious case of the crashing compiler".
Compiling the following code works in my standard GCC but causes another GCC to crash:
#include <float.h>
#define PyLong_BASE (1 << 15)
int main(int argc, char* argv[]) {

       int base = 10;
       static double data[37] = {0.0e0,};
       data[base] = (log((double)base) / log((double)(1<<15)));
       return 0;

The weird thing? It's the same Binutils/GCC built with the same flags - two seemingly identical GCCs.

This is when porting tools to a new platform so there is no GDB and so on. After debugging this by putting trace printouts everywhere, it becomes apparent that the linker crashes in libMPFR.

int mpfr_const_catalan (mpfr_ptr x, mpfr_rnd_t rnd_mode) {
 return mpfr_cache (x, __gmpfr_cache_const_catalan, rnd_mode);

Compiling an unoptimized libMPFR with logging makes everything work - which was the case for the first GCC and why it worked. What would cause this?

The hint is that __gmpfr_cache_const_catalan is a so called thread-local variable when logging is not enabled:

/* Declare the cache */
MPFR_DECL_INIT_CACHE(__gmpfr_cache_const_pi, mpfr_const_pi_internal);
MPFR_DECL_INIT_CACHE(__gmpfr_normal_pi, mpfr_const_pi_internal);
MPFR_DECL_INIT_CACHE(__gmpfr_logging_pi, mpfr_const_pi_internal);
mpfr_cache_ptr MPFR_THREAD_ATTR __gmpfr_cache_const_pi = __gmpfr_normal_pi;

Debugging this I know from my IRC logs that I found and fixed a bug with regards to the so called TLS relocations in our binutils port, but I cannot access the commit anymore since all patches were squashed together in preparation for upstreaming. What I have is this line:

< blueCmd> stekern: the human readable version: I think it was a race depending on which relocation was read first and that TLS and non-TLS locations would change the sreloc pointer in the section back and forth
Alas, now the problem is that linker works, but GCC crashes instead.

After some more printouts I arrive at the conclusion that the call to mpfr_cache would get called with a totally bogus value for x and the pointer __gmpfr_cache_const_catalan would be something very obviously broken (like 0x1). Inside mpfr_const_catalan it would however be fine.

A thread-local variable can be resolved in a bunch of different ways depending on how "close" the compiler can assume the real value is. The furthest one requires a call to our C library for a function called __tls_get_addr that will resolve the pointer to something that is valid in the current running thread. Since this only happens when TLS is used we'r now pretty confident that the bug is somewhere in the TLS code.

Let's have a look on how mpfr_const_catalan is generated. On OpenRISC 1200 the calling convention is to place the first argument in r3. Remember that we want to pass the first argument straight through to mpfr_cache, so we shouldn't touch it. This is how the assembly was generated (I removed a lot of unrelated stuff):

Note: OpenRISC 1200 has a delay-slot (indented) after jumps, so you need to think out-of-order.
       l.movhi         r3,tlsgdhi(__gmpfr_cache_const_catalan)
# ^^-- r3 clobbered!
       l.ori           r3,r3,tlsgdlo(__gmpfr_cache_const_catalan)
       l.jal           8
        l.movhi         r16,gotpchi(_GLOBAL_OFFSET_TABLE_-4)
       l.ori           r16,r16,gotpclo(_GLOBAL_OFFSET_TABLE_+0)
       l.add           r16,r16,r9
       l.add           r3,r3,r16
       l.jal           plt(__tls_get_addr)
        l.add           r3,r4,r16
       l.jal           plt(mpfr_cache)
        l.ori           r4,r11,0
# ^^-- Real address for __gmpfr_cache_const_catalan as 2nd arg
And at this point I had a clear case what was wrong, but as I expressed it a few hours after finding this:

< blueCmd> I honestly have no idea how to solve this
This feeling of helplessness had never struck me before. I had none I could ask, I had nobody that knew the system better than me. This was unique for me and I was very excited and frustrated at the same time. Fast forward a few days of ball planking and looking at old bug reports of seemingly maybe-related-but-probably-not bug reports in GCC me and stekern arrived that the code is probably emitting the library call to __tls_get_addr at the wrong state in the code expansion (remember! this is not a normal function call, it's inferred by accessing a specific pointer) and therefore messing up the register scheduling.

Moving the so called "address legitimization" to an early stage (sounds easy but is a *pain*) and removing some parts where I tried to be clever fixed the issues and made the generated code use sensible registers. The clever part was that I thought I was helping GCC by telling it that "you can use the same register multiple times at this point" causing GCC to make some weird optimizations that were very broken. Simply letting GCC schedule that by itself solved that problem.

Special thanks to stekern for all the help and idea bouncing through all of this! GCC code is not easy to reason about, or as stekern and I put it:
< blueCmd> gah, gcc's damn .md files - they are confusing as f*ck
< stekern> you don't say
< stekern> I can't count the number of times I've went "ah, now I get it! ...no, I don't" =)
< blueCmd> "error: unrecognizable insn:" *sigh*

When asked "What is the hardest bug you ever solved?" I will probably think of this for a long time.


Popular posts from this blog

Open Datacenter Hardware - What is OCP?

Open Datacenter Hardware - Leopard Server