Pages

Wednesday, December 4, 2013

Call performance weirdness

Someone on the Internet has a weird hardware performance question. What do we do as people familiar with microarchitecture? We investigate!

The problem boils down to this, a tight loop with a sequence of loads and stores to the same address performs faster if you add an unnecessary call at the beginning. Here's what the code looks like:
 08048430 <tightloop>:
 8048430:   53                      push   ebx
 8048431:   31 c0                   xor    eax,eax
 8048433:   31 d2                   xor    edx,edx
 8048435:   8d 76 00                lea    esi,[esi+0x0]
 8048438:   8b 0d 28 a0 04 08       mov    ecx,DWORD PTR ds:0x804a028
 804843e:   8b 1d 2c a0 04 08       mov    ebx,DWORD PTR ds:0x804a02c
 8048444:   01 c1                   add    ecx,eax
 8048446:   11 d3                   adc    ebx,edx
 8048448:   83 c0 01                add    eax,0x1
 804844b:   89 0d 28 a0 04 08       mov    DWORD PTR ds:0x804a028,ecx
 8048451:   89 c1                   mov    ecx,eax
 8048453:   83 d2 00                adc    edx,0x0
 8048456:   81 f1 00 84 d7 17       xor    ecx,0x17d78400
 804845c:   09 d1                   or     ecx,edx
 804845e:   89 1d 2c a0 04 08       mov    DWORD PTR ds:0x804a02c,ebx
 8048464:   75 d2                   jne    8048438 <tightloop+0x8>
 8048466:   5b                      pop    ebx
 8048467:   c3                      ret
Then, not a lot changes if we add the call to the empty function:
 08048480 <loop_with_extra_call>:
 8048480:   53                      push   ebx
 8048481:   31 c0                   xor    eax,eax
 8048483:   31 d2                   xor    edx,edx
 8048485:   8d 76 00                lea    esi,[esi+0x0]
 8048488:   e8 e3 ff ff ff          call   8048470 <foo>
 804848d:   8b 0d 28 a0 04 08       mov    ecx,DWORD PTR ds:0x804a028
 8048493:   8b 1d 2c a0 04 08       mov    ebx,DWORD PTR ds:0x804a02c
 8048499:   01 c1                   add    ecx,eax
 804849b:   11 d3                   adc    ebx,edx
 804849d:   83 c0 01                add    eax,0x1
 80484a0:   89 0d 28 a0 04 08       mov    DWORD PTR ds:0x804a028,ecx
 80484a6:   89 c1                   mov    ecx,eax
 80484a8:   83 d2 00                adc    edx,0x0
 80484ab:   81 f1 00 84 d7 17       xor    ecx,0x17d78400
 80484b1:   09 d1                   or     ecx,edx
 80484b3:   89 1d 2c a0 04 08       mov    DWORD PTR ds:0x804a02c,ebx
 80484b9:   75 cd                   jne    8048488 <loop_with_extra_call+0x8>
 80484bb:   5b                      pop    ebx
 80484bc:   c3                      ret
This is exactly the type of problem where an accurate x86 performance simulator would be a very useful tool. Well guess who has ten fingers and has been building such a simulator for a long time. Yeah, this guy! Let's see what XIOSim can show us about this loop.

Ok, let's build in 32-bit mode and reduce the number of iterations, so it is easier to simulate. Running this on a SandyBridge, we get an IPC of 1.52 for the non-call version:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness t

 Performance counter stats for 'build/loop-call-weirdness t':

       156,709,126 cycles:HG                 #    3.293 GHz                     [83.18%]
       238,816,603 instructions:HG           #    1.52  insns per cycle
... and an IPC of 1.97 for the version with the extra call:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c

 Performance counter stats for 'build/loop-call-weirdness c':

       142,337,932 cycles:HG                 #    3.271 GHz                     [81.70%]
       280,447,193 instructions:HG           #    1.97  insns per cycle
The call version still runs about 10% faster, so the effect is still there. Ok, let's test this on a Haswell:
skanev@haswell:~/eecs/loop_weird$ perf stat -- build/loop-call-weirdness t

 Performance counter stats for 'build/loop-call-weirdness t':

       143,093,636 cycles                    #    2.630 GHz
       240,420,445 instructions              #    1.68  insns per cycle
Wait, the non-call version is fast here.
skanev@haswell:~/eecs/loop_weird$ perf stat -- build/loop-call-weirdness c

 Performance counter stats for 'build/loop-call-weirdness c':

       142,766,225 cycles                    #    2.838 GHz
       260,414,544 instructions              #    1.82  insns per cycle
And so is the one with the call. Interesting. This means that it's probably not the call that's making this faster, but the tight version hitting some weird bottleneck on SandyBridge.

Ok, let's give this some thought and start crossing things off. Looking at several performance counters, nothing seems fishy in the load-store queue -- flushes because of some reorderings there are off the table (the call-ret pair might have been messing address disambiguation).

If this is the case, there shouldn't be anything much special about the call-ret pair -- strip away their memory accesses and they become a stack addition and subtraction. Actually, those are probably optimized early on:

After the LSD buffer, the last step in decode is the dedicated stack engine, which removes all the stack modifying uops. The stack modifying uops are all executed by a dedicated adder that writes to a speculative delta register in the front end, which is then occasionally synchronized with a renamed architectural register that contains the non-speculative value of the stack. After the stack manipulating uops have been excised, the remaining uops head down into the out-of-order machine to be renamed, issued, dispatched and executed. (source)

In that case, let's check a nop instead of the call:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c

 Performance counter stats for 'build/loop-call-weirdness c':

       152,240,563 cycles:HG                 #    3.231 GHz                     [84.92%]
       259,694,527 instructions:HG           #    1.71  insns per cycle
Nope, still slow. How about a smarter nop encoding that goes through execution. Like xchg edi, edi.
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c

 Performance counter stats for 'build/loop-call-weirdness c':

       144,585,773 cycles:HG                 #    3.252 GHz                     [82.01%]
       259,684,447 instructions:HG           #    1.80  insns per cycle
Aha! Fast again. Now we're onto something.

We can completely rule memory out -- with the xchg memory accesses should be exactly the same. Probably no structural (ALU) hazards either -- the xchg increases the utilization of port1 (checked with UOPS_DISPATCHED_PORT.PORT_1), but we pay no penalty -- so the out-of-order execution is working as it should.

Probably not something in the front end either -- the nop should have triggered front-end drama. This is starting to smell more and more like a bottleneck in register renaming. SandyBridge and newer have a unified physical register file implementation, so the renaming unit needs to keep at least a few tables (free list, architecture registers, speculative registers) to work without storing destination registers in the re-order buffer (see the intro here).

If some of these tables get full in the non-xchg case, the stalls might be bad enough so the OoO doesn't cover them. RESOURCE_STALLS2 are the exact set of counters that should show that -- too bad they return complete junk! Not atypical for esoteric counters, sigh. It seems like Intel saw that and they aren't there any more in IvyBridge or Haswell (SDM, chapter 19). Ugh.

Let's run these through XIOSim. Since I only have a Nehalem configuration, this is what we look at there:
#### SIMULATOR PERFORMANCE STATS ####                                                       
sim_cycle                       120205938 # total simulation cycles
total_IPC                          1.9974 # final commit IPC
Hmm, faster than the fast version. I guess I'm not modelling a bottleneck somewhere. Let's check the version with the call too:
#### SIMULATOR PERFORMANCE STATS ####                                                       
sim_cycle                       141640126 # total simulation cycles
total_IPC                          1.9775 # final commit IPC
Yup, IPC stays constant and we see proportionally more cycles for the extra 20M instructions. So, the call is hitting an additional bottleneck -- I am not modelling the stack pointer shortcut mentioned above for sure -- maybe that creates a dependence chain between the 4 micro-ops in the call-ret pair. Let's try with xchg:
#### SIMULATOR PERFORMANCE STATS ####                                                       
sim_cycle                       120203910 # total simulation cycles
total_IPC                          2.1638 # final commit IPC
Yup, as fast as the tight version. So the pipeline can easily absorb the independent add that this xchg gets turned into. This confirms the suspicion from above that typical memory alignment issues or cache behavior can be ruled out.

Also, the Nehalem model in XIOSim doesn't model a physical register file -- register renaming there should only be limited by the ROB capacity, which we certainly don't hit. Again, a reason to suspect that the tight loop on SNB is limited at renaming. But going further won't be easy.

Without the counters that can tell us what is going on (or without a simulator that implements a physical register file; put that on the to-do for XIOSim), we are mostly flying blind. It appears that someone else on the Internet has noticed weird performance behavior likely related to reclaiming physical registers to the free list (Fig. 4, the paragraph "Physical Register Size"). I wonder if that's related. Oh well, this will need more looking into if/when I implement detailed renaming in XIOSim...

1 comment: