The problem boils down to this, a tight loop with a sequence of loads and stores to the same address performs faster if you add an unnecessary call at the beginning. Here's what the code looks like:
08048430 <tightloop>: 8048430: 53 push ebx 8048431: 31 c0 xor eax,eax 8048433: 31 d2 xor edx,edx 8048435: 8d 76 00 lea esi,[esi+0x0] 8048438: 8b 0d 28 a0 04 08 mov ecx,DWORD PTR ds:0x804a028 804843e: 8b 1d 2c a0 04 08 mov ebx,DWORD PTR ds:0x804a02c 8048444: 01 c1 add ecx,eax 8048446: 11 d3 adc ebx,edx 8048448: 83 c0 01 add eax,0x1 804844b: 89 0d 28 a0 04 08 mov DWORD PTR ds:0x804a028,ecx 8048451: 89 c1 mov ecx,eax 8048453: 83 d2 00 adc edx,0x0 8048456: 81 f1 00 84 d7 17 xor ecx,0x17d78400 804845c: 09 d1 or ecx,edx 804845e: 89 1d 2c a0 04 08 mov DWORD PTR ds:0x804a02c,ebx 8048464: 75 d2 jne 8048438 <tightloop+0x8> 8048466: 5b pop ebx 8048467: c3 retThen, not a lot changes if we add the call to the empty function:
08048480 <loop_with_extra_call>: 8048480: 53 push ebx 8048481: 31 c0 xor eax,eax 8048483: 31 d2 xor edx,edx 8048485: 8d 76 00 lea esi,[esi+0x0] 8048488: e8 e3 ff ff ff call 8048470 <foo> 804848d: 8b 0d 28 a0 04 08 mov ecx,DWORD PTR ds:0x804a028 8048493: 8b 1d 2c a0 04 08 mov ebx,DWORD PTR ds:0x804a02c 8048499: 01 c1 add ecx,eax 804849b: 11 d3 adc ebx,edx 804849d: 83 c0 01 add eax,0x1 80484a0: 89 0d 28 a0 04 08 mov DWORD PTR ds:0x804a028,ecx 80484a6: 89 c1 mov ecx,eax 80484a8: 83 d2 00 adc edx,0x0 80484ab: 81 f1 00 84 d7 17 xor ecx,0x17d78400 80484b1: 09 d1 or ecx,edx 80484b3: 89 1d 2c a0 04 08 mov DWORD PTR ds:0x804a02c,ebx 80484b9: 75 cd jne 8048488 <loop_with_extra_call+0x8> 80484bb: 5b pop ebx 80484bc: c3 retThis is exactly the type of problem where an accurate x86 performance simulator would be a very useful tool. Well guess who has ten fingers and has been building such a simulator for a long time. Yeah, this guy! Let's see what XIOSim can show us about this loop.
Ok, let's build in 32-bit mode and reduce the number of iterations, so it is easier to simulate. Running this on a SandyBridge, we get an IPC of 1.52 for the non-call version:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness t Performance counter stats for 'build/loop-call-weirdness t': 156,709,126 cycles:HG # 3.293 GHz [83.18%] 238,816,603 instructions:HG # 1.52 insns per cycle... and an IPC of 1.97 for the version with the extra call:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c Performance counter stats for 'build/loop-call-weirdness c': 142,337,932 cycles:HG # 3.271 GHz [81.70%] 280,447,193 instructions:HG # 1.97 insns per cycleThe call version still runs about 10% faster, so the effect is still there. Ok, let's test this on a Haswell:
skanev@haswell:~/eecs/loop_weird$ perf stat -- build/loop-call-weirdness t Performance counter stats for 'build/loop-call-weirdness t': 143,093,636 cycles # 2.630 GHz 240,420,445 instructions # 1.68 insns per cycleWait, the non-call version is fast here.
skanev@haswell:~/eecs/loop_weird$ perf stat -- build/loop-call-weirdness c Performance counter stats for 'build/loop-call-weirdness c': 142,766,225 cycles # 2.838 GHz 260,414,544 instructions # 1.82 insns per cycleAnd so is the one with the call. Interesting. This means that it's probably not the call that's making this faster, but the tight version hitting some weird bottleneck on SandyBridge.
Ok, let's give this some thought and start crossing things off. Looking at several performance counters, nothing seems fishy in the load-store queue -- flushes because of some reorderings there are off the table (the call-ret pair might have been messing address disambiguation).
If this is the case, there shouldn't be anything much special about the call-ret pair -- strip away their memory accesses and they become a stack addition and subtraction. Actually, those are probably optimized early on:
After the LSD buffer, the last step in decode is the dedicated stack engine, which removes all the stack modifying uops. The stack modifying uops are all executed by a dedicated adder that writes to a speculative delta register in the front end, which is then occasionally synchronized with a renamed architectural register that contains the non-speculative value of the stack. After the stack manipulating uops have been excised, the remaining uops head down into the out-of-order machine to be renamed, issued, dispatched and executed. (source)
In that case, let's check a nop instead of the call:
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c Performance counter stats for 'build/loop-call-weirdness c': 152,240,563 cycles:HG # 3.231 GHz [84.92%] 259,694,527 instructions:HG # 1.71 insns per cycleNope, still slow. How about a smarter nop encoding that goes through execution. Like xchg edi, edi.
skanev@rb22:~/loop_weird$ perf stat -- build/loop-call-weirdness c Performance counter stats for 'build/loop-call-weirdness c': 144,585,773 cycles:HG # 3.252 GHz [82.01%] 259,684,447 instructions:HG # 1.80 insns per cycleAha! Fast again. Now we're onto something.
We can completely rule memory out -- with the xchg memory accesses should be exactly the same. Probably no structural (ALU) hazards either -- the xchg increases the utilization of port1 (checked with UOPS_DISPATCHED_PORT.PORT_1), but we pay no penalty -- so the out-of-order execution is working as it should.
Probably not something in the front end either -- the nop should have triggered front-end drama. This is starting to smell more and more like a bottleneck in register renaming. SandyBridge and newer have a unified physical register file implementation, so the renaming unit needs to keep at least a few tables (free list, architecture registers, speculative registers) to work without storing destination registers in the re-order buffer (see the intro here).
If some of these tables get full in the non-xchg case, the stalls might be bad enough so the OoO doesn't cover them. RESOURCE_STALLS2 are the exact set of counters that should show that -- too bad they return complete junk! Not atypical for esoteric counters, sigh. It seems like Intel saw that and they aren't there any more in IvyBridge or Haswell (SDM, chapter 19). Ugh.
Let's run these through XIOSim. Since I only have a Nehalem configuration, this is what we look at there:
#### SIMULATOR PERFORMANCE STATS #### sim_cycle 120205938 # total simulation cycles total_IPC 1.9974 # final commit IPCHmm, faster than the fast version. I guess I'm not modelling a bottleneck somewhere. Let's check the version with the call too:
#### SIMULATOR PERFORMANCE STATS #### sim_cycle 141640126 # total simulation cycles total_IPC 1.9775 # final commit IPCYup, IPC stays constant and we see proportionally more cycles for the extra 20M instructions. So, the call is hitting an additional bottleneck -- I am not modelling the stack pointer shortcut mentioned above for sure -- maybe that creates a dependence chain between the 4 micro-ops in the call-ret pair. Let's try with xchg:
#### SIMULATOR PERFORMANCE STATS #### sim_cycle 120203910 # total simulation cycles total_IPC 2.1638 # final commit IPCYup, as fast as the tight version. So the pipeline can easily absorb the independent add that this xchg gets turned into. This confirms the suspicion from above that typical memory alignment issues or cache behavior can be ruled out.
Also, the Nehalem model in XIOSim doesn't model a physical register file -- register renaming there should only be limited by the ROB capacity, which we certainly don't hit. Again, a reason to suspect that the tight loop on SNB is limited at renaming. But going further won't be easy.
Without the counters that can tell us what is going on (or without a simulator that implements a physical register file; put that on the to-do for XIOSim), we are mostly flying blind. It appears that someone else on the Internet has noticed weird performance behavior likely related to reclaiming physical registers to the free list (Fig. 4, the paragraph "Physical Register Size"). I wonder if that's related. Oh well, this will need more looking into if/when I implement detailed renaming in XIOSim...
This comment has been removed by the author.
ReplyDelete