Tuesday, June 15, 2010

GCC: Now you are going to unroll the loop or should I?

A codelet to calculate the log (to base 2) of a number which is known to be a power of 2.

unsigned int log2_pow2(unsigned int v) // 32-bit value to find the log2 of
{
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
0xFF00FF00, 0xFFFF0000};
register unsigned int r = (v & b[0]) != 0;
int i;
for (i = 4; i > 0; i--) // unroll for speed...
{
r |= ((v & b[i]) != 0) <<>
}
return r;
}

Let's see what do we get with GCC 4.4 at O3,

400550: 31 c0 xor %eax,%eax 400552: f7 c7 aa aa aa aa test $0xaaaaaaaa,%edi 400558: 0f 95 c0 setne %al 40055b: 31 d2 xor %edx,%edx 40055d: f7 c7 00 00 ff ff test $0xffff0000,%edi 400563: 0f 95 c2 setne %dl 400566: 31 c9 xor %ecx,%ecx 400568: c1 e2 04 shl $0x4,%edx 40056b: f7 c7 00 ff 00 ff test $0xff00ff00,%edi 400571: 0f 95 c1 setne %cl 400574: c1 e1 03 shl $0x3,%ecx 400577: 09 ca or %ecx,%edx 400579: 09 c2 or %eax,%edx 40057b: 31 c0 xor %eax,%eax 40057d: f7 c7 f0 f0 f0 f0 test $0xf0f0f0f0,%edi 400583: 0f 95 c0 setne %al 400586: c1 e0 02 shl $0x2,%eax 400589: 09 c2 or %eax,%edx 40058b: 31 c0 xor %eax,%eax 40058d: f7 c7 cc cc cc cc test $0xcccccccc,%edi 400593: 0f 95 c0 setne %al 400596: 01 c0 add %eax,%eax 400598: 09 d0 or %edx,%eax 40059a: c3 retq 40059b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

So GCC unrolled the code for us. Nice. But how about this bit of assistance?

unsigned int log2_pow2_2(unsigned int v) { register unsigned int r = (v & 0xAAAAAAAA) != 0; r |= ((v & 0xFFFF0000) != 0) <<>

Shouldn't make much difference, right? Let's ask the GCC gods for their verdict.

4005a0: 89 f8 mov %edi,%eax 4005a2: 89 fa mov %edi,%edx 4005a4: 66 31 c0 xor %ax,%ax 4005a7: 83 f8 01 cmp $0x1,%eax 4005aa: 19 c0 sbb %eax,%eax 4005ac: 81 e2 00 ff 00 ff and $0xff00ff00,%edx 4005b2: f7 d0 not %eax 4005b4: 83 e0 10 and $0x10,%eax 4005b7: 83 fa 01 cmp $0x1,%edx 4005ba: 89 fa mov %edi,%edx 4005bc: 19 f6 sbb %esi,%esi 4005be: 81 e2 f0 f0 f0 f0 and $0xf0f0f0f0,%edx 4005c4: f7 d6 not %esi 4005c6: 83 e6 08 and $0x8,%esi 4005c9: 83 fa 01 cmp $0x1,%edx 4005cc: 89 fa mov %edi,%edx 4005ce: 19 c9 sbb %ecx,%ecx 4005d0: 81 e2 cc cc cc cc and $0xcccccccc,%edx 4005d6: f7 d1 not %ecx 4005d8: 83 e1 04 and $0x4,%ecx 4005db: 83 fa 01 cmp $0x1,%edx 4005de: 19 d2 sbb %edx,%edx 4005e0: f7 d2 not %edx 4005e2: 83 e2 02 and $0x2,%edx 4005e5: 81 e7 aa aa aa aa and $0xaaaaaaaa,%edi 4005eb: 40 0f 95 c7 setne %dil 4005ef: 40 0f b6 ff movzbl %dil,%edi 4005f3: 09 f8 or %edi,%eax 4005f5: 09 f0 or %esi,%eax 4005f7: 09 c8 or %ecx,%eax 4005f9: 09 d0 or %edx,%eax 4005fb: c3 retq 4005fc: 0f 1f 40 00 nopl 0x0(%rax)

Gee, that was unexpected. The code size grew. And is probably slower because of it. I do not claim to know much of non-SSEx assembly, but the manual unrolling seems to produce worse code.

Who knew? :|

No comments: