Message ID | 4EE0C9D3.4000201@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On Thu, Dec 8, 2011 at 2:29 PM, Mark Wu <wudxw@linux.vnet.ibm.com> wrote: > I tried to optimize the zero detecting code with SSE instruction. The idea > comes from Paolo's patch "migration: vectorize is_dup_page". It's expected > to give us an noticeable improvement. But I didn't find any improvement in > the qemu-io test even though I increased the image size to 5GB. The > following is my test patch. Could you please review it to see if I made any > mistake and SSE can help for zero detecting? Please put the zero detection function in a common location before adding serious optimization so that qemu-img.c:is_not_zero() can also use it. Out of interest here is the code generated by gcc 4.6.2 from the non-SSE code: 1d50: 89 c2 mov %eax,%edx 1d52: c1 fa 03 sar $0x3,%edx 1d55: 48 63 d2 movslq %edx,%rdx 1d58: 48 83 3c d6 00 cmpq $0x0,(%rsi,%rdx,8) 1d5d: 0f 85 03 ff ff ff jne 1c66 <qed_aio_write_data+0x146> 1d63: 83 c0 08 add $0x8,%eax 1d66: 48 63 d0 movslq %eax,%rdx 1d69: 48 39 d1 cmp %rdx,%rcx 1d6c: 77 e2 ja 1d50 <qed_aio_write_data+0x230> Once you have the zero detection code in a utility function it's easy to write a small test program to run a performance benchmark. Stefan
diff --git a/block/qed.c b/block/qed.c index 75a44f3..61e4a27 100644 --- a/block/qed.c +++ b/block/qed.c @@ -998,6 +998,14 @@ static void qed_aio_write_l2_update_cb(void *opaque, int ret) qed_aio_write_l2_update(acb, ret, acb->cur_cluster); } +#ifdef __SSE2__ +#include <emmintrin.h> +#define VECTYPE __m128i +#define SPLAT(p) _mm_set1_epi8(*(p)) +#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) +#define VECTYPE_ZERO _mm_setzero_si128() +#endif + /**