| Submitter | Mark Wu |
|---|---|
| Date | Dec. 8, 2011, 2:29 p.m. |
| Message ID | <4EE0C9D3.4000201@linux.vnet.ibm.com> |
| Download | mbox | patch |
| Permalink | /patch/130184/ |
| State | New |
| Headers | show |
Comments
On Thu, Dec 8, 2011 at 2:29 PM, Mark Wu <wudxw@linux.vnet.ibm.com> wrote: > I tried to optimize the zero detecting code with SSE instruction. The idea > comes from Paolo's patch "migration: vectorize is_dup_page". It's expected > to give us an noticeable improvement. But I didn't find any improvement in > the qemu-io test even though I increased the image size to 5GB. The > following is my test patch. Could you please review it to see if I made any > mistake and SSE can help for zero detecting? Please put the zero detection function in a common location before adding serious optimization so that qemu-img.c:is_not_zero() can also use it. Out of interest here is the code generated by gcc 4.6.2 from the non-SSE code: 1d50: 89 c2 mov %eax,%edx 1d52: c1 fa 03 sar $0x3,%edx 1d55: 48 63 d2 movslq %edx,%rdx 1d58: 48 83 3c d6 00 cmpq $0x0,(%rsi,%rdx,8) 1d5d: 0f 85 03 ff ff ff jne 1c66 <qed_aio_write_data+0x146> 1d63: 83 c0 08 add $0x8,%eax 1d66: 48 63 d0 movslq %eax,%rdx 1d69: 48 39 d1 cmp %rdx,%rcx 1d6c: 77 e2 ja 1d50 <qed_aio_write_data+0x230> Once you have the zero detection code in a utility function it's easy to write a small test program to run a performance benchmark. Stefan
Patch
diff --git a/block/qed.c b/block/qed.c index 75a44f3..61e4a27 100644 --- a/block/qed.c +++ b/block/qed.c @@ -998,6 +998,14 @@ static void qed_aio_write_l2_update_cb(void *opaque, int ret) qed_aio_write_l2_update(acb, ret, acb->cur_cluster); } +#ifdef __SSE2__ +#include <emmintrin.h> +#define VECTYPE __m128i +#define SPLAT(p) _mm_set1_epi8(*(p)) +#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF) +#define VECTYPE_ZERO _mm_setzero_si128() +#endif + /**
I tried to optimize the zero detecting code with SSE instruction. The idea comes from Paolo's patch "migration: vectorize is_dup_page". It's expected to give us an noticeable improvement. But I didn't find any improvement in the qemu-io test even though I increased the image size to 5GB. The following is my test patch. Could you please review it to see if I made any mistake and SSE can help for zero detecting? Thanks. * Determine if we have a zero write to a block of clusters * @@ -1027,6 +1035,19 @@ static bool qed_is_zero_write(QEDAIOCB *acb) } v = iov->iov_base; + +#ifdef __SSE2__ + if ((iov->iov_len & 0x0f)) { + VECTYPE zero = VECTYPE_ZERO; + VECTYPE *p = (VECTYPE *)v; + for(j = 0; j < iov->iov_len / sizeof(VECTYPE); j++) { + if (!ALL_EQ(p[j], zero)) { + return false; + } + } + continue; + } +#endif for (j = 0; j < iov->iov_len; j += sizeof(v[0])) { if (v[j >> 3]) { return false;