diff mbox

[v2,2/3] qed: add zero write detection support

Message ID 4EE0C9D3.4000201@linux.vnet.ibm.com
State New
Headers show

Commit Message

Mark Wu Dec. 8, 2011, 2:29 p.m. UTC
I tried to optimize the zero detecting code with SSE instruction.   The 
idea comes from Paolo's patch "migration: vectorize is_dup_page".  It's 
expected to give us an noticeable improvement. But I didn't find any 
improvement in the qemu-io test even though I increased the image size 
to 5GB.  The following is my test patch.  Could you please review it to 
see if I made any mistake and SSE can help for zero detecting?

Thanks.


   * Determine if we have a zero write to a block of clusters
   *
@@ -1027,6 +1035,19 @@ static bool qed_is_zero_write(QEDAIOCB *acb)
          }

          v = iov->iov_base;
+
+#ifdef __SSE2__
+       if ((iov->iov_len & 0x0f)) {
+            VECTYPE zero = VECTYPE_ZERO;
+            VECTYPE *p = (VECTYPE *)v;
+            for(j = 0; j < iov->iov_len / sizeof(VECTYPE); j++) {
+                 if (!ALL_EQ(p[j], zero)) {
+                    return false;
+                 }
+            }
+            continue;
+        }
+#endif
          for (j = 0; j < iov->iov_len; j += sizeof(v[0])) {
              if (v[j >> 3]) {
                  return false;

Comments

Stefan Hajnoczi Dec. 8, 2011, 3:54 p.m. UTC | #1
On Thu, Dec 8, 2011 at 2:29 PM, Mark Wu <wudxw@linux.vnet.ibm.com> wrote:
> I tried to optimize the zero detecting code with SSE instruction.   The idea
> comes from Paolo's patch "migration: vectorize is_dup_page".  It's expected
> to give us an noticeable improvement. But I didn't find any improvement in
> the qemu-io test even though I increased the image size to 5GB.  The
> following is my test patch.  Could you please review it to see if I made any
> mistake and SSE can help for zero detecting?

Please put the zero detection function in a common location before
adding serious optimization so that qemu-img.c:is_not_zero() can also
use it.

Out of interest here is the code generated by gcc 4.6.2 from the non-SSE code:

    1d50:	89 c2                	mov    %eax,%edx
    1d52:	c1 fa 03             	sar    $0x3,%edx
    1d55:	48 63 d2             	movslq %edx,%rdx
    1d58:	48 83 3c d6 00       	cmpq   $0x0,(%rsi,%rdx,8)
    1d5d:	0f 85 03 ff ff ff    	jne    1c66 <qed_aio_write_data+0x146>
    1d63:	83 c0 08             	add    $0x8,%eax
    1d66:	48 63 d0             	movslq %eax,%rdx
    1d69:	48 39 d1             	cmp    %rdx,%rcx
    1d6c:	77 e2                	ja     1d50 <qed_aio_write_data+0x230>

Once you have the zero detection code in a utility function it's easy
to write a small test program to run a performance benchmark.

Stefan
diff mbox

Patch

diff --git a/block/qed.c b/block/qed.c
index 75a44f3..61e4a27 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -998,6 +998,14 @@  static void qed_aio_write_l2_update_cb(void 
*opaque, int ret)
      qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
  }

+#ifdef __SSE2__
+#include <emmintrin.h>
+#define VECTYPE        __m128i
+#define SPLAT(p)       _mm_set1_epi8(*(p))
+#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 
0xFFFF)
+#define VECTYPE_ZERO   _mm_setzero_si128()
+#endif
+
  /**