Patchwork [05/21] tcg-i386: Tidy bswap operations.

login
register
mail settings
Submitter malc
Date April 19, 2010, 4:05 p.m.
Message ID <alpine.LNX.2.00.1004192004080.1477@linmac>
Download mbox | patch
Permalink /patch/50473/
State New
Headers show

Comments

malc - April 19, 2010, 4:05 p.m.
On Mon, 19 Apr 2010, Richard Henderson wrote:

> On 04/18/2010 05:13 PM, Aurelien Jarno wrote:
> > On Tue, Apr 13, 2010 at 04:33:59PM -0700, Richard Henderson wrote:
> >> Define OPC_BSWAP.  Factor opcode emission to separate functions.
> >> Use bswap+shift to implement 16-bit swap instead of a rolw; this
> >> gets the proper zero-extension required by INDEX_op_bswap16_i32.
> > 
> > This is not required by INDEX_op_bswap16_i32. What is need is that the
> > value in the input register has the 16 upper bits set to 0.
> 
> Ah.

Apparently i'm not the only one who misinterpreted this bit of bswap
documentation. How about:

 

> 
> > Considering
> > that, the rolw instruction is faster than bswap + shift.
> 
> Well, no, it isn't.
> 
>  static inline int test_rolw(unsigned short *s)
>  {
>    int i, start, end;
>    asm volatile("rdtsc\n\t"
>                 "movl %%eax, %1\n\t"
>                 "movzwl %3,%2\n\t"
>                 "rolw $8, %w2\n\t"
>                 "addl $1,%2\n\t"
>                 "rdtsc"
>                 : "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx");
>    return end - start;
>  }
>  
>  static inline int test_bswap(unsigned short *s)
>  {
>    int i, start, end;
>    asm volatile("rdtsc\n\t"
>                 "movl %%eax, %1\n\t"
>                 "movzwl %3,%2\n\t"
>                 "bswap %2\n\t"
>                 "shl $16,%2\n\t"
>                 "addl $1,%2\n\t"
>                 "rdtsc"
>                 : "=&a"(end), "=r"(start), "=r"(i) : "m"(*s) : "edx");
>    return end - start;
>  }
> 
> 
> model name	: Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
>  rolw	   60   60   72   60   60   72   60   60   72   60
>  bswap	   60   60   60   60   60   60   60   60   60   60
> 
> model name	: Dual-Core AMD Opteron(tm) Processor 1210
>  rolw	    9   10    9    9    8    8    8    8    8    8
>  bswap	    9    9    8    8    8    8    8    8    8    8
> 
> The rolw sequence isn't ever faster, and it's more unstable,
> likely due to the partial register stall I mentioned.
> 
> I will grant that the rolw sequence is smaller, and I can 
> adjust this patch to use that sequence if you wish.
> 
> 
> r~
> 
>
Richard Henderson - April 19, 2010, 7:19 p.m.
On 04/19/2010 11:05 AM, malc wrote:
> Apparently i'm not the only one who misinterpreted this bit of bswap
> documentation. How about:
> 
> diff --git a/tcg/README b/tcg/README
> index 68d27ff..5b39a38 100644
> --- a/tcg/README
> +++ b/tcg/README
> @@ -269,7 +269,7 @@ ext32u_i64 t0, t1
>  * bswap16_i32/i64 t0, t1
>  
>  16 bit byte swap on a 32/64 bit value. It assumes that the two/six high 
> order
> -bytes are set to zero.
> +bytes of t1 are set to zero.

Ok by me.  You should also adjust the bswap32 documentation just below.


r~

Patch

diff --git a/tcg/README b/tcg/README
index 68d27ff..5b39a38 100644
--- a/tcg/README
+++ b/tcg/README
@@ -269,7 +269,7 @@  ext32u_i64 t0, t1
 * bswap16_i32/i64 t0, t1
 
 16 bit byte swap on a 32/64 bit value. It assumes that the two/six high 
order
-bytes are set to zero.
+bytes of t1 are set to zero.
 
 * bswap32_i32/i64 t0, t1