diff mbox

[07/10] tb hash: hash phys_pc, pc, and flags with xxhash

Message ID 1459834253-8291-8-git-send-email-cota@braap.org
State New
Headers show

Commit Message

Emilio Cota April 5, 2016, 5:30 a.m. UTC
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             |  2 +-
 include/exec/tb-hash.h | 19 +++++++++++++++++--
 translate-all.c        |  6 +++---
 3 files changed, 21 insertions(+), 6 deletions(-)

Comments

Richard Henderson April 5, 2016, 3:41 p.m. UTC | #1
On 04/04/2016 10:30 PM, Emilio G. Cota wrote:
> Tests show that the other element checked for in tb_find_physical,
> cs_base, is always a match when tb_phys+pc+flags are a match,
> so hashing cs_base is wasteful. It could be that this is an ARM-only
> thing, though.

The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB 
consisting of only a delay slot).

It may well still turn out to be reasonable to ignore cs_base for hashing.

> +static inline
> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
>   {
> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    struct key {
> +        tb_page_addr_t phys_pc;
> +        target_ulong pc;
> +        uint64_t flags;

The flags field should be uint32_t, not uint64_t.
See its definition in TranslationBlock.

> +    } QEMU_PACKED k;
> +    unsigned int hash;
> +
> +    k.phys_pc = phys_pc;
> +    k.pc = pc;
> +    k.flags = flags;
> +    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);

I'm less than happy about putting data into a struct and hashing that memory. 
Why don't you just take the ideas from xxh32/xxh64 and perform the hash 
directly right here?

In particular, you've 3 data elements here, a maximum of 5 32-bit words, thus 
the loop within xxh32 will never roll, so it's mostly dead code.


r~
Paolo Bonzini April 5, 2016, 3:48 p.m. UTC | #2
On 05/04/2016 17:41, Richard Henderson wrote:
> 
>> +static inline
>> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc,
>> uint64_t flags)
>>   {
>> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>> +    struct key {
>> +        tb_page_addr_t phys_pc;
>> +        target_ulong pc;
>> +        uint64_t flags;
> 
> The flags field should be uint32_t, not uint64_t.
> See its definition in TranslationBlock.
> 
>> +    } QEMU_PACKED k;
>> +    unsigned int hash;
>> +
>> +    k.phys_pc = phys_pc;
>> +    k.pc = pc;
>> +    k.flags = flags;
>> +    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
> 
> I'm less than happy about putting data into a struct and hashing that
> memory. Why don't you just take the ideas from xxh32/xxh64 and perform
> the hash directly right here?
> 
> In particular, you've 3 data elements here, a maximum of 5 32-bit words,
> thus the loop within xxh32 will never roll, so it's mostly dead code.

I think it's fine to use the struct.  The exact size of the struct
varies from 3 to 5 32-bit words, so it's hard to write nice
size-dependent code for the hash.

Paolo
Richard Henderson April 5, 2016, 4:07 p.m. UTC | #3
On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
> I think it's fine to use the struct.  The exact size of the struct
> varies from 3 to 5 32-bit words, so it's hard to write nice
> size-dependent code for the hash.

I don't think it is.  We have 3 integers.  It is trivial to create a simple 
function of 2 multiplies, two adds, and a remainder.

Take the primes from the xxhash.h, for example:

   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
   % PRIME32_1
   & (CODE_GEN_PHYS_HASH_SIZE - 1)

Obviously, some bucket measurements should be taken, but I can well imagine 
that this might perform just as well as the fully generic hasher.


r~
Laurent Desnogues April 5, 2016, 4:33 p.m. UTC | #4
On Tue, Apr 5, 2016 at 5:41 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/04/2016 10:30 PM, Emilio G. Cota wrote:
>>
>> Tests show that the other element checked for in tb_find_physical,
>> cs_base, is always a match when tb_phys+pc+flags are a match,
>> so hashing cs_base is wasteful. It could be that this is an ARM-only
>> thing, though.
>
>
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a
> TB consisting of only a delay slot).
>
> It may well still turn out to be reasonable to ignore cs_base for hashing.
>
>> +static inline
>> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t
>> flags)
>>   {
>> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>> +    struct key {
>> +        tb_page_addr_t phys_pc;
>> +        target_ulong pc;
>> +        uint64_t flags;
>
>
> The flags field should be uint32_t, not uint64_t.
> See its definition in TranslationBlock.

The 'flags' field is 64-bit.  You're thinking of cflags, I guess.


Laurent
Richard Henderson April 5, 2016, 5:19 p.m. UTC | #5
On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.

Well that's silly.  Since it's filled in via

static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
                                        target_ulong *cs_base, int *flags)

and passed back in to generate code with

TranslationBlock *tb_gen_code(CPUState *cpu,
                              target_ulong pc, target_ulong cs_base, int flags,
                              int cflags);

So while TranslationBlock stores "uint64_t", the producer and consumer see "int".


r~
Emilio Cota April 5, 2016, 7:40 p.m. UTC | #6
On Tue, Apr 05, 2016 at 09:07:57 -0700, Richard Henderson wrote:
> On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
> >I think it's fine to use the struct.  The exact size of the struct
> >varies from 3 to 5 32-bit words, so it's hard to write nice
> >size-dependent code for the hash.
> 
> I don't think it is.  We have 3 integers.  It is trivial to create a simple
> function of 2 multiplies, two adds, and a remainder.
> 
> Take the primes from the xxhash.h, for example:
> 
>   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
>   % PRIME32_1
>   & (CODE_GEN_PHYS_HASH_SIZE - 1)
> 
> Obviously, some bucket measurements should be taken, but I can well imagine
> that this might perform just as well as the fully generic hasher.

That function doesn't perform well: 25.06s vs. 21.18s with xxh32.

Having the packed struct and passing it to an *inlined* xxhash is
virtually unbeatable; gcc (>=v4.6, dunno about older ones) optimizes the
inline function since it knows the size of the struct.

To show this I'm appending the generated code for tb_hash_func when xxh32
is inlined vs. when it is not, for x86_64-softmmu. Results are similar
for arm-softmmu.

Anyway (for the arm bootup test) we're talking about ~0.50% of runtime spent
in tb_hash_func (with xxh32 inlined), so whatever we did here could not
improve overall performance much.

Thanks,

		Emilio

* no inline:

00000000001a4e60 <qemu_xxh32>:
  1a4e60:	48 83 ec 18          	sub    $0x18,%rsp
  1a4e64:	4c 8d 0c b7          	lea    (%rdi,%rsi,4),%r9
  1a4e68:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a4e6f:	00 00 
  1a4e71:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  1a4e76:	31 c0                	xor    %eax,%eax
  1a4e78:	48 83 fe 03          	cmp    $0x3,%rsi
  1a4e7c:	8d 82 b1 67 56 16    	lea    0x165667b1(%rdx),%eax
  1a4e82:	0f 86 92 00 00 00    	jbe    1a4f1a <qemu_xxh32+0xba>
  1a4e88:	4d 8d 59 f0          	lea    -0x10(%r9),%r11
  1a4e8c:	44 8d 82 28 44 23 24 	lea    0x24234428(%rdx),%r8d
  1a4e93:	8d 8a 77 ca eb 85    	lea    -0x7a143589(%rdx),%ecx
  1a4e99:	8d 82 4f 86 c8 61    	lea    0x61c8864f(%rdx),%eax
  1a4e9f:	90                   	nop
  1a4ea0:	44 8b 17             	mov    (%rdi),%r10d
  1a4ea3:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4eaa:	45 01 d0             	add    %r10d,%r8d
  1a4ead:	44 8b 57 04          	mov    0x4(%rdi),%r10d
  1a4eb1:	41 c1 c0 0d          	rol    $0xd,%r8d
  1a4eb5:	45 69 c0 b1 79 37 9e 	imul   $0x9e3779b1,%r8d,%r8d
  1a4ebc:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4ec3:	44 01 d1             	add    %r10d,%ecx
  1a4ec6:	44 8b 57 08          	mov    0x8(%rdi),%r10d
  1a4eca:	c1 c1 0d             	rol    $0xd,%ecx
  1a4ecd:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
  1a4ed3:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4eda:	44 01 d2             	add    %r10d,%edx
  1a4edd:	44 8b 57 0c          	mov    0xc(%rdi),%r10d
  1a4ee1:	48 83 c7 10          	add    $0x10,%rdi
  1a4ee5:	c1 c2 0d             	rol    $0xd,%edx
  1a4ee8:	69 d2 b1 79 37 9e    	imul   $0x9e3779b1,%edx,%edx
  1a4eee:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4ef5:	44 01 d0             	add    %r10d,%eax
  1a4ef8:	c1 c0 0d             	rol    $0xd,%eax
  1a4efb:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
  1a4f01:	49 39 fb             	cmp    %rdi,%r11
  1a4f04:	73 9a                	jae    1a4ea0 <qemu_xxh32+0x40>
  1a4f06:	c1 c9 19             	ror    $0x19,%ecx
  1a4f09:	41 c1 c8 1f          	ror    $0x1f,%r8d
  1a4f0d:	c1 ca 14             	ror    $0x14,%edx
  1a4f10:	44 01 c1             	add    %r8d,%ecx
  1a4f13:	c1 c8 0e             	ror    $0xe,%eax
  1a4f16:	01 ca                	add    %ecx,%edx
  1a4f18:	01 d0                	add    %edx,%eax
  1a4f1a:	4c 39 cf             	cmp    %r9,%rdi
  1a4f1d:	8d 34 b0             	lea    (%rax,%rsi,4),%esi
  1a4f20:	73 22                	jae    1a4f44 <qemu_xxh32+0xe4>
  1a4f22:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  1a4f28:	8b 17                	mov    (%rdi),%edx
  1a4f2a:	48 83 c7 04          	add    $0x4,%rdi
  1a4f2e:	69 c2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%eax
  1a4f34:	01 c6                	add    %eax,%esi
  1a4f36:	c1 c6 11             	rol    $0x11,%esi
  1a4f39:	69 f6 2f eb d4 27    	imul   $0x27d4eb2f,%esi,%esi
  1a4f3f:	49 39 f9             	cmp    %rdi,%r9
  1a4f42:	77 e4                	ja     1a4f28 <qemu_xxh32+0xc8>
  1a4f44:	89 f0                	mov    %esi,%eax
  1a4f46:	c1 e8 0f             	shr    $0xf,%eax
  1a4f49:	31 f0                	xor    %esi,%eax
  1a4f4b:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
  1a4f51:	89 d0                	mov    %edx,%eax
  1a4f53:	c1 e8 0d             	shr    $0xd,%eax
  1a4f56:	31 d0                	xor    %edx,%eax
  1a4f58:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
  1a4f5e:	89 d0                	mov    %edx,%eax
  1a4f60:	c1 e8 10             	shr    $0x10,%eax
  1a4f63:	31 d0                	xor    %edx,%eax
  1a4f65:	48 8b 54 24 08       	mov    0x8(%rsp),%rdx
  1a4f6a:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a4f71:	00 00 
  1a4f73:	75 05                	jne    1a4f7a <qemu_xxh32+0x11a>
  1a4f75:	48 83 c4 18          	add    $0x18,%rsp
  1a4f79:	c3                   	retq   
  1a4f7a:	e8 f1 7a fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a4f7f:	90                   	nop

00000000001a4f80 <tb_hash_func>:
  1a4f80:	48 83 ec 28          	sub    $0x28,%rsp
  1a4f84:	48 89 3c 24          	mov    %rdi,(%rsp)
  1a4f88:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
  1a4f8d:	48 89 e7             	mov    %rsp,%rdi
  1a4f90:	89 54 24 10          	mov    %edx,0x10(%rsp)
  1a4f94:	be 05 00 00 00       	mov    $0x5,%esi
  1a4f99:	ba 01 00 00 00       	mov    $0x1,%edx
  1a4f9e:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a4fa5:	00 00 
  1a4fa7:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
  1a4fac:	31 c0                	xor    %eax,%eax
  1a4fae:	e8 ad fe ff ff       	callq  1a4e60 <qemu_xxh32>
  1a4fb3:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
  1a4fb8:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a4fbf:	00 00 
  1a4fc1:	75 05                	jne    1a4fc8 <tb_hash_func+0x48>
  1a4fc3:	48 83 c4 28          	add    $0x28,%rsp
  1a4fc7:	c3                   	retq   
  1a4fc8:	e8 a3 7a fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a4fcd:	0f 1f 00             	nopl   (%rax)

* inline:

00000000001a6800 <tb_hash_func>:
  1a6800:	48 83 ec 28          	sub    $0x28,%rsp
  1a6804:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
  1a680a:	48 89 3c 24          	mov    %rdi,(%rsp)
  1a680e:	48 c1 ef 20          	shr    $0x20,%rdi
  1a6812:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
  1a6818:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
  1a681d:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a6824:	00 00 
  1a6826:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
  1a682b:	31 c0                	xor    %eax,%eax
  1a682d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
  1a6833:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
  1a6839:	48 c1 ee 20          	shr    $0x20,%rsi
  1a683d:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
  1a6843:	69 f6 77 ca eb 85    	imul   $0x85ebca77,%esi,%esi
  1a6849:	c1 c9 13             	ror    $0x13,%ecx
  1a684c:	c1 cf 13             	ror    $0x13,%edi
  1a684f:	83 c0 01             	add    $0x1,%eax
  1a6852:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
  1a6858:	c1 c8 13             	ror    $0x13,%eax
  1a685b:	81 c6 50 86 c8 61    	add    $0x61c88650,%esi
  1a6861:	69 ff b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edi
  1a6867:	c1 ce 13             	ror    $0x13,%esi
  1a686a:	c1 c9 1f             	ror    $0x1f,%ecx
  1a686d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
  1a6873:	c1 cf 19             	ror    $0x19,%edi
  1a6876:	69 f6 b1 79 37 9e    	imul   $0x9e3779b1,%esi,%esi
  1a687c:	8d 7c 39 14          	lea    0x14(%rcx,%rdi,1),%edi
  1a6880:	c1 c8 14             	ror    $0x14,%eax
  1a6883:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
  1a6889:	01 f8                	add    %edi,%eax
  1a688b:	c1 ce 0e             	ror    $0xe,%esi
  1a688e:	01 c6                	add    %eax,%esi
  1a6890:	01 f2                	add    %esi,%edx
  1a6892:	c1 ca 0f             	ror    $0xf,%edx
  1a6895:	69 d2 2f eb d4 27    	imul   $0x27d4eb2f,%edx,%edx
  1a689b:	89 d0                	mov    %edx,%eax
  1a689d:	c1 e8 0f             	shr    $0xf,%eax
  1a68a0:	31 d0                	xor    %edx,%eax
  1a68a2:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
  1a68a8:	89 d0                	mov    %edx,%eax
  1a68aa:	c1 e8 0d             	shr    $0xd,%eax
  1a68ad:	31 d0                	xor    %edx,%eax
  1a68af:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
  1a68b5:	89 d0                	mov    %edx,%eax
  1a68b7:	c1 e8 10             	shr    $0x10,%eax
  1a68ba:	31 d0                	xor    %edx,%eax
  1a68bc:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
  1a68c1:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a68c8:	00 00 
  1a68ca:	75 05                	jne    1a68d1 <tb_hash_func+0xd1>
  1a68cc:	48 83 c4 28          	add    $0x28,%rsp
  1a68d0:	c3                   	retq   
  1a68d1:	e8 9a 61 fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a68d6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1a68dd:	00 00 00
Richard Henderson April 5, 2016, 9:08 p.m. UTC | #7
On 04/05/2016 12:40 PM, Emilio G. Cota wrote:
> On Tue, Apr 05, 2016 at 09:07:57 -0700, Richard Henderson wrote:
>> On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
>>> I think it's fine to use the struct.  The exact size of the struct
>>> varies from 3 to 5 32-bit words, so it's hard to write nice
>>> size-dependent code for the hash.
>>
>> I don't think it is.  We have 3 integers.  It is trivial to create a simple
>> function of 2 multiplies, two adds, and a remainder.
>>
>> Take the primes from the xxhash.h, for example:
>>
>>   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
>>   % PRIME32_1
>>   & (CODE_GEN_PHYS_HASH_SIZE - 1)
>>
>> Obviously, some bucket measurements should be taken, but I can well imagine
>> that this might perform just as well as the fully generic hasher.
>
> That function doesn't perform well: 25.06s vs. 21.18s with xxh32.

Sure, that's just what came off the top of my head.

But the point is that we can do better than dropping data into memory.
Particularly for those hosts that do not support unaligned data, such as you
created with the packed structure.


> * inline:
>
> 00000000001a6800 <tb_hash_func>:
>   1a6800:	48 83 ec 28          	sub    $0x28,%rsp
>   1a6804:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
>   1a680a:	48 89 3c 24          	mov    %rdi,(%rsp)
>   1a680e:	48 c1 ef 20          	shr    $0x20,%rdi
>   1a6812:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
>   1a6818:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
>   1a681d:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
>   1a6824:	00 00
>   1a6826:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
>   1a682b:	31 c0                	xor    %eax,%eax
>   1a682d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
>   1a6833:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
>   1a6839:	48 c1 ee 20          	shr    $0x20,%rsi
>   1a683d:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
>   1a6843:	69 f6 77 ca eb 85    	imul   $0x85ebca77,%esi,%esi
>   1a6849:	c1 c9 13             	ror    $0x13,%ecx
>   1a684c:	c1 cf 13             	ror    $0x13,%edi
>   1a684f:	83 c0 01             	add    $0x1,%eax
>   1a6852:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
>   1a6858:	c1 c8 13             	ror    $0x13,%eax
>   1a685b:	81 c6 50 86 c8 61    	add    $0x61c88650,%esi
>   1a6861:	69 ff b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edi
>   1a6867:	c1 ce 13             	ror    $0x13,%esi
>   1a686a:	c1 c9 1f             	ror    $0x1f,%ecx
>   1a686d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
>   1a6873:	c1 cf 19             	ror    $0x19,%edi
>   1a6876:	69 f6 b1 79 37 9e    	imul   $0x9e3779b1,%esi,%esi
>   1a687c:	8d 7c 39 14          	lea    0x14(%rcx,%rdi,1),%edi
>   1a6880:	c1 c8 14             	ror    $0x14,%eax
>   1a6883:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
>   1a6889:	01 f8                	add    %edi,%eax
>   1a688b:	c1 ce 0e             	ror    $0xe,%esi
>   1a688e:	01 c6                	add    %eax,%esi
>   1a6890:	01 f2                	add    %esi,%edx
>   1a6892:	c1 ca 0f             	ror    $0xf,%edx
>   1a6895:	69 d2 2f eb d4 27    	imul   $0x27d4eb2f,%edx,%edx
>   1a689b:	89 d0                	mov    %edx,%eax
>   1a689d:	c1 e8 0f             	shr    $0xf,%eax
>   1a68a0:	31 d0                	xor    %edx,%eax
>   1a68a2:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
>   1a68a8:	89 d0                	mov    %edx,%eax
>   1a68aa:	c1 e8 0d             	shr    $0xd,%eax
>   1a68ad:	31 d0                	xor    %edx,%eax
>   1a68af:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
>   1a68b5:	89 d0                	mov    %edx,%eax
>   1a68b7:	c1 e8 10             	shr    $0x10,%eax
>   1a68ba:	31 d0                	xor    %edx,%eax
>   1a68bc:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
>   1a68c1:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
>   1a68c8:	00 00
>   1a68ca:	75 05                	jne    1a68d1 <tb_hash_func+0xd1>
>   1a68cc:	48 83 c4 28          	add    $0x28,%rsp
>   1a68d0:	c3                   	retq

If I inline everything explicitly, we do even better.


r~


0000000000000000 <tb_hash_func>:
    0:	44 69 c6 77 ca eb 85 	imul   $0x85ebca77,%esi,%r8d
    7:	48 c1 ee 20          	shr    $0x20,%rsi
    b:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
   11:	48 c1 ef 20          	shr    $0x20,%rdi
   15:	41 83 c0 01          	add    $0x1,%r8d
   19:	41 c1 c0 0d          	rol    $0xd,%r8d
   1d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
   23:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
   29:	c1 c1 0d             	rol    $0xd,%ecx
   2c:	41 69 f0 b1 79 37 9e 	imul   $0x9e3779b1,%r8d,%esi
   33:	05 50 86 c8 61       	add    $0x61c88650,%eax
   38:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
   3e:	c1 c6 0c             	rol    $0xc,%esi
   41:	c1 c0 0d             	rol    $0xd,%eax
   44:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
   4a:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
   50:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
   56:	8d 54 16 14          	lea    0x14(%rsi,%rdx,1),%edx
   5a:	c1 c7 0d             	rol    $0xd,%edi
   5d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
   63:	d1 c1                	rol    %ecx
   65:	01 d1                	add    %edx,%ecx
   67:	c1 c8 0e             	ror    $0xe,%eax
   6a:	69 d7 b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edx
   70:	01 c8                	add    %ecx,%eax
   72:	c1 c2 07             	rol    $0x7,%edx
   75:	01 d0                	add    %edx,%eax
   77:	c1 c8 0f             	ror    $0xf,%eax
   7a:	69 c0 2f eb d4 27    	imul   $0x27d4eb2f,%eax,%eax
   80:	89 c2                	mov    %eax,%edx
   82:	c1 ea 0f             	shr    $0xf,%edx
   85:	31 d0                	xor    %edx,%eax
   87:	69 c0 77 ca eb 85    	imul   $0x85ebca77,%eax,%eax
   8d:	89 c2                	mov    %eax,%edx
   8f:	c1 ea 0d             	shr    $0xd,%edx
   92:	31 d0                	xor    %edx,%eax
   94:	69 c0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%eax
   9a:	89 c2                	mov    %eax,%edx
   9c:	c1 ea 10             	shr    $0x10,%edx
   9f:	31 d0                	xor    %edx,%eax
   a1:	c3                   	retq


#define PRIME32_1   2654435761U
#define PRIME32_2   2246822519U
#define PRIME32_3   3266489917U
#define PRIME32_4    668265263U
#define PRIME32_5    374761393U
#define XXH_rotl32(x, r) ((x << r) | (x >> (32 - r)))

unsigned tb_hash_func(unsigned long phys_pc, unsigned long pc, unsigned flags)
{
   unsigned h32, seed = 1;
   unsigned w0, w1, w2, w3, w4;
   unsigned v1, v2, v3, v4;

   w0 = phys_pc;
   w1 = phys_pc >> 31 >> 1;
   w2 = pc;
   w3 = pc >> 31 >> 1;
   w4 = flags;

   v1 = seed + PRIME32_1 + PRIME32_2;
   v2 = seed + PRIME32_2;
   v3 = seed + 0;
   v4 = seed - PRIME32_1;

   v1 += w0 * PRIME32_2;
   v1 = XXH_rotl32(v1, 13);
   v1 *= PRIME32_1;
   v2 += w1 * PRIME32_2;
   v2 = XXH_rotl32(v2, 13);
   v2 *= PRIME32_1;
   v3 += w2 * PRIME32_2;
   v3 = XXH_rotl32(v3, 13);
   v3 *= PRIME32_1;
   v4 += w3 * PRIME32_2;
   v4 = XXH_rotl32(v4, 13);
   v4 *= PRIME32_1;

   h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7)
	+ XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

   h32 += 20;

   h32 += w4 * PRIME32_3;
   h32  = XXH_rotl32(h32, 17) * PRIME32_4;

   h32 ^= h32 >> 15;
   h32 *= PRIME32_2;
   h32 ^= h32 >> 13;
   h32 *= PRIME32_3;
   h32 ^= h32 >> 16;

   return h32;
}
Laurent Desnogues April 6, 2016, 6:06 a.m. UTC | #8
On Tue, Apr 5, 2016 at 7:19 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
>> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.
>
> Well that's silly.  Since it's filled in via
>
> static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
>                                         target_ulong *cs_base, int *flags)
>
> and passed back in to generate code with
>
> TranslationBlock *tb_gen_code(CPUState *cpu,
>                               target_ulong pc, target_ulong cs_base, int flags,
>                               int cflags);
>
> So while TranslationBlock stores "uint64_t", the producer and consumer see "int".

I agree.  I guess TranslationBlock should be fixed to use uint32_t
(note several functions have to be changed from using int to uint32_t
or aarch64-softmmu will fail).


Laurent
diff mbox

Patch

diff --git a/cpu-exec.c b/cpu-exec.c
index bbfcbfb..4194a4a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -233,7 +233,7 @@  static TranslationBlock *tb_find_physical(CPUState *cpu,
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
     ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     for(;;) {
         tb = *ptb1;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..c4942e1 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@ 
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "qemu/xxhash.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,21 @@  static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    struct key {
+        tb_page_addr_t phys_pc;
+        target_ulong pc;
+        uint64_t flags;
+    } QEMU_PACKED k;
+    unsigned int hash;
+
+    k.phys_pc = phys_pc;
+    k.pc = pc;
+    k.flags = flags;
+    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
+    return hash & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 8329ea6..8e7f9a7 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -972,7 +972,7 @@  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1474,8 +1474,8 @@  static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     unsigned int h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;