diff mbox

[RFC,v4,1/9] exec.c: Add new exclusive bitmap to ram_list

Message ID 1438966995-5913-2-git-send-email-a.rigo@virtualopensystems.com
State New
Headers show

Commit Message

Alvise Rigo Aug. 7, 2015, 5:03 p.m. UTC
The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
basis.
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path if at least one vCPU has the bit
not set.
When the system starts, the whole memory is dirty (all the bitmap is
set). A page, after being marked as exclusively-clean, will be
restored as dirty after the SC.

The accessors to this bitmap are currently not atomic, but they have to
be so in a real multi-threading TCG.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 exec.c                  |  7 +++--
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini Aug. 11, 2015, 1:52 p.m. UTC | #1
On 07/08/2015 19:03, Alvise Rigo wrote:
> +static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
> +{
> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
> +    unsigned long next, end;
> +
> +    if (likely(smp_cpus <= BITS_PER_LONG)) {

This only works if smp_cpus divides BITS_PER_LONG, i.e. BITS_PER_LONG %
smp_cpus == 0.

> +        unsigned long mask = (1 << smp_cpus) - 1;
> +
> +        return
> +            (mask & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
> +            (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)))) != mask;
> +    }
> +
> +    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + smp_cpus;
> +    next = find_next_zero_bit(bitmap, end,
> +                              BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)));
> +
> +    return next < end;

> +static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr,
> +                                                    unsigned long cpu)
> +{
> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
> +    unsigned long end, next;
> +    uint32_t add;
> +
> +    assert(cpu <= smp_cpus);
> +
> +    if (likely(smp_cpus <= BITS_PER_LONG)) {
> +        cpu = (cpu == smp_cpus) ? (1 << cpu) - 1 : (1 << cpu);
> +
> +        return cpu & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
> +                     (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)));
> +    }
> +
> +    add = (cpu == smp_cpus) ? 0 : 1;

Why not have a separate function for the cpu == smp_cpus case?

I don't think real hardware has ll/sc per CPU.  Can we have the bitmap as:

- 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
has done a concurrent access

- 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
a concurrent access.

Then:

- ll sets the bit to 0, and requests a flush if it was 1

- when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0

- in the TLB_EXCL slow path, set the bit to 1 and, for conditional
stores, succeed if the bit was 0

- when removing an exclusive entry, set the bit to 1

Paolo
Peter Maydell Aug. 11, 2015, 2:24 p.m. UTC | #2
On 11 August 2015 at 14:52, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> I don't think real hardware has ll/sc per CPU.

On ARM, the exclusives are handled by the 'global monitor', which
supports tracking an exclusive access per CPU.

>  Can we have the bitmap as:
>
> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
> has done a concurrent access
>
> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
> a concurrent access.
>
> Then:
>
> - ll sets the bit to 0, and requests a flush if it was 1
>
> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>
> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
> stores, succeed if the bit was 0
>
> - when removing an exclusive entry, set the bit to 1

This doesn't sound like it has the correct semantics for ARM:
wouldn't it mean that CPU 1 could do an LL, and then CPU 2
could do an SC and have it succeed even if it hadn't
previously done an LL itself?

thanks
-- PMM
Alvise Rigo Aug. 11, 2015, 2:34 p.m. UTC | #3
On Tue, Aug 11, 2015 at 4:24 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 11 August 2015 at 14:52, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> I don't think real hardware has ll/sc per CPU.
>
> On ARM, the exclusives are handled by the 'global monitor', which
> supports tracking an exclusive access per CPU.
>
>>  Can we have the bitmap as:
>>
>> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
>> has done a concurrent access
>>
>> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
>> a concurrent access.
>>
>> Then:
>>
>> - ll sets the bit to 0, and requests a flush if it was 1
>>
>> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>>
>> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
>> stores, succeed if the bit was 0
>>
>> - when removing an exclusive entry, set the bit to 1
>
> This doesn't sound like it has the correct semantics for ARM:
> wouldn't it mean that CPU 1 could do an LL, and then CPU 2
> could do an SC and have it succeed even if it hadn't
> previously done an LL itself?

This case is correctly handled now. If the CPU 2 has not previously
done a LL, the SC will fail, without even entering the helper.
I think Paolo omitted this case in the list.

Regards,
alvise

>
> thanks
> -- PMM
Alvise Rigo Aug. 11, 2015, 3:54 p.m. UTC | #4
On Tue, Aug 11, 2015 at 3:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 07/08/2015 19:03, Alvise Rigo wrote:
>> +static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
>> +{
>> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
>> +    unsigned long next, end;
>> +
>> +    if (likely(smp_cpus <= BITS_PER_LONG)) {
>
> This only works if smp_cpus divides BITS_PER_LONG, i.e. BITS_PER_LONG %
> smp_cpus == 0.

You are right,

>
> Why not have a separate function for the cpu == smp_cpus case?

I'll rework this part a bit.

>
> I don't think real hardware has ll/sc per CPU.  Can we have the bitmap as:
>
> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
> has done a concurrent access
>
> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
> a concurrent access.
>
> Then:
>
> - ll sets the bit to 0, and requests a flush if it was 1
>
> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>
> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
> stores, succeed if the bit was 0
>
> - when removing an exclusive entry, set the bit to 1

This can lead to an excessive rate of flush requests, since for one
CPU that removes the TLB_EXCL flag, all the others that are competing
for the same excl address will need to flush the entire cache and
start all over again.

If for a start we want to have a simpler implementation, I can revert
back to the one-bit design.

alvise

>
> Paolo
Paolo Bonzini Aug. 11, 2015, 3:55 p.m. UTC | #5
On 11/08/2015 17:54, alvise rigo wrote:
> This can lead to an excessive rate of flush requests, since for one
> CPU that removes the TLB_EXCL flag, all the others that are competing
> for the same excl address will need to flush the entire cache and
> start all over again.

Why flush the entire cache (I understand you mean TLB)?

Paolo

> If for a start we want to have a simpler implementation, I can revert
> back to the one-bit design.
Alvise Rigo Aug. 11, 2015, 4:11 p.m. UTC | #6
On Tue, Aug 11, 2015 at 5:55 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 17:54, alvise rigo wrote:
>> This can lead to an excessive rate of flush requests, since for one
>> CPU that removes the TLB_EXCL flag, all the others that are competing
>> for the same excl address will need to flush the entire cache and
>> start all over again.
>
> Why flush the entire cache (I understand you mean TLB)?

Sorry, I meant the TLB.
If for each removal of an exclusive entry we set also the bit to 1, we
force the following LL to make a tlb_flush() on every vCPU.

alvise

>
> Paolo
>
>> If for a start we want to have a simpler implementation, I can revert
>> back to the one-bit design.
Paolo Bonzini Aug. 11, 2015, 4:32 p.m. UTC | #7
On 11/08/2015 18:11, alvise rigo wrote:
>> > Why flush the entire cache (I understand you mean TLB)?
> Sorry, I meant the TLB.
> If for each removal of an exclusive entry we set also the bit to 1, we
> force the following LL to make a tlb_flush() on every vCPU.

What if you only flush one entry with tlb_flush_entry (on every vCPU)?

Paolo
Alvise Rigo Aug. 12, 2015, 7:31 a.m. UTC | #8
I think that tlb_flush_entry is not enough, since in theory another
vCPU could have a different TLB address referring the same phys
address.

alvise

On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 18:11, alvise rigo wrote:
>>> > Why flush the entire cache (I understand you mean TLB)?
>> Sorry, I meant the TLB.
>> If for each removal of an exclusive entry we set also the bit to 1, we
>> force the following LL to make a tlb_flush() on every vCPU.
>
> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>
> Paolo
Paolo Bonzini Aug. 12, 2015, 12:36 p.m. UTC | #9
On 12/08/2015 09:31, alvise rigo wrote:
> I think that tlb_flush_entry is not enough, since in theory another
> vCPU could have a different TLB address referring the same phys
> address.

You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure 
what happens on ARM, since it has a virtually indexed (VIVT or VIPT) 
cache, but indeed it would be a problem when implementing e.g. CMPXCHG 
using the TCG ll/sc ops.

I'm a bit worried about adding such a big bitmap.  It's only used on 
TCG, but it is also allocated on KVM and on KVM you can have hundreds 
of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest 
memory) is obviously not a great idea. :(

Perhaps we can use a bytemap instead:

- 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
loads the TLB for this vaddr does not have to set it.

- 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
loads the TLB for this vaddr has to set it.

- 255 = TLB_EXCL not set in at least two VCPUs

Transitions:

- ll transitions: anything -> 254

- sc transitions: 254 -> current CPU_ID

- TLB_EXCL store transitions: 254 -> current CPU_ID

- tlb_st_page transitions: CPU_ID other than current -> 255

The initial value is 255 on SMP guests, 0 on UP guests.

The algorithms are very similar to yours, just using this approximate
representation.

ll algorithm:
  llsc_value = bytemap[vaddr]
  if llsc_value == CPU_ID
     do nothing
  elseif llsc_value < 254
     flush TLB of CPU llsc_value
  elseif llsc_value == 255
     flush all TLBs
  set TLB_EXCL
  bytemap[vaddr] = 254
  load

tlb_set_page algorithm:
  llsc_value = bytemap[vaddr]
  if llsc_value == CPU_ID or llsc_value == 255
     do nothing
  else if llsc_value == 254
     set TLB_EXCL
  else
     # two CPUs without TLB_EXCL
     bytemap[vaddr] = 255

TLB_EXCL slow path algorithm:
   if bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID
   else
      # two CPUs without TLB_EXCL
      bytemap[vaddr] = 255
   clear TLB_EXCL in this CPU
   store

sc algorithm:
   if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID
      clear TLB_EXCL in this CPU
      store
      succeed
   else
      fail

clear algorithm:
   if bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID

The UP case is optimized because bytemap[vaddr] will always be 0 or 254.

The algorithm requires the LL to be cleared e.g. on exceptions.
Paolo

> alvise
> 
> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>> Why flush the entire cache (I understand you mean TLB)?
>>> Sorry, I meant the TLB.
>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>> force the following LL to make a tlb_flush() on every vCPU.
>>
>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>
>> Paolo
Peter Maydell Aug. 12, 2015, 1:02 p.m. UTC | #10
On 12 August 2015 at 13:36, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.

On ARM the exclusives operate on physical addresses, not virtual.

-- PMM
Alvise Rigo Aug. 12, 2015, 2:04 p.m. UTC | #11
On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.
>
> I'm a bit worried about adding such a big bitmap.  It's only used on
> TCG, but it is also allocated on KVM and on KVM you can have hundreds
> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
> memory) is obviously not a great idea. :(

I agree, it's a waste of memory.

>
> Perhaps we can use a bytemap instead:
>
> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
> loads the TLB for this vaddr does not have to set it.
>
> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
> loads the TLB for this vaddr has to set it.
>
> - 255 = TLB_EXCL not set in at least two VCPUs
>
> Transitions:
>
> - ll transitions: anything -> 254
>
> - sc transitions: 254 -> current CPU_ID
>
> - TLB_EXCL store transitions: 254 -> current CPU_ID
>
> - tlb_st_page transitions: CPU_ID other than current -> 255
>
> The initial value is 255 on SMP guests, 0 on UP guests.
>
> The algorithms are very similar to yours, just using this approximate
> representation.
>
> ll algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID
>      do nothing
>   elseif llsc_value < 254
>      flush TLB of CPU llsc_value
>   elseif llsc_value == 255
>      flush all TLBs
>   set TLB_EXCL
>   bytemap[vaddr] = 254
>   load
>
> tlb_set_page algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID or llsc_value == 255
>      do nothing
>   else if llsc_value == 254
>      set TLB_EXCL
>   else
>      # two CPUs without TLB_EXCL
>      bytemap[vaddr] = 255
>
> TLB_EXCL slow path algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>    else
>       # two CPUs without TLB_EXCL
>       bytemap[vaddr] = 255
>    clear TLB_EXCL in this CPU
>    store
>
> sc algorithm:
>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>       clear TLB_EXCL in this CPU
>       store
>       succeed
>    else
>       fail
>
> clear algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID

Isn't this also required for the clear algorithm?

    if bytemap[vaddr] < 254
        /* this can happen for the TLB_EXCL slow path effect */
        bytemap[vaddr] = 255

The whole idea makes sense, I will consider it for the next iteration
of the patches.

Thanks,
alvise

>
> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>
> The algorithm requires the LL to be cleared e.g. on exceptions.
> Paolo
>
>> alvise
>>
>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>> Sorry, I meant the TLB.
>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>
>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>
>>> Paolo
Paolo Bonzini Aug. 12, 2015, 2:10 p.m. UTC | #12
On 12/08/2015 16:04, alvise rigo wrote:
>> > clear algorithm:
>> >    if bytemap[vaddr] == 254
>> >       bytemap[vaddr] = CPU_ID
> Isn't this also required for the clear algorithm?
> 
>     if bytemap[vaddr] < 254
>         /* this can happen for the TLB_EXCL slow path effect */
>         bytemap[vaddr] = 255

I don't think so because clear doesn't clear TLB_EXCL.  But maybe we're
talking about two different things. :)

Paolo
Alvise Rigo Aug. 12, 2015, 2:32 p.m. UTC | #13
On Wed, Aug 12, 2015 at 4:10 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 16:04, alvise rigo wrote:
>>> > clear algorithm:
>>> >    if bytemap[vaddr] == 254
>>> >       bytemap[vaddr] = CPU_ID
>> Isn't this also required for the clear algorithm?
>>
>>     if bytemap[vaddr] < 254
>>         /* this can happen for the TLB_EXCL slow path effect */
>>         bytemap[vaddr] = 255
>
> I don't think so because clear doesn't clear TLB_EXCL.  But maybe we're
> talking about two different things. :)

Let me go through it again, I might have missed something :)

alvise

>
> Paolo
Alvise Rigo Sept. 10, 2015, 1:04 p.m. UTC | #14
Hi Paolo,

A brief update on this. I have a first implementation of the idea you
proposed, though it's not working really well. The failing rate of SCs
for some reason is very high.
Instead of trying to fix it, I came up with this alternative design:
we still use 8 bits per page and we group the smp_cpus in clusters of
maximum size ceiling(smp_cpus / 8).
Given a byte describing a certain page, its first bit will represents
the vCPUs of index 0, 9, 17, ... and so on for the other bits.

Note that systems with eight vCPUs or less will not be affected by
this simplified model.

Now, unless I'm missing some advantages of your proposal except the
less aggressive memory consumption, I would go with this design.

Regards,
alvise

On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.
>
> I'm a bit worried about adding such a big bitmap.  It's only used on
> TCG, but it is also allocated on KVM and on KVM you can have hundreds
> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
> memory) is obviously not a great idea. :(
>
> Perhaps we can use a bytemap instead:
>
> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
> loads the TLB for this vaddr does not have to set it.
>
> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
> loads the TLB for this vaddr has to set it.
>
> - 255 = TLB_EXCL not set in at least two VCPUs
>
> Transitions:
>
> - ll transitions: anything -> 254
>
> - sc transitions: 254 -> current CPU_ID
>
> - TLB_EXCL store transitions: 254 -> current CPU_ID
>
> - tlb_st_page transitions: CPU_ID other than current -> 255
>
> The initial value is 255 on SMP guests, 0 on UP guests.
>
> The algorithms are very similar to yours, just using this approximate
> representation.
>
> ll algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID
>      do nothing
>   elseif llsc_value < 254
>      flush TLB of CPU llsc_value
>   elseif llsc_value == 255
>      flush all TLBs
>   set TLB_EXCL
>   bytemap[vaddr] = 254
>   load
>
> tlb_set_page algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID or llsc_value == 255
>      do nothing
>   else if llsc_value == 254
>      set TLB_EXCL
>   else
>      # two CPUs without TLB_EXCL
>      bytemap[vaddr] = 255
>
> TLB_EXCL slow path algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>    else
>       # two CPUs without TLB_EXCL
>       bytemap[vaddr] = 255
>    clear TLB_EXCL in this CPU
>    store
>
> sc algorithm:
>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>       clear TLB_EXCL in this CPU
>       store
>       succeed
>    else
>       fail
>
> clear algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>
> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>
> The algorithm requires the LL to be cleared e.g. on exceptions.
> Paolo
>
>> alvise
>>
>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>> Sorry, I meant the TLB.
>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>
>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>
>>> Paolo
Alex Bennée Sept. 10, 2015, 4:19 p.m. UTC | #15
alvise rigo <a.rigo@virtualopensystems.com> writes:

> Hi Paolo,
>
> A brief update on this. I have a first implementation of the idea you
> proposed, though it's not working really well. The failing rate of SCs
> for some reason is very high.

Due to high memory contention on the EXCL page?

> Instead of trying to fix it, I came up with this alternative design:
> we still use 8 bits per page and we group the smp_cpus in clusters of
> maximum size ceiling(smp_cpus / 8).
> Given a byte describing a certain page, its first bit will represents
> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>
> Note that systems with eight vCPUs or less will not be affected by
> this simplified model.
>
> Now, unless I'm missing some advantages of your proposal except the
> less aggressive memory consumption, I would go with this design.

BTW have you had a look at Emilio's series? I'm about half way through
reviewing it but I'm looking forward to seeing how his implementation
differs from yours. Maybe there are some ideas in there that will
inspire?

>
> Regards,
> alvise
>
> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 12/08/2015 09:31, alvise rigo wrote:
>>> I think that tlb_flush_entry is not enough, since in theory another
>>> vCPU could have a different TLB address referring the same phys
>>> address.
>>
>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>> using the TCG ll/sc ops.
>>
>> I'm a bit worried about adding such a big bitmap.  It's only used on
>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>> memory) is obviously not a great idea. :(
>>
>> Perhaps we can use a bytemap instead:
>>
>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>> loads the TLB for this vaddr does not have to set it.
>>
>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>> loads the TLB for this vaddr has to set it.
>>
>> - 255 = TLB_EXCL not set in at least two VCPUs
>>
>> Transitions:
>>
>> - ll transitions: anything -> 254
>>
>> - sc transitions: 254 -> current CPU_ID
>>
>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>
>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>
>> The initial value is 255 on SMP guests, 0 on UP guests.
>>
>> The algorithms are very similar to yours, just using this approximate
>> representation.
>>
>> ll algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID
>>      do nothing
>>   elseif llsc_value < 254
>>      flush TLB of CPU llsc_value
>>   elseif llsc_value == 255
>>      flush all TLBs
>>   set TLB_EXCL
>>   bytemap[vaddr] = 254
>>   load
>>
>> tlb_set_page algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID or llsc_value == 255
>>      do nothing
>>   else if llsc_value == 254
>>      set TLB_EXCL
>>   else
>>      # two CPUs without TLB_EXCL
>>      bytemap[vaddr] = 255
>>
>> TLB_EXCL slow path algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>    else
>>       # two CPUs without TLB_EXCL
>>       bytemap[vaddr] = 255
>>    clear TLB_EXCL in this CPU
>>    store
>>
>> sc algorithm:
>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>       clear TLB_EXCL in this CPU
>>       store
>>       succeed
>>    else
>>       fail
>>
>> clear algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>
>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>
>> The algorithm requires the LL to be cleared e.g. on exceptions.
>> Paolo
>>
>>> alvise
>>>
>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>
>>>>
>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>> Sorry, I meant the TLB.
>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>
>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>
>>>> Paolo
Paolo Bonzini Sept. 10, 2015, 4:25 p.m. UTC | #16
On 10/09/2015 15:04, alvise rigo wrote:
> Hi Paolo,
> 
> A brief update on this. I have a first implementation of the idea you
> proposed, though it's not working really well. The failing rate of SCs
> for some reason is very high.
> Instead of trying to fix it, I came up with this alternative design:
> we still use 8 bits per page and we group the smp_cpus in clusters of
> maximum size ceiling(smp_cpus / 8).
> Given a byte describing a certain page, its first bit will represents
> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
> 
> Note that systems with eight vCPUs or less will not be affected by
> this simplified model.
> 
> Now, unless I'm missing some advantages of your proposal except the
> less aggressive memory consumption, I would go with this design.

Sounds great.

Paolo
Alvise Rigo Sept. 10, 2015, 5:36 p.m. UTC | #17
Hi Alex,

On Thu, Sep 10, 2015 at 6:19 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> alvise rigo <a.rigo@virtualopensystems.com> writes:
>
>> Hi Paolo,
>>
>> A brief update on this. I have a first implementation of the idea you
>> proposed, though it's not working really well. The failing rate of SCs
>> for some reason is very high.
>
> Due to high memory contention on the EXCL page?

I guess it's also due to the fact that we were using the bitmap itself
to determine if a SC operation should fail or not.
The SC should rather fail if another vCPU has an intersecting
exclusive range set and not for the state of the bitmap.

>
>> Instead of trying to fix it, I came up with this alternative design:
>> we still use 8 bits per page and we group the smp_cpus in clusters of
>> maximum size ceiling(smp_cpus / 8).
>> Given a byte describing a certain page, its first bit will represents
>> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>>
>> Note that systems with eight vCPUs or less will not be affected by
>> this simplified model.
>>
>> Now, unless I'm missing some advantages of your proposal except the
>> less aggressive memory consumption, I would go with this design.
>
> BTW have you had a look at Emilio's series? I'm about half way through
> reviewing it but I'm looking forward to seeing how his implementation
> differs from yours. Maybe there are some ideas in there that will
> inspire?

I still have to give an in-depth look at them.
However, if I'm not missing something, the problem of the fast-path
not being interceptable is still there. Isn't it?

Regards,
alvise

>
>>
>> Regards,
>> alvise
>>
>> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 12/08/2015 09:31, alvise rigo wrote:
>>>> I think that tlb_flush_entry is not enough, since in theory another
>>>> vCPU could have a different TLB address referring the same phys
>>>> address.
>>>
>>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>>> using the TCG ll/sc ops.
>>>
>>> I'm a bit worried about adding such a big bitmap.  It's only used on
>>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>>> memory) is obviously not a great idea. :(
>>>
>>> Perhaps we can use a bytemap instead:
>>>
>>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>>> loads the TLB for this vaddr does not have to set it.
>>>
>>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>>> loads the TLB for this vaddr has to set it.
>>>
>>> - 255 = TLB_EXCL not set in at least two VCPUs
>>>
>>> Transitions:
>>>
>>> - ll transitions: anything -> 254
>>>
>>> - sc transitions: 254 -> current CPU_ID
>>>
>>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>>
>>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>>
>>> The initial value is 255 on SMP guests, 0 on UP guests.
>>>
>>> The algorithms are very similar to yours, just using this approximate
>>> representation.
>>>
>>> ll algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID
>>>      do nothing
>>>   elseif llsc_value < 254
>>>      flush TLB of CPU llsc_value
>>>   elseif llsc_value == 255
>>>      flush all TLBs
>>>   set TLB_EXCL
>>>   bytemap[vaddr] = 254
>>>   load
>>>
>>> tlb_set_page algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID or llsc_value == 255
>>>      do nothing
>>>   else if llsc_value == 254
>>>      set TLB_EXCL
>>>   else
>>>      # two CPUs without TLB_EXCL
>>>      bytemap[vaddr] = 255
>>>
>>> TLB_EXCL slow path algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>    else
>>>       # two CPUs without TLB_EXCL
>>>       bytemap[vaddr] = 255
>>>    clear TLB_EXCL in this CPU
>>>    store
>>>
>>> sc algorithm:
>>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>       clear TLB_EXCL in this CPU
>>>       store
>>>       succeed
>>>    else
>>>       fail
>>>
>>> clear algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>
>>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>>
>>> The algorithm requires the LL to be cleared e.g. on exceptions.
>>> Paolo
>>>
>>>> alvise
>>>>
>>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>>
>>>>>
>>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>>> Sorry, I meant the TLB.
>>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>>
>>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>>
>>>>> Paolo
>
> --
> Alex Bennée
diff mbox

Patch

diff --git a/exec.c b/exec.c
index 7d60e15..f113076 100644
--- a/exec.c
+++ b/exec.c
@@ -1493,11 +1493,14 @@  static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
         int i;
 
         /* ram_list.dirty_memory[] is protected by the iothread lock.  */
-        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
+        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
             ram_list.dirty_memory[i] =
                 bitmap_zero_extend(ram_list.dirty_memory[i],
                                    old_ram_size, new_ram_size);
-       }
+        }
+        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
+                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                old_ram_size * smp_cpus, new_ram_size * smp_cpus);
     }
     cpu_physical_memory_set_dirty_range(new_block->offset,
                                         new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1394715..a525259 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@ 
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index c113f21..6b678d6 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@ 
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                     bool share, const char *mem_path,
@@ -135,6 +136,10 @@  static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
+                          page * smp_cpus, (end - page) * smp_cpus);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -249,5 +254,68 @@  uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
     return num_dirty;
 }
 
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_GET_OFFSET(addr) (smp_cpus * (addr >> TARGET_PAGE_BITS))
+static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr,
+                                                      uint32_t cpu_index)
+{
+    set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr) + cpu_index,
+                   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
+{
+    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+    unsigned long next, end;
+
+    if (likely(smp_cpus <= BITS_PER_LONG)) {
+        unsigned long mask = (1 << smp_cpus) - 1;
+
+        return
+            (mask & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
+            (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)))) != mask;
+    }
+
+    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + smp_cpus;
+    next = find_next_zero_bit(bitmap, end,
+                              BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)));
+
+    return next < end;
+}
+
+static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr,
+                                                    unsigned long cpu)
+{
+    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+    unsigned long end, next;
+    uint32_t add;
+
+    assert(cpu <= smp_cpus);
+
+    if (likely(smp_cpus <= BITS_PER_LONG)) {
+        cpu = (cpu == smp_cpus) ? (1 << cpu) - 1 : (1 << cpu);
+
+        return cpu & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
+                     (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)));
+    }
+
+    add = (cpu == smp_cpus) ? 0 : 1;
+    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + cpu + add;
+    next = find_next_bit(bitmap, end, BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))
+                         + (cpu % smp_cpus));
+
+    return next < end;
+}
+
+static inline bool cpu_physical_memory_clear_excl_dirty(ram_addr_t addr,
+                                                        uint32_t cpu_index)
+{
+    return bitmap_test_and_clear_atomic(
+                                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                                EXCL_BITMAP_GET_OFFSET(addr) + cpu_index, 1);
+}
+
+
+
 #endif
 #endif