About QEMU BQL and dirty log switch in Migration

Message ID	B2D15215269B544CADD246097EACE747395AD759@dggeml511-mbx.china.huawei.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Zhoujian (jay)" <jianjay.zhou@huawei.com> To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>, yanghongyang <yanghongyang@huawei.com> Thread-Topic: [Qemu-devel] About QEMU BQL and dirty log switch in Migration Thread-Index: AQHSvRnLyPsNIXfv9k+ZtGcw9lIdjKHvHZmA Date: Thu, 11 May 2017 12:07:28 +0000 Message-ID: <B2D15215269B544CADD246097EACE747395AD759@dggeml511-mbx.china.huawei.com> References: <830bfc39-56c7-a901-9ebb-77d6e7a5614c@huawei.com> <874lxeovrg.fsf@secure.mitica> <7cd332ec-48d4-1feb-12e2-97b50b04e028@huawei.com> <20170424164244.GJ2362@work-vm> In-Reply-To: <20170424164244.GJ2362@work-vm> Accept-Language: zh-CN, en-US Content-Language: zh-CN Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [Qemu-devel] About QEMU BQL and dirty log switch in Migration Precedence: list Cc: Zhanghailiang <zhang.zhanghailiang@huawei.com>, "quintela@redhat.com" <quintela@redhat.com>, "wangxin \(U\)" <wangxinxin.wang@huawei.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "Gonglei \(Arei\)" <arei.gonglei@huawei.com>, Huangzhichao <huangzhichao@huawei.com>, "pbonzini@redhat.com" <pbonzini@redhat.com>, "Herongguang \(Stephen\)" <herongguang.he@huawei.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Message ID

B2D15215269B544CADD246097EACE747395AD759@dggeml511-mbx.china.huawei.com

State

New

Headers

From: "Zhoujian (jay)" <jianjay.zhou@huawei.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>, yanghongyang
	<yanghongyang@huawei.com>
Thread-Topic: [Qemu-devel] About QEMU BQL and dirty log switch in Migration
Thread-Index: AQHSvRnLyPsNIXfv9k+ZtGcw9lIdjKHvHZmA
Date: Thu, 11 May 2017 12:07:28 +0000
Message-ID: <B2D15215269B544CADD246097EACE747395AD759@dggeml511-mbx.china.huawei.com>
References: <830bfc39-56c7-a901-9ebb-77d6e7a5614c@huawei.com>
	<874lxeovrg.fsf@secure.mitica>
	<7cd332ec-48d4-1feb-12e2-97b50b04e028@huawei.com>
	<20170424164244.GJ2362@work-vm>
In-Reply-To: <20170424164244.GJ2362@work-vm>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [Qemu-devel] About QEMU BQL and dirty log switch in Migration
Precedence: list
Cc: Zhanghailiang <zhang.zhanghailiang@huawei.com>,
	"quintela@redhat.com" <quintela@redhat.com>,
	"wangxin \(U\)" <wangxinxin.wang@huawei.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Gonglei \(Arei\)" <arei.gonglei@huawei.com>,
	Huangzhichao <huangzhichao@huawei.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"Herongguang \(Stephen\)" <herongguang.he@huawei.com>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Commit Message

Zhoujian (jay) May 11, 2017, 12:07 p.m. UTC

Hi all,

After applying the patch below, the time which
memory_global_dirty_log_stop() function takes is down to milliseconds
of a 4T memory guest, but I'm not sure whether this patch will trigger
other problems. Does this patch make sense?


> * Yang Hongyang (yanghongyang@huawei.com) wrote:
> >
> >
> > On 2017/4/24 20:06, Juan Quintela wrote:
> > > Yang Hongyang <yanghongyang@huawei.com> wrote:
> > >> Hi all,
> > >>
> > >> We found dirty log switch costs more then 13 seconds while
> > >> migrating a 4T memory guest, and dirty log switch is currently
> > >> protected by QEMU BQL. This causes guest freeze for a long time
> > >> when switching dirty log on, and the migration downtime is
> unacceptable.
> > >> Are there any chance to optimize the time cost for dirty log switch
> operation?
> > >> Or move the time consuming operation out of the QEMU BQL?
> > >
> > > Hi
> > >
> > > Could you specify what do you mean by dirty log switch?
> > > The one inside kvm?
> > > The merge between kvm one and migration bitmap?
> >
> > The call of the following functions:
> > memory_global_dirty_log_start/stop();
> 
> I suppose there's a few questions;
>   a) Do we actually need the BQL - and if so why
>   b) What actually takes 13s?  It's probably worth figuring out where it
> goes,  the whole bitmap is only 1GB isn't it even on a 4TB machine, and
> even the simplest way to fill that takes way less than 13s.
> 
> Dave
> 
> >
> > >
> > > Thanks, Juan.
> > >
> >
> > --
> > Thanks,
> > Yang
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Regards,
Jay Zhou

Comments

Zhoujian (jay) May 11, 2017, 2:18 p.m. UTC | #1

Hi Wanpeng,

> 2017-05-11 21:43 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:

> > 2017-05-11 20:24 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:

> >>

> >>

> >> On 11/05/2017 14:07, Zhoujian (jay) wrote:

> >>> -        * Scan sptes if dirty logging has been stopped, dropping

> those

> >>> -        * which can be collapsed into a single large-page spte.

> Later

> >>> -        * page faults will create the large-page sptes.

> >>> +        * Reset each vcpu's mmu, then page faults will create the

> large-page

> >>> +        * sptes later.

> >>>          */

> >>>         if ((change != KVM_MR_DELETE) &&

> >>>                 (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&

> >>> -               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))

> >>> -               kvm_mmu_zap_collapsible_sptes(kvm, new);

> >

> > This is an unlikely branch(unless guest live migration fails and

> > continue to run on the source machine) instead of hot path, do you

> > have any performance number for your real workloads?

> 

> I find the original discussion by google.

> https://lists.nongnu.org/archive/html/qemu-devel/2017-04/msg04143.html

> You will not go to this branch if the guest live migration successfully.


 In our tests, this branch is taken when living migration is successful.
 AFAIK, the kmod does not know whether living migration successful or not
 when dealing with KVM_SET_USER_MEMORY_REGION ioctl. Do I miss something?

Regards,
Jay Zhou

Wanpeng Li May 12, 2017, 6:34 a.m. UTC | #2

2017-05-11 22:18 GMT+08:00 Zhoujian (jay) <jianjay.zhou@huawei.com>:
> Hi Wanpeng,
>
>> 2017-05-11 21:43 GMT+08:00 Wanpeng Li <kernellwp@gmail.com>:
>> > 2017-05-11 20:24 GMT+08:00 Paolo Bonzini <pbonzini@redhat.com>:
>> >>
>> >>
>> >> On 11/05/2017 14:07, Zhoujian (jay) wrote:
>> >>> -        * Scan sptes if dirty logging has been stopped, dropping
>> those
>> >>> -        * which can be collapsed into a single large-page spte.
>> Later
>> >>> -        * page faults will create the large-page sptes.
>> >>> +        * Reset each vcpu's mmu, then page faults will create the
>> large-page
>> >>> +        * sptes later.
>> >>>          */
>> >>>         if ((change != KVM_MR_DELETE) &&
>> >>>                 (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
>> >>> -               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
>> >>> -               kvm_mmu_zap_collapsible_sptes(kvm, new);
>> >
>> > This is an unlikely branch(unless guest live migration fails and
>> > continue to run on the source machine) instead of hot path, do you
>> > have any performance number for your real workloads?
>>
>> I find the original discussion by google.
>> https://lists.nongnu.org/archive/html/qemu-devel/2017-04/msg04143.html
>> You will not go to this branch if the guest live migration successfully.
>
>  In our tests, this branch is taken when living migration is successful.
>  AFAIK, the kmod does not know whether living migration successful or not
>  when dealing with KVM_SET_USER_MEMORY_REGION ioctl. Do I miss something?

Original there is a bug which will not clear memslot dirty log flag
after live migration fails, a patch is submitted to fix it,
https://lists.nongnu.org/archive/html/qemu-devel/2015-04/msg00794.html,
however, I can't remember whether the dirty log flag will be cleared
if live migration complete successfully at that time, but maybe not.
Paolo replied to the patch he has a better method. Then I'm too busy
and didn't follow the qemu patch for this fix any more, I just find
this commit is merged currently:
http://git.qemu.org/?p=qemu.git;a=commit;h=6f6a5ef3e429f92f987678ea8c396aab4dc6aa19.
This commit will clear memslot dirty log flag after live migration no
matter whether it is successful or not.

Regards,
Wanpeng Li

Xiao Guangrong May 12, 2017, 8:09 a.m. UTC | #3

On 05/11/2017 08:24 PM, Paolo Bonzini wrote:
> 
> 
> On 11/05/2017 14:07, Zhoujian (jay) wrote:
>> -        * Scan sptes if dirty logging has been stopped, dropping those
>> -        * which can be collapsed into a single large-page spte.  Later
>> -        * page faults will create the large-page sptes.
>> +        * Reset each vcpu's mmu, then page faults will create the large-page
>> +        * sptes later.
>>           */
>>          if ((change != KVM_MR_DELETE) &&
>>                  (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
>> -               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
>> -               kvm_mmu_zap_collapsible_sptes(kvm, new);
>> +               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
>> +               kvm_for_each_vcpu(i, vcpu, kvm)
>> +                       kvm_mmu_reset_context(vcpu);
> 
> This should be "kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);" but
> I am not sure it is enough.  I think that if you do not zap the SPTEs,
> the page faults will use 4K SPTEs, not large ones (though I'd have to
> check better; CCing Xiao and Wanpeng).

Yes, Paolo is right. kvm_mmu_reset_context() just reloads vCPU's
root page table, 4k mappings are still kept.

There are two issues reported:
- one is kvm_mmu_slot_apply_flags(), when enable dirty log tracking.

   Its root cause is kvm_mmu_slot_remove_write_access() takes too much
   time.

   We can make the code adaptive to use the new fast-write-protect faculty
   introduced by my patchset, i.e, if the number of pages contained in this
   memslot is more than > TOTAL * FAST_WRITE_PROTECT_PAGE_PERCENTAGE, then
   we use fast-write-protect instead.

- another one is kvm_mmu_zap_collapsible_sptes() when disable dirty
   log tracking.

   collapsible_sptes zaps 4k mappings to make memory-read happy, it is not
   required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not
   urgent for vCPU's running, it could be done in a separate thread and use
   lock-break technology.

Thanks!

Zhanghailiang May 12, 2017, 8:42 a.m. UTC | #4

On 2017/5/12 16:09, Xiao Guangrong wrote:
>
> On 05/11/2017 08:24 PM, Paolo Bonzini wrote:
>>
>> On 11/05/2017 14:07, Zhoujian (jay) wrote:
>>> -        * Scan sptes if dirty logging has been stopped, dropping those
>>> -        * which can be collapsed into a single large-page spte.  Later
>>> -        * page faults will create the large-page sptes.
>>> +        * Reset each vcpu's mmu, then page faults will create the large-page
>>> +        * sptes later.
>>>            */
>>>           if ((change != KVM_MR_DELETE) &&
>>>                   (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
>>> -               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
>>> -               kvm_mmu_zap_collapsible_sptes(kvm, new);
>>> +               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
>>> +               kvm_for_each_vcpu(i, vcpu, kvm)
>>> +                       kvm_mmu_reset_context(vcpu);
>> This should be "kvm_make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);" but
>> I am not sure it is enough.  I think that if you do not zap the SPTEs,
>> the page faults will use 4K SPTEs, not large ones (though I'd have to
>> check better; CCing Xiao and Wanpeng).
> Yes, Paolo is right. kvm_mmu_reset_context() just reloads vCPU's
> root page table, 4k mappings are still kept.
>
> There are two issues reported:
> - one is kvm_mmu_slot_apply_flags(), when enable dirty log tracking.
>
>     Its root cause is kvm_mmu_slot_remove_write_access() takes too much
>     time.
>
>     We can make the code adaptive to use the new fast-write-protect faculty
>     introduced by my patchset, i.e, if the number of pages contained in this
>     memslot is more than > TOTAL * FAST_WRITE_PROTECT_PAGE_PERCENTAGE, then
>     we use fast-write-protect instead.
>
> - another one is kvm_mmu_zap_collapsible_sptes() when disable dirty
>     log tracking.
>
>     collapsible_sptes zaps 4k mappings to make memory-read happy, it is not
>     required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not
>     urgent for vCPU's running, it could be done in a separate thread and use
>     lock-break technology.

How about move the action of stopping dirty log into migrate_fd_cleanup() directly,
which is processed in main thread as BH after migration is completed?  It will not
has any side-effect even migration is failed, Or users cancel migration, No ?

> Thanks!
>
>
> .
>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 464da93..fe26ee5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8313,6 +8313,8 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
                                enum kvm_mr_change change)
 {
        int nr_mmu_pages = 0;
+       int i;
+       struct kvm_vcpu *vcpu;
 
        if (!kvm->arch.n_requested_mmu_pages)
                nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm);
@@ -8328,14 +8330,15 @@  void kvm_arch_commit_memory_region(struct kvm *kvm,
         * in the source machine (for example if live migration fails), small
         * sptes will remain around and cause bad performance.
         *
-        * Scan sptes if dirty logging has been stopped, dropping those
-        * which can be collapsed into a single large-page spte.  Later
-        * page faults will create the large-page sptes.
+        * Reset each vcpu's mmu, then page faults will create the large-page
+        * sptes later.
         */
        if ((change != KVM_MR_DELETE) &&
                (old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
-               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
-               kvm_mmu_zap_collapsible_sptes(kvm, new);
+               !(new->flags & KVM_MEM_LOG_DIRTY_PAGES)) {
+               kvm_for_each_vcpu(i, vcpu, kvm)
+                       kvm_mmu_reset_context(vcpu);
+       }
 
        /*
         * Set up write protection and/or dirty logging for the new slot.

About QEMU BQL and dirty log switch in Migration

Commit Message

Comments

Patch