diff mbox

[0/2,RFC] postcopy migration: Linux char device for postcopy

Message ID 20120104030355.GL19274@valinux.co.jp
State New
Headers show

Commit Message

Isaku Yamahata Jan. 4, 2012, 3:03 a.m. UTC
On Mon, Jan 02, 2012 at 06:05:51PM +0100, Andrea Arcangeli wrote:
> On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> > On 12/29/2011 06:00 PM, Avi Kivity wrote:
> > > The NFS client has exactly the same issue, if you mount it with the intr
> > > option.  In fact you could use the NFS client as a trivial umem/cuse
> > > prototype.
> > 
> > Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> 
> During KVMForum I suggested to a few people that it could be done
> entirely in userland with PROT_NONE. So the problem is if we do it in
> userland with the current functionality you'll run out of VMAs and
> slowdown performance too much.
> 
> But all you need is the ability to map single pages in the address
> space. The only special requirement is that a new vma must not be
> created during the map operation. It'd be very similar to
> remap_file_pages for MAP_SHARED, it also was created to avoid having
> to create new vmas on a large MAP_SHARED mapping and no other reason
> at all. In our case we deal with a large MAP_ANONYMOUS mapping and we
> must alter the pte without creating new vmas but the problem is very
> similar to remap_file_pages.
> 
> Qemu in the dst node can do:
> 
> 	mmap(MAP_ANONYMOUS....)
> 	fault_area_prepare(start, end, signalnr)
> 
> prepare_fault_area will map the range with the magic pte.
> 
> Then when the signalnr fires, you do:
> 
>      send(givemepageX)
>      recv(&tmpaddr_aligned, PAGE_SIZE,...);
>      fault_area_map(final_dest_aligned, tmpaddr_aligned, size)
> 
> map_fault_area will check the pgprot of the two vmas mapping
> final_dest_aligned and tmpaddr_aligned have the same vma->vm_pgprot
> and various other vma bits, and if all ok, it'll just copy the pte
> from tmpaddr_aligned, to final_dest_aligned and it'll update the
> page->index. It can fail if the page is shared to avoid dealing with
> the non-linearity of the page mapped in multiple vmas.
> 
> You basically need a bypass to avoid altering the pgprot of the vma,
> and enter into the pte a "magic" thing that fires signal handlers
> if accessed, without having to create new vmas. gup/gup_fast and stuff
> should just always fallback into handle_mm_fault when encountering such a
> thing, so returning failure as if gup_fast was run on a address beyond
> the end of the i_size in the MAP_SHARED case.

Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
And it would be easy to convert a separated daemon process into a thread
in qemu.

I think it should be done out side of qemu process for some reasons.
(I just repeat same discussion at the KVM-forum because no one remembers
it)

- ptrace (and its variant)
  Some people want to investigate guest ram on host (qemu stopped or lively).
  For example, enhance crash utility and it will attach qemu process and
  debug guest kernel.

- core dump
  qemu process may core-dump.
  As postmortem analysis, people want to investigate guest RAM.
  Again enhance crash utility and it will read the core file and analyze
  guest kernel.
  When creating core, the qemu process is already dead.

It precludes the above possibilities to handle fault in qemu process.


> THP already works on /dev/zero mmaps as long as it's a MAP_PRIVATE,
> KSM should work too but I doubt anybody tested it on MAP_PRIVATE of
> /dev/zero.

Oh great. It seems to work with anonymous page generally of non-anonymous VMA.
Is that right?
If correct, THP/KSM work with mmap(MAP_PRIVATE, /dev/umem...), do they?


> The device driver provides an advantage in being self contained but I
> doubt it's simpler. I suppose after migration is complete you'll still
> switch the vma back to regular anonymous vma so leading to the same
> result?

Yes, it was my original intention.
The page is anonymous, but the vma isn't anonymous. I concerned that
KSM/THP doesn't work with such pages.
If they work, it isn't necessary to switch the VMA into anonymous.


> The patch 2/2 is small and self contained so it's quite attractive, I
> didn't see patch 1/2, was it posted?

Posted. It's quite short and trivial which just do EXPORT_SYMBOL_GPL of
mem_cgroup_cache_chage and shmem_zero_setup.
I include it here for convenience.

From e8bfda16a845eef4381872a331c6f0f200c3f7d7 Mon Sep 17 00:00:00 2001
Message-Id: <e8bfda16a845eef4381872a331c6f0f200c3f7d7.1325055066.git.yamahata@valinux.co.jp>
In-Reply-To: <cover.1325055065.git.yamahata@valinux.co.jp>
References: <cover.1325055065.git.yamahata@valinux.co.jp>
From: Isaku Yamahata <yamahata@valinux.co.jp>
Date: Thu, 11 Aug 2011 20:05:28 +0900
Subject: [PATCH 1/2] export necessary symbols

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 mm/memcontrol.c |    1 +
 mm/shmem.c      |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

Comments

Avi Kivity Jan. 12, 2012, 1:59 p.m. UTC | #1
On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> And it would be easy to convert a separated daemon process into a thread
> in qemu.
>
> I think it should be done out side of qemu process for some reasons.
> (I just repeat same discussion at the KVM-forum because no one remembers
> it)
>
> - ptrace (and its variant)
>   Some people want to investigate guest ram on host (qemu stopped or lively).
>   For example, enhance crash utility and it will attach qemu process and
>   debug guest kernel.

To debug the guest kernel you don't need to stop qemu itself.   I agree
it's a problem for qemu debugging though.

>
> - core dump
>   qemu process may core-dump.
>   As postmortem analysis, people want to investigate guest RAM.
>   Again enhance crash utility and it will read the core file and analyze
>   guest kernel.
>   When creating core, the qemu process is already dead.

Yes, strong point.

> It precludes the above possibilities to handle fault in qemu process.

I agree.
Benoit Hudzia Jan. 13, 2012, 1:09 a.m. UTC | #2
Hi,

Sorry to jump to hijack the thread  like that , however i would like
to just to inform you  that we recently achieve a milestone out of the
research project I'm leading. We enhanced KVM in order to deliver
post copy live migration using RDMA at kernel level.

Few point on the architecture of the system :

* RDMA communication engine in kernel ( you can use soft iwarp or soft
ROCE if you don't have hardware acceleration, however we also support
standard RDMA enabled NIC) .
* Naturally Page are transferred with Zerop copy protocol
* Leverage the async page fault system.
* Pre paging / faulting
* No context switch as everything is handled within kernel and using
the page fault system.
* Hybrid migration ( pre + post copy) available
* Rely on an independent Kernel Module
* No modification to the KVM kernel Module
* Minimal Modification to the Qemu-Kvm code
* We plan to add the page prioritization algo in order to optimise the
pre paging algo and background transfer


You can learn a little bit more and see a demo here:
http://tinyurl.com/8xa2bgl
I hope to be able to provide more detail on the design soon. As well
as more concrete demo of the system ( live migration of VM running
large  enterprise apps such as ERP or In memory DB)

Note: this is just a step stone as the post copy live migration mainly
enable us to validate the architecture design and  code.

Regards
Benoit







Regards
Benoit


On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> And it would be easy to convert a separated daemon process into a thread
>> in qemu.
>>
>> I think it should be done out side of qemu process for some reasons.
>> (I just repeat same discussion at the KVM-forum because no one remembers
>> it)
>>
>> - ptrace (and its variant)
>>   Some people want to investigate guest ram on host (qemu stopped or lively).
>>   For example, enhance crash utility and it will attach qemu process and
>>   debug guest kernel.
>
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.
>
>>
>> - core dump
>>   qemu process may core-dump.
>>   As postmortem analysis, people want to investigate guest RAM.
>>   Again enhance crash utility and it will read the core file and analyze
>>   guest kernel.
>>   When creating core, the qemu process is already dead.
>
> Yes, strong point.
>
>> It precludes the above possibilities to handle fault in qemu process.
>
> I agree.
>
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa Jan. 13, 2012, 1:31 a.m. UTC | #3
(2012/01/13 10:09), Benoit Hudzia wrote:
> Hi,
>
> Sorry to jump to hijack the thread  like that , however i would like
> to just to inform you  that we recently achieve a milestone out of the
> research project I'm leading. We enhanced KVM in order to deliver
> post copy live migration using RDMA at kernel level.
>
> Few point on the architecture of the system :
>
> * RDMA communication engine in kernel ( you can use soft iwarp or soft
> ROCE if you don't have hardware acceleration, however we also support
> standard RDMA enabled NIC) .
> * Naturally Page are transferred with Zerop copy protocol
> * Leverage the async page fault system.
> * Pre paging / faulting
> * No context switch as everything is handled within kernel and using
> the page fault system.
> * Hybrid migration ( pre + post copy) available
> * Rely on an independent Kernel Module
> * No modification to the KVM kernel Module
> * Minimal Modification to the Qemu-Kvm code
> * We plan to add the page prioritization algo in order to optimise the
> pre paging algo and background transfer
>
>
> You can learn a little bit more and see a demo here:
> http://tinyurl.com/8xa2bgl
> I hope to be able to provide more detail on the design soon. As well
> as more concrete demo of the system ( live migration of VM running
> large  enterprise apps such as ERP or In memory DB)
>
> Note: this is just a step stone as the post copy live migration mainly
> enable us to validate the architecture design and  code.

Do you have any plan to send the patch series of your implementation?

	Takuya
Isaku Yamahata Jan. 13, 2012, 2:03 a.m. UTC | #4
Very interesting. We can cooperate for better (postcopy) live migration.
The code doesn't seem available yet, I'm eager for it.


On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
> Hi,
> 
> Sorry to jump to hijack the thread  like that , however i would like
> to just to inform you  that we recently achieve a milestone out of the
> research project I'm leading. We enhanced KVM in order to deliver
> post copy live migration using RDMA at kernel level.
> 
> Few point on the architecture of the system :
> 
> * RDMA communication engine in kernel ( you can use soft iwarp or soft
> ROCE if you don't have hardware acceleration, however we also support
> standard RDMA enabled NIC) .

Do you mean infiniband subsystem?


> * Naturally Page are transferred with Zerop copy protocol
> * Leverage the async page fault system.
> * Pre paging / faulting
> * No context switch as everything is handled within kernel and using
> the page fault system.
> * Hybrid migration ( pre + post copy) available

Ah, I've been also planing this.
After pre-copy phase, is the dirty bitmap sent?

So far I've thought naively that pre-copy phase would be finished by the
number of iterations. On the other hand your choice is timeout of
pre-copy phase. Do you have rationale? or it was just natural for you?


> * Rely on an independent Kernel Module
> * No modification to the KVM kernel Module
> * Minimal Modification to the Qemu-Kvm code
> * We plan to add the page prioritization algo in order to optimise the
> pre paging algo and background transfer

Where do you plan to implement? in qemu or in your kernel module?
This algo could be shared.

thanks in advance.

> You can learn a little bit more and see a demo here:
> http://tinyurl.com/8xa2bgl
> I hope to be able to provide more detail on the design soon. As well
> as more concrete demo of the system ( live migration of VM running
> large  enterprise apps such as ERP or In memory DB)
> 
> Note: this is just a step stone as the post copy live migration mainly
> enable us to validate the architecture design and  code.
> 
> Regards
> Benoit
> 
> 
> 
> 
> 
> 
> 
> Regards
> Benoit
> 
> 
> On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> >> And it would be easy to convert a separated daemon process into a thread
> >> in qemu.
> >>
> >> I think it should be done out side of qemu process for some reasons.
> >> (I just repeat same discussion at the KVM-forum because no one remembers
> >> it)
> >>
> >> - ptrace (and its variant)
> >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
> >> ?? For example, enhance crash utility and it will attach qemu process and
> >> ?? debug guest kernel.
> >
> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > it's a problem for qemu debugging though.
> >
> >>
> >> - core dump
> >> ?? qemu process may core-dump.
> >> ?? As postmortem analysis, people want to investigate guest RAM.
> >> ?? Again enhance crash utility and it will read the core file and analyze
> >> ?? guest kernel.
> >> ?? When creating core, the qemu process is already dead.
> >
> > Yes, strong point.
> >
> >> It precludes the above possibilities to handle fault in qemu process.
> >
> > I agree.
> >
> >
> > --
> > error compiling committee.c: too many arguments to function
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> " The production of too many useful things results in too many useless people"
>
Andrea Arcangeli Jan. 13, 2012, 2:09 a.m. UTC | #5
On Thu, Jan 12, 2012 at 03:59:59PM +0200, Avi Kivity wrote:
> On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > And it would be easy to convert a separated daemon process into a thread
> > in qemu.
> >
> > I think it should be done out side of qemu process for some reasons.
> > (I just repeat same discussion at the KVM-forum because no one remembers
> > it)
> >
> > - ptrace (and its variant)
> >   Some people want to investigate guest ram on host (qemu stopped or lively).
> >   For example, enhance crash utility and it will attach qemu process and
> >   debug guest kernel.
> 
> To debug the guest kernel you don't need to stop qemu itself.   I agree
> it's a problem for qemu debugging though.

But you need to debug postcopy migration too with gdb or not? I don't
see a big benefit in trying to prevent gdb to see really what is going
on in the qemu image.

> > - core dump
> >   qemu process may core-dump.
> >   As postmortem analysis, people want to investigate guest RAM.
> >   Again enhance crash utility and it will read the core file and analyze
> >   guest kernel.
> >   When creating core, the qemu process is already dead.
> 
> Yes, strong point.
> 
> > It precludes the above possibilities to handle fault in qemu process.
> 
> I agree.

In the receiving node if the memory is not there yet (and it isn't),
not sure how you plan to get a clean core dump (like if live migration
wasn't running) by preventing the kernel from dumping zeroes if qemu
crashes during post copy migration. Surely it won't be the kernel
crash handler completing the post migration, it won't even know where
to write the data in memory.
Isaku Yamahata Jan. 13, 2012, 2:15 a.m. UTC | #6
One more question.
Does your architecture/implementation (in theory) allow KVM memory
features like swap, KSM, THP?


On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
> 
> 
> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
> > Hi,
> > 
> > Sorry to jump to hijack the thread  like that , however i would like
> > to just to inform you  that we recently achieve a milestone out of the
> > research project I'm leading. We enhanced KVM in order to deliver
> > post copy live migration using RDMA at kernel level.
> > 
> > Few point on the architecture of the system :
> > 
> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
> > ROCE if you don't have hardware acceleration, however we also support
> > standard RDMA enabled NIC) .
> 
> Do you mean infiniband subsystem?
> 
> 
> > * Naturally Page are transferred with Zerop copy protocol
> > * Leverage the async page fault system.
> > * Pre paging / faulting
> > * No context switch as everything is handled within kernel and using
> > the page fault system.
> > * Hybrid migration ( pre + post copy) available
> 
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?
> 
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?
> 
> 
> > * Rely on an independent Kernel Module
> > * No modification to the KVM kernel Module
> > * Minimal Modification to the Qemu-Kvm code
> > * We plan to add the page prioritization algo in order to optimise the
> > pre paging algo and background transfer
> 
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.
> 
> thanks in advance.
> 
> > You can learn a little bit more and see a demo here:
> > http://tinyurl.com/8xa2bgl
> > I hope to be able to provide more detail on the design soon. As well
> > as more concrete demo of the system ( live migration of VM running
> > large  enterprise apps such as ERP or In memory DB)
> > 
> > Note: this is just a step stone as the post copy live migration mainly
> > enable us to validate the architecture design and  code.
> > 
> > Regards
> > Benoit
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Regards
> > Benoit
> > 
> > 
> > On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
> > >> And it would be easy to convert a separated daemon process into a thread
> > >> in qemu.
> > >>
> > >> I think it should be done out side of qemu process for some reasons.
> > >> (I just repeat same discussion at the KVM-forum because no one remembers
> > >> it)
> > >>
> > >> - ptrace (and its variant)
> > >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
> > >> ?? For example, enhance crash utility and it will attach qemu process and
> > >> ?? debug guest kernel.
> > >
> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
> > > it's a problem for qemu debugging though.
> > >
> > >>
> > >> - core dump
> > >> ?? qemu process may core-dump.
> > >> ?? As postmortem analysis, people want to investigate guest RAM.
> > >> ?? Again enhance crash utility and it will read the core file and analyze
> > >> ?? guest kernel.
> > >> ?? When creating core, the qemu process is already dead.
> > >
> > > Yes, strong point.
> > >
> > >> It precludes the above possibilities to handle fault in qemu process.
> > >
> > > I agree.
> > >
> > >
> > > --
> > > error compiling committee.c: too many arguments to function
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > 
> > -- 
> > " The production of too many useful things results in too many useless people"
> > 
> 
> -- 
> yamahata
>
Benoit Hudzia Jan. 13, 2012, 9:40 a.m. UTC | #7
Yes we plan to release patch as soon as we cleaned up the code and we
get the green light from our company ( and sadly it can take month on
that point..)

On 13 January 2012 01:31, Takuya Yoshikawa
<yoshikawa.takuya@oss.ntt.co.jp> wrote:
> (2012/01/13 10:09), Benoit Hudzia wrote:
>>
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>>
>>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>
>
> Do you have any plan to send the patch series of your implementation?
>
>        Takuya
Benoit Hudzia Jan. 13, 2012, 9:48 a.m. UTC | #8
On 13 January 2012 02:03, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> Very interesting. We can cooperate for better (postcopy) live migration.
> The code doesn't seem available yet, I'm eager for it.
>
>
> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
>> Hi,
>>
>> Sorry to jump to hijack the thread  like that , however i would like
>> to just to inform you  that we recently achieve a milestone out of the
>> research project I'm leading. We enhanced KVM in order to deliver
>> post copy live migration using RDMA at kernel level.
>>
>> Few point on the architecture of the system :
>>
>> * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> ROCE if you don't have hardware acceleration, however we also support
>> standard RDMA enabled NIC) .
>
> Do you mean infiniband subsystem?

Yes, basically any software or hardware implementation that support
the standard RDMA / OFED vverbs  stack in kernel.
>
>
>> * Naturally Page are transferred with Zerop copy protocol
>> * Leverage the async page fault system.
>> * Pre paging / faulting
>> * No context switch as everything is handled within kernel and using
>> the page fault system.
>> * Hybrid migration ( pre + post copy) available
>
> Ah, I've been also planing this.
> After pre-copy phase, is the dirty bitmap sent?

We sent over the dirty bitmap yes. In order to identify what is left
to be transferred . And combined with the priority algo we will then
prioritise the page for the background transfer.

>
> So far I've thought naively that pre-copy phase would be finished by the
> number of iterations. On the other hand your choice is timeout of
> pre-copy phase. Do you have rationale? or it was just natural for you?


The main rational behind that is any normal sys admin tend to to be
human and live migration iteration cycle has no meaning for him. As a
result we preferred to provide a time constraint rather than an
iteration constraint. Also it is hard to estimate how much time
bandwidth would be use per iteration cycle which led to poor
determinism.

>
>
>> * Rely on an independent Kernel Module
>> * No modification to the KVM kernel Module
>> * Minimal Modification to the Qemu-Kvm code
>> * We plan to add the page prioritization algo in order to optimise the
>> pre paging algo and background transfer
>
> Where do you plan to implement? in qemu or in your kernel module?
> This algo could be shared.

Yes we plan to actually release the algo first before the RDMA post
copy. The algo can be use for standard optimisation of the normal
pre-copy process (as demosntrated in my talk at KVM-forum). And if the
priority is reverse for the post copy page pull. My colleague Aidan
shribman is done with the implentation and we are now in testing phase
in order to quantify the improvement.


>
> thanks in advance.
>
>> You can learn a little bit more and see a demo here:
>> http://tinyurl.com/8xa2bgl
>> I hope to be able to provide more detail on the design soon. As well
>> as more concrete demo of the system ( live migration of VM running
>> large  enterprise apps such as ERP or In memory DB)
>>
>> Note: this is just a step stone as the post copy live migration mainly
>> enable us to validate the architecture design and  code.
>>
>> Regards
>> Benoit
>>
>>
>>
>>
>>
>>
>>
>> Regards
>> Benoit
>>
>>
>> On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
>> > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> >> And it would be easy to convert a separated daemon process into a thread
>> >> in qemu.
>> >>
>> >> I think it should be done out side of qemu process for some reasons.
>> >> (I just repeat same discussion at the KVM-forum because no one remembers
>> >> it)
>> >>
>> >> - ptrace (and its variant)
>> >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
>> >> ?? For example, enhance crash utility and it will attach qemu process and
>> >> ?? debug guest kernel.
>> >
>> > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > it's a problem for qemu debugging though.
>> >
>> >>
>> >> - core dump
>> >> ?? qemu process may core-dump.
>> >> ?? As postmortem analysis, people want to investigate guest RAM.
>> >> ?? Again enhance crash utility and it will read the core file and analyze
>> >> ?? guest kernel.
>> >> ?? When creating core, the qemu process is already dead.
>> >
>> > Yes, strong point.
>> >
>> >> It precludes the above possibilities to handle fault in qemu process.
>> >
>> > I agree.
>> >
>> >
>> > --
>> > error compiling committee.c: too many arguments to function
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> " The production of too many useful things results in too many useless people"
>>
>
> --
> yamahata
Benoit Hudzia Jan. 13, 2012, 9:55 a.m. UTC | #9
On 13 January 2012 02:15, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> One more question.
> Does your architecture/implementation (in theory) allow KVM memory
> features like swap, KSM, THP?

* Swap: Yes we support swap to disk ( the page is pulled from swap
before being send over), swap process do its job on the other side.
* KSM :  same , we support KSM, the KSMed page is broken down and
split and they are send individually ( yes sub optimal but make the
protocol less messy) and we let the KSM daemon do its job on the other
side.
* THP : more sticky here. Due to time constraint we decided that we
will be partially supporting it. What does it means: if we encounter
THP we break them down in standard page granularity as it is our
current memory unit we are manipulating. As a result you can have THP
on the source but you won't have THP on the other side.
           _ Note , we didn't explore fully the ramification of THP
with RDMA, i don't know if THP play well with the MMU of HW RDMA NIC,
One thing i would like to explore is if it is possible to break down
the THP in standard page and then reassemble them on the other side (
do any one fo you know if it is possible to aggregate page to for a
THP in kernel ? )
* cgroup  :  should be transparently working, but we need to do more
testing to confirm that .




>
>
> On Fri, Jan 13, 2012 at 11:03:23AM +0900, Isaku Yamahata wrote:
>> Very interesting. We can cooperate for better (postcopy) live migration.
>> The code doesn't seem available yet, I'm eager for it.
>>
>>
>> On Fri, Jan 13, 2012 at 01:09:30AM +0000, Benoit Hudzia wrote:
>> > Hi,
>> >
>> > Sorry to jump to hijack the thread  like that , however i would like
>> > to just to inform you  that we recently achieve a milestone out of the
>> > research project I'm leading. We enhanced KVM in order to deliver
>> > post copy live migration using RDMA at kernel level.
>> >
>> > Few point on the architecture of the system :
>> >
>> > * RDMA communication engine in kernel ( you can use soft iwarp or soft
>> > ROCE if you don't have hardware acceleration, however we also support
>> > standard RDMA enabled NIC) .
>>
>> Do you mean infiniband subsystem?
>>
>>
>> > * Naturally Page are transferred with Zerop copy protocol
>> > * Leverage the async page fault system.
>> > * Pre paging / faulting
>> > * No context switch as everything is handled within kernel and using
>> > the page fault system.
>> > * Hybrid migration ( pre + post copy) available
>>
>> Ah, I've been also planing this.
>> After pre-copy phase, is the dirty bitmap sent?
>>
>> So far I've thought naively that pre-copy phase would be finished by the
>> number of iterations. On the other hand your choice is timeout of
>> pre-copy phase. Do you have rationale? or it was just natural for you?
>>
>>
>> > * Rely on an independent Kernel Module
>> > * No modification to the KVM kernel Module
>> > * Minimal Modification to the Qemu-Kvm code
>> > * We plan to add the page prioritization algo in order to optimise the
>> > pre paging algo and background transfer
>>
>> Where do you plan to implement? in qemu or in your kernel module?
>> This algo could be shared.
>>
>> thanks in advance.
>>
>> > You can learn a little bit more and see a demo here:
>> > http://tinyurl.com/8xa2bgl
>> > I hope to be able to provide more detail on the design soon. As well
>> > as more concrete demo of the system ( live migration of VM running
>> > large  enterprise apps such as ERP or In memory DB)
>> >
>> > Note: this is just a step stone as the post copy live migration mainly
>> > enable us to validate the architecture design and  code.
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Regards
>> > Benoit
>> >
>> >
>> > On 12 January 2012 13:59, Avi Kivity <avi@redhat.com> wrote:
>> > > On 01/04/2012 05:03 AM, Isaku Yamahata wrote:
>> > >> Yes, it's quite doable in user space(qemu) with a kernel-enhancement.
>> > >> And it would be easy to convert a separated daemon process into a thread
>> > >> in qemu.
>> > >>
>> > >> I think it should be done out side of qemu process for some reasons.
>> > >> (I just repeat same discussion at the KVM-forum because no one remembers
>> > >> it)
>> > >>
>> > >> - ptrace (and its variant)
>> > >> ?? Some people want to investigate guest ram on host (qemu stopped or lively).
>> > >> ?? For example, enhance crash utility and it will attach qemu process and
>> > >> ?? debug guest kernel.
>> > >
>> > > To debug the guest kernel you don't need to stop qemu itself. ?? I agree
>> > > it's a problem for qemu debugging though.
>> > >
>> > >>
>> > >> - core dump
>> > >> ?? qemu process may core-dump.
>> > >> ?? As postmortem analysis, people want to investigate guest RAM.
>> > >> ?? Again enhance crash utility and it will read the core file and analyze
>> > >> ?? guest kernel.
>> > >> ?? When creating core, the qemu process is already dead.
>> > >
>> > > Yes, strong point.
>> > >
>> > >> It precludes the above possibilities to handle fault in qemu process.
>> > >
>> > > I agree.
>> > >
>> > >
>> > > --
>> > > error compiling committee.c: too many arguments to function
>> > >
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at ??http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> > --
>> > " The production of too many useful things results in too many useless people"
>> >
>>
>> --
>> yamahata
>>
>
> --
> yamahata
diff mbox

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..85530fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2807,6 +2807,7 @@  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
 
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
diff --git a/mm/shmem.c b/mm/shmem.c
index d672250..d137a37 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2546,6 +2546,7 @@  int shmem_zero_setup(struct vm_area_struct *vma)
 	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(shmem_zero_setup);
 
 /**
  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.