[RFC,RDMA,support,v4:,03/10] more verbose documentation of the RDMA transport

Message ID	1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Gateway: Authorized Use Only! Violators will be prosecuted for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>; Sun, 17 Mar 2013 23:19:29 -0400 Gateway: Authorized Use Only! Violators will be prosecuted; Sun, 17 Mar 2013 23:19:25 -0400 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Sun, 17 Mar 2013 23:18:56 -0400 Message-Id: <1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com> In-Reply-To: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com> References: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com> Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Subject: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

mrhines@linux.vnet.ibm.com March 18, 2013, 3:18 a.m. UTC

From: "Michael R. Hines" <mrhines@us.ibm.com>

This tries to cover all the questions I got the last time.

Please do tell me what is not clear, and I'll revise again.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)
 create mode 100644 docs/rdma.txt

Michael S. Tsirkin March 18, 2013, 10:40 a.m. UTC | #1

On Sun, Mar 17, 2013 at 11:18:56PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This tries to cover all the questions I got the last time.
> 
> Please do tell me what is not clear, and I'll revise again.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..2a48ab0
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,208 @@
> +Changes since v3:
> +
> +- Compile-tested with and without --enable-rdma is working.
> +- Updated docs/rdma.txt (included below)
> +- Merged with latest pull queue from Paolo
> +- Implemented qemu_ram_foreach_block()
> +
> +mrhines@mrhinesdev:~/qemu$ git diff --stat master
> +Makefile.objs                 |    1 +
> +arch_init.c                   |   28 +-
> +configure                     |   25 ++
> +docs/rdma.txt                 |  190 +++++++++++
> +exec.c                        |   21 ++
> +include/exec/cpu-common.h     |    6 +
> +include/migration/migration.h |    3 +
> +include/migration/qemu-file.h |   10 +
> +include/migration/rdma.h      |  269 ++++++++++++++++
> +include/qemu/sockets.h        |    1 +
> +migration-rdma.c              |  205 ++++++++++++
> +migration.c                   |   19 +-
> +rdma.c                        | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +savevm.c                      |  172 +++++++++-
> +util/qemu-sockets.c           |    2 +-
> +15 files changed, 2445 insertions(+), 18 deletions(-)


Above looks strange :)

> +QEMUFileRDMA:

I think there are two things here, API documentation
and protocol documentation, protocol documentation
still needs some more work. Also if what I understand
from this document is correct this breaks memory overcommit
on destination which needs to be fixed.


> +==================================
> +
> +QEMUFileRDMA introduces a couple of new functions:
> +
> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +
> +These two functions provide an RDMA transport
> +(not a protocol) without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.
> +
> +In order to provide the same bytestream interface 
> +for RDMA, we use SEND messages instead of sockets.
> +The operations themselves and the protocol built on 
> +top of QEMUFile used throughout the migration 
> +process do not change whatsoever.
> +
> +An infiniband SEND message is the standard ibverbs
> +message used by applications of infiniband hardware.
> +The only difference between a SEND message and an RDMA
> +message is that SEND message cause completion notifications
> +to be posted to the completion queue (CQ) on the 
> +infiniband receiver side, whereas RDMA messages (used
> +for pc.ram) do not (to behave like an actual DMA).
> +    
> +Messages in infiniband require two things:
> +
> +1. registration of the memory that will be transmitted
> +2. (SEND only) work requests to be posted on both
> +   sides of the network before the actual transmission
> +   can occur.
> +
> +RDMA messages much easier to deal with. Once the memory
> +on the receiver side is registed and pinned, we're
> +basically done. All that is required is for the sender
> +side to start dumping bytes onto the link.
> +
> +SEND messages require more coordination because the
> +receiver must have reserved space (using a receive
> +work request) on the receive queue (RQ) before QEMUFileRDMA
> +can start using them to carry all the bytes as
> +a transport for migration of device state.
> +
> +After the initial connection setup (migration-rdma.c),

Is there any feature and/or version negotiation? How are we going to
handle compatibility when we extend the protocol?

> +this coordination starts by having both sides post
> +a single work request to the RQ before any users
> +of QEMUFile are activated.

So how does destination know it's ok to send anything
to source?
I suspect this is wrong. When using CM you must post
on RQ before completing the connection negotiation,
not after it's done.

> +
> +Once an initial receive work request is posted,
> +we have a put_buffer()/get_buffer() implementation
> +that looks like this:
> +
> +Logically:
> +
> +qemu_rdma_get_buffer():
> +
> +1. A user on top of QEMUFile calls ops->get_buffer(),
> +   which calls us.
> +2. We transmit an empty SEND to let the sender know that 
> +   we are *ready* to receive some bytes from QEMUFileRDMA.
> +   These bytes will come in the form of a another SEND.
> +3. Before attempting to receive that SEND, we post another
> +   RQ work request to replace the one we just used up.
> +4. Block on a CQ event channel and wait for the SEND
> +   to arrive.
> +5. When the send arrives, librdmacm will unblock us
> +   and we can consume the bytes (described later).

Using an empty message seems somewhat hacky, a fixed header in the
message would let you do more things if protocol is ever extended.

> +qemu_rdma_put_buffer(): 
> +
> +1. A user on top of QEMUFile calls ops->put_buffer(),
> +   which calls us.
> +2. Block on the CQ event channel waiting for a SEND
> +   from the receiver to tell us that the receiver
> +   is *ready* for us to transmit some new bytes.
> +3. When the "ready" SEND arrives, librdmacm will 
> +   unblock us and we immediately post a RQ work request
> +   to replace the one we just used up.
> +4. Now, we can actually deliver the bytes that
> +   put_buffer() wants and return. 

OK to summarize flow control: at any time there's
either 0 or 1 outstanding buffers in RQ.
At each time only one side can talk.
Destination always goes first, then source, etc.
At each time a single send message can be passed.


Just FYI, this means you are often at 0 buffers in RQ and IIRC 0 buffers
is a worst-case path for infiniband. It's better to keep at least 1
buffers in RQ at all times, so prepost 2 initially so it would fluctuate
between 1 and 2.

> +
> +NOTE: This entire sequents of events is designed this
> +way to mimic the operations of a bytestream and is not
> +typical of an infiniband application. (Something like MPI
> +would not 'ping-pong' messages like this and would not
> +block after every request, which would normally defeat
> +the purpose of using zero-copy infiniband in the first place).
> +
> +Finally, how do we handoff the actual bytes to get_buffer()?
> +
> +Again, because we're trying to "fake" a bytestream abstraction
> +using an analogy not unlike individual UDP frames, we have
> +to hold on to the bytes received from SEND in memory.
> +
> +Each time we get to "Step 5" above for get_buffer(),
> +the bytes from SEND are copied into a local holding buffer.
> +
> +Then, we return the number of bytes requested by get_buffer()
> +and leave the remaining bytes in the buffer until get_buffer()
> +comes around for another pass.
> +
> +If the buffer is empty, then we follow the same steps
> +listed above for qemu_rdma_get_buffer() and block waiting
> +for another SEND message to re-fill the buffer.
> +
> +Migration of pc.ram:
> +===============================
> +
> +At the beginning of the migration, (migration-rdma.c),
> +the sender and the receiver populate the list of RAMBlocks
> +to be registered with each other into a structure.

Could you add the packet format here as well please?
Need to document endian-ness etc.

> +Then, using a single SEND message, they exchange this
> +structure with each other, to be used later during the
> +iteration of main memory. This structure includes a list
> +of all the RAMBlocks, their offsets and lengths.

This basically means that all memort on destination has to be registered
upfront.  A typical guest has gigabytes of memory, IMHO that's too much
memory to have pinned.

> +
> +Main memory is not migrated with SEND infiniband 
> +messages, but is instead migrated with RDMA infiniband
> +messages.
> +
> +Messages are migrated in "chunks" (about 64 pages right now).
> +Chunk size is not dynamic, but it could be in a future
> +implementation.
> +
> +When a total of 64 pages (or a flush()) are aggregated,
> +the memory backed by the chunk on the sender side is
> +registered with librdmacm and pinned in memory.
> +
> +After pinning, an RDMA send is generated and tramsmitted
> +for the entire chunk.

I think something chunk-based on the destination side is required
as well. You also can't trust the source to tell you
the chunk size it could be malicious and ask for too much.
Maybe source gives chunk size hint and destination responds
with what it wants to use.


> +Error-handling:
> +===============================
> +
> +Infiniband has what is called a "Reliable, Connected"
> +link (one of 4 choices). This is the mode in which
> +we use for RDMA migration.
> +
> +If a *single* message fails,
> +the decision is to abort the migration entirely and
> +cleanup all the RDMA descriptors and unregister all
> +the memory.
> +
> +After cleanup, the Virtual Machine is returned to normal
> +operation the same way that would happen if the TCP
> +socket is broken during a non-RDMA based migration.

Yes but we also need to report errors detected during migration.
Need to document how this is done.
We also need to report success.

> +
> +USAGE
> +===============================
> +
> +Compiling:
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +
> +$ make
> +
> +Command-line on the Source machine AND Destination:
> +
> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4

mrhines@linux.vnet.ibm.com March 18, 2013, 8:24 p.m. UTC | #2

On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote:
> I think there are two things here, API documentation and protocol 
> documentation, protocol documentation still needs some more work. Also 
> if what I understand from this document is correct this breaks memory 
> overcommit on destination which needs to be fixed.
>
> I think something chunk-based on the destination side is required as 
> well. You also can't trust the source to tell you the chunk size it 
> could be malicious and ask for too much. Maybe source gives chunk size 
> hint and destination responds with what it wants to use. 

Do we allow ballooning *during* the live migration? Is that necessary?

Would it be sufficient to inform the destination which pages are ballooned
and then only register the ones that the VM actually owns?

> Is there any feature and/or version negotiation? How are we going to
> handle compatibility when we extend the protocol?
You mean, on top of the protocol versioning that's already
builtin to QEMUFile? inside qemu_savevm_state_begin()?

Should I piggy-back and additional protocol version number
before QEMUFile sends it's version number?

> So how does destination know it's ok to send anything to source? I 
> suspect this is wrong. When using CM you must post on RQ before 
> completing the connection negotiation, not after it's done. 

This is already handled by the RDMA connection manager (librdmacm).

The library already has functions like listen() and accept() the same
way that TCP does.

Once these functions return success, we have a gaurantee that both
sides of the connection have already posted the appropriate work
requests sufficient for driving the migration.

>> +2. We transmit an empty SEND to let the sender know that
>> +   we are *ready* to receive some bytes from QEMUFileRDMA.
>> +   These bytes will come in the form of a another SEND.
> Using an empty message seems somewhat hacky, a fixed header in the
> message would let you do more things if protocol is ever extended.

Great idea....... I'll add a struct RDMAHeader to each send
message in the next RFC which includes a version number.

(Until now, there were *only* QEMUFile bytes, nothing else,
so I didn't have any reason for a formal structure.)

> OK to summarize flow control: at any time there's either 0 or 1 
> outstanding buffers in RQ. At each time only one side can talk. 
> Destination always goes first, then source, etc. At each time a single 
> send message can be passed. Just FYI, this means you are often at 0 
> buffers in RQ and IIRC 0 buffers is a worst-case path for infiniband. 
> It's better to keep at least 1 buffers in RQ at all times, so prepost 
> 2 initially so it would fluctuate between 1 and 2. 

That's correct. Having 0 buffers is not possible - sending
a message with 0 buffers would throw an error. The "protocol"
as I described ensures that there is always one buffer posted
before waiting for another message to arrive.

I avoided "better" flow control because the non-live state
is so small in comparison to the pc.ram contents that would be sent.
The non-live state is in the range of kilobytes, so it seemed silly to
have more rigorous flow control....

>> +Migration of pc.ram:
>> +===============================
>> +
>> +At the beginning of the migration, (migration-rdma.c),
>> +the sender and the receiver populate the list of RAMBlocks
>> +to be registered with each other into a structure.
> Could you add the packet format here as well please?
> Need to document endian-ness etc.

There is no packet format for pc.ram. It's just bytes - raw RDMA
writes of each 4K page, because the memory must be registered
before the RDMA write can begin.

(As discussed, there will be a format for SEND, though - so I'll
take care of that in my next RFC).

>  Yes but we also need to report errors detected during migration. Need 
> to document how this is done. We also need to report success. 
Acknowledged - I'll add more verbosity to the different error conditions.

- Michael R. Hines

Michael S. Tsirkin March 18, 2013, 9:26 p.m. UTC | #3

On Mon, Mar 18, 2013 at 04:24:44PM -0400, Michael R. Hines wrote:
> On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote:
> >I think there are two things here, API documentation and protocol
> >documentation, protocol documentation still needs some more work.
> >Also if what I understand from this document is correct this
> >breaks memory overcommit on destination which needs to be fixed.
> >
> >I think something chunk-based on the destination side is required
> >as well. You also can't trust the source to tell you the chunk
> >size it could be malicious and ask for too much. Maybe source
> >gives chunk size hint and destination responds with what it wants
> >to use.
> 
> Do we allow ballooning *during* the live migration? Is that necessary?

Probably but I haven't mentioned ballooning at all.

memory overcommit != ballooning

> Would it be sufficient to inform the destination which pages are ballooned
> and then only register the ones that the VM actually owns?

I haven't thought about it.

> >Is there any feature and/or version negotiation? How are we going to
> >handle compatibility when we extend the protocol?
> You mean, on top of the protocol versioning that's already
> builtin to QEMUFile? inside qemu_savevm_state_begin()?

I mean for protocol things like credit negotiation, which are unrelated
to high level QEMUFile.

> Should I piggy-back and additional protocol version number
> before QEMUFile sends it's version number?

CM can exchange a bit of data during connection setup, maybe use that?

> >So how does destination know it's ok to send anything to source? I
> >suspect this is wrong. When using CM you must post on RQ before
> >completing the connection negotiation, not after it's done.
> 
> This is already handled by the RDMA connection manager (librdmacm).
> 
> The library already has functions like listen() and accept() the same
> way that TCP does.
> 
> Once these functions return success, we have a gaurantee that both
> sides of the connection have already posted the appropriate work
> requests sufficient for driving the migration.

Not if you don't post anything. librdmacm does not post requests.  So
everyone posts 1 buffer on RQ during connection setup?
OK though this is not what the document said, I was under the impression
this is done after connection setup.

> 
> >>+2. We transmit an empty SEND to let the sender know that
> >>+   we are *ready* to receive some bytes from QEMUFileRDMA.
> >>+   These bytes will come in the form of a another SEND.
> >Using an empty message seems somewhat hacky, a fixed header in the
> >message would let you do more things if protocol is ever extended.
> 
> Great idea....... I'll add a struct RDMAHeader to each send
> message in the next RFC which includes a version number.
> 
> (Until now, there were *only* QEMUFile bytes, nothing else,
> so I didn't have any reason for a formal structure.)
> 
> 
> >OK to summarize flow control: at any time there's either 0 or 1
> >outstanding buffers in RQ. At each time only one side can talk.
> >Destination always goes first, then source, etc. At each time a
> >single send message can be passed. Just FYI, this means you are
> >often at 0 buffers in RQ and IIRC 0 buffers is a worst-case path
> >for infiniband. It's better to keep at least 1 buffers in RQ at
> >all times, so prepost 2 initially so it would fluctuate between 1
> >and 2.
> 
> That's correct. Having 0 buffers is not possible - sending
> a message with 0 buffers would throw an error. The "protocol"
> as I described ensures that there is always one buffer posted
> before waiting for another message to arrive.

So # of buffers goes 0 -> 1 -> 0 -> 1.
What I am saying is you should have an extra buffer
so it goes 1 -> 2 -> 1 -> 2
otherwise you keep hitting slow path in RQ processing:
each time you consume the last buffer, IIRC receiver sends
and ACK to sender saying "hey this is the last buffer, slow down".
You don't want that.

> I avoided "better" flow control because the non-live state
> is so small in comparison to the pc.ram contents that would be sent.
> The non-live state is in the range of kilobytes, so it seemed silly to
> have more rigorous flow control....

I think it's good enough, just add an extra unused buffer to make
hardware happy.

> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >Could you add the packet format here as well please?
> >Need to document endian-ness etc.
> 
> There is no packet format for pc.ram.

The 'structure' above is passed using SEND so there is
a format.

> It's just bytes - raw RDMA
> writes of each 4K page, because the memory must be registered
> before the RDMA write can begin.
> 
> (As discussed, there will be a format for SEND, though - so I'll
> take care of that in my next RFC).
> 
> > Yes but we also need to report errors detected during migration.
> >Need to document how this is done. We also need to report success.
> Acknowledged - I'll add more verbosity to the different error conditions.
> 
> - Michael R. Hines

mrhines@linux.vnet.ibm.com March 18, 2013, 11:23 p.m. UTC | #4

On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
>
> Probably but I haven't mentioned ballooning at all.
>
> memory overcommit != ballooning

Sure, then setting ballooning aside for the moment,
then let's just consider regular (unused) virtual memory.

In this case, what's wrong with the destination mapping
and pinning all the memory if it is not being ballooned?

If the guest touches all the memory during normal operation
before migration begins (which would be the common case),
then overcommit is irrelevant, no?

> This is already handled by the RDMA connection manager (librdmacm).
>
> The library already has functions like listen() and accept() the same
> way that TCP does.
>
> Once these functions return success, we have a gaurantee that both
> sides of the connection have already posted the appropriate work
> requests sufficient for driving the migration.
> Not if you don't post anything. librdmacm does not post requests.  So
> everyone posts 1 buffer on RQ during connection setup?
> OK though this is not what the document said, I was under the impression
> this is done after connection setup.

Sorry, I wasn't being clear. Here's the existing sequence
that I've already coded and validated:

1. Receiver and Sender are started (command line):
      (The receiver has to be running before QMP migrate
       can connect, of course or this all falls apart.)

2. Both sides post RQ work requests (or multiple ones)
3. Receiver does listen()
4. Sender does connect()
         At this point both sides have already posted
         work requests as stated before.
5. Receiver accept() => issue first SEND message

At this point the sequence of events I describe in the
documentation for put_buffer() / get_buffer() all kick
in and everything is normal.

I'll be sure to post an extra few work requests as suggested.

>
> So # of buffers goes 0 -> 1 -> 0 -> 1.
> What I am saying is you should have an extra buffer
> so it goes 1 -> 2 -> 1 -> 2
> otherwise you keep hitting slow path in RQ processing:
> each time you consume the last buffer, IIRC receiver sends
> and ACK to sender saying "hey this is the last buffer, slow down".
> You don't want that.

No problem - I'll take care of it.......

Michael S. Tsirkin March 19, 2013, 8:19 a.m. UTC | #5

On Mon, Mar 18, 2013 at 07:23:53PM -0400, Michael R. Hines wrote:
> On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
> >
> >Probably but I haven't mentioned ballooning at all.
> >
> >memory overcommit != ballooning
> 
> Sure, then setting ballooning aside for the moment,
> then let's just consider regular (unused) virtual memory.
> 
> In this case, what's wrong with the destination mapping
> and pinning all the memory if it is not being ballooned?
> 
> If the guest touches all the memory during normal operation
> before migration begins (which would be the common case),
> then overcommit is irrelevant, no?

We have ways (e.g. cgroups) to limit what a VM can do. If it tries to
use more RAM than we let it, it will swap, still making progress, just
slower.  OTOH it looks like pinning more memory than allowed by the
cgroups limit will just get stuck forever (probably a bug,
should fail instead? but does not help your protocol
which needs it all pinned at all times).

There are also per-task resource limits. If you exceed this
registration will fail, so not good either.

I just don't see why do registration by chunks
on source but not on destination.

mrhines@linux.vnet.ibm.com March 19, 2013, 1:21 p.m. UTC | #6

On 03/19/2013 04:19 AM, Michael S. Tsirkin wrote:
> We have ways (e.g. cgroups) to limit what a VM can do. If it tries to 
> use more RAM than we let it, it will swap, still making progress, just 
> slower. OTOH it looks like pinning more memory than allowed by the 
> cgroups limit will just get stuck forever (probably a bug, should fail 
> instead? but does not help your protocol which needs it all pinned at 
> all times). There are also per-task resource limits. If you exceed 
> this registration will fail, so not good either. I just don't see why 
> do registration by chunks on source but not on destination. 

Would this a hard requirement for an initial version?

I do understand how and why this makes things more flexible during
the long run, but it does have the potential to slow down the RDMA
protocol significantly.

The way its implemented now, the sender can dump bytes
onto the wire at full speed (up to 30gbps last time I measured it),
but if we insert a round-trip message + registration on the
destination side before we're allowed to push more bytes out,
we'll have to introduce more complex flow control only for
the benefit of making the destination side have the flexibility
that you described.

mrhines@linux.vnet.ibm.com March 19, 2013, 3:08 p.m. UTC | #7

This is actual a much bigger problem that I thought, not just for RDMA:

Currently the *sender* side is does not support overcommit
during a regular TCP migration.......I assume because the
migration_bitmap does not know which memory is mapped or
unmapped by the host kernel.

Is this a known issue?

- Michael

On 03/19/2013 04:19 AM, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2013 at 07:23:53PM -0400, Michael R. Hines wrote:
>> On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
>>> Probably but I haven't mentioned ballooning at all.
>>>
>>> memory overcommit != ballooning
>> Sure, then setting ballooning aside for the moment,
>> then let's just consider regular (unused) virtual memory.
>>
>> In this case, what's wrong with the destination mapping
>> and pinning all the memory if it is not being ballooned?
>>
>> If the guest touches all the memory during normal operation
>> before migration begins (which would be the common case),
>> then overcommit is irrelevant, no?
> We have ways (e.g. cgroups) to limit what a VM can do. If it tries to
> use more RAM than we let it, it will swap, still making progress, just
> slower.  OTOH it looks like pinning more memory than allowed by the
> cgroups limit will just get stuck forever (probably a bug,
> should fail instead? but does not help your protocol
> which needs it all pinned at all times).
>
> There are also per-task resource limits. If you exceed this
> registration will fail, so not good either.
>
> I just don't see why do registration by chunks
> on source but not on destination.
>

Michael S. Tsirkin March 19, 2013, 3:16 p.m. UTC | #8

On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
> This is actual a much bigger problem that I thought, not just for RDMA:
> 
> Currently the *sender* side is does not support overcommit
> during a regular TCP migration.......I assume because the
> migration_bitmap does not know which memory is mapped or
> unmapped by the host kernel.
> 
> Is this a known issue?
> 
> - Michael

I don't really understand what you are saying here.
Do you see some bug with migration where we might use
more memory than allowed by cgroups?

mrhines@linux.vnet.ibm.com March 19, 2013, 3:32 p.m. UTC | #9

On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
>> This is actual a much bigger problem that I thought, not just for RDMA:
>>
>> Currently the *sender* side is does not support overcommit
>> during a regular TCP migration.......I assume because the
>> migration_bitmap does not know which memory is mapped or
>> unmapped by the host kernel.
>>
>> Is this a known issue?
>>
>> - Michael
> I don't really understand what you are saying here.
> Do you see some bug with migration where we might use
> more memory than allowed by cgroups?
>

Yes: cgroups does not coordinate with the list of pages
that have "not yet been mapped" or touched by the
virtual machine, right?

I may be missing something here from what I read in
the code, but even if I set a cgroups limit on memory,
QEMU will still attempt to access that memory if the
migration_bitmap tells it to, as far as I can tell.

Is this an accurate observation?

A simple solution would be to just have QEMU consult with /dev/pagemap, no?

- Michael

Michael S. Tsirkin March 19, 2013, 3:36 p.m. UTC | #10

On Tue, Mar 19, 2013 at 11:32:49AM -0400, Michael R. Hines wrote:
> On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
> >On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
> >>This is actual a much bigger problem that I thought, not just for RDMA:
> >>
> >>Currently the *sender* side is does not support overcommit
> >>during a regular TCP migration.......I assume because the
> >>migration_bitmap does not know which memory is mapped or
> >>unmapped by the host kernel.
> >>
> >>Is this a known issue?
> >>
> >>- Michael
> >I don't really understand what you are saying here.
> >Do you see some bug with migration where we might use
> >more memory than allowed by cgroups?
> >
> 
> Yes: cgroups does not coordinate with the list of pages
> that have "not yet been mapped" or touched by the
> virtual machine, right?
> 
> I may be missing something here from what I read in
> the code, but even if I set a cgroups limit on memory,
> QEMU will still attempt to access that memory if the
> migration_bitmap tells it to, as far as I can tell.
> 
> Is this an accurate observation?

Yes but this simply means QEMU will hit swap.

> A simple solution would be to just have QEMU consult with /dev/pagemap, no?
> 
> - Michael

mrhines@linux.vnet.ibm.com March 19, 2013, 5:09 p.m. UTC | #11

Allowing QEMU to swap due to a cgroup limit during migration is a viable 
overcommit option?

I'm trying to keep an open mind, but that would kill the migration time.....

- Michael

On 03/19/2013 11:36 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 11:32:49AM -0400, Michael R. Hines wrote:
>> On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
>>>> This is actual a much bigger problem that I thought, not just for RDMA:
>>>>
>>>> Currently the *sender* side is does not support overcommit
>>>> during a regular TCP migration.......I assume because the
>>>> migration_bitmap does not know which memory is mapped or
>>>> unmapped by the host kernel.
>>>>
>>>> Is this a known issue?
>>>>
>>>> - Michael
>>> I don't really understand what you are saying here.
>>> Do you see some bug with migration where we might use
>>> more memory than allowed by cgroups?
>>>
>> Yes: cgroups does not coordinate with the list of pages
>> that have "not yet been mapped" or touched by the
>> virtual machine, right?
>>
>> I may be missing something here from what I read in
>> the code, but even if I set a cgroups limit on memory,
>> QEMU will still attempt to access that memory if the
>> migration_bitmap tells it to, as far as I can tell.
>>
>> Is this an accurate observation?
> Yes but this simply means QEMU will hit swap.
>
>> A simple solution would be to just have QEMU consult with /dev/pagemap, no?
>>
>> - Michael

Paolo Bonzini March 19, 2013, 5:14 p.m. UTC | #12

Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> Allowing QEMU to swap due to a cgroup limit during migration is a viable
> overcommit option?
> 
> I'm trying to keep an open mind, but that would kill the migration
> time.....

Would it swap?  Doesn't the kernel back all zero pages with a single
copy-on-write page?  If that still accounts towards cgroup limits, it
would be a bug.

Old kernels do not have a shared zero hugepage, and that includes some
distro kernels.  Perhaps that's the problem.

Paolo

Michael S. Tsirkin March 19, 2013, 5:23 p.m. UTC | #13

On Tue, Mar 19, 2013 at 06:14:45PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> > Allowing QEMU to swap due to a cgroup limit during migration is a viable
> > overcommit option?
> > 
> > I'm trying to keep an open mind, but that would kill the migration
> > time.....

Maybe not if you have a fast SSD, or are using swap in RAM or compressed
swap or ...

> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
> 
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
> 
> Paolo

AFAIK for zero pages, yes. I'm not sure what the problem is either.

mrhines@linux.vnet.ibm.com March 19, 2013, 5:40 p.m. UTC | #14

OK, so I did a quick test and the cgroup does appear to be working 
correctly for zero pages.

Nevertheless, this still doesn't solve the chunk registration problem 
for RDMA.

Even with a cgroup on the sender *or* receiver side, there is no API 
that I know
that would correctly indicate to the migration process which pages are 
safe to register
or not with the hardware. Without such an API, even a "smarter" chunked 
memory
registration scheme would not work with cgroups because we would be 
attempting
to pin zero pages (for no reason) that cgroups has already kicked out, 
which would
defeat the purpose of using cgroups.

So, if I submit a separate patch to fix this, would you guys review it? 
(Using /dev/pagemap).

Unless there is a better idea? Does KVM expose the necessary mappings?

- Michael

On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>> overcommit option?
>>
>> I'm trying to keep an open mind, but that would kill the migration
>> time.....
> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
>
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
>
> Paolo
>

mrhines@linux.vnet.ibm.com March 19, 2013, 5:49 p.m. UTC | #15

I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)

So, infiniband is not smart enough to know how to avoid pinning a zero 
page, I guess.

- Michael

On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>> overcommit option?
>>
>> I'm trying to keep an open mind, but that would kill the migration
>> time.....
> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
>
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
>
> Paolo
>

Paolo Bonzini March 19, 2013, 5:52 p.m. UTC | #16

Il 19/03/2013 18:40, Michael R. Hines ha scritto:
> registration scheme would not work with cgroups because we would be 
> attempting to pin zero pages (for no reason) that cgroups has already
> kicked out, which would defeat the purpose of using cgroups.

Yeah, pinning would be a problem.

> So, if I submit a separate patch to fix this, would you guys review it?
> (Using /dev/pagemap).

Sorry about the ignorance, but what is /dev/pagemap? :)

> Unless there is a better idea? Does KVM expose the necessary mappings?

We could have the balloon driver track the pages.  I and Michael had
some initial work a few months ago on extending the virtio-balloon spec
to allow this.  It went nowhere, though.

Still, at this point this is again an RDMA-specific problem, I don't
think it would be that bad if the first iterations of RDMA didn't
support ballooning/overcommit.

Paolo

mrhines@linux.vnet.ibm.com March 19, 2013, 6:04 p.m. UTC | #17

On 03/19/2013 01:52 PM, Paolo Bonzini wrote:
> So, if I submit a separate patch to fix this, would you guys review it?
> (Using /dev/pagemap).
> Sorry about the ignorance, but what is /dev/pagemap? :)
/dev/pagemap is a recent interface for eserland accesses to the pagetables.

https://www.kernel.org/doc/Documentation/vm/pagemap.txt

It would very easily tell you (without extra tracking) which pages
were mapped and which were not mapped.

It should work for both cgroups and ballooning. We've used it before.

- Michael

Michael S. Tsirkin March 20, 2013, 1:07 p.m. UTC | #18

On Tue, Mar 19, 2013 at 06:52:59PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 18:40, Michael R. Hines ha scritto:
> > registration scheme would not work with cgroups because we would be 
> > attempting to pin zero pages (for no reason) that cgroups has already
> > kicked out, which would defeat the purpose of using cgroups.
> 
> Yeah, pinning would be a problem.
> 
> > So, if I submit a separate patch to fix this, would you guys review it?
> > (Using /dev/pagemap).
> 
> Sorry about the ignorance, but what is /dev/pagemap? :)
> 
> > Unless there is a better idea? Does KVM expose the necessary mappings?
> 
> We could have the balloon driver track the pages.  I and Michael had
> some initial work a few months ago on extending the virtio-balloon spec
> to allow this.  It went nowhere, though.
> 
> Still, at this point this is again an RDMA-specific problem, I don't
> think it would be that bad if the first iterations of RDMA didn't
> support ballooning/overcommit.
> 
> Paolo

My problem is with the protocol. If it assumes at the protocol level
that everything is pinned down on the destination, we'll have to rework
it all to make it really useful.

mrhines@linux.vnet.ibm.com March 20, 2013, 3:15 p.m. UTC | #19

OK, can we make a deal? =)

I'm willing to put in the work to perform the dynamic registration on 
the destination side,
but let's go a step further and piggy-back on the effort:

We need to couple this registration with a very small modification to 
save_ram_block():

Currently, save_ram_block does:

1. is RDMA turned on?      if yes, unconditionally add to next chunk
                                          (will be made to dynamically 
register on destination)
2. is_dup_page() ?            if yes, skip
3. in xbzrle cache?           if yes, skip
4. still not sent?                if yes, transmit

I propose adding a "stub" function that adds:

0. is page mapped?         if yes, skip   (always returns true for now)
1. same
2. same
3. same
4. same

Then, later, in a separate patch, I can implement /dev/pagemap support.

When that's done, RDMA dynamic registration will actually take effect and
benefit from actually verifying that the page is mapped or not.

- Michael

On 03/20/2013 09:07 AM, Michael S. Tsirkin wrote:
> My problem is with the protocol. If it assumes at the protocol level 
> that everything is pinned down on the destination, we'll have to 
> rework it all to make it really useful.

mrhines@linux.vnet.ibm.com March 20, 2013, 3:22 p.m. UTC | #20

s / is page mapped?/ is page unmapped?/ g


On 03/20/2013 11:15 AM, Michael R. Hines wrote:
> OK, can we make a deal? =)
>
> I'm willing to put in the work to perform the dynamic registration on 
> the destination side,
> but let's go a step further and piggy-back on the effort:
>
> We need to couple this registration with a very small modification to 
> save_ram_block():
>
> Currently, save_ram_block does:
>
> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>                                          (will be made to dynamically 
> register on destination)
> 2. is_dup_page() ?            if yes, skip
> 3. in xbzrle cache?           if yes, skip
> 4. still not sent?                if yes, transmit
>
> I propose adding a "stub" function that adds:
>
> 0. is page mapped?         if yes, skip   (always returns true for now)
> 1. same
> 2. same
> 3. same
> 4. same
>
> Then, later, in a separate patch, I can implement /dev/pagemap support.
>
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
>
> - Michael
>
>
> On 03/20/2013 09:07 AM, Michael S. Tsirkin wrote:
>> My problem is with the protocol. If it assumes at the protocol level 
>> that everything is pinned down on the destination, we'll have to 
>> rework it all to make it really useful. 
>
>

Michael S. Tsirkin March 20, 2013, 3:55 p.m. UTC | #21

On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
> OK, can we make a deal? =)
> 
> I'm willing to put in the work to perform the dynamic registration
> on the destination side,
> but let's go a step further and piggy-back on the effort:
> 
> We need to couple this registration with a very small modification
> to save_ram_block():
> 
> Currently, save_ram_block does:
> 
> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>                                          (will be made to
> dynamically register on destination)
> 2. is_dup_page() ?            if yes, skip
> 3. in xbzrle cache?           if yes, skip
> 4. still not sent?                if yes, transmit
> 
> I propose adding a "stub" function that adds:
> 
> 0. is page mapped?         if yes, skip   (always returns true for now)
> 1. same
> 2. same
> 3. same
> 4. same
> 
> Then, later, in a separate patch, I can implement /dev/pagemap support.
> 
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
> 
> - Michael

Mapped into guest? You mean e.g. for ballooning?

mrhines@linux.vnet.ibm.com March 20, 2013, 4:08 p.m. UTC | #22

On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
>> OK, can we make a deal? =)
>>
>> I'm willing to put in the work to perform the dynamic registration
>> on the destination side,
>> but let's go a step further and piggy-back on the effort:
>>
>> We need to couple this registration with a very small modification
>> to save_ram_block():
>>
>> Currently, save_ram_block does:
>>
>> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>>                                           (will be made to
>> dynamically register on destination)
>> 2. is_dup_page() ?            if yes, skip
>> 3. in xbzrle cache?           if yes, skip
>> 4. still not sent?                if yes, transmit
>>
>> I propose adding a "stub" function that adds:
>>
>> 0. is page mapped?         if yes, skip   (always returns true for now)
>> 1. same
>> 2. same
>> 3. same
>> 4. same
>>
>> Then, later, in a separate patch, I can implement /dev/pagemap support.
>>
>> When that's done, RDMA dynamic registration will actually take effect and
>> benefit from actually verifying that the page is mapped or not.
>>
>> - Michael
> Mapped into guest? You mean e.g. for ballooning?
>

No, not just ballooning. Overcommit (i.e. cgroups).

Anytime cgroups kicks out a page (or anytime the balloon kicks in),
the page would become unmapped.

The make dynamic registration useful, we have to actually have something
in place in the future that knows how to *check* if a page is unmapped
from the virtual machine, either because it has never been dirtied before
(and might be pointing to the zero page) or because it has been madvised()
out or has been detatched because of a cgroup limit.

- Michael

Michael S. Tsirkin March 20, 2013, 7:06 p.m. UTC | #23

On Wed, Mar 20, 2013 at 12:08:40PM -0400, Michael R. Hines wrote:
> 
> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
> >>OK, can we make a deal? =)
> >>
> >>I'm willing to put in the work to perform the dynamic registration
> >>on the destination side,
> >>but let's go a step further and piggy-back on the effort:
> >>
> >>We need to couple this registration with a very small modification
> >>to save_ram_block():
> >>
> >>Currently, save_ram_block does:
> >>
> >>1. is RDMA turned on?      if yes, unconditionally add to next chunk
> >>                                          (will be made to
> >>dynamically register on destination)
> >>2. is_dup_page() ?            if yes, skip
> >>3. in xbzrle cache?           if yes, skip
> >>4. still not sent?                if yes, transmit
> >>
> >>I propose adding a "stub" function that adds:
> >>
> >>0. is page mapped?         if yes, skip   (always returns true for now)
> >>1. same
> >>2. same
> >>3. same
> >>4. same
> >>
> >>Then, later, in a separate patch, I can implement /dev/pagemap support.
> >>
> >>When that's done, RDMA dynamic registration will actually take effect and
> >>benefit from actually verifying that the page is mapped or not.
> >>
> >>- Michael
> >Mapped into guest? You mean e.g. for ballooning?
> >
> 
> No, not just ballooning. Overcommit (i.e. cgroups).
> 
> Anytime cgroups kicks out a page (or anytime the balloon kicks in),
> the page would become unmapped.

OK but we still need to send that page to remote.
It's in swap but has guest data in there, you can't
just ignore it.

> The make dynamic registration useful, we have to actually have something
> in place in the future that knows how to *check* if a page is unmapped
> from the virtual machine, either because it has never been dirtied before
> (and might be pointing to the zero page) or because it has been madvised()
> out or has been detatched because of a cgroup limit.
> 
> - Michael
>

mrhines@linux.vnet.ibm.com March 20, 2013, 8:20 p.m. UTC | #24

On 03/20/2013 03:06 PM, Michael S. Tsirkin wrote:
> No, not just ballooning. Overcommit (i.e. cgroups).
>
> Anytime cgroups kicks out a page (or anytime the balloon kicks in),
> the page would become unmapped.
> OK but we still need to send that page to remote.
> It's in swap but has guest data in there, you can't
> just ignore it.

Yes, absolutely: https://www.kernel.org/doc/Documentation/vm/pagemap.txt

The pagemap will tell you that.

In fact the pagemap ideally would *only* be used for the 1st migration 
round.

The rest of them would depend exclusively on the dirty bitmap as they do.

Basically, we could use the pagemap as first-time "hint" for the bulk of
the memory that costs the most to transmit.

mrhines@linux.vnet.ibm.com March 20, 2013, 8:24 p.m. UTC | #25

On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> Then, later, in a separate patch, I can implement /dev/pagemap support.
>
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
>
> - Michael
> Mapped into guest? You mean e.g. for ballooning?
>

Three scenarios are candidates for mapped checking:

1. anytime the virtual machine has not yet accessed a page (usually 
during the 1st-time boot)
2. Anytime madvise(DONTNEED) happens (for ballooning)
3.  Anytime cgroups kicks out a zero page that was accessed and faulted 
but not dirty that is a clean candidate for unmapping.
        (I did a test that seems to confirm that cgroups is pretty 
"smart" about that)

Basically, anytime the pagemap says "this page is *not* swap and *not* 
mapped
- then the page is not important during the 1st iteration.

On the subsequent iterations, we come along as normal checking the dirty 
bitmap as usual.

- Michael

Michael S. Tsirkin March 20, 2013, 8:31 p.m. UTC | #26

On Wed, Mar 20, 2013 at 04:20:06PM -0400, Michael R. Hines wrote:
> On 03/20/2013 03:06 PM, Michael S. Tsirkin wrote:
> 
>     No, not just ballooning. Overcommit (i.e. cgroups).
> 
>     Anytime cgroups kicks out a page (or anytime the balloon kicks in),
>     the page would become unmapped.
> 
>     OK but we still need to send that page to remote.
>     It's in swap but has guest data in there, you can't
>     just ignore it.
> 
> 
> Yes, absolutely: https://www.kernel.org/doc/Documentation/vm/pagemap.txt
> 
> The pagemap will tell you that.
> 
> In fact the pagemap ideally would *only* be used for the 1st migration round.
> 
> The rest of them would depend exclusively on the dirty bitmap as they do.
> 
> Basically, we could use the pagemap as first-time "hint" for the bulk of
> the memory that costs the most to transmit.

OK sure, this could be useful to detect pages deduplicated by KSM and only
transmit one copy. There's still the question of creating same
duplicate mappings on destination - do you just do data copy on destination?

Not sure why you talk about unmapped pages above though, it seems
not really relevant...

There's also the matter of KSM not touching pinned pages,
that's another good reason not to pin all pages on destination,
they won't be deduplicated.

Michael S. Tsirkin March 20, 2013, 8:37 p.m. UTC | #27

On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >Then, later, in a separate patch, I can implement /dev/pagemap support.
> >
> >When that's done, RDMA dynamic registration will actually take effect and
> >benefit from actually verifying that the page is mapped or not.
> >
> >- Michael
> >Mapped into guest? You mean e.g. for ballooning?
> >
> 
> Three scenarios are candidates for mapped checking:
> 
> 1. anytime the virtual machine has not yet accessed a page (usually
> during the 1st-time boot)

So migrating booting machines is faster now?  Why is this worth
optimizing for?

> 2. Anytime madvise(DONTNEED) happens (for ballooning)

This is likely worth optimizing.
I think a better the way to handling this one is by tracking
ballooned state. Just mark these pages as unused in qemu.

> 3.  Anytime cgroups kicks out a zero page that was accessed and
> faulted but not dirty that is a clean candidate for unmapping.
>        (I did a test that seems to confirm that cgroups is pretty
> "smart" about that)
> Basically, anytime the pagemap says "this page is *not* swap and
> *not* mapped
> - then the page is not important during the 1st iteration.
> On the subsequent iterations, we come along as normal checking the
> dirty bitmap as usual.
> 
> - Michael

If it will never be dirty you will never migrate it?
Seems wrong - it could have guest data on disk - AFAIK clean does not
mean no data, it means disk is in sync with memory.

mrhines@linux.vnet.ibm.com March 20, 2013, 8:39 p.m. UTC | #28

Agreed. Very useful for KSM.

Unmapped virtual addresses cannot be pinned for RDMA (the hardware will 
break),
but there's no way to know they are unmapped without checking another 
data structure.

- Michael

On 03/20/2013 04:31 PM, Michael S. Tsirkin wrote:
>
> OK sure, this could be useful to detect pages deduplicated by KSM and only
> transmit one copy. There's still the question of creating same
> duplicate mappings on destination - do you just do data copy on destination?
>
> Not sure why you talk about unmapped pages above though, it seems
> not really relevant...
>
> There's also the matter of KSM not touching pinned pages,
> that's another good reason not to pin all pages on destination,
> they won't be deduplicated.
>

mrhines@linux.vnet.ibm.com March 20, 2013, 8:45 p.m. UTC | #29

On 03/20/2013 04:37 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
>> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
>>> Then, later, in a separate patch, I can implement /dev/pagemap support.
>>>
>>> When that's done, RDMA dynamic registration will actually take effect and
>>> benefit from actually verifying that the page is mapped or not.
>>>
>>> - Michael
>>> Mapped into guest? You mean e.g. for ballooning?
>>>
>> Three scenarios are candidates for mapped checking:
>>
>> 1. anytime the virtual machine has not yet accessed a page (usually
>> during the 1st-time boot)
> So migrating booting machines is faster now?  Why is this worth
> optimizing for?
Yes, it helps both the TCP migration and RDMA migration simultaneously.
>
>> 2. Anytime madvise(DONTNEED) happens (for ballooning)
> This is likely worth optimizing.
> I think a better the way to handling this one is by tracking
> ballooned state. Just mark these pages as unused in qemu.

Paolo said somebody attempted that, but stopped work on it for some reason?

>> 3.  Anytime cgroups kicks out a zero page that was accessed and
>> faulted but not dirty that is a clean candidate for unmapping.
>>         (I did a test that seems to confirm that cgroups is pretty
>> "smart" about that)
>> Basically, anytime the pagemap says "this page is *not* swap and
>> *not* mapped
>> - then the page is not important during the 1st iteration.
>> On the subsequent iterations, we come along as normal checking the
>> dirty bitmap as usual.
>>
>> - Michael
> If it will never be dirty you will never migrate it?
> Seems wrong - it could have guest data on disk - AFAIK clean does not
> mean no data, it means disk is in sync with memory.
>

Sorry, yes - that was a mis-statement: clean pages are always mapped (or 
swapped) and would have to
be transmitted at least once.

- Michael

Michael S. Tsirkin March 20, 2013, 8:46 p.m. UTC | #30

On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
> Unmapped virtual addresses cannot be pinned for RDMA (the hardware
> will break),
> but there's no way to know they are unmapped without checking
> another data structure.

So for RDMA, when you try to register them, this will fault them in.
For regular migration we really should try using vmsplice.  Anyone up to
it? If we do this TCP could outperform RDMA for some workloads ...

Michael S. Tsirkin March 20, 2013, 8:52 p.m. UTC | #31

On Wed, Mar 20, 2013 at 04:45:05PM -0400, Michael R. Hines wrote:
> On 03/20/2013 04:37 PM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
> >>On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >>>Then, later, in a separate patch, I can implement /dev/pagemap support.
> >>>
> >>>When that's done, RDMA dynamic registration will actually take effect and
> >>>benefit from actually verifying that the page is mapped or not.
> >>>
> >>>- Michael
> >>>Mapped into guest? You mean e.g. for ballooning?
> >>>
> >>Three scenarios are candidates for mapped checking:
> >>
> >>1. anytime the virtual machine has not yet accessed a page (usually
> >>during the 1st-time boot)
> >So migrating booting machines is faster now?  Why is this worth
> >optimizing for?
> Yes, it helps both the TCP migration and RDMA migration simultaneously.

But for a class of VMs that is only common when you want to
run a benchmark. People do live migration precisely to
avoid the need to reboot the VM.

> >
> >>2. Anytime madvise(DONTNEED) happens (for ballooning)
> >This is likely worth optimizing.
> >I think a better the way to handling this one is by tracking
> >ballooned state. Just mark these pages as unused in qemu.
> 
> Paolo said somebody attempted that, but stopped work on it for some reason?
> 
> >>3.  Anytime cgroups kicks out a zero page that was accessed and
> >>faulted but not dirty that is a clean candidate for unmapping.
> >>        (I did a test that seems to confirm that cgroups is pretty
> >>"smart" about that)
> >>Basically, anytime the pagemap says "this page is *not* swap and
> >>*not* mapped
> >>- then the page is not important during the 1st iteration.
> >>On the subsequent iterations, we come along as normal checking the
> >>dirty bitmap as usual.
> >>
> >>- Michael
> >If it will never be dirty you will never migrate it?
> >Seems wrong - it could have guest data on disk - AFAIK clean does not
> >mean no data, it means disk is in sync with memory.
> >
> 
> Sorry, yes - that was a mis-statement: clean pages are always mapped
> (or swapped) and would have to
> be transmitted at least once.
> 
> - Michael

Right so maybe my idea of looking at the PFNs in pagemap and transmitting
only once could help some VMs (and it would cover the booting VMs as a
partial case), and it could be a useful though linux-specific
optimization, but I don't see how looking at whether page is
mapped would help for TCP.

mrhines@linux.vnet.ibm.com March 20, 2013, 8:56 p.m. UTC | #32

Forgive me, vmsplice system call? Or some other interface?

I'm not following......

On 03/20/2013 04:46 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
>> Unmapped virtual addresses cannot be pinned for RDMA (the hardware
>> will break),
>> but there's no way to know they are unmapped without checking
>> another data structure.
> So for RDMA, when you try to register them, this will fault them in.

Michael S. Tsirkin March 21, 2013, 5:20 a.m. UTC | #33

On Wed, Mar 20, 2013 at 04:56:01PM -0400, Michael R. Hines wrote:
> 
> Forgive me, vmsplice system call? Or some other interface?
> 
> I'm not following......
> 
> On 03/20/2013 04:46 PM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
> >>Unmapped virtual addresses cannot be pinned for RDMA (the hardware
> >>will break),
> >>but there's no way to know they are unmapped without checking
> >>another data structure.
> >So for RDMA, when you try to register them, this will fault them in.

I'm just saying get_user_pages brings pages back in from swap.

Michael S. Tsirkin March 21, 2013, 6:11 a.m. UTC | #34

On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
> 
> So, infiniband is not smart enough to know how to avoid pinning a
> zero page, I guess.
> 
> - Michael
> 
> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> >Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> >>Allowing QEMU to swap due to a cgroup limit during migration is a viable
> >>overcommit option?
> >>
> >>I'm trying to keep an open mind, but that would kill the migration
> >>time.....
> >Would it swap?  Doesn't the kernel back all zero pages with a single
> >copy-on-write page?  If that still accounts towards cgroup limits, it
> >would be a bug.
> >
> >Old kernels do not have a shared zero hugepage, and that includes some
> >distro kernels.  Perhaps that's the problem.
> >
> >Paolo
> >

I really shouldn't break COW if you don't request LOCAL_WRITE.
I think it's a kernel bug, and apparently has been there in the code since the
first version: get_user_pages parameters swapped.

I'll send a patch. If it's applied, you should also
change your code from

+                                IBV_ACCESS_LOCAL_WRITE |
+                                IBV_ACCESS_REMOTE_WRITE |
+                                IBV_ACCESS_REMOTE_READ);

to

+                                IBV_ACCESS_REMOTE_READ);

on send side.
Then, each time we detect a page has changed we must make sure to
unregister and re-register it. Or if you want to be very
smart, check that the PFN didn't change and reregister
if it did.

This will make overcommit work.

mrhines@linux.vnet.ibm.com March 21, 2013, 3:22 p.m. UTC | #35

Very nice catch. Yes, I didn't think about that.

Thanks.

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>

mrhines@linux.vnet.ibm.com April 5, 2013, 8:45 p.m. UTC | #36

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
>>
>> So, infiniband is not smart enough to know how to avoid pinning a
>> zero page, I guess.
>>
>> - Michael
>>
>> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
>>> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>>>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>>>> overcommit option?
>>>>
>>>> I'm trying to keep an open mind, but that would kill the migration
>>>> time.....
>>> Would it swap?  Doesn't the kernel back all zero pages with a single
>>> copy-on-write page?  If that still accounts towards cgroup limits, it
>>> would be a bug.
>>>
>>> Old kernels do not have a shared zero hugepage, and that includes some
>>> distro kernels.  Perhaps that's the problem.
>>>
>>> Paolo
>>>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>
Unfortunately RDMA + cgroups still kills QEMU:

I removed the *_WRITE flags and did a test like this:

1. Start QEMU with 2GB ram configured

$ cd /sys/fs/cgroup/memory/libvirt/qemu
$ echo "-1" > memory.memsw.limit_in_bytes
$ echo "-1" > memory.limit_in_bytes
$ echo $(pidof qemu-system-x86_64) > tasks
$ echo 512M > memory.limit_in_bytes              # maximum RSS
$ echo 3G > memory.memsw.limit_in_bytes     # maximum RSS + swap, extra 
1G to be safe

2. Start RDMA migration

3. RSS of 512M is reached
4. swap starts filling up
5. the kernel kills QEMU
6. dmesg:

[ 2981.657135] Task in /libvirt/qemu killed as a result of limit of 
/libvirt/qemu
[ 2981.657140] memory: usage 524288kB, limit 524288kB, failcnt 18031
[ 2981.657143] memory+swap: usage 525460kB, limit 3145728kB, failcnt 0
[ 2981.657146] Mem-Info:
[ 2981.657148] Node 0 DMA per-cpu:
[ 2981.657152] CPU    0: hi:    0, btch:   1 usd:   0
[ 2981.657155] CPU    1: hi:    0, btch:   1 usd:   0
[ 2981.657157] CPU    2: hi:    0, btch:   1 usd:   0
[ 2981.657160] CPU    3: hi:    0, btch:   1 usd:   0
[ 2981.657163] CPU    4: hi:    0, btch:   1 usd:   0
[ 2981.657165] CPU    5: hi:    0, btch:   1 usd:   0
[ 2981.657167] CPU    6: hi:    0, btch:   1 usd:   0
[ 2981.657170] CPU    7: hi:    0, btch:   1 usd:   0
[ 2981.657172] Node 0 DMA32 per-cpu:
[ 2981.657176] CPU    0: hi:  186, btch:  31 usd: 160
[ 2981.657178] CPU    1: hi:  186, btch:  31 usd:  22
[ 2981.657181] CPU    2: hi:  186, btch:  31 usd: 179
[ 2981.657184] CPU    3: hi:  186, btch:  31 usd:   6
[ 2981.657186] CPU    4: hi:  186, btch:  31 usd:  21
[ 2981.657189] CPU    5: hi:  186, btch:  31 usd:  15
[ 2981.657191] CPU    6: hi:  186, btch:  31 usd:  19
[ 2981.657194] CPU    7: hi:  186, btch:  31 usd:  22
[ 2981.657196] Node 0 Normal per-cpu:
[ 2981.657200] CPU    0: hi:  186, btch:  31 usd:  44
[ 2981.657202] CPU    1: hi:  186, btch:  31 usd:  58
[ 2981.657205] CPU    2: hi:  186, btch:  31 usd: 156
[ 2981.657207] CPU    3: hi:  186, btch:  31 usd: 107
[ 2981.657210] CPU    4: hi:  186, btch:  31 usd:  44
[ 2981.657213] CPU    5: hi:  186, btch:  31 usd:  70
[ 2981.657215] CPU    6: hi:  186, btch:  31 usd:  76
[ 2981.657218] CPU    7: hi:  186, btch:  31 usd: 173
[ 2981.657223] active_anon:181703 inactive_anon:68856 isolated_anon:0
[ 2981.657224]  active_file:66881 inactive_file:141056 isolated_file:0
[ 2981.657225]  unevictable:2174 dirty:6 writeback:0 unstable:0
[ 2981.657226]  free:4058168 slab_reclaimable:5152 slab_unreclaimable:10785
[ 2981.657227]  mapped:7709 shmem:192 pagetables:1913 bounce:0
[ 2981.657230] Node 0 DMA free:15896kB min:56kB low:68kB high:84kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[ 2981.657242] lowmem_reserve[]: 0 1966 18126 18126
[ 2981.657249] Node 0 DMA32 free:1990652kB min:7324kB low:9152kB 
high:10984kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2013280kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB 
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[ 2981.657260] lowmem_reserve[]: 0 0 16160 16160
[ 2981.657268] Node 0 Normal free:14226124kB min:60200kB low:75248kB 
high:90300kB active_anon:726812kB inactive_anon:275424kB 
active_file:267524kB inactive_file:564224kB unevictable:8696kB 
isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:6652kB 
dirty:24kB writeback:0kB mapped:30832kB shmem:768kB 
slab_reclaimable:20608kB slab_unreclaimable:43140kB kernel_stack:1784kB 
pagetables:7652kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
[ 2981.657281] lowmem_reserve[]: 0 0 0 0
[ 2981.657289] Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 2*64kB 1*128kB 
1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15896kB
[ 2981.657307] Node 0 DMA32: 17*4kB 9*8kB 7*16kB 4*32kB 8*64kB 5*128kB 
6*256kB 4*512kB 3*1024kB 6*2048kB 481*4096kB = 1990652kB
[ 2981.657325] Node 0 Normal: 2*4kB 1*8kB 991*16kB 893*32kB 271*64kB 
50*128kB 50*256kB 12*512kB 5*1024kB 1*2048kB 3450*4096kB = 14225504kB
[ 2981.657343] 277718 total pagecache pages
[ 2981.657345] 68816 pages in swap cache
[ 2981.657348] Swap cache stats: add 656848, delete 588032, find 19850/22338
[ 2981.657350] Free swap  = 15288376kB
[ 2981.657353] Total swap = 15564796kB
[ 2981.706982] 4718576 pages RAM

mrhines@linux.vnet.ibm.com April 5, 2013, 8:46 p.m. UTC | #37

FYI, I used the following redhat cgroups instructions, to test if 
overcommit + RDMA was working:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

- Michael

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
>>
>> So, infiniband is not smart enough to know how to avoid pinning a
>> zero page, I guess.
>>
>> - Michael
>>
>> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
>>> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>>>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>>>> overcommit option?
>>>>
>>>> I'm trying to keep an open mind, but that would kill the migration
>>>> time.....
>>> Would it swap?  Doesn't the kernel back all zero pages with a single
>>> copy-on-write page?  If that still accounts towards cgroup limits, it
>>> would be a bug.
>>>
>>> Old kernels do not have a shared zero hugepage, and that includes some
>>> distro kernels.  Perhaps that's the problem.
>>>
>>> Paolo
>>>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>

[RFC,RDMA,support,v4:,03/10] more verbose documentation of the RDMA transport

Commit Message

Comments

Patch