Patchwork [v3,9/9] coalesce adjacent iovecs

login
register
mail settings
Submitter Orit Wasserman
Date March 21, 2013, 4:05 p.m.
Message ID <1363881940-27505-10-git-send-email-owasserm@redhat.com>
Download mbox | patch
Permalink /patch/229756/
State New
Headers show

Comments

Orit Wasserman - March 21, 2013, 4:05 p.m.
This way we send one big buffer instead of many small ones

Signed-off-by: Orit Wasserman <owasserm@redhat.com>
---
 savevm.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)
Orit Wasserman - March 21, 2013, 4:27 p.m.
On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
>> This way we send one big buffer instead of many small ones
>>
>> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
> 
> Why does this happen BTW?

It happens in the last phase when we send the device state that consists of a lot
bytes and int field that are written using qemu_put_byte/be16/...


> 
>> ---
>>  savevm.c | 21 +++++++++++++++++----
>>  1 file changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/savevm.c b/savevm.c
>> index 50e8fb2..13a533b 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -634,8 +634,14 @@ void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
>>          abort();
>>      }
>>  
>> -    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
>> -    f->iov[f->iovcnt++].iov_len = size;
>> +    /* check for adjoint buffer and colace them */
>> +    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
>> +        f->iov[f->iovcnt - 1].iov_len) {
>> +        f->iov[f->iovcnt - 1].iov_len += size;
>> +    } else {
>> +        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
>> +        f->iov[f->iovcnt++].iov_len = size;
>> +    }
>>  
>>      f->is_write = 1;
>>      f->bytes_xfer += size;
>> @@ -690,8 +696,15 @@ void qemu_put_byte(QEMUFile *f, int v)
>>      f->buf[f->buf_index++] = v;
>>      f->is_write = 1;
>>      f->bytes_xfer += 1;
>> -    f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
>> -    f->iov[f->iovcnt++].iov_len = 1;
>> +
>> +    /* check for adjoint buffer and colace them */
>> +    if (f->iovcnt > 0 && f->buf + (f->buf_index - 1)  ==
>> +        f->iov[f->iovcnt - 1].iov_base + f->iov[f->iovcnt - 1].iov_len) {
>> +        f->iov[f->iovcnt - 1].iov_len += 1;
>> +    } else {
>> +        f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
>> +        f->iov[f->iovcnt++].iov_len = 1;
>> +    }
>>  
>>      if (f->buf_index >= IO_BUF_SIZE || f->iovcnt >= MAX_IOV_SIZE) {
>>          qemu_fflush(f);
>> -- 
>> 1.7.11.7
Michael S. Tsirkin - March 21, 2013, 4:29 p.m.
On Thu, Mar 21, 2013 at 06:27:42PM +0200, Orit Wasserman wrote:
> On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
> >> This way we send one big buffer instead of many small ones
> >>
> >> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
> > 
> > Why does this happen BTW?
> 
> It happens in the last phase when we send the device state that consists of a lot
> bytes and int field that are written using qemu_put_byte/be16/...
> 

Confused  I thought device_state does not use _nocopy?
My idea of using vmsplice relies exactly on this:
we can not splice device state ...

> > 
> >> ---
> >>  savevm.c | 21 +++++++++++++++++----
> >>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/savevm.c b/savevm.c
> >> index 50e8fb2..13a533b 100644
> >> --- a/savevm.c
> >> +++ b/savevm.c
> >> @@ -634,8 +634,14 @@ void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
> >>          abort();
> >>      }
> >>  
> >> -    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> >> -    f->iov[f->iovcnt++].iov_len = size;
> >> +    /* check for adjoint buffer and colace them */
> >> +    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
> >> +        f->iov[f->iovcnt - 1].iov_len) {
> >> +        f->iov[f->iovcnt - 1].iov_len += size;
> >> +    } else {
> >> +        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> >> +        f->iov[f->iovcnt++].iov_len = size;
> >> +    }
> >>  
> >>      f->is_write = 1;
> >>      f->bytes_xfer += size;
> >> @@ -690,8 +696,15 @@ void qemu_put_byte(QEMUFile *f, int v)
> >>      f->buf[f->buf_index++] = v;
> >>      f->is_write = 1;
> >>      f->bytes_xfer += 1;
> >> -    f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
> >> -    f->iov[f->iovcnt++].iov_len = 1;
> >> +
> >> +    /* check for adjoint buffer and colace them */
> >> +    if (f->iovcnt > 0 && f->buf + (f->buf_index - 1)  ==
> >> +        f->iov[f->iovcnt - 1].iov_base + f->iov[f->iovcnt - 1].iov_len) {
> >> +        f->iov[f->iovcnt - 1].iov_len += 1;
> >> +    } else {
> >> +        f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
> >> +        f->iov[f->iovcnt++].iov_len = 1;
> >> +    }
> >>  
> >>      if (f->buf_index >= IO_BUF_SIZE || f->iovcnt >= MAX_IOV_SIZE) {
> >>          qemu_fflush(f);
> >> -- 
> >> 1.7.11.7
Orit Wasserman - March 21, 2013, 4:40 p.m.
On 03/21/2013 06:29 PM, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 06:27:42PM +0200, Orit Wasserman wrote:
>> On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
>>> On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
>>>> This way we send one big buffer instead of many small ones
>>>>
>>>> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
>>>
>>> Why does this happen BTW?
>>
>> It happens in the last phase when we send the device state that consists of a lot
>> bytes and int field that are written using qemu_put_byte/be16/...
>>
> 
> Confused  I thought device_state does not use _nocopy?
> My idea of using vmsplice relies exactly on this:
> we can not splice device state ...

qemu_put_buffer calls qemu_put_buffer_no_copy (you know code reuse)
So I guess we will need to add some flag if we want to use vmsplice or not
or split no_copy into external and internal. It won't be a big change

> 
>>>
>>>> ---
>>>>  savevm.c | 21 +++++++++++++++++----
>>>>  1 file changed, 17 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/savevm.c b/savevm.c
>>>> index 50e8fb2..13a533b 100644
>>>> --- a/savevm.c
>>>> +++ b/savevm.c
>>>> @@ -634,8 +634,14 @@ void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
>>>>          abort();
>>>>      }
>>>>  
>>>> -    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
>>>> -    f->iov[f->iovcnt++].iov_len = size;
>>>> +    /* check for adjoint buffer and colace them */
>>>> +    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
>>>> +        f->iov[f->iovcnt - 1].iov_len) {
>>>> +        f->iov[f->iovcnt - 1].iov_len += size;
>>>> +    } else {
>>>> +        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
>>>> +        f->iov[f->iovcnt++].iov_len = size;
>>>> +    }
>>>>  
>>>>      f->is_write = 1;
>>>>      f->bytes_xfer += size;
>>>> @@ -690,8 +696,15 @@ void qemu_put_byte(QEMUFile *f, int v)
>>>>      f->buf[f->buf_index++] = v;
>>>>      f->is_write = 1;
>>>>      f->bytes_xfer += 1;
>>>> -    f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
>>>> -    f->iov[f->iovcnt++].iov_len = 1;
>>>> +
>>>> +    /* check for adjoint buffer and colace them */
>>>> +    if (f->iovcnt > 0 && f->buf + (f->buf_index - 1)  ==
>>>> +        f->iov[f->iovcnt - 1].iov_base + f->iov[f->iovcnt - 1].iov_len) {
>>>> +        f->iov[f->iovcnt - 1].iov_len += 1;
>>>> +    } else {
>>>> +        f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
>>>> +        f->iov[f->iovcnt++].iov_len = 1;
>>>> +    }
>>>>  
>>>>      if (f->buf_index >= IO_BUF_SIZE || f->iovcnt >= MAX_IOV_SIZE) {
>>>>          qemu_fflush(f);
>>>> -- 
>>>> 1.7.11.7
Michael S. Tsirkin - March 21, 2013, 5:10 p.m.
On Thu, Mar 21, 2013 at 06:40:24PM +0200, Orit Wasserman wrote:
> On 03/21/2013 06:29 PM, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 06:27:42PM +0200, Orit Wasserman wrote:
> >> On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
> >>> On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
> >>>> This way we send one big buffer instead of many small ones
> >>>>
> >>>> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
> >>>
> >>> Why does this happen BTW?
> >>
> >> It happens in the last phase when we send the device state that consists of a lot
> >> bytes and int field that are written using qemu_put_byte/be16/...
> >>
> > 
> > Confused  I thought device_state does not use _nocopy?
> > My idea of using vmsplice relies exactly on this:
> > we can not splice device state ...
> 
> qemu_put_buffer calls qemu_put_buffer_no_copy (you know code reuse)
> So I guess we will need to add some flag if we want to use vmsplice or not
> or split no_copy into external and internal. It won't be a big change

Fair enough. Have time to look into this? Data copies is like 15% CPU,
more with swap ...

> > 
> >>>
> >>>> ---
> >>>>  savevm.c | 21 +++++++++++++++++----
> >>>>  1 file changed, 17 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/savevm.c b/savevm.c
> >>>> index 50e8fb2..13a533b 100644
> >>>> --- a/savevm.c
> >>>> +++ b/savevm.c
> >>>> @@ -634,8 +634,14 @@ void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
> >>>>          abort();
> >>>>      }
> >>>>  
> >>>> -    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> >>>> -    f->iov[f->iovcnt++].iov_len = size;
> >>>> +    /* check for adjoint buffer and colace them */
> >>>> +    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
> >>>> +        f->iov[f->iovcnt - 1].iov_len) {
> >>>> +        f->iov[f->iovcnt - 1].iov_len += size;
> >>>> +    } else {
> >>>> +        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> >>>> +        f->iov[f->iovcnt++].iov_len = size;
> >>>> +    }
> >>>>  
> >>>>      f->is_write = 1;
> >>>>      f->bytes_xfer += size;
> >>>> @@ -690,8 +696,15 @@ void qemu_put_byte(QEMUFile *f, int v)
> >>>>      f->buf[f->buf_index++] = v;
> >>>>      f->is_write = 1;
> >>>>      f->bytes_xfer += 1;
> >>>> -    f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
> >>>> -    f->iov[f->iovcnt++].iov_len = 1;
> >>>> +
> >>>> +    /* check for adjoint buffer and colace them */
> >>>> +    if (f->iovcnt > 0 && f->buf + (f->buf_index - 1)  ==
> >>>> +        f->iov[f->iovcnt - 1].iov_base + f->iov[f->iovcnt - 1].iov_len) {
> >>>> +        f->iov[f->iovcnt - 1].iov_len += 1;
> >>>> +    } else {
> >>>> +        f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
> >>>> +        f->iov[f->iovcnt++].iov_len = 1;
> >>>> +    }
> >>>>  
> >>>>      if (f->buf_index >= IO_BUF_SIZE || f->iovcnt >= MAX_IOV_SIZE) {
> >>>>          qemu_fflush(f);
> >>>> -- 
> >>>> 1.7.11.7
Juan Quintela - March 21, 2013, 5:41 p.m.
Orit Wasserman <owasserm@redhat.com> wrote:
> This way we send one big buffer instead of many small ones
>
> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
> ---
>  savevm.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/savevm.c b/savevm.c
> index 50e8fb2..13a533b 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -634,8 +634,14 @@ void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
>          abort();
>      }
>  
> -    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> -    f->iov[f->iovcnt++].iov_len = size;


> +    /* check for adjoint buffer and colace them */

s/colace/coalesce/

> +    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
> +        f->iov[f->iovcnt - 1].iov_len) {
> +        f->iov[f->iovcnt - 1].iov_len += size;
> +    } else {
> +        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
> +        f->iov[f->iovcnt++].iov_len = size;
> +    }

Can we create a function for this?
Would help make sure that we don't forget to "fix" both places when we
have changes.

Later, Juan.
Juan Quintela - March 21, 2013, 5:44 p.m.
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Mar 21, 2013 at 06:27:42PM +0200, Orit Wasserman wrote:
>> On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
>> > On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
>> >> This way we send one big buffer instead of many small ones
>> >>
>> >> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
>> > 
>> > Why does this happen BTW?
>> 
>> It happens in the last phase when we send the device state that
>> consists of a lot
>> bytes and int field that are written using qemu_put_byte/be16/...
>> 
>
> Confused  I thought device_state does not use _nocopy?
> My idea of using vmsplice relies exactly on this:
> we can not splice device state ...


As it is today, I am not sure that we can use vmsplice() because we
are sending:


<header>
<page>
<header>
<page>
<header>
<page>

We can optimize at some pount to write a bigger/different header and
sent a bunch of pages together, but just now we don't have that code.

Later, Juan.
Michael S. Tsirkin - March 21, 2013, 5:46 p.m.
On Thu, Mar 21, 2013 at 06:44:14PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Mar 21, 2013 at 06:27:42PM +0200, Orit Wasserman wrote:
> >> On 03/21/2013 06:16 PM, Michael S. Tsirkin wrote:
> >> > On Thu, Mar 21, 2013 at 06:05:40PM +0200, Orit Wasserman wrote:
> >> >> This way we send one big buffer instead of many small ones
> >> >>
> >> >> Signed-off-by: Orit Wasserman <owasserm@redhat.com>
> >> > 
> >> > Why does this happen BTW?
> >> 
> >> It happens in the last phase when we send the device state that
> >> consists of a lot
> >> bytes and int field that are written using qemu_put_byte/be16/...
> >> 
> >
> > Confused  I thought device_state does not use _nocopy?
> > My idea of using vmsplice relies exactly on this:
> > we can not splice device state ...
> 
> 
> As it is today, I am not sure that we can use vmsplice() because we
> are sending:
> 
> 
> <header>
> <page>
> <header>
> <page>
> <header>
> <page>
> 
> We can optimize at some pount to write a bigger/different header and
> sent a bunch of pages together, but just now we don't have that code.
> 
> Later, Juan.

Sending the page can do vmsplice, can't it?
Multipage is likely a good idea anyway, e.g. RDMA wants to
do this too.
Juan Quintela - March 21, 2013, 6:22 p.m.
"Michael S. Tsirkin" <mst@redhat.com> wrote:

>> <header>
>> <page>
>> <header>
>> <page>
>> <header>
>> <page>
>> 
>> We can optimize at some pount to write a bigger/different header and
>> sent a bunch of pages together, but just now we don't have that code.
>> 
>> Later, Juan.
>
> Sending the page can do vmsplice, can't it?
> Multipage is likely a good idea anyway, e.g. RDMA wants to
> do this too.

RDMA requires de pages lock into memory, no?

Later, Juan.
Michael S. Tsirkin - March 21, 2013, 6:33 p.m.
On Thu, Mar 21, 2013 at 07:22:01PM +0100, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> >> <header>
> >> <page>
> >> <header>
> >> <page>
> >> <header>
> >> <page>
> >> 
> >> We can optimize at some pount to write a bigger/different header and
> >> sent a bunch of pages together, but just now we don't have that code.
> >> 
> >> Later, Juan.
> >
> > Sending the page can do vmsplice, can't it?
> > Multipage is likely a good idea anyway, e.g. RDMA wants to
> > do this too.
> 
> RDMA requires de pages lock into memory, no?
> 
> Later, Juan.

That's a separate issue, but the point is it is asynchronous.
It can be made to behave exactly the same implementing this API:

	- send async: lock, send
	- flush: poll for completion, unlock

Patch

diff --git a/savevm.c b/savevm.c
index 50e8fb2..13a533b 100644
--- a/savevm.c
+++ b/savevm.c
@@ -634,8 +634,14 @@  void qemu_put_buffer_no_copy(QEMUFile *f, const uint8_t *buf, int size)
         abort();
     }
 
-    f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
-    f->iov[f->iovcnt++].iov_len = size;
+    /* check for adjoint buffer and colace them */
+    if (f->iovcnt > 0 && buf == f->iov[f->iovcnt - 1].iov_base +
+        f->iov[f->iovcnt - 1].iov_len) {
+        f->iov[f->iovcnt - 1].iov_len += size;
+    } else {
+        f->iov[f->iovcnt].iov_base = (uint8_t *)buf;
+        f->iov[f->iovcnt++].iov_len = size;
+    }
 
     f->is_write = 1;
     f->bytes_xfer += size;
@@ -690,8 +696,15 @@  void qemu_put_byte(QEMUFile *f, int v)
     f->buf[f->buf_index++] = v;
     f->is_write = 1;
     f->bytes_xfer += 1;
-    f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
-    f->iov[f->iovcnt++].iov_len = 1;
+
+    /* check for adjoint buffer and colace them */
+    if (f->iovcnt > 0 && f->buf + (f->buf_index - 1)  ==
+        f->iov[f->iovcnt - 1].iov_base + f->iov[f->iovcnt - 1].iov_len) {
+        f->iov[f->iovcnt - 1].iov_len += 1;
+    } else {
+        f->iov[f->iovcnt].iov_base = f->buf + (f->buf_index - 1);
+        f->iov[f->iovcnt++].iov_len = 1;
+    }
 
     if (f->buf_index >= IO_BUF_SIZE || f->iovcnt >= MAX_IOV_SIZE) {
         qemu_fflush(f);