mbox series

[net-next,RFC,0/5] batched tx processing in vhost_net

Message ID 1506067355-5771-1-git-send-email-jasowang@redhat.com
Headers show
Series batched tx processing in vhost_net | expand

Message

Jason Wang Sept. 22, 2017, 8:02 a.m. UTC
Hi:

This series tries to implement basic tx batched processing. This is
done by prefetching descriptor indices and update used ring in a
batch. This intends to speed up used ring updating and improve the
cache utilization. Test shows about ~22% improvement in tx pss.

Please review.

Jason Wang (5):
  vhost: split out ring head fetching logic
  vhost: introduce helper to prefetch desc index
  vhost: introduce vhost_add_used_idx()
  vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
  vhost_net: basic tx virtqueue batched processing

 drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
 drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
 drivers/vhost/vhost.h |   9 ++
 3 files changed, 270 insertions(+), 125 deletions(-)

Comments

Michael S. Tsirkin Sept. 26, 2017, 1:45 p.m. UTC | #1
On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization.

Interesting, thanks for the patches. So IIUC most of the gain is really
overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?

Which is fair enough (1.0 is already deployed) but I would like to avoid
making 1.1 support harder, and this patchset does this unfortunately,
see comments on individual patches. I'm sure it can be addressed though.

> Test shows about ~22% improvement in tx pss.

Is this with or without tx napi in guest?

> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()
>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4
Michael S. Tsirkin Sept. 26, 2017, 7:26 p.m. UTC | #2
On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement basic tx batched processing. This is
> done by prefetching descriptor indices and update used ring in a
> batch. This intends to speed up used ring updating and improve the
> cache utilization. Test shows about ~22% improvement in tx pss.




> Please review.
> 
> Jason Wang (5):
>   vhost: split out ring head fetching logic
>   vhost: introduce helper to prefetch desc index
>   vhost: introduce vhost_add_used_idx()

Please squash these new APIs into where they are used.
This split-up just makes review harder for me as
I can't figure out how the new APIs are used.


>   vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH

This is ok as a separate patch.

>   vhost_net: basic tx virtqueue batched processing
> 
>  drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>  drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>  drivers/vhost/vhost.h |   9 ++
>  3 files changed, 270 insertions(+), 125 deletions(-)
> 
> -- 
> 2.7.4
Jason Wang Sept. 27, 2017, 12:27 a.m. UTC | #3
On 2017年09月26日 21:45, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to implement basic tx batched processing. This is
>> done by prefetching descriptor indices and update used ring in a
>> batch. This intends to speed up used ring updating and improve the
>> cache utilization.
> Interesting, thanks for the patches. So IIUC most of the gain is really
> overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?

Yes.

Actually, looks like batching in 1.1 is not as easy as in 1.0.

In 1.0, we could do something like:

batch update used ring by user copy_to_user()
smp_wmb()
update used_idx

In 1.1, we need more memory barriers, can't benefit from fast copy helpers?

for () {
     update desc.addr
     smp_wmb()
     update desc.flag
}

>
> Which is fair enough (1.0 is already deployed) but I would like to avoid
> making 1.1 support harder, and this patchset does this unfortunately,

I think the new APIs do not expose more internal data structure of 
virtio than before? (vq->heads has already been used by vhost_net for 
years). Consider the layout is re-designed completely, I don't see an 
easy method to reuse current 1.0 API for 1.1.

> see comments on individual patches. I'm sure it can be addressed though.
>
>> Test shows about ~22% improvement in tx pss.
> Is this with or without tx napi in guest?

MoonGen is used in guest for better numbers.

Thanks

>
>> Please review.
>>
>> Jason Wang (5):
>>    vhost: split out ring head fetching logic
>>    vhost: introduce helper to prefetch desc index
>>    vhost: introduce vhost_add_used_idx()
>>    vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
>>    vhost_net: basic tx virtqueue batched processing
>>
>>   drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>>   drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>>   drivers/vhost/vhost.h |   9 ++
>>   3 files changed, 270 insertions(+), 125 deletions(-)
>>
>> -- 
>> 2.7.4
Jason Wang Sept. 27, 2017, 2:06 a.m. UTC | #4
On 2017年09月27日 03:26, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to implement basic tx batched processing. This is
>> done by prefetching descriptor indices and update used ring in a
>> batch. This intends to speed up used ring updating and improve the
>> cache utilization. Test shows about ~22% improvement in tx pss.
>
>
>
>> Please review.
>>
>> Jason Wang (5):
>>    vhost: split out ring head fetching logic
>>    vhost: introduce helper to prefetch desc index
>>    vhost: introduce vhost_add_used_idx()
> Please squash these new APIs into where they are used.
> This split-up just makes review harder for me as
> I can't figure out how the new APIs are used.

Ok.

Thanks

>
>
>>    vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
> This is ok as a separate patch.
>
>>    vhost_net: basic tx virtqueue batched processing
>>
>>   drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>>   drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>>   drivers/vhost/vhost.h |   9 ++
>>   3 files changed, 270 insertions(+), 125 deletions(-)
>>
>> -- 
>> 2.7.4
Michael S. Tsirkin Sept. 27, 2017, 10:28 p.m. UTC | #5
On Wed, Sep 27, 2017 at 08:27:37AM +0800, Jason Wang wrote:
> 
> 
> On 2017年09月26日 21:45, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to implement basic tx batched processing. This is
> > > done by prefetching descriptor indices and update used ring in a
> > > batch. This intends to speed up used ring updating and improve the
> > > cache utilization.
> > Interesting, thanks for the patches. So IIUC most of the gain is really
> > overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?
> 
> Yes.
> 
> Actually, looks like batching in 1.1 is not as easy as in 1.0.
> 
> In 1.0, we could do something like:
> 
> batch update used ring by user copy_to_user()
> smp_wmb()
> update used_idx
> In 1.1, we need more memory barriers, can't benefit from fast copy helpers?
> 
> for () {
>     update desc.addr
>     smp_wmb()
>     update desc.flag
> }

Yes but smp_wmb is a NOP on e.g. x86. We can switch to other types of
barriers as well.  We do need to do the updates in order, so we might
need new APIs for that to avoid re-doing the translation all the time.

In 1.0 the last update is a cache miss always. You need batching to get
less misses. In 1.1 you don't have it so fundamentally there is less
need for batching. But batching does not always work.  DPDK guys (which
batch things aggressively) already tried 1.1 and saw performance gains
so we do not need to argue theoretically.



> > 
> > Which is fair enough (1.0 is already deployed) but I would like to avoid
> > making 1.1 support harder, and this patchset does this unfortunately,
> 
> I think the new APIs do not expose more internal data structure of virtio
> than before? (vq->heads has already been used by vhost_net for years).

For sure we might need to change vring_used_elem.

> Consider the layout is re-designed completely, I don't see an easy method to
> reuse current 1.0 API for 1.1.

Current API just says you get buffers then you use them. It is not tied
to actual separate used ring.


> > see comments on individual patches. I'm sure it can be addressed though.
> > 
> > > Test shows about ~22% improvement in tx pss.
> > Is this with or without tx napi in guest?
> 
> MoonGen is used in guest for better numbers.
> 
> Thanks

Not sure I understand. Did you set napi_tx to true or false?

> > 
> > > Please review.
> > > 
> > > Jason Wang (5):
> > >    vhost: split out ring head fetching logic
> > >    vhost: introduce helper to prefetch desc index
> > >    vhost: introduce vhost_add_used_idx()
> > >    vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
> > >    vhost_net: basic tx virtqueue batched processing
> > > 
> > >   drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
> > >   drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
> > >   drivers/vhost/vhost.h |   9 ++
> > >   3 files changed, 270 insertions(+), 125 deletions(-)
> > > 
> > > -- 
> > > 2.7.4
Jason Wang Sept. 28, 2017, 7:16 a.m. UTC | #6
On 2017年09月28日 06:28, Michael S. Tsirkin wrote:
> On Wed, Sep 27, 2017 at 08:27:37AM +0800, Jason Wang wrote:
>>
>> On 2017年09月26日 21:45, Michael S. Tsirkin wrote:
>>> On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to implement basic tx batched processing. This is
>>>> done by prefetching descriptor indices and update used ring in a
>>>> batch. This intends to speed up used ring updating and improve the
>>>> cache utilization.
>>> Interesting, thanks for the patches. So IIUC most of the gain is really
>>> overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?
>> Yes.
>>
>> Actually, looks like batching in 1.1 is not as easy as in 1.0.
>>
>> In 1.0, we could do something like:
>>
>> batch update used ring by user copy_to_user()
>> smp_wmb()
>> update used_idx
>> In 1.1, we need more memory barriers, can't benefit from fast copy helpers?
>>
>> for () {
>>      update desc.addr
>>      smp_wmb()
>>      update desc.flag
>> }
> Yes but smp_wmb is a NOP on e.g. x86. We can switch to other types of
> barriers as well.

We need consider non x86 platforms as well. And we meet similar things 
on batch reading.

>    We do need to do the updates in order, so we might
> need new APIs for that to avoid re-doing the translation all the time.
>
> In 1.0 the last update is a cache miss always. You need batching to get
> less misses. In 1.1 you don't have it so fundamentally there is less
> need for batching. But batching does not always work.  DPDK guys (which
> batch things aggressively) already tried 1.1 and saw performance gains
> so we do not need to argue theoretically.

Do they test on non-x86 CPUs? And the prototype should be reviewed 
carefully since a bug can boost the performance in this case.

>
>
>
>>> Which is fair enough (1.0 is already deployed) but I would like to avoid
>>> making 1.1 support harder, and this patchset does this unfortunately,
>> I think the new APIs do not expose more internal data structure of virtio
>> than before? (vq->heads has already been used by vhost_net for years).
> For sure we might need to change vring_used_elem.
>
>> Consider the layout is re-designed completely, I don't see an easy method to
>> reuse current 1.0 API for 1.1.
> Current API just says you get buffers then you use them. It is not tied
> to actual separate used ring.

So do this patch I think? It introduces:

int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
                 struct vring_used_elem *heads,
                 u16 num);
int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);

There's nothing new exposed to vhost_net. (vring_used_elem has been used 
by net.c at many places without this patch).

>
>
>>> see comments on individual patches. I'm sure it can be addressed though.
>>>
>>>> Test shows about ~22% improvement in tx pss.
>>> Is this with or without tx napi in guest?
>> MoonGen is used in guest for better numbers.
>>
>> Thanks
> Not sure I understand. Did you set napi_tx to true or false?

MoonGen uses dpdk, so virtio-net pmd is used in guest. (See 
http://conferences.sigcomm.org/imc/2015/papers/p275.pdf)

Thanks

>
>>>> Please review.
>>>>
>>>> Jason Wang (5):
>>>>     vhost: split out ring head fetching logic
>>>>     vhost: introduce helper to prefetch desc index
>>>>     vhost: introduce vhost_add_used_idx()
>>>>     vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
>>>>     vhost_net: basic tx virtqueue batched processing
>>>>
>>>>    drivers/vhost/net.c   | 221 ++++++++++++++++++++++++++++----------------------
>>>>    drivers/vhost/vhost.c | 165 +++++++++++++++++++++++++++++++------
>>>>    drivers/vhost/vhost.h |   9 ++
>>>>    3 files changed, 270 insertions(+), 125 deletions(-)
>>>>
>>>> -- 
>>>> 2.7.4