Patchwork Add a memory barrier to DMA functions

login
register
mail settings
Submitter Benjamin Herrenschmidt
Date May 22, 2012, 4:34 a.m.
Message ID <1337661278.2779.166.camel@pasglop>
Download mbox | patch
Permalink /patch/160526/
State New
Headers show

Comments

Benjamin Herrenschmidt - May 22, 2012, 4:34 a.m.
The emulated devices can run simultaneously with the guest, so
we need to be careful with ordering of load and stores done by
them to the guest system memory, which need to be observed in
the right order by the guest operating system.

This adds a barrier call to the basic DMA read/write ops which
is currently implemented as a smp_mb(), but could be later
improved for more fine grained control of barriers.

Additionally, a _relaxed() variant of the accessors is provided
to easily convert devices who would be performance sensitive
and negatively impacted by the change.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

So here's the latest try :-) I've kept is simple, I don't
add anything to map/unmap at this stage, so we might still
have a problem with drivers who do that a lot without any
explicit barrier (ahci ?). My preference is to also add
barriers to map/unmap by default but we can discuss it.

Note that I've put the barrier in an inline "helper" so
we can nag about the type of barriers, or try to be smart,
or use flags in the DMAContext or whatever we want reasonably
easily as we have a single spot to modify.

I'm now going to see if I can measure a performance hit on
my x86 laptop, but if somebody who has existing x86 guest setups
wants to help, that would be much welcome :-)

 dma.h |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 52 insertions(+), 2 deletions(-)
Benjamin Herrenschmidt - May 22, 2012, 4:51 a.m.
On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> So here's the latest try :-) I've kept is simple, I don't
> add anything to map/unmap at this stage, so we might still
> have a problem with drivers who do that a lot without any
> explicit barrier (ahci ?). My preference is to also add
> barriers to map/unmap by default but we can discuss it.
> 
> Note that I've put the barrier in an inline "helper" so
> we can nag about the type of barriers, or try to be smart,
> or use flags in the DMAContext or whatever we want reasonably
> easily as we have a single spot to modify.
> 
> I'm now going to see if I can measure a performance hit on
> my x86 laptop, but if somebody who has existing x86 guest setups
> wants to help, that would be much welcome :-) 

Also an idea I had to make it easier & avoid a clutter of
_relaxed variants of everything:

Can we change DMADirection to be a DMAAttributes or DMAFlags,
and basically have more than just the direction in there ?

That way we can easily have "relaxed" be a flag, and thus
apply to most of the accessors without adding a bunch of
different variant (especially since we already have that
"cancel" variant for map, I don't want to make 2 more
for _relaxed).

If I do that change I'd also like to change the enum so that
read and write are distinct bits, and so we can set both for
bidirectional. This makes more sense if we ever have to play
with cache issues etc... and might help using more optimal
barriers. It might also fit better to IOMMUs who provide
distinct read vs. write permissions.

Cheers,
Ben.
Benjamin Herrenschmidt - May 22, 2012, 7:17 a.m.
On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> The emulated devices can run simultaneously with the guest, so
> we need to be careful with ordering of load and stores done by
> them to the guest system memory, which need to be observed in
> the right order by the guest operating system.
> 
> This adds a barrier call to the basic DMA read/write ops which
> is currently implemented as a smp_mb(), but could be later
> improved for more fine grained control of barriers.
> 
> Additionally, a _relaxed() variant of the accessors is provided
> to easily convert devices who would be performance sensitive
> and negatively impacted by the change.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---

(Note to Rusty: The number I told you on ST is wrong, see below)

So I tried to do some performance measurements with that patch using
netperf on an x86 laptop (x220 with core i7).

It's a bit tricky. For example, if I just create a tap interface,
give it a local IP on the laptop and a different IP on the guest,
(ie talking to a netserver on the host basically from the guest
via tap), the performance is pretty poor and the numbers seem
useless with and without the barrier.

So I did tests involving talking to a server on our gigabit network
instead.

The baseline is the laptop without kvm talking to the server. The
TCP_STREAM test results are:

(The "*" at the beginning of the lines is something I added to
 distinguish multi-line results on some tests)

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

* 87380  16384  16384    10.02     933.02   
* 87380  16384  16384    10.03     908.64   
* 87380  16384  16384    10.03     926.78   
* 87380  16384  16384    10.03     919.73   

It's a bit noisy, ideally I should do a point-to-point setup to
an otherwise idle machine, here I'm getting some general lab network
noise but it gives us a pretty good baseline to begin with.

I have not managed to get any sensible result out of UDP_STREAM in
that configuration for some reason (ie just the host laptop), ie
they look insane :

MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

*229376   65507   10.00      270468      0    14173.84
 126976           10.00          44              2.31
*229376   65507   10.00      266526      0    13967.32
 126976           10.00          41              2.15

So we don't have a good comparison baseline but we can still compare
KVM against itself with and without the barrier.

Now KVM. This is x86_64 running an ubuntu precise guest (I had the
ISO laying around) and using the default setup which appears to be
an emulated e1000. I've done some tests with slirp just to see how
bad it was and it's bad enough to be irrelevant. The numbers have
thus been done using a tap interface bridged to the host ethernet
(who happens to also be some kind of e1000).

For each test I've done 3 series of numbers:

 - Without the barrier added
 - With the barrier added to dma_memory_rw
 - With the barrier added to dma_memory_rw -and- dma_memory_map

First TCP_STREAM. The numbers are a bit noisy, I suspect somebody
was hammering the server machine while I was doing one of the tests,
but here's what I got when it appeared to have stabilized:

MIGRATED TCP STREAM TEST
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

no barrier
* 87380 16384   16384    10.01    880.31
* 87380 16384   16384    10.01    876.73
* 87380 16384   16384    10.01    880.73
* 87380 16384   16384    10.01    878.63
barrier
* 87380 16384   16384    10.01    869.39
* 87380 16384   16384    10.01    864.99
* 87380 16384   16384    10.01    886.13
* 87380 16384   16384    10.01    872.90
barrier + map
* 87380 16384   16384    10.01    867.45
* 87380 16384   16384    10.01    868.51
* 87380 16384   16384    10.01    888.94
* 87380 16384   16384    10.01    888.19

As far as I can tell, it's all in the noise. I was about to concede a
small (1 % ?) loss to the barrier until I ran the last 2 tests and
then I stopped caring :-)

With UDP_STREAM, we get something like that:
MIGRATED UDP STREAM TEST
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

no barrier
*229376   65507   10.00        5208     0      272.92
 126976           10.00        5208            272.92
*229376   65507   10.00        5447     0      285.44
 126976           10.00        5447            285.44
*229376   65507   10.00        5119     0      268.22
 126976           10.00        5119            268.22
barrier
*229376   65507   10.00        5326     0      279.06
 126976           10.00        5326            279.06
*229376   65507   10.00        5072     0      265.75
 126976           10.00        5072            265.75
*229376   65507   10.00        5282     0      276.78
 126976           10.00        5282            276.78
barrier + map
*229376   65507   10.00        5512     0      288.83
 126976           10.00        5512            288.83
*229376   65507   10.00        5571     0      291.94
 126976           10.00        5571            291.94
*229376   65507   10.00        5195     0      272.23
 126976           10.00        5195            272.23

So I think here too we're in the noise. In fact, that makes me want to
stick the barrier in map() as well (though see my other email about
using a flag to implement "relaxed" to avoid an explosion of accessors).

Now, I suspect somebody needs to re-run those tests on HW that is known
to be more sensitive to memory barriers, it could be that my SB i7 in
64-bit mode is just the best case scenario and that some old core1 or 2
using a 32-bit lock instruction will suck a lot more.

In any case, it looks like the performance loss is minimal if measurable
at all, and in case there's a real concern on a given driver we can
always fix -that- driver to use more relaxed accessors.

Cheers,
Ben.
Michael S. Tsirkin - May 22, 2012, 11:14 a.m.
On Tue, May 22, 2012 at 05:17:39PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 14:34 +1000, Benjamin Herrenschmidt wrote:
> > The emulated devices can run simultaneously with the guest, so
> > we need to be careful with ordering of load and stores done by
> > them to the guest system memory, which need to be observed in
> > the right order by the guest operating system.
> > 
> > This adds a barrier call to the basic DMA read/write ops which
> > is currently implemented as a smp_mb(), but could be later
> > improved for more fine grained control of barriers.
> > 
> > Additionally, a _relaxed() variant of the accessors is provided
> > to easily convert devices who would be performance sensitive
> > and negatively impacted by the change.
> > 
> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > ---
> 
> (Note to Rusty: The number I told you on ST is wrong, see below)
> 
> So I tried to do some performance measurements with that patch using
> netperf on an x86 laptop (x220 with core i7).
> 
> It's a bit tricky. For example, if I just create a tap interface,
> give it a local IP on the laptop and a different IP on the guest,
> (ie talking to a netserver on the host basically from the guest
> via tap), the performance is pretty poor and the numbers seem
> useless with and without the barrier.
> 
> So I did tests involving talking to a server on our gigabit network
> instead.
> 
> The baseline is the laptop without kvm talking to the server. The
> TCP_STREAM test results are:

It's not a good test. The thing most affecting throughput results is how
much CPU does you guest get. So as a minumum you need to measure CPU
utilization on the host and divide by that.
Benjamin Herrenschmidt - May 22, 2012, 11:41 a.m.
On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
> > The baseline is the laptop without kvm talking to the server. The
> > TCP_STREAM test results are:
> 
> It's not a good test. The thing most affecting throughput results is
> how
> much CPU does you guest get. So as a minumum you need to measure CPU
> utilization on the host and divide by that. 

The simple fact that we don't reach the baseline while in qemu seems to
be a reasonably good indication that we tend to be CPU bound already so
it's not -that- relevant. It would be if we were saturating the network.

But yes, I can try to do more tests tomorrow, it would be nice if you
could contribute a proper test protocol (or even test on some machines)
since you seem to be familiar with those measurements (and I have a very
limited access to x86 gear ... basically just my laptop).

Cheers,
Ben.
Michael S. Tsirkin - May 22, 2012, 12:03 p.m.
On Tue, May 22, 2012 at 09:41:41PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
> > > The baseline is the laptop without kvm talking to the server. The
> > > TCP_STREAM test results are:
> > 
> > It's not a good test. The thing most affecting throughput results is
> > how
> > much CPU does you guest get. So as a minumum you need to measure CPU
> > utilization on the host and divide by that. 
> 
> The simple fact that we don't reach the baseline while in qemu seems to
> be a reasonably good indication that we tend to be CPU bound already so
> it's not -that- relevant. It would be if we were saturating the network.
> 
> But yes, I can try to do more tests tomorrow, it would be nice if you
> could contribute a proper test protocol (or even test on some machines)
> since you seem to be familiar with those measurements (and I have a very
> limited access to x86 gear ... basically just my laptop).
> 
> Cheers,
> Ben.

I have a deja vu. Amos sent perf results when you argued about
exactly the same issue in guest virtio. Delta was small but
measureable. At the moment I have no free time or free hardware
to redo the same work all over again. It's a well known fact that
actual memory barrier is slow on x86 CPUs. You can't see
it with network on your laptop? Write a microbenchmark.
Benjamin Herrenschmidt - May 22, 2012, 9:24 p.m.
On Tue, 2012-05-22 at 15:03 +0300, Michael S. Tsirkin wrote:
> I have a deja vu. Amos sent perf results when you argued about
> exactly the same issue in guest virtio. Delta was small but
> measureable. At the moment I have no free time or free hardware
> to redo the same work all over again. It's a well known fact that
> actual memory barrier is slow on x86 CPUs. You can't see
> it with network on your laptop? Write a microbenchmark. 

Or just screw it, it's small enough, for emulated devices we can swallow
it, and move on with life, since virtio is unaffected, and we can always
fine tune e1000 if performance of that one is critical.

Ben.
Anthony Liguori - May 22, 2012, 9:40 p.m.
On 05/22/2012 07:03 AM, Michael S. Tsirkin wrote:
> On Tue, May 22, 2012 at 09:41:41PM +1000, Benjamin Herrenschmidt wrote:
>> On Tue, 2012-05-22 at 14:14 +0300, Michael S. Tsirkin wrote:
>>>> The baseline is the laptop without kvm talking to the server. The
>>>> TCP_STREAM test results are:
>>>
>>> It's not a good test. The thing most affecting throughput results is
>>> how
>>> much CPU does you guest get. So as a minumum you need to measure CPU
>>> utilization on the host and divide by that.
>>
>> The simple fact that we don't reach the baseline while in qemu seems to
>> be a reasonably good indication that we tend to be CPU bound already so
>> it's not -that- relevant. It would be if we were saturating the network.
>>
>> But yes, I can try to do more tests tomorrow, it would be nice if you
>> could contribute a proper test protocol (or even test on some machines)
>> since you seem to be familiar with those measurements (and I have a very
>> limited access to x86 gear ... basically just my laptop).
>>
>> Cheers,
>> Ben.
>
> I have a deja vu. Amos sent perf results when you argued about
> exactly the same issue in guest virtio. Delta was small but
> measureable. At the moment I have no free time or free hardware
> to redo the same work all over again. It's a well known fact that
> actual memory barrier is slow on x86 CPUs. You can't see
> it with network on your laptop? Write a microbenchmark.

The latest patch doesn't put a barrier in map().  I have an extremely hard time 
believing anything that uses _rw is going to be performance sensitive.

There is a correctness problem here.  I think it's important that we fix that 
and then we can focus on improving performance later.

Regards,

Anthony Liguori

>

Patch

diff --git a/dma.h b/dma.h
index f1fcb71..0d57e50 100644
--- a/dma.h
+++ b/dma.h
@@ -13,6 +13,7 @@ 
 #include <stdio.h>
 #include "hw/hw.h"
 #include "block.h"
+#include "kvm.h"
 
 typedef struct DMAContext DMAContext;
 typedef struct ScatterGatherEntry ScatterGatherEntry;
@@ -70,6 +71,30 @@  typedef struct DMAContext {
     DMAUnmapFunc *unmap;
 } DMAContext;
 
+static inline void dma_barrier(DMAContext *dma, DMADirection dir)
+{
+    /*
+     * This is called before DMA read and write operations
+     * unless the _relaxed form is used and is responsible
+     * for providing some sane ordering of accesses vs
+     * concurrently running VCPUs.
+     *
+     * Users of map(), unmap() or lower level st/ld_*
+     * operations are responsible for providing their own
+     * ordering via barriers.
+     *
+     * This primitive implementation does a simple smp_mb()
+     * before each operation which provides pretty much full
+     * ordering.
+     *
+     * A smarter implementation can be devised if needed to
+     * use lighter barriers based on the direction of the
+     * transfer, the DMA context, etc...
+     */
+    if (kvm_enabled())
+        smp_mb();
+}
+
 static inline bool dma_has_iommu(DMAContext *dma)
 {
     return !!dma;
@@ -93,8 +118,9 @@  static inline bool dma_memory_valid(DMAContext *dma,
 
 int iommu_dma_memory_rw(DMAContext *dma, dma_addr_t addr,
                         void *buf, dma_addr_t len, DMADirection dir);
-static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
-                                void *buf, dma_addr_t len, DMADirection dir)
+static inline int dma_memory_rw_relaxed(DMAContext *dma, dma_addr_t addr,
+                                        void *buf, dma_addr_t len,
+                                        DMADirection dir)
 {
     if (!dma_has_iommu(dma)) {
         /* Fast-path for no IOMMU */
@@ -106,6 +132,28 @@  static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
     }
 }
 
+static inline int dma_memory_read_relaxed(DMAContext *dma, dma_addr_t addr,
+                                          void *buf, dma_addr_t len)
+{
+    return dma_memory_rw_relaxed(dma, addr, buf, len, DMA_DIRECTION_TO_DEVICE);
+}
+
+static inline int dma_memory_write_relaxed(DMAContext *dma, dma_addr_t addr,
+                                           const void *buf, dma_addr_t len)
+{
+    return dma_memory_rw_relaxed(dma, addr, (void *)buf, len,
+                                 DMA_DIRECTION_FROM_DEVICE);
+}
+
+static inline int dma_memory_rw(DMAContext *dma, dma_addr_t addr,
+                                void *buf, dma_addr_t len,
+                                DMADirection dir)
+{
+    dma_barrier(dma, dir);
+
+    return dma_memory_rw_relaxed(dma, addr, buf, len, dir);
+}
+
 static inline int dma_memory_read(DMAContext *dma, dma_addr_t addr,
                                   void *buf, dma_addr_t len)
 {
@@ -124,6 +172,8 @@  int iommu_dma_memory_set(DMAContext *dma, dma_addr_t addr, uint8_t c,
 static inline int dma_memory_set(DMAContext *dma, dma_addr_t addr,
                                  uint8_t c, dma_addr_t len)
 {
+    dma_barrier(dma, DMA_DIRECTION_FROM_DEVICE);
+
     if (!dma_has_iommu(dma)) {
         /* Fast-path for no IOMMU */
         cpu_physical_memory_set(addr, c, len);