Patchwork [RFC,00/17] virtual-bus

login
register
mail settings
Submitter Anthony Liguori
Date April 2, 2009, 4:09 p.m.
Message ID <49D4E33F.5000303@codemonkey.ws>
Download mbox | patch
Permalink /patch/25534/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Anthony Liguori - April 2, 2009, 4:09 p.m.
Anthony Liguori wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> The alternative is to get a notification from the stack that the 
>>> packet is done processing.  Either an skb destructor in the kernel, 
>>> or my new API that everyone is not rushing out to implement.
>>
>> btw, my new api is
>>
>>
>>   io_submit(..., nr, ...): submit nr packets
>>   io_getevents(): complete nr packets
>
> I don't think we even need that to end this debate.  I'm convinced we 
> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
> latency of around 300ns whereas it's only 50ns on the host.  This 
> defies logic so I'm now looking to isolate why that is.

I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes were 
the big winner... I hate qemu sometimes.

I'm pretty confident I can get at least to Greg's numbers with some 
poking.  I think I understand why he's doing better after reading his 
patches carefully but I also don't think it'll scale with many guests 
well...  stay tuned.

But most importantly, we are darn near where vbus is with this patch wrt 
added packet latency and this is totally from userspace with no host 
kernel changes.

So no, userspace is not the issue.

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>
Avi Kivity - April 2, 2009, 4:19 p.m.
Anthony Liguori wrote:
>> I don't think we even need that to end this debate.  I'm convinced we 
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>> latency of around 300ns whereas it's only 50ns on the host.  This 
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
> were the big winner... I hate qemu sometimes.
>
>

What, this:

> diff --git a/qemu/exec.c b/qemu/exec.c
> index 67f3fa3..1331022 100644
> --- a/qemu/exec.c
> +++ b/qemu/exec.c
> @@ -3268,6 +3268,10 @@ uint32_t ldl_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldl_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;
> @@ -3300,6 +3304,10 @@ uint64_t ldq_phys(target_phys_addr_t addr)
>      unsigned long pd;
>      PhysPageDesc *p;
>  
> +#if 1
> +    return ldq_p(phys_ram_base + addr);
> +#endif
> +
>      p = phys_page_find(addr >> TARGET_PAGE_BITS);
>      if (!p) {
>          pd = IO_MEM_UNASSIGNED;

The way I read it, it will run only run slowly once per page, then 
settle to a cache miss per page.

Regardless, it makes a memslot model even more attractive.
Anthony Liguori - April 2, 2009, 6:18 p.m.
Avi Kivity wrote:
> Anthony Liguori wrote:
>>> I don't think we even need that to end this debate.  I'm convinced 
>>> we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
>>> latency of around 300ns whereas it's only 50ns on the host.  This 
>>> defies logic so I'm now looking to isolate why that is.
>>
>> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
>> were the big winner... I hate qemu sometimes.
>>
>>
>
> What, this:

UDP_RR test was limited by CPU consumption.  QEMU was pegging a CPU with 
only about 4000 packets per second whereas the host could do 14000.  An 
oprofile run showed that phys_page_find/cpu_physical_memory_rw where at 
the top by a wide margin which makes little sense since virtio is zero 
copy in kvm-userspace today.

That leaves the ring queue accessors that used ld[wlq]_phys and friends 
that happen to make use of the above.  That led me to try this terrible 
hack below and low and beyond, we immediately jumped to 10000 pps.  This 
only works because almost nothing uses ld[wlq]_phys in practice except 
for virtio so breaking it for the non-RAM case didn't matter.

We didn't encounter this before because when I changed this behavior, I 
tested streaming and ping.  Both remained the same.  You can only expose 
this issue if you first disable tx mitigation.

Anyway, if we're able to send this many packets, I suspect we'll be able 
to also handle much higher throughputs without TX mitigation so that's 
what I'm going to look at now.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - April 3, 2009, 1:11 a.m.
Anthony Liguori <anthony@codemonkey.ws> wrote:
>
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Awesome! I'm prepared to eat my words :)

On the subject of TX mitigation, can we please set a standard
on how we measure it? For instance, do we bind the the backend
qemu to the same CPU as the guest, or do we bind it to a different
CPU that shares cache? They're two completely different scenarios
and I think we should be explicit about which one we're measuring.

Thanks,
Gregory Haskins - April 3, 2009, 12:03 p.m.
Anthony Liguori wrote:
> Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> Avi Kivity wrote:
>>>>
>>>> The alternative is to get a notification from the stack that the
>>>> packet is done processing.  Either an skb destructor in the kernel,
>>>> or my new API that everyone is not rushing out to implement.
>>>
>>> btw, my new api is
>>>
>>>
>>>   io_submit(..., nr, ...): submit nr packets
>>>   io_getevents(): complete nr packets
>>
>> I don't think we even need that to end this debate.  I'm convinced we
>> have a bug somewhere.  Even disabling TX mitigation, I see a ping
>> latency of around 300ns whereas it's only 50ns on the host.  This
>> defies logic so I'm now looking to isolate why that is.
>
> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes
> were the big winner... I hate qemu sometimes.

[ Ive already said this privately to Anthony on IRC, but ..]

Hey, congrats!  Thats impressive actually.

So I realize that perhaps you guys are not quite seeing my long term
vision here, which I think will offer some new features that we dont
have today.  I hope to change that over the coming weeks.  However, I
should also point out that perhaps even if, as of right now, my one and
only working module (venet-tap) were all I could offer, it does give us
a "rivalry" position between the two, and this historically has been a
good thing on many projects.  This helps foster innovation through
competition that potentially benefits both.  Case in point, a little
competition provoked an investigation that brought virtio-net's latency
down from 3125us to 90us.  I realize its not a production-ready patch
quite yet, but I am confident Anthony will find something that is
suitable to checkin very soon.  That's a huge improvement to a problem
that was just sitting around unnoticed because there was nothing to
compare it with.

So again, I am proposing for consideration of accepting my work (either
in its current form, or something we agree on after the normal review
process) not only on the basis of the future development of the
platform, but also to keep current components in their running to their
full potential.  I will again point out that the code is almost
completely off to the side, can be completely disabled with config
options, and I will maintain it.  Therefore the only real impact is to
people who care to even try it, and to me.

-Greg
Avi Kivity - April 3, 2009, 12:15 p.m.
Gregory Haskins wrote:
> So again, I am proposing for consideration of accepting my work (either
> in its current form, or something we agree on after the normal review
> process) not only on the basis of the future development of the
> platform, but also to keep current components in their running to their
> full potential.  I will again point out that the code is almost
> completely off to the side, can be completely disabled with config
> options, and I will maintain it.  Therefore the only real impact is to
> people who care to even try it, and to me.
>   

Your work is a whole stack.  Let's look at the constituents.

- a new virtual bus for enumerating devices.

Sorry, I still don't see the point.  It will just make writing drivers 
more difficult.  The only advantage I've heard from you is that it gets 
rid of the gunk.  Well, we still have to support the gunk for non-pv 
devices so the gunk is basically free.  The clean version is expensive 
since we need to port it to all guests and implement exciting features 
like hotplug.

- finer-grained point-to-point communication abstractions

Where virtio has ring+signalling together, you layer the two.  For 
networking, it doesn't matter.  For other applications, it may be 
helpful, perhaps you have something in mind.

- your "bidirectional napi" model for the network device

virtio implements exactly the same thing, except for the case of tx 
mitigation, due to my (perhaps pig-headed) rejection of doing things in 
a separate thread, and due to the total lack of sane APIs for packet 
traffic.

- a kernel implementation of the host networking device

Given the continuous rejection (or rather, their continuous 
non-adoption-and-implementation) of my ideas re zerocopy networking aio, 
that seems like a pragmatic approach.  I wish it were otherwise.

- a promise of more wonderful things yet to come

Obviously I can't evaluate this.

Did I miss anything?

Right now my preferred course of action is to implement a prototype 
userspace notification for networking.  Second choice is to move the 
host virtio implementation into the kernel.  I simply don't see how the 
rest of the stack is cost effective.
Gregory Haskins - April 3, 2009, 1:13 p.m.
Avi Kivity wrote:
> Gregory Haskins wrote:
>> So again, I am proposing for consideration of accepting my work (either
>> in its current form, or something we agree on after the normal review
>> process) not only on the basis of the future development of the
>> platform, but also to keep current components in their running to their
>> full potential.  I will again point out that the code is almost
>> completely off to the side, can be completely disabled with config
>> options, and I will maintain it.  Therefore the only real impact is to
>> people who care to even try it, and to me.
>>   
>
> Your work is a whole stack.  Let's look at the constituents.
>
> - a new virtual bus for enumerating devices.
>
> Sorry, I still don't see the point.  It will just make writing drivers
> more difficult.  The only advantage I've heard from you is that it
> gets rid of the gunk.  Well, we still have to support the gunk for
> non-pv devices so the gunk is basically free.  The clean version is
> expensive since we need to port it to all guests and implement
> exciting features like hotplug.
My real objection to PCI is fast-path related.  I don't object, per se,
to using PCI for discovery and hotplug.  If you use PCI just for these
types of things, but then allow fastpath to use more hypercall oriented
primitives, then I would agree with you.  We can leave PCI emulation in
user-space, and we get it for free, and things are relatively tidy.

Its once you start requiring that we stay ABI compatible with something
like the existing virtio-net in x86 KVM where I think it starts to get
ugly when you try to move it into the kernel.  So that is what I had a
real objection to.  I think as long as we are not talking about trying
to make something like that work, its a much more viable prospect.

So what I propose is the following: 

1) The core vbus design stays the same (or close to it)
2) the vbus-proxy and kvm-guest patch go away
3) the kvm-host patch changes to work with coordination from the
userspace-pci emulation for things like MSI routing
4) qemu will know to create some MSI shim 1:1 with whatever it
instantiates on the bus (and can communicate changes
5) any drivers that are written for these new PCI-IDs that might be
present are allowed to use a hypercall ABI to talk after they have been
probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
access methods).

Once I get here, I might have greater clarity to see how hard it would
make to emulate fast path components as well.  It might be easier than I
think.

This is all off the cuff so it might need some fine tuning before its
actually workable.

Does that sound reasonable?

>
> - finer-grained point-to-point communication abstractions
>
> Where virtio has ring+signalling together, you layer the two.  For
> networking, it doesn't matter.  For other applications, it may be
> helpful, perhaps you have something in mind.

Yeah, actually.  Thanks for bringing that up.

So the reason why signaling and the ring are distinct constructs in the
design is to facilitate constructs other than rings.  For instance,
there may be some models where having a flat shared page is better than
a ring.  A ring will naturally preserve all values in flight, where as a
flat shared page would not (last update is always current).  There are
some algorithms where a previously posted value is obsoleted by an
update, and therefore rings are inherently bad for this update model. 
And as we know, there are plenty of algorithms where a ring works
perfectly.  So I wanted that flexibility to be able to express both.

One of the things I have in mind for the flat page model is that RT vcpu
priority thing.  Another thing I am thinking of is coming up with a PV
LAPIC type replacement (where we can avoid doing the EOI trap by having
the PICs state shared).

>
> - your "bidirectional napi" model for the network device
>
> virtio implements exactly the same thing, except for the case of tx
> mitigation, due to my (perhaps pig-headed) rejection of doing things
> in a separate thread, and due to the total lack of sane APIs for
> packet traffic.

Yeah, and this part is not vbus, nor in-kernel specific.  That was just
a design element of venet-tap.  Though note, I did design the
vbus/shm-signal infrastructure with rich support for such a notion in
mind, so it wasn't accidental or anything like that.

>
> - a kernel implementation of the host networking device
>
> Given the continuous rejection (or rather, their continuous
> non-adoption-and-implementation) of my ideas re zerocopy networking
> aio, that seems like a pragmatic approach.  I wish it were otherwise.

Well, that gives me hope, at least ;)


>
> - a promise of more wonderful things yet to come
>
> Obviously I can't evaluate this.

Right, sorry.  I wish I had more concrete examples to show you, but we
only have the venet-tap working at this time.  I was going for the
"release early/often" approach in getting the core reviewed before we
got too far down a path, but perhaps that was the wrong thing in this
case.  We will certainly be sending updates as we get some of the more
advanced models and concepts working.

-Greg
Avi Kivity - April 3, 2009, 1:37 p.m.
Gregory Haskins wrote:
> Avi Kivity wrote:
>   
>> Gregory Haskins wrote:
>>     
>>> So again, I am proposing for consideration of accepting my work (either
>>> in its current form, or something we agree on after the normal review
>>> process) not only on the basis of the future development of the
>>> platform, but also to keep current components in their running to their
>>> full potential.  I will again point out that the code is almost
>>> completely off to the side, can be completely disabled with config
>>> options, and I will maintain it.  Therefore the only real impact is to
>>> people who care to even try it, and to me.
>>>   
>>>       
>> Your work is a whole stack.  Let's look at the constituents.
>>
>> - a new virtual bus for enumerating devices.
>>
>> Sorry, I still don't see the point.  It will just make writing drivers
>> more difficult.  The only advantage I've heard from you is that it
>> gets rid of the gunk.  Well, we still have to support the gunk for
>> non-pv devices so the gunk is basically free.  The clean version is
>> expensive since we need to port it to all guests and implement
>> exciting features like hotplug.
>>     
> My real objection to PCI is fast-path related.  I don't object, per se,
> to using PCI for discovery and hotplug.  If you use PCI just for these
> types of things, but then allow fastpath to use more hypercall oriented
> primitives, then I would agree with you.  We can leave PCI emulation in
> user-space, and we get it for free, and things are relatively tidy.
>   

PCI has very little to do with the fast path (nothing, if we use MSI).

> Its once you start requiring that we stay ABI compatible with something
> like the existing virtio-net in x86 KVM where I think it starts to get
> ugly when you try to move it into the kernel.  So that is what I had a
> real objection to.  I think as long as we are not talking about trying
> to make something like that work, its a much more viable prospect.
>   

I don't see why the fast path of virtio-net would be bad.  Can you 
elaborate?

Obviously all the pci glue stays in userspace.

> So what I propose is the following: 
>
> 1) The core vbus design stays the same (or close to it)
>   

Sorry, I still don't see what advantage this has over PCI, and how you 
deal with the disadvantages.

> 2) the vbus-proxy and kvm-guest patch go away
> 3) the kvm-host patch changes to work with coordination from the
> userspace-pci emulation for things like MSI routing
> 4) qemu will know to create some MSI shim 1:1 with whatever it
> instantiates on the bus (and can communicate changes
>   

Don't userstand.  What's this MSI shim?

> 5) any drivers that are written for these new PCI-IDs that might be
> present are allowed to use a hypercall ABI to talk after they have been
> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
> access methods).
>   

The way we'd to it with virtio is to add a feature bit that say "you can 
hypercall here instead of pio".  This way old drivers continue to work.

Note that nothing prevents us from trapping pio in the kernel (in fact, 
we do) and forwarding it to the device.  It shouldn't be any slower than 
hypercalls.

> Once I get here, I might have greater clarity to see how hard it would
> make to emulate fast path components as well.  It might be easier than I
> think.
>
> This is all off the cuff so it might need some fine tuning before its
> actually workable.
>
> Does that sound reasonable?
>   

The vbus part (I assume you mean device enumeration) worries me.  I 
don't think you've yet set down what its advantages are.  Being pure and 
clean doesn't count, unless you rip out PCI from all existing installed 
hardware and from Windows.

>> - finer-grained point-to-point communication abstractions
>>
>> Where virtio has ring+signalling together, you layer the two.  For
>> networking, it doesn't matter.  For other applications, it may be
>> helpful, perhaps you have something in mind.
>>     
>
> Yeah, actually.  Thanks for bringing that up.
>
> So the reason why signaling and the ring are distinct constructs in the
> design is to facilitate constructs other than rings.  For instance,
> there may be some models where having a flat shared page is better than
> a ring.  A ring will naturally preserve all values in flight, where as a
> flat shared page would not (last update is always current).  There are
> some algorithms where a previously posted value is obsoleted by an
> update, and therefore rings are inherently bad for this update model. 
> And as we know, there are plenty of algorithms where a ring works
> perfectly.  So I wanted that flexibility to be able to express both.
>   

I agree that there is significant potential here.

> One of the things I have in mind for the flat page model is that RT vcpu
> priority thing.  Another thing I am thinking of is coming up with a PV
> LAPIC type replacement (where we can avoid doing the EOI trap by having
> the PICs state shared).
>   

You keep falling into the paravirtualize the entire universe trap.  If 
you look deep down, you can see Jeremy struggling in there trying to 
bring dom0 support to Linux/Xen.

The lapic is a huge ball of gunk but ripping it out is a monumental job 
with no substantial benefits.  We can at much lower effort avoid the EOI 
trap by paravirtualizing that small bit of ugliness.  Sure the result 
isn't a pure and clean room implementation.  It's a band aid.  But I'll 
take a 50-line band aid over a 3000-line implementation split across 
guest and host, which only works with Linux.
Gregory Haskins - April 3, 2009, 4:28 p.m.
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>    
>>>> So again, I am proposing for consideration of accepting my work
>>>> (either
>>>> in its current form, or something we agree on after the normal review
>>>> process) not only on the basis of the future development of the
>>>> platform, but also to keep current components in their running to
>>>> their
>>>> full potential.  I will again point out that the code is almost
>>>> completely off to the side, can be completely disabled with config
>>>> options, and I will maintain it.  Therefore the only real impact is to
>>>> people who care to even try it, and to me.
>>>>         
>>> Your work is a whole stack.  Let's look at the constituents.
>>>
>>> - a new virtual bus for enumerating devices.
>>>
>>> Sorry, I still don't see the point.  It will just make writing drivers
>>> more difficult.  The only advantage I've heard from you is that it
>>> gets rid of the gunk.  Well, we still have to support the gunk for
>>> non-pv devices so the gunk is basically free.  The clean version is
>>> expensive since we need to port it to all guests and implement
>>> exciting features like hotplug.
>>>     
>> My real objection to PCI is fast-path related.  I don't object, per se,
>> to using PCI for discovery and hotplug.  If you use PCI just for these
>> types of things, but then allow fastpath to use more hypercall oriented
>> primitives, then I would agree with you.  We can leave PCI emulation in
>> user-space, and we get it for free, and things are relatively tidy.
>>   
>
> PCI has very little to do with the fast path (nothing, if we use MSI).

At the very least, PIOs are slightly slower than hypercalls.  Perhaps
not enough to care, but the last time I measured them they were slower,
and therefore my clean slate design doesn't use them.

But I digress.  I think I was actually kind of agreeing with you that we
could do this. :P

>
>> Its once you start requiring that we stay ABI compatible with something
>> like the existing virtio-net in x86 KVM where I think it starts to get
>> ugly when you try to move it into the kernel.  So that is what I had a
>> real objection to.  I think as long as we are not talking about trying
>> to make something like that work, its a much more viable prospect.
>>   
>
> I don't see why the fast path of virtio-net would be bad.  Can you
> elaborate?

Im not.  I am saying I think we might be able to do this.

>
> Obviously all the pci glue stays in userspace.
>
>> So what I propose is the following:
>> 1) The core vbus design stays the same (or close to it)
>>   
>
> Sorry, I still don't see what advantage this has over PCI, and how you
> deal with the disadvantages.

I think you are confusing the vbus-proxy (guest side) with the vbus
backend.  (1) is saying "keep the vbus backend'" and (2) is saying drop
the guest side stuff.  In this proposal, the guest would speak a PCI ABI
as far as its concerned.  Devices in the vbus backend would render as
PCI objects in the ICH (or whatever) model in userspace.

>
>> 2) the vbus-proxy and kvm-guest patch go away
>> 3) the kvm-host patch changes to work with coordination from the
>> userspace-pci emulation for things like MSI routing
>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>> instantiates on the bus (and can communicate changes
>>   
>
> Don't userstand.  What's this MSI shim?

Well, if the device model was an object in vbus down in the kernel, yet
PCI emulation was up in qemu, presumably we would want something to
handle things like PCI config-cycles up in userspace.  Like, for
instance, if the guest re-routes the MSI.  The shim/proxy would handle
the config-cycle, and then turn around and do an ioctl to the kernel to
configure the change with the in-kernel device model (or the irq
infrastructure, as required).

But, TBH, I haven't really looked into whats actually required to make
this work yet.  I am just spitballing to try to find a compromise.

>
>> 5) any drivers that are written for these new PCI-IDs that might be
>> present are allowed to use a hypercall ABI to talk after they have been
>> probed for that ID (e.g. they are not limited to PIO or MMIO BAR type
>> access methods).
>>   
>
> The way we'd to it with virtio is to add a feature bit that say "you
> can hypercall here instead of pio".  This way old drivers continue to
> work.

Yep, agreed.  This is what I was thinking we could do.  But now that I
have the possibility that I just need to write a virtio-vbus module to
co-exist with virtio-pci, perhaps it doesn't even need to be explicit.

>
> Note that nothing prevents us from trapping pio in the kernel (in
> fact, we do) and forwarding it to the device.  It shouldn't be any
> slower than hypercalls.
Sure, its just slightly slower, so I would prefer pure hypercalls if at
all possible.

>
>> Once I get here, I might have greater clarity to see how hard it would
>> make to emulate fast path components as well.  It might be easier than I
>> think.
>>
>> This is all off the cuff so it might need some fine tuning before its
>> actually workable.
>>
>> Does that sound reasonable?
>>   
>
> The vbus part (I assume you mean device enumeration) worries me

No, you are confusing the front-end and back-end again ;)

The back-end remains, and holds the device models as before.  This is
the "vbus core".  Today the front-end interacts with the hypervisor to
render "vbus" specific devices.  The proposal is to eliminate the
front-end, and have the back end render the objects on the bus as PCI
devices to the guest.  I am not sure if I can make it work, yet.  It
needs more thought.

> .  I don't think you've yet set down what its advantages are.  Being
> pure and clean doesn't count, unless you rip out PCI from all existing
> installed hardware and from Windows.

You are being overly dramatic.  No one has ever said we are talking
about ripping something out.  In fact, I've explicitly stated that PCI
can coexist peacefully.    Having more than one bus in a system is
certainly not without precedent (PCI, scsi, usb, etc).

Rather, PCI is PCI, and will always be.  PCI was designed as a
software-to-hardware interface.  It works well for its intention.  When
we do full emulation of guests, we still do PCI so that all that
software that was designed to work software-to-hardware still continue
to work, even though technically its now software-to-software.  When we
do PV, on the other hand, we no longer need to pretend it is
software-to-hardware.  We can continue to use an interface designed for
software-to-hardware if we choose, or we can use something else such as
an interface designed specifically for software-to-software.

As I have stated, PCI was designed with hardware constraints in mind. 
What if I don't want to be governed by those constraints?  What if I
don't want an interrupt per device (I don't)?   What do I need BARs for
(I don't)?  Is a PCI PIO address relevant to me (no, hypercalls are more
direct)?  Etc.  Its crap I dont need.

All I really need is a way to a) discover and enumerate devices,
preferably dynamically (hotswap), and b) a way to communicate with those
devices.  I think you are overstating the the importance that PCI plays
in (a), and are overstating the complexity associated with doing an
alternative.  I think you are understating the level of hackiness
required to continue to support PCI as we move to new paradigms, like
in-kernel models.  And I think I have already stated that I can
establish a higher degree of flexibility, and arguably, performance for
(b).  Therefore, I have come to the conclusion that I don't want it and
thus eradicated the dependence on it in my design.  I understand the
design tradeoffs that are associated with that decision.

>
>>> - finer-grained point-to-point communication abstractions
>>>
>>> Where virtio has ring+signalling together, you layer the two.  For
>>> networking, it doesn't matter.  For other applications, it may be
>>> helpful, perhaps you have something in mind.
>>>     
>>
>> Yeah, actually.  Thanks for bringing that up.
>>
>> So the reason why signaling and the ring are distinct constructs in the
>> design is to facilitate constructs other than rings.  For instance,
>> there may be some models where having a flat shared page is better than
>> a ring.  A ring will naturally preserve all values in flight, where as a
>> flat shared page would not (last update is always current).  There are
>> some algorithms where a previously posted value is obsoleted by an
>> update, and therefore rings are inherently bad for this update model.
>> And as we know, there are plenty of algorithms where a ring works
>> perfectly.  So I wanted that flexibility to be able to express both.
>>   
>
> I agree that there is significant potential here.
>
>> One of the things I have in mind for the flat page model is that RT vcpu
>> priority thing.  Another thing I am thinking of is coming up with a PV
>> LAPIC type replacement (where we can avoid doing the EOI trap by having
>> the PICs state shared).
>>   
>
> You keep falling into the paravirtualize the entire universe trap.  If
> you look deep down, you can see Jeremy struggling in there trying to
> bring dom0 support to Linux/Xen.
>
> The lapic is a huge ball of gunk but ripping it out is a monumental
> job with no substantial benefits.  We can at much lower effort avoid
> the EOI trap by paravirtualizing that small bit of ugliness.  Sure the
> result isn't a pure and clean room implementation.  It's a band aid. 
> But I'll take a 50-line band aid over a 3000-line implementation split
> across guest and host, which only works with Linux.
Well, keep in mind that I was really just giving you an example of
something that might want a shared-page instead of a shared-ring model. 
The possibility that such a device may be desirable in the future was
enough for me to decide that I wanted the shm model to be flexible,
instead of, say, designed specifically for virtio.  We may never, in
fact, do anything with the LAPIC idea.

-Greg

>
>
Avi Kivity - April 5, 2009, 10 a.m.
Gregory Haskins wrote:
>   
>>> 2) the vbus-proxy and kvm-guest patch go away
>>> 3) the kvm-host patch changes to work with coordination from the
>>> userspace-pci emulation for things like MSI routing
>>> 4) qemu will know to create some MSI shim 1:1 with whatever it
>>> instantiates on the bus (and can communicate changes
>>>   
>>>       
>> Don't userstand.  What's this MSI shim?
>>     
>
> Well, if the device model was an object in vbus down in the kernel, yet
> PCI emulation was up in qemu, presumably we would want something to
> handle things like PCI config-cycles up in userspace.  Like, for
> instance, if the guest re-routes the MSI.  The shim/proxy would handle
> the config-cycle, and then turn around and do an ioctl to the kernel to
> configure the change with the in-kernel device model (or the irq
> infrastructure, as required).
>   

Right, this is how it should work.  All the gunk in userspace.

> But, TBH, I haven't really looked into whats actually required to make
> this work yet.  I am just spitballing to try to find a compromise.
>   

One thing I thought of trying to get this generic is to use file 
descriptors as irq handles.  So:

- userspace exposes a PCI device (same as today)
- guest configures its PCI IRQ (using MSI if it supports it)
- userspace handles this by calling KVM_IRQ_FD which converts the irq to 
a file descriptor
- userspace passes this fd to the kernel, or another userspace process
- end user triggers guest irqs by writing to this fd

We could do the same with hypercalls:

- guest and host userspace negotiate hypercall use through PCI config space
- userspace passes an fd to the kernel
- whenever the guest issues an hypercall, the kernel writes the 
arguments to the fd
- other end (in kernel or userspace) processes the hypercall


> No, you are confusing the front-end and back-end again ;)
>
> The back-end remains, and holds the device models as before.  This is
> the "vbus core".  Today the front-end interacts with the hypervisor to
> render "vbus" specific devices.  The proposal is to eliminate the
> front-end, and have the back end render the objects on the bus as PCI
> devices to the guest.  I am not sure if I can make it work, yet.  It
> needs more thought.
>   

It seems to me this already exists, it's the qemu device model.

The host kernel doesn't need any knowledge of how the devices are 
connected, even if it does implement some of them.

>> .  I don't think you've yet set down what its advantages are.  Being
>> pure and clean doesn't count, unless you rip out PCI from all existing
>> installed hardware and from Windows.
>>     
>
> You are being overly dramatic.  No one has ever said we are talking
> about ripping something out.  In fact, I've explicitly stated that PCI
> can coexist peacefully.    Having more than one bus in a system is
> certainly not without precedent (PCI, scsi, usb, etc).
>
> Rather, PCI is PCI, and will always be.  PCI was designed as a
> software-to-hardware interface.  It works well for its intention.  When
> we do full emulation of guests, we still do PCI so that all that
> software that was designed to work software-to-hardware still continue
> to work, even though technically its now software-to-software.  When we
> do PV, on the other hand, we no longer need to pretend it is
> software-to-hardware.  We can continue to use an interface designed for
> software-to-hardware if we choose, or we can use something else such as
> an interface designed specifically for software-to-software.
>
> As I have stated, PCI was designed with hardware constraints in mind. 
> What if I don't want to be governed by those constraints?  

I'd agree with all this if I actually saw a constraint in PCI.  But I don't.

> What if I
> don't want an interrupt per device (I don't)?   

Don't.  Though I thing you do, even multiple interrupts per device.

> What do I need BARs for
> (I don't)?  

Don't use them.

> Is a PCI PIO address relevant to me (no, hypercalls are more
> direct)?  Etc.  Its crap I dont need.
>   

So use hypercalls.

> All I really need is a way to a) discover and enumerate devices,
> preferably dynamically (hotswap), and b) a way to communicate with those
> devices.  I think you are overstating the the importance that PCI plays
> in (a), and are overstating the complexity associated with doing an
> alternative.  

Given that we have PCI, why would we do an alternative?

It works, it works with Windows, the nasty stuff is in userspace.  Why 
expend effort on an alternative?  Instead make it go faster.

> I think you are understating the level of hackiness
> required to continue to support PCI as we move to new paradigms, like
> in-kernel models.  

The kernel need know nothing about PCI, so I don't see how you work this 
out.

> And I think I have already stated that I can
> establish a higher degree of flexibility, and arguably, performance for
> (b).  

You've stated it, but failed to provide arguments for it.
Alex Williamson - April 20, 2009, 6:02 p.m.
On Thu, 2009-04-02 at 13:18 -0500, Anthony Liguori wrote:
> Avi Kivity wrote:
> > Anthony Liguori wrote:
> >>> I don't think we even need that to end this debate.  I'm convinced 
> >>> we have a bug somewhere.  Even disabling TX mitigation, I see a ping 
> >>> latency of around 300ns whereas it's only 50ns on the host.  This 
> >>> defies logic so I'm now looking to isolate why that is.
> >>
> >> I'm down to 90us.  Obviously, s/ns/us/g above.  The exec.c changes 
> >> were the big winner... I hate qemu sometimes.
> 
> Anyway, if we're able to send this many packets, I suspect we'll be able 
> to also handle much higher throughputs without TX mitigation so that's 
> what I'm going to look at now.

Anthony,

Any news on this?  I'm anxious to see virtio-net performance on par with
the virtual-bus results.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/qemu/exec.c b/qemu/exec.c
index 67f3fa3..1331022 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -3268,6 +3268,10 @@  uint32_t ldl_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldl_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
@@ -3300,6 +3304,10 @@  uint64_t ldq_phys(target_phys_addr_t addr)
     unsigned long pd;
     PhysPageDesc *p;
 
+#if 1
+    return ldq_p(phys_ram_base + addr);
+#endif
+
     p = phys_page_find(addr >> TARGET_PAGE_BITS);
     if (!p) {
         pd = IO_MEM_UNASSIGNED;
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 9bce3a0..ac77b80 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -36,6 +36,7 @@  typedef struct VirtIONet
     VirtQueue *ctrl_vq;
     VLANClientState *vc;
     QEMUTimer *tx_timer;
+    QEMUBH *bh;
     int tx_timer_active;
     int mergeable_rx_bufs;
     int promisc;
@@ -504,6 +505,10 @@  static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
     virtio_notify(&n->vdev, n->rx_vq);
 }
 
+VirtIODevice *global_vdev = NULL;
+
+extern void tap_try_to_recv(VLANClientState *vc);
+
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -545,42 +550,35 @@  static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
+        global_vdev = &n->vdev;
         len += qemu_sendv_packet(n->vc, out_sg, out_num);
+        global_vdev = NULL;
 
         virtqueue_push(vq, &elem, len);
         virtio_notify(&n->vdev, vq);
     }
+
+    tap_try_to_recv(n->vc->vlan->first_client);
 }
 
 static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
-        virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_timer_active = 0;
-        virtio_net_flush_tx(n, vq);
-    } else {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-        n->tx_timer_active = 1;
-        virtio_queue_set_notification(vq, 0);
-    }
+#if 0
+    virtio_queue_set_notification(vq, 0);
+    qemu_bh_schedule(n->bh);
+#else
+    virtio_net_flush_tx(n, n->tx_vq);
+#endif
 }
 
-static void virtio_net_tx_timer(void *opaque)
+static void virtio_net_handle_tx_bh(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
-
-    /* Just in case the driver is not ready on more */
-    if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
-        return;
-
-    virtio_queue_set_notification(n->tx_vq, 1);
     virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq, 1);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
@@ -675,8 +673,8 @@  PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     n->vdev.get_features = virtio_net_get_features;
     n->vdev.set_features = virtio_net_set_features;
     n->vdev.reset = virtio_net_reset;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-    n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+    n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+    n->tx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_tx);
     n->ctrl_vq = virtio_add_queue(&n->vdev, 16, virtio_net_handle_ctrl);
     memcpy(n->mac, nd->macaddr, ETH_ALEN);
     n->status = VIRTIO_NET_S_LINK_UP;
@@ -684,10 +682,10 @@  PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
                                  virtio_net_receive, virtio_net_can_receive, n);
     n->vc->link_status_changed = virtio_net_set_link_status;
 
+    n->bh = qemu_bh_new(virtio_net_handle_tx_bh, n);
+
     qemu_format_nic_info_str(n->vc, n->mac);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
 
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 577eb5a..1365d11 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -507,6 +507,39 @@  static void virtio_reset(void *opaque)
     }
 }
 
+void virtio_sample_start(VirtIODevice *vdev)
+{
+    vdev->n_samples = 0;
+    virtio_sample(vdev);
+}
+
+void virtio_sample(VirtIODevice *vdev)
+{
+    gettimeofday(&vdev->samples[vdev->n_samples], NULL);
+    vdev->n_samples++;
+}
+
+static unsigned long usec_delta(struct timeval *before, struct timeval *after)
+{
+    return (after->tv_sec - before->tv_sec) * 1000000UL + (after->tv_usec - before->tv_usec);
+}
+
+void virtio_sample_end(VirtIODevice *vdev)
+{
+    int last, i;
+
+    virtio_sample(vdev);
+
+    last = vdev->n_samples - 1;
+
+    printf("Total time = %ldus\n", usec_delta(&vdev->samples[0], &vdev->samples[last]));
+
+    for (i = 1; i < vdev->n_samples; i++)
+        printf("sample[%d .. %d] = %ldus\n", i - 1, i, usec_delta(&vdev->samples[i - 1], &vdev->samples[i]));
+
+    vdev->n_samples = 0;
+}
+
 static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
     VirtIODevice *vdev = to_virtio_device(opaque);
diff --git a/qemu/hw/virtio.h b/qemu/hw/virtio.h
index 18c7a1a..a039310 100644
--- a/qemu/hw/virtio.h
+++ b/qemu/hw/virtio.h
@@ -17,6 +17,8 @@ 
 #include "hw.h"
 #include "pci.h"
 
+#include <sys/time.h>
+
 /* from Linux's linux/virtio_config.h */
 
 /* Status byte for guest to report progress, and synchronize features. */
@@ -87,6 +89,8 @@  struct VirtIODevice
     void (*set_config)(VirtIODevice *vdev, const uint8_t *config);
     void (*reset)(VirtIODevice *vdev);
     VirtQueue *vq;
+    int n_samples;
+    struct timeval samples[100];
 };
 
 VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name,
@@ -122,4 +126,10 @@  int virtio_queue_ready(VirtQueue *vq);
 
 int virtio_queue_empty(VirtQueue *vq);
 
+void virtio_sample_start(VirtIODevice *vdev);
+
+void virtio_sample(VirtIODevice *vdev);
+
+void virtio_sample_end(VirtIODevice *vdev);
+
 #endif
diff --git a/qemu/net.c b/qemu/net.c
index efb64d3..dc872e5 100644
--- a/qemu/net.c
+++ b/qemu/net.c
@@ -733,6 +733,7 @@  typedef struct TAPState {
 } TAPState;
 
 #ifdef HAVE_IOVEC
+
 static ssize_t tap_receive_iov(void *opaque, const struct iovec *iov,
                                int iovcnt)
 {
@@ -853,6 +854,12 @@  static void tap_send(void *opaque)
     } while (s->size > 0);
 }
 
+void tap_try_to_recv(VLANClientState *vc)
+{
+    TAPState *s = vc->opaque;
+    tap_send(s);
+}
+
 int tap_has_vnet_hdr(void *opaque)
 {
     VLANClientState *vc = opaque;