diff mbox series

[ovs-dev,RFC,net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

Message ID 20180704142342.21740-1-mcroce@redhat.com
State Not Applicable
Headers show
Series [ovs-dev,RFC,net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order | expand

Commit Message

Matteo Croce July 4, 2018, 2:23 p.m. UTC
From: Stefano Brivio <sbrivio@redhat.com>

Open vSwitch sends to userspace all received packets that have
no associated flow (thus doing an "upcall"). Then the userspace
program creates a new flow and determines the actions to apply
based on its configuration.

When a single port generates a high rate of upcalls, it can
prevent other ports from dispatching their own upcalls. vswitchd
overcomes this problem by creating many netlink sockets for each
port, but it quickly exceeds any reasonable maximum number of
open files when dealing with huge amounts of ports.

This patch queues all the upcalls into a list, ordering them in
a per-port round-robin fashion, and schedules a deferred work to
queue them to userspace.

The algorithm to queue upcalls in a round-robin fashion,
provided by Stefano, is based on these two rules:
 - upcalls for a given port must be inserted after all the other
   occurrences of upcalls for the same port already in the queue,
   in order to avoid out-of-order upcalls for a given port
 - insertion happens once the highest upcall count for any given
   port (excluding the one currently at hand) is greater than the
   count for the port we're queuing to -- if this condition is
   never true, upcall is queued at the tail. This results in a
   per-port round-robin order.

In order to implement a fair round-robin behaviour, a variable
queueing delay is introduced. This will be zero if the upcalls
rate is below a given threshold, and grows linearly with the
queue utilisation (i.e. upcalls rate) otherwise.

This ensures fairness among ports under load and with few
netlink sockets.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Co-authored-by: Stefano Brivio <sbrivio@redhat.com>
---
 net/openvswitch/datapath.c | 143 ++++++++++++++++++++++++++++++++++---
 net/openvswitch/datapath.h |  27 ++++++-
 2 files changed, 161 insertions(+), 9 deletions(-)

Comments

Pravin Shelar July 10, 2018, 6:31 p.m. UTC | #1
On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@redhat.com> wrote:
> From: Stefano Brivio <sbrivio@redhat.com>
>
> Open vSwitch sends to userspace all received packets that have
> no associated flow (thus doing an "upcall"). Then the userspace
> program creates a new flow and determines the actions to apply
> based on its configuration.
>
> When a single port generates a high rate of upcalls, it can
> prevent other ports from dispatching their own upcalls. vswitchd
> overcomes this problem by creating many netlink sockets for each
> port, but it quickly exceeds any reasonable maximum number of
> open files when dealing with huge amounts of ports.
>
> This patch queues all the upcalls into a list, ordering them in
> a per-port round-robin fashion, and schedules a deferred work to
> queue them to userspace.
>
> The algorithm to queue upcalls in a round-robin fashion,
> provided by Stefano, is based on these two rules:
>  - upcalls for a given port must be inserted after all the other
>    occurrences of upcalls for the same port already in the queue,
>    in order to avoid out-of-order upcalls for a given port
>  - insertion happens once the highest upcall count for any given
>    port (excluding the one currently at hand) is greater than the
>    count for the port we're queuing to -- if this condition is
>    never true, upcall is queued at the tail. This results in a
>    per-port round-robin order.
>
> In order to implement a fair round-robin behaviour, a variable
> queueing delay is introduced. This will be zero if the upcalls
> rate is below a given threshold, and grows linearly with the
> queue utilisation (i.e. upcalls rate) otherwise.
>
> This ensures fairness among ports under load and with few
> netlink sockets.
>
Thanks for the patch.
This patch is adding following overhead for upcall handling:
1. kmalloc.
2. global spin-lock.
3. context switch to single worker thread.
I think this could become bottle neck on most of multi core systems.
You have mentioned issue with existing fairness mechanism, Can you
elaborate on those, I think we could improve that before implementing
heavy weight fairness in upcall handling.
Matteo Croce July 16, 2018, 4:54 p.m. UTC | #2
On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar@ovn.org> wrote:
>
> On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@redhat.com> wrote:
> > From: Stefano Brivio <sbrivio@redhat.com>
> >
> > Open vSwitch sends to userspace all received packets that have
> > no associated flow (thus doing an "upcall"). Then the userspace
> > program creates a new flow and determines the actions to apply
> > based on its configuration.
> >
> > When a single port generates a high rate of upcalls, it can
> > prevent other ports from dispatching their own upcalls. vswitchd
> > overcomes this problem by creating many netlink sockets for each
> > port, but it quickly exceeds any reasonable maximum number of
> > open files when dealing with huge amounts of ports.
> >
> > This patch queues all the upcalls into a list, ordering them in
> > a per-port round-robin fashion, and schedules a deferred work to
> > queue them to userspace.
> >
> > The algorithm to queue upcalls in a round-robin fashion,
> > provided by Stefano, is based on these two rules:
> >  - upcalls for a given port must be inserted after all the other
> >    occurrences of upcalls for the same port already in the queue,
> >    in order to avoid out-of-order upcalls for a given port
> >  - insertion happens once the highest upcall count for any given
> >    port (excluding the one currently at hand) is greater than the
> >    count for the port we're queuing to -- if this condition is
> >    never true, upcall is queued at the tail. This results in a
> >    per-port round-robin order.
> >
> > In order to implement a fair round-robin behaviour, a variable
> > queueing delay is introduced. This will be zero if the upcalls
> > rate is below a given threshold, and grows linearly with the
> > queue utilisation (i.e. upcalls rate) otherwise.
> >
> > This ensures fairness among ports under load and with few
> > netlink sockets.
> >
> Thanks for the patch.
> This patch is adding following overhead for upcall handling:
> 1. kmalloc.
> 2. global spin-lock.
> 3. context switch to single worker thread.
> I think this could become bottle neck on most of multi core systems.
> You have mentioned issue with existing fairness mechanism, Can you
> elaborate on those, I think we could improve that before implementing
> heavy weight fairness in upcall handling.

Hi Pravin,

vswitchd allocates N * P netlink sockets, where N is the number of
online CPU cores, and P the number of ports.
With some setups, this number can grow quite fast, also exceeding the
system maximum file descriptor limit.
I've seen a 48 core server failing with -EMFILE when trying to create
more than 65535 netlink sockets needed for handling 1800+ ports.

I made a previous attempt to reduce the sockets to one per CPU, but
this was discussed and rejected on ovs-dev because it would remove
fairness among ports[1].
I think that the current approach of opening a huge number of sockets
doesn't really work, (it doesn't scale for sure), it still needs some
queueing logic (either in kernel or user space) if we really want to
be sure that low traffic ports gets their upcalls quota when other
ports are doing way more traffic.

If you are concerned about the kmalloc or spinlock, we can solve them
with kmem_cache or two copies of the list and rcu, I'll happy to
discuss the implementation details, as long as we all agree that the
current implementation doesn't scale well and has an issue.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html

--
Matteo Croce
per aspera ad upstream
Matteo Croce July 31, 2018, 7:43 p.m. UTC | #3
On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce <mcroce@redhat.com> wrote:
>
> On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar@ovn.org> wrote:
> >
> > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@redhat.com> wrote:
> > > From: Stefano Brivio <sbrivio@redhat.com>
> > >
> > > Open vSwitch sends to userspace all received packets that have
> > > no associated flow (thus doing an "upcall"). Then the userspace
> > > program creates a new flow and determines the actions to apply
> > > based on its configuration.
> > >
> > > When a single port generates a high rate of upcalls, it can
> > > prevent other ports from dispatching their own upcalls. vswitchd
> > > overcomes this problem by creating many netlink sockets for each
> > > port, but it quickly exceeds any reasonable maximum number of
> > > open files when dealing with huge amounts of ports.
> > >
> > > This patch queues all the upcalls into a list, ordering them in
> > > a per-port round-robin fashion, and schedules a deferred work to
> > > queue them to userspace.
> > >
> > > The algorithm to queue upcalls in a round-robin fashion,
> > > provided by Stefano, is based on these two rules:
> > >  - upcalls for a given port must be inserted after all the other
> > >    occurrences of upcalls for the same port already in the queue,
> > >    in order to avoid out-of-order upcalls for a given port
> > >  - insertion happens once the highest upcall count for any given
> > >    port (excluding the one currently at hand) is greater than the
> > >    count for the port we're queuing to -- if this condition is
> > >    never true, upcall is queued at the tail. This results in a
> > >    per-port round-robin order.
> > >
> > > In order to implement a fair round-robin behaviour, a variable
> > > queueing delay is introduced. This will be zero if the upcalls
> > > rate is below a given threshold, and grows linearly with the
> > > queue utilisation (i.e. upcalls rate) otherwise.
> > >
> > > This ensures fairness among ports under load and with few
> > > netlink sockets.
> > >
> > Thanks for the patch.
> > This patch is adding following overhead for upcall handling:
> > 1. kmalloc.
> > 2. global spin-lock.
> > 3. context switch to single worker thread.
> > I think this could become bottle neck on most of multi core systems.
> > You have mentioned issue with existing fairness mechanism, Can you
> > elaborate on those, I think we could improve that before implementing
> > heavy weight fairness in upcall handling.
>
> Hi Pravin,
>
> vswitchd allocates N * P netlink sockets, where N is the number of
> online CPU cores, and P the number of ports.
> With some setups, this number can grow quite fast, also exceeding the
> system maximum file descriptor limit.
> I've seen a 48 core server failing with -EMFILE when trying to create
> more than 65535 netlink sockets needed for handling 1800+ ports.
>
> I made a previous attempt to reduce the sockets to one per CPU, but
> this was discussed and rejected on ovs-dev because it would remove
> fairness among ports[1].
> I think that the current approach of opening a huge number of sockets
> doesn't really work, (it doesn't scale for sure), it still needs some
> queueing logic (either in kernel or user space) if we really want to
> be sure that low traffic ports gets their upcalls quota when other
> ports are doing way more traffic.
>
> If you are concerned about the kmalloc or spinlock, we can solve them
> with kmem_cache or two copies of the list and rcu, I'll happy to
> discuss the implementation details, as long as we all agree that the
> current implementation doesn't scale well and has an issue.
>
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
>
> --
> Matteo Croce
> per aspera ad upstream

Hi all,

any idea on how to solve the file descriptor limit hit by the netlink sockets?
I see this issue happen very often, and raising the FD limit to 400k
seems not the right way to solve it.
Any other suggestion on how to improve the patch, or solve the problem
in a different way?

Regards,



--
Matteo Croce
per aspera ad upstream
Ben Pfaff July 31, 2018, 10:06 p.m. UTC | #4
On Tue, Jul 31, 2018 at 07:43:34PM +0000, Matteo Croce wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce <mcroce@redhat.com> wrote:
> >
> > On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar@ovn.org> wrote:
> > >
> > > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@redhat.com> wrote:
> > > > From: Stefano Brivio <sbrivio@redhat.com>
> > > >
> > > > Open vSwitch sends to userspace all received packets that have
> > > > no associated flow (thus doing an "upcall"). Then the userspace
> > > > program creates a new flow and determines the actions to apply
> > > > based on its configuration.
> > > >
> > > > When a single port generates a high rate of upcalls, it can
> > > > prevent other ports from dispatching their own upcalls. vswitchd
> > > > overcomes this problem by creating many netlink sockets for each
> > > > port, but it quickly exceeds any reasonable maximum number of
> > > > open files when dealing with huge amounts of ports.
> > > >
> > > > This patch queues all the upcalls into a list, ordering them in
> > > > a per-port round-robin fashion, and schedules a deferred work to
> > > > queue them to userspace.
> > > >
> > > > The algorithm to queue upcalls in a round-robin fashion,
> > > > provided by Stefano, is based on these two rules:
> > > >  - upcalls for a given port must be inserted after all the other
> > > >    occurrences of upcalls for the same port already in the queue,
> > > >    in order to avoid out-of-order upcalls for a given port
> > > >  - insertion happens once the highest upcall count for any given
> > > >    port (excluding the one currently at hand) is greater than the
> > > >    count for the port we're queuing to -- if this condition is
> > > >    never true, upcall is queued at the tail. This results in a
> > > >    per-port round-robin order.
> > > >
> > > > In order to implement a fair round-robin behaviour, a variable
> > > > queueing delay is introduced. This will be zero if the upcalls
> > > > rate is below a given threshold, and grows linearly with the
> > > > queue utilisation (i.e. upcalls rate) otherwise.
> > > >
> > > > This ensures fairness among ports under load and with few
> > > > netlink sockets.
> > > >
> > > Thanks for the patch.
> > > This patch is adding following overhead for upcall handling:
> > > 1. kmalloc.
> > > 2. global spin-lock.
> > > 3. context switch to single worker thread.
> > > I think this could become bottle neck on most of multi core systems.
> > > You have mentioned issue with existing fairness mechanism, Can you
> > > elaborate on those, I think we could improve that before implementing
> > > heavy weight fairness in upcall handling.
> >
> > Hi Pravin,
> >
> > vswitchd allocates N * P netlink sockets, where N is the number of
> > online CPU cores, and P the number of ports.
> > With some setups, this number can grow quite fast, also exceeding the
> > system maximum file descriptor limit.
> > I've seen a 48 core server failing with -EMFILE when trying to create
> > more than 65535 netlink sockets needed for handling 1800+ ports.
> >
> > I made a previous attempt to reduce the sockets to one per CPU, but
> > this was discussed and rejected on ovs-dev because it would remove
> > fairness among ports[1].
> > I think that the current approach of opening a huge number of sockets
> > doesn't really work, (it doesn't scale for sure), it still needs some
> > queueing logic (either in kernel or user space) if we really want to
> > be sure that low traffic ports gets their upcalls quota when other
> > ports are doing way more traffic.
> >
> > If you are concerned about the kmalloc or spinlock, we can solve them
> > with kmem_cache or two copies of the list and rcu, I'll happy to
> > discuss the implementation details, as long as we all agree that the
> > current implementation doesn't scale well and has an issue.
> >
> > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
> >
> > --
> > Matteo Croce
> > per aspera ad upstream
> 
> Hi all,
> 
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?

This is an awkward problem to try to solve with sockets because of the
nature of sockets, which are strictly first-in first-out.  What you
really want is something closer to the algorithm that we use in
ovs-vswitchd to send packets to an OpenFlow controller.  When the
channel becomes congested, then for each packet to be sent to the
controller, OVS appends it to a queue associated with its input port.
(This could be done on a more granular basis than just port.)  If the
maximum amount of queued packets is reached, then OVS discards a packet
from the longest queue.  When space becomes available in the channel,
OVS round-robins through the queues to send a packet.  This achieves
pretty good fairness but it can't be done with sockets because you can't
drop a packet that is already queued to one.

My current thought is that any fairness scheme we implement directly in
the kernel is going to need to evolve over time.  Maybe we could do
something flexible with BPF and maps, instead of hard-coding it.
Pravin Shelar July 31, 2018, 11:12 p.m. UTC | #5
On Tue, Jul 31, 2018 at 12:43 PM, Matteo Croce <mcroce@redhat.com> wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce <mcroce@redhat.com> wrote:
>>
>> On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar@ovn.org> wrote:
>> >
>> > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce@redhat.com> wrote:
>> > > From: Stefano Brivio <sbrivio@redhat.com>
>> > >
>> > > Open vSwitch sends to userspace all received packets that have
>> > > no associated flow (thus doing an "upcall"). Then the userspace
>> > > program creates a new flow and determines the actions to apply
>> > > based on its configuration.
>> > >
>> > > When a single port generates a high rate of upcalls, it can
>> > > prevent other ports from dispatching their own upcalls. vswitchd
>> > > overcomes this problem by creating many netlink sockets for each
>> > > port, but it quickly exceeds any reasonable maximum number of
>> > > open files when dealing with huge amounts of ports.
>> > >
>> > > This patch queues all the upcalls into a list, ordering them in
>> > > a per-port round-robin fashion, and schedules a deferred work to
>> > > queue them to userspace.
>> > >
>> > > The algorithm to queue upcalls in a round-robin fashion,
>> > > provided by Stefano, is based on these two rules:
>> > >  - upcalls for a given port must be inserted after all the other
>> > >    occurrences of upcalls for the same port already in the queue,
>> > >    in order to avoid out-of-order upcalls for a given port
>> > >  - insertion happens once the highest upcall count for any given
>> > >    port (excluding the one currently at hand) is greater than the
>> > >    count for the port we're queuing to -- if this condition is
>> > >    never true, upcall is queued at the tail. This results in a
>> > >    per-port round-robin order.
>> > >
>> > > In order to implement a fair round-robin behaviour, a variable
>> > > queueing delay is introduced. This will be zero if the upcalls
>> > > rate is below a given threshold, and grows linearly with the
>> > > queue utilisation (i.e. upcalls rate) otherwise.
>> > >
>> > > This ensures fairness among ports under load and with few
>> > > netlink sockets.
>> > >
>> > Thanks for the patch.
>> > This patch is adding following overhead for upcall handling:
>> > 1. kmalloc.
>> > 2. global spin-lock.
>> > 3. context switch to single worker thread.
>> > I think this could become bottle neck on most of multi core systems.
>> > You have mentioned issue with existing fairness mechanism, Can you
>> > elaborate on those, I think we could improve that before implementing
>> > heavy weight fairness in upcall handling.
>>
>> Hi Pravin,
>>
>> vswitchd allocates N * P netlink sockets, where N is the number of
>> online CPU cores, and P the number of ports.
>> With some setups, this number can grow quite fast, also exceeding the
>> system maximum file descriptor limit.
>> I've seen a 48 core server failing with -EMFILE when trying to create
>> more than 65535 netlink sockets needed for handling 1800+ ports.
>>
>> I made a previous attempt to reduce the sockets to one per CPU, but
>> this was discussed and rejected on ovs-dev because it would remove
>> fairness among ports[1].

Rather than reducing number of thread down to 1, we could find better
number of FDs per port.
How about this simple solution:
1. Allocate (N * P) FDs as long as it is under FD limit.
2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.


Thanks,
Pravin.

>> I think that the current approach of opening a huge number of sockets
>> doesn't really work, (it doesn't scale for sure), it still needs some
>> queueing logic (either in kernel or user space) if we really want to
>> be sure that low traffic ports gets their upcalls quota when other
>> ports are doing way more traffic.
>>
>> If you are concerned about the kmalloc or spinlock, we can solve them
>> with kmem_cache or two copies of the list and rcu, I'll happy to
>> discuss the implementation details, as long as we all agree that the
>> current implementation doesn't scale well and has an issue.
>>

>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
>>
>> --
>> Matteo Croce
>> per aspera ad upstream
>
> Hi all,
>
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?
>
> Regards,
>
>
>
> --
> Matteo Croce
> per aspera ad upstream
Stefano Brivio Aug. 3, 2018, 4:52 p.m. UTC | #6
Hi Ben,

On Tue, 31 Jul 2018 15:06:57 -0700
Ben Pfaff <blp@ovn.org> wrote:

> This is an awkward problem to try to solve with sockets because of the
> nature of sockets, which are strictly first-in first-out.  What you
> really want is something closer to the algorithm that we use in
> ovs-vswitchd to send packets to an OpenFlow controller.  When the
> channel becomes congested, then for each packet to be sent to the
> controller, OVS appends it to a queue associated with its input port.
> (This could be done on a more granular basis than just port.)  If the
> maximum amount of queued packets is reached, then OVS discards a packet
> from the longest queue.  When space becomes available in the channel,
> OVS round-robins through the queues to send a packet.  This achieves
> pretty good fairness but it can't be done with sockets because you can't
> drop a packet that is already queued to one.

Thanks for your feedback. What you describe is, though, functionally
equivalent to what this patch does, minus the per-port queueing limit.

However, instead of having one explicit queue for each port, and
then fetching packets in a round-robin fashion from all the queues, we
implemented this with a single queue and choose insertion points while
queueing in such a way that the result is equivalent. This way, we
avoid the massive overhead associated with having one queue per each
port (we can have up to 2^16 ports), and cycling over them.

Let's say we have two ports, A and B, and three upcalls are sent for
each port. If we implement one queue for each port as you described, we
end up with this:

.---------------- - - -
| A1 | A2 | A3 |
'---------------- - - -

.---------------- - - -
| B1 | B2 | B3 |
'---------------- - - -

and then send upcalls in this order: A1, B1, A2, B2, A3, B3.

What we are doing here with a single queue is inserting the upcalls
directly in this order:

.------------------------------- - - -
| A1 | B1 | A2 | B2 | A3 | B3 |
'------------------------------- - - -

and dequeueing from the head.

About the per-port queueing limit: we currently have a global one
(UPCALL_QUEUE_MAX_LEN), while the per-port limit is simply given by
implementation constraints in our case:

		if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) {
			err = -ENOSPC;
			goto out_clear;
		}

but we can easily swap that U8_MAX - 1 with another macro or a
configurable value, if there's any value in doing that.

> My current thought is that any fairness scheme we implement directly in
> the kernel is going to need to evolve over time.  Maybe we could do
> something flexible with BPF and maps, instead of hard-coding it.

Honestly, I fail to see what else we might want to do here, other than
adding a simple mechanism for fairness, to solve the specific issue at
hand. Flexibility would probably come at a higher cost. We could easily
make limits configurable if needed. Do you have anything else in mind?
Ben Pfaff Aug. 3, 2018, 11:01 p.m. UTC | #7
On Fri, Aug 03, 2018 at 06:52:41PM +0200, Stefano Brivio wrote:
> On Tue, 31 Jul 2018 15:06:57 -0700 Ben Pfaff <blp@ovn.org> wrote:
> > My current thought is that any fairness scheme we implement directly in
> > the kernel is going to need to evolve over time.  Maybe we could do
> > something flexible with BPF and maps, instead of hard-coding it.
> 
> Honestly, I fail to see what else we might want to do here, other than
> adding a simple mechanism for fairness, to solve the specific issue at
> hand. Flexibility would probably come at a higher cost. We could easily
> make limits configurable if needed. Do you have anything else in mind?

I think that a simple mechanism for fairness is fine.  The direction of
extensibility that makes me anxious is how to decide what matters for
fairness.  So far, we've talked about per-vport fairness.  That works
pretty well for packets coming in from virtual interfaces where each
vport represents a separate VM.  It does not work well if the traffic
filling your queues all comes from a single physical port because some
source of traffic is sending traffic at a high rate.  In that case,
you'll do a lot better if you do fairness based on the source 5-tuple.
But if you're doing network virtualization, then the outer source
5-tuples won't necessarily vary much and you'd be better off looking at
the VNI and maybe some Geneve TLV options and maybe the inner 5-tuple...

I would be very pleased if we could integrate a simple mechanism for
fairness, based for now on some simple criteria like the source port,
but thinking ahead to how we could later make it gracefully extensible
to consider more general and possibly customizable criteria.

Thanks,

Ben.
Stefano Brivio Aug. 4, 2018, 12:43 a.m. UTC | #8
On Fri, 3 Aug 2018 16:01:08 -0700
Ben Pfaff <blp@ovn.org> wrote:

> I think that a simple mechanism for fairness is fine.  The direction
> of extensibility that makes me anxious is how to decide what matters
> for fairness.  So far, we've talked about per-vport fairness.  That
> works pretty well for packets coming in from virtual interfaces where
> each vport represents a separate VM.

Yes, right, that's the case where we have significant issues currently.

> It does not work well if the traffic filling your queues all comes
> from a single physical port because some source of traffic is sending
> traffic at a high rate.  In that case, you'll do a lot better if you
> do fairness based on the source 5-tuple. But if you're doing network
> virtualization, then the outer source 5-tuples won't necessarily vary
> much and you'd be better off looking at the VNI and maybe some Geneve
> TLV options and maybe the inner 5-tuple...

Sure, I see what you mean now. That looks entirely doable if we
abstract the round-robin bucket selection out of the current patch.

> I would be very pleased if we could integrate a simple mechanism for
> fairness, based for now on some simple criteria like the source port,
> but thinking ahead to how we could later make it gracefully extensible
> to consider more general and possibly customizable criteria.

We could change the patch so that instead of just using the vport for
round-robin queue insertion, we generalise that and use "buckets"
instead of vports, and have a set of possible functions that are called
instead of using port_no directly in ovs_dp_upcall_queue_roundrobin(),
making this configurable via netlink, per datapath.

We could implement selection based on source port or a hash on the
source 5-tuple, and the relevant bits of
ovs_dp_upcall_queue_roundrobin() would look like this:

static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
					  struct dp_upcall_info *upcall)
{

[...]

	list_for_each_entry(pos, head, list) {
		int bucket = dp->rr_select(pos);

		/* Count per-bucket upcalls. */
		if (dp->upcalls.count[bucket] == U8_MAX) {
			err = -ENOSPC;
			goto out_clear;
		}
		dp->upcalls.count[bucket]++;

		if (bucket == upcall->bucket) {
			/* Another upcall for the same bucket: move insertion
			 * point here, keep looking for insertion condition to
			 * be still met further on.
			 */
			find_next = true;
			here = pos;
			continue;
		}

		count = dp->upcalls.count[bucket];
		if (find_next && dp->upcalls.count[bucket] >= count) {
			/* Insertion condition met: no need to look further,
			 * unless another upcall for the same port occurs later.
			 */
			find_next = false;
			here = pos;
		}
	}

[...]

}

and implementations for dp->rr_select() would look like:

int rr_select_vport(struct dp_upcall_info *upcall)
{
	return upcall->port_no;
}

int rr_select_srcport(struct dp_upcall_info *upcall)
{
	/* look up source port from upcall->skb... */
}

And we could then easily extend this to use BPF with maps one day.

This is for clarity by the way, but I guess we should avoid indirect
calls in the final implementation. 

What do you think?
Ben Pfaff Aug. 4, 2018, 12:54 a.m. UTC | #9
On Sat, Aug 04, 2018 at 02:43:24AM +0200, Stefano Brivio wrote:
> On Fri, 3 Aug 2018 16:01:08 -0700
> Ben Pfaff <blp@ovn.org> wrote:
> > I would be very pleased if we could integrate a simple mechanism for
> > fairness, based for now on some simple criteria like the source port,
> > but thinking ahead to how we could later make it gracefully extensible
> > to consider more general and possibly customizable criteria.
> 
> We could change the patch so that instead of just using the vport for
> round-robin queue insertion, we generalise that and use "buckets"
> instead of vports, and have a set of possible functions that are called
> instead of using port_no directly in ovs_dp_upcall_queue_roundrobin(),
> making this configurable via netlink, per datapath.
> 
> We could implement selection based on source port or a hash on the
> source 5-tuple, and the relevant bits of
> ovs_dp_upcall_queue_roundrobin() would look like this:

[...]

> What do you think?

I'd support that.  Thanks.
Stefano Brivio Aug. 7, 2018, 1:31 p.m. UTC | #10
Hi Pravin,

On Tue, 31 Jul 2018 16:12:03 -0700
Pravin Shelar <pshelar@ovn.org> wrote:

> Rather than reducing number of thread down to 1, we could find better
> number of FDs per port.
> How about this simple solution:
> 1. Allocate (N * P) FDs as long as it is under FD limit.
> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.

I still see a few quantitative issues with this approach, other than
Ben's observation about design (which, by the way, looks entirely
reasonable to me).

We're talking about a disproportionate amount of sockets in any case.
We can have up to 2^16 vports here, with 5k vports being rather common,
and for any reasonable value of N that manages somehow to perturbate the
distribution of upcalls per thread, we are talking about something well
in excess of 100k sockets. I think this doesn't really scale.

With the current value for N (3/4 * number of threads) this can even get
close to /proc/sys/fs/file-max on some systems, and there raising the
number of allowed file descriptors for ovs-vswitchd isn't a solution
anymore.

I would instead try to address the concerns that you had about the
original patch adding fairness in the kernel, rather than trying to
make the issue appear less severe in ovs-vswitchd.
Stefano Brivio Aug. 7, 2018, 1:39 p.m. UTC | #11
On Tue, 7 Aug 2018 15:31:11 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.

And, by the way, if we introduce a way to configure the possible
fairness algorithms via netlink, we can also rather easily allow
ovs-vswitchd to disable fairness on kernel side altogether, should it
be needed.
William Tu Aug. 10, 2018, 2:11 p.m. UTC | #12
On Fri, Aug 3, 2018 at 5:43 PM, Stefano Brivio <sbrivio@redhat.com> wrote:

> On Fri, 3 Aug 2018 16:01:08 -0700
> Ben Pfaff <blp@ovn.org> wrote:
>
> > I think that a simple mechanism for fairness is fine.  The direction
> > of extensibility that makes me anxious is how to decide what matters
> > for fairness.  So far, we've talked about per-vport fairness.  That
> > works pretty well for packets coming in from virtual interfaces where
> > each vport represents a separate VM.
>
> Yes, right, that's the case where we have significant issues currently.
>
> > It does not work well if the traffic filling your queues all comes
> > from a single physical port because some source of traffic is sending
> > traffic at a high rate.  In that case, you'll do a lot better if you
> > do fairness based on the source 5-tuple. But if you're doing network
> > virtualization, then the outer source 5-tuples won't necessarily vary
> > much and you'd be better off looking at the VNI and maybe some Geneve
> > TLV options and maybe the inner 5-tuple...
>
> Sure, I see what you mean now. That looks entirely doable if we
> abstract the round-robin bucket selection out of the current patch.
>
> > I would be very pleased if we could integrate a simple mechanism for
> > fairness, based for now on some simple criteria like the source port,
> > but thinking ahead to how we could later make it gracefully extensible
> > to consider more general and possibly customizable criteria.
>
> We could change the patch so that instead of just using the vport for
> round-robin queue insertion, we generalise that and use "buckets"
> instead of vports, and have a set of possible functions that are called
> instead of using port_no directly in ovs_dp_upcall_queue_roundrobin(),
> making this configurable via netlink, per datapath.
>
> We could implement selection based on source port or a hash on the
> source 5-tuple, and the relevant bits of
> ovs_dp_upcall_queue_roundrobin() would look like this:
>
> static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
>                                           struct dp_upcall_info *upcall)
> {
>
> [...]
>
>         list_for_each_entry(pos, head, list) {
>                 int bucket = dp->rr_select(pos);
>
>                 /* Count per-bucket upcalls. */
>                 if (dp->upcalls.count[bucket] == U8_MAX) {
>                         err = -ENOSPC;
>                         goto out_clear;
>                 }
>                 dp->upcalls.count[bucket]++;
>
>                 if (bucket == upcall->bucket) {
>                         /* Another upcall for the same bucket: move
> insertion
>                          * point here, keep looking for insertion
> condition to
>                          * be still met further on.
>                          */
>                         find_next = true;
>                         here = pos;
>                         continue;
>                 }
>
>                 count = dp->upcalls.count[bucket];
>                 if (find_next && dp->upcalls.count[bucket] >= count) {
>                         /* Insertion condition met: no need to look
> further,
>                          * unless another upcall for the same port occurs
> later.
>                          */
>                         find_next = false;
>                         here = pos;
>                 }
>         }
>
> [...]
>
> }
>
> and implementations for dp->rr_select() would look like:
>
> int rr_select_vport(struct dp_upcall_info *upcall)
> {
>         return upcall->port_no;
> }
>
> int rr_select_srcport(struct dp_upcall_info *upcall)
> {
>         /* look up source port from upcall->skb... */
> }
>
> And we could then easily extend this to use BPF with maps one day.
>
>
Hi Stefano,

If you want to experiment with BPF, Joe and I have some prototype.
We implemented the upcall mechanism using BPF perf event helper function
https://github.com/williamtu/ovs-ebpf/blob/master/bpf/datapath.c#L62

And there are threads polling the perf ring buffer to receive packets from
BPF.
https://github.com/williamtu/ovs-ebpf/blob/master/lib/perf-event.c#L232

If I follow the discussion correctly, before upcall, you need to queue
packets based
on different configurations (vport/hash/vni/5-tuple/...) and queue to
different buckets
when congestion happens. In this case, you probably needs a BPF map to
enqueue/dequeue
the packet. BPF queue map is not supported yet, but there is patch
available:
[iovisor-dev] [RFC PATCH 1/3] bpf: add bpf queue map

So how to enqueue and dequeue packets depends on user's BPF implementation.
This allows fairness scheme to be extensible.

Regards,
William



This is for clarity by the way, but I guess we should avoid indirect
> calls in the final implementation.
>
> What do you think?
>
> --
> Stefano
>
Stefano Brivio Aug. 14, 2018, 3:25 p.m. UTC | #13
Hi William,

On Fri, 10 Aug 2018 07:11:01 -0700
William Tu <u9012063@gmail.com> wrote:

> > int rr_select_srcport(struct dp_upcall_info *upcall)
> > {
> >         /* look up source port from upcall->skb... */
> > }
> >
> > And we could then easily extend this to use BPF with maps one day.
> >
> >  
> Hi Stefano,
> 
> If you want to experiment with BPF, Joe and I have some prototype.
> We implemented the upcall mechanism using BPF perf event helper function
> https://github.com/williamtu/ovs-ebpf/blob/master/bpf/datapath.c#L62
> 
> And there are threads polling the perf ring buffer to receive packets from
> BPF.
> https://github.com/williamtu/ovs-ebpf/blob/master/lib/perf-event.c#L232

Interesting, thanks for the pointers!

> If I follow the discussion correctly, before upcall, you need to queue
> packets based on different configurations (vport/hash/vni/5-tuple/...)
> and queue to different buckets when congestion happens.

Yes, correct.

> In this case, you
> probably needs a BPF map to enqueue/dequeue the packet.BPF queue map is
> not supported yet, but there is patch available:
> [iovisor-dev] [RFC PATCH 1/3] bpf: add bpf queue map
> 
> So how to enqueue and dequeue packets depends on user's BPF implementation.
> This allows fairness scheme to be extensible.

For the moment being we'll try to ensure that BPF can be plugged there
rather easily. I see the advantage, but I'd rather do this as a second
step.
Pravin Shelar Aug. 15, 2018, 7:19 a.m. UTC | #14
Hi Stefano

On Tue, Aug 7, 2018 at 6:31 AM, Stefano Brivio <sbrivio@redhat.com> wrote:
> Hi Pravin,
>
> On Tue, 31 Jul 2018 16:12:03 -0700
> Pravin Shelar <pshelar@ovn.org> wrote:
>
>> Rather than reducing number of thread down to 1, we could find better
>> number of FDs per port.
>> How about this simple solution:
>> 1. Allocate (N * P) FDs as long as it is under FD limit.
>> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.
>
> I still see a few quantitative issues with this approach, other than
> Ben's observation about design (which, by the way, looks entirely
> reasonable to me).
>
> We're talking about a disproportionate amount of sockets in any case.
> We can have up to 2^16 vports here, with 5k vports being rather common,
> and for any reasonable value of N that manages somehow to perturbate the
> distribution of upcalls per thread, we are talking about something well
> in excess of 100k sockets. I think this doesn't really scale.
>
My argument is not about proposed fairness algorithm. It is about cost
of the fairness and I do not see it is addressed in any of the follow
ups. You seems to be worried about memory cost and fairness aspects, I
am worried about CPU cost of the solution.
I think proposed solution is solving the fairness issue but it is also
creating bottleneck in upcall processing. OVS is known to have slower
upcall processing. This patch is adding even more cost to the upcall
handling. The latency of first packet handling is also going up with
this approach.

I revisited the original patch, here is what I see in term of added
cost to existing upcall processing:
1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
and initialize memory
2. copy flow key which is more than 1 KB (upcall->key = *key)
3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
half processing on CPU while waiting for the global lock.
4. Iterate list of queued upcalls, one of objective it is to avoid out
of order packet. But I do not see point of ordering packets from
different streams.
5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
further to the latency.
6. upcall is then handed over to different thread (context switch),
likely on different CPU.
8. the upcall object is freed on remote CPU.
9. single lock essentially means OVS kernel datapath upcall processing
is single threaded no matter number of cores in system.

I would be interested in how are we going to address these issues.

In example you were talking about netlink fd issue on server with 48
core, how does this solution works when there are 5K ports each
triggering upcall ? Can you benchmark your patch? Do you have
performance numbers for TCP_CRR with and without this patch ? Also
publish latency numbers for this patch. Please turn off megaflow to
exercise upcall handling.

I understand fairness has cost, but we need to find right balance
between performance and fairness. Current fairness scheme is a
lockless algorithm without much computational overhead, did you try to
improve current algorithm so that it uses less number of ports.


> With the current value for N (3/4 * number of threads) this can even get
> close to /proc/sys/fs/file-max on some systems, and there raising the
> number of allowed file descriptors for ovs-vswitchd isn't a solution
> anymore.
>
> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.
>
> --
> Stefano
Stefano Brivio Aug. 16, 2018, 9:07 p.m. UTC | #15
Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar <pshelar@ovn.org> wrote:

> My argument is not about proposed fairness algorithm. It is about cost
> of the fairness and I do not see it is addressed in any of the follow
> ups.

We are still working on it (especially on the points that you mentioned
already), that's why there hasn't been any follow-up yet.

After all, we marked this patch as RFC because we thought we needed to
gather some feedback before we'd reach a solution that's good enough :)

> I revisited the original patch, here is what I see in term of added
> cost to existing upcall processing:

Thanks for the review and the detailed summary, some answers follow:

> 1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
> and initialize memory

We would use kmem_cache_alloc() here, the queue is rather small.
Initialisation, we don't really need it, we can drop it.

> 2. copy flow key which is more than 1 KB (upcall->key = *key)

The current idea here is to find a way to safely hold the pointer to the
flow key. Do you have any suggestion?

> 3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
> half processing on CPU while waiting for the global lock.

A double list, whose pointer is swapped when we start dequeuing
packets (same as it's done e.g. for the flow table on rehashing), would
avoid the need for this spinlock. We're trying that out.

> 4. Iterate list of queued upcalls, one of objective it is to avoid out
> of order packet. But I do not see point of ordering packets from
> different streams.

Please note, though, that we also have packets from the same stream.
Actually, the whole point of this exercise is to get packets from
different streams out of order, while maintaining order for a given
stream.

> 5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
> further to the latency.

The idea behind ovs_dp_upcall_delay() is to schedule without delay if
we don't currently have a storm of upcalls.

But if we do, we're probably introducing less latency by doing this than
by letting ovs-vswitchd handle them. It's also a fundamental requirement
to have fairness: we need to schedule upcalls, and to schedule we need
some (small, in the overall picture) delay. This is another point where
we need to show some detailed measurements, I guess.

> 6. upcall is then handed over to different thread (context switch),
> likely on different CPU.
> 8. the upcall object is freed on remote CPU.

The solution could be to use cmwq instead and have per-CPU workers and
queues. But I wonder what would be the drawbacks of having per-CPU
fairness. I think this depends a lot on how ovs-vswitchd handles the
upcalls. We could check how that performs. Any thoughts?

> 9. single lock essentially means OVS kernel datapath upcall processing
> is single threaded no matter number of cores in system.

This should also be solved by keeping two queues.

> I would be interested in how are we going to address these issues.
> 
> In example you were talking about netlink fd issue on server with 48
> core, how does this solution works when there are 5K ports each
> triggering upcall ? Can you benchmark your patch? Do you have
> performance numbers for TCP_CRR with and without this patch ? Also
> publish latency numbers for this patch. Please turn off megaflow to
> exercise upcall handling.

We just run some tests that show that fairness is maintained with a
much lower number of ports, but we have no performance numbers at the
moment -- other than the consideration that when flooding with upcalls,
ovs-vswitchd is the bottleneck. We'll run proper performance tests,
focusing especially on latency (which we kind of ignored so far).

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

We tried with one socket per thread, it just doesn't work. We can
definitely try a bit harder. The problem I see here is that the current
mechanism is not actually a fairness scheme. It kind of works for most
workloads, but if a port happens to be flooding with a given timing, I
don't see how fairness can be guaranteed.
Stefano Brivio Sept. 26, 2018, 9:58 a.m. UTC | #16
Hi Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar <pshelar@ovn.org> wrote:

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

In the end, we tried harder as you suggested, and found out a way to
avoid the need for per-thread sets of per-vport sockets: with a few
changes, a process-wide set of per-vport sockets is actually enough to
maintain the current fairness behaviour.

Further, we can get rid of duplicate fd events if the EPOLLEXCLUSIVE
epoll() flag is available, which improves performance noticeably. This
is explained in more detail in the commit message of the ovs-vswitchd
patch Matteo wrote [1], now merged.

This approach solves the biggest issue (i.e. huge amount of netlink
sockets) in ovs-vswitchd itself, without the need for kernel changes.

It doesn't address some proposed improvements though, that is, it does
nothing to improve the current fairness scheme, it won't allow neither
the configurable fairness criteria proposed by Ben, nor usage of BPF
maps for extensibility as suggested by William.

I would however say that those topics bear definitely lower priority,
and perhaps addressing them at all becomes overkill now that we don't
any longer have a serious issue.

[1] https://patchwork.ozlabs.org/patch/974304/
Pravin Shelar Sept. 28, 2018, 5:15 p.m. UTC | #17
On Wed, Sep 26, 2018 at 2:58 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> Hi Pravin,
>
> On Wed, 15 Aug 2018 00:19:39 -0700
> Pravin Shelar <pshelar@ovn.org> wrote:
>
> > I understand fairness has cost, but we need to find right balance
> > between performance and fairness. Current fairness scheme is a
> > lockless algorithm without much computational overhead, did you try to
> > improve current algorithm so that it uses less number of ports.
>
> In the end, we tried harder as you suggested, and found out a way to
> avoid the need for per-thread sets of per-vport sockets: with a few
> changes, a process-wide set of per-vport sockets is actually enough to
> maintain the current fairness behaviour.
>
> Further, we can get rid of duplicate fd events if the EPOLLEXCLUSIVE
> epoll() flag is available, which improves performance noticeably. This
> is explained in more detail in the commit message of the ovs-vswitchd
> patch Matteo wrote [1], now merged.
>
> This approach solves the biggest issue (i.e. huge amount of netlink
> sockets) in ovs-vswitchd itself, without the need for kernel changes.
>

Thanks for working on this issue.

> It doesn't address some proposed improvements though, that is, it does
> nothing to improve the current fairness scheme, it won't allow neither
> the configurable fairness criteria proposed by Ben, nor usage of BPF
> maps for extensibility as suggested by William.
>
> I would however say that those topics bear definitely lower priority,
> and perhaps addressing them at all becomes overkill now that we don't
> any longer have a serious issue.
>
> [1] https://patchwork.ozlabs.org/patch/974304/
Nice!

Thanks,
Pravin.
diff mbox series

Patch

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0f5ce77460d4..2cfd504562d8 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -59,6 +59,10 @@ 
 #include "vport-internal_dev.h"
 #include "vport-netdev.h"
 
+#define UPCALL_QUEUE_TIMEOUT	msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_DELAY	msecs_to_jiffies(10)
+#define UPCALL_QUEUE_MAX_LEN	200
+
 unsigned int ovs_net_id __read_mostly;
 
 static struct genl_family dp_packet_genl_family;
@@ -225,6 +229,116 @@  void ovs_dp_detach_port(struct vport *p)
 	ovs_vport_del(p);
 }
 
+static void ovs_dp_upcall_dequeue(struct work_struct *work)
+{
+	struct datapath *dp = container_of(work, struct datapath,
+					   upcalls.work.work);
+	struct dp_upcall_info *u, *n;
+
+	spin_lock_bh(&dp->upcalls.lock);
+	list_for_each_entry_safe(u, n, &dp->upcalls.list, list) {
+		if (unlikely(ovs_dp_upcall(dp, u->skb, &u->key, u, 0)))
+			kfree_skb(u->skb);
+		else
+			consume_skb(u->skb);
+		kfree(u);
+	}
+	dp->upcalls.len = 0;
+	INIT_LIST_HEAD(&dp->upcalls.list);
+	spin_unlock_bh(&dp->upcalls.lock);
+}
+
+/* Calculate the delay of the deferred work which sends the upcalls. If it ran
+ * more than UPCALL_QUEUE_TIMEOUT ago, schedule the work immediately. Otherwise
+ * return a time between 0 and UPCALL_QUEUE_MAX_DELAY, depending linearly on the
+ * queue utilisation.
+ */
+static unsigned long ovs_dp_upcall_delay(int queue_len, unsigned long last_run)
+{
+	if (jiffies - last_run >= UPCALL_QUEUE_TIMEOUT)
+		return 0;
+
+	return UPCALL_QUEUE_MAX_DELAY -
+	       UPCALL_QUEUE_MAX_DELAY * queue_len / UPCALL_QUEUE_MAX_LEN;
+}
+
+static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
+					  struct dp_upcall_info *upcall)
+{
+	struct list_head *head = &dp->upcalls.list;
+	struct dp_upcall_info *here = NULL, *pos;
+	bool find_next = true;
+	unsigned long delay;
+	int err = 0;
+	u8 count;
+
+	spin_lock_bh(&dp->upcalls.lock);
+	if (dp->upcalls.len > UPCALL_QUEUE_MAX_LEN) {
+		err = -ENOSPC;
+		goto out;
+	}
+
+	/* Insert upcalls in the list in a per-port round-robin fashion, look
+	 * for insertion point:
+	 * - to avoid out-of-order per-port upcalls, we can insert only after
+	 *   the last occurrence of upcalls for the same port
+	 * - insert upcall only after we reach a count of occurrences for a
+	 *   given port greater than the one we're inserting this upcall for
+	 */
+	list_for_each_entry(pos, head, list) {
+		/* Count per-port upcalls. */
+		if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) {
+			err = -ENOSPC;
+			goto out_clear;
+		}
+		dp->upcalls.count[pos->port_no]++;
+
+		if (pos->port_no == upcall->port_no) {
+			/* Another upcall for the same port: move insertion
+			 * point here, keep looking for insertion condition to
+			 * be still met further on.
+			 */
+			find_next = true;
+			here = pos;
+			continue;
+		}
+
+		count = dp->upcalls.count[upcall->port_no];
+		if (find_next && dp->upcalls.count[pos->port_no] >= count) {
+			/* Insertion condition met: no need to look further,
+			 * unless another upcall for the same port occurs later.
+			 */
+			find_next = false;
+			here = pos;
+		}
+	}
+
+	if (here)
+		list_add(&upcall->list, &here->list);
+	else
+		list_add_tail(&upcall->list, head);
+
+	dp->upcalls.len++;
+
+out_clear:
+	/* Clear the per-port counters we used, so that we don't need to zero
+	 * out the counters array on every insertion.
+	 */
+	list_for_each_entry_reverse(pos, head, list)
+		dp->upcalls.count[pos->port_no] = 0;
+
+out:
+	spin_unlock_bh(&dp->upcalls.lock);
+
+	if (!err) {
+		delay = ovs_dp_upcall_delay(dp->upcalls.len,
+					    dp->upcalls.last_run);
+		mod_delayed_work(system_wq, &dp->upcalls.work, delay);
+	}
+
+	return err;
+}
+
 /* Must be called with rcu_read_lock. */
 void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 {
@@ -241,18 +355,25 @@  void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	/* Look up flow. */
 	flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
 	if (unlikely(!flow)) {
-		struct dp_upcall_info upcall;
+		struct dp_upcall_info *upcall;
 		int error;
 
-		memset(&upcall, 0, sizeof(upcall));
-		upcall.cmd = OVS_PACKET_CMD_MISS;
-		upcall.portid = ovs_vport_find_upcall_portid(p, skb);
-		upcall.mru = OVS_CB(skb)->mru;
-		error = ovs_dp_upcall(dp, skb, key, &upcall, 0);
+		upcall = kzalloc(sizeof(*upcall), GFP_ATOMIC);
+		if (!upcall) {
+			kfree_skb(skb);
+			stats_counter = &stats->n_missed;
+			goto out;
+		}
+
+		upcall->cmd = OVS_PACKET_CMD_MISS;
+		upcall->portid = ovs_vport_find_upcall_portid(p, skb);
+		upcall->port_no = p->port_no;
+		upcall->mru = OVS_CB(skb)->mru;
+		upcall->skb = skb;
+		upcall->key = *key;
+		error = ovs_dp_upcall_queue_roundrobin(dp, upcall);
 		if (unlikely(error))
 			kfree_skb(skb);
-		else
-			consume_skb(skb);
 		stats_counter = &stats->n_missed;
 		goto out;
 	}
@@ -1589,6 +1710,10 @@  static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	for (i = 0; i < DP_VPORT_HASH_BUCKETS; i++)
 		INIT_HLIST_HEAD(&dp->ports[i]);
 
+	INIT_LIST_HEAD(&dp->upcalls.list);
+	spin_lock_init(&dp->upcalls.lock);
+	INIT_DELAYED_WORK(&dp->upcalls.work, ovs_dp_upcall_dequeue);
+
 	err = ovs_meters_init(dp);
 	if (err)
 		goto err_destroy_ports_array;
@@ -1658,6 +1783,8 @@  static void __dp_destroy(struct datapath *dp)
 {
 	int i;
 
+	cancel_delayed_work_sync(&dp->upcalls.work);
+
 	for (i = 0; i < DP_VPORT_HASH_BUCKETS; i++) {
 		struct vport *vport;
 		struct hlist_node *n;
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index c9eb267c6f7e..f8b8bb679929 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -24,6 +24,7 @@ 
 #include <linux/mutex.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>
+#include <linux/workqueue.h>
 #include <linux/u64_stats_sync.h>
 #include <net/ip_tunnels.h>
 
@@ -70,6 +71,12 @@  struct dp_stats_percpu {
  * @net: Reference to net namespace.
  * @max_headroom: the maximum headroom of all vports in this datapath; it will
  * be used by all the internal vports in this dp.
+ * @upcalls.work: sends queued upcalls to userspace.
+ * @upcalls.list: list of queued upcalls.
+ * @upcalls.len: elements in upcall_list.
+ * @upcalls.lock: lock for the upcall list.
+ * @upcalls.count: array used to sort the upcalls delivered to userspace.
+ * @upcalls.last_run: timestamp of last work run.
  *
  * Context: See the comment on locking at the top of datapath.c for additional
  * locking information.
@@ -96,6 +103,16 @@  struct datapath {
 
 	/* Switch meters. */
 	struct hlist_head *meters;
+
+	/* Upcalls queue handling. */
+	struct {
+		struct delayed_work work;
+		struct list_head list;
+		int len;
+		spinlock_t lock;	/* Protects len and upcall list. */
+		u8 count[DP_MAX_PORTS];
+		unsigned long last_run;
+	} upcalls;
 };
 
 /**
@@ -116,7 +133,7 @@  struct ovs_skb_cb {
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
 /**
- * struct dp_upcall - metadata to include with a packet to send to userspace
+ * struct dp_upcall_info - Upcall for userspace, including metadata to send
  * @cmd: One of %OVS_PACKET_CMD_*.
  * @userdata: If nonnull, its variable-length value is passed to userspace as
  * %OVS_PACKET_ATTR_USERDATA.
@@ -125,6 +142,10 @@  struct ovs_skb_cb {
  * counter.
  * @egress_tun_info: If nonnull, becomes %OVS_PACKET_ATTR_EGRESS_TUN_KEY.
  * @mru: If not zero, Maximum received IP fragment size.
+ * @list: list within vport for upcall queue handling.
+ * @skb: the socket buffer that generated the upcall.
+ * @key: flow key.
+ * @port_no: port number within the datapath.
  */
 struct dp_upcall_info {
 	struct ip_tunnel_info *egress_tun_info;
@@ -134,6 +155,10 @@  struct dp_upcall_info {
 	u32 portid;
 	u8 cmd;
 	u16 mru;
+	struct list_head list;
+	struct sk_buff *skb;
+	struct sw_flow_key key;
+	u16 port_no;
 };
 
 /**