diff mbox

[v2,net-next,4/5] xps_flows: XPS for packets that don't have a socket

Message ID 20160929035428.204355-5-tom@herbertland.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Tom Herbert Sept. 29, 2016, 3:54 a.m. UTC
xps_flows maintains a per device flow table that is indexed by the
skbuff hash. The table is only consulted when there is no queue saved in
a transmit socket for an skbuff.

Each entry in the flow table contains a queue index and a queue
pointer. The queue pointer is set when a queue is chosen using a
flow table entry. This pointer is set to the head pointer in the
transmit queue (which is maintained by BQL).

The new function get_xps_flows_index that looks up flows in the
xps_flows table. The entry returned gives the last queue a matching flow
used. The returned queue is compared against the normal XPS queue. If
they are different, then we only switch if the tail pointer in the TX
queue has advanced past the pointer saved in the entry. In this
way OOO should be avoided when XPS wants to use a different queue.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 net/Kconfig    |  6 ++++
 net/core/dev.c | 87 +++++++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 74 insertions(+), 19 deletions(-)

Comments

Eric Dumazet Sept. 29, 2016, 4:54 a.m. UTC | #1
On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> xps_flows maintains a per device flow table that is indexed by the
> skbuff hash. The table is only consulted when there is no queue saved in
> a transmit socket for an skbuff.
> 
> Each entry in the flow table contains a queue index and a queue
> pointer. The queue pointer is set when a queue is chosen using a
> flow table entry. This pointer is set to the head pointer in the
> transmit queue (which is maintained by BQL).
> 
> The new function get_xps_flows_index that looks up flows in the
> xps_flows table. The entry returned gives the last queue a matching flow
> used. The returned queue is compared against the normal XPS queue. If
> they are different, then we only switch if the tail pointer in the TX
> queue has advanced past the pointer saved in the entry. In this
> way OOO should be avoided when XPS wants to use a different queue.

There is something I do not understand with this.

If this OOO avoidance is tied to BQL, it means that packets sitting in a
qdisc wont be part of the detection.

So packets of flow X could possibly be queued on multiple qdiscs.
Tom Herbert Sept. 29, 2016, 12:53 p.m. UTC | #2
On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> xps_flows maintains a per device flow table that is indexed by the
>> skbuff hash. The table is only consulted when there is no queue saved in
>> a transmit socket for an skbuff.
>>
>> Each entry in the flow table contains a queue index and a queue
>> pointer. The queue pointer is set when a queue is chosen using a
>> flow table entry. This pointer is set to the head pointer in the
>> transmit queue (which is maintained by BQL).
>>
>> The new function get_xps_flows_index that looks up flows in the
>> xps_flows table. The entry returned gives the last queue a matching flow
>> used. The returned queue is compared against the normal XPS queue. If
>> they are different, then we only switch if the tail pointer in the TX
>> queue has advanced past the pointer saved in the entry. In this
>> way OOO should be avoided when XPS wants to use a different queue.
>
> There is something I do not understand with this.
>
> If this OOO avoidance is tied to BQL, it means that packets sitting in a
> qdisc wont be part of the detection.
>
> So packets of flow X could possibly be queued on multiple qdiscs.
>
Hi Eric,

I'm not sure I understand your concern. If packets for flow X can be
queued on multiple qdiscs wouldn't that mean that flow will be subject
to ooo transmission regardless of this patch? In other words here
we're trying to keep packets for the flow in order as they are emitted
from the qdiscs which we assume maintains ordering of packets in a
flow.

Tom

>
>
Eric Dumazet Sept. 29, 2016, 1:18 p.m. UTC | #3
On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> >> xps_flows maintains a per device flow table that is indexed by the
> >> skbuff hash. The table is only consulted when there is no queue saved in
> >> a transmit socket for an skbuff.
> >>
> >> Each entry in the flow table contains a queue index and a queue
> >> pointer. The queue pointer is set when a queue is chosen using a
> >> flow table entry. This pointer is set to the head pointer in the
> >> transmit queue (which is maintained by BQL).
> >>
> >> The new function get_xps_flows_index that looks up flows in the
> >> xps_flows table. The entry returned gives the last queue a matching flow
> >> used. The returned queue is compared against the normal XPS queue. If
> >> they are different, then we only switch if the tail pointer in the TX
> >> queue has advanced past the pointer saved in the entry. In this
> >> way OOO should be avoided when XPS wants to use a different queue.
> >
> > There is something I do not understand with this.
> >
> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
> > qdisc wont be part of the detection.
> >
> > So packets of flow X could possibly be queued on multiple qdiscs.
> >
> Hi Eric,
> 
> I'm not sure I understand your concern. If packets for flow X can be
> queued on multiple qdiscs wouldn't that mean that flow will be subject
> to ooo transmission regardless of this patch? In other words here
> we're trying to keep packets for the flow in order as they are emitted
> from the qdiscs which we assume maintains ordering of packets in a
> flow.

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.

Your patch aims to detect this and steer packets to one TXQ, but not
using a static hash() % num_real_queues (as RSS would have done on a NIC
at RX), but trying to please XPS (as : selecting a queue close to the
CPU producing the packet. VM with one vcpu, and cpu bounded, would
really be happy to not spread packets all over the queues)

Your mechanism relies on counters updated at the time packets are given
to the NIC, not at the time packets are given to the txq (and eventually
sit for a while in the qdisc right before BQL layer)

dev_queue_xmit() is not talking to the device directly...


All I am saying is that the producer/consumer counters you added are not
at the right place.

You tried hard to not change the drivers, by adding something to
existing BQL. But packets can sit in a qdisc for a while...

So you added 2 potential cache lines misses per outgoing packet, but
with no guarantee that OOO are really avoided.
Tom Herbert Sept. 29, 2016, 2:08 p.m. UTC | #4
On Thu, Sep 29, 2016 at 9:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-09-29 at 08:53 -0400, Tom Herbert wrote:
>> On Thu, Sep 29, 2016 at 12:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
>> >> xps_flows maintains a per device flow table that is indexed by the
>> >> skbuff hash. The table is only consulted when there is no queue saved in
>> >> a transmit socket for an skbuff.
>> >>
>> >> Each entry in the flow table contains a queue index and a queue
>> >> pointer. The queue pointer is set when a queue is chosen using a
>> >> flow table entry. This pointer is set to the head pointer in the
>> >> transmit queue (which is maintained by BQL).
>> >>
>> >> The new function get_xps_flows_index that looks up flows in the
>> >> xps_flows table. The entry returned gives the last queue a matching flow
>> >> used. The returned queue is compared against the normal XPS queue. If
>> >> they are different, then we only switch if the tail pointer in the TX
>> >> queue has advanced past the pointer saved in the entry. In this
>> >> way OOO should be avoided when XPS wants to use a different queue.
>> >
>> > There is something I do not understand with this.
>> >
>> > If this OOO avoidance is tied to BQL, it means that packets sitting in a
>> > qdisc wont be part of the detection.
>> >
>> > So packets of flow X could possibly be queued on multiple qdiscs.
>> >
>> Hi Eric,
>>
>> I'm not sure I understand your concern. If packets for flow X can be
>> queued on multiple qdiscs wouldn't that mean that flow will be subject
>> to ooo transmission regardless of this patch? In other words here
>> we're trying to keep packets for the flow in order as they are emitted
>> from the qdiscs which we assume maintains ordering of packets in a
>> flow.
>
> Well, then what this patch series is solving ?
>
It addresses  the issue that Rick Jones pointed out was happening with
XPS. When packets are sent for a flow that has no socket and XPS is
enabled then each packet uses the XPS queue based on the running CPU.
Since the thread sending on a flow can be rescheduled on different
CPUs this is creating ooo packets. In this case the ooo is being
caused by interaction with XPS.

> You have a producer of packets running on 8 vcpus in a VM.
>
> Packets are exiting the VM and need to be queued on a mq NIC in the
> hypervisor.
>
> Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
> selecting different TXQ.
>
> Your patch aims to detect this and steer packets to one TXQ, but not
> using a static hash() % num_real_queues (as RSS would have done on a NIC
> at RX), but trying to please XPS (as : selecting a queue close to the
> CPU producing the packet. VM with one vcpu, and cpu bounded, would
> really be happy to not spread packets all over the queues)
>
You're not describing this patch, you're describing how XPS works in
the first place. This patch is done to make socketless packets work
well with XPS. If all you want is a static hash() % num_real_queues
then just disable XPS to get that.

> Your mechanism relies on counters updated at the time packets are given
> to the NIC, not at the time packets are given to the txq (and eventually
> sit for a while in the qdisc right before BQL layer)
>
> dev_queue_xmit() is not talking to the device directly...
>
>
> All I am saying is that the producer/consumer counters you added are not
> at the right place.
>
> You tried hard to not change the drivers, by adding something to
> existing BQL. But packets can sit in a qdisc for a while...
>
But again, this patch is to prevent ooo being caused by an interaction
with XPS. It does not address ooo being caused by qdiscs or VMs, but
then I don't see that as being the issue reported by Rick.

> So you added 2 potential cache lines misses per outgoing packet, but
> with no guarantee that OOO are really avoided.
>
>
>
Eric Dumazet Sept. 29, 2016, 2:51 p.m. UTC | #5
On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:

> It addresses  the issue that Rick Jones pointed out was happening with
> XPS. When packets are sent for a flow that has no socket and XPS is
> enabled then each packet uses the XPS queue based on the running CPU.
> Since the thread sending on a flow can be rescheduled on different
> CPUs this is creating ooo packets. In this case the ooo is being
> caused by interaction with XPS.
> 

Nope, your patch does not address the problem properly.

I am not sure I want to spend more time explaining the issue.

Lets talk about this in Tokyo next week.
Eric Dumazet Sept. 29, 2016, 3:15 p.m. UTC | #6
On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote:
> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:
> 
> > It addresses  the issue that Rick Jones pointed out was happening with
> > XPS. When packets are sent for a flow that has no socket and XPS is
> > enabled then each packet uses the XPS queue based on the running CPU.
> > Since the thread sending on a flow can be rescheduled on different
> > CPUs this is creating ooo packets. In this case the ooo is being
> > caused by interaction with XPS.
> > 
> 
> Nope, your patch does not address the problem properly.
> 
> I am not sure I want to spend more time explaining the issue.
> 
> Lets talk about this in Tokyo next week.
> 

Just as a reminder, sorry to bother you, stating some obvious facts for
both of us. We have public exchanges, so we also need to re-explain how
things work.

Queue selection on xmit happens before we hit the qdisc and its delays.

So when you access txq->dql.num_completed_ops and
txq->dql.num_enqueue_ops you can observe values that do not change for a
while.

Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow
(skb_get_hash() returns the same value for these 2 packets)

P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the
packet on its qdisc . Transmit does not happen because of some
constraints like rate limiting or scheduling constraints.

P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior
packet chose txq1. We check txq1->dql and decide it is fine to use txq2,
since the dql params of txq1 were not changed yet.

( txq->dql.num_completed_ops == ent.queue_ptr )

Note that in RFS case, we have the guarantee that we observe 'live
queues' since they are the per cpu backlog.

So input_queue_head_incr() and input_queue_tail_incr_save() are
correctly doing the OOO prevention, because a queued packet immediately
changes the state.

So really your patch works if you have no qdisc, or a non congested
qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really
want to avoid steering P2, P3, ..., PN on this full pfifo while maybe
other txq are idle). Strange attractors are back (check commit
9b462d02d6dd6 )

You could avoid (ab)using BQL with a different method, grabbing
skb->destructor for the packets that are socketless

The hash table would simply track the sum of skb->truesize to allow flow
migration. This would be self contained and not intrusive.
Rick Jones Sept. 29, 2016, 4:35 p.m. UTC | #7
On 09/29/2016 06:18 AM, Eric Dumazet wrote:
> Well, then what this patch series is solving ?
>
> You have a producer of packets running on 8 vcpus in a VM.
>
> Packets are exiting the VM and need to be queued on a mq NIC in the
> hypervisor.
>
> Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
> selecting different TXQ.

Just for completeness, in my testing, the VMs were single-vCPU.

rick jones
Tom Herbert Sept. 29, 2016, 8:26 p.m. UTC | #8
On Thu, Sep 29, 2016 at 11:15 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-09-29 at 07:51 -0700, Eric Dumazet wrote:
>> On Thu, 2016-09-29 at 10:08 -0400, Tom Herbert wrote:
>>
>> > It addresses  the issue that Rick Jones pointed out was happening with
>> > XPS. When packets are sent for a flow that has no socket and XPS is
>> > enabled then each packet uses the XPS queue based on the running CPU.
>> > Since the thread sending on a flow can be rescheduled on different
>> > CPUs this is creating ooo packets. In this case the ooo is being
>> > caused by interaction with XPS.
>> >
>>
>> Nope, your patch does not address the problem properly.
>>
>> I am not sure I want to spend more time explaining the issue.
>>
>> Lets talk about this in Tokyo next week.
>>
>
> Just as a reminder, sorry to bother you, stating some obvious facts for
> both of us. We have public exchanges, so we also need to re-explain how
> things work.
>
> Queue selection on xmit happens before we hit the qdisc and its delays.
>
> So when you access txq->dql.num_completed_ops and
> txq->dql.num_enqueue_ops you can observe values that do not change for a
> while.
>
> Say a thread runs on a VM, and sends 2 packets P1, P2 on the same flow
> (skb_get_hash() returns the same value for these 2 packets)
>
> P1 is sent on behalf of CPU 1, we pickup queue txq1, and queue the
> packet on its qdisc . Transmit does not happen because of some
> constraints like rate limiting or scheduling constraints.
>
> P2 is sent on behalf of CPU 2, we pickup queue txq2, notice that prior
> packet chose txq1. We check txq1->dql and decide it is fine to use txq2,
> since the dql params of txq1 were not changed yet.
>
> ( txq->dql.num_completed_ops == ent.queue_ptr )
>
> Note that in RFS case, we have the guarantee that we observe 'live
> queues' since they are the per cpu backlog.
>
> So input_queue_head_incr() and input_queue_tail_incr_save() are
> correctly doing the OOO prevention, because a queued packet immediately
> changes the state.
>
> So really your patch works if you have no qdisc, or a non congested
> qdisc. (Think if P1 is dropped by a full pfifo or pfifo_fast : We really
> want to avoid steering P2, P3, ..., PN on this full pfifo while maybe
> other txq are idle). Strange attractors are back (check commit
> 9b462d02d6dd6 )
>
Understood.

> You could avoid (ab)using BQL with a different method, grabbing
> skb->destructor for the packets that are socketless
>
> The hash table would simply track the sum of skb->truesize to allow flow
> migration. This would be self contained and not intrusive.
>
Okay, will look that.

>
>
>
diff mbox

Patch

diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..f77fad1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -255,6 +255,12 @@  config XPS
 	depends on SMP
 	default y
 
+config XPS_FLOWS
+	bool
+	depends on XPS
+	depends on BQL
+	default y
+
 config HWBM
        bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index c0c291f..1ca08b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3210,6 +3210,7 @@  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
+/* Must be called with RCU read_lock */
 static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
@@ -3217,7 +3218,6 @@  static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
-	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_maps);
 	if (dev_maps) {
 		map = rcu_dereference(
@@ -3228,15 +3228,62 @@  static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			else
 				queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
 									   map->len)];
-			if (unlikely(queue_index >= dev->real_num_tx_queues))
-				queue_index = -1;
+			if (queue_index >= 0 &&
+			    likely(queue_index < dev->real_num_tx_queues))
+				return queue_index;
 		}
 	}
-	rcu_read_unlock();
+#endif
+	return skb_tx_hash(dev, skb);
+}
+
+/* Must be called with RCU read_lock */
+static int get_xps_flows_index(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS_FLOWS
+	struct xps_dev_flow_table *flow_table;
+	struct xps_dev_flow ent;
+	int queue_index;
+	struct netdev_queue *txq;
+	u32 hash;
+
+	queue_index = get_xps_queue(dev, skb);
+	if (queue_index < 0)
+		return -1;
+
+	flow_table = rcu_dereference(dev->xps_flow_table);
+	if (!flow_table)
+		return queue_index;
+
+	hash = skb_get_hash(skb);
+	if (!hash)
+		return queue_index;
+
+	ent.v64 = flow_table->flows[hash & flow_table->mask].v64;
+
+	if (queue_index != ent.queue_index &&
+	    ent.queue_index >= 0 &&
+	    ent.queue_index < dev->real_num_tx_queues) {
+		txq = netdev_get_tx_queue(dev, ent.queue_index);
+		if ((int)(txq->dql.num_completed_ops - ent.queue_ptr) < 0)  {
+			/* The current queue's tail has not advanced beyond the
+			 * last packet that was enqueued using the table entry.
+			 * We can't change queues without risking OOO. Stick
+			 * with the queue listed in the flow table.
+			 */
+			queue_index = ent.queue_index;
+		}
+	}
+
+	/* Save the updated entry */
+	txq = netdev_get_tx_queue(dev, queue_index);
+	ent.queue_index = queue_index;
+	ent.queue_ptr = txq->dql.num_enqueue_ops;
+	flow_table->flows[hash & flow_table->mask].v64 = ent.v64;
 
 	return queue_index;
 #else
-	return -1;
+	return get_xps_queue(dev, skb);
 #endif
 }
 
@@ -3244,22 +3291,24 @@  static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 	int queue_index = sk_tx_queue_get(sk);
-
-	if (queue_index < 0 || skb->ooo_okay ||
-	    queue_index >= dev->real_num_tx_queues) {
-		int new_index = get_xps_queue(dev, skb);
-		if (new_index < 0)
-			new_index = skb_tx_hash(dev, skb);
-
-		if (queue_index != new_index && sk &&
-		    sk_fullsock(sk) &&
-		    rcu_access_pointer(sk->sk_dst_cache))
-			sk_tx_queue_set(sk, new_index);
-
-		queue_index = new_index;
+	int new_index;
+
+	if (queue_index < 0) {
+		/* Socket did not provide a queue index, try xps_flows */
+		new_index = get_xps_flows_index(dev, skb);
+	} else if (skb->ooo_okay || queue_index >= dev->real_num_tx_queues) {
+		/* Queue index in socket, see if we can find a better one */
+		new_index = get_xps_queue(dev, skb);
+	} else {
+		/* Valid queue in socket and can't send OOO. Just return it */
+		return queue_index;
 	}
 
-	return queue_index;
+	if (queue_index != new_index && sk && sk_fullsock(sk) &&
+	    rcu_access_pointer(sk->sk_dst_cache))
+		sk_tx_queue_set(sk, new_index);
+
+	return new_index;
 }
 
 struct netdev_queue *netdev_pick_tx(struct net_device *dev,