Patchwork [1/4] : net: Allow RX queue selection to seed TX queue hashing.

login
register
mail settings
Submitter David Miller
Date Jan. 28, 2009, 12:40 a.m.
Message ID <20090127.164027.245779962.davem@davemloft.net>
Download mbox | patch
Permalink /patch/20540/
State Accepted
Delegated to: David Miller
Headers show

Comments

David Miller - Jan. 28, 2009, 12:40 a.m.
The idea is that drivers which implement multiqueue RX
pre-seed the SKB by recording the RX queue selected by
the hardware.

If such a seed is found on TX, we'll use that to select
the outgoing TX queue.

This helps get more consistent load balancing on router
and firewall loads.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/skbuff.h |   15 +++++++++++++++
 net/core/dev.c         |    8 ++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)
Herbert Xu - Jan. 28, 2009, 8:53 a.m.
David Miller <davem@davemloft.net> wrote:
>
> +       if (skb_rx_queue_recorded(skb)) {
> +               u32 val = skb_get_rx_queue(skb);
> +
> +               hash = jhash_1word(val, simple_tx_hashrnd);

I'm not so sure about this added randomness.  On the one hand
I can see the benefit in being defensive about hashing, but this
does pose a problem for admins who're trying to optimse the system
by tying RX interrupts together with TX interrupts for the most
common traffic path.

For example, if you're forwarding traffic between multiqueue NICs
A and B, one would like to make it so that each queue on A goes
to a fixed queue on B where the CPU of the RX queue IRQ handler on
A is the same as the CPU of the TX queue IRQ handler on B.

This can still be done with the randomness, but it is much more
difficult.  Also if the randomness changes, we'd have to rejig
the IRQ assignment.

So can you think of a scenario where we really need this added
protection?

Cheers,
Eric Dumazet - Jan. 28, 2009, 9:29 a.m.
Herbert Xu a écrit :
> David Miller <davem@davemloft.net> wrote:
>> +       if (skb_rx_queue_recorded(skb)) {
>> +               u32 val = skb_get_rx_queue(skb);
>> +
>> +               hash = jhash_1word(val, simple_tx_hashrnd);
> 
> I'm not so sure about this added randomness.  On the one hand
> I can see the benefit in being defensive about hashing, but this
> does pose a problem for admins who're trying to optimse the system
> by tying RX interrupts together with TX interrupts for the most
> common traffic path.
> 
> For example, if you're forwarding traffic between multiqueue NICs
> A and B, one would like to make it so that each queue on A goes
> to a fixed queue on B where the CPU of the RX queue IRQ handler on
> A is the same as the CPU of the TX queue IRQ handler on B.
> 
> This can still be done with the randomness, but it is much more
> difficult.  Also if the randomness changes, we'd have to rejig
> the IRQ assignment.
> 
> So can you think of a scenario where we really need this added
> protection?
> 
jhash_1word(val, some_32bits_value) has a shuffle property we want to keep.

Maybe we can add a boot parameter so that admins who're trying to
optimize their system can fix a given value (yet not known to outsiders)
for simple_tx_hashrnd. This way, they know next reboots will keep
same hash function.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Brandeburg - Jan. 28, 2009, 6:50 p.m.
David Miller wrote:
> The idea is that drivers which implement multiqueue RX
> pre-seed the SKB by recording the RX queue selected by
> the hardware.
> 
> If such a seed is found on TX, we'll use that to select
> the outgoing TX queue.
> 
> This helps get more consistent load balancing on router
> and firewall loads.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  include/linux/skbuff.h |   15 +++++++++++++++
>  net/core/dev.c         |    8 ++++++++
>  2 files changed, 23 insertions(+), 0 deletions(-)

Hi Dave, thanks for doing this work, it looks very interesting, I'm curious 
what you think should happen with the tx / rx interrupts when every queue has 
a unique interrupt number.

I don't think the irqbalance daemon is smart enough to know that a given
tx queue should always end up either cache coherent (different core) or
core-coherent (same core) to the rx queue.

So, is it still expected that users running routing setups will always
have to hand-tune /proc/irq/NNN/smp_affinity for each of the tx and rx
queues?

Ideally the routing case would somehow end up working like the socket case
where the scheduler would notice the wakeups cross cpu and move the interrupt
for the routing case in the same way (I assume from your previous comments) it
would move the interrupt for a socket based application.

I hope to be able to test this today or tomorrow.--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Jan. 28, 2009, 8:22 p.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 28 Jan 2009 19:53:58 +1100

> So can you think of a scenario where we really need this added
> protection?

No matter what you think about the randomness aspect, our
divide avoidance technique being used here will still
get in the way of the situations you seem to be concerned
about.

We do the jhash et al. magic in order to be able to use a
multiply to get the modulus.  So the function computing from
RX to TX queue numbers will never be straightforward.

And since we do need to do the jhash to make that multiply
trick work, the randomness comes essentially for free and
does not make the RX to TX queue relationship any more or
less straightforward.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Jan. 28, 2009, 9:31 p.m.
On Wed, Jan 28, 2009 at 12:22:31PM -0800, David Miller wrote:
>
> No matter what you think about the randomness aspect, our
> divide avoidance technique being used here will still
> get in the way of the situations you seem to be concerned
> about.
> 
> We do the jhash et al. magic in order to be able to use a
> multiply to get the modulus.  So the function computing from
> RX to TX queue numbers will never be straightforward.

Right, if the number of queues weren't equal then clearly it's
not an issue to tack on more randomness.  But I was thinking of
scenarios where you have two identical NICs, performing routing
where the number of queues is likely to be identical.

Cheers,
David Miller - Jan. 29, 2009, 12:40 a.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 29 Jan 2009 08:31:41 +1100

> On Wed, Jan 28, 2009 at 12:22:31PM -0800, David Miller wrote:
> >
> > No matter what you think about the randomness aspect, our
> > divide avoidance technique being used here will still
> > get in the way of the situations you seem to be concerned
> > about.
> > 
> > We do the jhash et al. magic in order to be able to use a
> > multiply to get the modulus.  So the function computing from
> > RX to TX queue numbers will never be straightforward.
> 
> Right, if the number of queues weren't equal then clearly it's
> not an issue to tack on more randomness.  But I was thinking of
> scenarios where you have two identical NICs, performing routing
> where the number of queues is likely to be identical.

It is even the same for "identical" NICs.  Robert Olsson knows this
quite well :-)

For example, with the NIU chips the number of TX queues is larger
than the number of RX queues.

Robert has been proposing some ideas wherein some mapping is
implemented such that we iterate over a group of TX queues
per RX queue.

Otherwise in a routing workload some of the TX queues become
unused.

That's a fun hack for testing when you know the characteristics of
all of the devices in your system, but doesn't generalize well.

It's a tricky thing because we don't know where we're sending to of
course.  It shows that the "RX queue number" is not a sufficient
seed.

One way to deal with this is to grab the hash the chip computed.
I'm reluctant to advocate this because it's expensive with NIU
because I have to start using the larger RX descriptor layout
to get at that cookie.  (see "rx_pkt_hdr0" vs. "rx_pkt_hdr1" in
drivers/net/niu.h)

And I suspect other cards will require similar tradeoffs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Jan. 29, 2009, 1:34 a.m.
On Wed, Jan 28, 2009 at 04:40:58PM -0800, David Miller wrote:
> 
> It is even the same for "identical" NICs.  Robert Olsson knows this
> quite well :-)
> 
> For example, with the NIU chips the number of TX queues is larger
> than the number of RX queues.

I see that as a temporary issue while the NIC manufacturers are
catching up with the CPU development.  From the CPU's point of
view, the ideal situation is where you reduce the amount of cross-
cache communication to a minimum.  So having more TX queues than
you have cores (or rather caches) isn't all that useful since a
single core can only do so much (and a single piece of wire can
only carry so much data, no matter how many queues you place in
front of it).

Of course having less queues than cores may be the order of the
day for a while and we should certainly deal with that as well
as we can, but we should at least ensure the optimal case where
you have exactly one queue per core/cache is not penalised.

Having said that, it would seem that the randomness isn't too
big a deal since we could always create an interface where the
admin sets it to a fixed value so that we get predictable IRQ
affinity.  So I'm happy :)

> Robert has been proposing some ideas wherein some mapping is
> implemented such that we iterate over a group of TX queues
> per RX queue.
> 
> Otherwise in a routing workload some of the TX queues become
> unused.

If the number of RX queues is less than the number of caches or
cores, then I think this is a great idea and we could use the
stuff that you played with at the Kernel Summit to in effect
boost the number of queues so that all cores are utilised while
incurring the minimum amount of interprocessor communication.

> It's a tricky thing because we don't know where we're sending to of
> course.  It shows that the "RX queue number" is not a sufficient
> seed.

I don't think it really matters where we're sending to.  Ideally
every outbound interface should have a sufficient number of TX
queues so that each core can just deposit its output into a queue
that does not incur cross-CPU overheads.

So from the RX queue's point of view, we want to have the packet
stay on the core it arrived at for as long as possible, with the
best case being the entire duration from the RX queue to the TX
queue.

Now of course if the number of RX queues is too small such that
we can't utilise all the cores, then something has to give.  My
suggestion (as above) would be to multiplex it as early as possible,
i.e., simulat the hardware multiplexer in software if the number
of hardware RX queues is too small.

In fact, while doing the GRO stuff it occured me that this is
the perfect spot to do the multiplexing since it's the first
moment (or at least close to it) when we touch the data.

> One way to deal with this is to grab the hash the chip computed.
> I'm reluctant to advocate this because it's expensive with NIU
> because I have to start using the larger RX descriptor layout
> to get at that cookie.  (see "rx_pkt_hdr0" vs. "rx_pkt_hdr1" in
> drivers/net/niu.h)

That's a separate discussion.  The hash would be useful for the
software multiplexer discussed above, and further processing in
our stack.  But it isn't all that important with regards to keeping
the traffic on the core where it arrived.

So if we can get it then great, but if it's more expensive than
computing one in software then forget it :)

Cheers,
David Miller - Jan. 29, 2009, 1:45 a.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 29 Jan 2009 12:34:03 +1100

> So having more TX queues than you have cores (or rather caches)
> isn't all that useful since a single core can only do so much (and a
> single piece of wire can only carry so much data, no matter how many
> queues you place in front of it).

This is already proven to be false :-)

Having more TX queues helps a lot.  Robert's testing confirms
this even when the number of TX queues exceeds the number of
cores.

When Robert directly maps RX queues to a single TX queue on NIU, we
get drops at the qdisc level during his routing tests.  With the TX
queue iterator which causes all of the TX queues to be utilizied there
are no drops.

It would not be one bit surprising to me to see more hardware coming
out with more TX queues than RX queues, for similar reasons.  It's a
real situation and we have more proof that it's likely to continue
than not.

> > One way to deal with this is to grab the hash the chip computed.
> > I'm reluctant to advocate this because it's expensive with NIU
> > because I have to start using the larger RX descriptor layout
> > to get at that cookie.  (see "rx_pkt_hdr0" vs. "rx_pkt_hdr1" in
> > drivers/net/niu.h)
> 
> That's a separate discussion.  The hash would be useful for the
> software multiplexer discussed above, and further processing in
> our stack.  But it isn't all that important with regards to keeping
> the traffic on the core where it arrived.

I see your point.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Jan. 29, 2009, 4:41 a.m.
On Wed, Jan 28, 2009 at 05:45:22PM -0800, David Miller wrote:
> 
> This is already proven to be false :-)
> 
> Having more TX queues helps a lot.  Robert's testing confirms
> this even when the number of TX queues exceeds the number of
> cores.
> 
> When Robert directly maps RX queues to a single TX queue on NIU, we
> get drops at the qdisc level during his routing tests.  With the TX
> queue iterator which causes all of the TX queues to be utilizied there
> are no drops.

This could just mean that the hardware has underprovisioned each
individual TX queue.  Or is there some fundamental efficiency of
using multiple queues that I have missed?

But yeah if a single TX queue isn't big enough to take care of
the output from a single core then you will need multiple TX
queues.  Although from an system-wide point of view it would
seem to be more optimal to make that TX queue bigger rather than
forcing the software to use multiple queues (which may come with
a bigger overhead) to make up for it.

Of course for us software developers we don't really have a choice :)

Cheers,
David Miller - Jan. 29, 2009, 5:01 a.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 29 Jan 2009 15:41:48 +1100

> But yeah if a single TX queue isn't big enough to take care of
> the output from a single core then you will need multiple TX
> queues.  Although from an system-wide point of view it would
> seem to be more optimal to make that TX queue bigger rather than
> forcing the software to use multiple queues (which may come with
> a bigger overhead) to make up for it.
> 
> Of course for us software developers we don't really have a choice :)

Yes, this is one of the things I told Robert, make the
TX queues larger :-)

Besides, a plain iterating RX --> TX function can't work because
that doesn't take flows into consideration and thus can introduce
reordering.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf2cb50..a2c2378 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1904,6 +1904,21 @@  static inline void skb_copy_queue_mapping(struct sk_buff *to, const struct sk_bu
 	to->queue_mapping = from->queue_mapping;
 }
 
+static inline void skb_record_rx_queue(struct sk_buff *skb, u16 rx_queue)
+{
+	skb->queue_mapping = rx_queue + 1;
+}
+
+static inline u16 skb_get_rx_queue(struct sk_buff *skb)
+{
+	return skb->queue_mapping - 1;
+}
+
+static inline bool skb_rx_queue_recorded(struct sk_buff *skb)
+{
+	return (skb->queue_mapping != 0);
+}
+
 #ifdef CONFIG_XFRM
 static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 5379b0c..b21ad0b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1722,6 +1722,13 @@  static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb)
 		simple_tx_hashrnd_initialized = 1;
 	}
 
+	if (skb_rx_queue_recorded(skb)) {
+		u32 val = skb_get_rx_queue(skb);
+
+		hash = jhash_1word(val, simple_tx_hashrnd);
+		goto out;
+	}
+
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
 		if (!(ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)))
@@ -1759,6 +1766,7 @@  static u16 simple_tx_hash(struct net_device *dev, struct sk_buff *skb)
 
 	hash = jhash_3words(addr1, addr2, ports, simple_tx_hashrnd);
 
+out:
 	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
 }