diff mbox

[net-next,V2] gro: introduce gro_mac_header_len

Message ID 1328546834.2220.79.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 6, 2012, 4:47 p.m. UTC
Le lundi 06 février 2012 à 11:31 -0500, David Miller a écrit :

> We really need an explanation, probably both in the commit message and
> the comments next to this new struct member, explaining why in the world
> we can't use ->hard_header_len for this.

OK, I added some information from Or Gerlitz in V2

Thanks !

[PATCH net-next V2] gro: introduce gro_mac_header_len

Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
only, and failed on IB/ipoib traffic.

He provided a patch faking a zeroed header to let GRO aggregates frames.

Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
check to be more generic.

This patch introduces a new netdevice field, gro_mac_header_len, giving
L2 header length, default to ETH_HLEN (14 bytes)

A device setup function can override this default value.

gro_max_header_len can be different than hard_header_len because as Or
Gerlitz said :

	IPoIB advertizes hard_header_len which is bigger than the
	IPoIB header len, this is done such that skbs sent by the
	network stack have enough headroom for a "pseudoheader"
	which for few flows (e.g unicast arp replies and multicast)
	is placed there by the ipoib hard_header function and later
	used by the xmit function.

__napi_gro_receive() has special handling for the common case (Ethernet)
to avoid a memcmp() call and use an inline optimized function instead.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Shlomo Pongratz <shlomop@mellanox.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
V2: added a comment saying why we dont use hard_header_len but a new
field.

 include/linux/netdevice.h |    1 +
 net/core/dev.c            |   11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Feb. 6, 2012, 4:58 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 06 Feb 2012 17:47:14 +0100

[ Roland Dreier CC:'d ]

> gro_max_header_len can be different than hard_header_len because as Or
> Gerlitz said :
> 
> 	IPoIB advertizes hard_header_len which is bigger than the
> 	IPoIB header len, this is done such that skbs sent by the
> 	network stack have enough headroom for a "pseudoheader"
> 	which for few flows (e.g unicast arp replies and multicast)
> 	is placed there by the ipoib hard_header function and later
> 	used by the xmit function.

Translation: IPoIB's path resolution mechanism is garbage

So if IPoIB path resolution was properly integrated into the neighbour
cache state machine, instead of being implemented awkwardly in the
device transmit path, this crap wouldn't be necessary right?

So here we have yet another incredibly painful side effect of how
IPoIB path resolution works.

Roland, I want you to seriously consider a way, any way, to get rid of
how IPoIB does path resolution.  It must be fully integrated into the
neighbour layer, the neighbour layer must be knowledgable about how
path resolution is a necessary step for a neighbour entry to enter the
valid state, and I want all of this awkward neighbour handling code
removed from the transmit path of IPoIB.

And finally it must not lie about it's hardware header length.

Then we won't need crap like what is being proposed here, a
"no_this_is_the_real_hard_header_len" struct member.  That's just
rediculious.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2012, 5:07 p.m. UTC | #2
Le lundi 06 février 2012 à 11:58 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 06 Feb 2012 17:47:14 +0100
> 
> [ Roland Dreier CC:'d ]
> 
> > gro_max_header_len can be different than hard_header_len because as Or
> > Gerlitz said :
> > 
> > 	IPoIB advertizes hard_header_len which is bigger than the
> > 	IPoIB header len, this is done such that skbs sent by the
> > 	network stack have enough headroom for a "pseudoheader"
> > 	which for few flows (e.g unicast arp replies and multicast)
> > 	is placed there by the ipoib hard_header function and later
> > 	used by the xmit function.
> 
> Translation: IPoIB's path resolution mechanism is garbage
> 
> So if IPoIB path resolution was properly integrated into the neighbour
> cache state machine, instead of being implemented awkwardly in the
> device transmit path, this crap wouldn't be necessary right?
> 
> So here we have yet another incredibly painful side effect of how
> IPoIB path resolution works.
> 
> Roland, I want you to seriously consider a way, any way, to get rid of
> how IPoIB does path resolution.  It must be fully integrated into the
> neighbour layer, the neighbour layer must be knowledgable about how
> path resolution is a necessary step for a neighbour entry to enter the
> valid state, and I want all of this awkward neighbour handling code
> removed from the transmit path of IPoIB.
> 
> And finally it must not lie about it's hardware header length.
> 
> Then we won't need crap like what is being proposed here, a
> "no_this_is_the_real_hard_header_len" struct member.  That's just
> rediculious.

OK, I'll resend my first patch then, using hard_header_len



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Feb. 6, 2012, 5:09 p.m. UTC | #3
On 2/6/2012 6:58 PM, David Miller wrote:
> So if IPoIB path resolution was properly integrated into the neighbour 
> cache state machine, instead of being implemented awkwardly in the 
> device transmit path, this crap wouldn't be necessary right?


Dave, say we have integrated the path resolution into ND cache, how 
would you suggest ipoib to act for skbs for which xmit is called without 
a neighbour? specifically arp replies and multicast. I can think of at 
least another one other location where the HW address can be stored 
between hard_header and xmit - on the sbk->cb storage, which is large 
enough.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Feb. 6, 2012, 5:11 p.m. UTC | #4
On 2/6/2012 7:07 PM, Eric Dumazet wrote:
> OK, I'll resend my first patch then, using hard_header_len

so, we will be back to square one... as the hard_header_len which 
advertized now by IPoIB will fail
the GRO L2 check... lets see where this discussion evolves.

Or.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 6, 2012, 5:19 p.m. UTC | #5
Le lundi 06 février 2012 à 19:11 +0200, Or Gerlitz a écrit :
> On 2/6/2012 7:07 PM, Eric Dumazet wrote:
> > OK, I'll resend my first patch then, using hard_header_len
> 
> so, we will be back to square one... as the hard_header_len which 
> advertized now by IPoIB will fail
> the GRO L2 check... lets see where this discussion evolves.
> 
> Or.
> 

You'll have to define hard_header_len to IPOIB_ENCAP_LEN ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier Feb. 6, 2012, 5:23 p.m. UTC | #6
On Mon, Feb 6, 2012 at 8:58 AM, David Miller <davem@davemloft.net> wrote:
>
> So if IPoIB path resolution was properly integrated into the neighbour
> cache state machine, instead of being implemented awkwardly in the
> device transmit path, this crap wouldn't be necessary right?
>
> So here we have yet another incredibly painful side effect of how
> IPoIB path resolution works.
>
> Roland, I want you to seriously consider a way, any way, to get rid of
> how IPoIB does path resolution.  It must be fully integrated into the
> neighbour layer, the neighbour layer must be knowledgable about how
> path resolution is a necessary step for a neighbour entry to enter the
> valid state, and I want all of this awkward neighbour handling code
> removed from the transmit path of IPoIB.
>
> And finally it must not lie about it's hardware header length.

I would love to clean this up.  But I don't know how to do it.

IMHO the problem is in the IPoIB RFC: (4391) which makes
a distinction between an "encapsulation header" and the "link
layer address".  The LL address is what we put into ARP and
ND packets, and so I think we are forced into exposing that
to the network stack as our hardware address.

However this LL address is not actually what we need to send
a packet -- we need to take the GID of our destination and send
a query to the subnet manager to resolve it to a path.  And this
query really is an RPC to a remote entity somewhere else on
the network, so we have to do it asynchronously etc.

Now the part I don't know how to handle is when the network
stack gives us a LL address to send an ARP or something to
but the skb has no dst attached.  Suppose I don't have a path
for that LL address when I get that skb with no dst into my
.hard_header method.  Where do I stick the LL addr to keep
around until the packet shows up in the xmit function?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2012, 7:12 p.m. UTC | #7
From: Roland Dreier <roland@kernel.org>
Date: Mon, 6 Feb 2012 09:23:58 -0800

> IMHO the problem is in the IPoIB RFC: (4391) which makes
> a distinction between an "encapsulation header" and the "link
> layer address".  The LL address is what we put into ARP and
> ND packets, and so I think we are forced into exposing that
> to the network stack as our hardware address.

An address is not a hardware MAC header, we're only talking
about the length of the latter.

If the addressing is such that you need to put the GID
into the ARP/NDISC packets, and that's different from what
ends up in the final encapsulation header, I really don't
see what the problem is specificially with respect to the
MAC header size.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier Feb. 6, 2012, 7:23 p.m. UTC | #8
On Mon, Feb 6, 2012 at 11:12 AM, David Miller <davem@davemloft.net> wrote:
>> IMHO the problem is in the IPoIB RFC: (4391) which makes
>> a distinction between an "encapsulation header" and the "link
>> layer address".  The LL address is what we put into ARP and
>> ND packets, and so I think we are forced into exposing that
>> to the network stack as our hardware address.
>
> An address is not a hardware MAC header, we're only talking
> about the length of the latter.
>
> If the addressing is such that you need to put the GID
> into the ARP/NDISC packets, and that's different from what
> ends up in the final encapsulation header, I really don't
> see what the problem is specificially with respect to the
> MAC header size.

Dave, please look at my whole email.  The problem is that
the LL addr is not what we need to send packets but it's all
we get in our .hard_header method.  So for eg unicast ARP
packets without a dst we have nowhere to stash what we
actually will need in our xmit method.

(The same problem for unicast ARPs applies to multicast
sends too).

Does the netdev driver own skb->cb between hard_header
and start_xmit?  If so we could use that instead of stealing
some header space, and that would at least let us not lie
about hard_header_len.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2012, 7:32 p.m. UTC | #9
From: Roland Dreier <roland@kernel.org>
Date: Mon, 6 Feb 2012 11:23:21 -0800

> Does the netdev driver own skb->cb between hard_header
> and start_xmit?  If so we could use that instead of stealing
> some header space, and that would at least let us not lie
> about hard_header_len.

Unfortunately the packet scheduler sits between the hard_header()
call (via neigh_*_output() --> dev_hard_header()) and when the
device xmit method is invoked.

And the packet scheduler can make use of the skb->cb[], for
include/net/sch_generic.h:qdisc_skb_cb


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2012, 7:51 p.m. UTC | #10
From: David Miller <davem@davemloft.net>
Date: Mon, 06 Feb 2012 14:32:30 -0500 (EST)

> From: Roland Dreier <roland@kernel.org>
> Date: Mon, 6 Feb 2012 11:23:21 -0800
> 
>> Does the netdev driver own skb->cb between hard_header
>> and start_xmit?  If so we could use that instead of stealing
>> some header space, and that would at least let us not lie
>> about hard_header_len.
> 
> Unfortunately the packet scheduler sits between the hard_header()
> call (via neigh_*_output() --> dev_hard_header()) and when the
> device xmit method is invoked.
> 
> And the packet scheduler can make use of the skb->cb[], for
> include/net/sch_generic.h:qdisc_skb_cb

Actually there is a way to make this work.

Define your ipoib_skb_cb something like:

struct ipoib_skb_cb {
	struct qdisc_skb_cb	qdisc_cb;

	... ipoib stuff goes here ...
};

That way you can use the SKB cb area for your ipoib info
without interfering with the packet scheduler.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2012, 7:56 p.m. UTC | #11
From: David Miller <davem@davemloft.net>
Date: Mon, 06 Feb 2012 14:51:48 -0500 (EST)

> From: David Miller <davem@davemloft.net>
> Date: Mon, 06 Feb 2012 14:32:30 -0500 (EST)
> 
>> From: Roland Dreier <roland@kernel.org>
>> Date: Mon, 6 Feb 2012 11:23:21 -0800
>> 
>>> Does the netdev driver own skb->cb between hard_header
>>> and start_xmit?  If so we could use that instead of stealing
>>> some header space, and that would at least let us not lie
>>> about hard_header_len.
>> 
>> Unfortunately the packet scheduler sits between the hard_header()
>> call (via neigh_*_output() --> dev_hard_header()) and when the
>> device xmit method is invoked.
>> 
>> And the packet scheduler can make use of the skb->cb[], for
>> include/net/sch_generic.h:qdisc_skb_cb
> 
> Actually there is a way to make this work.
> 
> Define your ipoib_skb_cb something like:
> 
> struct ipoib_skb_cb {
> 	struct qdisc_skb_cb	qdisc_cb;
> 
> 	... ipoib stuff goes here ...
> };
> 
> That way you can use the SKB cb area for your ipoib info
> without interfering with the packet scheduler.

But this needs a little bit of work since the qdisc_skb_cb ends with a
variable length array, but we can put an upper bound on this just like
we do for skb->cb[] itself to fix this issue.

I'll toss something together.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier Feb. 6, 2012, 7:59 p.m. UTC | #12
> Actually there is a way to make this work.
>
> Define your ipoib_skb_cb something like:
>
> struct ipoib_skb_cb {
>        struct qdisc_skb_cb     qdisc_cb;
>
>        ... ipoib stuff goes here ...
> };
>
> That way you can use the SKB cb area for your ipoib info
> without interfering with the packet scheduler.

Not sure I see how this could work.  sch_generic.h has:

struct qdisc_skb_cb {
        unsigned int            pkt_len;
        long                    data[];
};

and ipoib has no way of knowing what the biggest
scheduler-specific use of data[] will be.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier Feb. 6, 2012, 8:01 p.m. UTC | #13
On Mon, Feb 6, 2012 at 11:56 AM, David Miller <davem@davemloft.net> wrote:
> But this needs a little bit of work since the qdisc_skb_cb ends with a
> variable length array, but we can put an upper bound on this just like
> we do for skb->cb[] itself to fix this issue.
>
> I'll toss something together.

OK thanks.  We need something like 20 bytes of space to stash an
ipoib LL addr.  Which I guess leaves 28 bytes for sch use.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0eac07c..903bb6e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1095,6 +1095,7 @@  struct net_device {
 	unsigned int		mtu;	/* interface MTU value		*/
 	unsigned short		type;	/* interface hardware type	*/
 	unsigned short		hard_header_len;	/* hardware hdr length	*/
+	unsigned int		gro_mac_header_len; /* L2 header length for GRO */
 
 	/* extra head- and tailroom the hardware may need, but not in all cases
 	 * can this be guaranteed, especially tailroom. Some cases also use
diff --git a/net/core/dev.c b/net/core/dev.c
index f124947..0b43939 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3491,14 +3491,20 @@  static inline gro_result_t
 __napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
 	struct sk_buff *p;
+	unsigned int maclen = skb->dev->gro_mac_header_len;
 
 	for (p = napi->gro_list; p; p = p->next) {
 		unsigned long diffs;
 
 		diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
 		diffs |= p->vlan_tci ^ skb->vlan_tci;
-		diffs |= compare_ether_header(skb_mac_header(p),
-					      skb_gro_mac_header(skb));
+		if (maclen == ETH_HLEN)
+			diffs |= compare_ether_header(skb_mac_header(p),
+						      skb_gro_mac_header(skb));
+		else if (!diffs)
+			diffs = memcmp(skb_mac_header(p),
+				       skb_gro_mac_header(skb),
+				       maclen);
 		NAPI_GRO_CB(p)->same_flow = !diffs;
 		NAPI_GRO_CB(p)->flush = 0;
 	}
@@ -5962,6 +5968,7 @@  struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	INIT_LIST_HEAD(&dev->unreg_list);
 	INIT_LIST_HEAD(&dev->link_watch_list);
 	dev->priv_flags = IFF_XMIT_DST_RELEASE;
+	dev->gro_mac_header_len = ETH_HLEN;
 	setup(dev);
 
 	dev->num_tx_queues = txqs;