Message ID | 1328546834.2220.79.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 06 Feb 2012 17:47:14 +0100 [ Roland Dreier CC:'d ] > gro_max_header_len can be different than hard_header_len because as Or > Gerlitz said : > > IPoIB advertizes hard_header_len which is bigger than the > IPoIB header len, this is done such that skbs sent by the > network stack have enough headroom for a "pseudoheader" > which for few flows (e.g unicast arp replies and multicast) > is placed there by the ipoib hard_header function and later > used by the xmit function. Translation: IPoIB's path resolution mechanism is garbage So if IPoIB path resolution was properly integrated into the neighbour cache state machine, instead of being implemented awkwardly in the device transmit path, this crap wouldn't be necessary right? So here we have yet another incredibly painful side effect of how IPoIB path resolution works. Roland, I want you to seriously consider a way, any way, to get rid of how IPoIB does path resolution. It must be fully integrated into the neighbour layer, the neighbour layer must be knowledgable about how path resolution is a necessary step for a neighbour entry to enter the valid state, and I want all of this awkward neighbour handling code removed from the transmit path of IPoIB. And finally it must not lie about it's hardware header length. Then we won't need crap like what is being proposed here, a "no_this_is_the_real_hard_header_len" struct member. That's just rediculious. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 06 février 2012 à 11:58 -0500, David Miller a écrit : > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Mon, 06 Feb 2012 17:47:14 +0100 > > [ Roland Dreier CC:'d ] > > > gro_max_header_len can be different than hard_header_len because as Or > > Gerlitz said : > > > > IPoIB advertizes hard_header_len which is bigger than the > > IPoIB header len, this is done such that skbs sent by the > > network stack have enough headroom for a "pseudoheader" > > which for few flows (e.g unicast arp replies and multicast) > > is placed there by the ipoib hard_header function and later > > used by the xmit function. > > Translation: IPoIB's path resolution mechanism is garbage > > So if IPoIB path resolution was properly integrated into the neighbour > cache state machine, instead of being implemented awkwardly in the > device transmit path, this crap wouldn't be necessary right? > > So here we have yet another incredibly painful side effect of how > IPoIB path resolution works. > > Roland, I want you to seriously consider a way, any way, to get rid of > how IPoIB does path resolution. It must be fully integrated into the > neighbour layer, the neighbour layer must be knowledgable about how > path resolution is a necessary step for a neighbour entry to enter the > valid state, and I want all of this awkward neighbour handling code > removed from the transmit path of IPoIB. > > And finally it must not lie about it's hardware header length. > > Then we won't need crap like what is being proposed here, a > "no_this_is_the_real_hard_header_len" struct member. That's just > rediculious. OK, I'll resend my first patch then, using hard_header_len -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2/6/2012 6:58 PM, David Miller wrote: > So if IPoIB path resolution was properly integrated into the neighbour > cache state machine, instead of being implemented awkwardly in the > device transmit path, this crap wouldn't be necessary right? Dave, say we have integrated the path resolution into ND cache, how would you suggest ipoib to act for skbs for which xmit is called without a neighbour? specifically arp replies and multicast. I can think of at least another one other location where the HW address can be stored between hard_header and xmit - on the sbk->cb storage, which is large enough. Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2/6/2012 7:07 PM, Eric Dumazet wrote:
> OK, I'll resend my first patch then, using hard_header_len
so, we will be back to square one... as the hard_header_len which
advertized now by IPoIB will fail
the GRO L2 check... lets see where this discussion evolves.
Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 06 février 2012 à 19:11 +0200, Or Gerlitz a écrit : > On 2/6/2012 7:07 PM, Eric Dumazet wrote: > > OK, I'll resend my first patch then, using hard_header_len > > so, we will be back to square one... as the hard_header_len which > advertized now by IPoIB will fail > the GRO L2 check... lets see where this discussion evolves. > > Or. > You'll have to define hard_header_len to IPOIB_ENCAP_LEN ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 6, 2012 at 8:58 AM, David Miller <davem@davemloft.net> wrote: > > So if IPoIB path resolution was properly integrated into the neighbour > cache state machine, instead of being implemented awkwardly in the > device transmit path, this crap wouldn't be necessary right? > > So here we have yet another incredibly painful side effect of how > IPoIB path resolution works. > > Roland, I want you to seriously consider a way, any way, to get rid of > how IPoIB does path resolution. It must be fully integrated into the > neighbour layer, the neighbour layer must be knowledgable about how > path resolution is a necessary step for a neighbour entry to enter the > valid state, and I want all of this awkward neighbour handling code > removed from the transmit path of IPoIB. > > And finally it must not lie about it's hardware header length. I would love to clean this up. But I don't know how to do it. IMHO the problem is in the IPoIB RFC: (4391) which makes a distinction between an "encapsulation header" and the "link layer address". The LL address is what we put into ARP and ND packets, and so I think we are forced into exposing that to the network stack as our hardware address. However this LL address is not actually what we need to send a packet -- we need to take the GID of our destination and send a query to the subnet manager to resolve it to a path. And this query really is an RPC to a remote entity somewhere else on the network, so we have to do it asynchronously etc. Now the part I don't know how to handle is when the network stack gives us a LL address to send an ARP or something to but the skb has no dst attached. Suppose I don't have a path for that LL address when I get that skb with no dst into my .hard_header method. Where do I stick the LL addr to keep around until the packet shows up in the xmit function? - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Roland Dreier <roland@kernel.org> Date: Mon, 6 Feb 2012 09:23:58 -0800 > IMHO the problem is in the IPoIB RFC: (4391) which makes > a distinction between an "encapsulation header" and the "link > layer address". The LL address is what we put into ARP and > ND packets, and so I think we are forced into exposing that > to the network stack as our hardware address. An address is not a hardware MAC header, we're only talking about the length of the latter. If the addressing is such that you need to put the GID into the ARP/NDISC packets, and that's different from what ends up in the final encapsulation header, I really don't see what the problem is specificially with respect to the MAC header size. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 6, 2012 at 11:12 AM, David Miller <davem@davemloft.net> wrote: >> IMHO the problem is in the IPoIB RFC: (4391) which makes >> a distinction between an "encapsulation header" and the "link >> layer address". The LL address is what we put into ARP and >> ND packets, and so I think we are forced into exposing that >> to the network stack as our hardware address. > > An address is not a hardware MAC header, we're only talking > about the length of the latter. > > If the addressing is such that you need to put the GID > into the ARP/NDISC packets, and that's different from what > ends up in the final encapsulation header, I really don't > see what the problem is specificially with respect to the > MAC header size. Dave, please look at my whole email. The problem is that the LL addr is not what we need to send packets but it's all we get in our .hard_header method. So for eg unicast ARP packets without a dst we have nowhere to stash what we actually will need in our xmit method. (The same problem for unicast ARPs applies to multicast sends too). Does the netdev driver own skb->cb between hard_header and start_xmit? If so we could use that instead of stealing some header space, and that would at least let us not lie about hard_header_len. - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Roland Dreier <roland@kernel.org> Date: Mon, 6 Feb 2012 11:23:21 -0800 > Does the netdev driver own skb->cb between hard_header > and start_xmit? If so we could use that instead of stealing > some header space, and that would at least let us not lie > about hard_header_len. Unfortunately the packet scheduler sits between the hard_header() call (via neigh_*_output() --> dev_hard_header()) and when the device xmit method is invoked. And the packet scheduler can make use of the skb->cb[], for include/net/sch_generic.h:qdisc_skb_cb -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Mon, 06 Feb 2012 14:32:30 -0500 (EST) > From: Roland Dreier <roland@kernel.org> > Date: Mon, 6 Feb 2012 11:23:21 -0800 > >> Does the netdev driver own skb->cb between hard_header >> and start_xmit? If so we could use that instead of stealing >> some header space, and that would at least let us not lie >> about hard_header_len. > > Unfortunately the packet scheduler sits between the hard_header() > call (via neigh_*_output() --> dev_hard_header()) and when the > device xmit method is invoked. > > And the packet scheduler can make use of the skb->cb[], for > include/net/sch_generic.h:qdisc_skb_cb Actually there is a way to make this work. Define your ipoib_skb_cb something like: struct ipoib_skb_cb { struct qdisc_skb_cb qdisc_cb; ... ipoib stuff goes here ... }; That way you can use the SKB cb area for your ipoib info without interfering with the packet scheduler. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Mon, 06 Feb 2012 14:51:48 -0500 (EST) > From: David Miller <davem@davemloft.net> > Date: Mon, 06 Feb 2012 14:32:30 -0500 (EST) > >> From: Roland Dreier <roland@kernel.org> >> Date: Mon, 6 Feb 2012 11:23:21 -0800 >> >>> Does the netdev driver own skb->cb between hard_header >>> and start_xmit? If so we could use that instead of stealing >>> some header space, and that would at least let us not lie >>> about hard_header_len. >> >> Unfortunately the packet scheduler sits between the hard_header() >> call (via neigh_*_output() --> dev_hard_header()) and when the >> device xmit method is invoked. >> >> And the packet scheduler can make use of the skb->cb[], for >> include/net/sch_generic.h:qdisc_skb_cb > > Actually there is a way to make this work. > > Define your ipoib_skb_cb something like: > > struct ipoib_skb_cb { > struct qdisc_skb_cb qdisc_cb; > > ... ipoib stuff goes here ... > }; > > That way you can use the SKB cb area for your ipoib info > without interfering with the packet scheduler. But this needs a little bit of work since the qdisc_skb_cb ends with a variable length array, but we can put an upper bound on this just like we do for skb->cb[] itself to fix this issue. I'll toss something together. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Actually there is a way to make this work. > > Define your ipoib_skb_cb something like: > > struct ipoib_skb_cb { > struct qdisc_skb_cb qdisc_cb; > > ... ipoib stuff goes here ... > }; > > That way you can use the SKB cb area for your ipoib info > without interfering with the packet scheduler. Not sure I see how this could work. sch_generic.h has: struct qdisc_skb_cb { unsigned int pkt_len; long data[]; }; and ipoib has no way of knowing what the biggest scheduler-specific use of data[] will be. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 6, 2012 at 11:56 AM, David Miller <davem@davemloft.net> wrote: > But this needs a little bit of work since the qdisc_skb_cb ends with a > variable length array, but we can put an upper bound on this just like > we do for skb->cb[] itself to fix this issue. > > I'll toss something together. OK thanks. We need something like 20 bytes of space to stash an ipoib LL addr. Which I guess leaves 28 bytes for sch use. - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0eac07c..903bb6e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1095,6 +1095,7 @@ struct net_device { unsigned int mtu; /* interface MTU value */ unsigned short type; /* interface hardware type */ unsigned short hard_header_len; /* hardware hdr length */ + unsigned int gro_mac_header_len; /* L2 header length for GRO */ /* extra head- and tailroom the hardware may need, but not in all cases * can this be guaranteed, especially tailroom. Some cases also use diff --git a/net/core/dev.c b/net/core/dev.c index f124947..0b43939 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3491,14 +3491,20 @@ static inline gro_result_t __napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb) { struct sk_buff *p; + unsigned int maclen = skb->dev->gro_mac_header_len; for (p = napi->gro_list; p; p = p->next) { unsigned long diffs; diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev; diffs |= p->vlan_tci ^ skb->vlan_tci; - diffs |= compare_ether_header(skb_mac_header(p), - skb_gro_mac_header(skb)); + if (maclen == ETH_HLEN) + diffs |= compare_ether_header(skb_mac_header(p), + skb_gro_mac_header(skb)); + else if (!diffs) + diffs = memcmp(skb_mac_header(p), + skb_gro_mac_header(skb), + maclen); NAPI_GRO_CB(p)->same_flow = !diffs; NAPI_GRO_CB(p)->flush = 0; } @@ -5962,6 +5968,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, INIT_LIST_HEAD(&dev->unreg_list); INIT_LIST_HEAD(&dev->link_watch_list); dev->priv_flags = IFF_XMIT_DST_RELEASE; + dev->gro_mac_header_len = ETH_HLEN; setup(dev); dev->num_tx_queues = txqs;