diff mbox

[2/2] net: ip, ipv6: handle gso skbs in forwarding path

Message ID 1390810971-23959-2-git-send-email-fw@strlen.de
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Florian Westphal Jan. 27, 2014, 8:22 a.m. UTC
Marcelo Ricardo Leitner reported problems when the forwarding link
path has a lower mtu than the incoming link if the inbound interface
supports GRO.

Given:
Host <mtu1500> R1 <mtu1200> R2

Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.

In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as gso packets currently bypass the
dst mtu checks in forward path. Instead, Linux tries to send out packets
exceeding R1-R2 link mtu.

When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.

This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.

For ipv6, we send out pkt too big error for gso if the individual
segments are too big.

For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
Changes since V2:
 - make this thing apply to current -net tree
 - kill unused variables in ip_forward/ip6_output

Changes since V1:
 suggestions from Eric Dumazet:
  - skip more expensive computation for small packets in fwd path
  - use netif_skb_features() feature mask and remove GSO flags
    instead of using 0 feature set.

 include/linux/skbuff.h | 17 ++++++++++++++
 net/ipv4/ip_forward.c  | 60 ++++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv6/ip6_output.c  | 17 ++++++++++++--
 3 files changed, 90 insertions(+), 4 deletions(-)

Comments

David Miller Jan. 27, 2014, 8:34 a.m. UTC | #1
From: Florian Westphal <fw@strlen.de>
Date: Mon, 27 Jan 2014 09:22:51 +0100

> Changes since V2:
>  - make this thing apply to current -net tree
>  - kill unused variables in ip_forward/ip6_output

Still need changes.
> +	return skb_gso_network_seglen(skb) > dst_mtu(skb_dst(skb));

You can't use dst_mtu() directly, in order to be consistent with the
rest of the forwarding code in this file you must use something like:

>  	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 27, 2014, 8:36 a.m. UTC | #2
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 27 Jan 2014 09:22:51 +0100
> > Changes since V2:
> >  - make this thing apply to current -net tree
> >  - kill unused variables in ip_forward/ip6_output
> 
> Still need changes.
> > +	return skb_gso_network_seglen(skb) > dst_mtu(skb_dst(skb));
> 
> You can't use dst_mtu() directly, in order to be consistent with the
> rest of the forwarding code in this file you must use something like:
> 
> >  	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);

You're right of course.

Sorry.  I will fix this up and NOT resend soon, its clear I need
to do more homework (aka follow Hannes PMTU changes).

Expect a V4 in a couple of hours.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 27, 2014, 6:22 p.m. UTC | #3
On Mon, 2014-01-27 at 09:22 +0100, Florian Westphal wrote:

> +/* called if GSO skb needs to be fragmented on forward.  */
> +static int ip_forward_finish_gso(struct sk_buff *skb)
> +{
> +	netdev_features_t features = netif_skb_features(skb);
> +	struct sk_buff *segs;
> +	int ret = 0;
> +
> +	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> +	if (IS_ERR(segs)) {
> +		kfree_skb(skb);
> +		return -ENOMEM;
> +	}
> +
> +	consume_skb(skb);
> +
> +	do {
> +		struct sk_buff *nskb = segs->next;
> +		int err;
> +
> +		segs->next = NULL;
> +		err = dst_output(segs);
> +
> +		if (err && ret == 0)
> +			ret = err;
> +		segs = nskb;
> +	} while (segs);
> +
> +	return ret;
> +}
> +

Its still unclear if this is the best strategy.

TCP stream not using DF flag are very unlikely to care if we adjust
their MTU (lowering gso_size) at this point ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 27, 2014, 8:58 p.m. UTC | #4
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 27 Jan 2014 10:22:47 -0800

> Its still unclear if this is the best strategy.
> 
> TCP stream not using DF flag are very unlikely to care if we adjust
> their MTU (lowering gso_size) at this point ?

It's better than what happens now when the destination link has a lower
MTU, wouldn't you say?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 27, 2014, 9:08 p.m. UTC | #5
From: David Miller <davem@davemloft.net>
Date: Mon, 27 Jan 2014 12:58:38 -0800 (PST)

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 27 Jan 2014 10:22:47 -0800
> 
>> Its still unclear if this is the best strategy.
>> 
>> TCP stream not using DF flag are very unlikely to care if we adjust
>> their MTU (lowering gso_size) at this point ?
> 
> It's better than what happens now when the destination link has a lower
> MTU, wouldn't you say?

In the mean time I'll hold off on this patch while you guys discuss
this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Jan. 28, 2014, 12:27 a.m. UTC | #6
On Mon, Jan 27, 2014 at 10:22:47AM -0800, Eric Dumazet wrote:
> On Mon, 2014-01-27 at 09:22 +0100, Florian Westphal wrote:
> 
> > +/* called if GSO skb needs to be fragmented on forward.  */
> > +static int ip_forward_finish_gso(struct sk_buff *skb)
> > +{
> > +	netdev_features_t features = netif_skb_features(skb);

netif_skb_features uses skb->dev for determination of offloading features but
we actually need rt->dst.dev, no?

> > +	struct sk_buff *segs;
> > +	int ret = 0;
> > +
> > +	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> > +	if (IS_ERR(segs)) {
> > +		kfree_skb(skb);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	consume_skb(skb);
> > +
> > +	do {
> > +		struct sk_buff *nskb = segs->next;
> > +		int err;
> > +
> > +		segs->next = NULL;
> > +		err = dst_output(segs);
> > +
> > +		if (err && ret == 0)
> > +			ret = err;
> > +		segs = nskb;
> > +	} while (segs);
> > +
> > +	return ret;
> > +}
> > +
> 
> Its still unclear if this is the best strategy.
> 
> TCP stream not using DF flag are very unlikely to care if we adjust
> their MTU (lowering gso_size) at this point ?

UDP shouldn't be a problem, too.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 28, 2014, 8:57 a.m. UTC | #7
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > +	do {
> > +		struct sk_buff *nskb = segs->next;
> > +		int err;
> > +
> > +		segs->next = NULL;
> > +		err = dst_output(segs);
> > +
> > +		if (err && ret == 0)
> > +			ret = err;
> > +		segs = nskb;
> > +	} while (segs);
> > +
> > +	return ret;
> > +}
> > +
> 
> Its still unclear if this is the best strategy.
> 
> TCP stream not using DF flag are very unlikely to care if we adjust
> their MTU (lowering gso_size) at this point ?

Thanks for this suggestion.  It would indeed be nice to avoid sw
segmentation.  I tried:

static void ip_gso_adjust_seglen(struct sk_buff *skb)
{
        unsigned int mtu;

        if (!skb_is_gso(skb))
                return;

        mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
        skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
}

But this yields

[   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
[   28.644776] invalid opcode: 0000 [#1] SMP 
[   28.644776] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0+ #35
[   28.644776] task: ffffffff818104c0 ti: ffffffff81800000 task.ti: ffffffff81800000
[   28.644776] RIP: 0010:[<ffffffff813b10d8>]  [<ffffffff813b10d8>] skb_segment+0x808/0x830
[   28.644776] RSP: 0018:ffff88002fc03688  EFLAGS: 00010212
[   28.644776] RAX: 000000000000047c RBX: ffff88002d614b00 RCX: ffff88002d72ab00
[   28.644776] RDX: 000000000000047c RSI: 00000000000050fa RDI: ffff88002cf9f800
[   28.644776] RBP: ffff88002fc03778 R08: 0000000000000000 R09: ffff88002cdaf300
[   28.644776] R10: 0000000000000011 R11: 0000000000004ff2 R12: ffff88002cf9ff80
[   28.644776] R13: 0000000000000011 R14: 00000000000050fa R15: 00000000000054a2
[   28.644776] FS:  00007f27db007700(0000) GS:ffff88002fc00000(0000) knlGS:0000000000000000
[   28.644776] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   28.644776] CR2: 00007f8176cedcfc CR3: 000000002cd14000 CR4: 00000000000006b0
[   28.644776] Stack:
[   28.644776]  0000000000000046 ffffffff00000014 0000000000000001 ffffffff00000022
[   28.644776]  ffff88002cdaf300 ffff88002d72aaf0 0000000000000000 0000000000004ff2
[   28.644776]  0000000000000014 ffffffff818104c0 ffffffff81810bc8 ffffffffffffffbe
[   28.644776] Call Trace:
[   28.644776]  <IRQ> 
[   28.644776]  [<ffffffff8125e742>] ? number.isra.1+0x302/0x330
[   28.644776]  [<ffffffff8142f35e>] tcp_gso_segment+0x11e/0x3f0
[   28.644776]  [<ffffffff8143f2c9>] inet_gso_segment+0x129/0x350
[   28.644776]  [<ffffffff810832cf>] ?  __lock_acquire+0x2ef/0x1ca0
[   28.644776]  [<ffffffff813bcd9d>] skb_mac_gso_segment+0xdd/0x1e0
[   28.644776]  [<ffffffff813bcd07>] ?  skb_mac_gso_segment+0x47/0x1e0
[   28.644776]  [<ffffffff813bcf00>] __skb_gso_segment+0x60/0xc0
[   28.644776]  [<ffffffff813bd203>] dev_hard_start_xmit+0x183/0x5b0
[   28.644776]  [<ffffffff813e064e>] sch_direct_xmit+0xfe/0x280
[   28.644776]  [<ffffffff813bd843>] __dev_queue_xmit+0x213/0x6b0
[   28.644776]  [<ffffffff813bd635>] ?  __dev_queue_xmit+0x5/0x6b0
[   28.644776]  [<ffffffff813bdcf0>] dev_queue_xmit+0x10/0x20
[   28.644776]  [<ffffffff8140c2a9>] ip_finish_output+0x419/0x600
[   28.644776]  [<ffffffff8140c4de>] ? ip_output+0x4e/0xc0
[   28.644776]  [<ffffffff810803e4>] ? __lock_is_held+0x54/0x80
[   28.644776]  [<ffffffff8140c4de>] ip_output+0x4e/0xc0
[   28.644776]  [<ffffffff81407ffb>] ip_forward+0x21b/0x650

Eric, any chance you know wheter mucking with gso_size in this way
is supposed to work?

I will go through skb_segment and see if I can find out what exactly causes this
BUG_ON to trigger.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 28, 2014, 9:12 a.m. UTC | #8
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> On Mon, Jan 27, 2014 at 10:22:47AM -0800, Eric Dumazet wrote:
> > On Mon, 2014-01-27 at 09:22 +0100, Florian Westphal wrote:
> > 
> > > +/* called if GSO skb needs to be fragmented on forward.  */
> > > +static int ip_forward_finish_gso(struct sk_buff *skb)
> > > +{
> > > +	netdev_features_t features = netif_skb_features(skb);
> 
> netif_skb_features uses skb->dev for determination of offloading features but
> we actually need rt->dst.dev, no?

good catch, cannot use netif_skb_features as skb->dev still
points to incoming device here...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 28, 2014, 4:34 p.m. UTC | #9
On Tue, 2014-01-28 at 09:57 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > +	do {
> > > +		struct sk_buff *nskb = segs->next;
> > > +		int err;
> > > +
> > > +		segs->next = NULL;
> > > +		err = dst_output(segs);
> > > +
> > > +		if (err && ret == 0)
> > > +			ret = err;
> > > +		segs = nskb;
> > > +	} while (segs);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > 
> > Its still unclear if this is the best strategy.
> > 
> > TCP stream not using DF flag are very unlikely to care if we adjust
> > their MTU (lowering gso_size) at this point ?
> 
> Thanks for this suggestion.  It would indeed be nice to avoid sw
> segmentation.  I tried:
> 
> static void ip_gso_adjust_seglen(struct sk_buff *skb)
> {
>         unsigned int mtu;
> 
>         if (!skb_is_gso(skb))
>                 return;
> 
>         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
>         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> }
> 
> But this yields
> 
> [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!

Yep, lets CC Herbert Xu, as he 'owns' skb_segment()

BUG_ON(skb_headlen(fskb));

I sent once a generic version of skb_segment(), but Herbert preferred a
different one.

> [   28.644776] invalid opcode: 0000 [#1] SMP 
> [   28.644776] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0+ #35
> [   28.644776] task: ffffffff818104c0 ti: ffffffff81800000 task.ti: ffffffff81800000
> [   28.644776] RIP: 0010:[<ffffffff813b10d8>]  [<ffffffff813b10d8>] skb_segment+0x808/0x830
> [   28.644776] RSP: 0018:ffff88002fc03688  EFLAGS: 00010212
> [   28.644776] RAX: 000000000000047c RBX: ffff88002d614b00 RCX: ffff88002d72ab00
> [   28.644776] RDX: 000000000000047c RSI: 00000000000050fa RDI: ffff88002cf9f800
> [   28.644776] RBP: ffff88002fc03778 R08: 0000000000000000 R09: ffff88002cdaf300
> [   28.644776] R10: 0000000000000011 R11: 0000000000004ff2 R12: ffff88002cf9ff80
> [   28.644776] R13: 0000000000000011 R14: 00000000000050fa R15: 00000000000054a2
> [   28.644776] FS:  00007f27db007700(0000) GS:ffff88002fc00000(0000) knlGS:0000000000000000
> [   28.644776] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [   28.644776] CR2: 00007f8176cedcfc CR3: 000000002cd14000 CR4: 00000000000006b0
> [   28.644776] Stack:
> [   28.644776]  0000000000000046 ffffffff00000014 0000000000000001 ffffffff00000022
> [   28.644776]  ffff88002cdaf300 ffff88002d72aaf0 0000000000000000 0000000000004ff2
> [   28.644776]  0000000000000014 ffffffff818104c0 ffffffff81810bc8 ffffffffffffffbe
> [   28.644776] Call Trace:
> [   28.644776]  <IRQ> 
> [   28.644776]  [<ffffffff8125e742>] ? number.isra.1+0x302/0x330
> [   28.644776]  [<ffffffff8142f35e>] tcp_gso_segment+0x11e/0x3f0
> [   28.644776]  [<ffffffff8143f2c9>] inet_gso_segment+0x129/0x350
> [   28.644776]  [<ffffffff810832cf>] ?  __lock_acquire+0x2ef/0x1ca0
> [   28.644776]  [<ffffffff813bcd9d>] skb_mac_gso_segment+0xdd/0x1e0
> [   28.644776]  [<ffffffff813bcd07>] ?  skb_mac_gso_segment+0x47/0x1e0
> [   28.644776]  [<ffffffff813bcf00>] __skb_gso_segment+0x60/0xc0
> [   28.644776]  [<ffffffff813bd203>] dev_hard_start_xmit+0x183/0x5b0
> [   28.644776]  [<ffffffff813e064e>] sch_direct_xmit+0xfe/0x280
> [   28.644776]  [<ffffffff813bd843>] __dev_queue_xmit+0x213/0x6b0
> [   28.644776]  [<ffffffff813bd635>] ?  __dev_queue_xmit+0x5/0x6b0
> [   28.644776]  [<ffffffff813bdcf0>] dev_queue_xmit+0x10/0x20
> [   28.644776]  [<ffffffff8140c2a9>] ip_finish_output+0x419/0x600
> [   28.644776]  [<ffffffff8140c4de>] ? ip_output+0x4e/0xc0
> [   28.644776]  [<ffffffff810803e4>] ? __lock_is_held+0x54/0x80
> [   28.644776]  [<ffffffff8140c4de>] ip_output+0x4e/0xc0
> [   28.644776]  [<ffffffff81407ffb>] ip_forward+0x21b/0x650
> 
> Eric, any chance you know wheter mucking with gso_size in this way
> is supposed to work?
> 
> I will go through skb_segment and see if I can find out what exactly causes this
> BUG_ON to trigger.

This is definitely net-next material anyway, no hurry ;)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 28, 2014, 5:15 p.m. UTC | #10
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Eric, any chance you know wheter mucking with gso_size in this way
> > is supposed to work?
> > 
> > I will go through skb_segment and see if I can find out what exactly causes this
> > BUG_ON to trigger.
> 
> This is definitely net-next material anyway, no hurry ;)

Yes, looks like it :)

Eric, do you mind if I re-send the patch with skb_gso_segment and a zero
feature mask?

I think thats the best solution for -net.  I would then try to come up
with a version that follows your "shrink gso_size" suggestion for -next.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Jan. 28, 2014, 5:30 p.m. UTC | #11
On Tue, 2014-01-28 at 18:15 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Eric, any chance you know wheter mucking with gso_size in this way
> > > is supposed to work?
> > > 
> > > I will go through skb_segment and see if I can find out what exactly causes this
> > > BUG_ON to trigger.
> > 
> > This is definitely net-next material anyway, no hurry ;)
> 
> Yes, looks like it :)
> 
> Eric, do you mind if I re-send the patch with skb_gso_segment and a zero
> feature mask?



I think the xmit will take care of doing the fallback anyway, if skb
need to be linear or TX checksum be computed. 

> I think thats the best solution for -net.  I would then try to come up
> with a version that follows your "shrink gso_size" suggestion for -next.

Note that I mentioned this MTU thing months ago, and the bug is here
since years. I do not think its a very urgent matter :)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 28, 2014, 5:37 p.m. UTC | #12
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I think the xmit will take care of doing the fallback anyway, if skb
> need to be linear or TX checksum be computed. 
> 
> > I think thats the best solution for -net.  I would then try to come up
> > with a version that follows your "shrink gso_size" suggestion for -next.
> 
> Note that I mentioned this MTU thing months ago, and the bug is here
> since years. I do not think its a very urgent matter :)

Fair enough.  I'll see that I have something ready when -next opens.

Thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 29, 2014, 10:53 a.m. UTC | #13
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> > TCP stream not using DF flag are very unlikely to care if we adjust
> > their MTU (lowering gso_size) at this point ?
> 
> UDP shouldn't be a problem, too.

Sorry for late reply, but how can this be safe for UDP?
We should make sure that peer sees original, unchanged datagram?

And only solution for UDP that I can see is to do sw segmentation (i.e.
create ip fragments).

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Jan. 29, 2014, 11 a.m. UTC | #14
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > Its still unclear if this is the best strategy.
> > > 
> > > TCP stream not using DF flag are very unlikely to care if we adjust
> > > their MTU (lowering gso_size) at this point ?
> > 
> > Thanks for this suggestion.  It would indeed be nice to avoid sw
> > segmentation.  I tried:
> > 
> > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > {
> >         unsigned int mtu;
> > 
> >         if (!skb_is_gso(skb))
> >                 return;
> > 
> >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);

FWIW Erics idea works fine with:

        headerlen = skb_transport_header(skb) - skb_network_header(skb);
        if (likely(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
                headerlen += tcp_hdrlen(skb);
        skb_shinfo(skb)->gso_size = mtu - headerlen;

and disabling 'sg' on outgoing (lower-mtu) interface.  [ else BUG() ]

> > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> 
> Yep, lets CC Herbert Xu, as he 'owns' skb_segment()
> 
> BUG_ON(skb_headlen(fskb));
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Jan. 29, 2014, 11:04 a.m. UTC | #15
On Wed, Jan 29, 2014 at 11:53:47AM +0100, Florian Westphal wrote:
> Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:
> > > TCP stream not using DF flag are very unlikely to care if we adjust
> > > their MTU (lowering gso_size) at this point ?
> > 
> > UDP shouldn't be a problem, too.
> 
> Sorry for late reply, but how can this be safe for UDP?
> We should make sure that peer sees original, unchanged datagram?

Peer as in original destination? Of course, we must not alter the datagram but
can only do fragmentation or send back frag_needed.

> And only solution for UDP that I can see is to do sw segmentation (i.e.
> create ip fragments).

Hardware(-UFO) would do fragmentation in hardware, too, because there is
no other way to split UDP data in any other way. If UFO is not supported
manual sw segmentation would create the required fragments in output
path, too.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 9, 2014, 2:55 a.m. UTC | #16
On Tue, Jan 28, 2014 at 08:34:43AM -0800, Eric Dumazet wrote:
> On Tue, 2014-01-28 at 09:57 +0100, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > +	do {
> > > > +		struct sk_buff *nskb = segs->next;
> > > > +		int err;
> > > > +
> > > > +		segs->next = NULL;
> > > > +		err = dst_output(segs);
> > > > +
> > > > +		if (err && ret == 0)
> > > > +			ret = err;
> > > > +		segs = nskb;
> > > > +	} while (segs);
> > > > +
> > > > +	return ret;
> > > > +}
> > > > +
> > > 
> > > Its still unclear if this is the best strategy.
> > > 
> > > TCP stream not using DF flag are very unlikely to care if we adjust
> > > their MTU (lowering gso_size) at this point ?
> > 
> > Thanks for this suggestion.  It would indeed be nice to avoid sw
> > segmentation.  I tried:
> > 
> > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > {
> >         unsigned int mtu;
> > 
> >         if (!skb_is_gso(skb))
> >                 return;
> > 
> >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > }
> > 
> > But this yields
> > 
> > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> 
> Yep, lets CC Herbert Xu, as he 'owns' skb_segment()

IMHO we should just stop merging ~DF packets altogether, at least
for TCP.

Cheers,
Florian Westphal Feb. 10, 2014, 12:23 p.m. UTC | #17
Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > > {
> > >         unsigned int mtu;
> > > 
> > >         if (!skb_is_gso(skb))
> > >                 return;
> > > 
> > >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> > >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > > }
> > > 
> > > But this yields
> > > 
> > > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> > 
> > Yep, lets CC Herbert Xu, as he 'owns' skb_segment()
> 
> IMHO we should just stop merging ~DF packets altogether, at least
> for TCP.

Eric, you added DF aggregation in db8caf3dbc77599dc90f4ea0a803cd1d97116f30
(gro: should aggregate frames without DF).

I guess you don't want to revert this commit?
Any other ideas?

skb_gso_segment() is already very complex, I don't want to add more code
to it.  And that seems unavoidable if we need to de-couple nr_frags and
gso_size.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 10, 2014, 12:31 p.m. UTC | #18
On Mon, Feb 10, 2014 at 01:23:31PM +0100, Florian Westphal wrote:
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > > > {
> > > >         unsigned int mtu;
> > > > 
> > > >         if (!skb_is_gso(skb))
> > > >                 return;
> > > > 
> > > >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> > > >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > > > }
> > > > 
> > > > But this yields
> > > > 
> > > > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> > > 
> > > Yep, lets CC Herbert Xu, as he 'owns' skb_segment()
> > 
> > IMHO we should just stop merging ~DF packets altogether, at least
> > for TCP.
> 
> Eric, you added DF aggregation in db8caf3dbc77599dc90f4ea0a803cd1d97116f30
> (gro: should aggregate frames without DF).
> 
> I guess you don't want to revert this commit?
> Any other ideas?
> 
> skb_gso_segment() is already very complex, I don't want to add more code
> to it.  And that seems unavoidable if we need to de-couple nr_frags and
> gso_size.

I don't think adding all this complexity just to be able to
aggregate ~DF packets (which are just wrong to begin with) is
worth it.

If aggregating ~DF packets was a one-liner then sure, but there
is a reason why I didn't aggregate them in the first place and
you've found it :)

Cheers,
Florian Westphal Feb. 10, 2014, 12:43 p.m. UTC | #19
Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Feb 10, 2014 at 01:23:31PM +0100, Florian Westphal wrote:
> > Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > > > > {
> > > > >         unsigned int mtu;
> > > > > 
> > > > >         if (!skb_is_gso(skb))
> > > > >                 return;
> > > > > 
> > > > >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> > > > >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > > > > }
> > > > > 
> > > > > But this yields
> > > > > 
> > > > > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> > > > 
> > > > Yep, lets CC Herbert Xu, as he 'owns' skb_segment()
> > > 
> > > IMHO we should just stop merging ~DF packets altogether, at least
> > > for TCP.
> > 
> > Eric, you added DF aggregation in db8caf3dbc77599dc90f4ea0a803cd1d97116f30
> > (gro: should aggregate frames without DF).
> > 
> > I guess you don't want to revert this commit?
> > Any other ideas?
> > 
> > skb_gso_segment() is already very complex, I don't want to add more code
> > to it.  And that seems unavoidable if we need to de-couple nr_frags and
> > gso_size.
> 
> I don't think adding all this complexity just to be able to
> aggregate ~DF packets (which are just wrong to begin with) is
> worth it.
> 
> If aggregating ~DF packets was a one-liner then sure, but there
> is a reason why I didn't aggregate them in the first place and
> you've found it :)

Well we could go with my original patch that will do software
segmentation on ~DF packets in the forwarding path if the outmtu is too
small for the individual packets.  The output path then simply
creates fragments.

Eric suggested to shrink gso_size instead to avoid segmentation+fragments.
I think its nice idea, but skb_gso_segment makes certain assumptions about
nr_frags and gso_size (it can't handle frag size > desired mss).

Hannes pointed out that we'd also need to deal with
SKB_MAX_FRAGS * gso_size exceeding fragments.

Quite frankly, I'd prefer to go with

skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);

The scenario is rare anyway given the number of bug reports (or lack
thereof) about '~DF tcp doesn't work with gro in fwd path when output
mtu is too small'.

Its not like this could never be improved later on.

Best regards,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 10, 2014, 12:50 p.m. UTC | #20
On Mon, Feb 10, 2014 at 01:43:46PM +0100, Florian Westphal wrote:
>
> Eric suggested to shrink gso_size instead to avoid segmentation+fragments.
> I think its nice idea, but skb_gso_segment makes certain assumptions about
> nr_frags and gso_size (it can't handle frag size > desired mss).

This breaks the most important assumption behind GRO which is
to preserve end-to-end connectivity.  Resegmenting packets as
suggested on a router/bridge is just wrong.

Cheers,
Eric Dumazet Feb. 10, 2014, 1:08 p.m. UTC | #21
On Mon, 2014-02-10 at 20:50 +0800, Herbert Xu wrote:
> On Mon, Feb 10, 2014 at 01:43:46PM +0100, Florian Westphal wrote:
> >
> > Eric suggested to shrink gso_size instead to avoid segmentation+fragments.
> > I think its nice idea, but skb_gso_segment makes certain assumptions about
> > nr_frags and gso_size (it can't handle frag size > desired mss).
> 
> This breaks the most important assumption behind GRO which is
> to preserve end-to-end connectivity.  Resegmenting packets as
> suggested on a router/bridge is just wrong.

Yeah, this is the old mantra.

Sending TCP packets without DF means the sender do not care by
definition.

If you disable GRO for such packets, it slows down receivers and
increase packet drops.

I've added the segmentation for these packets for a reason, that you are
free to not understand, but there is absolutely no need reason to not
aggregate TCP packets without DF. This is what you suggested to ignore
the problem on skb_segment() being so limited.

Instead of a router being forced to segment all incoming fragments into

X+Y
X+Y
X+Y
X+Y

Its reasonable to send X+X+X+X+X

And we should be reasonable, not trying to enforce a particular view of
what _should_ the traffic looks like on the Internet.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 10, 2014, 1:12 p.m. UTC | #22
On Mon, 2014-02-10 at 20:31 +0800, Herbert Xu wrote:
> On Mon, Feb 10, 2014 at 01:23:31PM +0100, Florian Westphal wrote:
> > Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > > > > static void ip_gso_adjust_seglen(struct sk_buff *skb)
> > > > > {
> > > > >         unsigned int mtu;
> > > > > 
> > > > >         if (!skb_is_gso(skb))
> > > > >                 return;
> > > > > 
> > > > >         mtu = ip_dst_mtu_maybe_forward(skb_dst(skb), true);
> > > > >         skb_shinfo(skb)->gso_size = mtu - sizeof(struct iphdr);
> > > > > }
> > > > > 
> > > > > But this yields
> > > > > 
> > > > > [   28.644776] kernel BUG at net/net/core/skbuff.c:2984!
> > > > 
> > > > Yep, lets CC Herbert Xu, as he 'owns' skb_segment()
> > > 
> > > IMHO we should just stop merging ~DF packets altogether, at least
> > > for TCP.
> > 
> > Eric, you added DF aggregation in db8caf3dbc77599dc90f4ea0a803cd1d97116f30
> > (gro: should aggregate frames without DF).
> > 
> > I guess you don't want to revert this commit?
> > Any other ideas?
> > 
> > skb_gso_segment() is already very complex, I don't want to add more code
> > to it.  And that seems unavoidable if we need to de-couple nr_frags and
> > gso_size.
> 
> I don't think adding all this complexity just to be able to
> aggregate ~DF packets (which are just wrong to begin with) is
> worth it.

Wrong by your standards. Which are not universal.

> 
> If aggregating ~DF packets was a one-liner then sure, but there
> is a reason why I didn't aggregate them in the first place and
> you've found it :)

I sent months ago a solution for skb_segment() that you ignored.

I understand you never hit cases where DF was not set, I can tell you
its happening in the real world.

GRO stack already breaks reversibility by definition since day-0

Recent tunneling support breaks it as well.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 10, 2014, 1:15 p.m. UTC | #23
On Mon, 2014-02-10 at 13:43 +0100, Florian Westphal wrote:

> Well we could go with my original patch that will do software
> segmentation on ~DF packets in the forwarding path if the outmtu is too
> small for the individual packets.  The output path then simply
> creates fragments.

Most linux routers disable GRO anyway.

GRO is mostly used on linux hosts to improve performance, so most GRO
packets are consumed on the receiving host.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f589c9a..3ebbbe7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2916,5 +2916,22 @@  static inline bool skb_head_is_locked(const struct sk_buff *skb)
 {
 	return !skb->head_frag || skb_cloned(skb);
 }
+
+/**
+ * skb_gso_network_seglen - Return length of individual segments of a gso packet
+ *
+ * @skb: GSO skb
+ *
+ * skb_gso_network_seglen is used to determine the real size of the
+ * individual segments, including Layer3 (IP, IPv6) and L4 headers (TCP/UDP).
+ *
+ * The MAC/L2 header is not accounted for.
+ */
+static inline unsigned int skb_gso_network_seglen(const struct sk_buff *skb)
+{
+	unsigned int hdr_len = skb_transport_header(skb) -
+			       skb_network_header(skb);
+	return hdr_len + skb_gso_transport_seglen(skb);
+}
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index e9f1217..91c8f51 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -39,6 +39,60 @@ 
 #include <net/route.h>
 #include <net/xfrm.h>
 
+static bool ip_may_fragment(const struct sk_buff *skb)
+{
+	return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
+	       !skb->local_df;
+}
+
+static bool ip_gso_exceeds_dst_mtu(const struct sk_buff *skb)
+{
+	if (skb->local_df || !skb_is_gso(skb))
+		return false;
+	return skb_gso_network_seglen(skb) > dst_mtu(skb_dst(skb));
+}
+
+static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+{
+	if (skb->len <= mtu || skb->local_df)
+		return false;
+
+	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+		return false;
+
+	return true;
+}
+
+/* called if GSO skb needs to be fragmented on forward.  */
+static int ip_forward_finish_gso(struct sk_buff *skb)
+{
+	netdev_features_t features = netif_skb_features(skb);
+	struct sk_buff *segs;
+	int ret = 0;
+
+	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+	if (IS_ERR(segs)) {
+		kfree_skb(skb);
+		return -ENOMEM;
+	}
+
+	consume_skb(skb);
+
+	do {
+		struct sk_buff *nskb = segs->next;
+		int err;
+
+		segs->next = NULL;
+		err = dst_output(segs);
+
+		if (err && ret == 0)
+			ret = err;
+		segs = nskb;
+	} while (segs);
+
+	return ret;
+}
+
 static int ip_forward_finish(struct sk_buff *skb)
 {
 	struct ip_options *opt	= &(IPCB(skb)->opt);
@@ -49,6 +103,9 @@  static int ip_forward_finish(struct sk_buff *skb)
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
 
+	if (ip_gso_exceeds_dst_mtu(skb))
+		return ip_forward_finish_gso(skb);
+
 	return dst_output(skb);
 }
 
@@ -91,8 +148,7 @@  int ip_forward(struct sk_buff *skb)
 
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
-	if (unlikely(skb->len > mtu && !skb_is_gso(skb) &&
-		     (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
+	if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, mtu)) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ef02b26..070a2fa 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -342,6 +342,20 @@  static unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
 	return mtu;
 }
 
+static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
+{
+	if (skb->len <= mtu || skb->local_df)
+		return false;
+
+	if (IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)
+		return true;
+
+	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
+		return false;
+
+	return true;
+}
+
 int ip6_forward(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -466,8 +480,7 @@  int ip6_forward(struct sk_buff *skb)
 	if (mtu < IPV6_MIN_MTU)
 		mtu = IPV6_MIN_MTU;
 
-	if ((!skb->local_df && skb->len > mtu && !skb_is_gso(skb)) ||
-	    (IP6CB(skb)->frag_max_size && IP6CB(skb)->frag_max_size > mtu)) {
+	if (ip6_pkt_too_big(skb, mtu)) {
 		/* Again, force OUTPUT device used as source address */
 		skb->dev = dst->dev;
 		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);