diff mbox

net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

Message ID 20160709132230.GD2067@breakpoint.cc
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Florian Westphal July 9, 2016, 1:22 p.m. UTC
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> I'd appreciate any suggestion how to determine traffic is local OTHER
> THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
> impact on local traffic.
> 
> > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > skb->skb_iif != 0... instead?
> 
> Can you please elaborate?

[ not even compile tested ]

Comments

Shmulik Ladkani July 10, 2016, 7:51 a.m. UTC | #1
On Sat, 9 Jul 2016 15:22:30 +0200, fw@strlen.de wrote:
> Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> > I'd appreciate any suggestion how to determine traffic is local OTHER
> > THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
> > impact on local traffic.
> > 
> > > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > > skb->skb_iif != 0... instead?
> > 
> > Can you please elaborate?
> 
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -65,6 +65,7 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
>  	struct net_device *dev = skb->dev;
>  	struct iphdr *iph;
>  	int err;
> +	bool fwd = skb->skb_iif > 0;
>  
>  	skb_scrub_packet(skb, xnet);
>  
> @@ -72,6 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
>  	skb_dst_set(skb, &rt->dst);
>  	memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
>  
> +	if (fwd)
> +		IPCB(skb)->flags = IPSKB_FORWARDED;
> +

Thanks.

OK with the general idea; However I don't want to abuse IPSKB_FORWARDED
as the skb is not really "ip forwarded", and it might have undesirable
affects, such as in 'ip_skb_dst_mtu' which follows.

How about a new IPSKB bit flag, say IPSKB_TUNNEL_FORWARDED, or a
different way of marking?

Regards,
Shmulik
Florian Westphal July 11, 2016, 8:15 a.m. UTC | #2
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> OK with the general idea; However I don't want to abuse IPSKB_FORWARDED
> as the skb is not really "ip forwarded", and it might have undesirable
> affects, such as in 'ip_skb_dst_mtu' which follows.

I don't see a problem with that. Hannes?
Hannes Frederic Sowa July 11, 2016, 1:32 p.m. UTC | #3
On 11.07.2016 04:15, Florian Westphal wrote:
> Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
>> OK with the general idea; However I don't want to abuse IPSKB_FORWARDED
>> as the skb is not really "ip forwarded", and it might have undesirable
>> affects, such as in 'ip_skb_dst_mtu' which follows.
> 
> I don't see a problem with that. Hannes?

If we segment/fragment along the way we need to use the
non-pmtu/interface-mtu, so this change would be correct in regard to the
mtu selection.

Bye,
Hannes
Shmulik Ladkani July 12, 2016, 5:56 a.m. UTC | #4
On Sat, 9 Jul 2016 15:22:30 +0200 Florian Westphal <fw@strlen.de> wrote:
> Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> > I'd appreciate any suggestion how to determine traffic is local OTHER
> > THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
> > impact on local traffic.
> >   
> > > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > > skb->skb_iif != 0... instead?  

I've came up with a suggestion that does not abuse IPSKB_FORWARDED,
while properly addressing the use case (and similar ones), without
introducing the cost of entering 'skb_gso_validate_mtu' in the local
case.

How about:

@@ -220,12 +220,15 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 				struct sk_buff *skb, unsigned int mtu)
 {
 	netdev_features_t features;
+	int local_trusted_gso;
 	struct sk_buff *segs;
 	int ret = 0;
 
-	/* common case: locally created skb or seglen is <= mtu */
-	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
-	      skb_gso_validate_mtu(skb, mtu))
+	local_trusted_gso = (IPCB(skb)->flags & IPSKB_FORWARDED) == 0 &&
+			    !(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY);
+	/* common case: locally created skb from a trusted gso source or
+	 * seglen is <= mtu */
+	if (local_trusted_gso || skb_gso_validate_mtu(skb, mtu))
 		return ip_finish_output2(net, sk, skb);
 
 	/* Slowpath -  GSO segment length is exceeding the dst MTU.

This well addresses the usecase where we have gso-skb arriving from an
untrusted source, thus its gso_size is out of our control (e.g. tun/tap,
macvtap, af_packet, xen-netfront...).

Locally "gso trusted" skbs (the common case) will NOT suffer the
additional (possibly costy) call to 'skb_gso_validate_mtu'.

Also, if IPSKB_FORWARDED is true, behavior stays exactly the same.

Regards,
Shmulik
Shmulik Ladkani July 13, 2016, 2 p.m. UTC | #5
Hi Florian, Hannes,

On Tue, 12 Jul 2016 08:56:56 +0300 Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> On Sat, 9 Jul 2016 15:22:30 +0200 Florian Westphal <fw@strlen.de> wrote:
> > >     
> > > > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > > > skb->skb_iif != 0... instead?    
> 
> I've came up with a suggestion that does not abuse IPSKB_FORWARDED,
> while properly addressing the use case (and similar ones), without
> introducing the cost of entering 'skb_gso_validate_mtu' in the local
> case.
> 
> How about:
> 
> @@ -220,12 +220,15 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
>  				struct sk_buff *skb, unsigned int mtu)
>  {
>  	netdev_features_t features;
> +	int local_trusted_gso;
>  	struct sk_buff *segs;
>  	int ret = 0;
>  
> -	/* common case: locally created skb or seglen is <= mtu */
> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -	      skb_gso_validate_mtu(skb, mtu))
> +	local_trusted_gso = (IPCB(skb)->flags & IPSKB_FORWARDED) == 0 &&
> +			    !(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY);
> +	/* common case: locally created skb from a trusted gso source or
> +	 * seglen is <= mtu */
> +	if (local_trusted_gso || skb_gso_validate_mtu(skb, mtu))
>  		return ip_finish_output2(net, sk, skb);
>  
>  	/* Slowpath -  GSO segment length is exceeding the dst MTU.
> 
> This well addresses the usecase where we have gso-skb arriving from an
> untrusted source, thus its gso_size is out of our control (e.g. tun/tap,
> macvtap, af_packet, xen-netfront...).
> 
> Locally "gso trusted" skbs (the common case) will NOT suffer the
> additional (possibly costy) call to 'skb_gso_validate_mtu'.
> 
> Also, if IPSKB_FORWARDED is true, behavior stays exactly the same.

Any commnets regarding the latest suggestion above?
I'd like to post it as v2 - if it is in the right direction.

It handles the problem of gso_size values which are not in host's
control, it addresses the usecase described, and has a benefit of not
overloading IPSKB_FORWARDED with a new semantic that might be hard to
maintain.

PS:
Also, if we'd like to pinpoint it even further, we can:

local_trusted_gso = (IPCB(skb)->flags & IPSKB_FORWARDED) == 0 &&
		    (!sk || !(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY));

Which ensures only the following conditions go to the expensive
skb_gso_validate_mtu:

1. IPSKB_FORWARDED is on
2. IPSKB_FORWARDED is off, but sk exists and gso_size is untrusted.
   Meaning: we have a packet arriving from higher layers (sk is set)
   with a gso_size out of host's control.

This fine-tuining leaves standard l2 bridging case (e.g 2x taps bridged)
of a gso skb unaffected, as sk would be NULL.

Many thanks,
Shmulik
Hannes Frederic Sowa July 14, 2016, 1:12 p.m. UTC | #6
Hello Shmulik,

On Wed, Jul 13, 2016, at 16:00, Shmulik Ladkani wrote:
> Hi Florian, Hannes,
> 
> On Tue, 12 Jul 2016 08:56:56 +0300 Shmulik Ladkani
> <shmulik.ladkani@ravellosystems.com> wrote:
> > On Sat, 9 Jul 2016 15:22:30 +0200 Florian Westphal <fw@strlen.de> wrote:
> > > >     
> > > > > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > > > > skb->skb_iif != 0... instead?    
> > 
> > I've came up with a suggestion that does not abuse IPSKB_FORWARDED,
> > while properly addressing the use case (and similar ones), without
> > introducing the cost of entering 'skb_gso_validate_mtu' in the local
> > case.
> > 
> > How about:
> > 
> > @@ -220,12 +220,15 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
> >  				struct sk_buff *skb, unsigned int mtu)
> >  {
> >  	netdev_features_t features;
> > +	int local_trusted_gso;
> >  	struct sk_buff *segs;
> >  	int ret = 0;
> >  
> > -	/* common case: locally created skb or seglen is <= mtu */
> > -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> > -	      skb_gso_validate_mtu(skb, mtu))
> > +	local_trusted_gso = (IPCB(skb)->flags & IPSKB_FORWARDED) == 0 &&
> > +			    !(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY);
> > +	/* common case: locally created skb from a trusted gso source or
> > +	 * seglen is <= mtu */
> > +	if (local_trusted_gso || skb_gso_validate_mtu(skb, mtu))
> >  		return ip_finish_output2(net, sk, skb);
> >  
> >  	/* Slowpath -  GSO segment length is exceeding the dst MTU.
> > 
> > This well addresses the usecase where we have gso-skb arriving from an
> > untrusted source, thus its gso_size is out of our control (e.g. tun/tap,
> > macvtap, af_packet, xen-netfront...).
> > 
> > Locally "gso trusted" skbs (the common case) will NOT suffer the
> > additional (possibly costy) call to 'skb_gso_validate_mtu'.
> > 
> > Also, if IPSKB_FORWARDED is true, behavior stays exactly the same.

Sorry for the late reply, I am right now travelling and can't review
that closely.

> Any commnets regarding the latest suggestion above?
> I'd like to post it as v2 - if it is in the right direction.
> 
> It handles the problem of gso_size values which are not in host's
> control, it addresses the usecase described, and has a benefit of not
> overloading IPSKB_FORWARDED with a new semantic that might be hard to
> maintain.

I liked the fact that setting IPSKB_FORWARDED was only contained in
vxlan and as such wouldn't have as much impact. It was more logically
easy to review for me actually.

> PS:
> Also, if we'd like to pinpoint it even further, we can:
> 
> local_trusted_gso = (IPCB(skb)->flags & IPSKB_FORWARDED) == 0 &&
> 		    (!sk || !(skb_shinfo(skb)->gso_type & SKB_GSO_DODGY));

This also looks valid but too random. It seems to be a mix of random
conditions to make it work. ;)

> 
> Which ensures only the following conditions go to the expensive
> skb_gso_validate_mtu:
> 
> 1. IPSKB_FORWARDED is on
> 2. IPSKB_FORWARDED is off, but sk exists and gso_size is untrusted.
>    Meaning: we have a packet arriving from higher layers (sk is set)
>    with a gso_size out of host's control.

When can this really happen? In general we don't want to refragment gso
skb's and I think we can only make an exception for vxlan or udp.

> This fine-tuining leaves standard l2 bridging case (e.g 2x taps bridged)
> of a gso skb unaffected, as sk would be NULL.

Bridging does not in general orphan the socket, no?

Bye,
Hannes
Shmulik Ladkani July 14, 2016, 2:13 p.m. UTC | #7
Hi,

On Thu, 14 Jul 2016 15:12:07 +0200, hannes@stressinduktion.org wrote:
> I liked the fact that setting IPSKB_FORWARDED was only contained in
> vxlan and as such wouldn't have as much impact. It was more logically
> easy to review for me actually.

I agree here. It is rather safe and to the point.

I'm trying to exaust other alternatives because it has one potential
drawback: the name IPSKB_FORWARDED suggests ipv4 forwarding had
happened. Indeed, current setters of IPSKB_FORWARDED are ip_forward and
ip_mr_forward.

If we set IPSKB_FORWARDED in iptunnel_xmit, with packet not being ipv4
forwarded (e.g. bridged from some ingress device to a tunnel device), it
presents a nuance whose impact is yet to be determined.

For example, what about a packet that gets encapsulated and sent to a
multicast destination? The condition controlling mc loop-back in
ip_mc_output is affected by the flag.

> > Which ensures only the following conditions go to the expensive
> > skb_gso_validate_mtu:
> > 
> > 1. IPSKB_FORWARDED is on
> > 2. IPSKB_FORWARDED is off, but sk exists and gso_size is untrusted.
> >    Meaning: we have a packet arriving from higher layers (sk is set)
> >    with a gso_size out of host's control.
> 
> When can this really happen? In general we don't want to refragment gso
> skb's and I think we can only make an exception for vxlan or udp.

When IPSKB_FORWARDED is off, we'll get SKB_GSO_DODGY if packet
originally arrived from tap/macvtap/packet and it did NOT pass ipv4
forwarding (e.g bridges: tap0 to eth0 bridge, or tap0 to vxlan0 bridge).

The rationale: in the SKB_GSO_DODGY cases, the gso_size is given by
the user's virtio-net header, which is not in kernel's control.

This exactly resembles the usecase: tap0 gives packets with gso_size
unsuitable for encapsulation and segmentation. I have no control on
the source that gives those packets.

If (1) it does not make sense, or (2) considered too broad-spectrum to
asses, then we can go with the safer IPSKB_FORWARDED approach.

Let me know.

Regards,
Shmulik
Hannes Frederic Sowa July 14, 2016, 11:32 p.m. UTC | #8
Hello,

On Thu, Jul 14, 2016, at 16:13, Shmulik Ladkani wrote:
> Hi,
> 
> On Thu, 14 Jul 2016 15:12:07 +0200, hannes@stressinduktion.org wrote:
> > I liked the fact that setting IPSKB_FORWARDED was only contained in
> > vxlan and as such wouldn't have as much impact. It was more logically
> > easy to review for me actually.
> 
> I agree here. It is rather safe and to the point.
> 
> I'm trying to exaust other alternatives because it has one potential
> drawback: the name IPSKB_FORWARDED suggests ipv4 forwarding had
> happened. Indeed, current setters of IPSKB_FORWARDED are ip_forward and
> ip_mr_forward.
> 
> If we set IPSKB_FORWARDED in iptunnel_xmit, with packet not being ipv4
> forwarded (e.g. bridged from some ingress device to a tunnel device), it
> presents a nuance whose impact is yet to be determined.

The reason why I rather don't want to see a general solution is that
currently gre and ipip are pmtu-safe and I rather would like to keep the
old behavior that gre and ipip packets don't get treated alike. I
couldn't check the current throughly but it seems they would also be
affected by this change.

My idea was to put those IPSKB_FORWARD just into the vxlan and geneve
endpoints which could be seen as UDP endpoints and don't forward and
care about pmtu events.

> For example, what about a packet that gets encapsulated and sent to a
> multicast destination? The condition controlling mc loop-back in
> ip_mc_output is affected by the flag.

I have to look at that more closely after my trip.

What about adding a new flag with a proper name. It shouldn't affect
performance in any way if you check for two bits instead of just the
FORWARDED one.

> > > Which ensures only the following conditions go to the expensive
> > > skb_gso_validate_mtu:
> > > 
> > > 1. IPSKB_FORWARDED is on
> > > 2. IPSKB_FORWARDED is off, but sk exists and gso_size is untrusted.
> > >    Meaning: we have a packet arriving from higher layers (sk is set)
> > >    with a gso_size out of host's control.
> > 
> > When can this really happen? In general we don't want to refragment gso
> > skb's and I think we can only make an exception for vxlan or udp.
> 
> When IPSKB_FORWARDED is off, we'll get SKB_GSO_DODGY if packet
> originally arrived from tap/macvtap/packet and it did NOT pass ipv4
> forwarding (e.g bridges: tap0 to eth0 bridge, or tap0 to vxlan0 bridge).
> 
> The rationale: in the SKB_GSO_DODGY cases, the gso_size is given by
> the user's virtio-net header, which is not in kernel's control.

But we can assume it being correct based on the mtu of the interface
which is not in control of the current namespace instance or in another
vm. We should act accordingly to normal ethernet bridging here, IMHO.

> This exactly resembles the usecase: tap0 gives packets with gso_size
> unsuitable for encapsulation and segmentation. I have no control on
> the source that gives those packets.

My argumentation is that vxlan and geneve can be seen as endpoints and
don't have proper pmtu handling. Thus I would be fine with adding hacks
to just make it working. For GRE I would like to keep strict internet
pmtu handling and also keep the normal ethernet semantics in place, that
we should drop packets if another interface did transmit on an ethernet
bridge with a too big frame size.

> If (1) it does not make sense, or (2) considered too broad-spectrum to
> asses, then we can go with the safer IPSKB_FORWARDED approach.

Please have a look at my reasoning. I probably can follow-up more deeply
from Sunday on.

Thanks,
Hannes
diff mbox

Patch

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -65,6 +65,7 @@  void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
 	struct net_device *dev = skb->dev;
 	struct iphdr *iph;
 	int err;
+	bool fwd = skb->skb_iif > 0;
 
 	skb_scrub_packet(skb, xnet);
 
@@ -72,6 +73,9 @@  void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
 	skb_dst_set(skb, &rt->dst);
 	memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
+	if (fwd)
+		IPCB(skb)->flags = IPSKB_FORWARDED;
+
 	/* Push down and install the IP header. */
 	skb_push(skb, sizeof(struct iphdr));
 	skb_reset_network_header(skb);