Message ID | 1464013425-31129-1-git-send-email-alan.davey@metaswitch.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: Alan Davey <alan.davey@metaswitch.com> Date: Mon, 23 May 2016 15:23:45 +0100 > One of the bugs documented in the raw(7) man page is as follows: When the > IP_HDRINCL option is set, datagrams will not be fragmented and are limited to > the interface MTU. > > This patch fixes the bug by removing the check for "length > rt->dst.dev->mtu" > in raw_send_hdrinc() (net/ipv4/raw.c). Datagrams are no longer limited to the > interface MTU size if the IP_HDRINCL option is set, but are fragmented, if > necessary, in the same way as all other datagrams. > > Signed-off-by: Alan Davey <alan.davey@metaswitch.com> This is not a bug, it's a feature and it's how RAW ipv4 sockets have behaved for two decades. If the user wants to use hdr inclusion, he can send multiple frames and set the fragmentation bits appropriately. I'm not applying this patch.
Hi David Thank you for your email. I understand raw IP sockets have worked this way for a long time, but I think that there is real benefit in this patch for little risk. Please take a look at the following and let me know what you think. - The current behaviour is counter-intuitive (fragmentation takes place in all other cases) and therefore different to what everyone expects. - Consequently, everyone has to fix the same bug and work around it by fragmenting in their application (we have seen this happen several dozen times just in our experience). - The end result is that the fragmentation code ends up being implemented in many places, instead of just once, using the existing kernel code. - The patch is a low risk fix; removing 5 lines of code and using existing code to perform the fragmentation. It should be back-compatible because o existing code written to work round the feature will continue to work o it seems very unlikely that anyone relies on the current behaviour of oversized packets being rejected, and would not prefer the new behavior. Therefore, whether it is a bug or a feature, I think there is value in fixing the behaviour. Regards Alan -----Original Message----- From: David Miller [mailto:davem@davemloft.net] Sent: 31 May 2016 19:39 To: Alan Davey <Alan.Davey@metaswitch.com> Cc: netdev@vger.kernel.org; kuznet@ms2.inr.ac.ru; jmorris@namei.org; yoshfuji@linux-ipv6.org; kaber@trash.net Subject: Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set. From: Alan Davey <alan.davey@metaswitch.com> Date: Mon, 23 May 2016 15:23:45 +0100 > One of the bugs documented in the raw(7) man page is as follows: When > the IP_HDRINCL option is set, datagrams will not be fragmented and are > limited to the interface MTU. > > This patch fixes the bug by removing the check for "length > rt->dst.dev->mtu" > in raw_send_hdrinc() (net/ipv4/raw.c). Datagrams are no longer > limited to the interface MTU size if the IP_HDRINCL option is set, but > are fragmented, if necessary, in the same way as all other datagrams. > > Signed-off-by: Alan Davey <alan.davey@metaswitch.com> This is not a bug, it's a feature and it's how RAW ipv4 sockets have behaved for two decades. If the user wants to use hdr inclusion, he can send multiple frames and set the fragmentation bits appropriately. I'm not applying this patch.
Hi Please do not use "top-posting" here. Alan Davey wrote: > Hi David > > Thank you for your email. I understand raw IP sockets have worked this way for a long time, but I think that there is real benefit in this patch for little risk. Please take a look at the following and let me know what you think. > > - The current behaviour is counter-intuitive (fragmentation takes place in all other cases) and therefore different to what everyone expects. I do NEVER expect fragmentation occurs with IP_HDRINCL, at least.
Hi, We have to re-create IPv4 fragmentation in user-space in ospfd in Quagga cause of this, on raw IPv4/OSPF sockets. It'd be really nice to be able to deprecate that code and just have existing kernel code do that for us. It should only happen if the app indicates it though, e.g. via a sockopt, to avoid compatibility issues. regards, Paul On Wed, 8 Jun 2016, Alan Davey wrote: > - Consequently, everyone has to fix the same bug and work around it > by fragmenting in their application (we have seen this happen > several dozen times just in our experience). > > - The end result is that the fragmentation code ends up being > implemented in many places, instead of just once, using the existing > kernel code. > > - The patch is a low risk fix; removing 5 lines of code and using existing code to perform the fragmentation. It should be back-compatible because > o existing code written to work round the feature will continue to work > o it seems very unlikely that anyone relies on the current behaviour of oversized packets being rejected, and would not prefer the new behavior. > > Therefore, whether it is a bug or a feature, I think there is value in > fixing the behaviour. > Regards > Alan > > -----Original Message----- > From: David Miller [mailto:davem@davemloft.net] > Sent: 31 May 2016 19:39 > To: Alan Davey <Alan.Davey@metaswitch.com> > Cc: netdev@vger.kernel.org; kuznet@ms2.inr.ac.ru; jmorris@namei.org; yoshfuji@linux-ipv6.org; kaber@trash.net > Subject: Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set. > > From: Alan Davey <alan.davey@metaswitch.com> > Date: Mon, 23 May 2016 15:23:45 +0100 > >> One of the bugs documented in the raw(7) man page is as follows: When >> the IP_HDRINCL option is set, datagrams will not be fragmented and are >> limited to the interface MTU. >> >> This patch fixes the bug by removing the check for "length > rt->dst.dev->mtu" >> in raw_send_hdrinc() (net/ipv4/raw.c). Datagrams are no longer >> limited to the interface MTU size if the IP_HDRINCL option is set, but >> are fragmented, if necessary, in the same way as all other datagrams. >> >> Signed-off-by: Alan Davey <alan.davey@metaswitch.com> > > This is not a bug, it's a feature and it's how RAW ipv4 sockets have behaved for two decades. > > If the user wants to use hdr inclusion, he can send multiple frames and set the fragmentation bits appropriately. > > I'm not applying this patch. >
From: Alan Davey <Alan.Davey@metaswitch.com> Date: Wed, 8 Jun 2016 08:41:31 +0000 > - The current behaviour is counter-intuitive (fragmentation takes > - place in all other cases) and therefore different to what > - everyone expects. But it's what all existing applications must expect, and as you have seen in these replied they absolutely do. You cannot just break things on people like this.
From: David Miller [mailto:davem@davemloft.net] Sent: 08 June 2016 18:26 >> - The current behaviour is counter-intuitive (fragmentation takes >> - place in all other cases) and therefore different to what >> - everyone expects. > > But it's what all existing applications must expect, and as you have seen in these replied they absolutely do. > > You cannot just break things on people like this. The only case that would break is that where an application relies on the existing (documented as a bug) feature of getting an EMSGSIZE return code in the case of an over-sized packet. Applications that perform their own fragmentation would be unaffected. I think that the benefit of the patch, in moving all fragmentation and reassembly into the kernel, outweigh the very small chance that applications rely on the send of an over-sized packet failing. What is your thinking on taking the patch? Regards Alan
On 31.05.2016 20:39, David Miller wrote: > From: Alan Davey <alan.davey@metaswitch.com> > Date: Mon, 23 May 2016 15:23:45 +0100 > >> One of the bugs documented in the raw(7) man page is as follows: When the >> IP_HDRINCL option is set, datagrams will not be fragmented and are limited to >> the interface MTU. >> >> This patch fixes the bug by removing the check for "length > rt->dst.dev->mtu" >> in raw_send_hdrinc() (net/ipv4/raw.c). Datagrams are no longer limited to the >> interface MTU size if the IP_HDRINCL option is set, but are fragmented, if >> necessary, in the same way as all other datagrams. >> >> Signed-off-by: Alan Davey <alan.davey@metaswitch.com> > > This is not a bug, it's a feature and it's how RAW ipv4 sockets have behaved > for two decades. > > If the user wants to use hdr inclusion, he can send multiple frames and set > the fragmentation bits appropriately. Actually this is difficult, as only the kernel is in control of the IP_ID generator and it should never leak its state to user space. The probability of IP_ID collisions would increase drastically because of this, even imagine different programs doing so concurrently. Basically those checks are pretty much unnecessary anyway, if we set raw sockets by default into inet->pmtudisc IPPMTUDISC_DO mode, so a user can easily control the fragmentation on a raw socket themselves based on already provided infrastructure. Bye, Hannes
On Wed, 15 Jun 2016, Alan Davey wrote: > The only case that would break is that where an application relies on > the existing (documented as a bug) feature of getting an EMSGSIZE > return code in the case of an over-sized packet. Applications that > perform their own fragmentation would be unaffected. If this doesn't break existing applications that are doing fragmentation in userspace on raw sockets (e.g. Quagga ospfd), that's better. As per previous email, I'd love to be able to get rid of that code and have the kernel do it for me. However, I also don't want to have to do anything other non-trivial to that code either. :) The issue for us is, how would we know on any given host whether the kernel will do the fragmentation or whether ospfd has to do it? We need to be able to probe for that capability, surely? I guess "send an oversized packet and see if we get EMSGSIZE comes back or not" could be one way, though the 'man 7 raw' man page ony my system says that is returned on raw socket if the /IP/ max size (64 KiB) is exceeded - it doesn't say anything about that being returned if output MTU is exceeded. Also, that implies a failed packet send on startup on older kernels and having to rejiggle the packet we've constructed to retry. Some kind of "I want kernel to do the fragmentation" sockopt, that was guaranteed to error with ENOPROTOOPT on non-implementing kernels, that we could try set when creating the socket, would seem simpler/nicer from my POV of my corner of user-space. ? regards,
From: Paul Jakma <paul@jakma.org> Date: Fri, 8 Jul 2016 13:55:11 +0100 (BST) > On Wed, 15 Jun 2016, Alan Davey wrote: > >> The only case that would break is that where an application relies on >> the existing (documented as a bug) feature of getting an EMSGSIZE >> return code in the case of an over-sized packet. Applications that >> perform their own fragmentation would be unaffected. > > If this doesn't break existing applications that are doing > fragmentation in userspace on raw sockets (e.g. Quagga ospfd), that's > better. > > As per previous email, I'd love to be able to get rid of that code and > have the kernel do it for me. However, I also don't want to have to do > anything other non-trivial to that code either. :) > > The issue for us is, how would we know on any given host whether the > kernel will do the fragmentation or whether ospfd has to do it? We > need to be able to probe for that capability, surely? The fact is, regardless of whether you could probe for the capability or not, you have to keep the fragmentation code around forever. And that is yet another reason I do not want to add this change at all. It doesn't make any existing server any simpler, in fact it makes them all more complicated because not only do they keep the fragmentation code, they also get new logic to test for the feature that would allow them to avoid using it. Sorry, there is no way I am adding this, it's a net lose.
Hello! I can tell why it has not been done initially. Main problem was in IP options, which can be present in raw packet. They have to be properly fragmented, some options are to be deleted on fragments. Not that it is too complicated, it is just boring and ugly and inconsistent with IP_HDRINCL logic. So, it was done in logically consistent way: did you order IP_HDRINCL? Please, then deal with fragments yourself. Alexey
>> On Wed, 15 Jun 2016, Alan Davey wrote: >> >>> The only case that would break is that where an application relies on >>> the existing (documented as a bug) feature of getting an EMSGSIZE >>> return code in the case of an over-sized packet. Applications that >>> perform their own fragmentation would be unaffected. >> >> If this doesn't break existing applications that are doing >> fragmentation in userspace on raw sockets (e.g. Quagga ospfd), that's >> better. >> >> As per previous email, I'd love to be able to get rid of that code and >> have the kernel do it for me. However, I also don't want to have to do >> anything other non-trivial to that code either. :) >> >> The issue for us is, how would we know on any given host whether the >> kernel will do the fragmentation or whether ospfd has to do it? We >> need to be able to probe for that capability, surely? > >The fact is, regardless of whether you could probe for the capability or not, you have to keep the fragmentation code around forever. > >And that is yet another reason I do not want to add this change at all. > >It doesn't make any existing server any simpler, in fact it makes them all more complicated because not only do they keep the fragmentation code, they also >get new logic to test for the feature that would allow them to avoid using it. > >Sorry, there is no way I am adding this, it's a net lose. David, I accept you don't want to take this patch. However, I don't yet understand why, so I have a few more questions (which will hopefully help me produce patches that are likely to be accepted in future). Adding the patch means that some existing applications will continue to contain their own fragmentation code, which becomes unnecessary. This is OK; those applications will continue to fragment packets, and will continue to work with an updated kernel. Do you agree? Not adding the patch means that - all future applications have to continue to implement their own fragmentation code, duplicating that which already exists in the kernel - hardware vendors may choose to apply the patch themselves (they do not want to implement function that already exists), but would surely prefer to have it in the standard kernel. So I still don't understand what part of taking the patch would have a negative result on existing applications, it should be neutral to them. Could you help me out here and explain? Regards Alan
From: Alan Davey <Alan.Davey@metaswitch.com> Date: Tue, 12 Jul 2016 12:34:07 +0000 > - all future applications have to continue to implement their own fragmentation code, duplicating that which already exists in the kernel They have to do this anyways, don't you see this? Otherwise they don't support %99 of the kernels out there. Even if I put this change in now, it would take years for it to propagate to even a moderate percentage of Linux machines out there. And applications doing a "we only support kernels > version x.y.z" is a situation I don't want to promote if you think that's a reasonable way to handle this.
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8d22de7..de690b3 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -351,11 +351,6 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, struct rtable *rt = *rtp; int hlen, tlen; - if (length > rt->dst.dev->mtu) { - ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport, - rt->dst.dev->mtu); - return -EMSGSIZE; - } if (flags&MSG_PROBE) goto out;
One of the bugs documented in the raw(7) man page is as follows: When the IP_HDRINCL option is set, datagrams will not be fragmented and are limited to the interface MTU. This patch fixes the bug by removing the check for "length > rt->dst.dev->mtu" in raw_send_hdrinc() (net/ipv4/raw.c). Datagrams are no longer limited to the interface MTU size if the IP_HDRINCL option is set, but are fragmented, if necessary, in the same way as all other datagrams. Signed-off-by: Alan Davey <alan.davey@metaswitch.com> --- net/ipv4/raw.c | 5 ----- 1 file changed, 5 deletions(-)