diff mbox

Multicast packet reassembly can fail

Message ID 1256683583.3153.389.camel@linux-1lbu
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Steve Chen Oct. 27, 2009, 10:46 p.m. UTC
Multicast packet reassembly can fail

When multicast connections with multiple fragments are received by the same
node from more than one Ethernet ports, race condition between fragments
from each Ethernet port can cause fragment reassembly to fail leading to
packet drop.  This is because packets from each Ethernet port appears identical
to the the code that reassembles the Ethernet packet.

The solution is evaluate the Ethernet interface number in addition to all other
parameters so that every packet can be uniquely identified.  The existing
iif field in struct ipq is now used to generate the hash key, and iif is also
used for comparison in case of hash collision.

Please note that q->saddr ^ (q->iif << 5) is now being passed into
ipqhashfn to generate the hash key.  This is borrowed from the routing
code.

Signed-off-by: Steve Chen <schen@mvista.com>
Signed-off-by: Mark Huth <mhuth@mvista.com>

---

 net/ipv4/ip_fragment.c |   24 +++++++++++++++++-------
 1 files changed, 17 insertions(+), 7 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rick Jones Oct. 27, 2009, 11:22 p.m. UTC | #1
Steve Chen wrote:
> Multicast packet reassembly can fail
> 
> When multicast connections with multiple fragments are received by the same
> node from more than one Ethernet ports, race condition between fragments
> from each Ethernet port can cause fragment reassembly to fail leading to
> packet drop.  This is because packets from each Ethernet port appears identical
> to the the code that reassembles the Ethernet packet.
> 
> The solution is evaluate the Ethernet interface number in addition to all other
> parameters so that every packet can be uniquely identified.  The existing
> iif field in struct ipq is now used to generate the hash key, and iif is also
> used for comparison in case of hash collision.
> 
> Please note that q->saddr ^ (q->iif << 5) is now being passed into
> ipqhashfn to generate the hash key.  This is borrowed from the routing
> code.
> 
> Signed-off-by: Steve Chen <schen@mvista.com>
> Signed-off-by: Mark Huth <mhuth@mvista.com>

It has been hours since my last good Emily Litella moment so I'll ask - isn't 
the combination of source and dest addr, protocol, IP ID and fragment offset 
supposed to take care of this?  How does the ingress interface have anything to 
do with it?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2009, 10:18 a.m. UTC | #2
Steve Chen a écrit :
> Multicast packet reassembly can fail
> 
> When multicast connections with multiple fragments are received by the same
> node from more than one Ethernet ports, race condition between fragments
> from each Ethernet port can cause fragment reassembly to fail leading to
> packet drop.  This is because packets from each Ethernet port appears identical
> to the the code that reassembles the Ethernet packet.
> 
> The solution is evaluate the Ethernet interface number in addition to all other
> parameters so that every packet can be uniquely identified.  The existing
> iif field in struct ipq is now used to generate the hash key, and iif is also
> used for comparison in case of hash collision.
> 
> Please note that q->saddr ^ (q->iif << 5) is now being passed into
> ipqhashfn to generate the hash key.  This is borrowed from the routing
> code.
> 
> Signed-off-by: Steve Chen <schen@mvista.com>
> Signed-off-by: Mark Huth <mhuth@mvista.com>
> 

This makes no sense to me, but I need to check the code.

How interface could matter in IP defragmentation ?
And why multicast is part of the equation ?

If defrag fails, this must be for other reason,
and probably needs another fix.

Check line 219 of net/ipv4/inet_fragment.c

#ifdef CONFIG_SMP
        /* With SMP race we have to recheck hash table, because
         * such entry could be created on other cpu, while we
         * promoted read lock to write lock.
         */
        hlist_for_each_entry(qp, n, &f->hash[hash], list) {
                if (qp->net == nf && f->match(qp, arg)) {
                        atomic_inc(&qp->refcnt);
                        write_unlock(&f->lock);
                        qp_in->last_in |= INET_FRAG_COMPLETE;   <<< HERE >>>
                        inet_frag_put(qp_in, f);
                        return qp;
                }
        }
#endif

I really wonder why we set INET_FRAG_COMPLETE here
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Chen Oct. 28, 2009, 1:29 p.m. UTC | #3
On Tue, 2009-10-27 at 16:22 -0700, Rick Jones wrote:
> Steve Chen wrote:
> > Multicast packet reassembly can fail
> > 
> > When multicast connections with multiple fragments are received by the same
> > node from more than one Ethernet ports, race condition between fragments
> > from each Ethernet port can cause fragment reassembly to fail leading to
> > packet drop.  This is because packets from each Ethernet port appears identical
> > to the the code that reassembles the Ethernet packet.
> > 
> > The solution is evaluate the Ethernet interface number in addition to all other
> > parameters so that every packet can be uniquely identified.  The existing
> > iif field in struct ipq is now used to generate the hash key, and iif is also
> > used for comparison in case of hash collision.
> > 
> > Please note that q->saddr ^ (q->iif << 5) is now being passed into
> > ipqhashfn to generate the hash key.  This is borrowed from the routing
> > code.
> > 
> > Signed-off-by: Steve Chen <schen@mvista.com>
> > Signed-off-by: Mark Huth <mhuth@mvista.com>
> 
> It has been hours since my last good Emily Litella moment so I'll ask - isn't 
> the combination of source and dest addr, protocol, IP ID and fragment offset 
> supposed to take care of this?  How does the ingress interface have anything to 
> do with it?

Here is the scenario this patch tries to address

<src node> ---->  <switch>  ----> <eth0 dest node>
                            \--->  <eth1 dest node>

For this specific case, src/dst address, protocol, IP ID and fragment
offset are all identical.  The only difference is the ingress interface.
A good follow up question would be why would anyone in their right mind
multicast to the same destination?  well, I don't know.  I can not get
the people who reported the problem to tell me either.   Since someone
found the need to do this,  perhaps others may find it useful too.

Regards,

Steve

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2009, 1:30 p.m. UTC | #4
Steve Chen a écrit :
 
> I sent the specific scenario the patch tries to address to the list in
> an earlier e-mail.  Would it be beneficial if I post the test code
> somewhere so everyone can have access?
> 

Yes please, I cannot find your previous mail in my archives.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Chen Oct. 28, 2009, 1:32 p.m. UTC | #5
On Wed, 2009-10-28 at 11:18 +0100, Eric Dumazet wrote:
> Steve Chen a écrit :
> > Multicast packet reassembly can fail
> > 
> > When multicast connections with multiple fragments are received by the same
> > node from more than one Ethernet ports, race condition between fragments
> > from each Ethernet port can cause fragment reassembly to fail leading to
> > packet drop.  This is because packets from each Ethernet port appears identical
> > to the the code that reassembles the Ethernet packet.
> > 
> > The solution is evaluate the Ethernet interface number in addition to all other
> > parameters so that every packet can be uniquely identified.  The existing
> > iif field in struct ipq is now used to generate the hash key, and iif is also
> > used for comparison in case of hash collision.
> > 
> > Please note that q->saddr ^ (q->iif << 5) is now being passed into
> > ipqhashfn to generate the hash key.  This is borrowed from the routing
> > code.
> > 
> > Signed-off-by: Steve Chen <schen@mvista.com>
> > Signed-off-by: Mark Huth <mhuth@mvista.com>
> > 
> 
> This makes no sense to me, but I need to check the code.
> 
> How interface could matter in IP defragmentation ?
> And why multicast is part of the equation ?
> 
> If defrag fails, this must be for other reason,
> and probably needs another fix.
> 
> Check line 219 of net/ipv4/inet_fragment.c
> 
> #ifdef CONFIG_SMP
>         /* With SMP race we have to recheck hash table, because
>          * such entry could be created on other cpu, while we
>          * promoted read lock to write lock.
>          */
>         hlist_for_each_entry(qp, n, &f->hash[hash], list) {
>                 if (qp->net == nf && f->match(qp, arg)) {
>                         atomic_inc(&qp->refcnt);
>                         write_unlock(&f->lock);
>                         qp_in->last_in |= INET_FRAG_COMPLETE;   <<< HERE >>>
>                         inet_frag_put(qp_in, f);
>                         return qp;
>                 }
>         }
> #endif
> 
> I really wonder why we set INET_FRAG_COMPLETE here

I sent the specific scenario the patch tries to address to the list in
an earlier e-mail.  Would it be beneficial if I post the test code
somewhere so everyone can have access?

Regards,

Steve

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Huth Oct. 28, 2009, 4:55 p.m. UTC | #6
Rick Jones wrote:
> Steve Chen wrote:
>> Multicast packet reassembly can fail
>>
>> When multicast connections with multiple fragments are received by the 
>> same
>> node from more than one Ethernet ports, race condition between fragments
>> from each Ethernet port can cause fragment reassembly to fail leading to
>> packet drop.  This is because packets from each Ethernet port appears 
>> identical
>> to the the code that reassembles the Ethernet packet.
>>
>> The solution is evaluate the Ethernet interface number in addition to 
>> all other
>> parameters so that every packet can be uniquely identified.  The existing
>> iif field in struct ipq is now used to generate the hash key, and iif 
>> is also
>> used for comparison in case of hash collision.
>>
>> Please note that q->saddr ^ (q->iif << 5) is now being passed into
>> ipqhashfn to generate the hash key.  This is borrowed from the routing
>> code.
>>
>> Signed-off-by: Steve Chen <schen@mvista.com>
>> Signed-off-by: Mark Huth <mhuth@mvista.com>
> 
> It has been hours since my last good Emily Litella moment so I'll ask - 
> isn't the combination of source and dest addr, protocol, IP ID and 
> fragment offset supposed to take care of this?  How does the ingress 
> interface have anything to do with it?
> 
> rick jones
The problem we've seen arises only when there are multiple interfaces 
each receiving the same multicast packets.  In that case there are 
multiple packets with the same key.  Steve was able to track down a 
packet loss due to re-assembly failure under certain arrival order 
conditions.

The proposed fix eliminated the packet loss in this case.  There might 
be a different problem in the re-assembly code that we have masked by 
separating the packets into streams from each interface.  Now that you 
mention it, the re-assembly code should be robust in the face of some 
duplicated and mis-ordered packets.  We can look more closely at that code.

Mark Huth

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 28, 2009, 5:18 p.m. UTC | #7
>> It has been hours since my last good Emily Litella moment so I'll ask 
>> - isn't the combination of source and dest addr, protocol, IP ID and 
>> fragment offset supposed to take care of this?  How does the ingress 
>> interface have anything to do with it?
>>
>> rick jones
> 
> The problem we've seen arises only when there are multiple interfaces 
> each receiving the same multicast packets.  In that case there are 
> multiple packets with the same key.  Steve was able to track down a 
> packet loss due to re-assembly failure under certain arrival order 
> conditions.
> 
> The proposed fix eliminated the packet loss in this case.  There might 
> be a different problem in the re-assembly code that we have masked by 
> separating the packets into streams from each interface.  Now that you 
> mention it, the re-assembly code should be robust in the face of some 
> duplicated and mis-ordered packets.  We can look more closely at that code.

If I understand correctly, the idea here is to say that when multiple interfaces 
receive fragments of copies of the same  IP datagram that both copies will 
"survive" and flow up the stack?

I'm basing that on your description, and an email from Steve that reads:

> Actually, the patch tries to prevent packet drop for this exact
> scenario.  Please consider the following scenarios
> 1.  Packet comes in the fragment reassemble code in the following order
> (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> Packet from both interfaces get reassembled and gets further processed.
> 
> 2. Packet can some times arrive in (perhaps other orders as well)
> (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> packet from eth1 is dropped in the routing code.

Doesn't that rather fly in the face of the weak-end-system model followed by Linux?

I can see where scenario one leads to two IP datagrams making it up the stack, 
but I would have thought that was simply an "accident" of the situation that 
cannot reasonably be prevented, not justification to cause scenario two to send 
two datagrams up the stack.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Chen Oct. 28, 2009, 5:50 p.m. UTC | #8
On Wed, 2009-10-28 at 10:18 -0700, Rick Jones wrote:
> >> It has been hours since my last good Emily Litella moment so I'll ask 
> >> - isn't the combination of source and dest addr, protocol, IP ID and 
> >> fragment offset supposed to take care of this?  How does the ingress 
> >> interface have anything to do with it?
> >>
> >> rick jones
> > 
> > The problem we've seen arises only when there are multiple interfaces 
> > each receiving the same multicast packets.  In that case there are 
> > multiple packets with the same key.  Steve was able to track down a 
> > packet loss due to re-assembly failure under certain arrival order 
> > conditions.
> > 
> > The proposed fix eliminated the packet loss in this case.  There might 
> > be a different problem in the re-assembly code that we have masked by 
> > separating the packets into streams from each interface.  Now that you 
> > mention it, the re-assembly code should be robust in the face of some 
> > duplicated and mis-ordered packets.  We can look more closely at that code.
> 
> If I understand correctly, the idea here is to say that when multiple interfaces 
> receive fragments of copies of the same  IP datagram that both copies will 
> "survive" and flow up the stack?
> 
> I'm basing that on your description, and an email from Steve that reads:
> 
> > Actually, the patch tries to prevent packet drop for this exact
> > scenario.  Please consider the following scenarios
> > 1.  Packet comes in the fragment reassemble code in the following order
> > (eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> > Packet from both interfaces get reassembled and gets further processed.
> > 
> > 2. Packet can some times arrive in (perhaps other orders as well)
> > (eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> > Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> > packet from eth1 is dropped in the routing code.
> 
> Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> 
> I can see where scenario one leads to two IP datagrams making it up the stack, 
> but I would have thought that was simply an "accident" of the situation that 
> cannot reasonably be prevented, not justification to cause scenario two to send 
> two datagrams up the stack.

For scenario 2, the routing code drops the 2nd packet.  As a result, no
packet make it to the application.  If someone is willing to suggest an
alternative, I can certainly rework the patch and retest.

Regards,

Steve

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 28, 2009, 6:10 p.m. UTC | #9
>>If I understand correctly, the idea here is to say that when multiple interfaces 
>>receive fragments of copies of the same  IP datagram that both copies will 
>>"survive" and flow up the stack?
>>
>>I'm basing that on your description, and an email from Steve that reads:
>>
>>
>>>Actually, the patch tries to prevent packet drop for this exact
>>>scenario.  Please consider the following scenarios
>>>1.  Packet comes in the fragment reassemble code in the following order
>>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
>>>Packet from both interfaces get reassembled and gets further processed.
>>>
>>>2. Packet can some times arrive in (perhaps other orders as well)
>>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
>>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
>>>packet from eth1 is dropped in the routing code.
>>
>>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
>>
>>I can see where scenario one leads to two IP datagrams making it up the stack, 
>>but I would have thought that was simply an "accident" of the situation that 
>>cannot reasonably be prevented, not justification to cause scenario two to send 
>>two datagrams up the stack.
> 
> 
> For scenario 2, the routing code drops the 2nd packet.  As a result, no
> packet make it to the application.  If someone is willing to suggest an
> alternative, I can certainly rework the patch and retest.

I'll ask my next potentially Emily Litella question - don't multicast IP 
applications bind to multicast IP addresses and not interfaces?  That is to say, 
doesn't the first datagram completed get delivered to all applications on the 
host which have bound to the corresponding multicast IP (and port number...) ?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Chen Oct. 28, 2009, 6:40 p.m. UTC | #10
On Wed, 2009-10-28 at 11:10 -0700, Rick Jones wrote:
> >>If I understand correctly, the idea here is to say that when multiple interfaces 
> >>receive fragments of copies of the same  IP datagram that both copies will 
> >>"survive" and flow up the stack?
> >>
> >>I'm basing that on your description, and an email from Steve that reads:
> >>
> >>
> >>>Actually, the patch tries to prevent packet drop for this exact
> >>>scenario.  Please consider the following scenarios
> >>>1.  Packet comes in the fragment reassemble code in the following order
> >>>(eth0 frag1), (eth0 frag2), (eth1 frag1), (eth1 frag2)
> >>>Packet from both interfaces get reassembled and gets further processed.
> >>>
> >>>2. Packet can some times arrive in (perhaps other orders as well)
> >>>(eth0 frag1), (eth1 frag1), (eth0 frag2), (eth1 frag2)
> >>>Without this patch, eth0 frag 1/2 are overwritten by eth1 frag1/2, and
> >>>packet from eth1 is dropped in the routing code.
> >>
> >>Doesn't that rather fly in the face of the weak-end-system model followed by Linux?
> >>
> >>I can see where scenario one leads to two IP datagrams making it up the stack, 
> >>but I would have thought that was simply an "accident" of the situation that 
> >>cannot reasonably be prevented, not justification to cause scenario two to send 
> >>two datagrams up the stack.
> > 
> > 
> > For scenario 2, the routing code drops the 2nd packet.  As a result, no
> > packet make it to the application.  If someone is willing to suggest an
> > alternative, I can certainly rework the patch and retest.
> 
> I'll ask my next potentially Emily Litella question - don't multicast IP 
> applications bind to multicast IP addresses and not interfaces?  That is to say, 
> doesn't the first datagram completed get delivered to all applications on the 
> host which have bound to the corresponding multicast IP (and port number...) ?
I actually don't know who Emily Litella is until today.  This mailing
list is great not just for learning networking stuff :).  In the test
code I received, one of the step to setup is to configure the IP address
of the interface that the application is expecting the packet.  It
appears to bind on interface based on that casual observation.  I'll
have to study the code in detail to be able to say for sure.

Regards,

Steve


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Stevens Oct. 28, 2009, 8:12 p.m. UTC | #11
I haven't gone through the entire thread yet, but I should point
out that this appears to break regular IP fragmentation for
unicast packets. There is no restriction whatsoever that
fragments from a remote destination that are actually for
the same datagram need to be routed on the same paths
and received on the same input interface.

For the multicast case, if they are from the same datagram,
it doesn't matter how you got them. If it's a different datagram
with the same ID, which can happen anyway, the checksum
should fail (at least (64K-1) of 64K cases). I don't see a special
case here, other than that you can tell by the interface if it was
actually a distinct datagram with the same ID in the multicast
case (and only in multicast and only if the different interfaces
are not in the same multicast routing domain).

NACK.

                                        +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 29, 2009, 4:57 a.m. UTC | #12
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 28 Oct 2009 11:18:24 +0100

> Check line 219 of net/ipv4/inet_fragment.c
> 
> #ifdef CONFIG_SMP
>         /* With SMP race we have to recheck hash table, because
>          * such entry could be created on other cpu, while we
>          * promoted read lock to write lock.
>          */
>         hlist_for_each_entry(qp, n, &f->hash[hash], list) {
>                 if (qp->net == nf && f->match(qp, arg)) {
>                         atomic_inc(&qp->refcnt);
>                         write_unlock(&f->lock);
>                         qp_in->last_in |= INET_FRAG_COMPLETE;   <<< HERE >>>
>                         inet_frag_put(qp_in, f);
>                         return qp;
>                 }
>         }
> #endif
> 
> I really wonder why we set INET_FRAG_COMPLETE here

What has happened here is that another cpu created an identical
frag entry before we took the write lock.

So we're letting that other cpu's entry stand, and will release
our local one and not use it at all.

Setting INET_FRAG_COMPLETE does two things:

1) It makes sure input frag processing skips this entry if such
   code paths happen to see it for some reason.

2) INET_FRAG_COMPLETE must be set when inet_frag_destroy() gets
   called by inet_frag_put() when it drops the refcount to zero.
   There is an assertion on INET_FRAG_COMPLETE in inet_frag_destroy.

Hope that clears things up.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 29, 2009, 5:31 a.m. UTC | #13
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 28 Oct 2009 11:18:24 +0100
> 
>> Check line 219 of net/ipv4/inet_fragment.c
>>
>> #ifdef CONFIG_SMP
>>         /* With SMP race we have to recheck hash table, because
>>          * such entry could be created on other cpu, while we
>>          * promoted read lock to write lock.
>>          */
>>         hlist_for_each_entry(qp, n, &f->hash[hash], list) {
>>                 if (qp->net == nf && f->match(qp, arg)) {
>>                         atomic_inc(&qp->refcnt);
>>                         write_unlock(&f->lock);
>>                         qp_in->last_in |= INET_FRAG_COMPLETE;   <<< HERE >>>
>>                         inet_frag_put(qp_in, f);
>>                         return qp;
>>                 }
>>         }
>> #endif
>>
>> I really wonder why we set INET_FRAG_COMPLETE here
> 
> What has happened here is that another cpu created an identical
> frag entry before we took the write lock.
> 
> So we're letting that other cpu's entry stand, and will release
> our local one and not use it at all.
> 
> Setting INET_FRAG_COMPLETE does two things:
> 
> 1) It makes sure input frag processing skips this entry if such
>    code paths happen to see it for some reason.
> 
> 2) INET_FRAG_COMPLETE must be set when inet_frag_destroy() gets
>    called by inet_frag_put() when it drops the refcount to zero.
>    There is an assertion on INET_FRAG_COMPLETE in inet_frag_destroy.
> 
> Hope that clears things up.


Yes thanks David, this is clear now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Oct. 29, 2009, 6:04 p.m. UTC | #14
Steve Chen <schen@mvista.com> wrote:
>
> of the interface that the application is expecting the packet.  It
> appears to bind on interface based on that casual observation.  I'll
> have to study the code in detail to be able to say for sure.

Well if it does bind to the interface then that explains the
failure. And the fix is "if it hurts, don't do it" :)

Cheers,
Steve Chen Oct. 29, 2009, 6:33 p.m. UTC | #15
On Thu, 2009-10-29 at 14:04 -0400, Herbert Xu wrote:
> Steve Chen <schen@mvista.com> wrote:
> >
> > of the interface that the application is expecting the packet.  It
> > appears to bind on interface based on that casual observation.  I'll
> > have to study the code in detail to be able to say for sure.
> 
> Well if it does bind to the interface then that explains the
> failure. And the fix is "if it hurts, don't do it" :)

I like that solution.  May be I can even use the first letter of every
line to send a "special" message to the customer :)

Steve

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Chen Nov. 2, 2009, 6:36 p.m. UTC | #16
On Thu, 2009-10-29 at 14:04 -0400, Herbert Xu wrote:
> Steve Chen <schen@mvista.com> wrote:
> >
> > of the interface that the application is expecting the packet.  It
> > appears to bind on interface based on that casual observation.  I'll
> > have to study the code in detail to be able to say for sure.
> 
> Well if it does bind to the interface then that explains the
> failure. And the fix is "if it hurts, don't do it" :)
> 
> Cheers,

The packet drop was tracked to rp_filter.  All packets received as
expected after disabling rp_filter.  Thank you all for the inputs.

Regards,

Steve

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 575f9bd..2de0035 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -90,6 +90,7 @@  static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 struct ip4_create_arg {
 	struct iphdr *iph;
 	u32 user;
+	int iif;
 };
 
 static unsigned int ipqhashfn(__be16 id, __be32 saddr, __be32 daddr, u8 prot)
@@ -104,7 +105,8 @@  static unsigned int ip4_hashfn(struct inet_frag_queue *q)
 	struct ipq *ipq;
 
 	ipq = container_of(q, struct ipq, q);
-	return ipqhashfn(ipq->id, ipq->saddr, ipq->daddr, ipq->protocol);
+	return ipqhashfn(ipq->id, ipq->saddr ^ (ipq->iif << 5), ipq->daddr,
+			 ipq->protocol);
 }
 
 static int ip4_frag_match(struct inet_frag_queue *q, void *a)
@@ -117,6 +119,7 @@  static int ip4_frag_match(struct inet_frag_queue *q, void *a)
 			qp->saddr == arg->iph->saddr &&
 			qp->daddr == arg->iph->daddr &&
 			qp->protocol == arg->iph->protocol &&
+			qp->iif == arg->iif &&
 			qp->user == arg->user);
 }
 
@@ -140,6 +143,7 @@  static void ip4_frag_init(struct inet_frag_queue *q, void *a)
 	qp->saddr = arg->iph->saddr;
 	qp->daddr = arg->iph->daddr;
 	qp->user = arg->user;
+	qp->iif = arg->iif;
 	qp->peer = sysctl_ipfrag_max_dist ?
 		inet_getpeer(arg->iph->saddr, 1) : NULL;
 }
@@ -219,7 +223,8 @@  out:
 /* Find the correct entry in the "incomplete datagrams" queue for
  * this IP datagram, and create new one, if nothing is found.
  */
-static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
+static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user,
+				  int iif)
 {
 	struct inet_frag_queue *q;
 	struct ip4_create_arg arg;
@@ -227,9 +232,11 @@  static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 
 	arg.iph = iph;
 	arg.user = user;
+	arg.iif = iif;
 
 	read_lock(&ip4_frags.lock);
-	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
+	hash = ipqhashfn(iph->id, iph->saddr & (iif << 5), iph->daddr,
+			 iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
 	if (q == NULL)
@@ -433,10 +440,9 @@  static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		qp->q.fragments = skb;
 
 	dev = skb->dev;
-	if (dev) {
-		qp->iif = dev->ifindex;
+	if (dev)
 		skb->dev = NULL;
-	}
+
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	atomic_add(skb->truesize, &qp->q.net->mem);
@@ -572,6 +578,7 @@  int ip_defrag(struct sk_buff *skb, u32 user)
 {
 	struct ipq *qp;
 	struct net *net;
+	int iif  = 0;
 
 	net = skb->dev ? dev_net(skb->dev) : dev_net(skb_dst(skb)->dev);
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMREQDS);
@@ -580,8 +587,12 @@  int ip_defrag(struct sk_buff *skb, u32 user)
 	if (atomic_read(&net->ipv4.frags.mem) > net->ipv4.frags.high_thresh)
 		ip_evictor(net);
 
+	if (skb->dev)
+		iif = skb->dev->ifindex;
+
 	/* Lookup (or create) queue header */
-	if ((qp = ip_find(net, ip_hdr(skb), user)) != NULL) {
+	qp = ip_find(net, ip_hdr(skb), user, iif);
+	if (qp != NULL) {
 		int ret;
 
 		spin_lock(&qp->q.lock);