diff mbox

TCP-MD5 checksum failure on x86_64 SMP

Message ID 1273085598.2367.233.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 5, 2010, 6:53 p.m. UTC
Le mercredi 05 mai 2010 à 23:33 +0530, Bhaskar Dutta a écrit :

> Hi,
> 
> TSO, GSO and SG are already turned off.
> rx/tx checksumming is on, but that shouldn't matter, right?
> 
> # ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: off
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: off
> 
> The bad packets are very small in size, most have no data at all (<300 bytes).
> 
> After adding some logs to kernel 2.6.31-12, it seems that
> tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
> (might?) getting corrupt.
> 
> The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
> and len fields get modified to different values towards the end of the
> tcp_v4_md5_hash_skb function whenever there is a checksum error.
> 
> The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
> filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).
> 
> Using a local copy of the tcp4_pseudohdr in the same function
> tcp_v4_md5_hash_skb (copied all fields from the original
> tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
> checksum with the local  tcp4_pseudohdr seems to solve the issue
> (don't see bad packets for a hours in load tests, and without the
> change I can see them instantaneously in the load tests).
> 
> I am still unable to figure out how this is happening. Please let me
> know if you have any pointers.

I am not familiar with this code, but I suspect same per_cpu data can be
used at both time by a sender (process context) and by a receiver
(softirq context).

To trigger this, you need at least two active md5 sockets.

tcp_get_md5sig_pool() should probably disable bh to make sure current
cpu wont be preempted by softirq processing


Something like :


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bhaskar Dutta May 6, 2010, 11:55 a.m. UTC | #1
On Thu, May 6, 2010 at 12:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 05 mai 2010 à 23:33 +0530, Bhaskar Dutta a écrit :
>
>> Hi,
>>
>> TSO, GSO and SG are already turned off.
>> rx/tx checksumming is on, but that shouldn't matter, right?
>>
>> # ethtool -k eth0
>> Offload parameters for eth0:
>> rx-checksumming: on
>> tx-checksumming: on
>> scatter-gather: off
>> tcp segmentation offload: off
>> udp fragmentation offload: off
>> generic segmentation offload: off
>>
>> The bad packets are very small in size, most have no data at all (<300 bytes).
>>
>> After adding some logs to kernel 2.6.31-12, it seems that
>> tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
>> (might?) getting corrupt.
>>
>> The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
>> and len fields get modified to different values towards the end of the
>> tcp_v4_md5_hash_skb function whenever there is a checksum error.
>>
>> The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
>> filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).
>>
>> Using a local copy of the tcp4_pseudohdr in the same function
>> tcp_v4_md5_hash_skb (copied all fields from the original
>> tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
>> checksum with the local  tcp4_pseudohdr seems to solve the issue
>> (don't see bad packets for a hours in load tests, and without the
>> change I can see them instantaneously in the load tests).
>>
>> I am still unable to figure out how this is happening. Please let me
>> know if you have any pointers.
>
> I am not familiar with this code, but I suspect same per_cpu data can be
> used at both time by a sender (process context) and by a receiver
> (softirq context).
>
> To trigger this, you need at least two active md5 sockets.
>
> tcp_get_md5sig_pool() should probably disable bh to make sure current
> cpu wont be preempted by softirq processing
>
>
> Something like :
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index fb5c66b..e232123 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
>        struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
>        if (!ret)
>                put_cpu();
> +       else
> +               local_bh_disable();
>        return ret;
>  }
>
>  static inline void             tcp_put_md5sig_pool(void)
>  {
>        __tcp_put_md5sig_pool();
> +       local_bh_enable();
>        put_cpu();
>  }
>
>
>

I put in the above change and ran some load tests with around 50
active TCP connections doing MD5.
I could see only 1 bad packet in 30 min (earlier the problem used to
occur instantaneously and repeatedly).

I think there is another possibility of being preempted when calling
tcp_alloc_md5sig_pool()
this function releases the spinlock when calling __tcp_alloc_md5sig_pool().

I will run some more tests after changing the  tcp_alloc_md5sig_pool
and see if the problem is completely resolved.

Thanks a lot for your help!
Bhaskar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 6, 2010, 12:06 p.m. UTC | #2
Le jeudi 06 mai 2010 à 17:25 +0530, Bhaskar Dutta a écrit :

> I put in the above change and ran some load tests with around 50
> active TCP connections doing MD5.
> I could see only 1 bad packet in 30 min (earlier the problem used to
> occur instantaneously and repeatedly).
> 
> I think there is another possibility of being preempted when calling
> tcp_alloc_md5sig_pool()
> this function releases the spinlock when calling __tcp_alloc_md5sig_pool().
> 
> I will run some more tests after changing the  tcp_alloc_md5sig_pool
> and see if the problem is completely resolved.
> 

This code should be completely rewritten for linux-2.6.35, its very ugly
and over complex, yet it is not scalable.

It could use true percpu data, with no central lock or refcount.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 7, 2010, 5:04 a.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 06 May 2010 14:06:26 +0200

> Le jeudi 06 mai 2010 à 17:25 +0530, Bhaskar Dutta a écrit :
> 
>> I put in the above change and ran some load tests with around 50
>> active TCP connections doing MD5.
>> I could see only 1 bad packet in 30 min (earlier the problem used to
>> occur instantaneously and repeatedly).
>> 
>> I think there is another possibility of being preempted when calling
>> tcp_alloc_md5sig_pool()
>> this function releases the spinlock when calling __tcp_alloc_md5sig_pool().
>> 
>> I will run some more tests after changing the  tcp_alloc_md5sig_pool
>> and see if the problem is completely resolved.
>> 
> 
> This code should be completely rewritten for linux-2.6.35, its very ugly
> and over complex, yet it is not scalable.
> 
> It could use true percpu data, with no central lock or refcount.

Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's
frankly crap and this is like the 5th major bug that had to get fixed
in it. :-)

But let's fix this as simply as possible in net-2.6 and -stable, so Eric
let me know when you have a tested patch for me to apply.

Thanks a lot!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 7, 2010, 5:32 a.m. UTC | #4
Le jeudi 06 mai 2010 à 22:04 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>

> > This code should be completely rewritten for linux-2.6.35, its very ugly
> > and over complex, yet it is not scalable.
> > 
> > It could use true percpu data, with no central lock or refcount.
> 
> Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's
> frankly crap and this is like the 5th major bug that had to get fixed
> in it. :-)
> 
> But let's fix this as simply as possible in net-2.6 and -stable, so Eric
> let me know when you have a tested patch for me to apply.

There are so many things ...

I am comtemplating commit aa133076

__tcp_alloc_md5sig_pool() now do a : 

p = kzalloc(sizeof(*p), sk->sk_allocation);

But later call :

hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);

(GFP_KERNEL allocations for sure)


tcp_v4_parse_md5_keys() also use :

p = kzalloc(sizeof(*p), sk->sk_allocation);

right after a (possibly sleeping) copy_from_user()

Oh well...


I'll test the already suggested patch before official submission,
thanks.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 7, 2010, 5:39 a.m. UTC | #5
Le jeudi 06 mai 2010 à 17:25 +0530, Bhaskar Dutta a écrit :
> On Thu, May 6, 2010 at 12:23 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > I am not familiar with this code, but I suspect same per_cpu data can be
> > used at both time by a sender (process context) and by a receiver
> > (softirq context).
> >
> > To trigger this, you need at least two active md5 sockets.
> >
> > tcp_get_md5sig_pool() should probably disable bh to make sure current
> > cpu wont be preempted by softirq processing
> >
> >
> > Something like :
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index fb5c66b..e232123 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
> >        struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
> >        if (!ret)
> >                put_cpu();
> > +       else
> > +               local_bh_disable();
> >        return ret;
> >  }
> >
> >  static inline void             tcp_put_md5sig_pool(void)
> >  {
> >        __tcp_put_md5sig_pool();
> > +       local_bh_enable();
> >        put_cpu();
> >  }
> >
> >
> >
> 
> I put in the above change and ran some load tests with around 50
> active TCP connections doing MD5.
> I could see only 1 bad packet in 30 min (earlier the problem used to
> occur instantaneously and repeatedly).
> 


> I think there is another possibility of being preempted when calling
> tcp_alloc_md5sig_pool()
> this function releases the spinlock when calling __tcp_alloc_md5sig_pool().
> 
> I will run some more tests after changing the  tcp_alloc_md5sig_pool
> and see if the problem is completely resolved.

I cant see a race with spinlock in
tcp_alloc_md5sig_pool/__tcp_alloc_md5sig_pool().

We allocate structures for all cpus, so preemption.migration should be
OK

Could you elaborate please ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lars Eggert May 7, 2010, 8:46 a.m. UTC | #6
On 2010-5-6, at 15:06, Eric Dumazet wrote:
> This code should be completely rewritten for linux-2.6.35, its very ugly
> and over complex, yet it is not scalable.

Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.)

You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO.

Lars
Eric Dumazet May 7, 2010, 8:55 a.m. UTC | #7
Le vendredi 07 mai 2010 à 11:46 +0300, Lars Eggert a écrit :
> On 2010-5-6, at 15:06, Eric Dumazet wrote:
> > This code should be completely rewritten for linux-2.6.35, its very ugly
> > and over complex, yet it is not scalable.
> 
> Let me quickly point out that the IETF has recently replaced TCP-MD5 with TCP-AO (http://tools.ietf.org/html/draft-ietf-tcpm-tcp-auth-opt-11) which is days or weeks away from being published as an RFC. (The draft has been approved and it waiting for the RFC Editor to finish the final copy-editing.)
> 
> You'll obviously still want TCP-MD5 support for talking to existing equipment, but significant cycles would IMO be better spent on TCP-AO.
> 

Thanks for this info !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 7, 2010, 9:12 a.m. UTC | #8
From: Lars Eggert <lars.eggert@nokia.com>
Date: Fri, 7 May 2010 11:46:40 +0300

> You'll obviously still want TCP-MD5 support for talking to existing
> equipment, but significant cycles would IMO be better spent on
> TCP-AO.

Code we have and users use which is unstable and crashes is more
important to work on and fix than code we don't have which
user's therefore don't use.

You're wrong from just about every possible angle.

Whoever finds AO useful will work on it, just as was the case with MD5
support.  And right now, that "whoever" is definitely not us. :-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger May 7, 2010, 5:14 p.m. UTC | #9
On Fri, 07 May 2010 07:32:09 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 06 mai 2010 à 22:04 -0700, David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> 
> > > This code should be completely rewritten for linux-2.6.35, its very ugly
> > > and over complex, yet it is not scalable.
> > > 
> > > It could use true percpu data, with no central lock or refcount.
> > 
> > Yes I've always disliked the way the TCP MD5 pool stuff is coded, it's
> > frankly crap and this is like the 5th major bug that had to get fixed
> > in it. :-)
> > 
> > But let's fix this as simply as possible in net-2.6 and -stable, so Eric
> > let me know when you have a tested patch for me to apply.
> 
> There are so many things ...
> 
> I am comtemplating commit aa133076
> 
> __tcp_alloc_md5sig_pool() now do a : 
> 
> p = kzalloc(sizeof(*p), sk->sk_allocation);
> 
> But later call :
> 
> hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
> 
> (GFP_KERNEL allocations for sure)
> 
> 
> tcp_v4_parse_md5_keys() also use :
> 
> p = kzalloc(sizeof(*p), sk->sk_allocation);
> 
> right after a (possibly sleeping) copy_from_user()
> 
> Oh well...
> 
> 
> I'll test the already suggested patch before official submission,
> thanks.

Forget the per cpu data; the pool should just be scrapped.

The only reason the pool exists is that the crypto hash state which
should just be moved into the md5_info (88 bytes).  The pseudo
header can just be generated on the stack before passing to the crypto
code.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 7, 2010, 5:21 p.m. UTC | #10
Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :

> Forget the per cpu data; the pool should just be scrapped.
> 
> The only reason the pool exists is that the crypto hash state which
> should just be moved into the md5_info (88 bytes).  The pseudo
> header can just be generated on the stack before passing to the crypto
> code.


Sure, but I'm afraid there is no generic API do do that (if we want to
reuse crypto/md5.c code)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger May 7, 2010, 5:36 p.m. UTC | #11
On Fri, 07 May 2010 19:21:33 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
> 
> > Forget the per cpu data; the pool should just be scrapped.
> > 
> > The only reason the pool exists is that the crypto hash state which
> > should just be moved into the md5_info (88 bytes).  The pseudo
> > header can just be generated on the stack before passing to the crypto
> > code.
> 
> 
> Sure, but I'm afraid there is no generic API do do that (if we want to
> reuse crypto/md5.c code).

It looks like the pool is just an optimization to avoid opening too
many crypto API connections.  This should only be an issue if offloading
MD5.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 7, 2010, 9:40 p.m. UTC | #12
Le vendredi 07 mai 2010 à 10:36 -0700, Stephen Hemminger a écrit :
> On Fri, 07 May 2010 19:21:33 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
> > 
> > > Forget the per cpu data; the pool should just be scrapped.
> > > 
> > > The only reason the pool exists is that the crypto hash state which
> > > should just be moved into the md5_info (88 bytes).  The pseudo
> > > header can just be generated on the stack before passing to the crypto
> > > code.
> > 
> > 
> > Sure, but I'm afraid there is no generic API do do that (if we want to
> > reuse crypto/md5.c code).
> 
> It looks like the pool is just an optimization to avoid opening too
> many crypto API connections.  This should only be an issue if offloading
> MD5.

You mean we could allocate two contexts per socket, one for tx path, one
for rx path, but TCP-MD5 implementors chose to use percpu allocations to
factorize them. They should have allocated two contexts per cpu (one for
process context, preemption disabled, one for BH context)

As you said, this could be allocated on stack, with some changes to
crypto API I guess. Since TCP is not a module, md5 is also static, so
there is no module loading involved.

struct crypto_tfm *__crypto_alloc_tfm(struct crypto_alg *alg, u32
type,u32 mask)

-->

struct crypto_tfm *__crypto_alloc_tfm_onstack(struct crypto_alg *alg,
u32 type, u32 mask, void *storage, size_t storage_max_length)


Or a direct plug to crypto/md5.c functions and hand crafted context.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bijay Singh May 10, 2010, 2:55 p.m. UTC | #13
Hi,
I had noticed the corruption in the context and actually did what is mentioned.

I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread.

But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario.

Regards,
Bijay

On 08-May-2010, at 3:10 AM, Eric Dumazet wrote:

> Le vendredi 07 mai 2010 à 10:36 -0700, Stephen Hemminger a écrit :
>> On Fri, 07 May 2010 19:21:33 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>>> Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
>>> 
>>>> Forget the per cpu data; the pool should just be scrapped.
>>>> 
>>>> The only reason the pool exists is that the crypto hash state which
>>>> should just be moved into the md5_info (88 bytes).  The pseudo
>>>> header can just be generated on the stack before passing to the crypto
>>>> code.
>>> 
>>> 
>>> Sure, but I'm afraid there is no generic API do do that (if we want to
>>> reuse crypto/md5.c code).
>> 
>> It looks like the pool is just an optimization to avoid opening too
>> many crypto API connections.  This should only be an issue if offloading
>> MD5.
> 
> You mean we could allocate two contexts per socket, one for tx path, one
> for rx path, but TCP-MD5 implementors chose to use percpu allocations to
> factorize them. They should have allocated two contexts per cpu (one for
> process context, preemption disabled, one for BH context)
> 
> As you said, this could be allocated on stack, with some changes to
> crypto API I guess. Since TCP is not a module, md5 is also static, so
> there is no module loading involved.
> 
> struct crypto_tfm *__crypto_alloc_tfm(struct crypto_alg *alg, u32
> type,u32 mask)
> 
> -->
> 
> struct crypto_tfm *__crypto_alloc_tfm_onstack(struct crypto_alg *alg,
> u32 type, u32 mask, void *storage, size_t storage_max_length)
> 
> 
> Or a direct plug to crypto/md5.c functions and hand crafted context.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 10, 2010, 3:18 p.m. UTC | #14
Le lundi 10 mai 2010 à 14:55 +0000, Bijay Singh a écrit :
> Hi,
> I had noticed the corruption in the context and actually did what is mentioned.
> 
> I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread.
> 
> But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario.

Thats very fine, but you mix very different problems.

Step by step resolution is required, and clean patches too, because
plugging md5.c functions is not an option for stable series :)

Obviously, nobody seriously used TCP-MD5 on linux, but you...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bijay Singh May 10, 2010, 5:27 p.m. UTC | #15
Hi Eric,

Didn't intend to mix  the issues. It was a hack intended to calm the nerves. I am going to apply the proper patches asap.

About the latest problem of MD5 not working with MTU set to 4470. I noticed this when i needed to change the MTU for some other  purpose. 

Since it was a production box, i have to first set up my box with the right NIC card to reproduce this and try debugging it. In the meantime any ques will  help.

Thanks,
Bijay
 
On 10-May-2010, at 8:48 PM, Eric Dumazet wrote:

> Le lundi 10 mai 2010 à 14:55 +0000, Bijay Singh a écrit :
>> Hi,
>> I had noticed the corruption in the context and actually did what is mentioned.
>> 
>> I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread.
>> 
>> But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario.
> 
> Thats very fine, but you mix very different problems.
> 
> Step by step resolution is required, and clean patches too, because
> plugging md5.c functions is not an option for stable series :)
> 
> Obviously, nobody seriously used TCP-MD5 on linux, but you...
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bijay Singh May 11, 2010, 4:08 a.m. UTC | #16
Hi Eric,

I guess that makes me the enviable one. So I am keen to test out this feature completely, as long as I know what to do as a next step, directions, patches.

Thanks,
Bijay

On 10-May-2010, at 8:48 PM, Eric Dumazet wrote:

> Le lundi 10 mai 2010 à 14:55 +0000, Bijay Singh a écrit :
>> Hi,
>> I had noticed the corruption in the context and actually did what is mentioned.
>> 
>> I allocated the context on the stack and plugged in the md5.c functions. I was able to temporarily solve the problem, all this before I got a response on this thread.
>> 
>> But now I have seeing another problem, when i change the MTU on the interface from 1500 to 4470 none of the message from the peer get thru and I get hash failed message. I am wondering if this is another bug getting hit in this scenario.
> 
> Thats very fine, but you mix very different problems.
> 
> Step by step resolution is required, and clean patches too, because
> plugging md5.c functions is not an option for stable series :)
> 
> Obviously, nobody seriously used TCP-MD5 on linux, but you...
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 11, 2010, 6:27 a.m. UTC | #17
Le mardi 11 mai 2010 à 04:08 +0000, Bijay Singh a écrit :
> Hi Eric,
> 
> I guess that makes me the enviable one. So I am keen to test out this feature

>  completely, as long as I know what to do as a next step, directions, patches.
> 

MTU > 4000 is not reliable because of high order allocations on typical
NICS. I am afraid you need NIC able to deliver page fragments.

Its working here (32bit kernel) with a tg3 NIC, but I got following
message :

ifconfig eth3 mtu 9000
...
[51492.936500] 167731 total pagecache pages
[51492.936500] 0 pages in swap cache
[51492.936500] Swap cache stats: add 0, delete 0, find 0/0
[51492.936500] Free swap  = 4192928kB
[51492.936500] Total swap = 4192928kB
[51492.936500] 1114110 pages RAM
[51492.936500] 885761 pages HighMem
[51492.936500] 77073 pages reserved
[51492.936500] 134483 pages shared
[51492.936500] 159131 pages non-shared
[51492.953027] tg3 0000:14:04.1: eth3: Using a smaller RX standard ring.
Only 183 out of 511 buffers were allocated successfully

$ ethtool -g eth3
Ring parameters for eth3:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		183
RX Mini:	0
RX Jumbo:	0
TX:		511

$ cat /proc/buddyinfo
Node 0, zone      DMA      5      2      1      1      2      2      2      1      0      1      0 
Node 0, zone   Normal   4285   1823    248     73      9      5      0      0      0      0      0 
Node 0, zone  HighMem     97    199    921    583    383    261    155    117     69     41    649 

I know that if I try to stress RX path, I'll get failures.

Could you explain me why you need both big MTUS and TCP-MD5 ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bijay Singh May 11, 2010, 8:23 a.m. UTC | #18
I need MD5 for my BGP sessions and need the jumbo packets for the IS-IS peering. MTU of 1500 results in LSPs higher that 1500 getting dropped at the peering router. 

On 11-May-2010, at 11:57 AM, Eric Dumazet wrote:

> Le mardi 11 mai 2010 à 04:08 +0000, Bijay Singh a écrit :
>> Hi Eric,
>> 
>> I guess that makes me the enviable one. So I am keen to test out this feature
> 
>> completely, as long as I know what to do as a next step, directions, patches.
>> 
> 
> MTU > 4000 is not reliable because of high order allocations on typical
> NICS. I am afraid you need NIC able to deliver page fragments.
> 
> Its working here (32bit kernel) with a tg3 NIC, but I got following
> message :
> 
> ifconfig eth3 mtu 9000
> ...
> [51492.936500] 167731 total pagecache pages
> [51492.936500] 0 pages in swap cache
> [51492.936500] Swap cache stats: add 0, delete 0, find 0/0
> [51492.936500] Free swap  = 4192928kB
> [51492.936500] Total swap = 4192928kB
> [51492.936500] 1114110 pages RAM
> [51492.936500] 885761 pages HighMem
> [51492.936500] 77073 pages reserved
> [51492.936500] 134483 pages shared
> [51492.936500] 159131 pages non-shared
> [51492.953027] tg3 0000:14:04.1: eth3: Using a smaller RX standard ring.
> Only 183 out of 511 buffers were allocated successfully
> 
> $ ethtool -g eth3
> Ring parameters for eth3:
> Pre-set maximums:
> RX:		511
> RX Mini:	0
> RX Jumbo:	0
> TX:		511
> Current hardware settings:
> RX:		183
> RX Mini:	0
> RX Jumbo:	0
> TX:		511
> 
> $ cat /proc/buddyinfo
> Node 0, zone      DMA      5      2      1      1      2      2      2      1      0      1      0 
> Node 0, zone   Normal   4285   1823    248     73      9      5      0      0      0      0      0 
> Node 0, zone  HighMem     97    199    921    583    383    261    155    117     69     41    649 
> 
> I know that if I try to stress RX path, I'll get failures.
> 
> Could you explain me why you need both big MTUS and TCP-MD5 ?
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 16, 2010, 7:30 a.m. UTC | #19
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 7 May 2010 10:36:39 -0700

> On Fri, 07 May 2010 19:21:33 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Le vendredi 07 mai 2010 à 10:14 -0700, Stephen Hemminger a écrit :
>> 
>> > Forget the per cpu data; the pool should just be scrapped.
>> > 
>> > The only reason the pool exists is that the crypto hash state which
>> > should just be moved into the md5_info (88 bytes).  The pseudo
>> > header can just be generated on the stack before passing to the crypto
>> > code.
>> 
>> 
>> Sure, but I'm afraid there is no generic API do do that (if we want to
>> reuse crypto/md5.c code).
> 
> It looks like the pool is just an optimization to avoid opening too
> many crypto API connections.  This should only be an issue if offloading
> MD5.

It's an issue because creating a crypto API context is expensive, so this
influences our connection rates with MD5.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fb5c66b..e232123 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1221,12 +1221,15 @@  struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
 	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
 	if (!ret)
 		put_cpu();
+	else
+		local_bh_disable();
 	return ret;
 }
 
 static inline void             tcp_put_md5sig_pool(void)
 {
 	__tcp_put_md5sig_pool();
+	local_bh_enable();
 	put_cpu();
 }