diff mbox

conntrack: Reduce conntrack count in nf_conntrack_free()

Message ID 49C8CCF4.5050104@cosmosbay.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet March 24, 2009, 12:07 p.m. UTC
Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>> Joakim Tjernlund a écrit :
>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>> Joakim Tjernlund wrote:
>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>
>>>>>
>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>> to the board I see:
>>>>>>>
>>>>>> That means that something is leaking conntrack references, most 
>>> likely
>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
>>> would
>>>>>> be the ucc_geth driver.
>>>>>>
>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>  - Move TX from IRQ to NAPI context
>>>>>  - double the weight.
>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
> thread 
>>> has 
>>>>> stopped
>>>>>    scanning the FS 
>>>>>
>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>> Does this seem right to you guys?
>>>> No. As I said, something seems to be leaking packets. You should be
>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>>>> to capture the conntrack events before the "table full" message
>>>> appears and post the output.
>>> skbuff does not differ much, but others do
>>>
>>> Before ping:
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>   size-64              731    767     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata     13     13      0
>>>   nf_conntrack          10     19    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata      1      1      0
>>>
>>> During ping: 
>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
> 0 
>>> : slabdata      0      0      0
>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
> 0 
>>> : slabdata      2      2      0
>>>   size-64             8909   8909     64   59    1 : tunables  120 60 
> 0 
>>> : slabdata    151    151      0
>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
> 0 
>>> : slabdata    269    269      0
>>>
>>> This feels more like the freeing of conntrack objects are delayed and 
>>> builds up when ping flooding.
>>>
>>> Don't have "conntrack -E" for my embedded board so that will have to 
> wait 
>>> a bit longer.
>> I dont understand how your ping can use so many conntrack entries...
>>
>> Then, as I said yesterday, I believe you have a RCU delay, because of
>> a misbehaving driver or something...
>>
>> grep RCU .config
> grep RCU .config
> # RCU Subsystem
> CONFIG_CLASSIC_RCU=y
> # CONFIG_TREE_RCU is not set
> # CONFIG_PREEMPT_RCU is not set
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> 
>> grep CONFIG_SMP .config
> grep CONFIG_SMP .config
> # CONFIG_SMP is not set
> 
>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
> 80)
>> as a workaround. It should force a quiescent state after 1000 freed 
> conntracks.
> 
> right, doing this almost killed all conntrack messages, had to stress it 
> pretty
> hard before I saw handful "nf_conntrack: table full, dropping packet"
> 
> RCU is not my cup of tea, do you have any ideas were to look?

In a stress situation, you feed more deleted conntracks to call_rcu() than
the blimit (10 real freeing per RCU softirq invocation). 

So with default qhimark being 10000, this means about 10000 conntracks
can sit in RCU (per CPU) before being really freed.

Only when hitting 10000, RCU enters a special mode to free all queued items, instead
of a small batch of 10

To solve your problem we can :

1) reduce qhimark from 10000 to 1000 (for example)
   Probably should be done to reduce some spikes in RCU code when freeing
   whole 10000 elements...
OR
2) change conntrack tunable (max conntrack entries on your machine)
OR
3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
  in nf_conntrack_free() instead of callback.

[PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 24, 2009, 12:25 p.m. UTC | #1
Eric Dumazet a écrit :
> Joakim Tjernlund a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>>> Joakim Tjernlund a écrit :
>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>>> Joakim Tjernlund wrote:
>>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>>
>>>>>>
>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>>> to the board I see:
>>>>>>>>
>>>>>>> That means that something is leaking conntrack references, most 
>>>> likely
>>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
>>>> would
>>>>>>> be the ucc_geth driver.
>>>>>>>
>>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>>  - Move TX from IRQ to NAPI context
>>>>>>  - double the weight.
>>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
>> thread 
>>>> has 
>>>>>> stopped
>>>>>>    scanning the FS 
>>>>>>
>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>>> Does this seem right to you guys?
>>>>> No. As I said, something seems to be leaking packets. You should be
>>>>> able to confirm that by checking the sk_buff slabs in /proc/slabinfo.
>>>>> If that *doesn't* show any signs of a leak, please run "conntrack -E"
>>>>> to capture the conntrack events before the "table full" message
>>>>> appears and post the output.
>>>> skbuff does not differ much, but others do
>>>>
>>>> Before ping:
>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
>> 0 
>>>> : slabdata      0      0      0
>>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 60 
>> 0 
>>>> : slabdata      1      1      0
>>>>   size-64              731    767     64   59    1 : tunables  120 60 
>> 0 
>>>> : slabdata     13     13      0
>>>>   nf_conntrack          10     19    208   19    1 : tunables  120 60 
>> 0 
>>>> : slabdata      1      1      0
>>>>
>>>> During ping: 
>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 27 
>> 0 
>>>> : slabdata      0      0      0
>>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 60 
>> 0 
>>>> : slabdata      2      2      0
>>>>   size-64             8909   8909     64   59    1 : tunables  120 60 
>> 0 
>>>> : slabdata    151    151      0
>>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 60 
>> 0 
>>>> : slabdata    269    269      0
>>>>
>>>> This feels more like the freeing of conntrack objects are delayed and 
>>>> builds up when ping flooding.
>>>>
>>>> Don't have "conntrack -E" for my embedded board so that will have to 
>> wait 
>>>> a bit longer.
>>> I dont understand how your ping can use so many conntrack entries...
>>>
>>> Then, as I said yesterday, I believe you have a RCU delay, because of
>>> a misbehaving driver or something...
>>>
>>> grep RCU .config
>> grep RCU .config
>> # RCU Subsystem
>> CONFIG_CLASSIC_RCU=y
>> # CONFIG_TREE_RCU is not set
>> # CONFIG_PREEMPT_RCU is not set
>> # CONFIG_TREE_RCU_TRACE is not set
>> # CONFIG_PREEMPT_RCU_TRACE is not set
>> # CONFIG_RCU_TORTURE_TEST is not set
>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
>>
>>> grep CONFIG_SMP .config
>> grep CONFIG_SMP .config
>> # CONFIG_SMP is not set
>>
>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c (line 
>> 80)
>>> as a workaround. It should force a quiescent state after 1000 freed 
>> conntracks.
>>
>> right, doing this almost killed all conntrack messages, had to stress it 
>> pretty
>> hard before I saw handful "nf_conntrack: table full, dropping packet"
>>
>> RCU is not my cup of tea, do you have any ideas were to look?
> 
> In a stress situation, you feed more deleted conntracks to call_rcu() than
> the blimit (10 real freeing per RCU softirq invocation). 
> 
> So with default qhimark being 10000, this means about 10000 conntracks
> can sit in RCU (per CPU) before being really freed.
> 
> Only when hitting 10000, RCU enters a special mode to free all queued items, instead
> of a small batch of 10
> 
> To solve your problem we can :
> 
> 1) reduce qhimark from 10000 to 1000 (for example)
>    Probably should be done to reduce some spikes in RCU code when freeing
>    whole 10000 elements...
> OR
> 2) change conntrack tunable (max conntrack entries on your machine)
> OR
> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>   in nf_conntrack_free() instead of callback.
> 
> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
> accumulate about 10.000 elements per CPU in its internal queues. To get accurate
> conntrack counts (at the expense of slightly more RAM used), we might consider
> conntrack counter not taking into account "about to be freed elements, waiting
> in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
> callback.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index f4935e3..6478dc7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>  {
>  	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
> -	struct net *net = nf_ct_net(ct);
>  
>  	nf_ct_ext_free(ct);
>  	kmem_cache_free(nf_conntrack_cachep, ct);
> -	atomic_dec(&net->ct.count);
>  }
>  
>  void nf_conntrack_free(struct nf_conn *ct)
>  {
> +	struct net *net = nf_ct_net(ct);
> +
>  	nf_ct_ext_destroy(ct);
> +	atomic_dec(&net->ct.count);
>  	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_free);

I forgot to say this is what we do for 'struct file' freeing as well. We
decrement nr_files in file_free(), not in file_free_rcu()

static inline void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);

        put_cred(f->f_cred);
        kmem_cache_free(filp_cachep, f);
}

static inline void file_free(struct file *f)
{
        percpu_counter_dec(&nr_files);      <<<< HERE >>>>
        file_check_state(f);
        call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 24, 2009, 12:43 p.m. UTC | #2
Eric Dumazet wrote:
>> In a stress situation, you feed more deleted conntracks to call_rcu() than
>> the blimit (10 real freeing per RCU softirq invocation). 
>>
>> So with default qhimark being 10000, this means about 10000 conntracks
>> can sit in RCU (per CPU) before being really freed.
>>
>> Only when hitting 10000, RCU enters a special mode to free all queued items, instead
>> of a small batch of 10
>>
>> To solve your problem we can :
>>
>> 1) reduce qhimark from 10000 to 1000 (for example)
>>    Probably should be done to reduce some spikes in RCU code when freeing
>>    whole 10000 elements...
>> OR
>> 2) change conntrack tunable (max conntrack entries on your machine)
>> OR
>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>   in nf_conntrack_free() instead of callback.
>>
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
>>
>> We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
>> accumulate about 10.000 elements per CPU in its internal queues. To get accurate
>> conntrack counts (at the expense of slightly more RAM used), we might consider
>> conntrack counter not taking into account "about to be freed elements, waiting
>> in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
>> callback.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>>
>> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
>> index f4935e3..6478dc7 100644
>> --- a/net/netfilter/nf_conntrack_core.c
>> +++ b/net/netfilter/nf_conntrack_core.c
>> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>>  {
>>  	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
>> -	struct net *net = nf_ct_net(ct);
>>  
>>  	nf_ct_ext_free(ct);
>>  	kmem_cache_free(nf_conntrack_cachep, ct);
>> -	atomic_dec(&net->ct.count);
>>  }
>>  
>>  void nf_conntrack_free(struct nf_conn *ct)
>>  {
>> +	struct net *net = nf_ct_net(ct);
>> +
>>  	nf_ct_ext_destroy(ct);
>> +	atomic_dec(&net->ct.count);
>>  	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_free);
> 
> I forgot to say this is what we do for 'struct file' freeing as well. We
> decrement nr_files in file_free(), not in file_free_rcu()


While temporarily exceeding the limit by up to 10000 entries is
quite a lot, I guess the important thing is that it can't grow
unbounded, so I think this patch is fine.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joakim Tjernlund March 24, 2009, 1:20 p.m. UTC | #3
Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
> 
> Joakim Tjernlund a écrit :
> > Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
> >> Joakim Tjernlund a écrit :
> >>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
> >>>> Joakim Tjernlund wrote:
> >>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
> >>>>>
> >>>>>
> >>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
> >>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
> >>>>>>> to the board I see:
> >>>>>>>
> >>>>>> That means that something is leaking conntrack references, most 
> >>> likely
> >>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 

> >>> would
> >>>>>> be the ucc_geth driver.
> >>>>>>
> >>>>> Mucking around with the ucc_geth driver I found that if I:
> >>>>>  - Move TX from IRQ to NAPI context
> >>>>>  - double the weight.
> >>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
> > thread 
> >>> has 
> >>>>> stopped
> >>>>>    scanning the FS 
> >>>>>
> >>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
> >>>>> Does this seem right to you guys?
> >>>> No. As I said, something seems to be leaking packets. You should be
> >>>> able to confirm that by checking the sk_buff slabs in 
/proc/slabinfo.
> >>>> If that *doesn't* show any signs of a leak, please run "conntrack 
-E"
> >>>> to capture the conntrack events before the "table full" message
> >>>> appears and post the output.
> >>> skbuff does not differ much, but others do
> >>>
> >>> Before ping:
> >>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
27 
> > 0 
> >>> : slabdata      0      0      0
> >>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      1      1      0
> >>>   size-64              731    767     64   59    1 : tunables  120 
60 
> > 0 
> >>> : slabdata     13     13      0
> >>>   nf_conntrack          10     19    208   19    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      1      1      0
> >>>
> >>> During ping: 
> >>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
27 
> > 0 
> >>> : slabdata      0      0      0
> >>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 
60 
> > 0 
> >>> : slabdata      2      2      0
> >>>   size-64             8909   8909     64   59    1 : tunables  120 
60 
> > 0 
> >>> : slabdata    151    151      0
> >>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 
60 
> > 0 
> >>> : slabdata    269    269      0
> >>>
> >>> This feels more like the freeing of conntrack objects are delayed 
and 
> >>> builds up when ping flooding.
> >>>
> >>> Don't have "conntrack -E" for my embedded board so that will have to 

> > wait 
> >>> a bit longer.
> >> I dont understand how your ping can use so many conntrack entries...
> >>
> >> Then, as I said yesterday, I believe you have a RCU delay, because of
> >> a misbehaving driver or something...
> >>
> >> grep RCU .config
> > grep RCU .config
> > # RCU Subsystem
> > CONFIG_CLASSIC_RCU=y
> > # CONFIG_TREE_RCU is not set
> > # CONFIG_PREEMPT_RCU is not set
> > # CONFIG_TREE_RCU_TRACE is not set
> > # CONFIG_PREEMPT_RCU_TRACE is not set
> > # CONFIG_RCU_TORTURE_TEST is not set
> > # CONFIG_RCU_CPU_STALL_DETECTOR is not set
> > 
> >> grep CONFIG_SMP .config
> > grep CONFIG_SMP .config
> > # CONFIG_SMP is not set
> > 
> >> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
(line 
> > 80)
> >> as a workaround. It should force a quiescent state after 1000 freed 
> > conntracks.
> > 
> > right, doing this almost killed all conntrack messages, had to stress 
it 
> > pretty
> > hard before I saw handful "nf_conntrack: table full, dropping packet"
> > 
> > RCU is not my cup of tea, do you have any ideas were to look?
> 
> In a stress situation, you feed more deleted conntracks to call_rcu() 
than
> the blimit (10 real freeing per RCU softirq invocation). 
> 
> So with default qhimark being 10000, this means about 10000 conntracks
> can sit in RCU (per CPU) before being really freed.
> 
> Only when hitting 10000, RCU enters a special mode to free all queued 
items, instead
> of a small batch of 10
> 
> To solve your problem we can :
> 
> 1) reduce qhimark from 10000 to 1000 (for example)
>    Probably should be done to reduce some spikes in RCU code when 
freeing
>    whole 10000 elements...
> OR
> 2) change conntrack tunable (max conntrack entries on your machine)
> OR
> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>   in nf_conntrack_free() instead of callback.
> 
> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()

The patch fixes the problem and the system feels a bit more responsive 
too, thanks.
I guess I should probably do both 1) and 3) as my board is pretty slow 
too.

Been trying to figure out a good value for NAPI weigth too. Currently my
HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX 
processing
to NAPI context AND increase weigth to 32, the system is a lot more 
responsive during
ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 
16?

 Jocke

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 24, 2009, 1:28 p.m. UTC | #4
Joakim Tjernlund wrote:
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> The patch fixes the problem and the system feels a bit more responsive 
> too, thanks.

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 24, 2009, 1:29 p.m. UTC | #5
Joakim Tjernlund a écrit :
> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 13:07:16:
>> Joakim Tjernlund a écrit :
>>> Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 10:12:53:
>>>> Joakim Tjernlund a écrit :
>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 18:49:15:
>>>>>> Joakim Tjernlund wrote:
>>>>>>> Patrick McHardy <kaber@trash.net> wrote on 23/03/2009 13:29:33:
>>>>>>>
>>>>>>>
>>>>>>>>> There is no /proc/net/netfilter/nf_conntrack. There is a
>>>>>>>>> /proc/net/nf_conntrack though and it is empty. If I telnet
>>>>>>>>> to the board I see:
>>>>>>>>>
>>>>>>>> That means that something is leaking conntrack references, most 
>>>>> likely
>>>>>>>> by leaking skbs. Since I haven't seen any other reports, my guess 
> 
>>>>> would
>>>>>>>> be the ucc_geth driver.
>>>>>>>>
>>>>>>> Mucking around with the ucc_geth driver I found that if I:
>>>>>>>  - Move TX from IRQ to NAPI context
>>>>>>>  - double the weight.
>>>>>>>  - after booting up, wait a few mins until the JFFS2 GC kernel 
>>> thread 
>>>>> has 
>>>>>>> stopped
>>>>>>>    scanning the FS 
>>>>>>>
>>>>>>> Then the "nf_conntrack: table full, dropping packet." msgs stops.
>>>>>>> Does this seem right to you guys?
>>>>>> No. As I said, something seems to be leaking packets. You should be
>>>>>> able to confirm that by checking the sk_buff slabs in 
> /proc/slabinfo.
>>>>>> If that *doesn't* show any signs of a leak, please run "conntrack 
> -E"
>>>>>> to capture the conntrack events before the "table full" message
>>>>>> appears and post the output.
>>>>> skbuff does not differ much, but others do
>>>>>
>>>>> Before ping:
>>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
> 27 
>>> 0 
>>>>> : slabdata      0      0      0
>>>>>   skbuff_head_cache     20     20    192   20    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      1      1      0
>>>>>   size-64              731    767     64   59    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata     13     13      0
>>>>>   nf_conntrack          10     19    208   19    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      1      1      0
>>>>>
>>>>> During ping: 
>>>>>   skbuff_fclone_cache    0      0    352   11    1 : tunables   54 
> 27 
>>> 0 
>>>>> : slabdata      0      0      0
>>>>>   skbuff_head_cache     40     40    192   20    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata      2      2      0
>>>>>   size-64             8909   8909     64   59    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata    151    151      0
>>>>>   nf_conntrack        5111   5111    208   19    1 : tunables  120 
> 60 
>>> 0 
>>>>> : slabdata    269    269      0
>>>>>
>>>>> This feels more like the freeing of conntrack objects are delayed 
> and 
>>>>> builds up when ping flooding.
>>>>>
>>>>> Don't have "conntrack -E" for my embedded board so that will have to 
> 
>>> wait 
>>>>> a bit longer.
>>>> I dont understand how your ping can use so many conntrack entries...
>>>>
>>>> Then, as I said yesterday, I believe you have a RCU delay, because of
>>>> a misbehaving driver or something...
>>>>
>>>> grep RCU .config
>>> grep RCU .config
>>> # RCU Subsystem
>>> CONFIG_CLASSIC_RCU=y
>>> # CONFIG_TREE_RCU is not set
>>> # CONFIG_PREEMPT_RCU is not set
>>> # CONFIG_TREE_RCU_TRACE is not set
>>> # CONFIG_PREEMPT_RCU_TRACE is not set
>>> # CONFIG_RCU_TORTURE_TEST is not set
>>> # CONFIG_RCU_CPU_STALL_DETECTOR is not set
>>>
>>>> grep CONFIG_SMP .config
>>> grep CONFIG_SMP .config
>>> # CONFIG_SMP is not set
>>>
>>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
> (line 
>>> 80)
>>>> as a workaround. It should force a quiescent state after 1000 freed 
>>> conntracks.
>>>
>>> right, doing this almost killed all conntrack messages, had to stress 
> it 
>>> pretty
>>> hard before I saw handful "nf_conntrack: table full, dropping packet"
>>>
>>> RCU is not my cup of tea, do you have any ideas were to look?
>> In a stress situation, you feed more deleted conntracks to call_rcu() 
> than
>> the blimit (10 real freeing per RCU softirq invocation). 
>>
>> So with default qhimark being 10000, this means about 10000 conntracks
>> can sit in RCU (per CPU) before being really freed.
>>
>> Only when hitting 10000, RCU enters a special mode to free all queued 
> items, instead
>> of a small batch of 10
>>
>> To solve your problem we can :
>>
>> 1) reduce qhimark from 10000 to 1000 (for example)
>>    Probably should be done to reduce some spikes in RCU code when 
> freeing
>>    whole 10000 elements...
>> OR
>> 2) change conntrack tunable (max conntrack entries on your machine)
>> OR
>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>   in nf_conntrack_free() instead of callback.
>>
>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> 
> The patch fixes the problem and the system feels a bit more responsive 
> too, thanks.
> I guess I should probably do both 1) and 3) as my board is pretty slow 
> too.
> 
> Been trying to figure out a good value for NAPI weigth too. Currently my
> HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move TX 
> processing
> to NAPI context AND increase weigth to 32, the system is a lot more 
> responsive during
> ping flooding. Does weigth 32 make sense when the HW TX and RX queues are 
> 16?

If you only have one NIC, I dont understand why changing weight should make
a difference. Are you referring to dev_weight or netdev_budget ?

# cat /proc/sys/net/core/dev_weight
64
# cat /proc/sys/net/core/netdev_budget
300


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 24, 2009, 1:32 p.m. UTC | #6
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>>> In a stress situation, you feed more deleted conntracks to call_rcu()
>>> than
>>> the blimit (10 real freeing per RCU softirq invocation).
>>> So with default qhimark being 10000, this means about 10000 conntracks
>>> can sit in RCU (per CPU) before being really freed.
>>>
>>> Only when hitting 10000, RCU enters a special mode to free all queued
>>> items, instead
>>> of a small batch of 10
>>>
>>> To solve your problem we can :
>>>
>>> 1) reduce qhimark from 10000 to 1000 (for example)
>>>    Probably should be done to reduce some spikes in RCU code when
>>> freeing
>>>    whole 10000 elements...
>>> OR
>>> 2) change conntrack tunable (max conntrack entries on your machine)
>>> OR
>>> 3) change net/netfilter/nf_conntrack_core.c to decrement net->ct.count
>>>   in nf_conntrack_free() instead of callback.
>>>
>>> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
>>>
>>> We use RCU to defer freeing of conntrack structures. In DOS
>>> situation, RCU might
>>> accumulate about 10.000 elements per CPU in its internal queues. To
>>> get accurate
>>> conntrack counts (at the expense of slightly more RAM used), we might
>>> consider
>>> conntrack counter not taking into account "about to be freed
>>> elements, waiting
>>> in RCU queues". We thus decrement it in nf_conntrack_free(), not in
>>> the RCU
>>> callback.
>>>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>
>>>
>>> diff --git a/net/netfilter/nf_conntrack_core.c
>>> b/net/netfilter/nf_conntrack_core.c
>>> index f4935e3..6478dc7 100644
>>> --- a/net/netfilter/nf_conntrack_core.c
>>> +++ b/net/netfilter/nf_conntrack_core.c
>>> @@ -516,16 +516,17 @@ EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
>>>  static void nf_conntrack_free_rcu(struct rcu_head *head)
>>>  {
>>>      struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
>>> -    struct net *net = nf_ct_net(ct);
>>>  
>>>      nf_ct_ext_free(ct);
>>>      kmem_cache_free(nf_conntrack_cachep, ct);
>>> -    atomic_dec(&net->ct.count);
>>>  }
>>>  
>>>  void nf_conntrack_free(struct nf_conn *ct)
>>>  {
>>> +    struct net *net = nf_ct_net(ct);
>>> +
>>>      nf_ct_ext_destroy(ct);
>>> +    atomic_dec(&net->ct.count);
>>>      call_rcu(&ct->rcu, nf_conntrack_free_rcu);
>>>  }
>>>  EXPORT_SYMBOL_GPL(nf_conntrack_free);
>>
>> I forgot to say this is what we do for 'struct file' freeing as well. We
>> decrement nr_files in file_free(), not in file_free_rcu()
> 
> 
> While temporarily exceeding the limit by up to 10000 entries is
> quite a lot, I guess the important thing is that it can't grow
> unbounded, so I think this patch is fine.
> 
> 

Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing
problem. That would better use CPU caches as well...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy March 24, 2009, 1:38 p.m. UTC | #7
Eric Dumazet wrote:
> Patrick McHardy a écrit :
>>> I forgot to say this is what we do for 'struct file' freeing as well. We
>>> decrement nr_files in file_free(), not in file_free_rcu()
>>
>> While temporarily exceeding the limit by up to 10000 entries is
>> quite a lot, I guess the important thing is that it can't grow
>> unbounded, so I think this patch is fine.
>>
> 
> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu() queueing
> problem. That would better use CPU caches as well...

I'm not sure I understand the rules correctly, but we'd still
have to wait for the grace period before an object can be reused,
no?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joakim Tjernlund March 24, 2009, 1:41 p.m. UTC | #8
Eric Dumazet <dada1@cosmosbay.com> wrote on 24/03/2009 14:29:29:

[SNIP]

> >>>> You could change qhimark from 10000 to 1000 in kernel/rcuclassic.c 
> > (line 
> >>> 80)
> >>>> as a workaround. It should force a quiescent state after 1000 freed 

> >>> conntracks.
> >>>
> >>> right, doing this almost killed all conntrack messages, had to 
stress 
> > it 
> >>> pretty
> >>> hard before I saw handful "nf_conntrack: table full, dropping 
packet"
> >>>
> >>> RCU is not my cup of tea, do you have any ideas were to look?
> >> In a stress situation, you feed more deleted conntracks to call_rcu() 

> > than
> >> the blimit (10 real freeing per RCU softirq invocation). 
> >>
> >> So with default qhimark being 10000, this means about 10000 
conntracks
> >> can sit in RCU (per CPU) before being really freed.
> >>
> >> Only when hitting 10000, RCU enters a special mode to free all queued 

> > items, instead
> >> of a small batch of 10
> >>
> >> To solve your problem we can :
> >>
> >> 1) reduce qhimark from 10000 to 1000 (for example)
> >>    Probably should be done to reduce some spikes in RCU code when 
> > freeing
> >>    whole 10000 elements...
> >> OR
> >> 2) change conntrack tunable (max conntrack entries on your machine)
> >> OR
> >> 3) change net/netfilter/nf_conntrack_core.c to decrement 
net->ct.count
> >>   in nf_conntrack_free() instead of callback.
> >>
> >> [PATCH] conntrack: Reduce conntrack count in nf_conntrack_free()
> > 
> > The patch fixes the problem and the system feels a bit more responsive 

> > too, thanks.
> > I guess I should probably do both 1) and 3) as my board is pretty slow 

> > too.
> > 
> > Been trying to figure out a good value for NAPI weigth too. Currently 
my
> > HW RX and TX queues are 16 pkgs deep and weigth is 16 too. If I move 
TX 
> > processing
> > to NAPI context AND increase weigth to 32, the system is a lot more 
> > responsive during
> > ping flooding. Does weigth 32 make sense when the HW TX and RX queues 
are 
> > 16?
> 
> If you only have one NIC, I dont understand why changing weight should 
make
> a difference. Are you referring to dev_weight or netdev_budget ?
> 
> # cat /proc/sys/net/core/dev_weight
> 64
> # cat /proc/sys/net/core/netdev_budget
> 300

I mean this call in ucc_geth:
  netif_napi_add(dev, &ugeth->napi, ucc_geth_poll, UCC_GETH_DEV_WEIGHT);
UCC_GETH_DEV_WEIGHT is 16

Noticed that rcuclassic.c has a 
  module_param(qhimark, int, 0);
But I can't figure out hot to set this qhimark from the cmdline.
 rcuclassic.c is not a module(I don't use modules at all)

 Jocke
 Jocke


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 24, 2009, 1:47 p.m. UTC | #9
Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> Patrick McHardy a écrit :
>>>> I forgot to say this is what we do for 'struct file' freeing as
>>>> well. We
>>>> decrement nr_files in file_free(), not in file_free_rcu()
>>>
>>> While temporarily exceeding the limit by up to 10000 entries is
>>> quite a lot, I guess the important thing is that it can't grow
>>> unbounded, so I think this patch is fine.
>>>
>>
>> Maybe we could use SLAB_DESTROY_BY_RCU thing and no more call_rcu()
>> queueing
>> problem. That would better use CPU caches as well...
> 
> I'm not sure I understand the rules correctly, but we'd still
> have to wait for the grace period before an object can be reused,
> no?

No we dont have to, but we must do additionnal checks after getting
a reference on object found on lookup.
(We must re-check the keys used during search)

This re-check is not very expensive since everything is hot in cpu cache.

Check Documentation/RCU/rculist_nulls.txt for some documentation.

1) Lookup algo
--------------

rcu_read_lock()
begin:
obj = lockless_lookup(key);
if (obj) {
  if (!try_get_ref(obj)) // might fail for free objects
    goto begin;
  /*
   * Because a writer could delete object, and a writer could
   * reuse these object before the RCU grace period, we
   * must check key after geting the reference on object
   */
  if (obj->key != key) { // not the object we expected
     put_ref(obj);
     goto begin;
   }
}
rcu_read_unlock();


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maxime Bizon March 24, 2009, 3:17 p.m. UTC | #10
On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:

Hi Eric,

> We use RCU to defer freeing of conntrack structures. In DOS situation,
> RCU might accumulate about 10.000 elements per CPU in its internal
> queues. To get accurate conntrack counts (at the expense of slightly
> more RAM used), we might consider conntrack counter not taking into
> account "about to be freed elements, waiting in RCU queues". We thus
> decrement it in nf_conntrack_free(), not in the RCU callback.

Your patch fixes the problem on my board too (embedded mips router
250Mhz), thanks.

Yet I'm concerned about what you said concerning RAM usage. I have a
very small amount on memory left on my board (less than 4M), and I tuned
ip route cache size and nf_conntrack_max to make sure I won't go OOM.

With your patch, does it mean 10000 conntrack entries can be allocated
while nf_conntrack_max is say only 2048 ?

Regards,
Patrick McHardy March 24, 2009, 3:21 p.m. UTC | #11
Maxime Bizon wrote:
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
>> We use RCU to defer freeing of conntrack structures. In DOS situation,
>> RCU might accumulate about 10.000 elements per CPU in its internal
>> queues. To get accurate conntrack counts (at the expense of slightly
>> more RAM used), we might consider conntrack counter not taking into
>> account "about to be freed elements, waiting in RCU queues". We thus
>> decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Temporarily under worst-case circumstances, yes. Eric is already working
on his proposed improvement though :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 24, 2009, 3:27 p.m. UTC | #12
Maxime Bizon a écrit :
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
> Hi Eric,
> 
>> We use RCU to defer freeing of conntrack structures. In DOS situation,
>> RCU might accumulate about 10.000 elements per CPU in its internal
>> queues. To get accurate conntrack counts (at the expense of slightly
>> more RAM used), we might consider conntrack counter not taking into
>> account "about to be freed elements, waiting in RCU queues". We thus
>> decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Well... yes, RCU can have this 'interesting' OOM property.

For small machines, you really want to lower RCU parameters, because
as you said, we also push route cache entries in RCU queue, my patch
being applied or not (But using call_rcu_bh(), so we have lower latencies
I think)

We are working on a SLAB_DESTROY_BY_RCU implementation so that
conntrack wont use call_rcu() anymore, give us a couple of days :)

Paul, could we have /sys knobs to be able to tune qhimark, blimit & qlowmark ?

Thanks

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joakim Tjernlund March 24, 2009, 6:29 p.m. UTC | #13
Maxime Bizon <mbizon@freebox.fr> wrote on 24/03/2009 16:17:30:
> 
> 
> On Tue, 2009-03-24 at 13:07 +0100, Eric Dumazet wrote:
> 
> Hi Eric,
> 
> > We use RCU to defer freeing of conntrack structures. In DOS situation,
> > RCU might accumulate about 10.000 elements per CPU in its internal
> > queues. To get accurate conntrack counts (at the expense of slightly
> > more RAM used), we might consider conntrack counter not taking into
> > account "about to be freed elements, waiting in RCU queues". We thus
> > decrement it in nf_conntrack_free(), not in the RCU callback.
> 
> Your patch fixes the problem on my board too (embedded mips router
> 250Mhz), thanks.
> 
> Yet I'm concerned about what you said concerning RAM usage. I have a
> very small amount on memory left on my board (less than 4M), and I tuned
> ip route cache size and nf_conntrack_max to make sure I won't go OOM.
> 
> With your patch, does it mean 10000 conntrack entries can be allocated
> while nf_conntrack_max is say only 2048 ?

Just add "rcuclassic.qhimark=2048" to your cmdline.

 Jocke

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f4935e3..6478dc7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -516,16 +516,17 @@  EXPORT_SYMBOL_GPL(nf_conntrack_alloc);
 static void nf_conntrack_free_rcu(struct rcu_head *head)
 {
 	struct nf_conn *ct = container_of(head, struct nf_conn, rcu);
-	struct net *net = nf_ct_net(ct);
 
 	nf_ct_ext_free(ct);
 	kmem_cache_free(nf_conntrack_cachep, ct);
-	atomic_dec(&net->ct.count);
 }
 
 void nf_conntrack_free(struct nf_conn *ct)
 {
+	struct net *net = nf_ct_net(ct);
+
 	nf_ct_ext_destroy(ct);
+	atomic_dec(&net->ct.count);
 	call_rcu(&ct->rcu, nf_conntrack_free_rcu);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_free);