[RFC,0/2] net: threadable napi poll loop

Message ID	1462920697.23934.113.camel@edumazet-glaptop3.roam.corp.google.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <1462920697.23934.113.camel@edumazet-glaptop3.roam.corp.google.com> Subject: Re: [RFC PATCH 0/2] net: threadable napi poll loop From: Eric Dumazet <eric.dumazet@gmail.com> To: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, netdev <netdev@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, Jiri Pirko <jiri@mellanox.com>, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@plumgrid.com>, Alexander Duyck <aduyck@mirantis.com>, Tom Herbert <tom@herbertland.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>, Rik van Riel <riel@redhat.com>, LKML <linux-kernel@vger.kernel.org> Date: Tue, 10 May 2016 15:51:37 -0700 In-Reply-To: <94f323a9-515e-4d75-cac8-ef0214f0499e@stressinduktion.org> References: <cover.1462886866.git.pabeni@redhat.com> <1462890590.23934.68.camel@edumazet-glaptop3.roam.corp.google.com> <90f3db8c-c30c-b204-576a-454939ac93ce@stressinduktion.org> <CANn89i+YKrVOgfFwW-2psfCzq5_YLdmbV06-H-1FCOwXdMBuzQ@mail.gmail.com> <94f323a9-515e-4d75-cac8-ef0214f0499e@stressinduktion.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet May 10, 2016, 10:51 p.m. UTC

On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:

> Not only did we want to present this solely as a bugfix but also as as
> performance enhancements in case of virtio (as you can see in the cover
> letter). Given that a long time ago there was a tendency to remove
> softirqs completely, we thought it might be very interesting, that a
> threaded napi in general seems to be absolutely viable nowadays and
> might offer new features.

Well, you did not fix the bug, you worked around by adding yet another
layer, with another sysctl that admins or programs have to manage.

If you have a special need for virtio, do not hide it behind a 'bug fix'
but add it as a features request.

This ksoftirqd issue is real and a fix looks very reasonable.

Please try this patch, as I had very good success with it.

Thanks.

Peter Zijlstra May 11, 2016, 6:55 a.m. UTC | #1

On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 17caf4b63342..22463217e3cf 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
>  
>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>  
>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>  	/* Interrupts are disabled: no need to stop preemption */
>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>  
> -	if (tsk && tsk->state != TASK_RUNNING)
> +	if (tsk && tsk->state != TASK_RUNNING) {
> +		__this_cpu_write(ksoftirqd_scheduled, true);
>  		wake_up_process(tsk);

Since we're already looking at tsk->state, and the wake_up_process()
ensures the thing becomes TASK_RUNNING, you could add:

static inline bool ksoftirqd_running(void)
{
	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
}

> +	}
>  }
>  
>  /*
> @@ -162,7 +165,9 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
>  	 */
>  	preempt_count_sub(cnt - 1);
>  
> -	if (unlikely(!in_interrupt() && local_softirq_pending())) {
> +	if (unlikely(!in_interrupt() &&
> +		     local_softirq_pending() &&
> +		     !__this_cpu_read(ksoftirqd_scheduled))) {

And use it here,

>  		/*
>  		 * Run softirq if any pending. And do it in its own stack
>  		 * as we may be calling this deep in a task call stack already.
> @@ -340,6 +345,9 @@ void irq_enter(void)
>  
>  static inline void invoke_softirq(void)
>  {
> +	if (__this_cpu_read(ksoftirqd_scheduled))

and here.

> +		return;
> +
>  	if (!force_irqthreads) {
>  #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
>  		/*
> @@ -660,6 +668,8 @@ static void run_ksoftirqd(unsigned int cpu)
>  		 * in the task stack here.
>  		 */
>  		__do_softirq();
> +		if (!local_softirq_pending())
> +			__this_cpu_write(ksoftirqd_scheduled, false);

And avoid twiddling the new variable which only seems to mirror
tsk->state.

>  		local_irq_enable();
>  		cond_resched_rcu_qs();
>  		return;
> 
>

Paolo Abeni May 11, 2016, 9:48 a.m. UTC | #2

Hi Eric,
On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> 
> > Not only did we want to present this solely as a bugfix but also as as
> > performance enhancements in case of virtio (as you can see in the cover
> > letter). Given that a long time ago there was a tendency to remove
> > softirqs completely, we thought it might be very interesting, that a
> > threaded napi in general seems to be absolutely viable nowadays and
> > might offer new features.
> 
> Well, you did not fix the bug, you worked around by adding yet another
> layer, with another sysctl that admins or programs have to manage.
> 
> If you have a special need for virtio, do not hide it behind a 'bug fix'
> but add it as a features request.
> 
> This ksoftirqd issue is real and a fix looks very reasonable.
> 
> Please try this patch, as I had very good success with it.

Thank you for your time and your effort.

I tested your patch on the bare metal "single core" scenario, disabling
the unneeded cores with:
CPUS=`nproc`
for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done

And I got a:

[   86.925249] Broke affinity for irq <num>

for each irq number generated by a network device.

In this scenario, your patch solves the ksoftirqd issue, performing
comparable to the napi threaded patches (with a negative delta in the
noise range) and introducing a minor regression with a single flow, in
the noise range (3%).

As said in a previous mail, we actually experimented something similar,
but it felt quite hackish.

AFAICS this patch adds three more tests in the fast path and affect all
other softirq use case. I'm not sure how to check for regression there.

The napi thread patches are actually a new feature, that also fixes the
ksoftirqd issue: hunting the ksoftirqd issue has been the initial
trigger for this work. I'm sorry for not being clear enough in the cover
letter.

The napi thread patches offer additional benefits, i.e. an additional
relevant gain in the described test scenario, and do not impact on other
subsystems/kernel entities. 

I still think they are worthy, and I bet you would disagree, but could
you please articulate more which parts concern you most and/or are more
bloated ?

Thank you,

Paolo

Eric Dumazet May 11, 2016, 1:08 p.m. UTC | #3

On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> Hi Eric,
> On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > 
> > > Not only did we want to present this solely as a bugfix but also as as
> > > performance enhancements in case of virtio (as you can see in the cover
> > > letter). Given that a long time ago there was a tendency to remove
> > > softirqs completely, we thought it might be very interesting, that a
> > > threaded napi in general seems to be absolutely viable nowadays and
> > > might offer new features.
> > 
> > Well, you did not fix the bug, you worked around by adding yet another
> > layer, with another sysctl that admins or programs have to manage.
> > 
> > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > but add it as a features request.
> > 
> > This ksoftirqd issue is real and a fix looks very reasonable.
> > 
> > Please try this patch, as I had very good success with it.
> 
> Thank you for your time and your effort.
> 
> I tested your patch on the bare metal "single core" scenario, disabling
> the unneeded cores with:
> CPUS=`nproc`
> for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
> 
> And I got a:
> 
> [   86.925249] Broke affinity for irq <num>
> 

Was it fatal, or simply a warning that you are removing the cpu that was
the only allowed cpu in an affinity_mask ?

Looks another bug to fix then ? We disabled CPU hotplug here at Google
for our production, as it was notoriously buggy. No time to fix dozens
of issues added by a crowd of developers that do not even know a cpu can
be unplugged.

Maybe some caller of local_bh_disable()/local_bh_enable() expected that
current softirq would be processed. Obviously flaky even before the
patches.

> for each irq number generated by a network device.
> 
> In this scenario, your patch solves the ksoftirqd issue, performing
> comparable to the napi threaded patches (with a negative delta in the
> noise range) and introducing a minor regression with a single flow, in
> the noise range (3%).
> 
> As said in a previous mail, we actually experimented something similar,
> but it felt quite hackish.

Right, we are networking guys, and we feel that messing with such core
infra is not for us. So we feel comfortable adding a pure networking
patch.

> 
> AFAICS this patch adds three more tests in the fast path and affect all
> other softirq use case. I'm not sure how to check for regression there.

It is obvious to me that ksoftird mechanism is not working as intended.

Fixing it might uncover bugs from parts of the kernel relying on the
bug, indirectly or directly. Is it a good thing ?

I can not tell before trying.

Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
as we normally schedule ksoftird under stress but most of the time,
the softirq items were processed by another tasks as you found out.

> 
> The napi thread patches are actually a new feature, that also fixes the
> ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> trigger for this work. I'm sorry for not being clear enough in the cover
> letter.
> 
> The napi thread patches offer additional benefits, i.e. an additional
> relevant gain in the described test scenario, and do not impact on other
> subsystems/kernel entities. 
> 
> I still think they are worthy, and I bet you would disagree, but could
> you please articulate more which parts concern you most and/or are more
> bloated ?

Just look at the added code. napi_threaded_poll() is very buggy, but
honestly I do not want to fix the bugs you added there. If you have only
one vcpu, how jiffies can ever change since you block BH ?

I was planning to remove cond_resched_softirq() that we no longer use
after my recent changes to TCP stack,
and you call it again (while it is obviously buggy since it does not
check if a BH is pending, only if a thread needs the cpu)

I prefer fixing the existing code, really. It took us years to
understand it and maybe fix it.

Just think of what will happen if you have 10 devices (10 new threads in
your model) and one cpu.

Instead of the nice existing netif_rx() doing 64 packets per device
rounds, you'll now rely on process scheduler behavior that has no such
granularity.

Adding more threads is the natural answer of userland programmers, but
in the kernel it is not the right answer. We already have mechanism,
just use them and fix them if they are broken.

Sorry, I really do not think your patches are the way to go.
But this thread is definitely interesting.

Hannes Frederic Sowa May 11, 2016, 1:13 p.m. UTC | #4

On 11.05.2016 08:55, Peter Zijlstra wrote:
> On Tue, May 10, 2016 at 03:51:37PM -0700, Eric Dumazet wrote:
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 17caf4b63342..22463217e3cf 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -56,6 +56,7 @@ EXPORT_SYMBOL(irq_stat);
>>  static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
>>  
>>  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>> +DEFINE_PER_CPU(bool, ksoftirqd_scheduled);
>>  
>>  const char * const softirq_to_name[NR_SOFTIRQS] = {
>>  	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
>> @@ -73,8 +74,10 @@ static void wakeup_softirqd(void)
>>  	/* Interrupts are disabled: no need to stop preemption */
>>  	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>>  
>> -	if (tsk && tsk->state != TASK_RUNNING)
>> +	if (tsk && tsk->state != TASK_RUNNING) {
>> +		__this_cpu_write(ksoftirqd_scheduled, true);
>>  		wake_up_process(tsk);
> 
> Since we're already looking at tsk->state, and the wake_up_process()
> ensures the thing becomes TASK_RUNNING, you could add:
> 
> static inline bool ksoftirqd_running(void)
> {
> 	return __this_cpu_read(ksoftirqd)->state == TASK_RUNNING;
> }

This looks racy to me as the ksoftirqd could be in the progress to stop
and we would miss another softirq invocation.

Thanks,
Hannes

Hannes Frederic Sowa May 11, 2016, 1:39 p.m. UTC | #5

Hi all,

On 11.05.2016 15:08, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
>> Hi Eric,
>> On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
>>> On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
>>>
>>>> Not only did we want to present this solely as a bugfix but also as as
>>>> performance enhancements in case of virtio (as you can see in the cover
>>>> letter). Given that a long time ago there was a tendency to remove
>>>> softirqs completely, we thought it might be very interesting, that a
>>>> threaded napi in general seems to be absolutely viable nowadays and
>>>> might offer new features.
>>>
>>> Well, you did not fix the bug, you worked around by adding yet another
>>> layer, with another sysctl that admins or programs have to manage.
>>>
>>> If you have a special need for virtio, do not hide it behind a 'bug fix'
>>> but add it as a features request.
>>>
>>> This ksoftirqd issue is real and a fix looks very reasonable.
>>>
>>> Please try this patch, as I had very good success with it.
>>
>> Thank you for your time and your effort.
>>
>> I tested your patch on the bare metal "single core" scenario, disabling
>> the unneeded cores with:
>> CPUS=`nproc`
>> for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
>>
>> And I got a:
>>
>> [   86.925249] Broke affinity for irq <num>
>>
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?
> 
> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.

Yes, I fear this could come up. If we want to target net or stable maybe
we should maybe special case this patch specifically for net-rx?

>> for each irq number generated by a network device.
>>
>> In this scenario, your patch solves the ksoftirqd issue, performing
>> comparable to the napi threaded patches (with a negative delta in the
>> noise range) and introducing a minor regression with a single flow, in
>> the noise range (3%).
>>
>> As said in a previous mail, we actually experimented something similar,
>> but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.

We posted this patch as an RFC. My initial internal proposal only had a
check in ___napi_schedule and completely relied on threaded irqs and
didn't spawn a thread per napi instance in the networking stack. I think
this is the better approach long term, as it allows to configure
threaded irqs per device and doesn't specifically deal with networking
only. NAPI must be aware of when to schedule, obviously, so we need
another check in napi_schedule.

My plan was definitely to go with something more generic, but we didn't
yet know how to express that in a generic way, but relied on the forced
threaded irqs kernel parameter.

>> AFAICS this patch adds three more tests in the fast path and affect all
>> other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.

Yes.

> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.

Exactly, the pending mask gets reset by the task handling the softirq
inline and ksoftirqd runs dry too early not processing any more softirq
notifications.

>> The napi thread patches are actually a new feature, that also fixes the
>> ksoftirqd issue: hunting the ksoftirqd issue has been the initial
>> trigger for this work. I'm sorry for not being clear enough in the cover
>> letter.
>>
>> The napi thread patches offer additional benefits, i.e. an additional
>> relevant gain in the described test scenario, and do not impact on other
>> subsystems/kernel entities. 
>>
>> I still think they are worthy, and I bet you would disagree, but could
>> you please articulate more which parts concern you most and/or are more
>> bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?

I think the local_bh_disable/enable needs to be more fine granular,
correct (inside the loop).

> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)

Good point, thanks!

> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.

I agree, we should find a simple way to let ksoftirqd and netrx behave
better and target threaded napi as a new feature.

> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.

We didn't inspect all kinds of workloads right now but I hoped to see
benefits in forwarding with multiple interfaces. This is also why we
want to have some "easy" way to configure that for admins. Also
integration with RPS is on the todo list.

> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
> 
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.

I am fine with that. It is us to show clear benefits or use cases for
that. If we fail with that, no problem at all that the patches get
rejected, we don't want to add bloat to the kernel, for sure! At this
point I still think a possibility to run napi in kthreads will allow
specific workloads to see an improvement. Maybe the simple branch in
napi_schedule is just worth so people can play around with it. As it
shouldn't change behavior we can later on simply remove it.

Thanks,
Hannes

Hannes Frederic Sowa May 11, 2016, 1:47 p.m. UTC | #6

On 11.05.2016 15:39, Hannes Frederic Sowa wrote:
> I am fine with that. It is us to show clear benefits or use cases for
> that. If we fail with that, no problem at all that the patches get
> rejected, we don't want to add bloat to the kernel, for sure! At this
> point I still think a possibility to run napi in kthreads will allow
> specific workloads to see an improvement. Maybe the simple branch in
> napi_schedule is just worth so people can play around with it. As it
> shouldn't change behavior we can later on simply remove it.

Actually, I consider it a bug that when we force the kernel to use
threaded irqs that we only schedule the softirq for napi later on and
don't do the processing within the thread.

Bye,
Hannes

Paolo Abeni May 11, 2016, 2:38 p.m. UTC | #7

On Wed, 2016-05-11 at 06:08 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> > Hi Eric,
> > On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > > 
> > > > Not only did we want to present this solely as a bugfix but also as as
> > > > performance enhancements in case of virtio (as you can see in the cover
> > > > letter). Given that a long time ago there was a tendency to remove
> > > > softirqs completely, we thought it might be very interesting, that a
> > > > threaded napi in general seems to be absolutely viable nowadays and
> > > > might offer new features.
> > > 
> > > Well, you did not fix the bug, you worked around by adding yet another
> > > layer, with another sysctl that admins or programs have to manage.
> > > 
> > > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > > but add it as a features request.
> > > 
> > > This ksoftirqd issue is real and a fix looks very reasonable.
> > > 
> > > Please try this patch, as I had very good success with it.
> > 
> > Thank you for your time and your effort.
> > 
> > I tested your patch on the bare metal "single core" scenario, disabling
> > the unneeded cores with:
> > CPUS=`nproc`
> > for I in `seq 1 $CPUS`; do echo 0  >  /sys/devices/system/node/node0/cpu$I/online; done
> > 
> > And I got a:
> > 
> > [   86.925249] Broke affinity for irq <num>
> > 
> 
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?

The above message is emitted with pr_notice() by the x86 version of
fixup_irqs(). It's not fatal, the host is alive and well after that. The
un-patched kernel does not emit it on cpus disabling.

I'll try to look into this later.

> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
> 
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
> 
> > for each irq number generated by a network device.
> > 
> > In this scenario, your patch solves the ksoftirqd issue, performing
> > comparable to the napi threaded patches (with a negative delta in the
> > noise range) and introducing a minor regression with a single flow, in
> > the noise range (3%).
> > 
> > As said in a previous mail, we actually experimented something similar,
> > but it felt quite hackish.
> 
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
> 
> > 
> > AFAICS this patch adds three more tests in the fast path and affect all
> > other softirq use case. I'm not sure how to check for regression there.
> 
> It is obvious to me that ksoftird mechanism is not working as intended.
> 
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
> 
> I can not tell before trying.
> 
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
> 
> 
> > 
> > The napi thread patches are actually a new feature, that also fixes the
> > ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> > trigger for this work. I'm sorry for not being clear enough in the cover
> > letter.
> > 
> > The napi thread patches offer additional benefits, i.e. an additional
> > relevant gain in the described test scenario, and do not impact on other
> > subsystems/kernel entities. 
> > 
> > I still think they are worthy, and I bet you would disagree, but could
> > you please articulate more which parts concern you most and/or are more
> > bloated ?
> 
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?

Uh, we have likely the same issue in the net_rx_action() function, which
also execute with bh disabled and check for jiffies changes even on
single core hosts ?!?

Aren't jiffies updated by the timer interrupt ? and thous even with
bh_disabled ?!?

> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)

I missed that, thank you for pointing out.

> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
> 
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
> 
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.
> 
> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
> 
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.

Oh, this is a far better comment that I would have expected ;-)

Cheers,

Paolo

Eric Dumazet May 11, 2016, 2:40 p.m. UTC | #8

On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:

> This looks racy to me as the ksoftirqd could be in the progress to stop
> and we would miss another softirq invocation.

Looking at smpboot_thread_fn(), it looks fine :

                if (!ht->thread_should_run(td->cpu)) {
                        preempt_enable_no_resched();
                        schedule();
                } else {
                        __set_current_state(TASK_RUNNING);
                        preempt_enable();
                        ht->thread_fn(td->cpu);
                }

Eric Dumazet May 11, 2016, 2:45 p.m. UTC | #9

On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni <pabeni@redhat.com> wrote:

> Uh, we have likely the same issue in the net_rx_action() function, which
> also execute with bh disabled and check for jiffies changes even on
> single core hosts ?!?

That is why we have a loop break after netdev_budget=300 packets.
And a sysctl to eventually tune this.

Same issue for softirq handler, look at commit
34376a50fb1fa095b9d0636fa41ed2e73125f214

Your questions about this central piece of networking code are worrying.

>
> Aren't jiffies updated by the timer interrupt ? and thous even with
> bh_disabled ?!?

Exactly my point : jiffie wont be updated in your code, since you block BH.

Rik van Riel May 11, 2016, 3:01 p.m. UTC | #10

On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to
> > stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 

Additionally, we are talking about waking up
ksoftirqd on the same CPU.

That means the wakeup code could interrupt
ksoftirqd almost going to sleep, but the
two code paths could not run simultaneously.

That does narrow the scope considerably.

Eric Dumazet May 11, 2016, 3:50 p.m. UTC | #11

On Wed, 2016-05-11 at 07:40 -0700, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 6:13 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> 
> > This looks racy to me as the ksoftirqd could be in the progress to stop
> > and we would miss another softirq invocation.
> 
> Looking at smpboot_thread_fn(), it looks fine :
> 
>                 if (!ht->thread_should_run(td->cpu)) {
>                         preempt_enable_no_resched();
>                         schedule();
>                 } else {
>                         __set_current_state(TASK_RUNNING);
>                         preempt_enable();
>                         ht->thread_fn(td->cpu);
>                 }

BTW, I wonder why we pass td->cpu as argument to ht->thread_fn(td->cpu)

This always should be the current processor id.

Or do we have an issue because we ignore it in :

static int ksoftirqd_should_run(unsigned int cpu)
{
        return local_softirq_pending();
}

Hannes Frederic Sowa May 11, 2016, 10:47 p.m. UTC | #12

On 11.05.2016 16:45, Eric Dumazet wrote:
> On Wed, May 11, 2016 at 7:38 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> 
>> Uh, we have likely the same issue in the net_rx_action() function, which
>> also execute with bh disabled and check for jiffies changes even on
>> single core hosts ?!?
> 
> That is why we have a loop break after netdev_budget=300 packets.
> And a sysctl to eventually tune this.
> 
> Same issue for softirq handler, look at commit
> 34376a50fb1fa095b9d0636fa41ed2e73125f214
> 
> Your questions about this central piece of networking code are worrying.
> 
>>
>> Aren't jiffies updated by the timer interrupt ? and thous even with
>> bh_disabled ?!?
> 
> Exactly my point : jiffie wont be updated in your code, since you block BH.

To be fair, jiffies get updated in hardirq and not softirq context. The
cond_resched_softirq not looking for pending softirqs is indeed a problem.

Thanks,
Hannes

[RFC,0/2] net: threadable napi poll loop

Commit Message

Comments

Patch