diff mbox

[1/4] sched: move IO scheduling accounting from io_schedule_timeout() to __schedule()

Message ID 1477673892-28940-2-git-send-email-tj@kernel.org
State New, archived
Headers show

Commit Message

Tejun Heo Oct. 28, 2016, 4:58 p.m. UTC
For an interface to support blocking for IOs, it must call
io_schedule() instead of schedule().  This makes it tedious to add IO
blocking to existing interfaces as the switching between schedule()
and io_schedule() is often buried deep.

As we already have a way to mark the task as IO scheduling, this can
be made easier by separating out io_schedule() into multiple steps so
that IO schedule preparation can be performed before invoking a
blocking interface and the actual accounting happens inside
schedule().

io_schedule_timeout() does the following three things prior to calling
schedule_timeout().

 1. Mark the task as scheduling for IO.
 2. Flush out plugged IOs.
 3. Account the IO scheduling.

#1 and #2 can be performed in the prepartaion step while #3 must be
done close to the actual scheduling.  This patch moves #3 into
__schedule() so that later patches can separate out preparation and
finish steps from io_schedule().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 kernel/sched/core.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Comments

Peter Zijlstra Oct. 28, 2016, 6:27 p.m. UTC | #1
On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
>  	unsigned long *switch_count;
>  	struct pin_cookie cookie;
>  	struct rq *rq;
> -	int cpu;
> +	int cpu, in_iowait;
>  
>  	cpu = smp_processor_id();
>  	rq = cpu_rq(cpu);
>  	prev = rq->curr;
> +	in_iowait = prev->in_iowait;
> +
> +	if (in_iowait) {
> +		delayacct_blkio_start();
> +		atomic_inc(&rq->nr_iowait);
> +	}
>  
>  	schedule_debug(prev);
>  
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
>  	}
>  
>  	balance_callback(rq);
> +
> +	if (in_iowait) {
> +		atomic_dec(&rq->nr_iowait);
> +		delayacct_blkio_end();
> +	}
>  }
>  
>  void __noreturn do_task_dead(void)

Urgh, can't say I like this much. It moves two branches into the
schedule path.

Nor do I really like the idea of having to annotate special mutexes for
the iowait crap.

I'll think more after KS/LPC etc..
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 28, 2016, 7:07 p.m. UTC | #2
On Fri, Oct 28, 2016 at 08:27:12PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> >  	unsigned long *switch_count;
> >  	struct pin_cookie cookie;
> >  	struct rq *rq;
> > -	int cpu;
> > +	int cpu, in_iowait;
> >  
> >  	cpu = smp_processor_id();
> >  	rq = cpu_rq(cpu);
> >  	prev = rq->curr;
> > +	in_iowait = prev->in_iowait;
> > +
> > +	if (in_iowait) {
> > +		delayacct_blkio_start();
> > +		atomic_inc(&rq->nr_iowait);
> > +	}
> >  
> >  	schedule_debug(prev);
> >  
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> >  	}
> >  
> >  	balance_callback(rq);
> > +
> > +	if (in_iowait) {
> > +		atomic_dec(&rq->nr_iowait);
> > +		delayacct_blkio_end();
> > +	}
> >  }
> >  
> >  void __noreturn do_task_dead(void)
> 
> Urgh, can't say I like this much. It moves two branches into the
> schedule path.
> 
> Nor do I really like the idea of having to annotate special mutexes for
> the iowait crap.
> 
> I'll think more after KS/LPC etc..

One alternative is to inherit the iowait state of the task we block on.
That'll not get rid of the branches much, but it will remove the new
mutex APIs.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Oct. 28, 2016, 7:12 p.m. UTC | #3
Hello, Peter.

On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> One alternative is to inherit the iowait state of the task we block on.
> That'll not get rid of the branches much, but it will remove the new
> mutex APIs.

Yeah, thought about that briefly but we don't necessarily track mutex
or other synchronization construct owners, things get gnarly with
rwsems (the inode ones sometimes end up in a similar situation), and
we'll probably end up dealing with some surprising propagations down
the line.  That said, getting such automatic propagation working would
be great.

Thanks.
Peter Zijlstra Oct. 29, 2016, 3:21 a.m. UTC | #4
On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > One alternative is to inherit the iowait state of the task we block on.
> > That'll not get rid of the branches much, but it will remove the new
> > mutex APIs.
> 
> Yeah, thought about that briefly but we don't necessarily track mutex

This one I actually fixed and should be in -next. And it would be
sufficient to cover the use case here.

> or other synchronization construct owners, things get gnarly with
> rwsems (the inode ones sometimes end up in a similar situation), and
> we'll probably end up dealing with some surprising propagations down
> the line.

rwsems could be done for writers only.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Oct. 31, 2016, 4:45 p.m. UTC | #5
Hello,

On Sat, Oct 29, 2016 at 05:21:26AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote:
> > > One alternative is to inherit the iowait state of the task we block on.
> > > That'll not get rid of the branches much, but it will remove the new
> > > mutex APIs.
> > 
> > Yeah, thought about that briefly but we don't necessarily track mutex
> 
> This one I actually fixed and should be in -next. And it would be
> sufficient to cover the use case here.

Tracking the owners of mutexes and rwsems does help quite a bit.  I
don't think it's as simple as inheriting io sleep state from the
current owner tho.  The owner might be running or in a non-IO sleep
when others try to grab the mutex.  It is an option to ignore those
cases but this would have a real possibility to lead to surprising
results in some corner cases.  If we choose to propagate dynamically,
it becomes an a lot more complex problem and I don't think it'd be
justfiable.

Unless there can be a simple enough and reliable solution, I think
it'd be better to stick with explicit marking.

Thanks.
Pavan Kondeti Nov. 3, 2016, 3:33 p.m. UTC | #6
Hi Tejun,

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1..f6baa38 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
>         unsigned long *switch_count;
>         struct pin_cookie cookie;
>         struct rq *rq;
> -       int cpu;
> +       int cpu, in_iowait;
>
>         cpu = smp_processor_id();
>         rq = cpu_rq(cpu);
>         prev = rq->curr;
> +       in_iowait = prev->in_iowait;
> +
> +       if (in_iowait) {
> +               delayacct_blkio_start();
> +               atomic_inc(&rq->nr_iowait);
> +       }
>
>         schedule_debug(prev);
>
> @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
>         }
>
>         balance_callback(rq);
> +
> +       if (in_iowait) {
> +               atomic_dec(&rq->nr_iowait);
> +               delayacct_blkio_end();
> +       }
>  }

I think, the nr_iowait update can go wrong here.

When the task migrates to a different CPU upon wakeup, this rq points
to a different CPU from the one on which nr_iowait is incremented
before.

Thanks,
Pavan
Tejun Heo Nov. 8, 2016, 10:51 p.m. UTC | #7
Hello,

On Thu, Nov 03, 2016 at 09:03:45PM +0530, Pavan Kondeti wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 94732d1..f6baa38 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt)
> >         unsigned long *switch_count;
> >         struct pin_cookie cookie;
> >         struct rq *rq;
> > -       int cpu;
> > +       int cpu, in_iowait;
> >
> >         cpu = smp_processor_id();
> >         rq = cpu_rq(cpu);
> >         prev = rq->curr;
> > +       in_iowait = prev->in_iowait;
> > +
> > +       if (in_iowait) {
> > +               delayacct_blkio_start();
> > +               atomic_inc(&rq->nr_iowait);
> > +       }
> >
> >         schedule_debug(prev);
> >
> > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt)
> >         }
> >
> >         balance_callback(rq);
> > +
> > +       if (in_iowait) {
> > +               atomic_dec(&rq->nr_iowait);
> > +               delayacct_blkio_end();
> > +       }
> >  }
> 
> I think, the nr_iowait update can go wrong here.
> 
> When the task migrates to a different CPU upon wakeup, this rq points
> to a different CPU from the one on which nr_iowait is incremented
> before.

Ah, you're right, it should remember the original rq.

Thanks.
Tejun Heo Dec. 6, 2016, 9:30 p.m. UTC | #8
Hello,

On Mon, Oct 31, 2016 at 10:45:56AM -0600, Tejun Heo wrote:
> Tracking the owners of mutexes and rwsems does help quite a bit.  I
> don't think it's as simple as inheriting io sleep state from the
> current owner tho.  The owner might be running or in a non-IO sleep
> when others try to grab the mutex.  It is an option to ignore those
> cases but this would have a real possibility to lead to surprising
> results in some corner cases.  If we choose to propagate dynamically,
> it becomes an a lot more complex problem and I don't think it'd be
> justfiable.
> 
> Unless there can be a simple enough and reliable solution, I think
> it'd be better to stick with explicit marking.

Just posted the fixed version for the first patch.  Any more thoughts
on this?

Thanks.
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..f6baa38 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3336,11 +3336,17 @@  static void __sched notrace __schedule(bool preempt)
 	unsigned long *switch_count;
 	struct pin_cookie cookie;
 	struct rq *rq;
-	int cpu;
+	int cpu, in_iowait;
 
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
+	in_iowait = prev->in_iowait;
+
+	if (in_iowait) {
+		delayacct_blkio_start();
+		atomic_inc(&rq->nr_iowait);
+	}
 
 	schedule_debug(prev);
 
@@ -3406,6 +3412,11 @@  static void __sched notrace __schedule(bool preempt)
 	}
 
 	balance_callback(rq);
+
+	if (in_iowait) {
+		atomic_dec(&rq->nr_iowait);
+		delayacct_blkio_end();
+	}
 }
 
 void __noreturn do_task_dead(void)
@@ -5063,19 +5074,13 @@  EXPORT_SYMBOL_GPL(yield_to);
 long __sched io_schedule_timeout(long timeout)
 {
 	int old_iowait = current->in_iowait;
-	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
 	blk_schedule_flush_plug(current);
 
-	delayacct_blkio_start();
-	rq = raw_rq();
-	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	current->in_iowait = old_iowait;
-	atomic_dec(&rq->nr_iowait);
-	delayacct_blkio_end();
 
 	return ret;
 }