Message ID | 1477673892-28940-2-git-send-email-tj@kernel.org |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote: > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > unsigned long *switch_count; > struct pin_cookie cookie; > struct rq *rq; > - int cpu; > + int cpu, in_iowait; > > cpu = smp_processor_id(); > rq = cpu_rq(cpu); > prev = rq->curr; > + in_iowait = prev->in_iowait; > + > + if (in_iowait) { > + delayacct_blkio_start(); > + atomic_inc(&rq->nr_iowait); > + } > > schedule_debug(prev); > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > } > > balance_callback(rq); > + > + if (in_iowait) { > + atomic_dec(&rq->nr_iowait); > + delayacct_blkio_end(); > + } > } > > void __noreturn do_task_dead(void) Urgh, can't say I like this much. It moves two branches into the schedule path. Nor do I really like the idea of having to annotate special mutexes for the iowait crap. I'll think more after KS/LPC etc.. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 28, 2016 at 08:27:12PM +0200, Peter Zijlstra wrote: > On Fri, Oct 28, 2016 at 12:58:09PM -0400, Tejun Heo wrote: > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > > unsigned long *switch_count; > > struct pin_cookie cookie; > > struct rq *rq; > > - int cpu; > > + int cpu, in_iowait; > > > > cpu = smp_processor_id(); > > rq = cpu_rq(cpu); > > prev = rq->curr; > > + in_iowait = prev->in_iowait; > > + > > + if (in_iowait) { > > + delayacct_blkio_start(); > > + atomic_inc(&rq->nr_iowait); > > + } > > > > schedule_debug(prev); > > > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > > } > > > > balance_callback(rq); > > + > > + if (in_iowait) { > > + atomic_dec(&rq->nr_iowait); > > + delayacct_blkio_end(); > > + } > > } > > > > void __noreturn do_task_dead(void) > > Urgh, can't say I like this much. It moves two branches into the > schedule path. > > Nor do I really like the idea of having to annotate special mutexes for > the iowait crap. > > I'll think more after KS/LPC etc.. One alternative is to inherit the iowait state of the task we block on. That'll not get rid of the branches much, but it will remove the new mutex APIs. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Peter. On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > One alternative is to inherit the iowait state of the task we block on. > That'll not get rid of the branches much, but it will remove the new > mutex APIs. Yeah, thought about that briefly but we don't necessarily track mutex or other synchronization construct owners, things get gnarly with rwsems (the inode ones sometimes end up in a similar situation), and we'll probably end up dealing with some surprising propagations down the line. That said, getting such automatic propagation working would be great. Thanks.
On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote: > Hello, Peter. > > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > > One alternative is to inherit the iowait state of the task we block on. > > That'll not get rid of the branches much, but it will remove the new > > mutex APIs. > > Yeah, thought about that briefly but we don't necessarily track mutex This one I actually fixed and should be in -next. And it would be sufficient to cover the use case here. > or other synchronization construct owners, things get gnarly with > rwsems (the inode ones sometimes end up in a similar situation), and > we'll probably end up dealing with some surprising propagations down > the line. rwsems could be done for writers only. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Sat, Oct 29, 2016 at 05:21:26AM +0200, Peter Zijlstra wrote: > On Fri, Oct 28, 2016 at 03:12:32PM -0400, Tejun Heo wrote: > > Hello, Peter. > > > > On Fri, Oct 28, 2016 at 09:07:02PM +0200, Peter Zijlstra wrote: > > > One alternative is to inherit the iowait state of the task we block on. > > > That'll not get rid of the branches much, but it will remove the new > > > mutex APIs. > > > > Yeah, thought about that briefly but we don't necessarily track mutex > > This one I actually fixed and should be in -next. And it would be > sufficient to cover the use case here. Tracking the owners of mutexes and rwsems does help quite a bit. I don't think it's as simple as inheriting io sleep state from the current owner tho. The owner might be running or in a non-IO sleep when others try to grab the mutex. It is an option to ignore those cases but this would have a real possibility to lead to surprising results in some corner cases. If we choose to propagate dynamically, it becomes an a lot more complex problem and I don't think it'd be justfiable. Unless there can be a simple enough and reliable solution, I think it'd be better to stick with explicit marking. Thanks.
Hi Tejun, > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 94732d1..f6baa38 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > unsigned long *switch_count; > struct pin_cookie cookie; > struct rq *rq; > - int cpu; > + int cpu, in_iowait; > > cpu = smp_processor_id(); > rq = cpu_rq(cpu); > prev = rq->curr; > + in_iowait = prev->in_iowait; > + > + if (in_iowait) { > + delayacct_blkio_start(); > + atomic_inc(&rq->nr_iowait); > + } > > schedule_debug(prev); > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > } > > balance_callback(rq); > + > + if (in_iowait) { > + atomic_dec(&rq->nr_iowait); > + delayacct_blkio_end(); > + } > } I think, the nr_iowait update can go wrong here. When the task migrates to a different CPU upon wakeup, this rq points to a different CPU from the one on which nr_iowait is incremented before. Thanks, Pavan
Hello, On Thu, Nov 03, 2016 at 09:03:45PM +0530, Pavan Kondeti wrote: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 94732d1..f6baa38 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) > > unsigned long *switch_count; > > struct pin_cookie cookie; > > struct rq *rq; > > - int cpu; > > + int cpu, in_iowait; > > > > cpu = smp_processor_id(); > > rq = cpu_rq(cpu); > > prev = rq->curr; > > + in_iowait = prev->in_iowait; > > + > > + if (in_iowait) { > > + delayacct_blkio_start(); > > + atomic_inc(&rq->nr_iowait); > > + } > > > > schedule_debug(prev); > > > > @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) > > } > > > > balance_callback(rq); > > + > > + if (in_iowait) { > > + atomic_dec(&rq->nr_iowait); > > + delayacct_blkio_end(); > > + } > > } > > I think, the nr_iowait update can go wrong here. > > When the task migrates to a different CPU upon wakeup, this rq points > to a different CPU from the one on which nr_iowait is incremented > before. Ah, you're right, it should remember the original rq. Thanks.
Hello, On Mon, Oct 31, 2016 at 10:45:56AM -0600, Tejun Heo wrote: > Tracking the owners of mutexes and rwsems does help quite a bit. I > don't think it's as simple as inheriting io sleep state from the > current owner tho. The owner might be running or in a non-IO sleep > when others try to grab the mutex. It is an option to ignore those > cases but this would have a real possibility to lead to surprising > results in some corner cases. If we choose to propagate dynamically, > it becomes an a lot more complex problem and I don't think it'd be > justfiable. > > Unless there can be a simple enough and reliable solution, I think > it'd be better to stick with explicit marking. Just posted the fixed version for the first patch. Any more thoughts on this? Thanks.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 94732d1..f6baa38 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3336,11 +3336,17 @@ static void __sched notrace __schedule(bool preempt) unsigned long *switch_count; struct pin_cookie cookie; struct rq *rq; - int cpu; + int cpu, in_iowait; cpu = smp_processor_id(); rq = cpu_rq(cpu); prev = rq->curr; + in_iowait = prev->in_iowait; + + if (in_iowait) { + delayacct_blkio_start(); + atomic_inc(&rq->nr_iowait); + } schedule_debug(prev); @@ -3406,6 +3412,11 @@ static void __sched notrace __schedule(bool preempt) } balance_callback(rq); + + if (in_iowait) { + atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); + } } void __noreturn do_task_dead(void) @@ -5063,19 +5074,13 @@ EXPORT_SYMBOL_GPL(yield_to); long __sched io_schedule_timeout(long timeout) { int old_iowait = current->in_iowait; - struct rq *rq; long ret; current->in_iowait = 1; blk_schedule_flush_plug(current); - delayacct_blkio_start(); - rq = raw_rq(); - atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); current->in_iowait = old_iowait; - atomic_dec(&rq->nr_iowait); - delayacct_blkio_end(); return ret; }
For an interface to support blocking for IOs, it must call io_schedule() instead of schedule(). This makes it tedious to add IO blocking to existing interfaces as the switching between schedule() and io_schedule() is often buried deep. As we already have a way to mark the task as IO scheduling, this can be made easier by separating out io_schedule() into multiple steps so that IO schedule preparation can be performed before invoking a blocking interface and the actual accounting happens inside schedule(). io_schedule_timeout() does the following three things prior to calling schedule_timeout(). 1. Mark the task as scheduling for IO. 2. Flush out plugged IOs. 3. Account the IO scheduling. #1 and #2 can be performed in the prepartaion step while #3 must be done close to the actual scheduling. This patch moves #3 into __schedule() so that later patches can separate out preparation and finish steps from io_schedule(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> --- kernel/sched/core.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)