Patchwork libata: use single threaded work queue

login
register
mail settings
Submitter Jens Axboe
Date Aug. 19, 2009, 11:25 a.m.
Message ID <20090819112554.GY12579@kernel.dk>
Download mbox | patch
Permalink /patch/31640/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Jens Axboe - Aug. 19, 2009, 11:25 a.m.
Hi,

On boxes with lots of CPUs, we have so many kernel threads it's not
funny. The basic problem is that create_workqueue() creates a per-cpu
thread, where we could easily get by with a single thread for a lot of
cases.

One such case appears to be ata_wq. You want at most one per pio drive,
not one per CPU. I'd suggest just dropping it to a single threaded wq.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Jeff Garzik - Aug. 19, 2009, 11:59 a.m.
On 08/19/2009 07:25 AM, Jens Axboe wrote:
> Hi,
>
> On boxes with lots of CPUs, we have so many kernel threads it's not
> funny. The basic problem is that create_workqueue() creates a per-cpu
> thread, where we could easily get by with a single thread for a lot of
> cases.
>
> One such case appears to be ata_wq. You want at most one per pio drive,
> not one per CPU. I'd suggest just dropping it to a single threaded wq.
>
> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 072ba5e..0d78628 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>   {
>   	ata_parse_force_param();
>
> -	ata_wq = create_workqueue("ata");
> +	ata_wq = create_singlethread_workqueue("ata");
>   	if (!ata_wq)
>   		goto free_force_tbl;


I agree with one-thread-per-cpu is too much, in these modern multi-core 
times, but one thread is too little.  You have essentially re-created 
simplex DMA -- blocking and waiting such that one drive out of ~4 can be 
used at any one time.

ata_pio_task() is in a workqueue so that it can sleep and/or spend a 
long time polling ATA registers.  That means an active task can 
definitely starve all other tasks in the workqueue, if only one thread 
is available.  If starvation occurs, it will potentially pause the 
unrelated task for several seconds.

The proposed patch actually expands an existing problem -- uniprocessor 
case, where there is only one workqueue thread.  For the reasons 
outlined above, we actually want multiple threads even in the UP case. 
If you have more than one PIO device, latency is bloody awful, with 
occasional multi-second "hiccups" as one PIO devices waits for another. 
  It's an ugly wart that users DO occasionally complain about; luckily 
most users have at most one PIO polled device.

It would be nice if we could replace this workqueue with a thread pool, 
where thread count inside the pool ranges from zero to $N depending on 
level of thread pool activity.  Our common case in libata would be 
_zero_ threads, if so...

	Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe - Aug. 19, 2009, 12:04 p.m.
On Wed, Aug 19 2009, Jeff Garzik wrote:
> On 08/19/2009 07:25 AM, Jens Axboe wrote:
>> Hi,
>>
>> On boxes with lots of CPUs, we have so many kernel threads it's not
>> funny. The basic problem is that create_workqueue() creates a per-cpu
>> thread, where we could easily get by with a single thread for a lot of
>> cases.
>>
>> One such case appears to be ata_wq. You want at most one per pio drive,
>> not one per CPU. I'd suggest just dropping it to a single threaded wq.
>>
>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
>>
>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
>> index 072ba5e..0d78628 100644
>> --- a/drivers/ata/libata-core.c
>> +++ b/drivers/ata/libata-core.c
>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>>   {
>>   	ata_parse_force_param();
>>
>> -	ata_wq = create_workqueue("ata");
>> +	ata_wq = create_singlethread_workqueue("ata");
>>   	if (!ata_wq)
>>   		goto free_force_tbl;
>
>
> I agree with one-thread-per-cpu is too much, in these modern multi-core  
> times, but one thread is too little.  You have essentially re-created  
> simplex DMA -- blocking and waiting such that one drive out of ~4 can be  
> used at any one time.
>
> ata_pio_task() is in a workqueue so that it can sleep and/or spend a  
> long time polling ATA registers.  That means an active task can  
> definitely starve all other tasks in the workqueue, if only one thread  
> is available.  If starvation occurs, it will potentially pause the  
> unrelated task for several seconds.
>
> The proposed patch actually expands an existing problem -- uniprocessor  
> case, where there is only one workqueue thread.  For the reasons  
> outlined above, we actually want multiple threads even in the UP case.  
> If you have more than one PIO device, latency is bloody awful, with  
> occasional multi-second "hiccups" as one PIO devices waits for another.  
> It's an ugly wart that users DO occasionally complain about; luckily  
> most users have at most one PIO polled device.
>
> It would be nice if we could replace this workqueue with a thread pool,  
> where thread count inside the pool ranges from zero to $N depending on  
> level of thread pool activity.  Our common case in libata would be  
> _zero_ threads, if so...

That would be ideal, N essentially be:

        N = min(nr_cpus, nr_drives_that_need_pio);

How can I easily test whether we will ever need a pio thread for a
drive in libata? For a simple patch, I would suggest simply creating a
single threaded workqueue per ap instead, if that ata_port would ever
want PIO.
Mark Lord - Aug. 19, 2009, 12:14 p.m.
Jens Axboe wrote:
> On Wed, Aug 19 2009, Jeff Garzik wrote:
>> On 08/19/2009 07:25 AM, Jens Axboe wrote:
>>> Hi,
>>>
>>> On boxes with lots of CPUs, we have so many kernel threads it's not
>>> funny. The basic problem is that create_workqueue() creates a per-cpu
>>> thread, where we could easily get by with a single thread for a lot of
>>> cases.
>>>
>>> One such case appears to be ata_wq. You want at most one per pio drive,
>>> not one per CPU. I'd suggest just dropping it to a single threaded wq.
>>>
>>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
>>>
>>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
>>> index 072ba5e..0d78628 100644
>>> --- a/drivers/ata/libata-core.c
>>> +++ b/drivers/ata/libata-core.c
>>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>>>   {
>>>   	ata_parse_force_param();
>>>
>>> -	ata_wq = create_workqueue("ata");
>>> +	ata_wq = create_singlethread_workqueue("ata");
>>>   	if (!ata_wq)
>>>   		goto free_force_tbl;
>>
>> I agree with one-thread-per-cpu is too much, in these modern multi-core  
>> times, but one thread is too little.  You have essentially re-created  
>> simplex DMA -- blocking and waiting such that one drive out of ~4 can be  
>> used at any one time.
>>
>> ata_pio_task() is in a workqueue so that it can sleep and/or spend a  
>> long time polling ATA registers.  That means an active task can  
>> definitely starve all other tasks in the workqueue, if only one thread  
>> is available.  If starvation occurs, it will potentially pause the  
>> unrelated task for several seconds.
>>
>> The proposed patch actually expands an existing problem -- uniprocessor  
>> case, where there is only one workqueue thread.  For the reasons  
>> outlined above, we actually want multiple threads even in the UP case.  
>> If you have more than one PIO device, latency is bloody awful, with  
>> occasional multi-second "hiccups" as one PIO devices waits for another.  
>> It's an ugly wart that users DO occasionally complain about; luckily  
>> most users have at most one PIO polled device.
>>
>> It would be nice if we could replace this workqueue with a thread pool,  
>> where thread count inside the pool ranges from zero to $N depending on  
>> level of thread pool activity.  Our common case in libata would be  
>> _zero_ threads, if so...
> 
> That would be ideal, N essentially be:
> 
>         N = min(nr_cpus, nr_drives_that_need_pio);
..

No, that would leave just a single thread again for UP.

It would be nice to just create these threads on-demand,
and destroy them again after periods of dis-use.
Kind of like how Apache does worker threads.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe - Aug. 19, 2009, 12:23 p.m.
On Wed, Aug 19 2009, Mark Lord wrote:
> Jens Axboe wrote:
>> On Wed, Aug 19 2009, Jeff Garzik wrote:
>>> On 08/19/2009 07:25 AM, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> On boxes with lots of CPUs, we have so many kernel threads it's not
>>>> funny. The basic problem is that create_workqueue() creates a per-cpu
>>>> thread, where we could easily get by with a single thread for a lot of
>>>> cases.
>>>>
>>>> One such case appears to be ata_wq. You want at most one per pio drive,
>>>> not one per CPU. I'd suggest just dropping it to a single threaded wq.
>>>>
>>>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
>>>>
>>>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
>>>> index 072ba5e..0d78628 100644
>>>> --- a/drivers/ata/libata-core.c
>>>> +++ b/drivers/ata/libata-core.c
>>>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>>>>   {
>>>>   	ata_parse_force_param();
>>>>
>>>> -	ata_wq = create_workqueue("ata");
>>>> +	ata_wq = create_singlethread_workqueue("ata");
>>>>   	if (!ata_wq)
>>>>   		goto free_force_tbl;
>>>
>>> I agree with one-thread-per-cpu is too much, in these modern 
>>> multi-core  times, but one thread is too little.  You have 
>>> essentially re-created  simplex DMA -- blocking and waiting such that 
>>> one drive out of ~4 can be  used at any one time.
>>>
>>> ata_pio_task() is in a workqueue so that it can sleep and/or spend a  
>>> long time polling ATA registers.  That means an active task can   
>>> definitely starve all other tasks in the workqueue, if only one 
>>> thread  is available.  If starvation occurs, it will potentially 
>>> pause the  unrelated task for several seconds.
>>>
>>> The proposed patch actually expands an existing problem -- 
>>> uniprocessor  case, where there is only one workqueue thread.  For 
>>> the reasons  outlined above, we actually want multiple threads even 
>>> in the UP case.  If you have more than one PIO device, latency is 
>>> bloody awful, with  occasional multi-second "hiccups" as one PIO 
>>> devices waits for another.  It's an ugly wart that users DO 
>>> occasionally complain about; luckily  most users have at most one PIO 
>>> polled device.
>>>
>>> It would be nice if we could replace this workqueue with a thread 
>>> pool,  where thread count inside the pool ranges from zero to $N 
>>> depending on  level of thread pool activity.  Our common case in 
>>> libata would be  _zero_ threads, if so...
>>
>> That would be ideal, N essentially be:
>>
>>         N = min(nr_cpus, nr_drives_that_need_pio);
> ..
>
> No, that would leave just a single thread again for UP.

So one thread per ap would be better, then.

> It would be nice to just create these threads on-demand,
> and destroy them again after periods of dis-use.
> Kind of like how Apache does worker threads.

Well, that's the same thread pool suggestion that Jeff came up with. And
I agree, that's a nicer long term solution (it's also how the per-bdi
flushing replacement works). The problem with that appears to be that
any suggested patchset for thread pools spiral into certain "but what
color should it be?!" death.

I'll try and work up a simple create_threadpool() implementation, we can
literally cut away hundreds of threads with that.
Jeff Garzik - Aug. 19, 2009, 1:22 p.m.
On 08/19/2009 08:23 AM, Jens Axboe wrote:
> On Wed, Aug 19 2009, Mark Lord wrote:
>> Jens Axboe wrote:
>>> On Wed, Aug 19 2009, Jeff Garzik wrote:
>>>> On 08/19/2009 07:25 AM, Jens Axboe wrote:
>>>>> Hi,
>>>>>
>>>>> On boxes with lots of CPUs, we have so many kernel threads it's not
>>>>> funny. The basic problem is that create_workqueue() creates a per-cpu
>>>>> thread, where we could easily get by with a single thread for a lot of
>>>>> cases.
>>>>>
>>>>> One such case appears to be ata_wq. You want at most one per pio drive,
>>>>> not one per CPU. I'd suggest just dropping it to a single threaded wq.
>>>>>
>>>>> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
>>>>>
>>>>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
>>>>> index 072ba5e..0d78628 100644
>>>>> --- a/drivers/ata/libata-core.c
>>>>> +++ b/drivers/ata/libata-core.c
>>>>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>>>>>    {
>>>>>    	ata_parse_force_param();
>>>>>
>>>>> -	ata_wq = create_workqueue("ata");
>>>>> +	ata_wq = create_singlethread_workqueue("ata");
>>>>>    	if (!ata_wq)
>>>>>    		goto free_force_tbl;
>>>>
>>>> I agree with one-thread-per-cpu is too much, in these modern
>>>> multi-core  times, but one thread is too little.  You have
>>>> essentially re-created  simplex DMA -- blocking and waiting such that
>>>> one drive out of ~4 can be  used at any one time.
>>>>
>>>> ata_pio_task() is in a workqueue so that it can sleep and/or spend a
>>>> long time polling ATA registers.  That means an active task can
>>>> definitely starve all other tasks in the workqueue, if only one
>>>> thread  is available.  If starvation occurs, it will potentially
>>>> pause the  unrelated task for several seconds.
>>>>
>>>> The proposed patch actually expands an existing problem --
>>>> uniprocessor  case, where there is only one workqueue thread.  For
>>>> the reasons  outlined above, we actually want multiple threads even
>>>> in the UP case.  If you have more than one PIO device, latency is
>>>> bloody awful, with  occasional multi-second "hiccups" as one PIO
>>>> devices waits for another.  It's an ugly wart that users DO
>>>> occasionally complain about; luckily  most users have at most one PIO
>>>> polled device.
>>>>
>>>> It would be nice if we could replace this workqueue with a thread
>>>> pool,  where thread count inside the pool ranges from zero to $N
>>>> depending on  level of thread pool activity.  Our common case in
>>>> libata would be  _zero_ threads, if so...
>>>
>>> That would be ideal, N essentially be:
>>>
>>>          N = min(nr_cpus, nr_drives_that_need_pio);
>> ..
>>
>> No, that would leave just a single thread again for UP.
>
> So one thread per ap would be better, then.

Yes.


>> It would be nice to just create these threads on-demand,
>> and destroy them again after periods of dis-use.
>> Kind of like how Apache does worker threads.
>
> Well, that's the same thread pool suggestion that Jeff came up with. And
> I agree, that's a nicer long term solution (it's also how the per-bdi
> flushing replacement works). The problem with that appears to be that
> any suggested patchset for thread pools spiral into certain "but what
> color should it be?!" death.

Let people complain with code :)  libata has two basic needs in this area:
(1) specifying a thread count other than "1" or "nr-cpus"
(2) don't start unneeded threads / idle out unused threads


> I'll try and work up a simple create_threadpool() implementation, we can
> literally cut away hundreds of threads with that.

That would be fantastic.  It really _would_ remove hundreds of threads 
on this 2x4 (2 quad-core pkgs) system I have, for example.

I bet systems like those sparc64 Niagaras or 32-core MIPS are completely 
insane with unused kernel threads...

But if create_threadpool() proves too ambitious, having the ability for 
the subsystem to specify number of threads (#1, above) should suffice to 
fix your problem and the existing UP-thread-count problem.

	Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Garzik - Aug. 19, 2009, 1:28 p.m.
>>> It would be nice to just create these threads on-demand,
>>> and destroy them again after periods of dis-use.
>>> Kind of like how Apache does worker threads.
>>
>> Well, that's the same thread pool suggestion that Jeff came up with. And
>> I agree, that's a nicer long term solution (it's also how the per-bdi
>> flushing replacement works). The problem with that appears to be that
>> any suggested patchset for thread pools spiral into certain "but what
>> color should it be?!" death.
>
> Let people complain with code :) libata has two basic needs in this area:
> (1) specifying a thread count other than "1" or "nr-cpus"
> (2) don't start unneeded threads / idle out unused threads

To be even more general,

libata needs a workqueue or thread pool that can

(a) scale up to nr-drives-that-use-pio threads, on demand
(b) scale down to zero threads, with lack of demand

That handles the worst case of each PIO-polling drive needing to sleep 
(thus massively impacting latency, if any other PIO-polling drive must 
wait for a free thread).

That also handles the best case of not needing any threads at all.

	Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Aug. 19, 2009, 2:11 p.m.
Hello, guys.

Jeff Garzik wrote:
>> Let people complain with code :) libata has two basic needs in this area:
>> (1) specifying a thread count other than "1" or "nr-cpus"
>> (2) don't start unneeded threads / idle out unused threads
> 
> To be even more general,
> 
> libata needs a workqueue or thread pool that can
> 
> (a) scale up to nr-drives-that-use-pio threads, on demand
> (b) scale down to zero threads, with lack of demand
> 
> That handles the worst case of each PIO-polling drive needing to sleep
> (thus massively impacting latency, if any other PIO-polling drive must
> wait for a free thread).
> 
> That also handles the best case of not needing any threads at all.

Heh... I've been trying to implement in-kernel media presence polling
and hit about the same problem.  The problem is quite widespread.  The
choice of multithreaded workqueue was intentional as Jeff explained.
There are many workqueues which are created in fear of blocking or
being blocked by other works although in most cases it shouldn't be a
problem then there's the newly added async mechanism, which I don't
quite get as it runs the worker function from different environment
depending on resource availability - the worker function might be
executed synchronously where it might have different context
w.r.t. locking or whatever.

So, I've spent some time thinking about alternative so that things can
be unified.

* Per-cpu binding is good.

* Managing the right level of concurrency isn't easy.  If we try to
  schedule works too soonish we can end up wasting resources and slow
  things down compared to the current somewhat confined work
  processing.  If works are scheduled too late, resources will be
  underutilized.

* Some workqueues are there to guarantee forward progress and avoid
  deadlocks around the work execution resource (workqueue threads).
  Similar mechanism needs to be in place.

* It would be nice to implement async execution in terms of workqueue
  or even replace it with workqueue.

My a bit crazy idea was like the followings.

* All works get queued on a single unified per-cpu work list.

* Perfect level of concurrency can be managed by hooking into
  scheduler and kicking a new worker thread iff the currently running
  worker is about to be scheduled out for whatever reason and there's
  no other worker ready to run.

* Thread pool of a few idle threads is always maintained per cpu and
  they get used by the above scheduler hooking.  When the thread pool
  gets exhausted, manager thread is scheduled instead and replenishes
  the pool.  When there are too many idle threads, the pool size is
  reduced slowly.

* Forward-progress can be guaranteed by reserving a single thread for
  any such group of works.  When there are such works pending and the
  manager is invoked to replenish the worker pook, all such works on
  the queue are dispatched to their respective reserved threads.
  Please note that this will happen only rarely as the worker pool
  size will be kept enough and stable most of the time.

* Async can be reimplemented as work which get assigned to cpus in
  round-robin manner.  This wouldn't be perfect but should be enough.

Managing the perfect level of concurrency would have benefits in
resource usages, cache footprint, bandwidth and responsiveness.  I
haven't actually tried to implement the above yet and am still
wondering whether the complexity is justified.

Thanks.
Alan Cox - Aug. 19, 2009, 3:21 p.m.
Somewhat simpler for the general case would be to implement

create_workqueue(min_threads,max_threads, max_thread_idle_time);
queue_work_within();

at which point you can queue work to the workqueue but with a per
workqueue timing running so you know when you might need to create a new
thread if the current work hasn't finished. Idle threads would then sleep
and either expire or pick up new work - so that under load we don't keep
destructing threads.

That might need a single thread (for the system) that does nothing but
create workqueue threads to order. It could manage the "next workqueue
deadline" timer and thread creation. The threads would then pick up their
work queue work. There is an intrinsic race where if we are just on the
limit we might create a thread just as there is no work to be done - but
it would be rare and would just then go away.

I see no point tring to fix ata when applying sanity to the workqueue
logic would sort a lot of other cases out nicely.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Aug. 19, 2009, 3:53 p.m.
Hello, Alan.

Alan Cox wrote:
> Somewhat simpler for the general case would be to implement

Yeah, it would be great if there's a much simpler solution.  Latency
based one could be a good compromise.

> create_workqueue(min_threads,max_threads, max_thread_idle_time);
> queue_work_within();
> 
> at which point you can queue work to the workqueue but with a per
> workqueue timing running so you know when you might need to create a new
> thread if the current work hasn't finished. Idle threads would then sleep
> and either expire or pick up new work - so that under load we don't keep
> destructing threads.

Are work threads per workqueue?  Combined with per-cpu binding,
dynamic thread pool per workqueue can get quite messy.  All three
factors end up getting multiplied - ie. #cpus * pool_size which can be
enlarged by the same work hopping around * #workqueues.

> That might need a single thread (for the system) that does nothing but
> create workqueue threads to order. It could manage the "next workqueue
> deadline" timer and thread creation.

Another problem is that if we apply this to the existing default
workqueue which is used by many different supposed-to-be-short works
in essentially batch mode, we might end up enlarging cache footprint
by scheduling unnecessarily many threads, which, in tight situations,
might show up as small but noticeable performance regression.

> The threads would then pick up their work queue work. There is an
> intrinsic race where if we are just on the limit we might create a
> thread just as there is no work to be done - but it would be rare
> and would just then go away.

Agreed.  As long as the pool size is reduced gradually maybe with some
buffer, I don't think this would be an issue.

> I see no point tring to fix ata when applying sanity to the workqueue
> logic would sort a lot of other cases out nicely.

About the same problem exists for in-kernel media presence polling.
Currently, I'm thinking about creating a worker thread per any device
which requires polling.  It isn't too bad but it would still be far
better if I can just schedule a work and don't have to worry about
managing concurrency.

Thanks.
Alan Cox - Aug. 19, 2009, 4:15 p.m.
> Are work threads per workqueue?  Combined with per-cpu binding,
> dynamic thread pool per workqueue can get quite messy.  All three
> factors end up getting multiplied - ie. #cpus * pool_size which can be
> enlarged by the same work hopping around * #workqueues.

Stop a moment. Exactly how many work queue users need per cpu binding
wizardry ?

> Another problem is that if we apply this to the existing default
> workqueue which is used by many different supposed-to-be-short works
> in essentially batch mode, we might end up enlarging cache footprint
> by scheduling unnecessarily many threads, which, in tight situations,
> might show up as small but noticeable performance regression.

Only if you make the default assumed max wait time for the work too low -
its a tunable behaviour in fact.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo - Aug. 19, 2009, 4:58 p.m.
Alan Cox wrote:
>> Are work threads per workqueue?  Combined with per-cpu binding,
>> dynamic thread pool per workqueue can get quite messy.  All three
>> factors end up getting multiplied - ie. #cpus * pool_size which can be
>> enlarged by the same work hopping around * #workqueues.
> 
> Stop a moment. Exactly how many work queue users need per cpu binding
> wizardry ?

It's not about needing per-cpu binding but if works can be executed on
the same cpu they were issued, it's almost always beneficial.  The
only reason why we have single threaded workqueue now is to limit the
number of threads.

>> Another problem is that if we apply this to the existing default
>> workqueue which is used by many different supposed-to-be-short works
>> in essentially batch mode, we might end up enlarging cache footprint
>> by scheduling unnecessarily many threads, which, in tight situations,
>> might show up as small but noticeable performance regression.
> 
> Only if you make the default assumed max wait time for the work too low -
> its a tunable behaviour in fact.

If the default workqueue is made to manage concurrency well, most
works should be able to just use it, so the queue will contain both
long running ones and short running ones which can disturb the current
batch like processing of the default workqueue which is assumed to
have only short ones.

If we maintain separate workqueues for different purposes, a tuning
knob could be enough.  If we try to manage the whole thing in uniform
manner, a good tune might not exist.

I'm not sure either way yet.  One of the reasons the current situation
is messy is because there are too many segregations among different
thread pools (different workqueues, async thread pool, other private
kthreads).  It would be great if a single work API is exported and
concurrency is managed automatically so that no one else has to worry
about concurrency but achieving that requires much more intelligence
on the workqueue implementation as the basic concurrency policies
which used to be imposed by those segregations need to be handled
automatically.  Maybe it's better trade-off to leave those
segregations as-are and just add another workqueue type with dynamic
thread pool.

Thanks.
Alan Cox - Aug. 19, 2009, 5:23 p.m.
> It's not about needing per-cpu binding but if works can be executed on
> the same cpu they were issued, it's almost always beneficial.  The
> only reason why we have single threaded workqueue now is to limit the
> number of threads.

That would argue very strongly for putting all the logic in one place so
everything shares queues.

> > Only if you make the default assumed max wait time for the work too low -
> > its a tunable behaviour in fact.
> 
> If the default workqueue is made to manage concurrency well, most
> works should be able to just use it, so the queue will contain both
> long running ones and short running ones which can disturb the current
> batch like processing of the default workqueue which is assumed to
> have only short ones.

Not sure why it matters - the short ones will instead end up being
processed serially in parallel to the hog.

> kthreads).  It would be great if a single work API is exported and
> concurrency is managed automatically so that no one else has to worry
> about concurrency but achieving that requires much more intelligence
> on the workqueue implementation as the basic concurrency policies
> which used to be imposed by those segregations need to be handled
> automatically.  Maybe it's better trade-off to leave those
> segregations as-are and just add another workqueue type with dynamic
> thread pool.

The more intelligence in the workqueue logic, the less in the drivers and
the more it can be adjusted and adapt itself. Consider things like power
management which might argue for breaking the cpu affinity to avoid
waking up a sleeping CPU in preference to jumping work between processors

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt - Aug. 19, 2009, 10:22 p.m.
On Wed, 2009-08-19 at 14:23 +0200, Jens Axboe wrote:
> Well, that's the same thread pool suggestion that Jeff came up with.
> And
> I agree, that's a nicer long term solution (it's also how the per-bdi
> flushing replacement works). The problem with that appears to be that
> any suggested patchset for thread pools spiral into certain "but what
> color should it be?!" death.
> 
> I'll try and work up a simple create_threadpool() implementation, we
> can
> literally cut away hundreds of threads with that.

Don't we have one already ? Dave Howells slow_work or such ?

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Richter - Aug. 20, 2009, 11:39 a.m.
Alan Cox wrote:
> Somewhat simpler for the general case would be to implement
> 
> create_workqueue(min_threads,max_threads, max_thread_idle_time);
> queue_work_within();
> 
> at which point you can queue work to the workqueue but with a per
> workqueue timing running so you know when you might need to create a new
> thread if the current work hasn't finished. Idle threads would then sleep
> and either expire or pick up new work
[...]

Sounds a lot like kernel/slow-work.c / include/linux/slow-work.h to me
(as Ben already pointed out).
Stefan Richter - Aug. 20, 2009, 12:11 p.m.
Stefan Richter wrote:
> Sounds a lot like kernel/slow-work.c / include/linux/slow-work.h to me
> (as Ben already pointed out).

On the other hand, Jens' new create_lazy_workqueue() from two hours ago
looks very nice in terms of the effort to convert candidate users to it.
Tejun Heo - Aug. 20, 2009, 12:46 p.m.
Hello, Alan.

Alan Cox wrote:
>> It's not about needing per-cpu binding but if works can be executed on
>> the same cpu they were issued, it's almost always beneficial.  The
>> only reason why we have single threaded workqueue now is to limit the
>> number of threads.
> 
> That would argue very strongly for putting all the logic in one place so
> everything shares queues.

Yes, it does.

>>> Only if you make the default assumed max wait time for the work too low -
>>> its a tunable behaviour in fact.
>> If the default workqueue is made to manage concurrency well, most
>> works should be able to just use it, so the queue will contain both
>> long running ones and short running ones which can disturb the current
>> batch like processing of the default workqueue which is assumed to
>> have only short ones.
> 
> Not sure why it matters - the short ones will instead end up being
> processed serially in parallel to the hog.

The problem is how to assign works to workers.  With long running
works, workqueue will definitely need some reserves in the worker
pool.  When short works are consecutively queued, without special
provision, they'll end up served by different workers increasing cache
foot print and execution overhead.  The special provision could be
something timer based but modding timer for each work is a bit
expensive.  I think it needs to be more mechanical rather than depend
on heuristics or timing.

>> kthreads).  It would be great if a single work API is exported and
>> concurrency is managed automatically so that no one else has to worry
>> about concurrency but achieving that requires much more intelligence
>> on the workqueue implementation as the basic concurrency policies
>> which used to be imposed by those segregations need to be handled
>> automatically.  Maybe it's better trade-off to leave those
>> segregations as-are and just add another workqueue type with dynamic
>> thread pool.
> 
> The more intelligence in the workqueue logic, the less in the drivers and
> the more it can be adjusted and adapt itself.

Yeap, sure.

> Consider things like power management which might argue for breaking
> the cpu affinity to avoid waking up a sleeping CPU in preference to
> jumping work between processors

Yeah, that's one thing to consider too but works being scheduled on a
particular cpu usually is the result of other activities going on the
cpu.  I don't think workqueue needs to be modified for that.  If other
things move, workqueue will automatically follow.

Thanks.
Tejun Heo - Aug. 20, 2009, 12:47 p.m.
Benjamin Herrenschmidt wrote:
> On Wed, 2009-08-19 at 14:23 +0200, Jens Axboe wrote:
>> Well, that's the same thread pool suggestion that Jeff came up with.
>> And
>> I agree, that's a nicer long term solution (it's also how the per-bdi
>> flushing replacement works). The problem with that appears to be that
>> any suggested patchset for thread pools spiral into certain "but what
>> color should it be?!" death.
>>
>> I'll try and work up a simple create_threadpool() implementation, we
>> can
>> literally cut away hundreds of threads with that.
> 
> Don't we have one already ? Dave Howells slow_work or such ?

Yes, it addresses different aspect of the concurrency problem.  Might
be more suitable for ATA workqueues but definitely more costly to
convert to.  Argh...
Tejun Heo - Aug. 20, 2009, 12:48 p.m.
Tejun Heo wrote:
> Yes, it addresses different aspect of the concurrency problem.  Might
> be more suitable for ATA workqueues but definitely more costly to
> convert to.  Argh...
            ^ compared to Jens's lazy workqueue.
James Bottomley - Aug. 20, 2009, 2:28 p.m.
On Thu, 2009-08-20 at 21:48 +0900, Tejun Heo wrote:
> Tejun Heo wrote:
> > Yes, it addresses different aspect of the concurrency problem.  Might
> > be more suitable for ATA workqueues but definitely more costly to
> > convert to.  Argh...
>             ^ compared to Jens's lazy workqueue.

So there are two issues here.  One is ATAs need for execution in user
context that won't block other execution ... I really think that if
there's an existing pattern for this in the kernel, we should use it
rather than inventing our own.

The other is the question of whether the workqueue concept itself is
flawed.  This business of some jobs blocking other jobs due to execution
order on the queue can be a nasty side effect and it can lead to
entangled deadlocks, but for some uses, the whole concept of queued jobs
following a set order is necessary.  It might be appropriate to think
about whether we want to convert the whole workqueue infrastructure to
something like slow_work instead and possibly think about ordering on
top of this.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 072ba5e..0d78628 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6580,7 +6580,7 @@  static int __init ata_init(void)
 {
 	ata_parse_force_param();
 
-	ata_wq = create_workqueue("ata");
+	ata_wq = create_singlethread_workqueue("ata");
 	if (!ata_wq)
 		goto free_force_tbl;