diff mbox series

[RFC,rdma-next] RDMA/core: Add attribute WQ_MEM_RECLAIM to workqueue "infiniband"

Message ID 1581996935-46507-1-git-send-email-chenglang@huawei.com
State Not Applicable
Delegated to: David Miller
Headers show
Series [RFC,rdma-next] RDMA/core: Add attribute WQ_MEM_RECLAIM to workqueue "infiniband" | expand

Commit Message

Lang Cheng Feb. 18, 2020, 3:35 a.m. UTC
The hns3 driver sets "hclge_service_task" workqueue with
WQ_MEM_RECLAIM flag in order to guarantee forward progress
under memory pressure. When hns3 ethernet driver perfrom a
reset bacause of tx timeout or ras error, hclge_service_task
will unregister ib_device before telling the fw to perfrom the
hardware reset in oder to disable accessing to the ib_device.
And ib_unregister_device() will call ib_cache_cleanup_one() to
flush workqueue "infiniband", which is without WQ_MEM_RECLAIM set,
then a WARNNING is triggered as below:

[11246.200168] hns3 0000:bd:00.1: Reset done, hclge driver initialization finished.
[11246.209979] hns3 0000:bd:00.1 eth7: net open
[11246.227608] ------------[ cut here ]------------
[11246.237370] workqueue: WQ_MEM_RECLAIM hclge:hclge_service_task [hclge] is flushing !WQ_MEM_RECLAIM infiniband:0x0
[11246.237391] WARNING: CPU: 50 PID: 2279 at ./kernel/workqueue.c:2605 check_flush_dependency+0xcc/0x140
[11246.260412] Modules linked in: hclgevf hns_roce_hw_v2 rdma_test(O) hns3 xt_CHECKSUM iptable_mangle xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter vfio_iommu_type1 vfio_pci vfio_virqfd vfio ib_isert iscsi_target_mod ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi aes_ce_blk crypto_simd cryptd aes_ce_cipher sunrpc nls_iso8859_1 crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce joydev input_leds hid_generic usbkbd usbmouse sbsa_gwdt usbhid usb_storage hid ses hclge hisi_zip hisi_hpre hisi_sec2 hnae3 hisi_qm ahci hisi_trng_v2 evbug uacce rng_core gpio_dwapb autofs4 hisi_sas_v3_hw megaraid_sas hisi_sas_main libsas scsi_transport_sas [last unloaded: hns_roce_hw_v2]
[11246.325742] CPU: 50 PID: 2279 Comm: kworker/50:0 Kdump: loaded Tainted: G           O      5.4.0-rc4+ #1
[11246.335181] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS 2280-V2 CS V3.B140.01 12/18/2019
[11246.344802] Workqueue: hclge hclge_service_task [hclge]
[11246.350007] pstate: 60c00009 (nZCv daif +PAN +UAO)
[11246.354779] pc : check_flush_dependency+0xcc/0x140
[11246.359549] lr : check_flush_dependency+0xcc/0x140
[11246.364317] sp : ffff800268a73990
[11246.367618] x29: ffff800268a73990 x28: 0000000000000001 
[11246.372907] x27: ffffcbe4f5868000 x26: ffffcbe4f5541000 
[11246.378196] x25: 00000000000000b8 x24: ffff002fdd0ff868 
[11246.383483] x23: ffff002fdd0ff800 x22: ffff2027401ba600 
[11246.388770] x21: 0000000000000000 x20: ffff002fdd0ff800 
[11246.394059] x19: ffff202719293b00 x18: ffffcbe4f5541948 
[11246.399347] x17: 000000006f8ad8dd x16: 0000000000000002 
[11246.404634] x15: ffff8002e8a734f7 x14: 6c66207369205d65 
[11246.409922] x13: 676c63685b206b73 x12: 61745f6563697672 
[11246.415208] x11: 65735f65676c6368 x10: 3a65676c6368204d 
[11246.420494] x9 : 49414c4345525f4d x8 : 6e6162696e69666e 
[11246.425782] x7 : 69204d49414c4345 x6 : ffffcbe4f5765145 
[11246.431068] x5 : 0000000000000000 x4 : 0000000000000000 
[11246.436355] x3 : 0000000000000030 x2 : 00000000ffffffff 
[11246.441642] x1 : 3349eb1ac5310100 x0 : 0000000000000000 
[11246.446928] Call trace:
[11246.449363]  check_flush_dependency+0xcc/0x140
[11246.453785]  flush_workqueue+0x110/0x410
[11246.457691]  ib_cache_cleanup_one+0x54/0x468
[11246.461943]  __ib_unregister_device+0x70/0xa8
[11246.466279]  ib_unregister_device+0x2c/0x40
[11246.470455]  hns_roce_exit+0x34/0x198 [hns_roce_hw_v2]
[11246.475571]  __hns_roce_hw_v2_uninit_instance.isra.56+0x3c/0x58 [hns_roce_hw_v2]
[11246.482934]  hns_roce_hw_v2_reset_notify+0xd8/0x210 [hns_roce_hw_v2]
[11246.489261]  hclge_notify_roce_client+0x84/0xe0 [hclge]
[11246.494464]  hclge_reset_rebuild+0x60/0x730 [hclge]
[11246.499320]  hclge_reset_service_task+0x400/0x5a0 [hclge]
[11246.504695]  hclge_service_task+0x54/0x698 [hclge]
[11246.509464]  process_one_work+0x15c/0x458
[11246.513454]  worker_thread+0x144/0x520
[11246.517186]  kthread+0xfc/0x128
[11246.520314]  ret_from_fork+0x10/0x18
[11246.523873] ---[ end trace eb980723699c2585 ]---
[11246.528710] hns3 0000:bd:00.2: Func clear success after reset.
[11246.528747] hns3 0000:bd:00.0: Func clear success after reset.
[11246.907710] hns3 0000:bd:00.1 eth7: link up

There may be three ways to avoid the above warnning:
1. Allocate the "hclge_service_task" workqueue without
WQ_MEM_RECLAIM flag, which may cause deadlock problem
when hns3 driver is used as the low level transport of
a network file system

2. Do not unregister ib_device during reset process, maybe
only disable accessing to the ib_device using disable_device()
as rdma_dev_change_netns() does.

3. Allocate the "infiniband" workqueue with WQ_MEM_RECLAIM flag.

This patch allocates the "infiniband" workqueue with WQ_MEM_RECLAIM
flag to avoid the warnning.

Fixes: 0ea68902256e ("net: hns3: allocate WQ with WQ_MEM_RECLAIM flag")
Signed-off-by: Lang Cheng <chenglang@huawei.com>
---
 drivers/infiniband/core/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jason Gunthorpe Feb. 18, 2020, 3:31 p.m. UTC | #1
On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
> The hns3 driver sets "hclge_service_task" workqueue with
> WQ_MEM_RECLAIM flag in order to guarantee forward progress
> under memory pressure.

Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
with reclaimed processing.

Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
allocations, can't do certain kinds of sleeps, can't hold certain
kinds of locks, etc.

Jason
Yunsheng Lin Feb. 19, 2020, 1:13 a.m. UTC | #2
On 2020/2/18 23:31, Jason Gunthorpe wrote:
> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
>> The hns3 driver sets "hclge_service_task" workqueue with
>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
>> under memory pressure.
> 
> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
> with reclaimed processing.
> 
> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
> allocations, can't do certain kinds of sleeps, can't hold certain
> kinds of locks, etc.

From mlx5 driver, it seems that there is GFP_KERNEL allocations
on wq marked with WQ_MEM_RECLAIM too:

mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
mlx5e_safe_switch_channels() -> mlx5e_open_channels()

kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
and mlx5e_tx_timeout_work() is queued with priv->wq, which is
allocated with WQ_MEM_RECLAIM flags. see:

mlx5e_netdev_init() -> create_singlethread_workqueue()


From the comment in kernel/workqueue.c, the work queued with
wq with WQ_MEM_RECLAIM flag set seems to be executed without
blocking under some rare case. I still not quite understand
the comment, and I can not find any doc that point out the
GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
yet. Is there any doc that mentions that GFP_KERNEL allocations
can not be done in wq with WQ_MEM_RECLAIM?


/**
 * rescuer_thread - the rescuer thread function
 * @__rescuer: self
 *
 * Workqueue rescuer thread function.  There's one rescuer for each
 * workqueue which has WQ_MEM_RECLAIM set.
 *
 * Regular work processing on a pool may block trying to create a new
 * worker which uses GFP_KERNEL allocation which has slight chance of
 * developing into deadlock if some works currently on the same queue
 * need to be processed to satisfy the GFP_KERNEL allocation.  This is
 * the problem rescuer solves.
 *
 * When such condition is possible, the pool summons rescuers of all
 * workqueues which have works queued on the pool and let them process
 * those works so that forward progress can be guaranteed.
 *
 * This should happen rarely.
 *
 * Return: 0
 */


The below is the reason we add the sets "hclge_service_task" workqueue
with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:

hns3 ethernet driver may be used as the low level transport of a
network file system, memory reclaim data path may depend on the
worker in hns3 driver to bring back the ethernet link so that it flush
the some cache to network based disk.

> 
> Jason
> 
>
Leon Romanovsky Feb. 19, 2020, 6:45 a.m. UTC | #3
On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
> On 2020/2/18 23:31, Jason Gunthorpe wrote:
> > On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
> >> The hns3 driver sets "hclge_service_task" workqueue with
> >> WQ_MEM_RECLAIM flag in order to guarantee forward progress
> >> under memory pressure.
> >
> > Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
> > with reclaimed processing.
> >
> > Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
> > allocations, can't do certain kinds of sleeps, can't hold certain
> > kinds of locks, etc.
>
> From mlx5 driver, it seems that there is GFP_KERNEL allocations
> on wq marked with WQ_MEM_RECLAIM too:
>
> mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
> mlx5e_safe_switch_channels() -> mlx5e_open_channels()
>
> kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
> and mlx5e_tx_timeout_work() is queued with priv->wq, which is
> allocated with WQ_MEM_RECLAIM flags. see:
>
> mlx5e_netdev_init() -> create_singlethread_workqueue()

There are two reasons for that, first mlx5 driver was written far before
WQ_MEM_RECLAIM usage was clarified, second mlx5 has bugs.

>
>
> From the comment in kernel/workqueue.c, the work queued with
> wq with WQ_MEM_RECLAIM flag set seems to be executed without
> blocking under some rare case. I still not quite understand
> the comment, and I can not find any doc that point out the
> GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
> yet. Is there any doc that mentions that GFP_KERNEL allocations
> can not be done in wq with WQ_MEM_RECLAIM?

It is whole purpose of WQ_MEM_RECLAIM flag - allow progress in case of
memory pressure. Allocation memory while we are under memory pressure
is an invitation for a disaster.

>
>
> /**
>  * rescuer_thread - the rescuer thread function
>  * @__rescuer: self
>  *
>  * Workqueue rescuer thread function.  There's one rescuer for each
>  * workqueue which has WQ_MEM_RECLAIM set.
>  *
>  * Regular work processing on a pool may block trying to create a new
>  * worker which uses GFP_KERNEL allocation which has slight chance of
>  * developing into deadlock if some works currently on the same queue
>  * need to be processed to satisfy the GFP_KERNEL allocation.  This is
>  * the problem rescuer solves.
>  *
>  * When such condition is possible, the pool summons rescuers of all
>  * workqueues which have works queued on the pool and let them process
>  * those works so that forward progress can be guaranteed.
>  *
>  * This should happen rarely.
>  *
>  * Return: 0
>  */
>
>
> The below is the reason we add the sets "hclge_service_task" workqueue
> with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
> allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:

Many drivers are developed using copy/paste technique, so it is wrong
to assume that "other ethernet drivers" did the right thing.

>
> hns3 ethernet driver may be used as the low level transport of a
> network file system, memory reclaim data path may depend on the
> worker in hns3 driver to bring back the ethernet link so that it flush
> the some cache to network based disk.

Unlikely that this "network file system" dependency on ethernet link is correct.

Thanks

>
> >
> > Jason
> >
> >
>
Yunsheng Lin Feb. 19, 2020, 7:40 a.m. UTC | #4
+cc Bhaktipriya, Tejun and Jeff

On 2020/2/19 14:45, Leon Romanovsky wrote:
> On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
>> On 2020/2/18 23:31, Jason Gunthorpe wrote:
>>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
>>>> The hns3 driver sets "hclge_service_task" workqueue with
>>>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
>>>> under memory pressure.
>>>
>>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
>>> with reclaimed processing.
>>>
>>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
>>> allocations, can't do certain kinds of sleeps, can't hold certain
>>> kinds of locks, etc.

By the way, what kind of sleeps and locks can not be done in the work
queued to wq marked with WQ_MEM_RECLAIM?

>>
>> From mlx5 driver, it seems that there is GFP_KERNEL allocations
>> on wq marked with WQ_MEM_RECLAIM too:
>>
>> mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
>> mlx5e_safe_switch_channels() -> mlx5e_open_channels()
>>
>> kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
>> and mlx5e_tx_timeout_work() is queued with priv->wq, which is
>> allocated with WQ_MEM_RECLAIM flags. see:
>>
>> mlx5e_netdev_init() -> create_singlethread_workqueue()
> 
> There are two reasons for that, first mlx5 driver was written far before
> WQ_MEM_RECLAIM usage was clarified, second mlx5 has bugs.
> 
>>
>>
>> From the comment in kernel/workqueue.c, the work queued with
>> wq with WQ_MEM_RECLAIM flag set seems to be executed without
>> blocking under some rare case. I still not quite understand
>> the comment, and I can not find any doc that point out the
>> GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
>> yet. Is there any doc that mentions that GFP_KERNEL allocations
>> can not be done in wq with WQ_MEM_RECLAIM?
> 
> It is whole purpose of WQ_MEM_RECLAIM flag - allow progress in case of
> memory pressure. Allocation memory while we are under memory pressure
> is an invitation for a disaster.

Ok, make sense.

> 
>>
>>
>> /**
>>  * rescuer_thread - the rescuer thread function
>>  * @__rescuer: self
>>  *
>>  * Workqueue rescuer thread function.  There's one rescuer for each
>>  * workqueue which has WQ_MEM_RECLAIM set.
>>  *
>>  * Regular work processing on a pool may block trying to create a new
>>  * worker which uses GFP_KERNEL allocation which has slight chance of
>>  * developing into deadlock if some works currently on the same queue
>>  * need to be processed to satisfy the GFP_KERNEL allocation.  This is
>>  * the problem rescuer solves.
>>  *
>>  * When such condition is possible, the pool summons rescuers of all
>>  * workqueues which have works queued on the pool and let them process
>>  * those works so that forward progress can be guaranteed.
>>  *
>>  * This should happen rarely.
>>  *
>>  * Return: 0
>>  */
>>
>>
>> The below is the reason we add the sets "hclge_service_task" workqueue
>> with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
>> allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:
> 
> Many drivers are developed using copy/paste technique, so it is wrong
> to assume that "other ethernet drivers" did the right thing.
> 
>>
>> hns3 ethernet driver may be used as the low level transport of a
>> network file system, memory reclaim data path may depend on the
>> worker in hns3 driver to bring back the ethernet link so that it flush
>> the some cache to network based disk.
> 
> Unlikely that this "network file system" dependency on ethernet link is correct.

Ok, I may be wrong about the above usecase.
but the below commit explicitly state that network devices may be used in
memory reclaim path.

0a38c17a21a0 ("fm10k: Remove create_workqueue"):

fm10k: Remove create_workqueue

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem (viz
fm10k_service_task, which manages and runs other subtasks) is involved in
normal device operation and requires forward progress under memory
pressure.

create_workqueue has been replaced with alloc_workqueue with max_active
as 0 since there is no need for throttling the number of active work
items.

Since network devices may be used in memory reclaim path,
WQ_MEM_RECLAIM has been set to guarantee forward progress.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty. Hence the call to flush_workqueue() has been dropped.

Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

So:
1. Maybe the above commit log is misleading, and network device driver's
   wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
   not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
   remove the WQ_MEM_RECLAIM flag for the wq of network device driver.


2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
   hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
   wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.

> 
> Thanks
> 
>>
>>>
>>> Jason
>>>
>>>
>>
> 
> .
>
Leon Romanovsky Feb. 19, 2020, 11:07 a.m. UTC | #5
On Wed, Feb 19, 2020 at 03:40:59PM +0800, Yunsheng Lin wrote:
> +cc Bhaktipriya, Tejun and Jeff
>
> On 2020/2/19 14:45, Leon Romanovsky wrote:
> > On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
> >> On 2020/2/18 23:31, Jason Gunthorpe wrote:
> >>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
> >>>> The hns3 driver sets "hclge_service_task" workqueue with
> >>>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
> >>>> under memory pressure.
> >>>
> >>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
> >>> with reclaimed processing.
> >>>
> >>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
> >>> allocations, can't do certain kinds of sleeps, can't hold certain
> >>> kinds of locks, etc.
>
> By the way, what kind of sleeps and locks can not be done in the work
> queued to wq marked with WQ_MEM_RECLAIM?

I didn't see this knowledge documented, but I would assume that
everything that can block memory reclaim progress should not be
in such workqueue.

>
> >>
> >> From mlx5 driver, it seems that there is GFP_KERNEL allocations
> >> on wq marked with WQ_MEM_RECLAIM too:
> >>
> >> mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
> >> mlx5e_safe_switch_channels() -> mlx5e_open_channels()
> >>
> >> kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
> >> and mlx5e_tx_timeout_work() is queued with priv->wq, which is
> >> allocated with WQ_MEM_RECLAIM flags. see:
> >>
> >> mlx5e_netdev_init() -> create_singlethread_workqueue()
> >
> > There are two reasons for that, first mlx5 driver was written far before
> > WQ_MEM_RECLAIM usage was clarified, second mlx5 has bugs.
> >
> >>
> >>
> >> From the comment in kernel/workqueue.c, the work queued with
> >> wq with WQ_MEM_RECLAIM flag set seems to be executed without
> >> blocking under some rare case. I still not quite understand
> >> the comment, and I can not find any doc that point out the
> >> GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
> >> yet. Is there any doc that mentions that GFP_KERNEL allocations
> >> can not be done in wq with WQ_MEM_RECLAIM?
> >
> > It is whole purpose of WQ_MEM_RECLAIM flag - allow progress in case of
> > memory pressure. Allocation memory while we are under memory pressure
> > is an invitation for a disaster.
>
> Ok, make sense.
>
> >
> >>
> >>
> >> /**
> >>  * rescuer_thread - the rescuer thread function
> >>  * @__rescuer: self
> >>  *
> >>  * Workqueue rescuer thread function.  There's one rescuer for each
> >>  * workqueue which has WQ_MEM_RECLAIM set.
> >>  *
> >>  * Regular work processing on a pool may block trying to create a new
> >>  * worker which uses GFP_KERNEL allocation which has slight chance of
> >>  * developing into deadlock if some works currently on the same queue
> >>  * need to be processed to satisfy the GFP_KERNEL allocation.  This is
> >>  * the problem rescuer solves.
> >>  *
> >>  * When such condition is possible, the pool summons rescuers of all
> >>  * workqueues which have works queued on the pool and let them process
> >>  * those works so that forward progress can be guaranteed.
> >>  *
> >>  * This should happen rarely.
> >>  *
> >>  * Return: 0
> >>  */
> >>
> >>
> >> The below is the reason we add the sets "hclge_service_task" workqueue
> >> with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
> >> allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:
> >
> > Many drivers are developed using copy/paste technique, so it is wrong
> > to assume that "other ethernet drivers" did the right thing.
> >
> >>
> >> hns3 ethernet driver may be used as the low level transport of a
> >> network file system, memory reclaim data path may depend on the
> >> worker in hns3 driver to bring back the ethernet link so that it flush
> >> the some cache to network based disk.
> >
> > Unlikely that this "network file system" dependency on ethernet link is correct.
>
> Ok, I may be wrong about the above usecase.
> but the below commit explicitly state that network devices may be used in
> memory reclaim path.
>
> 0a38c17a21a0 ("fm10k: Remove create_workqueue"):
>
> fm10k: Remove create_workqueue
>
> alloc_workqueue replaces deprecated create_workqueue().
>
> A dedicated workqueue has been used since the workitem (viz
> fm10k_service_task, which manages and runs other subtasks) is involved in
> normal device operation and requires forward progress under memory
> pressure.
>
> create_workqueue has been replaced with alloc_workqueue with max_active
> as 0 since there is no need for throttling the number of active work
> items.
>
> Since network devices may be used in memory reclaim path,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
>
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
>
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>
> So:
> 1. Maybe the above commit log is misleading, and network device driver's
>    wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
>    not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
>    remove the WQ_MEM_RECLAIM flag for the wq of network device driver.

I wouldn't truly count on what is written in commit messages of patch
series which globally replaced create_workqueue() interface.

>
>
> 2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
>    hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
>    wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.

You are proposing to put WQ_MEM_RECLAIM in infiniband queue and not to
special queue inside of the driver.

>
> >
> > Thanks
> >
> >>
> >>>
> >>> Jason
> >>>
> >>>
> >>
> >
> > .
> >
>
Jason Gunthorpe Feb. 19, 2020, 1:04 p.m. UTC | #6
On Wed, Feb 19, 2020 at 03:40:59PM +0800, Yunsheng Lin wrote:
> +cc Bhaktipriya, Tejun and Jeff
> 
> On 2020/2/19 14:45, Leon Romanovsky wrote:
> > On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
> >> On 2020/2/18 23:31, Jason Gunthorpe wrote:
> >>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
> >>>> The hns3 driver sets "hclge_service_task" workqueue with
> >>>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
> >>>> under memory pressure.
> >>>
> >>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
> >>> with reclaimed processing.
> >>>
> >>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
> >>> allocations, can't do certain kinds of sleeps, can't hold certain
> >>> kinds of locks, etc.
> 
> By the way, what kind of sleeps and locks can not be done in the work
> queued to wq marked with WQ_MEM_RECLAIM?

Anything that recurses back into a blocking allocation function.

If we are freeing memory because an allocation failed (eg GFP_KERNEL)
then we cannot go back into a blockable allocation while trying to
progress the first failing allocation. That is a deadlock.

So a WQ cannot hold any locks that enclose GFP_KERNEL in any other
threads.

Unfortunately we don't have a lockdep test for this by default.

> >> hns3 ethernet driver may be used as the low level transport of a
> >> network file system, memory reclaim data path may depend on the
> >> worker in hns3 driver to bring back the ethernet link so that it flush
> >> the some cache to network based disk.
> > 
> > Unlikely that this "network file system" dependency on ethernet link is correct.
> 
> Ok, I may be wrong about the above usecase.  but the below commit
> explicitly state that network devices may be used in memory reclaim
> path.

I don't really know how this works when the networking stacks
intersect with the block stack.

Forward progress on something like a NVMeOF requires a lot of stuff to
be working, and presumably under reclaim.

But, we can't make everything WQ_MEM_RECLAIM safe, because we could
never do a GFP_KERNEL allocation..

I have never seen specific guidance what to do here, I assume it is
broken.

Jason
Yunsheng Lin Feb. 20, 2020, 1:06 a.m. UTC | #7
On 2020/2/19 21:04, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2020 at 03:40:59PM +0800, Yunsheng Lin wrote:
>> +cc Bhaktipriya, Tejun and Jeff
>> 
>> On 2020/2/19 14:45, Leon Romanovsky wrote:
>>> On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
>>>> On 2020/2/18 23:31, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
>>>>>> The hns3 driver sets "hclge_service_task" workqueue with WQ_MEM_RECLAIM flag in order to guarantee forward progress under memory pressure.
>>>>> 
>>>>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked with reclaimed processing.
>>>>> 
>>>>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL allocations, can't do certain kinds of sleeps, can't hold certain kinds of locks, etc.
>> 
>> By the way, what kind of sleeps and locks can not be done in the work queued to wq marked with WQ_MEM_RECLAIM?
> 
> Anything that recurses back into a blocking allocation function.
> 
> If we are freeing memory because an allocation failed (eg GFP_KERNEL) then we cannot go back into a blockable allocation while trying to progress the first failing allocation. That is a deadlock.
> 
> So a WQ cannot hold any locks that enclose GFP_KERNEL in any other threads.
> 
> Unfortunately we don't have a lockdep test for this by default.
> 
>>>> hns3 ethernet driver may be used as the low level transport of a network file system, memory reclaim data path may depend on the worker in hns3 driver to bring back the ethernet link so that it flush the some cache to network based disk.
>>> 
>>> Unlikely that this "network file system" dependency on ethernet link is correct.
>> 
>> Ok, I may be wrong about the above usecase.  but the below commit explicitly state that network devices may be used in memory reclaim path.
> 
> I don't really know how this works when the networking stacks intersect with the block stack.
> 
> Forward progress on something like a NVMeOF requires a lot of stuff to be working, and presumably under reclaim.
> 
> But, we can't make everything WQ_MEM_RECLAIM safe, because we could never do a GFP_KERNEL allocation..
> 
> I have never seen specific guidance what to do here, I assume it is broken.

I assume the forward progress guarantee of network device's wq is broken
too, at least for the case of hns3, fm10k and mlx5 driver.

So I suggest to remove WQ_MEM_RECLAIM for hns3' wq for now.

For now there are two known problems which defeat the forward progress
guarantee of WQ_MEM_RECLAIM  when adding the WQ_MEM_RECLAIM for hns3'
wq:
1. GFP_KERNEL allocation in the hns3' work queued to WQ_MEM_RECLAIM wq.
2. hns3' WQ_MEM_RECLAIM wq flushing infiniband' !WQ_MEM_RECLAIM wq.

We can add the WQ_MEM_RECLAIM back when we have fixed the above problem and
find more specific guidance about handling progress guarantee in network
device's wq .

Thanks for the feedback.

> 
> Jason
> 
> .
> 
f
Yunsheng Lin Feb. 20, 2020, 1:16 a.m. UTC | #8
On 2020/2/19 19:07, Leon Romanovsky wrote:
> On Wed, Feb 19, 2020 at 03:40:59PM +0800, Yunsheng Lin wrote:
>> +cc Bhaktipriya, Tejun and Jeff
>>
>> On 2020/2/19 14:45, Leon Romanovsky wrote:
>>> On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
>>>> On 2020/2/18 23:31, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
>>>>>> The hns3 driver sets "hclge_service_task" workqueue with
>>>>>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
>>>>>> under memory pressure.
>>>>>
>>>>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
>>>>> with reclaimed processing.
>>>>>
>>>>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
>>>>> allocations, can't do certain kinds of sleeps, can't hold certain
>>>>> kinds of locks, etc.
>>
>> By the way, what kind of sleeps and locks can not be done in the work
>> queued to wq marked with WQ_MEM_RECLAIM?
> 
> I didn't see this knowledge documented, but I would assume that
> everything that can block memory reclaim progress should not be
> in such workqueue.
> 
>>
>>>>
>>>> From mlx5 driver, it seems that there is GFP_KERNEL allocations
>>>> on wq marked with WQ_MEM_RECLAIM too:
>>>>
>>>> mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
>>>> mlx5e_safe_switch_channels() -> mlx5e_open_channels()
>>>>
>>>> kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
>>>> and mlx5e_tx_timeout_work() is queued with priv->wq, which is
>>>> allocated with WQ_MEM_RECLAIM flags. see:
>>>>
>>>> mlx5e_netdev_init() -> create_singlethread_workqueue()
>>>
>>> There are two reasons for that, first mlx5 driver was written far before
>>> WQ_MEM_RECLAIM usage was clarified, second mlx5 has bugs.
>>>
>>>>
>>>>
>>>> From the comment in kernel/workqueue.c, the work queued with
>>>> wq with WQ_MEM_RECLAIM flag set seems to be executed without
>>>> blocking under some rare case. I still not quite understand
>>>> the comment, and I can not find any doc that point out the
>>>> GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
>>>> yet. Is there any doc that mentions that GFP_KERNEL allocations
>>>> can not be done in wq with WQ_MEM_RECLAIM?
>>>
>>> It is whole purpose of WQ_MEM_RECLAIM flag - allow progress in case of
>>> memory pressure. Allocation memory while we are under memory pressure
>>> is an invitation for a disaster.
>>
>> Ok, make sense.
>>
>>>
>>>>
>>>>
>>>> /**
>>>>  * rescuer_thread - the rescuer thread function
>>>>  * @__rescuer: self
>>>>  *
>>>>  * Workqueue rescuer thread function.  There's one rescuer for each
>>>>  * workqueue which has WQ_MEM_RECLAIM set.
>>>>  *
>>>>  * Regular work processing on a pool may block trying to create a new
>>>>  * worker which uses GFP_KERNEL allocation which has slight chance of
>>>>  * developing into deadlock if some works currently on the same queue
>>>>  * need to be processed to satisfy the GFP_KERNEL allocation.  This is
>>>>  * the problem rescuer solves.
>>>>  *
>>>>  * When such condition is possible, the pool summons rescuers of all
>>>>  * workqueues which have works queued on the pool and let them process
>>>>  * those works so that forward progress can be guaranteed.
>>>>  *
>>>>  * This should happen rarely.
>>>>  *
>>>>  * Return: 0
>>>>  */
>>>>
>>>>
>>>> The below is the reason we add the sets "hclge_service_task" workqueue
>>>> with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
>>>> allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:
>>>
>>> Many drivers are developed using copy/paste technique, so it is wrong
>>> to assume that "other ethernet drivers" did the right thing.
>>>
>>>>
>>>> hns3 ethernet driver may be used as the low level transport of a
>>>> network file system, memory reclaim data path may depend on the
>>>> worker in hns3 driver to bring back the ethernet link so that it flush
>>>> the some cache to network based disk.
>>>
>>> Unlikely that this "network file system" dependency on ethernet link is correct.
>>
>> Ok, I may be wrong about the above usecase.
>> but the below commit explicitly state that network devices may be used in
>> memory reclaim path.
>>
>> 0a38c17a21a0 ("fm10k: Remove create_workqueue"):
>>
>> fm10k: Remove create_workqueue
>>
>> alloc_workqueue replaces deprecated create_workqueue().
>>
>> A dedicated workqueue has been used since the workitem (viz
>> fm10k_service_task, which manages and runs other subtasks) is involved in
>> normal device operation and requires forward progress under memory
>> pressure.
>>
>> create_workqueue has been replaced with alloc_workqueue with max_active
>> as 0 since there is no need for throttling the number of active work
>> items.
>>
>> Since network devices may be used in memory reclaim path,
>> WQ_MEM_RECLAIM has been set to guarantee forward progress.
>>
>> flush_workqueue is unnecessary since destroy_workqueue() itself calls
>> drain_workqueue() which flushes repeatedly till the workqueue
>> becomes empty. Hence the call to flush_workqueue() has been dropped.
>>
>> Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>
>> So:
>> 1. Maybe the above commit log is misleading, and network device driver's
>>    wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
>>    not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
>>    remove the WQ_MEM_RECLAIM flag for the wq of network device driver.
> 
> I wouldn't truly count on what is written in commit messages of patch
> series which globally replaced create_workqueue() interface.
> 
>>
>>
>> 2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
>>    hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
>>    wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.
> 
> You are proposing to put WQ_MEM_RECLAIM in infiniband queue and not to
> special queue inside of the driver.

In the commit log, we thought of three ways to avoid the warning.
which is:

1. Allocate the "hclge_service_task" workqueue without
WQ_MEM_RECLAIM flag, which may cause deadlock problem
when hns3 driver is used as the low level transport of
a network file system

2. Do not unregister ib_device during reset process, maybe
only disable accessing to the ib_device using disable_device()
as rdma_dev_change_netns() does.

3. Allocate the "infiniband" workqueue with WQ_MEM_RECLAIM flag.


Actually I prefer the second one, which need IB stack to provide
interface to disable and enable ib_device in the roce device driver
when there is hardware reset happening.

Is the above interface reasonable?

> 
>>
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>> Jason
>>>>>
>>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
>
Alexander H Duyck Feb. 20, 2020, 5:46 p.m. UTC | #9
On Tue, Feb 18, 2020 at 11:42 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> Ok, I may be wrong about the above usecase.
> but the below commit explicitly state that network devices may be used in
> memory reclaim path.
>
> 0a38c17a21a0 ("fm10k: Remove create_workqueue"):
>
> fm10k: Remove create_workqueue
>
> alloc_workqueue replaces deprecated create_workqueue().
>
> A dedicated workqueue has been used since the workitem (viz
> fm10k_service_task, which manages and runs other subtasks) is involved in
> normal device operation and requires forward progress under memory
> pressure.
>
> create_workqueue has been replaced with alloc_workqueue with max_active
> as 0 since there is no need for throttling the number of active work
> items.
>
> Since network devices may be used in memory reclaim path,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
>
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
>
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>
> So:
> 1. Maybe the above commit log is misleading, and network device driver's
>    wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
>    not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
>    remove the WQ_MEM_RECLAIM flag for the wq of network device driver.

I am not sure why they added WQ_MEM_RECLAIM to the fm10k service task
thread. It has nothing to do with memory reclaim. If a memory
allocation fails then it will just run to the end and bring the
interface down. The service task is related to dealing with various
one-off events like link up and link down, sorting out hangs, and
updating statistics. The only memory allocation it is involved with is
if it has to reset the interface in which case I believe there may
even be a few GFP_KERNEL calls in there since it is freeing and
reallocating several port related structures.

> 2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
>    hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
>    wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.

It seems like you could solve this by going the other way and dropping
the WQ_MEM_RECLAIM from the original patch you mentioned in your fixes
tag. I'm not seeing anything in hclge_periodic_service_task that
justifies the use of the WQ_MEM_RECLAIM flag. It claims to be involved
with memory reclaim but I don't see where that could be the case.

- Alex
Yunsheng Lin Feb. 21, 2020, 1:44 a.m. UTC | #10
On 2020/2/21 1:46, Alexander Duyck wrote:
> On Tue, Feb 18, 2020 at 11:42 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>> Ok, I may be wrong about the above usecase.
>> but the below commit explicitly state that network devices may be used in
>> memory reclaim path.
>>
>> 0a38c17a21a0 ("fm10k: Remove create_workqueue"):
>>
>> fm10k: Remove create_workqueue
>>
>> alloc_workqueue replaces deprecated create_workqueue().
>>
>> A dedicated workqueue has been used since the workitem (viz
>> fm10k_service_task, which manages and runs other subtasks) is involved in
>> normal device operation and requires forward progress under memory
>> pressure.
>>
>> create_workqueue has been replaced with alloc_workqueue with max_active
>> as 0 since there is no need for throttling the number of active work
>> items.
>>
>> Since network devices may be used in memory reclaim path,
>> WQ_MEM_RECLAIM has been set to guarantee forward progress.
>>
>> flush_workqueue is unnecessary since destroy_workqueue() itself calls
>> drain_workqueue() which flushes repeatedly till the workqueue
>> becomes empty. Hence the call to flush_workqueue() has been dropped.
>>
>> Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
>> Acked-by: Tejun Heo <tj@kernel.org>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>
>> So:
>> 1. Maybe the above commit log is misleading, and network device driver's
>>    wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
>>    not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
>>    remove the WQ_MEM_RECLAIM flag for the wq of network device driver.
> 
> I am not sure why they added WQ_MEM_RECLAIM to the fm10k service task
> thread. It has nothing to do with memory reclaim. If a memory
> allocation fails then it will just run to the end and bring the
> interface down. The service task is related to dealing with various
> one-off events like link up and link down, sorting out hangs, and
> updating statistics. The only memory allocation it is involved with is
> if it has to reset the interface in which case I believe there may
> even be a few GFP_KERNEL calls in there since it is freeing and
> reallocating several port related structures.

Yes, the hns3 driver does a few GFP_KERNEL calls too when resetting the
interface in hclge_reset_service_task(), which will run in the hns3 driver'
wq.

> 
>> 2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
>>    hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
>>    wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.
> 
> It seems like you could solve this by going the other way and dropping
> the WQ_MEM_RECLAIM from the original patch you mentioned in your fixes
> tag. I'm not seeing anything in hclge_periodic_service_task that
> justifies the use of the WQ_MEM_RECLAIM flag. It claims to be involved
> with memory reclaim but I don't see where that could be the case.

Ok, Will remove the WQ_MEM_RECLAIM first.

Thanks.

> 
> - Alex
> 
> .
>
diff mbox series

Patch

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 84dd74f..595548a 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2707,7 +2707,7 @@  static int __init ib_core_init(void)
 {
 	int ret;
 
-	ib_wq = alloc_workqueue("infiniband", 0, 0);
+	ib_wq = alloc_workqueue("infiniband", WQ_MEM_RECLAIM, 0);
 	if (!ib_wq)
 		return -ENOMEM;