Message ID | 20231215100343.32480-1-chengen.du@canonical.com |
---|---|
Headers | show |
Series | Page fault in RDMA ODP triggers BUG_ON during MMU notifier registration | expand |
On 12/15/23 3:03 AM, Chengen Du wrote: > BugLink: https://bugs.launchpad.net/bugs/2046534 > > SRU Justification: > > [Impact] > When a page fault is triggered in RDMA ODP, it registers an MMU notifier during the process. > Unfortunately, an error arises due to a race condition where the mm is released while attempting to register a notifier. > ========== > Oct 14 23:38:32 bnode001 kernel: [1576115.901880] kernel BUG at mm/mmu_notifier.c:255! > Oct 14 23:38:32 bnode001 kernel: [1576115.909129] RSP: 0000:ffffbd3def843c90 EFLAGS: 00010246 > Oct 14 23:38:32 bnode001 kernel: [1576115.912689] RAX: ffffa11635d20000 RBX: ffffa0f913ba5800 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.912691] RDX: ffffffffc0b666f0 RSI: ffffffffc0b601c7 RDI: ffffa0f913ba5850 > Oct 14 23:38:32 bnode001 kernel: [1576115.913564] RAX: 0000000000000000 RBX: ffffffffc0b5a060 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.913565] RDX: 0000000000000007 RSI: ffffa1152ed3c400 RDI: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.914431] RBP: ffffbd3defcb7c88 R08: ffffa1163f4f50e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.914432] R10: ffffa0fd99a00000 R11: 0000000000000000 R12: ffffa1152c923b80 > Oct 14 23:38:32 bnode001 kernel: [1576115.915263] RBP: ffffbd3def843cb0 R08: ffffa1163f7350e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.915265] R10: ffffa1088d000000 R11: 0000000000000000 R12: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.916079] R13: ffffa1152c923b80 R14: ffffa1152c923bf8 R15: ffffa114f8127800 > Oct 14 23:38:32 bnode001 kernel: [1576115.916080] FS: 0000000000000000(0000) GS:ffffa1163f4c0000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.917705] R13: ffffa1152ed3c400 R14: ffffa1152ed3c478 R15: ffffa1101cbfbc00 > Oct 14 23:38:32 bnode001 kernel: [1576115.917706] FS: 0000000000000000(0000) GS:ffffa1163f700000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.918506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.918508] CR2: 00007f94146af5e0 CR3: 0000001722472004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.919301] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.919302] CR2: 00007f32f0a2dc80 CR3: 0000001f9f1fc004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.920082] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.920084] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.920850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.921604] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.921605] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.922354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.922355] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.923112] mmu_notifier_get_locked+0x5f/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.923867] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.923870] ? mmu_notifier_get_locked+0x79/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.924645] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.925409] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.926161] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.926906] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.927651] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.928393] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929131] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929866] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.930610] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.931351] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.932084] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.932813] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.933543] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934272] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934986] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.934988] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.935687] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.935689] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.936387] ret_from_fork+0x1f/0x40 > Oct 14 23:38:32 bnode001 kernel: [1576115.936389] ---[ end trace 1823b59637af552f ]--- > Oct 14 23:38:32 bnode001 kernel: [1576115.937077] ret_from_fork+0x1f/0x40 > ========== > > [Fix] > There is an upstream patch that fixes this issue: > ========== > commit a4e63bce1414df7ab6eb82ca9feb8494ce13e554 > Author: Jason Gunthorpe <jgg@ziepe.ca> > Date: Thu Feb 27 13:41:18 2020 +0200 > > RDMA/odp: Ensure the mm is still alive before creating an implicit child > ========== > The patch has been implemented to modify the behavior by calling mmget() around the registration, thereby ensuring it is held to avoid the race condition. > > [Test Plan] > This is a race condition issue and may not be easy to reproduce. > The test plan involves running on a system with InfiniBand, triggering the RDMA ODP page fault path to check if everything works as expected. > > [Where problems could occur] > The patch calls mmget_not_zero() before registering the MMU notifier and puts it after registration is done. > This change may not affect the execution result but ensures that the mm will not be released during registration. > The risk associated with adopting this patch can be judged as low. > > Jason Gunthorpe (1): > RDMA/odp: Ensure the mm is still alive before creating an implicit > child > > drivers/infiniband/core/umem_odp.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > Acked-by: Tim Gardner <tim.gardner@canonical.com>
On Fri, Dec 15, 2023 at 06:03:42PM +0800, Chengen Du wrote: > BugLink: https://bugs.launchpad.net/bugs/2046534 > > SRU Justification: > > [Impact] > When a page fault is triggered in RDMA ODP, it registers an MMU notifier during the process. > Unfortunately, an error arises due to a race condition where the mm is released while attempting to register a notifier. > ========== > Oct 14 23:38:32 bnode001 kernel: [1576115.901880] kernel BUG at mm/mmu_notifier.c:255! > Oct 14 23:38:32 bnode001 kernel: [1576115.909129] RSP: 0000:ffffbd3def843c90 EFLAGS: 00010246 > Oct 14 23:38:32 bnode001 kernel: [1576115.912689] RAX: ffffa11635d20000 RBX: ffffa0f913ba5800 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.912691] RDX: ffffffffc0b666f0 RSI: ffffffffc0b601c7 RDI: ffffa0f913ba5850 > Oct 14 23:38:32 bnode001 kernel: [1576115.913564] RAX: 0000000000000000 RBX: ffffffffc0b5a060 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.913565] RDX: 0000000000000007 RSI: ffffa1152ed3c400 RDI: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.914431] RBP: ffffbd3defcb7c88 R08: ffffa1163f4f50e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.914432] R10: ffffa0fd99a00000 R11: 0000000000000000 R12: ffffa1152c923b80 > Oct 14 23:38:32 bnode001 kernel: [1576115.915263] RBP: ffffbd3def843cb0 R08: ffffa1163f7350e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.915265] R10: ffffa1088d000000 R11: 0000000000000000 R12: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.916079] R13: ffffa1152c923b80 R14: ffffa1152c923bf8 R15: ffffa114f8127800 > Oct 14 23:38:32 bnode001 kernel: [1576115.916080] FS: 0000000000000000(0000) GS:ffffa1163f4c0000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.917705] R13: ffffa1152ed3c400 R14: ffffa1152ed3c478 R15: ffffa1101cbfbc00 > Oct 14 23:38:32 bnode001 kernel: [1576115.917706] FS: 0000000000000000(0000) GS:ffffa1163f700000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.918506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.918508] CR2: 00007f94146af5e0 CR3: 0000001722472004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.919301] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.919302] CR2: 00007f32f0a2dc80 CR3: 0000001f9f1fc004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.920082] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.920084] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.920850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.921604] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.921605] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.922354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.922355] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.923112] mmu_notifier_get_locked+0x5f/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.923867] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.923870] ? mmu_notifier_get_locked+0x79/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.924645] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.925409] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.926161] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.926906] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.927651] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.928393] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929131] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929866] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.930610] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.931351] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.932084] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.932813] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.933543] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934272] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934986] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.934988] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.935687] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.935689] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.936387] ret_from_fork+0x1f/0x40 > Oct 14 23:38:32 bnode001 kernel: [1576115.936389] ---[ end trace 1823b59637af552f ]--- > Oct 14 23:38:32 bnode001 kernel: [1576115.937077] ret_from_fork+0x1f/0x40 > ========== > > [Fix] > There is an upstream patch that fixes this issue: > ========== > commit a4e63bce1414df7ab6eb82ca9feb8494ce13e554 > Author: Jason Gunthorpe <jgg@ziepe.ca> > Date: Thu Feb 27 13:41:18 2020 +0200 > > RDMA/odp: Ensure the mm is still alive before creating an implicit child > ========== > The patch has been implemented to modify the behavior by calling mmget() around the registration, thereby ensuring it is held to avoid the race condition. > > [Test Plan] > This is a race condition issue and may not be easy to reproduce. > The test plan involves running on a system with InfiniBand, triggering the RDMA ODP page fault path to check if everything works as expected. > > [Where problems could occur] > The patch calls mmget_not_zero() before registering the MMU notifier and puts it after registration is done. > This change may not affect the execution result but ensures that the mm will not be released during registration. > The risk associated with adopting this patch can be judged as low. > > Jason Gunthorpe (1): > RDMA/odp: Ensure the mm is still alive before creating an implicit > child > > drivers/infiniband/core/umem_odp.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > -- > 2.40.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team Acked-by: Manuel Diewald <manuel.diewald@canonical.com>
On 15/12/2023 11:03, Chengen Du wrote: > BugLink: https://bugs.launchpad.net/bugs/2046534 > > SRU Justification: > > [Impact] > When a page fault is triggered in RDMA ODP, it registers an MMU notifier during the process. > Unfortunately, an error arises due to a race condition where the mm is released while attempting to register a notifier. > ========== > Oct 14 23:38:32 bnode001 kernel: [1576115.901880] kernel BUG at mm/mmu_notifier.c:255! > Oct 14 23:38:32 bnode001 kernel: [1576115.909129] RSP: 0000:ffffbd3def843c90 EFLAGS: 00010246 > Oct 14 23:38:32 bnode001 kernel: [1576115.912689] RAX: ffffa11635d20000 RBX: ffffa0f913ba5800 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.912691] RDX: ffffffffc0b666f0 RSI: ffffffffc0b601c7 RDI: ffffa0f913ba5850 > Oct 14 23:38:32 bnode001 kernel: [1576115.913564] RAX: 0000000000000000 RBX: ffffffffc0b5a060 RCX: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.913565] RDX: 0000000000000007 RSI: ffffa1152ed3c400 RDI: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.914431] RBP: ffffbd3defcb7c88 R08: ffffa1163f4f50e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.914432] R10: ffffa0fd99a00000 R11: 0000000000000000 R12: ffffa1152c923b80 > Oct 14 23:38:32 bnode001 kernel: [1576115.915263] RBP: ffffbd3def843cb0 R08: ffffa1163f7350e0 R09: ffffa11638c072c0 > Oct 14 23:38:32 bnode001 kernel: [1576115.915265] R10: ffffa1088d000000 R11: 0000000000000000 R12: ffffa1102dcd4300 > Oct 14 23:38:32 bnode001 kernel: [1576115.916079] R13: ffffa1152c923b80 R14: ffffa1152c923bf8 R15: ffffa114f8127800 > Oct 14 23:38:32 bnode001 kernel: [1576115.916080] FS: 0000000000000000(0000) GS:ffffa1163f4c0000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.917705] R13: ffffa1152ed3c400 R14: ffffa1152ed3c478 R15: ffffa1101cbfbc00 > Oct 14 23:38:32 bnode001 kernel: [1576115.917706] FS: 0000000000000000(0000) GS:ffffa1163f700000(0000) knlGS:0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.918506] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.918508] CR2: 00007f94146af5e0 CR3: 0000001722472004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.919301] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Oct 14 23:38:32 bnode001 kernel: [1576115.919302] CR2: 00007f32f0a2dc80 CR3: 0000001f9f1fc004 CR4: 0000000000760ee0 > Oct 14 23:38:32 bnode001 kernel: [1576115.920082] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.920084] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.920850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Oct 14 23:38:32 bnode001 kernel: [1576115.921604] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.921605] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.922354] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > Oct 14 23:38:32 bnode001 kernel: [1576115.922355] PKRU: 55555554 > Oct 14 23:38:32 bnode001 kernel: [1576115.923112] mmu_notifier_get_locked+0x5f/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.923867] Call Trace: > Oct 14 23:38:32 bnode001 kernel: [1576115.923870] ? mmu_notifier_get_locked+0x79/0xe0 > Oct 14 23:38:32 bnode001 kernel: [1576115.924645] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.925409] ib_umem_odp_alloc_child+0x15a/0x290 [ib_core] > Oct 14 23:38:32 bnode001 kernel: [1576115.926161] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.926906] pagefault_mr+0x312/0x5d0 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.927651] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.928393] pagefault_single_data_segment.isra.0+0x284/0x490 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929131] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.929866] mlx5_ib_eqe_pf_action+0x7d5/0x990 [mlx5_ib] > Oct 14 23:38:32 bnode001 kernel: [1576115.930610] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.931351] process_one_work+0x1eb/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.932084] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.932813] worker_thread+0x4d/0x400 > Oct 14 23:38:32 bnode001 kernel: [1576115.933543] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934272] kthread+0x104/0x140 > Oct 14 23:38:32 bnode001 kernel: [1576115.934986] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.934988] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.935687] ? process_one_work+0x3b0/0x3b0 > Oct 14 23:38:32 bnode001 kernel: [1576115.935689] ? kthread_park+0x90/0x90 > Oct 14 23:38:32 bnode001 kernel: [1576115.936387] ret_from_fork+0x1f/0x40 > Oct 14 23:38:32 bnode001 kernel: [1576115.936389] ---[ end trace 1823b59637af552f ]--- > Oct 14 23:38:32 bnode001 kernel: [1576115.937077] ret_from_fork+0x1f/0x40 > ========== > > [Fix] > There is an upstream patch that fixes this issue: > ========== > commit a4e63bce1414df7ab6eb82ca9feb8494ce13e554 > Author: Jason Gunthorpe <jgg@ziepe.ca> > Date: Thu Feb 27 13:41:18 2020 +0200 > > RDMA/odp: Ensure the mm is still alive before creating an implicit child > ========== > The patch has been implemented to modify the behavior by calling mmget() around the registration, thereby ensuring it is held to avoid the race condition. > > [Test Plan] > This is a race condition issue and may not be easy to reproduce. > The test plan involves running on a system with InfiniBand, triggering the RDMA ODP page fault path to check if everything works as expected. > > [Where problems could occur] > The patch calls mmget_not_zero() before registering the MMU notifier and puts it after registration is done. > This change may not affect the execution result but ensures that the mm will not be released during registration. > The risk associated with adopting this patch can be judged as low. > > Jason Gunthorpe (1): > RDMA/odp: Ensure the mm is still alive before creating an implicit > child > > drivers/infiniband/core/umem_odp.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) Applied to focal master-next branch. Thanks!