Message ID | 20200420183936.40908-1-kelsey.skunberg@canonical.com |
---|---|
Headers | show |
Series | Fix LIO from hanging in iscsit_free_session and iscsit_stop_session | expand |
On 20.04.20 20:39, Kelsey Skunberg wrote: > BugLink: https://bugs.launchpad.net/bugs/1871688 > > SRU Justification > > [Impact] > > (Following details are from the bug report) > > The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI > sessions simultaneously. This is reproducible on systems that have multiple > targets with initiators regularly connecting/disconnecting. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is > executed when a logout operation is underway. > > The iscsi target doesn't handle such events in a correct way: two or more > threads may end up sleeping while waiting for the driver to close the remaining > connections on the session. When the connections are closed, the driver wakes > up only the first thread that will then proceed to destroy the session > structure. The remaining threads are blocked there forever, waiting on a > completion synchronization mechanism that doesn't exist in memory anymore > because it has been freed by the first thread. > > Note that if the blocked threads are somehow forced to wake up, they will try > to free the same iSCSI session structure destroyed by the first thread, causing > double frees, memory corruptions, etc. > > The driver has been reorganized so the concurrent threads will set a flag in > the session structure to notify the driver that the session should be > destroyed; then, they wait for the driver to close the remaining connections. > When the connections are all closed, the driver will wake up all the threads > and will wait for the refcount of the iSCSI session structure to reach zero. > When the last thread wakes up, the refcount is decreased to zero and the driver > can proceed to destroy the session structure because no one is referencing it > anymore. > > Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems. > States this is a regression, because this did not occur several years ago. No > detailed records from that far back to determine exactly which kernel reporter > was running that was not affected by this bug (Believes it was either 4.8.x or > 4.10.x). > > Attached in the bug report is the requested uname, version_signature, dmesg, and > lspci from reporter's system. However, the reporter has seen this happen on a > wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel > CPUs, onboard storage and PCIe SAS cards, etc. > > This has been fixed in the upstream master branch, but it hasn't yet been > backported to "-stable". > > [Fixes] > > These three commits should be backported: > * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad > * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796 > * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e > > [Test] > > This is reproducible on systems that have multiple targets with initiators > regularly connecting/disconnecting by having multiple threads try and destroy > iSCSI sessions simultaneiously. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed > when a logout operation is underway. > > > [Regression Risk] > > Low, cherry picked from upstream with no changes. > > Verified applies cleanly to Bionic/master-next and Focal/master-next. Build > tests pass. > > Maurizio Lombardi (3): > scsi: target: remove boilerplate code > scsi: target: fix hang when multiple threads try to destroy the same > iscsi session > scsi: target: iscsi: calling iscsit_stop_session() inside > iscsit_close_session() has no effect > > drivers/target/iscsi/iscsi_target.c | 82 ++++++-------------- > drivers/target/iscsi/iscsi_target.h | 1 - > drivers/target/iscsi/iscsi_target_configfs.c | 5 +- > drivers/target/iscsi/iscsi_target_login.c | 5 +- > include/target/iscsi/iscsi_target_core.h | 2 +- > 5 files changed, 32 insertions(+), 63 deletions(-) > LGTM, clean cherry-picks: Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Kamal Mostafa <kamal@canonical.com> -Kamal On Mon, Apr 20, 2020 at 12:39:33PM -0600, Kelsey Skunberg wrote: > BugLink: https://bugs.launchpad.net/bugs/1871688 > > SRU Justification > > [Impact] > > (Following details are from the bug report) > > The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI > sessions simultaneously. This is reproducible on systems that have multiple > targets with initiators regularly connecting/disconnecting. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is > executed when a logout operation is underway. > > The iscsi target doesn't handle such events in a correct way: two or more > threads may end up sleeping while waiting for the driver to close the remaining > connections on the session. When the connections are closed, the driver wakes > up only the first thread that will then proceed to destroy the session > structure. The remaining threads are blocked there forever, waiting on a > completion synchronization mechanism that doesn't exist in memory anymore > because it has been freed by the first thread. > > Note that if the blocked threads are somehow forced to wake up, they will try > to free the same iSCSI session structure destroyed by the first thread, causing > double frees, memory corruptions, etc. > > The driver has been reorganized so the concurrent threads will set a flag in > the session structure to notify the driver that the session should be > destroyed; then, they wait for the driver to close the remaining connections. > When the connections are all closed, the driver will wake up all the threads > and will wait for the refcount of the iSCSI session structure to reach zero. > When the last thread wakes up, the refcount is decreased to zero and the driver > can proceed to destroy the session structure because no one is referencing it > anymore. > > Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems. > States this is a regression, because this did not occur several years ago. No > detailed records from that far back to determine exactly which kernel reporter > was running that was not affected by this bug (Believes it was either 4.8.x or > 4.10.x). > > Attached in the bug report is the requested uname, version_signature, dmesg, and > lspci from reporter's system. However, the reporter has seen this happen on a > wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel > CPUs, onboard storage and PCIe SAS cards, etc. > > This has been fixed in the upstream master branch, but it hasn't yet been > backported to "-stable". > > [Fixes] > > These three commits should be backported: > * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad > * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796 > * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e > > [Test] > > This is reproducible on systems that have multiple targets with initiators > regularly connecting/disconnecting by having multiple threads try and destroy > iSCSI sessions simultaneiously. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed > when a logout operation is underway. > > > [Regression Risk] > > Low, cherry picked from upstream with no changes. > > Verified applies cleanly to Bionic/master-next and Focal/master-next. Build > tests pass. > > Maurizio Lombardi (3): > scsi: target: remove boilerplate code > scsi: target: fix hang when multiple threads try to destroy the same > iscsi session > scsi: target: iscsi: calling iscsit_stop_session() inside > iscsit_close_session() has no effect > > drivers/target/iscsi/iscsi_target.c | 82 ++++++-------------- > drivers/target/iscsi/iscsi_target.h | 1 - > drivers/target/iscsi/iscsi_target_configfs.c | 5 +- > drivers/target/iscsi/iscsi_target_login.c | 5 +- > include/target/iscsi/iscsi_target_core.h | 2 +- > 5 files changed, 32 insertions(+), 63 deletions(-) > > -- > 2.20.1 > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
On 2020-04-20 12:39:33 , Kelsey Skunberg wrote: Applied to bionic/master-next and focal/master-next -Kelsey > BugLink: https://bugs.launchpad.net/bugs/1871688 > > SRU Justification > > [Impact] > > (Following details are from the bug report) > > The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI > sessions simultaneously. This is reproducible on systems that have multiple > targets with initiators regularly connecting/disconnecting. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is > executed when a logout operation is underway. > > The iscsi target doesn't handle such events in a correct way: two or more > threads may end up sleeping while waiting for the driver to close the remaining > connections on the session. When the connections are closed, the driver wakes > up only the first thread that will then proceed to destroy the session > structure. The remaining threads are blocked there forever, waiting on a > completion synchronization mechanism that doesn't exist in memory anymore > because it has been freed by the first thread. > > Note that if the blocked threads are somehow forced to wake up, they will try > to free the same iSCSI session structure destroyed by the first thread, causing > double frees, memory corruptions, etc. > > The driver has been reorganized so the concurrent threads will set a flag in > the session structure to notify the driver that the session should be > destroyed; then, they wait for the driver to close the remaining connections. > When the connections are all closed, the driver will wake up all the threads > and will wait for the refcount of the iSCSI session structure to reach zero. > When the last thread wakes up, the refcount is decreased to zero and the driver > can proceed to destroy the session structure because no one is referencing it > anymore. > > Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems. > States this is a regression, because this did not occur several years ago. No > detailed records from that far back to determine exactly which kernel reporter > was running that was not affected by this bug (Believes it was either 4.8.x or > 4.10.x). > > Attached in the bug report is the requested uname, version_signature, dmesg, and > lspci from reporter's system. However, the reporter has seen this happen on a > wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel > CPUs, onboard storage and PCIe SAS cards, etc. > > This has been fixed in the upstream master branch, but it hasn't yet been > backported to "-stable". > > [Fixes] > > These three commits should be backported: > * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad > * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796 > * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e > > [Test] > > This is reproducible on systems that have multiple targets with initiators > regularly connecting/disconnecting by having multiple threads try and destroy > iSCSI sessions simultaneiously. > > This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed > when a logout operation is underway. > > > [Regression Risk] > > Low, cherry picked from upstream with no changes. > > Verified applies cleanly to Bionic/master-next and Focal/master-next. Build > tests pass. > > Maurizio Lombardi (3): > scsi: target: remove boilerplate code > scsi: target: fix hang when multiple threads try to destroy the same > iscsi session > scsi: target: iscsi: calling iscsit_stop_session() inside > iscsit_close_session() has no effect > > drivers/target/iscsi/iscsi_target.c | 82 ++++++-------------- > drivers/target/iscsi/iscsi_target.h | 1 - > drivers/target/iscsi/iscsi_target_configfs.c | 5 +- > drivers/target/iscsi/iscsi_target_login.c | 5 +- > include/target/iscsi/iscsi_target_core.h | 2 +- > 5 files changed, 32 insertions(+), 63 deletions(-) > > -- > 2.20.1