mbox series

[SRU,B,F,0/3] Fix LIO from hanging in iscsit_free_session and iscsit_stop_session

Message ID 20200420183936.40908-1-kelsey.skunberg@canonical.com
Headers show
Series Fix LIO from hanging in iscsit_free_session and iscsit_stop_session | expand

Message

Kelsey Skunberg April 20, 2020, 6:39 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1871688

SRU Justification

[Impact]

(Following details are from the bug report)

The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI 
sessions simultaneously. This is reproducible on systems that have multiple
targets with initiators regularly connecting/disconnecting.

This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is 
executed when a logout operation is underway.

The iscsi target doesn't handle such events in a correct way: two or more 
threads may end up sleeping while waiting for the driver to close the remaining
connections on the session. When the connections are closed, the driver wakes
up only the first thread that will then proceed to destroy the session
structure. The remaining threads are blocked there forever, waiting on a
completion synchronization mechanism that doesn't exist in memory anymore
because it has been freed by the first thread.

Note that if the blocked threads are somehow forced to wake up, they will try 
to free the same iSCSI session structure destroyed by the first thread, causing
double frees, memory corruptions, etc.

The driver has been reorganized so the concurrent threads will set a flag in 
the session structure to notify the driver that the session should be
destroyed; then, they wait for the driver to close the remaining connections.
When the connections are all closed, the driver will wake up all the threads
and will wait for the refcount of the iSCSI session structure to reach zero.
When the last thread wakes up, the refcount is decreased to zero and the driver
can proceed to destroy the session structure because no one is referencing it
anymore.

Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems.
States this is a regression, because this did not occur several years ago.  No
detailed records from that far back to determine exactly which kernel reporter
was running that was not affected by this bug (Believes it was either 4.8.x or
4.10.x).

Attached in the bug report is the requested uname, version_signature, dmesg, and
lspci from reporter's system. However, the reporter has seen this happen on a
wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel
CPUs, onboard storage and PCIe SAS cards, etc.

This has been fixed in the upstream master branch, but it hasn't yet been 
backported to "-stable".

[Fixes]

These three commits should be backported:
* https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad
* https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796
* https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e

[Test]

This is reproducible on systems that have multiple targets with initiators
regularly connecting/disconnecting by having multiple threads try and destroy
iSCSI sessions simultaneiously. 

This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed
when a logout operation is underway.


[Regression Risk]

Low, cherry picked from upstream with no changes. 

Verified applies cleanly to Bionic/master-next and Focal/master-next. Build
tests pass.

Maurizio Lombardi (3):
  scsi: target: remove boilerplate code
  scsi: target: fix hang when multiple threads try to destroy the same
    iscsi session
  scsi: target: iscsi: calling iscsit_stop_session() inside
    iscsit_close_session() has no effect

 drivers/target/iscsi/iscsi_target.c          | 82 ++++++--------------
 drivers/target/iscsi/iscsi_target.h          |  1 -
 drivers/target/iscsi/iscsi_target_configfs.c |  5 +-
 drivers/target/iscsi/iscsi_target_login.c    |  5 +-
 include/target/iscsi/iscsi_target_core.h     |  2 +-
 5 files changed, 32 insertions(+), 63 deletions(-)

Comments

Kleber Sacilotto de Souza April 23, 2020, 1:21 p.m. UTC | #1
On 20.04.20 20:39, Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1871688
> 
> SRU Justification
> 
> [Impact]
> 
> (Following details are from the bug report)
> 
> The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI 
> sessions simultaneously. This is reproducible on systems that have multiple
> targets with initiators regularly connecting/disconnecting.
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is 
> executed when a logout operation is underway.
> 
> The iscsi target doesn't handle such events in a correct way: two or more 
> threads may end up sleeping while waiting for the driver to close the remaining
> connections on the session. When the connections are closed, the driver wakes
> up only the first thread that will then proceed to destroy the session
> structure. The remaining threads are blocked there forever, waiting on a
> completion synchronization mechanism that doesn't exist in memory anymore
> because it has been freed by the first thread.
> 
> Note that if the blocked threads are somehow forced to wake up, they will try 
> to free the same iSCSI session structure destroyed by the first thread, causing
> double frees, memory corruptions, etc.
> 
> The driver has been reorganized so the concurrent threads will set a flag in 
> the session structure to notify the driver that the session should be
> destroyed; then, they wait for the driver to close the remaining connections.
> When the connections are all closed, the driver will wake up all the threads
> and will wait for the refcount of the iSCSI session structure to reach zero.
> When the last thread wakes up, the refcount is decreased to zero and the driver
> can proceed to destroy the session structure because no one is referencing it
> anymore.
> 
> Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems.
> States this is a regression, because this did not occur several years ago.  No
> detailed records from that far back to determine exactly which kernel reporter
> was running that was not affected by this bug (Believes it was either 4.8.x or
> 4.10.x).
> 
> Attached in the bug report is the requested uname, version_signature, dmesg, and
> lspci from reporter's system. However, the reporter has seen this happen on a
> wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel
> CPUs, onboard storage and PCIe SAS cards, etc.
> 
> This has been fixed in the upstream master branch, but it hasn't yet been 
> backported to "-stable".
> 
> [Fixes]
> 
> These three commits should be backported:
> * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad
> * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796
> * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e
> 
> [Test]
> 
> This is reproducible on systems that have multiple targets with initiators
> regularly connecting/disconnecting by having multiple threads try and destroy
> iSCSI sessions simultaneiously. 
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed
> when a logout operation is underway.
> 
> 
> [Regression Risk]
> 
> Low, cherry picked from upstream with no changes. 
> 
> Verified applies cleanly to Bionic/master-next and Focal/master-next. Build
> tests pass.
> 
> Maurizio Lombardi (3):
>   scsi: target: remove boilerplate code
>   scsi: target: fix hang when multiple threads try to destroy the same
>     iscsi session
>   scsi: target: iscsi: calling iscsit_stop_session() inside
>     iscsit_close_session() has no effect
> 
>  drivers/target/iscsi/iscsi_target.c          | 82 ++++++--------------
>  drivers/target/iscsi/iscsi_target.h          |  1 -
>  drivers/target/iscsi/iscsi_target_configfs.c |  5 +-
>  drivers/target/iscsi/iscsi_target_login.c    |  5 +-
>  include/target/iscsi/iscsi_target_core.h     |  2 +-
>  5 files changed, 32 insertions(+), 63 deletions(-)
> 

LGTM, clean cherry-picks:

Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Kamal Mostafa April 23, 2020, 4:59 p.m. UTC | #2
Acked-by: Kamal Mostafa <kamal@canonical.com>

 -Kamal

On Mon, Apr 20, 2020 at 12:39:33PM -0600, Kelsey Skunberg wrote:
> BugLink: https://bugs.launchpad.net/bugs/1871688
> 
> SRU Justification
> 
> [Impact]
> 
> (Following details are from the bug report)
> 
> The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI 
> sessions simultaneously. This is reproducible on systems that have multiple
> targets with initiators regularly connecting/disconnecting.
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is 
> executed when a logout operation is underway.
> 
> The iscsi target doesn't handle such events in a correct way: two or more 
> threads may end up sleeping while waiting for the driver to close the remaining
> connections on the session. When the connections are closed, the driver wakes
> up only the first thread that will then proceed to destroy the session
> structure. The remaining threads are blocked there forever, waiting on a
> completion synchronization mechanism that doesn't exist in memory anymore
> because it has been freed by the first thread.
> 
> Note that if the blocked threads are somehow forced to wake up, they will try 
> to free the same iSCSI session structure destroyed by the first thread, causing
> double frees, memory corruptions, etc.
> 
> The driver has been reorganized so the concurrent threads will set a flag in 
> the session structure to notify the driver that the session should be
> destroyed; then, they wait for the driver to close the remaining connections.
> When the connections are all closed, the driver will wake up all the threads
> and will wait for the refcount of the iSCSI session structure to reach zero.
> When the last thread wakes up, the refcount is decreased to zero and the driver
> can proceed to destroy the session structure because no one is referencing it
> anymore.
> 
> Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems.
> States this is a regression, because this did not occur several years ago.  No
> detailed records from that far back to determine exactly which kernel reporter
> was running that was not affected by this bug (Believes it was either 4.8.x or
> 4.10.x).
> 
> Attached in the bug report is the requested uname, version_signature, dmesg, and
> lspci from reporter's system. However, the reporter has seen this happen on a
> wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel
> CPUs, onboard storage and PCIe SAS cards, etc.
> 
> This has been fixed in the upstream master branch, but it hasn't yet been 
> backported to "-stable".
> 
> [Fixes]
> 
> These three commits should be backported:
> * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad
> * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796
> * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e
> 
> [Test]
> 
> This is reproducible on systems that have multiple targets with initiators
> regularly connecting/disconnecting by having multiple threads try and destroy
> iSCSI sessions simultaneiously. 
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed
> when a logout operation is underway.
> 
> 
> [Regression Risk]
> 
> Low, cherry picked from upstream with no changes. 
> 
> Verified applies cleanly to Bionic/master-next and Focal/master-next. Build
> tests pass.
> 
> Maurizio Lombardi (3):
>   scsi: target: remove boilerplate code
>   scsi: target: fix hang when multiple threads try to destroy the same
>     iscsi session
>   scsi: target: iscsi: calling iscsit_stop_session() inside
>     iscsit_close_session() has no effect
> 
>  drivers/target/iscsi/iscsi_target.c          | 82 ++++++--------------
>  drivers/target/iscsi/iscsi_target.h          |  1 -
>  drivers/target/iscsi/iscsi_target_configfs.c |  5 +-
>  drivers/target/iscsi/iscsi_target_login.c    |  5 +-
>  include/target/iscsi/iscsi_target_core.h     |  2 +-
>  5 files changed, 32 insertions(+), 63 deletions(-)
> 
> -- 
> 2.20.1
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Kelsey Skunberg April 24, 2020, 5:25 p.m. UTC | #3
On 2020-04-20 12:39:33 , Kelsey Skunberg wrote:

Applied to bionic/master-next and focal/master-next

-Kelsey 

> BugLink: https://bugs.launchpad.net/bugs/1871688
> 
> SRU Justification
> 
> [Impact]
> 
> (Following details are from the bug report)
> 
> The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI 
> sessions simultaneously. This is reproducible on systems that have multiple
> targets with initiators regularly connecting/disconnecting.
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is 
> executed when a logout operation is underway.
> 
> The iscsi target doesn't handle such events in a correct way: two or more 
> threads may end up sleeping while waiting for the driver to close the remaining
> connections on the session. When the connections are closed, the driver wakes
> up only the first thread that will then proceed to destroy the session
> structure. The remaining threads are blocked there forever, waiting on a
> completion synchronization mechanism that doesn't exist in memory anymore
> because it has been freed by the first thread.
> 
> Note that if the blocked threads are somehow forced to wake up, they will try 
> to free the same iSCSI session structure destroyed by the first thread, causing
> double frees, memory corruptions, etc.
> 
> The driver has been reorganized so the concurrent threads will set a flag in 
> the session structure to notify the driver that the session should be
> destroyed; then, they wait for the driver to close the remaining connections.
> When the connections are all closed, the driver will wake up all the threads
> and will wait for the refcount of the iSCSI session structure to reach zero.
> When the last thread wakes up, the refcount is decreased to zero and the driver
> can proceed to destroy the session structure because no one is referencing it
> anymore.
> 
> Bug reporter witnessed this happening on hundreds of Ubuntu 16.04.5 systems.
> States this is a regression, because this did not occur several years ago.  No
> detailed records from that far back to determine exactly which kernel reporter
> was running that was not affected by this bug (Believes it was either 4.8.x or
> 4.10.x).
> 
> Attached in the bug report is the requested uname, version_signature, dmesg, and
> lspci from reporter's system. However, the reporter has seen this happen on a
> wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel
> CPUs, onboard storage and PCIe SAS cards, etc.
> 
> This has been fixed in the upstream master branch, but it hasn't yet been 
> backported to "-stable".
> 
> [Fixes]
> 
> These three commits should be backported:
> * https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad
> * https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796
> * https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e
> 
> [Test]
> 
> This is reproducible on systems that have multiple targets with initiators
> regularly connecting/disconnecting by having multiple threads try and destroy
> iSCSI sessions simultaneiously. 
> 
> This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed
> when a logout operation is underway.
> 
> 
> [Regression Risk]
> 
> Low, cherry picked from upstream with no changes. 
> 
> Verified applies cleanly to Bionic/master-next and Focal/master-next. Build
> tests pass.
> 
> Maurizio Lombardi (3):
>   scsi: target: remove boilerplate code
>   scsi: target: fix hang when multiple threads try to destroy the same
>     iscsi session
>   scsi: target: iscsi: calling iscsit_stop_session() inside
>     iscsit_close_session() has no effect
> 
>  drivers/target/iscsi/iscsi_target.c          | 82 ++++++--------------
>  drivers/target/iscsi/iscsi_target.h          |  1 -
>  drivers/target/iscsi/iscsi_target_configfs.c |  5 +-
>  drivers/target/iscsi/iscsi_target_login.c    |  5 +-
>  include/target/iscsi/iscsi_target_core.h     |  2 +-
>  5 files changed, 32 insertions(+), 63 deletions(-)
> 
> -- 
> 2.20.1