mbox series

[SRU,Bionic,0/2] Fix IO hang regression

Message ID 20180421210826.27042-1-seth.forshee@canonical.com
Headers show
Series Fix IO hang regression | expand

Message

Seth Forshee April 21, 2018, 9:08 p.m. UTC
BugLink: http://bugs.launchpad.net/bugs/1765232

Impact: Since 4.15.0-15 some machines have been failing to boot due to
IO hangs. This is caused by patches applied for LP #1759723, which
assigned managed interrupt vectors and reply queues for all possible
CPUs, not just present CPUs. Some drivers were not prepared to cope with
this and end up selecting reply queues not mapped to an online CPU,
causing IO hangs during boot.

Fix: There are driver fixes available upstream, but there are 8-ish
patches in total and we're extremely close to release, so the safer bet
it to just revert the patches for LP #1759723. We can consider
reintroducing them with required fixes at a later time.

Regression Potential: This is obviously going to reintroduce the problem
the patches were intended to fix. These are less serious than the
problems which the patches introduced, and IBM has given their okay to
revert them as well.

Test Case: Verified to fix affected hardware on LP #1765232.

Thanks,
Seth


Seth Forshee (2):
  Revert "blk-mq: simplify queue mapping & schedule with each possisble
    CPU"
  Revert "genirq/affinity: assign vectors to all possible CPUs"

 block/blk-mq.c        | 19 +++++++++++--------
 kernel/irq/affinity.c | 30 +++++++++++++++---------------
 2 files changed, 26 insertions(+), 23 deletions(-)

Comments

Colin Ian King April 21, 2018, 9:32 p.m. UTC | #1
On 21/04/18 22:08, Seth Forshee wrote:
> BugLink: http://bugs.launchpad.net/bugs/1765232
> 
> Impact: Since 4.15.0-15 some machines have been failing to boot due to
> IO hangs. This is caused by patches applied for LP #1759723, which
> assigned managed interrupt vectors and reply queues for all possible
> CPUs, not just present CPUs. Some drivers were not prepared to cope with
> this and end up selecting reply queues not mapped to an online CPU,
> causing IO hangs during boot.
> 
> Fix: There are driver fixes available upstream, but there are 8-ish
> patches in total and we're extremely close to release, so the safer bet
> it to just revert the patches for LP #1759723. We can consider
> reintroducing them with required fixes at a later time.
> 
> Regression Potential: This is obviously going to reintroduce the problem
> the patches were intended to fix. These are less serious than the
> problems which the patches introduced, and IBM has given their okay to
> revert them as well.
> 
> Test Case: Verified to fix affected hardware on LP #1765232.
> 
> Thanks,
> Seth
> 
> 
> Seth Forshee (2):
>   Revert "blk-mq: simplify queue mapping & schedule with each possisble
>     CPU"
>   Revert "genirq/affinity: assign vectors to all possible CPUs"
> 
>  block/blk-mq.c        | 19 +++++++++++--------
>  kernel/irq/affinity.c | 30 +++++++++++++++---------------
>  2 files changed, 26 insertions(+), 23 deletions(-)
> 
> 
Seems like the best solution for the moment.

Acked-by: Colin Ian King <colin.king@canonical.com>
Khalid Elmously April 21, 2018, 9:47 p.m. UTC | #2
Original Message  
From: Seth Forshee
Sent: Saturday, April 21, 2018 5:11 PM
To: kernel-team@lists.ubuntu.com
Subject: [SRU][Bionic][PATCH 0/2] Fix IO hang regression

BugLink: http://bugs.launchpad.net/bugs/1765232

Impact: Since 4.15.0-15 some machines have been failing to boot due to
IO hangs. This is caused by patches applied for LP #1759723, which
assigned managed interrupt vectors and reply queues for all possible
CPUs, not just present CPUs. Some drivers were not prepared to cope with
this and end up selecting reply queues not mapped to an online CPU,
causing IO hangs during boot.

Fix: There are driver fixes available upstream, but there are 8-ish
patches in total and we're extremely close to release, so the safer bet
it to just revert the patches for LP #1759723. We can consider
reintroducing them with required fixes at a later time.

Regression Potential: This is obviously going to reintroduce the problem
the patches were intended to fix. These are less serious than the
problems which the patches introduced, and IBM has given their okay to
revert them as well.

Test Case: Verified to fix affected hardware on LP #1765232.

Thanks,
Seth


Seth Forshee (2):
Revert "blk-mq: simplify queue mapping & schedule with each possisble
CPU"
Revert "genirq/affinity: assign vectors to all possible CPUs"

block/blk-mq.c | 19 +++++++++++--------
kernel/irq/affinity.c | 30 +++++++++++++++---------------
2 files changed, 26 insertions(+), 23 deletions(-)
Seth Forshee April 21, 2018, 10:04 p.m. UTC | #3
On Sat, Apr 21, 2018 at 04:08:24PM -0500, Seth Forshee wrote:
> BugLink: http://bugs.launchpad.net/bugs/1765232
> 
> Impact: Since 4.15.0-15 some machines have been failing to boot due to
> IO hangs. This is caused by patches applied for LP #1759723, which
> assigned managed interrupt vectors and reply queues for all possible
> CPUs, not just present CPUs. Some drivers were not prepared to cope with
> this and end up selecting reply queues not mapped to an online CPU,
> causing IO hangs during boot.
> 
> Fix: There are driver fixes available upstream, but there are 8-ish
> patches in total and we're extremely close to release, so the safer bet
> it to just revert the patches for LP #1759723. We can consider
> reintroducing them with required fixes at a later time.
> 
> Regression Potential: This is obviously going to reintroduce the problem
> the patches were intended to fix. These are less serious than the
> problems which the patches introduced, and IBM has given their okay to
> revert them as well.
> 
> Test Case: Verified to fix affected hardware on LP #1765232.

Applied to bionic/master-next.