mbox series

[SRU,Cosmic,Bionic,0/6] Fixes for LP1797367

Message ID CA+jPhpdaCgJvZXWtc-O4j42RxXQc5kkYnzm8ZtNKNtcUBYKgPg@mail.gmail.com
Headers show
Series Fixes for LP1797367 | expand

Message

Frank Heimes Nov. 7, 2018, 6:20 p.m. UTC
BugLink: http://bugs.launchpad.net/bugs/1797367

== SRU Justification ==

While running a series of stress tests for network on a bond device on
Ubuntu 18.04.1 with kernel 4.15.0-36.39,
kernel panic is observed (btw. also on non-bond devices).
This looks like a race between disabling a qeth device and accessing debugfs.
This is critical and leads repeatedly to a crash (sooner or later).

== Fix ==

e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")

pre-reqs:
750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
d857e11 ("s390/qeth: remove outdated portname debug msg")
9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
4641b02 ("s390/qeth: don't keep track of MAC address's cast type")

== Regression Potential ==

Low, because:
- limited to s390x
- and furthermore limited to qeth driver
- patches a problem identified during testing
- fix was tested by IBM before submitted

== Test Case ==

run:
   #!/bin/bash
   var=0
   while :
   do
        var=$((var + 1))
        echo "DBG count is $var"
        mkdir /tmp/DBGINFO
        dbginfo.sh -d /tmp/DBGINFO
        rm -rf /tmp/DBGINFO*
        echo "chzdev now is $var"
        chzdev -e <qeth device>
        chzdev -d <qeth device>
   done
and in avg. in less than 20 cycles a crash happens (usually < 10).

Comments

Stefan Bader Nov. 8, 2018, 9:32 a.m. UTC | #1
On 07.11.18 19:20, Frank Heimes wrote:
> BugLink: http://bugs.launchpad.net/bugs/1797367
> 
> == SRU Justification ==
> 
> While running a series of stress tests for network on a bond device on
> Ubuntu 18.04.1 with kernel 4.15.0-36.39,
> kernel panic is observed (btw. also on non-bond devices).
> This looks like a race between disabling a qeth device and accessing debugfs.
> This is critical and leads repeatedly to a crash (sooner or later).
> 
> == Fix ==
> 
> e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")
> 
> pre-reqs:
> 750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
> d857e11 ("s390/qeth: remove outdated portname debug msg")
> 9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
> 8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
> 4641b02 ("s390/qeth: don't keep track of MAC address's cast type")
> 
> == Regression Potential ==
> 
> Low, because:
> - limited to s390x
> - and furthermore limited to qeth driver
> - patches a problem identified during testing
> - fix was tested by IBM before submitted
> 
> == Test Case ==
> 
> run:
>    #!/bin/bash
>    var=0
>    while :
>    do
>         var=$((var + 1))
>         echo "DBG count is $var"
>         mkdir /tmp/DBGINFO
>         dbginfo.sh -d /tmp/DBGINFO
>         rm -rf /tmp/DBGINFO*
>         echo "chzdev now is $var"
>         chzdev -e <qeth device>
>         chzdev -d <qeth device>
>    done
> and in avg. in less than 20 cycles a crash happens (usually < 10).
> 

And thanks for ASCII ;)

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Kleber Sacilotto de Souza Nov. 8, 2018, 10:02 a.m. UTC | #2
On 11/07/18 19:20, Frank Heimes wrote:
> BugLink: http://bugs.launchpad.net/bugs/1797367
>
> == SRU Justification ==
>
> While running a series of stress tests for network on a bond device on
> Ubuntu 18.04.1 with kernel 4.15.0-36.39,
> kernel panic is observed (btw. also on non-bond devices).
> This looks like a race between disabling a qeth device and accessing debugfs.
> This is critical and leads repeatedly to a crash (sooner or later).
>
> == Fix ==
>
> e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")
>
> pre-reqs:
> 750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
> d857e11 ("s390/qeth: remove outdated portname debug msg")
> 9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
> 8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
> 4641b02 ("s390/qeth: don't keep track of MAC address's cast type")
>
> == Regression Potential ==
>
> Low, because:
> - limited to s390x
> - and furthermore limited to qeth driver
> - patches a problem identified during testing
> - fix was tested by IBM before submitted
>
> == Test Case ==
>
> run:
>    #!/bin/bash
>    var=0
>    while :
>    do
>         var=$((var + 1))
>         echo "DBG count is $var"
>         mkdir /tmp/DBGINFO
>         dbginfo.sh -d /tmp/DBGINFO
>         rm -rf /tmp/DBGINFO*
>         echo "chzdev now is $var"
>         chzdev -e <qeth device>
>         chzdev -d <qeth device>
>    done
> and in avg. in less than 20 cycles a crash happens (usually < 10).
>

Some of the patches won't probably apply because of bogus line-breaks as
I mentioned in one of the patches, so some extra care is needed when
applying the patches. We'll probably need to cherry-pick the original
commit and add the BugLink and the s-o-b.


Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Stefan Bader Nov. 8, 2018, 3:51 p.m. UTC | #3
On 07.11.18 19:20, Frank Heimes wrote:
> BugLink: http://bugs.launchpad.net/bugs/1797367
> 
> == SRU Justification ==
> 
> While running a series of stress tests for network on a bond device on
> Ubuntu 18.04.1 with kernel 4.15.0-36.39,
> kernel panic is observed (btw. also on non-bond devices).
> This looks like a race between disabling a qeth device and accessing debugfs.
> This is critical and leads repeatedly to a crash (sooner or later).
> 
> == Fix ==
> 
> e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")
> 
> pre-reqs:
> 750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
> d857e11 ("s390/qeth: remove outdated portname debug msg")
> 9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
> 8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
> 4641b02 ("s390/qeth: don't keep track of MAC address's cast type")
> 
> == Regression Potential ==
> 
> Low, because:
> - limited to s390x
> - and furthermore limited to qeth driver
> - patches a problem identified during testing
> - fix was tested by IBM before submitted
> 
> == Test Case ==
> 
> run:
>    #!/bin/bash
>    var=0
>    while :
>    do
>         var=$((var + 1))
>         echo "DBG count is $var"
>         mkdir /tmp/DBGINFO
>         dbginfo.sh -d /tmp/DBGINFO
>         rm -rf /tmp/DBGINFO*
>         echo "chzdev now is $var"
>         chzdev -e <qeth device>
>         chzdev -d <qeth device>
>    done
> and in avg. in less than 20 cycles a crash happens (usually < 10).
> 
For Bionic I had to invert the series to make it apply. The actual fix still
needed some adaption, so its again more a backport.
For Cosmic, there was only one of the pre-reqs was needed and the actual fix had
to ignore a bit of context.

Applied to bionic,cosmic/master-next. Thanks.

-Stefan
Seth Forshee Nov. 15, 2018, 5:23 p.m. UTC | #4
On Wed, Nov 07, 2018 at 07:20:40PM +0100, Frank Heimes wrote:
> BugLink: http://bugs.launchpad.net/bugs/1797367
> 
> == SRU Justification ==
> 
> While running a series of stress tests for network on a bond device on
> Ubuntu 18.04.1 with kernel 4.15.0-36.39,
> kernel panic is observed (btw. also on non-bond devices).
> This looks like a race between disabling a qeth device and accessing debugfs.
> This is critical and leads repeatedly to a crash (sooner or later).
> 
> == Fix ==
> 
> e19e5be8b4ca ("s390/qeth: sanitize strings in debug messages")
> 
> pre-reqs:
> 750b162 ("s390/qeth: reduce hard-coded access to ccw channels")
> d857e11 ("s390/qeth: remove outdated portname debug msg")
> 9d0a58f ("s390/qeth: avoid using is_multicast_ether_addr_64bits on (u8 *)[6]")
> 8174aa8 ("s390/qeth: consolidate qeth MAC address helpers")
> 4641b02 ("s390/qeth: don't keep track of MAC address's cast type")
> 
> == Regression Potential ==
> 
> Low, because:
> - limited to s390x
> - and furthermore limited to qeth driver
> - patches a problem identified during testing
> - fix was tested by IBM before submitted
> 
> == Test Case ==
> 
> run:
>    #!/bin/bash
>    var=0
>    while :
>    do
>         var=$((var + 1))
>         echo "DBG count is $var"
>         mkdir /tmp/DBGINFO
>         dbginfo.sh -d /tmp/DBGINFO
>         rm -rf /tmp/DBGINFO*
>         echo "chzdev now is $var"
>         chzdev -e <qeth device>
>         chzdev -d <qeth device>
>    done
> and in avg. in less than 20 cycles a crash happens (usually < 10).

Applied patch 1 to unstable/master, the prerequisites were already
present. Thanks!