mbox series

[SRU,X/B/D/E,0/1] PM / hibernate: fix potential memory corruption

Message ID 20191007163550.20548-1-andrea.righi@canonical.com
Headers show
Series PM / hibernate: fix potential memory corruption | expand

Message

Andrea Righi Oct. 7, 2019, 4:35 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1847118

[Impact]

A caching bug in the hibernation code can lead to potential memory
corruptions on resume.

The hibernation code is representing all the allocated pages in memory
(pfn) using a list of extents, inside each extent it uses a radix tree
and each node in the tree contains a bitmap. This structure is used to
save the memory image to disk.

To speed up lookups in this structure the kernel is caching the position
of the previous lookup in the form (current_extent, current_node).
However, if two consecutive lookups are distant enough from each other,
the extent can change, but the kernel can still use the cached node
(current_node), accessing the wrong bitmap and ending up saving to disk
the wrong pfn's.

[Test Case]

Bug has been reproduced in Xenial and Bionic trying to hibernate a large
instance with a lot of RAM (100GB+).

But we also wrote a custom kernel module to better isolate the code that
triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap

This module has exactly the same code as the hibernation code, but it
can be used as a fast test case to reproduce the problem without
actually triggering a real hibernation/resume cycle.

[Fix]

This bug can be fixed by properly invalidating the cached pair (extent,
node) when the next lookup falls in a different extent or a different
node.

[Regression Potential]

The fix has been sent to the LKML for review/feedback
(https://lkml.org/lkml/2019/9/25/393), we have not received any feedback
so far, but the bug is pretty clear and well tested on the affected
platforms. Moreover, the code is isolated to the hibernation area, so
the overall regression potential is minimal.

----------------------------------------------------------------
Andy Whitcroft (1):
      PM / hibernate: memory_bm_find_bit -- tighten node optimisation

 kernel/power/snapshot.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Colin Ian King Oct. 7, 2019, 4:45 p.m. UTC | #1
On 07/10/2019 17:35, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847118
> 
> [Impact]
> 
> A caching bug in the hibernation code can lead to potential memory
> corruptions on resume.
> 
> The hibernation code is representing all the allocated pages in memory
> (pfn) using a list of extents, inside each extent it uses a radix tree
> and each node in the tree contains a bitmap. This structure is used to
> save the memory image to disk.
> 
> To speed up lookups in this structure the kernel is caching the position
> of the previous lookup in the form (current_extent, current_node).
> However, if two consecutive lookups are distant enough from each other,
> the extent can change, but the kernel can still use the cached node
> (current_node), accessing the wrong bitmap and ending up saving to disk
> the wrong pfn's.
> 
> [Test Case]
> 
> Bug has been reproduced in Xenial and Bionic trying to hibernate a large
> instance with a lot of RAM (100GB+).
> 
> But we also wrote a custom kernel module to better isolate the code that
> triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap
> 
> This module has exactly the same code as the hibernation code, but it
> can be used as a fast test case to reproduce the problem without
> actually triggering a real hibernation/resume cycle.
> 
> [Fix]
> 
> This bug can be fixed by properly invalidating the cached pair (extent,
> node) when the next lookup falls in a different extent or a different
> node.
> 
> [Regression Potential]
> 
> The fix has been sent to the LKML for review/feedback
> (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback
> so far, but the bug is pretty clear and well tested on the affected
> platforms. Moreover, the code is isolated to the hibernation area, so
> the overall regression potential is minimal.
> 
> ----------------------------------------------------------------
> Andy Whitcroft (1):
>       PM / hibernate: memory_bm_find_bit -- tighten node optimisation
> 
>  kernel/power/snapshot.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> 

I've been following this issue, the fix looks good to me.  Perhaps we
need to ping upstream again on this before the fix gets lost.

Acked-by: Colin Ian King <colin.king@canonical.com>

Colin
Thadeu Lima de Souza Cascardo Oct. 7, 2019, 4:46 p.m. UTC | #2
Limited to hibernation/resume path and tested.

Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Kleber Sacilotto de Souza Oct. 16, 2019, 1:27 p.m. UTC | #3
On 07.10.19 18:35, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847118
> 
> [Impact]
> 
> A caching bug in the hibernation code can lead to potential memory
> corruptions on resume.
> 
> The hibernation code is representing all the allocated pages in memory
> (pfn) using a list of extents, inside each extent it uses a radix tree
> and each node in the tree contains a bitmap. This structure is used to
> save the memory image to disk.
> 
> To speed up lookups in this structure the kernel is caching the position
> of the previous lookup in the form (current_extent, current_node).
> However, if two consecutive lookups are distant enough from each other,
> the extent can change, but the kernel can still use the cached node
> (current_node), accessing the wrong bitmap and ending up saving to disk
> the wrong pfn's.
> 
> [Test Case]
> 
> Bug has been reproduced in Xenial and Bionic trying to hibernate a large
> instance with a lot of RAM (100GB+).
> 
> But we also wrote a custom kernel module to better isolate the code that
> triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap
> 
> This module has exactly the same code as the hibernation code, but it
> can be used as a fast test case to reproduce the problem without
> actually triggering a real hibernation/resume cycle.
> 
> [Fix]
> 
> This bug can be fixed by properly invalidating the cached pair (extent,
> node) when the next lookup falls in a different extent or a different
> node.
> 
> [Regression Potential]
> 
> The fix has been sent to the LKML for review/feedback
> (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback
> so far, but the bug is pretty clear and well tested on the affected
> platforms. Moreover, the code is isolated to the hibernation area, so
> the overall regression potential is minimal.
> 
> ----------------------------------------------------------------
> Andy Whitcroft (1):
>       PM / hibernate: memory_bm_find_bit -- tighten node optimisation
> 
>  kernel/power/snapshot.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> 

This patch has been accepted upstream and applied to linux-next, so I have
removed the "UBUNTU: SAUCE:" tags from the subject, added the s-o-b
block from upstream and added the provenance:

({backported/cherry picked} from commit 39800d8fc4083cfe5c8f69333336bf03f9571070 linux-next)

Applied to xenial, bionic, disco and eoan master-next branches.

Thanks,
Kleber
Seth Forshee Oct. 22, 2019, 8:29 p.m. UTC | #4
On Mon, Oct 07, 2019 at 06:35:48PM +0200, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1847118
> 
> [Impact]
> 
> A caching bug in the hibernation code can lead to potential memory
> corruptions on resume.
> 
> The hibernation code is representing all the allocated pages in memory
> (pfn) using a list of extents, inside each extent it uses a radix tree
> and each node in the tree contains a bitmap. This structure is used to
> save the memory image to disk.
> 
> To speed up lookups in this structure the kernel is caching the position
> of the previous lookup in the form (current_extent, current_node).
> However, if two consecutive lookups are distant enough from each other,
> the extent can change, but the kernel can still use the cached node
> (current_node), accessing the wrong bitmap and ending up saving to disk
> the wrong pfn's.
> 
> [Test Case]
> 
> Bug has been reproduced in Xenial and Bionic trying to hibernate a large
> instance with a lot of RAM (100GB+).
> 
> But we also wrote a custom kernel module to better isolate the code that
> triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap
> 
> This module has exactly the same code as the hibernation code, but it
> can be used as a fast test case to reproduce the problem without
> actually triggering a real hibernation/resume cycle.
> 
> [Fix]
> 
> This bug can be fixed by properly invalidating the cached pair (extent,
> node) when the next lookup falls in a different extent or a different
> node.
> 
> [Regression Potential]
> 
> The fix has been sent to the LKML for review/feedback
> (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback
> so far, but the bug is pretty clear and well tested on the affected
> platforms. Moreover, the code is isolated to the hibernation area, so
> the overall regression potential is minimal.

Applied to unstable/master, thanks!