Message ID | 20191007163550.20548-1-andrea.righi@canonical.com |
---|---|
Headers | show |
Series | PM / hibernate: fix potential memory corruption | expand |
On 07/10/2019 17:35, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1847118 > > [Impact] > > A caching bug in the hibernation code can lead to potential memory > corruptions on resume. > > The hibernation code is representing all the allocated pages in memory > (pfn) using a list of extents, inside each extent it uses a radix tree > and each node in the tree contains a bitmap. This structure is used to > save the memory image to disk. > > To speed up lookups in this structure the kernel is caching the position > of the previous lookup in the form (current_extent, current_node). > However, if two consecutive lookups are distant enough from each other, > the extent can change, but the kernel can still use the cached node > (current_node), accessing the wrong bitmap and ending up saving to disk > the wrong pfn's. > > [Test Case] > > Bug has been reproduced in Xenial and Bionic trying to hibernate a large > instance with a lot of RAM (100GB+). > > But we also wrote a custom kernel module to better isolate the code that > triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap > > This module has exactly the same code as the hibernation code, but it > can be used as a fast test case to reproduce the problem without > actually triggering a real hibernation/resume cycle. > > [Fix] > > This bug can be fixed by properly invalidating the cached pair (extent, > node) when the next lookup falls in a different extent or a different > node. > > [Regression Potential] > > The fix has been sent to the LKML for review/feedback > (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback > so far, but the bug is pretty clear and well tested on the affected > platforms. Moreover, the code is isolated to the hibernation area, so > the overall regression potential is minimal. > > ---------------------------------------------------------------- > Andy Whitcroft (1): > PM / hibernate: memory_bm_find_bit -- tighten node optimisation > > kernel/power/snapshot.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > I've been following this issue, the fix looks good to me. Perhaps we need to ping upstream again on this before the fix gets lost. Acked-by: Colin Ian King <colin.king@canonical.com> Colin
Limited to hibernation/resume path and tested.
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
On 07.10.19 18:35, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1847118 > > [Impact] > > A caching bug in the hibernation code can lead to potential memory > corruptions on resume. > > The hibernation code is representing all the allocated pages in memory > (pfn) using a list of extents, inside each extent it uses a radix tree > and each node in the tree contains a bitmap. This structure is used to > save the memory image to disk. > > To speed up lookups in this structure the kernel is caching the position > of the previous lookup in the form (current_extent, current_node). > However, if two consecutive lookups are distant enough from each other, > the extent can change, but the kernel can still use the cached node > (current_node), accessing the wrong bitmap and ending up saving to disk > the wrong pfn's. > > [Test Case] > > Bug has been reproduced in Xenial and Bionic trying to hibernate a large > instance with a lot of RAM (100GB+). > > But we also wrote a custom kernel module to better isolate the code that > triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap > > This module has exactly the same code as the hibernation code, but it > can be used as a fast test case to reproduce the problem without > actually triggering a real hibernation/resume cycle. > > [Fix] > > This bug can be fixed by properly invalidating the cached pair (extent, > node) when the next lookup falls in a different extent or a different > node. > > [Regression Potential] > > The fix has been sent to the LKML for review/feedback > (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback > so far, but the bug is pretty clear and well tested on the affected > platforms. Moreover, the code is isolated to the hibernation area, so > the overall regression potential is minimal. > > ---------------------------------------------------------------- > Andy Whitcroft (1): > PM / hibernate: memory_bm_find_bit -- tighten node optimisation > > kernel/power/snapshot.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > This patch has been accepted upstream and applied to linux-next, so I have removed the "UBUNTU: SAUCE:" tags from the subject, added the s-o-b block from upstream and added the provenance: ({backported/cherry picked} from commit 39800d8fc4083cfe5c8f69333336bf03f9571070 linux-next) Applied to xenial, bionic, disco and eoan master-next branches. Thanks, Kleber
On Mon, Oct 07, 2019 at 06:35:48PM +0200, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1847118 > > [Impact] > > A caching bug in the hibernation code can lead to potential memory > corruptions on resume. > > The hibernation code is representing all the allocated pages in memory > (pfn) using a list of extents, inside each extent it uses a radix tree > and each node in the tree contains a bitmap. This structure is used to > save the memory image to disk. > > To speed up lookups in this structure the kernel is caching the position > of the previous lookup in the form (current_extent, current_node). > However, if two consecutive lookups are distant enough from each other, > the extent can change, but the kernel can still use the cached node > (current_node), accessing the wrong bitmap and ending up saving to disk > the wrong pfn's. > > [Test Case] > > Bug has been reproduced in Xenial and Bionic trying to hibernate a large > instance with a lot of RAM (100GB+). > > But we also wrote a custom kernel module to better isolate the code that > triggers the problem: https://code.launchpad.net/~arighi/+git/mybitmap > > This module has exactly the same code as the hibernation code, but it > can be used as a fast test case to reproduce the problem without > actually triggering a real hibernation/resume cycle. > > [Fix] > > This bug can be fixed by properly invalidating the cached pair (extent, > node) when the next lookup falls in a different extent or a different > node. > > [Regression Potential] > > The fix has been sent to the LKML for review/feedback > (https://lkml.org/lkml/2019/9/25/393), we have not received any feedback > so far, but the bug is pretty clear and well tested on the affected > platforms. Moreover, the code is isolated to the hibernation area, so > the overall regression potential is minimal. Applied to unstable/master, thanks!