diff mbox series

[v3] lib: memutils: respect minimum memory watermark when polluting memory

Message ID 20211020091353.90731-1-krzysztof.kozlowski@canonical.com
State Superseded
Headers show
Series [v3] lib: memutils: respect minimum memory watermark when polluting memory | expand

Commit Message

Krzysztof Kozlowski Oct. 20, 2021, 9:13 a.m. UTC
Previous fix for an out-of-memory killer killing ioctl_sg01 process
in commit 4d2e3d44fad5 ("lib: memutils: don't pollute
entire system memory to avoid OoM") was not fully effective.  While it
covers most of the cases, an ARM64 machine with 128 GB of memory, 64 kB
page size and v5.11 kernel hit it again.  Polluting the memory fails
with OoM:

  LTP: starting ioctl_sg01
  ioctl_sg01 invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
  ...
  Mem-Info:
  active_anon:309 inactive_anon:1964781 isolated_anon:0
                  active_file:94 inactive_file:0 isolated_file:0
                  unevictable:305 dirty:0 writeback:0
                  slab_reclaimable:1510 slab_unreclaimable:5012
                  mapped:115 shmem:339 pagetables:463 bounce:0
                  free:112043 free_pcp:1 free_cma:3159
  Node 0 active_anon:19776kB inactive_anon:125745984kB active_file:6016kB inactive_file:0kB unevictable:19520kB ...
  Node 0 DMA free:710656kB min:205120kB low:256384kB high:307648kB reserved_highatomic:0KB active_anon:0kB inactive_anon:3332032kB ...
  lowmem_reserve[]: 0 0 7908 7908 7908
  Node 0 Normal free:6460096kB min:6463168kB low:8078912kB high:9694656kB reserved_highatomic:0KB active_anon:19776kB inactive_anon:122413952kB ...
  lowmem_reserve[]: 0 0 0 0 0

The important part are details of memory on Node 0 in Normal zone:
1. free memory: 6460096 kB
2. min (minimum watermark): 6463168 kB

Parse the /proc/sys/vm/min_free_kbytes which contains the free
memory level used as minimum watermark (triggering OoM killer).

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>

---

Changes since v2:
1. Use /proc/sys/vm/min_free_kbytes instead of parsing zoneinfo, thanks
   tgo Liu Xinpeng.

Changes since v1:
1. Add static and rename to count_min_pages().
---
 lib/tst_memutils.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Li Wang Oct. 21, 2021, 7:21 a.m. UTC | #1
On Wed, Oct 20, 2021 at 5:14 PM Krzysztof Kozlowski <
krzysztof.kozlowski@canonical.com> wrote:

> Previous fix for an out-of-memory killer killing ioctl_sg01 process
> in commit 4d2e3d44fad5 ("lib: memutils: don't pollute
> entire system memory to avoid OoM") was not fully effective.  While it
> covers most of the cases, an ARM64 machine with 128 GB of memory, 64 kB
> page size and v5.11 kernel hit it again.  Polluting the memory fails
> with OoM:
>
>   LTP: starting ioctl_sg01
>   ioctl_sg01 invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
>   ...
>   Mem-Info:
>   active_anon:309 inactive_anon:1964781 isolated_anon:0
>                   active_file:94 inactive_file:0 isolated_file:0
>                   unevictable:305 dirty:0 writeback:0
>                   slab_reclaimable:1510 slab_unreclaimable:5012
>                   mapped:115 shmem:339 pagetables:463 bounce:0
>                   free:112043 free_pcp:1 free_cma:3159
>   Node 0 active_anon:19776kB inactive_anon:125745984kB active_file:6016kB
> inactive_file:0kB unevictable:19520kB ...
>   Node 0 DMA free:710656kB min:205120kB low:256384kB high:307648kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:3332032kB ...
>   lowmem_reserve[]: 0 0 7908 7908 7908
>   Node 0 Normal free:6460096kB min:6463168kB low:8078912kB high:9694656kB
> reserved_highatomic:0KB active_anon:19776kB inactive_anon:122413952kB ...
>   lowmem_reserve[]: 0 0 0 0 0
>
> The important part are details of memory on Node 0 in Normal zone:
> 1. free memory: 6460096 kB
> 2. min (minimum watermark): 6463168 kB
>
> Parse the /proc/sys/vm/min_free_kbytes which contains the free
> memory level used as minimum watermark (triggering OoM killer).
>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
>
> ---
>
> Changes since v2:
> 1. Use /proc/sys/vm/min_free_kbytes instead of parsing zoneinfo, thanks
>    tgo Liu Xinpeng.
>
> Changes since v1:
> 1. Add static and rename to count_min_pages().
> ---
>  lib/tst_memutils.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c
> index af132bcc6c24..df53c542d239 100644
> --- a/lib/tst_memutils.c
> +++ b/lib/tst_memutils.c
> @@ -16,12 +16,18 @@
>  void tst_pollute_memory(size_t maxsize, int fillchar)
>  {
>         size_t i, map_count = 0, safety = 0, blocksize = BLOCKSIZE;
> +       unsigned long min_free;
>         void **map_blocks;
>         struct sysinfo info;
>
> +       SAFE_FILE_SCANF("/proc/sys/vm/min_free_kbytes", "%lu", &min_free);
> +       min_free *= 1024;
> +       /* Apply a margin because we cannot get below "min" watermark */
> +       min_free += min_free / 10;
> +
>         SAFE_SYSINFO(&info);
>         safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 * 1024);
> -       safety = MAX(safety, (info.freeram / 64));
> +       safety = MAX(safety, min_free);
>

Therically this is correct, and I believe it will work on your tested
machine.

But my concern is ioctl_sg01 still fails on the special system which
MemAvai < MemFree.

Just like the one Xinpeng mentioned before:
https://lists.linux.it/pipermail/ltp/2021-January/020817.html

[root@test-env-nm05-compute-14e5e72e38 ~]# cat /proc/meminfo

MemTotal:       526997420 kB
MemFree:        520224908 kB
MemAvailable:   519936744 kB...

[root@test-env-nm05-compute-14e5e72e38 ~]# cat  /proc/sys/vm/min_free_kbytes
90112


There even reserve the safety to the 128MB, still less than the gap
between MemFree and MemAvailable.

MemFree (520224908 kB) - MemAvailable (520224908 kB) = 288164 kB  > safety
Krzysztof Kozlowski Oct. 21, 2021, 7:36 a.m. UTC | #2
On 21/10/2021 09:21, Li Wang wrote:
> 
> 
> On Wed, Oct 20, 2021 at 5:14 PM Krzysztof Kozlowski
> <krzysztof.kozlowski@canonical.com
> <mailto:krzysztof.kozlowski@canonical.com>> wrote:
> 
>     Previous fix for an out-of-memory killer killing ioctl_sg01 process
>     in commit 4d2e3d44fad5 ("lib: memutils: don't pollute
>     entire system memory to avoid OoM") was not fully effective.  While it
>     covers most of the cases, an ARM64 machine with 128 GB of memory, 64 kB
>     page size and v5.11 kernel hit it again.  Polluting the memory fails
>     with OoM:
> 
>       LTP: starting ioctl_sg01
>       ioctl_sg01 invoked oom-killer:
>     gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
>     oom_score_adj=0
>       ...
>       Mem-Info:
>       active_anon:309 inactive_anon:1964781 isolated_anon:0
>                       active_file:94 inactive_file:0 isolated_file:0
>                       unevictable:305 dirty:0 writeback:0
>                       slab_reclaimable:1510 slab_unreclaimable:5012
>                       mapped:115 shmem:339 pagetables:463 bounce:0
>                       free:112043 free_pcp:1 free_cma:3159
>       Node 0 active_anon:19776kB inactive_anon:125745984kB
>     active_file:6016kB inactive_file:0kB unevictable:19520kB ...
>       Node 0 DMA free:710656kB min:205120kB low:256384kB high:307648kB
>     reserved_highatomic:0KB active_anon:0kB inactive_anon:3332032kB ...
>       lowmem_reserve[]: 0 0 7908 7908 7908
>       Node 0 Normal free:6460096kB min:6463168kB low:8078912kB
>     high:9694656kB reserved_highatomic:0KB active_anon:19776kB
>     inactive_anon:122413952kB ...
>       lowmem_reserve[]: 0 0 0 0 0
> 
>     The important part are details of memory on Node 0 in Normal zone:
>     1. free memory: 6460096 kB
>     2. min (minimum watermark): 6463168 kB
> 
>     Parse the /proc/sys/vm/min_free_kbytes which contains the free
>     memory level used as minimum watermark (triggering OoM killer).
> 
>     Signed-off-by: Krzysztof Kozlowski
>     <krzysztof.kozlowski@canonical.com
>     <mailto:krzysztof.kozlowski@canonical.com>>
> 
>     ---
> 
>     Changes since v2:
>     1. Use /proc/sys/vm/min_free_kbytes instead of parsing zoneinfo, thanks
>        tgo Liu Xinpeng.
> 
>     Changes since v1:
>     1. Add static and rename to count_min_pages().
>     ---
>      lib/tst_memutils.c | 8 +++++++-
>      1 file changed, 7 insertions(+), 1 deletion(-)
> 
>     diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c
>     index af132bcc6c24..df53c542d239 100644
>     --- a/lib/tst_memutils.c
>     +++ b/lib/tst_memutils.c
>     @@ -16,12 +16,18 @@
>      void tst_pollute_memory(size_t maxsize, int fillchar)
>      {
>             size_t i, map_count = 0, safety = 0, blocksize = BLOCKSIZE;
>     +       unsigned long min_free;
>             void **map_blocks;
>             struct sysinfo info;
> 
>     +       SAFE_FILE_SCANF("/proc/sys/vm/min_free_kbytes", "%lu",
>     &min_free);
>     +       min_free *= 1024;
>     +       /* Apply a margin because we cannot get below "min" watermark */
>     +       min_free += min_free / 10;
>     +
>             SAFE_SYSINFO(&info);
>             safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 *
>     1024);
>     -       safety = MAX(safety, (info.freeram / 64));
>     +       safety = MAX(safety, min_free);
> 
> 
> Therically this is correct, and I believe it will work on your tested
> machine.
> 
> But my concern is ioctl_sg01 still fails on the special system which
> MemAvai < MemFree.
> 
> Just like the one Xinpeng mentioned before:
> https://lists.linux.it/pipermail/ltp/2021-January/020817.html
> <https://lists.linux.it/pipermail/ltp/2021-January/020817.html>
> 
> [root@test-env-nm05-compute-14e5e72e38
> <mailto:root@test-env-nm05-compute-14e5e72e38>~]# cat /proc/meminfo
> 
> MemTotal:       526997420 kB
> MemFree:        520224908 kB
> MemAvailable:   519936744 kB
> ...
> 
> [root@test-env-nm05-compute-14e5e72e38 <mailto:root@test-env-nm05-compute-14e5e72e38> ~]# cat  /proc/sys/vm/min_free_kbytes
> 90112
> 
> 
> There even reserve the safety to the 128MB, still less than the gap
> between MemFree and MemAvailable. 
> 
> MemFree (520224908 kB) - MemAvailable (520224908 kB) = 288164 kB  > safety

I don't have such case and I am not sure it is reasonable.

As mentioned in the thread there it looks unusual to have less available
memory than free. Maybe the system has some weird memory accounting
because MemAvailable is counted from MemFree by adding memory which can
be reclaimed. When adding a non-negative number, you should not end up
with lower MemAvailable than MemFree. :)

Maybe that's the reason why that patch was not accepted - the system is
not vanilla, not common?

Best regards,
Krzysztof
Krzysztof Kozlowski Oct. 21, 2021, 7:55 a.m. UTC | #3
On 21/10/2021 09:36, Krzysztof Kozlowski wrote:
> On 21/10/2021 09:21, Li Wang wrote:
>>
>>
>> On Wed, Oct 20, 2021 at 5:14 PM Krzysztof Kozlowski
>> <krzysztof.kozlowski@canonical.com
>> <mailto:krzysztof.kozlowski@canonical.com>> wrote:
>>
>>     Previous fix for an out-of-memory killer killing ioctl_sg01 process
>>     in commit 4d2e3d44fad5 ("lib: memutils: don't pollute
>>     entire system memory to avoid OoM") was not fully effective.  While it
>>     covers most of the cases, an ARM64 machine with 128 GB of memory, 64 kB
>>     page size and v5.11 kernel hit it again.  Polluting the memory fails
>>     with OoM:
>>
>>       LTP: starting ioctl_sg01
>>       ioctl_sg01 invoked oom-killer:
>>     gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
>>     oom_score_adj=0
>>       ...
>>       Mem-Info:
>>       active_anon:309 inactive_anon:1964781 isolated_anon:0
>>                       active_file:94 inactive_file:0 isolated_file:0
>>                       unevictable:305 dirty:0 writeback:0
>>                       slab_reclaimable:1510 slab_unreclaimable:5012
>>                       mapped:115 shmem:339 pagetables:463 bounce:0
>>                       free:112043 free_pcp:1 free_cma:3159
>>       Node 0 active_anon:19776kB inactive_anon:125745984kB
>>     active_file:6016kB inactive_file:0kB unevictable:19520kB ...
>>       Node 0 DMA free:710656kB min:205120kB low:256384kB high:307648kB
>>     reserved_highatomic:0KB active_anon:0kB inactive_anon:3332032kB ...
>>       lowmem_reserve[]: 0 0 7908 7908 7908
>>       Node 0 Normal free:6460096kB min:6463168kB low:8078912kB
>>     high:9694656kB reserved_highatomic:0KB active_anon:19776kB
>>     inactive_anon:122413952kB ...
>>       lowmem_reserve[]: 0 0 0 0 0
>>
>>     The important part are details of memory on Node 0 in Normal zone:
>>     1. free memory: 6460096 kB
>>     2. min (minimum watermark): 6463168 kB
>>
>>     Parse the /proc/sys/vm/min_free_kbytes which contains the free
>>     memory level used as minimum watermark (triggering OoM killer).
>>
>>     Signed-off-by: Krzysztof Kozlowski
>>     <krzysztof.kozlowski@canonical.com
>>     <mailto:krzysztof.kozlowski@canonical.com>>
>>
>>     ---
>>
>>     Changes since v2:
>>     1. Use /proc/sys/vm/min_free_kbytes instead of parsing zoneinfo, thanks
>>        tgo Liu Xinpeng.
>>
>>     Changes since v1:
>>     1. Add static and rename to count_min_pages().
>>     ---
>>      lib/tst_memutils.c | 8 +++++++-
>>      1 file changed, 7 insertions(+), 1 deletion(-)
>>
>>     diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c
>>     index af132bcc6c24..df53c542d239 100644
>>     --- a/lib/tst_memutils.c
>>     +++ b/lib/tst_memutils.c
>>     @@ -16,12 +16,18 @@
>>      void tst_pollute_memory(size_t maxsize, int fillchar)
>>      {
>>             size_t i, map_count = 0, safety = 0, blocksize = BLOCKSIZE;
>>     +       unsigned long min_free;
>>             void **map_blocks;
>>             struct sysinfo info;
>>
>>     +       SAFE_FILE_SCANF("/proc/sys/vm/min_free_kbytes", "%lu",
>>     &min_free);
>>     +       min_free *= 1024;
>>     +       /* Apply a margin because we cannot get below "min" watermark */
>>     +       min_free += min_free / 10;
>>     +
>>             SAFE_SYSINFO(&info);
>>             safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 *
>>     1024);
>>     -       safety = MAX(safety, (info.freeram / 64));
>>     +       safety = MAX(safety, min_free);
>>
>>
>> Therically this is correct, and I believe it will work on your tested
>> machine.
>>
>> But my concern is ioctl_sg01 still fails on the special system which
>> MemAvai < MemFree.
>>
>> Just like the one Xinpeng mentioned before:
>> https://lists.linux.it/pipermail/ltp/2021-January/020817.html
>> <https://lists.linux.it/pipermail/ltp/2021-January/020817.html>
>>
>> [root@test-env-nm05-compute-14e5e72e38
>> <mailto:root@test-env-nm05-compute-14e5e72e38>~]# cat /proc/meminfo
>>
>> MemTotal:       526997420 kB
>> MemFree:        520224908 kB
>> MemAvailable:   519936744 kB
>> ...
>>
>> [root@test-env-nm05-compute-14e5e72e38 <mailto:root@test-env-nm05-compute-14e5e72e38> ~]# cat  /proc/sys/vm/min_free_kbytes
>> 90112
>>
>>
>> There even reserve the safety to the 128MB, still less than the gap
>> between MemFree and MemAvailable. 
>>
>> MemFree (520224908 kB) - MemAvailable (520224908 kB) = 288164 kB  > safety
> 
> I don't have such case and I am not sure it is reasonable.
> 
> As mentioned in the thread there it looks unusual to have less available
> memory than free. Maybe the system has some weird memory accounting
> because MemAvailable is counted from MemFree by adding memory which can
> be reclaimed. When adding a non-negative number, you should not end up
> with lower MemAvailable than MemFree. :)
> 
> Maybe that's the reason why that patch was not accepted - the system is
> not vanilla, not common?

OK, I found a possible explanation (on vanilla kernel) - the
totalreserve_pages. This is the only subtraction from free memory when
counting available. This could happen if someone was setting sysctl
lowmem_reserve_ratio or min_free_kbytes.

When setting min_free_kbytes, this will be reflected in
/proc/sys/vm/min_free_kbytes, so we are good.

When setting vm.lowmem_reserve_ratio, this will be missed by my patch
and MemAvailable could be lower than MemFree.

I'll send a v4.

Best regards,
Krzysztof
Li Wang Oct. 21, 2021, 8:05 a.m. UTC | #4
> >> Therically this is correct, and I believe it will work on your tested
> >> machine.
> >>
> >> But my concern is ioctl_sg01 still fails on the special system which
> >> MemAvai < MemFree.
> >>
> >> Just like the one Xinpeng mentioned before:
> >> https://lists.linux.it/pipermail/ltp/2021-January/020817.html
> >> <https://lists.linux.it/pipermail/ltp/2021-January/020817.html>
> >>
> >> [root@test-env-nm05-compute-14e5e72e38
> >> <mailto:root@test-env-nm05-compute-14e5e72e38>~]# cat /proc/meminfo
> >>
> >> MemTotal:       526997420 kB
> >> MemFree:        520224908 kB
> >> MemAvailable:   519936744 kB
> >> ...
> >>
> >> [root@test-env-nm05-compute-14e5e72e38 <mailto:
> root@test-env-nm05-compute-14e5e72e38> ~]# cat
> /proc/sys/vm/min_free_kbytes
> >> 90112
> >>
> >>
> >> There even reserve the safety to the 128MB, still less than the gap
> >> between MemFree and MemAvailable.
> >>
> >> MemFree (520224908 kB) - MemAvailable (520224908 kB) = 288164 kB  >
> safety
> >
> > I don't have such case and I am not sure it is reasonable.
> >
> > As mentioned in the thread there it looks unusual to have less available
> > memory than free. Maybe the system has some weird memory accounting
> > because MemAvailable is counted from MemFree by adding memory which can
> > be reclaimed. When adding a non-negative number, you should not end up
> > with lower MemAvailable than MemFree. :)
> >
> > Maybe that's the reason why that patch was not accepted - the system is
> > not vanilla, not common?
>
> OK, I found a possible explanation (on vanilla kernel) - the
> totalreserve_pages. This is the only subtraction from free memory when
> counting available. This could happen if someone was setting sysctl
> lowmem_reserve_ratio or min_free_kbytes.
>

That's exactly, beside the two controllers, you could also get such
a system with enabling smaller swap space on aarch64/x86_64.

(I did that and found that 'MemFree > MemAvail' is common to see)


> When setting min_free_kbytes, this will be reflected in
> /proc/sys/vm/min_free_kbytes, so we are good.
>
> When setting vm.lowmem_reserve_ratio, this will be missed by my patch
> and MemAvailable could be lower than MemFree.
>

Yes, we have to face different kinds of system configuration in testing.

Maybe we'd better keep the (info.freeram/64) in MAX() as well,
Or, do a comparison between MemFree and MemAvail to choose
the smaller one as baseline.
diff mbox series

Patch

diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c
index af132bcc6c24..df53c542d239 100644
--- a/lib/tst_memutils.c
+++ b/lib/tst_memutils.c
@@ -16,12 +16,18 @@ 
 void tst_pollute_memory(size_t maxsize, int fillchar)
 {
 	size_t i, map_count = 0, safety = 0, blocksize = BLOCKSIZE;
+	unsigned long min_free;
 	void **map_blocks;
 	struct sysinfo info;
 
+	SAFE_FILE_SCANF("/proc/sys/vm/min_free_kbytes", "%lu", &min_free);
+	min_free *= 1024;
+	/* Apply a margin because we cannot get below "min" watermark */
+	min_free += min_free / 10;
+
 	SAFE_SYSINFO(&info);
 	safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 * 1024);
-	safety = MAX(safety, (info.freeram / 64));
+	safety = MAX(safety, min_free);
 	safety /= info.mem_unit;
 
 	if (info.freeswap > safety)