diff mbox series

lib: memutils: don't pollute entire system memory to avoid OoM

Message ID 20210624132226.84611-1-krzysztof.kozlowski@canonical.com
State Accepted
Headers show
Series lib: memutils: don't pollute entire system memory to avoid OoM | expand

Commit Message

Krzysztof Kozlowski June 24, 2021, 1:22 p.m. UTC
On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
failing because of OoM killer during memory pollution:

    tst_test.c:1311: TINFO: Timeout per run is 0h 05m 00s
    ioctl_sg01.c:81: TINFO: Found SCSI device /dev/sg2
    tst_test.c:1357: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
    tst_test.c:1359: TBROK: Test killed! (timeout?)

In dmesg:

    [76477.661067] LTP: starting cve-2018-1000204 (ioctl_sg01)
    [76578.062209] ioctl_sg01 invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
    ...
    [76578.062335] Mem-Info:
    [76578.062340] active_anon:63 inactive_anon:49016768 isolated_anon:0
                    active_file:253 inactive_file:117 isolated_file:0
                    unevictable:4871 dirty:4 writeback:0
                    slab_reclaimable:18451 slab_unreclaimable:56355
                    mapped:2478 shmem:310 pagetables:96625 bounce:0
                    free:121136 free_pcp:0 free_cma:0
    ...
    [76578.062527] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-40.scope,task=ioctl_sg01,pid=446171,uid=0
    [76578.062539] Out of memory: Killed process 446171 (ioctl_sg01) total-vm:195955840kB, anon-rss:195941256kB, file-rss:1416kB, shmem-rss:0kB, UID:0 pgtables:383496kB oom_score_adj:0
    [76581.046078] oom_reaper: reaped process 446171 (ioctl_sg01), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It seems leaving hard-coded 128 MB free memory works for small or medium
systems, but for such bigger machine it creates significant memory
pressure triggering the out of memory reaper.

The memory pressure usually is defined by ratio between free and total
memory, so adjust the safety/spare memory similarly to keep always 0.5%
of memory free.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
---
 lib/tst_memutils.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Martin Doucha June 24, 2021, 1:33 p.m. UTC | #1
On 24. 06. 21 15:22, Krzysztof Kozlowski wrote:
> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
> failing because of OoM killer during memory pollution:
> 
> ...
> 
> It seems leaving hard-coded 128 MB free memory works for small or medium
> systems, but for such bigger machine it creates significant memory
> pressure triggering the out of memory reaper.
> 
> The memory pressure usually is defined by ratio between free and total
> memory, so adjust the safety/spare memory similarly to keep always 0.5%
> of memory free.

Hi,
I've sent a similar patch for the same issue a while ago. It covers a
few more edge cases. See [1] for the discussion about it.

[1]
https://patchwork.ozlabs.org/project/ltp/patch/20210127115606.28985-1-mdoucha@suse.cz/
Li Wang June 24, 2021, 2 p.m. UTC | #2
On Thu, Jun 24, 2021 at 9:33 PM Martin Doucha <mdoucha@suse.cz> wrote:

> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote:
> > On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
> > failing because of OoM killer during memory pollution:
> >
> > ...
> >
> > It seems leaving hard-coded 128 MB free memory works for small or medium
> > systems, but for such bigger machine it creates significant memory
> > pressure triggering the out of memory reaper.
> >
> > The memory pressure usually is defined by ratio between free and total
> > memory, so adjust the safety/spare memory similarly to keep always 0.5%
> > of memory free.
>
> Hi,
> I've sent a similar patch for the same issue a while ago. It covers a
> few more edge cases. See [1] for the discussion about it.


> [1]
>
> https://patchwork.ozlabs.org/project/ltp/patch/20210127115606.28985-1-mdoucha@suse.cz/


FYI, Another related analysis:
https://lists.linux.it/pipermail/ltp/2021-April/021903.html

The mmap() behavior changed in GUESS mode from commit 8c7829b04c523cd,
we can NOT receive MAP_FAILED on ENOMEM in userspace anymore, unless
the process one-time allocating memory larger than "total_ram+ total_swap"
explicitly.

Which also means the MAP_FAILED check lose effect permanently in line#51:
https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51
Martin Doucha June 24, 2021, 2:13 p.m. UTC | #3
On 24. 06. 21 16:00, Li Wang wrote:
> FYI, Another related analysis:
> https://lists.linux.it/pipermail/ltp/2021-April/021903.html
> <https://lists.linux.it/pipermail/ltp/2021-April/021903.html>
> 
> The mmap() behavior changed in GUESS mode from commit 8c7829b04c523cd,
> we can NOT receive MAP_FAILED on ENOMEM in userspace anymore,unless 
> the process one-time allocating memory larger than "total_ram+
> total_swap" explicitly.
> 
> Which also means the MAP_FAILED check lose effect permanently in line#51:
> https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51
> <https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51>

I'm pretty sure that 32bit x86 kernel with PAE and more than 3GB of RAM
will give you MAP_FAILED because you'll run out of available address
space before you run out of physical memory.
Krzysztof Kozlowski June 24, 2021, 3:07 p.m. UTC | #4
On 24/06/2021 15:33, Martin Doucha wrote:
> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote:
>> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
>> failing because of OoM killer during memory pollution:
>>
>> ...
>>
>> It seems leaving hard-coded 128 MB free memory works for small or medium
>> systems, but for such bigger machine it creates significant memory
>> pressure triggering the out of memory reaper.
>>
>> The memory pressure usually is defined by ratio between free and total
>> memory, so adjust the safety/spare memory similarly to keep always 0.5%
>> of memory free.
> 
> Hi,
> I've sent a similar patch for the same issue a while ago. It covers a
> few more edge cases. See [1] for the discussion about it.
> 

Thanks for the pointer. I see partially we used similar solution -
always leave some percentage of free memory.

Different kernels might have different limits here, for example v5.11
where this happened has two additional restrictions:

1. vm.min_free_kbytes = 90112
The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11).

2. vm.lowmem_reserve_ratio = 256	256	32	0	0
Which is a ratio 1/X for specific zones and since it was highmem
allocation, it does not matter here (machine has plenty of normal zone
memory).

Therefore it OoM seems to be caused by min_free_kbytes. The machine has
two nodes and the limit looks like to be spread between them:

[76578.062366] Node 0 Normal free:44536kB min:44600kB ...
[76578.062373] Node 1 Normal free:44824kB min:45060kB ...

The rest of free memory is in other zones (11 MB DMA and 380 MB in
DMA32), which were not used for this allocation.  Therefore to be
accurate, the safety limit should process /proc/zoneinfo and count
amount of free memory in Normal zone. This 128 MB safety limit should
not be counted from total memory, but from Normal zone.

But this is much more complex task and simple limit of 0.5% usually does
the trick.

P.S. For 32-bit systems the Highmem zone should also be included in Normal.

Best regards,
Krzysztof
Krzysztof Kozlowski June 24, 2021, 3:34 p.m. UTC | #5
On 24/06/2021 17:07, Krzysztof Kozlowski wrote:
> 
> On 24/06/2021 15:33, Martin Doucha wrote:
>> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote:
>>> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
>>> failing because of OoM killer during memory pollution:
>>>
>>> ...
>>>
>>> It seems leaving hard-coded 128 MB free memory works for small or medium
>>> systems, but for such bigger machine it creates significant memory
>>> pressure triggering the out of memory reaper.
>>>
>>> The memory pressure usually is defined by ratio between free and total
>>> memory, so adjust the safety/spare memory similarly to keep always 0.5%
>>> of memory free.
>>
>> Hi,
>> I've sent a similar patch for the same issue a while ago. It covers a
>> few more edge cases. See [1] for the discussion about it.
>>
> 
> Thanks for the pointer. I see partially we used similar solution -
> always leave some percentage of free memory.
> 
> Different kernels might have different limits here, for example v5.11
> where this happened has two additional restrictions:
> 
> 1. vm.min_free_kbytes = 90112
> The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11).
> 
> 2. vm.lowmem_reserve_ratio = 256	256	32	0	0
> Which is a ratio 1/X for specific zones and since it was highmem
> allocation, it does not matter here (machine has plenty of normal zone
> memory).
> 
> Therefore it OoM seems to be caused by min_free_kbytes. The machine has
> two nodes and the limit looks like to be spread between them:
> 
> [76578.062366] Node 0 Normal free:44536kB min:44600kB ...
> [76578.062373] Node 1 Normal free:44824kB min:45060kB ...
> 
> The rest of free memory is in other zones (11 MB DMA and 380 MB in
> DMA32), which were not used for this allocation.  Therefore to be
> accurate, the safety limit should process /proc/zoneinfo and count
> amount of free memory in Normal zone. This 128 MB safety limit should
> not be counted from total memory, but from Normal zone.
> 
> But this is much more complex task and simple limit of 0.5% usually does
> the trick.
> 
> P.S. For 32-bit systems the Highmem zone should also be included in Normal.

Just to backup this with some numbers:
MemTotal:       198067420 kB
MemFree:        109125196 kB => 27 281 299 pages
MemAvailable:   108425900 kB

Node 1 free pages: 2732177
Node 0 free pages: 24305662
                 2732177+24305662 = 27037839
DMA32 free pages: 240511
DMA free pages: 2949

You can see that MemFree, which is returned by sysinfo, includes DMA32
and DMA zones which is not valid. Under low memory pressure user-space
(allocating highmem page) cannot allocate memory from DMA zones and
normal zones counters are in reality lower and hitting minimal level.

Best regards,
Krzysztof
Krzysztof Kozlowski June 24, 2021, 3:46 p.m. UTC | #6
On 24/06/2021 17:34, Krzysztof Kozlowski wrote:
> On 24/06/2021 17:07, Krzysztof Kozlowski wrote:
>>
>> On 24/06/2021 15:33, Martin Doucha wrote:
>>> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote:
>>>> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was
>>>> failing because of OoM killer during memory pollution:
>>>>
>>>> ...
>>>>
>>>> It seems leaving hard-coded 128 MB free memory works for small or medium
>>>> systems, but for such bigger machine it creates significant memory
>>>> pressure triggering the out of memory reaper.
>>>>
>>>> The memory pressure usually is defined by ratio between free and total
>>>> memory, so adjust the safety/spare memory similarly to keep always 0.5%
>>>> of memory free.
>>>
>>> Hi,
>>> I've sent a similar patch for the same issue a while ago. It covers a
>>> few more edge cases. See [1] for the discussion about it.
>>>
>>
>> Thanks for the pointer. I see partially we used similar solution -
>> always leave some percentage of free memory.
>>
>> Different kernels might have different limits here, for example v5.11
>> where this happened has two additional restrictions:
>>
>> 1. vm.min_free_kbytes = 90112
>> The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11).
>>
>> 2. vm.lowmem_reserve_ratio = 256	256	32	0	0
>> Which is a ratio 1/X for specific zones and since it was highmem
>> allocation, it does not matter here (machine has plenty of normal zone
>> memory).
>>
>> Therefore it OoM seems to be caused by min_free_kbytes. The machine has
>> two nodes and the limit looks like to be spread between them:
>>
>> [76578.062366] Node 0 Normal free:44536kB min:44600kB ...
>> [76578.062373] Node 1 Normal free:44824kB min:45060kB ...
>>
>> The rest of free memory is in other zones (11 MB DMA and 380 MB in
>> DMA32), which were not used for this allocation.  Therefore to be
>> accurate, the safety limit should process /proc/zoneinfo and count
>> amount of free memory in Normal zone. This 128 MB safety limit should
>> not be counted from total memory, but from Normal zone.
>>
>> But this is much more complex task and simple limit of 0.5% usually does
>> the trick.
>>
>> P.S. For 32-bit systems the Highmem zone should also be included in Normal.
> 
> Just to backup this with some numbers:
> MemTotal:       198067420 kB
> MemFree:        109125196 kB => 27 281 299 pages
> MemAvailable:   108425900 kB
> 
> Node 1 free pages: 2732177
> Node 0 free pages: 24305662
>                  2732177+24305662 = 27037839
> DMA32 free pages: 240511
> DMA free pages: 2949
> 
> You can see that MemFree, which is returned by sysinfo, includes DMA32
> and DMA zones which is not valid. Under low memory pressure user-space
> (allocating highmem page) cannot allocate memory from DMA zones and
> normal zones counters are in reality lower and hitting minimal level.

Which brings to the topic that using sysinfo is not reliable in the
first place. It returns free memory, not available memory, even though
man page says otherwise.

It would be better to read /proc/meminfo and use MemAvailable and
subtract swap from it (as MemAvailable takes into account watermarks
/low limit/).

Best regards,
Krzysztof
Cyril Hrubis Sept. 8, 2021, 1:37 p.m. UTC | #7
Hi!
I guess that this is another bug that should be fixed before the
release. I still think that the memory pollution is a best effort
operation and that we should be more conservative with the reserve. I
would go for a few percents of the free memory just to be extra sure
that we do not cause memory pressure.

If we go for 2% we will add following;

safety = MAX(safety, info.freeram / 50);

Also it looks like info.freeram is the same as MemFree: from
/proc/meminfo, I guess that this is not wrong, since memory that have
been used in buffers is dirty enough for our case.
Martin Doucha Sept. 8, 2021, 1:54 p.m. UTC | #8
On 08. 09. 21 15:37, Cyril Hrubis wrote:
> Hi!
> I guess that this is another bug that should be fixed before the
> release. I still think that the memory pollution is a best effort
> operation and that we should be more conservative with the reserve. I
> would go for a few percents of the free memory just to be extra sure
> that we do not cause memory pressure.
> 
> If we go for 2% we will add following;
> 
> safety = MAX(safety, info.freeram / 50);
> 
> Also it looks like info.freeram is the same as MemFree: from
> /proc/meminfo, I guess that this is not wrong, since memory that have
> been used in buffers is dirty enough for our case.

I'd recommend dividing by a power of 2 (either 32 or 64) but other than
that, I completely agree.
Cyril Hrubis Sept. 8, 2021, 2:17 p.m. UTC | #9
Hi!
> > I guess that this is another bug that should be fixed before the
> > release. I still think that the memory pollution is a best effort
> > operation and that we should be more conservative with the reserve. I
> > would go for a few percents of the free memory just to be extra sure
> > that we do not cause memory pressure.
> > 
> > If we go for 2% we will add following;
> > 
> > safety = MAX(safety, info.freeram / 50);
> > 
> > Also it looks like info.freeram is the same as MemFree: from
> > /proc/meminfo, I guess that this is not wrong, since memory that have
> > been used in buffers is dirty enough for our case.
> 
> I'd recommend dividing by a power of 2 (either 32 or 64) but other than
> that, I completely agree.

Sounds good.

Krzysztof unless you disagree I will push your patch but change the
division factor from 200 to 64.
Krzysztof Kozlowski Sept. 8, 2021, 2:19 p.m. UTC | #10
On Wed, 8 Sept 2021 at 16:17, Cyril Hrubis <chrubis@suse.cz> wrote:
>
> Hi!
> > > I guess that this is another bug that should be fixed before the
> > > release. I still think that the memory pollution is a best effort
> > > operation and that we should be more conservative with the reserve. I
> > > would go for a few percents of the free memory just to be extra sure
> > > that we do not cause memory pressure.
> > >
> > > If we go for 2% we will add following;
> > >
> > > safety = MAX(safety, info.freeram / 50);
> > >
> > > Also it looks like info.freeram is the same as MemFree: from
> > > /proc/meminfo, I guess that this is not wrong, since memory that have
> > > been used in buffers is dirty enough for our case.
> >
> > I'd recommend dividing by a power of 2 (either 32 or 64) but other than
> > that, I completely agree.
>
> Sounds good.
>
> Krzysztof unless you disagree I will push your patch but change the
> division factor from 200 to 64.


Sounds good. In such case please also update the % mentioned at the
end of commit msg (0.5% -> 1.5%).

Best regards,
Krzysztof
Cyril Hrubis Sept. 8, 2021, 2:32 p.m. UTC | #11
Hi!
> > Sounds good.
> >
> > Krzysztof unless you disagree I will push your patch but change the
> > division factor from 200 to 64.
> 
> 
> Sounds good. In such case please also update the % mentioned at the
> end of commit msg (0.5% -> 1.5%).

Done and pushed, thanks.
diff mbox series

Patch

diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c
index dd09db4902b0..abf382d41b20 100644
--- a/lib/tst_memutils.c
+++ b/lib/tst_memutils.c
@@ -21,6 +21,7 @@  void tst_pollute_memory(size_t maxsize, int fillchar)
 
 	SAFE_SYSINFO(&info);
 	safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 * 1024);
+	safety = MAX(safety, (info.freeram / 200));
 	safety /= info.mem_unit;
 
 	if (info.freeswap > safety)