Message ID | 20210624132226.84611-1-krzysztof.kozlowski@canonical.com |
---|---|
State | Accepted |
Headers | show |
Series | lib: memutils: don't pollute entire system memory to avoid OoM | expand |
On 24. 06. 21 15:22, Krzysztof Kozlowski wrote: > On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was > failing because of OoM killer during memory pollution: > > ... > > It seems leaving hard-coded 128 MB free memory works for small or medium > systems, but for such bigger machine it creates significant memory > pressure triggering the out of memory reaper. > > The memory pressure usually is defined by ratio between free and total > memory, so adjust the safety/spare memory similarly to keep always 0.5% > of memory free. Hi, I've sent a similar patch for the same issue a while ago. It covers a few more edge cases. See [1] for the discussion about it. [1] https://patchwork.ozlabs.org/project/ltp/patch/20210127115606.28985-1-mdoucha@suse.cz/
On Thu, Jun 24, 2021 at 9:33 PM Martin Doucha <mdoucha@suse.cz> wrote: > On 24. 06. 21 15:22, Krzysztof Kozlowski wrote: > > On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was > > failing because of OoM killer during memory pollution: > > > > ... > > > > It seems leaving hard-coded 128 MB free memory works for small or medium > > systems, but for such bigger machine it creates significant memory > > pressure triggering the out of memory reaper. > > > > The memory pressure usually is defined by ratio between free and total > > memory, so adjust the safety/spare memory similarly to keep always 0.5% > > of memory free. > > Hi, > I've sent a similar patch for the same issue a while ago. It covers a > few more edge cases. See [1] for the discussion about it. > [1] > > https://patchwork.ozlabs.org/project/ltp/patch/20210127115606.28985-1-mdoucha@suse.cz/ FYI, Another related analysis: https://lists.linux.it/pipermail/ltp/2021-April/021903.html The mmap() behavior changed in GUESS mode from commit 8c7829b04c523cd, we can NOT receive MAP_FAILED on ENOMEM in userspace anymore, unless the process one-time allocating memory larger than "total_ram+ total_swap" explicitly. Which also means the MAP_FAILED check lose effect permanently in line#51: https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51
On 24. 06. 21 16:00, Li Wang wrote: > FYI, Another related analysis: > https://lists.linux.it/pipermail/ltp/2021-April/021903.html > <https://lists.linux.it/pipermail/ltp/2021-April/021903.html> > > The mmap() behavior changed in GUESS mode from commit 8c7829b04c523cd, > we can NOT receive MAP_FAILED on ENOMEM in userspace anymore,unless > the process one-time allocating memory larger than "total_ram+ > total_swap" explicitly. > > Which also means the MAP_FAILED check lose effect permanently in line#51: > https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51 > <https://github.com/linux-test-project/ltp/blob/master/lib/tst_memutils.c#L51> I'm pretty sure that 32bit x86 kernel with PAE and more than 3GB of RAM will give you MAP_FAILED because you'll run out of available address space before you run out of physical memory.
On 24/06/2021 15:33, Martin Doucha wrote: > On 24. 06. 21 15:22, Krzysztof Kozlowski wrote: >> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was >> failing because of OoM killer during memory pollution: >> >> ... >> >> It seems leaving hard-coded 128 MB free memory works for small or medium >> systems, but for such bigger machine it creates significant memory >> pressure triggering the out of memory reaper. >> >> The memory pressure usually is defined by ratio between free and total >> memory, so adjust the safety/spare memory similarly to keep always 0.5% >> of memory free. > > Hi, > I've sent a similar patch for the same issue a while ago. It covers a > few more edge cases. See [1] for the discussion about it. > Thanks for the pointer. I see partially we used similar solution - always leave some percentage of free memory. Different kernels might have different limits here, for example v5.11 where this happened has two additional restrictions: 1. vm.min_free_kbytes = 90112 The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11). 2. vm.lowmem_reserve_ratio = 256 256 32 0 0 Which is a ratio 1/X for specific zones and since it was highmem allocation, it does not matter here (machine has plenty of normal zone memory). Therefore it OoM seems to be caused by min_free_kbytes. The machine has two nodes and the limit looks like to be spread between them: [76578.062366] Node 0 Normal free:44536kB min:44600kB ... [76578.062373] Node 1 Normal free:44824kB min:45060kB ... The rest of free memory is in other zones (11 MB DMA and 380 MB in DMA32), which were not used for this allocation. Therefore to be accurate, the safety limit should process /proc/zoneinfo and count amount of free memory in Normal zone. This 128 MB safety limit should not be counted from total memory, but from Normal zone. But this is much more complex task and simple limit of 0.5% usually does the trick. P.S. For 32-bit systems the Highmem zone should also be included in Normal. Best regards, Krzysztof
On 24/06/2021 17:07, Krzysztof Kozlowski wrote: > > On 24/06/2021 15:33, Martin Doucha wrote: >> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote: >>> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was >>> failing because of OoM killer during memory pollution: >>> >>> ... >>> >>> It seems leaving hard-coded 128 MB free memory works for small or medium >>> systems, but for such bigger machine it creates significant memory >>> pressure triggering the out of memory reaper. >>> >>> The memory pressure usually is defined by ratio between free and total >>> memory, so adjust the safety/spare memory similarly to keep always 0.5% >>> of memory free. >> >> Hi, >> I've sent a similar patch for the same issue a while ago. It covers a >> few more edge cases. See [1] for the discussion about it. >> > > Thanks for the pointer. I see partially we used similar solution - > always leave some percentage of free memory. > > Different kernels might have different limits here, for example v5.11 > where this happened has two additional restrictions: > > 1. vm.min_free_kbytes = 90112 > The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11). > > 2. vm.lowmem_reserve_ratio = 256 256 32 0 0 > Which is a ratio 1/X for specific zones and since it was highmem > allocation, it does not matter here (machine has plenty of normal zone > memory). > > Therefore it OoM seems to be caused by min_free_kbytes. The machine has > two nodes and the limit looks like to be spread between them: > > [76578.062366] Node 0 Normal free:44536kB min:44600kB ... > [76578.062373] Node 1 Normal free:44824kB min:45060kB ... > > The rest of free memory is in other zones (11 MB DMA and 380 MB in > DMA32), which were not used for this allocation. Therefore to be > accurate, the safety limit should process /proc/zoneinfo and count > amount of free memory in Normal zone. This 128 MB safety limit should > not be counted from total memory, but from Normal zone. > > But this is much more complex task and simple limit of 0.5% usually does > the trick. > > P.S. For 32-bit systems the Highmem zone should also be included in Normal. Just to backup this with some numbers: MemTotal: 198067420 kB MemFree: 109125196 kB => 27 281 299 pages MemAvailable: 108425900 kB Node 1 free pages: 2732177 Node 0 free pages: 24305662 2732177+24305662 = 27037839 DMA32 free pages: 240511 DMA free pages: 2949 You can see that MemFree, which is returned by sysinfo, includes DMA32 and DMA zones which is not valid. Under low memory pressure user-space (allocating highmem page) cannot allocate memory from DMA zones and normal zones counters are in reality lower and hitting minimal level. Best regards, Krzysztof
On 24/06/2021 17:34, Krzysztof Kozlowski wrote: > On 24/06/2021 17:07, Krzysztof Kozlowski wrote: >> >> On 24/06/2021 15:33, Martin Doucha wrote: >>> On 24. 06. 21 15:22, Krzysztof Kozlowski wrote: >>>> On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was >>>> failing because of OoM killer during memory pollution: >>>> >>>> ... >>>> >>>> It seems leaving hard-coded 128 MB free memory works for small or medium >>>> systems, but for such bigger machine it creates significant memory >>>> pressure triggering the out of memory reaper. >>>> >>>> The memory pressure usually is defined by ratio between free and total >>>> memory, so adjust the safety/spare memory similarly to keep always 0.5% >>>> of memory free. >>> >>> Hi, >>> I've sent a similar patch for the same issue a while ago. It covers a >>> few more edge cases. See [1] for the discussion about it. >>> >> >> Thanks for the pointer. I see partially we used similar solution - >> always leave some percentage of free memory. >> >> Different kernels might have different limits here, for example v5.11 >> where this happened has two additional restrictions: >> >> 1. vm.min_free_kbytes = 90112 >> The min_free_kbytes will grow non-linearly up to 256 MB (still for v5.11). >> >> 2. vm.lowmem_reserve_ratio = 256 256 32 0 0 >> Which is a ratio 1/X for specific zones and since it was highmem >> allocation, it does not matter here (machine has plenty of normal zone >> memory). >> >> Therefore it OoM seems to be caused by min_free_kbytes. The machine has >> two nodes and the limit looks like to be spread between them: >> >> [76578.062366] Node 0 Normal free:44536kB min:44600kB ... >> [76578.062373] Node 1 Normal free:44824kB min:45060kB ... >> >> The rest of free memory is in other zones (11 MB DMA and 380 MB in >> DMA32), which were not used for this allocation. Therefore to be >> accurate, the safety limit should process /proc/zoneinfo and count >> amount of free memory in Normal zone. This 128 MB safety limit should >> not be counted from total memory, but from Normal zone. >> >> But this is much more complex task and simple limit of 0.5% usually does >> the trick. >> >> P.S. For 32-bit systems the Highmem zone should also be included in Normal. > > Just to backup this with some numbers: > MemTotal: 198067420 kB > MemFree: 109125196 kB => 27 281 299 pages > MemAvailable: 108425900 kB > > Node 1 free pages: 2732177 > Node 0 free pages: 24305662 > 2732177+24305662 = 27037839 > DMA32 free pages: 240511 > DMA free pages: 2949 > > You can see that MemFree, which is returned by sysinfo, includes DMA32 > and DMA zones which is not valid. Under low memory pressure user-space > (allocating highmem page) cannot allocate memory from DMA zones and > normal zones counters are in reality lower and hitting minimal level. Which brings to the topic that using sysinfo is not reliable in the first place. It returns free memory, not available memory, even though man page says otherwise. It would be better to read /proc/meminfo and use MemAvailable and subtract swap from it (as MemAvailable takes into account watermarks /low limit/). Best regards, Krzysztof
Hi! I guess that this is another bug that should be fixed before the release. I still think that the memory pollution is a best effort operation and that we should be more conservative with the reserve. I would go for a few percents of the free memory just to be extra sure that we do not cause memory pressure. If we go for 2% we will add following; safety = MAX(safety, info.freeram / 50); Also it looks like info.freeram is the same as MemFree: from /proc/meminfo, I guess that this is not wrong, since memory that have been used in buffers is dirty enough for our case.
On 08. 09. 21 15:37, Cyril Hrubis wrote: > Hi! > I guess that this is another bug that should be fixed before the > release. I still think that the memory pollution is a best effort > operation and that we should be more conservative with the reserve. I > would go for a few percents of the free memory just to be extra sure > that we do not cause memory pressure. > > If we go for 2% we will add following; > > safety = MAX(safety, info.freeram / 50); > > Also it looks like info.freeram is the same as MemFree: from > /proc/meminfo, I guess that this is not wrong, since memory that have > been used in buffers is dirty enough for our case. I'd recommend dividing by a power of 2 (either 32 or 64) but other than that, I completely agree.
Hi! > > I guess that this is another bug that should be fixed before the > > release. I still think that the memory pollution is a best effort > > operation and that we should be more conservative with the reserve. I > > would go for a few percents of the free memory just to be extra sure > > that we do not cause memory pressure. > > > > If we go for 2% we will add following; > > > > safety = MAX(safety, info.freeram / 50); > > > > Also it looks like info.freeram is the same as MemFree: from > > /proc/meminfo, I guess that this is not wrong, since memory that have > > been used in buffers is dirty enough for our case. > > I'd recommend dividing by a power of 2 (either 32 or 64) but other than > that, I completely agree. Sounds good. Krzysztof unless you disagree I will push your patch but change the division factor from 200 to 64.
On Wed, 8 Sept 2021 at 16:17, Cyril Hrubis <chrubis@suse.cz> wrote: > > Hi! > > > I guess that this is another bug that should be fixed before the > > > release. I still think that the memory pollution is a best effort > > > operation and that we should be more conservative with the reserve. I > > > would go for a few percents of the free memory just to be extra sure > > > that we do not cause memory pressure. > > > > > > If we go for 2% we will add following; > > > > > > safety = MAX(safety, info.freeram / 50); > > > > > > Also it looks like info.freeram is the same as MemFree: from > > > /proc/meminfo, I guess that this is not wrong, since memory that have > > > been used in buffers is dirty enough for our case. > > > > I'd recommend dividing by a power of 2 (either 32 or 64) but other than > > that, I completely agree. > > Sounds good. > > Krzysztof unless you disagree I will push your patch but change the > division factor from 200 to 64. Sounds good. In such case please also update the % mentioned at the end of commit msg (0.5% -> 1.5%). Best regards, Krzysztof
Hi! > > Sounds good. > > > > Krzysztof unless you disagree I will push your patch but change the > > division factor from 200 to 64. > > > Sounds good. In such case please also update the % mentioned at the > end of commit msg (0.5% -> 1.5%). Done and pushed, thanks.
diff --git a/lib/tst_memutils.c b/lib/tst_memutils.c index dd09db4902b0..abf382d41b20 100644 --- a/lib/tst_memutils.c +++ b/lib/tst_memutils.c @@ -21,6 +21,7 @@ void tst_pollute_memory(size_t maxsize, int fillchar) SAFE_SYSINFO(&info); safety = MAX(4096 * SAFE_SYSCONF(_SC_PAGESIZE), 128 * 1024 * 1024); + safety = MAX(safety, (info.freeram / 200)); safety /= info.mem_unit; if (info.freeswap > safety)
On big memory systems, e.g. 196 GB RAM machine, the ioctl_sg01 test was failing because of OoM killer during memory pollution: tst_test.c:1311: TINFO: Timeout per run is 0h 05m 00s ioctl_sg01.c:81: TINFO: Found SCSI device /dev/sg2 tst_test.c:1357: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1 tst_test.c:1359: TBROK: Test killed! (timeout?) In dmesg: [76477.661067] LTP: starting cve-2018-1000204 (ioctl_sg01) [76578.062209] ioctl_sg01 invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 ... [76578.062335] Mem-Info: [76578.062340] active_anon:63 inactive_anon:49016768 isolated_anon:0 active_file:253 inactive_file:117 isolated_file:0 unevictable:4871 dirty:4 writeback:0 slab_reclaimable:18451 slab_unreclaimable:56355 mapped:2478 shmem:310 pagetables:96625 bounce:0 free:121136 free_pcp:0 free_cma:0 ... [76578.062527] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-1000.slice/session-40.scope,task=ioctl_sg01,pid=446171,uid=0 [76578.062539] Out of memory: Killed process 446171 (ioctl_sg01) total-vm:195955840kB, anon-rss:195941256kB, file-rss:1416kB, shmem-rss:0kB, UID:0 pgtables:383496kB oom_score_adj:0 [76581.046078] oom_reaper: reaped process 446171 (ioctl_sg01), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB It seems leaving hard-coded 128 MB free memory works for small or medium systems, but for such bigger machine it creates significant memory pressure triggering the out of memory reaper. The memory pressure usually is defined by ratio between free and total memory, so adjust the safety/spare memory similarly to keep always 0.5% of memory free. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> --- lib/tst_memutils.c | 1 + 1 file changed, 1 insertion(+)