mbox series

[v2,0/3] controllers/memcg: fixes for newer kernels

Message ID 20210617070730.7699-1-krzysztof.kozlowski@canonical.com
Headers show
Series controllers/memcg: fixes for newer kernels | expand

Message

Krzysztof Kozlowski June 17, 2021, 7:07 a.m. UTC
Hi,

This is a resend of previous Github pull:
https://github.com/linux-test-project/ltp/pull/830

The patches fix several test failures we are hitting since months, see:
https://bugs.launchpad.net/bugs/1829979
https://bugs.launchpad.net/bugs/1829984

Best regards,
Krzysztof


Krzysztof Kozlowski (3):
  controllers/memcg: accept range of max_usage_in_bytes
  controllers/memcg: accept range of usage_in_bytes
  controllers/memcg: accept non-zero max_usage_in_bytes after reset

 .../controllers/memcg/functional/memcg_lib.sh | 22 ++++++++++++++-----
 .../memcg_max_usage_in_bytes_test.sh          | 12 ++++++++--
 .../memcg/functional/memcg_stat_rss.sh        | 20 ++++++++---------
 .../memcg/functional/memcg_stat_test.sh       |  8 +++----
 .../functional/memcg_usage_in_bytes_test.sh   | 10 +++++++--
 5 files changed, 49 insertions(+), 23 deletions(-)

Comments

Richard Palethorpe June 17, 2021, 7:32 a.m. UTC | #1
Hello,

Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> writes:

> Hi,
>
> This is a resend of previous Github pull:
> https://github.com/linux-test-project/ltp/pull/830

LGTM (Reviewed on Github).

I doubt tests that expect a precise allocation figure are valid.

>
> The patches fix several test failures we are hitting since months, see:
> https://bugs.launchpad.net/bugs/1829979
> https://bugs.launchpad.net/bugs/1829984
>
> Best regards,
> Krzysztof
>
>
> Krzysztof Kozlowski (3):
>   controllers/memcg: accept range of max_usage_in_bytes
>   controllers/memcg: accept range of usage_in_bytes
>   controllers/memcg: accept non-zero max_usage_in_bytes after reset
>
>  .../controllers/memcg/functional/memcg_lib.sh | 22 ++++++++++++++-----
>  .../memcg_max_usage_in_bytes_test.sh          | 12 ++++++++--
>  .../memcg/functional/memcg_stat_rss.sh        | 20 ++++++++---------
>  .../memcg/functional/memcg_stat_test.sh       |  8 +++----
>  .../functional/memcg_usage_in_bytes_test.sh   | 10 +++++++--
>  5 files changed, 49 insertions(+), 23 deletions(-)
>
> -- 
> 2.27.0
Krzysztof Kozlowski June 24, 2021, 7:31 p.m. UTC | #2
On 17/06/2021 09:07, Krzysztof Kozlowski wrote:
> Hi,
> 
> This is a resend of previous Github pull:
> https://github.com/linux-test-project/ltp/pull/830
> 
> The patches fix several test failures we are hitting since months, see:
> https://bugs.launchpad.net/bugs/1829979
> https://bugs.launchpad.net/bugs/1829984
> 
> Best regards,
> Krzysztof
> 
> 
> Krzysztof Kozlowski (3):
>   controllers/memcg: accept range of max_usage_in_bytes
>   controllers/memcg: accept range of usage_in_bytes
>   controllers/memcg: accept non-zero max_usage_in_bytes after reset


Hi everyone,

The patchset got positive LGTM on the Github. Any further comments for
it or can it be applied?

Best regards,
Krzysztof
Li Wang June 25, 2021, 8:21 a.m. UTC | #3
On Fri, Jun 25, 2021 at 3:31 AM Krzysztof Kozlowski <
krzysztof.kozlowski@canonical.com> wrote:

> On 17/06/2021 09:07, Krzysztof Kozlowski wrote:
> > Hi,
> >
> > This is a resend of previous Github pull:
> > https://github.com/linux-test-project/ltp/pull/830
> >
> > The patches fix several test failures we are hitting since months, see:
> > https://bugs.launchpad.net/bugs/1829979
> > https://bugs.launchpad.net/bugs/1829984
> >
> > Best regards,
> > Krzysztof
> >
> >
> > Krzysztof Kozlowski (3):
> >   controllers/memcg: accept range of max_usage_in_bytes
> >   controllers/memcg: accept range of usage_in_bytes
> >   controllers/memcg: accept non-zero max_usage_in_bytes after reset
>
>
> Hi everyone,
>
> The patchset got positive LGTM on the Github. Any further comments for
> it or can it be applied?
>

I slightly agree with Richard that we need an explanation/investigation
on where the 32*PAGE_SIZE comes from. Otherwise, we are very possible
to mask a counter bug if only to make the test happy.

Forgive me pour cold water on the method though it looks good in coding:).
Krzysztof Kozlowski June 25, 2021, 9:13 a.m. UTC | #4
On 25/06/2021 10:21, Li Wang wrote:
> 
> 
> On Fri, Jun 25, 2021 at 3:31 AM Krzysztof Kozlowski
> <krzysztof.kozlowski@canonical.com
> <mailto:krzysztof.kozlowski@canonical.com>> wrote:
> 
>     Hi everyone,
> 
>     The patchset got positive LGTM on the Github. Any further comments for
>     it or can it be applied?
> 
> 
> I slightly agree with Richard that we need an explanation/investigation
> on where the 32*PAGE_SIZE comes from. Otherwise, we are very possible
> to mask a counter bug if only to make the test happy.

I don't know where 32*PAGE_SIZE comes from and investigation would
require effort/time which I don't have.

I don't think we mask current bug as this appears in multiple kernels -
I tested from v5.4 up to 5.13.0-rc5-next-20210608.

It is possible this will mask future bugs but that's life of a project
depending on kernel internals, not on API or ABI. The kernel is allowed
to change such details any moment because it is neither part of API nor
ABI. Therefore you just have to live with inaccurate limits or keep
investigating every x-months.

For now the test is simply unreliable.

Best regards,
Krzysztof
Joerg Vehlow June 25, 2021, 9:24 a.m. UTC | #5
Hi,

On 6/25/2021 10:21 AM, Li Wang wrote:
> On Fri, Jun 25, 2021 at 3:31 AM Krzysztof Kozlowski 
> <krzysztof.kozlowski@canonical.com 
> <mailto:krzysztof.kozlowski@canonical.com>> wrote:
>
>     On 17/06/2021 09:07, Krzysztof Kozlowski wrote:
>
>
>     The patchset got positive LGTM on the Github. Any further comments for
>     it or can it be applied?
>
>
> I slightly agree with Richard that we need an explanation/investigation
> on where the 32*PAGE_SIZE comes from. Otherwise, we are very possible
> to mask a counter bug if only to make the test happy.
+1. I think it has something to do with batch processing of counter updates.
The memory handling code is heavily batched per cpu and only committed 
to a global counter after the batch size overflows.
At least for the subgroup_charge test I once found that page table cache 
stored in kmem contributes to the memory counter used in the test.
Maybe something in that region was changed in recent kernel version. 
Maybe you could ask on the kernel mailing list?


>
> Forgive me pour cold water on the method though it looks good in coding:).
I would just suggest to make 32 * PAGESIZE a constant calculated in 
memcg_lib.sh, instead of calculating (and defining) it everywhere.
>
> -- 
> Regards,
> Li Wang
>
Jörg
Krzysztof Kozlowski June 25, 2021, 11:55 a.m. UTC | #6
On 25/06/2021 11:24, Joerg Vehlow wrote:
>> Forgive me pour cold water on the method though it looks good in coding:).
> I would just suggest to make 32 * PAGESIZE a constant calculated in 
> memcg_lib.sh, instead of calculating (and defining) it everywhere.

I understand that you want to define constant in memcg_lib for a
accepted higher bound of some specific memory limit checks? Won't that
be extra confusing? Trying to generalize some constant without actually
knowing what is it about and to which limits it applies?

Now it's this unknown part is quite local and documented in three
places. I can move it to bigger scope and pretend it's generic, if you want.

Best regards,
Krzysztof
Krzysztof Kozlowski July 2, 2021, 10:49 a.m. UTC | #7
On 25/06/2021 10:21, Li Wang wrote:
> 
> 
> On Fri, Jun 25, 2021 at 3:31 AM Krzysztof Kozlowski
> <krzysztof.kozlowski@canonical.com
> <mailto:krzysztof.kozlowski@canonical.com>> wrote:
> 
>     On 17/06/2021 09:07, Krzysztof Kozlowski wrote:
>     > Hi,
>     >
>     > This is a resend of previous Github pull:
>     > https://github.com/linux-test-project/ltp/pull/830
>     <https://github.com/linux-test-project/ltp/pull/830>
>     >
>     > The patches fix several test failures we are hitting since months,
>     see:
>     > https://bugs.launchpad.net/bugs/1829979
>     <https://bugs.launchpad.net/bugs/1829979>
>     > https://bugs.launchpad.net/bugs/1829984
>     <https://bugs.launchpad.net/bugs/1829984>
>     >
>     > Best regards,
>     > Krzysztof
>     >
>     >
>     > Krzysztof Kozlowski (3):
>     >   controllers/memcg: accept range of max_usage_in_bytes
>     >   controllers/memcg: accept range of usage_in_bytes
>     >   controllers/memcg: accept non-zero max_usage_in_bytes after reset
> 
> 
>     Hi everyone,
> 
>     The patchset got positive LGTM on the Github. Any further comments for
>     it or can it be applied?
> 
> 
> I slightly agree with Richard that we need an explanation/investigation
> on where the 32*PAGE_SIZE comes from. Otherwise, we are very possible
> to mask a counter bug if only to make the test happy.

On newer v5.11 the max_usage_in_bytes go even above 32 pages up to 34.

I got some explanation from Michal Hocko [1] from which one can try to
conclude:
1. There is significant caching in memory accounting. Not only charging
of cgroup is once per MEMCG_CHARGE_BATCH batch (try_charge()), but also
statistics are gathered from per-cpu if threshold exceed
MEMCG_CHARGE_BATCH (__mod_memcg_state()).

2. Depending on machine (different amount of CPUs), the memory group
charging be less or more accurate.

3. The accuracy of group memory accounting is not considered as
important thus test relying on it, will be failing from time to time.


I'll send a v3 with significantly increased limits and some explanation.

[1]
https://lore.kernel.org/linux-mm/85b8a4f9-b9e9-a6ca-5d0c-c1ecb8c11ef3@canonical.com/T/#m6459b3be3a86f5eaf2cfc48dd586b6faf949e440


Best regards,
Krzysztof