diff mbox series

[ovs-dev] vswitchd: Pass MCL_ONFAULT to mlockall

Message ID 20200710115313.18319-1-ross.lagerwall@citrix.com
State Changes Requested
Headers show
Series [ovs-dev] vswitchd: Pass MCL_ONFAULT to mlockall | expand

Commit Message

Ross Lagerwall July 10, 2020, 11:53 a.m. UTC
mlockall locks thread stack pages into memory, even pages which have not
yet been demand-paged.  As vswitchd has a lot of threads and the default
stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
systems I looked at, vswitchd used ~150 MiB of RSS when idle after
startup.

Use the new MCL_ONFAULT flag to only lock pages into memory once they
have been demand-paged in. This still satisfies the requirement that
vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
(18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
startup.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 vswitchd/ovs-vswitchd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Gregory Rose July 10, 2020, 3:08 p.m. UTC | #1
On 7/10/2020 4:53 AM, Ross Lagerwall wrote:
> mlockall locks thread stack pages into memory, even pages which have not
> yet been demand-paged.  As vswitchd has a lot of threads and the default
> stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
> systems I looked at, vswitchd used ~150 MiB of RSS when idle after
> startup.
> 
> Use the new MCL_ONFAULT flag to only lock pages into memory once they
> have been demand-paged in. This still satisfies the requirement that
> vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
> (18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
> startup.
> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>

Have you done any performance measurements?  This seems like it may
have some impact.

Thanks,

- Greg

> ---
>   vswitchd/ovs-vswitchd.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 1e72b628b..e4d482521 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -92,7 +92,10 @@ main(int argc, char *argv[])
>   
>       if (want_mlockall) {
>   #ifdef HAVE_MLOCKALL
> -        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
> +#ifndef MCL_ONFAULT
> +#define MCL_ONFAULT 0
> +#endif
> +        if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) {
>               VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));
>           } else {
>               set_memory_locked();
>
Flavio Leitner July 10, 2020, 3:12 p.m. UTC | #2
On Fri, Jul 10, 2020 at 12:53:13PM +0100, Ross Lagerwall wrote:
> mlockall locks thread stack pages into memory, even pages which have not
> yet been demand-paged.  As vswitchd has a lot of threads and the default
> stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
> systems I looked at, vswitchd used ~150 MiB of RSS when idle after
> startup.
> 
> Use the new MCL_ONFAULT flag to only lock pages into memory once they
> have been demand-paged in. This still satisfies the requirement that
> vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
> (18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
> startup.

The problem with this approach is that when using userspace datapath
those page faults can introduce scheduling points which introduces
jitter and impacts performance.

Alternatively you can try to reduce the default stack size as done
by this commit:

    commit b82a90e266e1246fe2973db97c95df22558174ea
    Author: Flavio Leitner <fbl@sysclose.org>
    Date:   Thu Feb 28 13:13:57 2019 -0300

        rhel: limit stack size to 2M.


Thanks,
fbl

> 
> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> ---
>  vswitchd/ovs-vswitchd.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 1e72b628b..e4d482521 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -92,7 +92,10 @@ main(int argc, char *argv[])
>  
>      if (want_mlockall) {
>  #ifdef HAVE_MLOCKALL
> -        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
> +#ifndef MCL_ONFAULT
> +#define MCL_ONFAULT 0
> +#endif
> +        if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) {
>              VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));
>          } else {
>              set_memory_locked();
> -- 
> 2.21.1
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
William Tu July 13, 2020, 9:49 p.m. UTC | #3
On Fri, Jul 10, 2020 at 12:12:57PM -0300, Flavio Leitner wrote:
> On Fri, Jul 10, 2020 at 12:53:13PM +0100, Ross Lagerwall wrote:
> > mlockall locks thread stack pages into memory, even pages which have not
> > yet been demand-paged.  As vswitchd has a lot of threads and the default
> > stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
> > systems I looked at, vswitchd used ~150 MiB of RSS when idle after
> > startup.
> > 
> > Use the new MCL_ONFAULT flag to only lock pages into memory once they
> > have been demand-paged in. This still satisfies the requirement that
> > vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
> > (18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
> > startup.
> 
> The problem with this approach is that when using userspace datapath
> those page faults can introduce scheduling points which introduces
> jitter and impacts performance.
> 
I like this idea of using MCL_ONFAULT.
IIUC, the performance impact only happens first time when page fault
happens, and afterward, it should have the same performance.

William
Flavio Leitner July 14, 2020, 1:31 p.m. UTC | #4
On Mon, Jul 13, 2020 at 02:49:50PM -0700, William Tu wrote:
> On Fri, Jul 10, 2020 at 12:12:57PM -0300, Flavio Leitner wrote:
> > On Fri, Jul 10, 2020 at 12:53:13PM +0100, Ross Lagerwall wrote:
> > > mlockall locks thread stack pages into memory, even pages which have not
> > > yet been demand-paged.  As vswitchd has a lot of threads and the default
> > > stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
> > > systems I looked at, vswitchd used ~150 MiB of RSS when idle after
> > > startup.
> > > 
> > > Use the new MCL_ONFAULT flag to only lock pages into memory once they
> > > have been demand-paged in. This still satisfies the requirement that
> > > vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
> > > (18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
> > > startup.
> > 
> > The problem with this approach is that when using userspace datapath
> > those page faults can introduce scheduling points which introduces
> > jitter and impacts performance.
> > 
> I like this idea of using MCL_ONFAULT.
> IIUC, the performance impact only happens first time when page fault
> happens, and afterward, it should have the same performance.

Yes, there is a performance impact only when the page fault happens.

The issue is that we can't tell when it is going to happen. Some
page faults will happen at the initialization, which is fine, but
some might happen when there is production traffic.

The thread is preempted and all the ongoing traffic is impacted
when that page fault happens. Therefore, this has a potential
of affecting "zero" packet loss, for instance, which is important
to Telco deployments.

It seems we have two relevant use cases here with conflicting
requirements. I don't think we can just change current behavior
without announcing first and still the current behavior is desired
for those who don't want surprises with page faults.

Perhaps we can leave --mlockall as is to avoid possible regressions,
and add another way to express the on demand use case?

Thanks,
William Tu July 14, 2020, 1:36 p.m. UTC | #5
On Tue, Jul 14, 2020 at 6:32 AM Flavio Leitner <fbl@sysclose.org> wrote:
>
> On Mon, Jul 13, 2020 at 02:49:50PM -0700, William Tu wrote:
> > On Fri, Jul 10, 2020 at 12:12:57PM -0300, Flavio Leitner wrote:
> > > On Fri, Jul 10, 2020 at 12:53:13PM +0100, Ross Lagerwall wrote:
> > > > mlockall locks thread stack pages into memory, even pages which have not
> > > > yet been demand-paged.  As vswitchd has a lot of threads and the default
> > > > stack size on x86_64 is 8 MiB, this consumes a lot of memory.  On two
> > > > systems I looked at, vswitchd used ~150 MiB of RSS when idle after
> > > > startup.
> > > >
> > > > Use the new MCL_ONFAULT flag to only lock pages into memory once they
> > > > have been demand-paged in. This still satisfies the requirement that
> > > > vswitchd is not swapped out but frees up ~144 MiB of unswappable memory
> > > > (18 threads x 8 MiB).  After this, vswitchd uses ~6 MiB when idle after
> > > > startup.
> > >
> > > The problem with this approach is that when using userspace datapath
> > > those page faults can introduce scheduling points which introduces
> > > jitter and impacts performance.
> > >
> > I like this idea of using MCL_ONFAULT.
> > IIUC, the performance impact only happens first time when page fault
> > happens, and afterward, it should have the same performance.
>
> Yes, there is a performance impact only when the page fault happens.
>
> The issue is that we can't tell when it is going to happen. Some
> page faults will happen at the initialization, which is fine, but
> some might happen when there is production traffic.
>
> The thread is preempted and all the ongoing traffic is impacted
> when that page fault happens. Therefore, this has a potential
> of affecting "zero" packet loss, for instance, which is important
> to Telco deployments.

I see, that makes sense, thanks for the detail.
>
> It seems we have two relevant use cases here with conflicting
> requirements. I don't think we can just change current behavior
> without announcing first and still the current behavior is desired
> for those who don't want surprises with page faults.
>
> Perhaps we can leave --mlockall as is to avoid possible regressions,
> and add another way to express the on demand use case?
>
Agree
Thanks,
William
diff mbox series

Patch

diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index 1e72b628b..e4d482521 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -92,7 +92,10 @@  main(int argc, char *argv[])
 
     if (want_mlockall) {
 #ifdef HAVE_MLOCKALL
-        if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
+#ifndef MCL_ONFAULT
+#define MCL_ONFAULT 0
+#endif
+        if (mlockall(MCL_CURRENT | MCL_FUTURE | MCL_ONFAULT)) {
             VLOG_ERR("mlockall failed: %s", ovs_strerror(errno));
         } else {
             set_memory_locked();