mbox series

[net-next,0/2] net/sctp: Avoid allocating high order memory with kmalloc()

Message ID 1524508866-317485-1-git-send-email-obabin@virtuozzo.com
Headers show
Series net/sctp: Avoid allocating high order memory with kmalloc() | expand

Message

Oleg Babin April 23, 2018, 6:41 p.m. UTC
Each SCTP association can have up to 65535 input and output streams.
For each stream type an array of sctp_stream_in or sctp_stream_out
structures is allocated using kmalloc_array() function. This function
allocates physically contiguous memory regions, so this can lead
to allocation of memory regions of very high order, i.e.:

  sizeof(struct sctp_stream_out) == 24,
  ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
  which means 9th memory order.

This can lead to a memory allocation failures on the systems
under a memory stress.

We actually do not need these arrays of memory to be physically
contiguous. Possible simple solution would be to use kvmalloc()
instread of kmalloc() as kvmalloc() can allocate physically scattered
pages if contiguous pages are not available. But the problem
is that the allocation can happed in a softirq context with
GFP_ATOMIC flag set, and kvmalloc() cannot be used in this scenario.

So the other possible solution is to use flexible arrays instead of
contiguios arrays of memory so that the memory would be allocated
on a per-page basis.

This patchset replaces kvmalloc() with flex_array usage.
It consists of two parts:

  * First patch is preparatory - it mechanically wraps all direct
    access to assoc->stream.out[] and assoc->stream.in[] arrays
    with SCTP_SO() and SCTP_SI() wrappers so that later a direct
    array access could be easily changed to an access to a
    flex_array (or any other possible alternative).
  * Second patch replaces kmalloc_array() with flex_array usage.

Oleg Babin (2):
  net/sctp: Make wrappers for accessing in/out streams
  net/sctp: Replace in/out stream arrays with flex_array

 include/net/sctp/structs.h   |  31 +++++---
 net/sctp/chunk.c             |   6 +-
 net/sctp/outqueue.c          |  11 +--
 net/sctp/socket.c            |   4 +-
 net/sctp/stream.c            | 165 +++++++++++++++++++++++++++++--------------
 net/sctp/stream_interleave.c |   2 +-
 net/sctp/stream_sched.c      |  13 ++--
 net/sctp/stream_sched_prio.c |  22 +++---
 net/sctp/stream_sched_rr.c   |   8 +--
 9 files changed, 167 insertions(+), 95 deletions(-)

Comments

Marcelo Ricardo Leitner April 23, 2018, 9:33 p.m. UTC | #1
Hi,

On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
> Each SCTP association can have up to 65535 input and output streams.
> For each stream type an array of sctp_stream_in or sctp_stream_out
> structures is allocated using kmalloc_array() function. This function
> allocates physically contiguous memory regions, so this can lead
> to allocation of memory regions of very high order, i.e.:
>
>   sizeof(struct sctp_stream_out) == 24,
>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>   which means 9th memory order.
>
> This can lead to a memory allocation failures on the systems
> under a memory stress.

Did you do performance tests while actually using these 65k streams
and with 256 (so it gets 2 pages)?

This will introduce another deref on each access to an element, but
I'm not expecting any impact due to it.

  Marcelo
Oleg Babin April 26, 2018, 10:14 p.m. UTC | #2
Hi Marcelo,

On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
>> Each SCTP association can have up to 65535 input and output streams.
>> For each stream type an array of sctp_stream_in or sctp_stream_out
>> structures is allocated using kmalloc_array() function. This function
>> allocates physically contiguous memory regions, so this can lead
>> to allocation of memory regions of very high order, i.e.:
>>
>>   sizeof(struct sctp_stream_out) == 24,
>>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>>   which means 9th memory order.
>>
>> This can lead to a memory allocation failures on the systems
>> under a memory stress.
> 
> Did you do performance tests while actually using these 65k streams
> and with 256 (so it gets 2 pages)?
> 
> This will introduce another deref on each access to an element, but
> I'm not expecting any impact due to it.
> 

No, I didn't do such tests. Could you please tell me what methodology
do you usually use to measure performance properly?

I'm trying to do measurements with iperf3 on unmodified kernel and get
very strange results like this:

ovbabin@ovbabin-laptop:~$ ~/programs/iperf/bin/iperf3 -c 169.254.11.150 --sctp
Connecting to host 169.254.11.150, port 5201
[  5] local 169.254.11.150 port 46330 connected to 169.254.11.150 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  9.88 MBytes  82.8 Mbits/sec                  
[  5]   1.00-2.00   sec   226 MBytes  1.90 Gbits/sec                  
[  5]   2.00-3.00   sec   832 KBytes  6.82 Mbits/sec                  
[  5]   3.00-4.00   sec   640 KBytes  5.24 Mbits/sec                  
[  5]   4.00-5.00   sec   756 MBytes  6.34 Gbits/sec                  
[  5]   5.00-6.00   sec   522 MBytes  4.38 Gbits/sec                  
[  5]   6.00-7.00   sec   896 KBytes  7.34 Mbits/sec                  
[  5]   7.00-8.00   sec   519 MBytes  4.35 Gbits/sec                  
[  5]   8.00-9.00   sec   504 MBytes  4.23 Gbits/sec                  
[  5]   9.00-10.00  sec   475 MBytes  3.98 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  2.94 GBytes  2.53 Gbits/sec                  sender
[  5]   0.00-10.04  sec  2.94 GBytes  2.52 Gbits/sec                  receiver

iperf Done.

The values are spread enormously from hundreds of kilobits to gigabits.
I get similar results with netperf. This particular result was obtained
with client and server running on the same machine. Also I tried this
on different machines with different kernel versions - situation was similar.
I compiled latest versions of iperf and netperf from sources.

Could it possibly be that I am missing something very obvious? 

Thanks!
Marcelo Ricardo Leitner April 26, 2018, 10:28 p.m. UTC | #3
On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
> Hi Marcelo,
>
> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> > Hi,
> >
> > On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
> >> Each SCTP association can have up to 65535 input and output streams.
> >> For each stream type an array of sctp_stream_in or sctp_stream_out
> >> structures is allocated using kmalloc_array() function. This function
> >> allocates physically contiguous memory regions, so this can lead
> >> to allocation of memory regions of very high order, i.e.:
> >>
> >>   sizeof(struct sctp_stream_out) == 24,
> >>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
> >>   which means 9th memory order.
> >>
> >> This can lead to a memory allocation failures on the systems
> >> under a memory stress.
> >
> > Did you do performance tests while actually using these 65k streams
> > and with 256 (so it gets 2 pages)?
> >
> > This will introduce another deref on each access to an element, but
> > I'm not expecting any impact due to it.
> >
>
> No, I didn't do such tests. Could you please tell me what methodology
> do you usually use to measure performance properly?
>
> I'm trying to do measurements with iperf3 on unmodified kernel and get
> very strange results like this:
...

I've been trying to fight this fluctuation for some time now but
couldn't really fix it yet. One thing that usually helps (quite a lot)
is increasing the socket buffer sizes and/or using smaller messages,
so there is more cushion in the buffers.

What I have seen in my tests is that when it floats like this, is
because socket buffers floats between 0 and full and don't get into a
steady state. I believe this is because of socket buffer size is used
for limiting the amount of memory used by the socket, instead of being
the amount of payload that the buffer can hold. This causes some
discrepancy, especially because in SCTP we don't defrag the buffer (as
TCP does, it's the collapse operation), and the announced rwnd may
turn up being a lie in the end, which triggers rx drops, then tx cwnd
reduction, and so on. SCTP min_rto of 1s also doesn't help much on
this situation.

On netperf, you may use -S 200000,200000 -s 200000,200000. That should
help it.

Cheers,
Marcelo
Oleg Babin April 26, 2018, 10:45 p.m. UTC | #4
On 04/27/2018 01:28 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
>> Hi Marcelo,
>>
>> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
>>> Hi,
>>>
>>> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
>>>> Each SCTP association can have up to 65535 input and output streams.
>>>> For each stream type an array of sctp_stream_in or sctp_stream_out
>>>> structures is allocated using kmalloc_array() function. This function
>>>> allocates physically contiguous memory regions, so this can lead
>>>> to allocation of memory regions of very high order, i.e.:
>>>>
>>>>   sizeof(struct sctp_stream_out) == 24,
>>>>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>>>>   which means 9th memory order.
>>>>
>>>> This can lead to a memory allocation failures on the systems
>>>> under a memory stress.
>>>
>>> Did you do performance tests while actually using these 65k streams
>>> and with 256 (so it gets 2 pages)?
>>>
>>> This will introduce another deref on each access to an element, but
>>> I'm not expecting any impact due to it.
>>>
>>
>> No, I didn't do such tests. Could you please tell me what methodology
>> do you usually use to measure performance properly?
>>
>> I'm trying to do measurements with iperf3 on unmodified kernel and get
>> very strange results like this:
> ...
> 
> I've been trying to fight this fluctuation for some time now but
> couldn't really fix it yet. One thing that usually helps (quite a lot)
> is increasing the socket buffer sizes and/or using smaller messages,
> so there is more cushion in the buffers.
> 
> What I have seen in my tests is that when it floats like this, is
> because socket buffers floats between 0 and full and don't get into a
> steady state. I believe this is because of socket buffer size is used
> for limiting the amount of memory used by the socket, instead of being
> the amount of payload that the buffer can hold. This causes some
> discrepancy, especially because in SCTP we don't defrag the buffer (as
> TCP does, it's the collapse operation), and the announced rwnd may
> turn up being a lie in the end, which triggers rx drops, then tx cwnd
> reduction, and so on. SCTP min_rto of 1s also doesn't help much on
> this situation.
> 
> On netperf, you may use -S 200000,200000 -s 200000,200000. That should
> help it.
>

Thank you very much! I'll try this and get back with results later.
Konstantin Khorenko July 24, 2018, 3:35 p.m. UTC | #5
On 04/27/2018 01:28 AM, Marcelo Ricardo Leitner wrote:
 > On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
 >> Hi Marcelo,
 >>
 >> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
 >>> Hi,
 >>>
 >>> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
 >>>> Each SCTP association can have up to 65535 input and output streams.
 >>>> For each stream type an array of sctp_stream_in or sctp_stream_out
 >>>> structures is allocated using kmalloc_array() function. This function
 >>>> allocates physically contiguous memory regions, so this can lead
 >>>> to allocation of memory regions of very high order, i.e.:
 >>>>
 >>>>   sizeof(struct sctp_stream_out) == 24,
 >>>>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
 >>>>   which means 9th memory order.
 >>>>
 >>>> This can lead to a memory allocation failures on the systems
 >>>> under a memory stress.
 >>>
 >>> Did you do performance tests while actually using these 65k streams
 >>> and with 256 (so it gets 2 pages)?
 >>>
 >>> This will introduce another deref on each access to an element, but
 >>> I'm not expecting any impact due to it.
 >>>
 >>
 >> No, I didn't do such tests. Could you please tell me what methodology
 >> do you usually use to measure performance properly?
 >>
 >> I'm trying to do measurements with iperf3 on unmodified kernel and get
 >> very strange results like this:
 > ...
 >
 > I've been trying to fight this fluctuation for some time now but
 > couldn't really fix it yet. One thing that usually helps (quite a lot)
 > is increasing the socket buffer sizes and/or using smaller messages,
 > so there is more cushion in the buffers.
 >
 > What I have seen in my tests is that when it floats like this, is
 > because socket buffers floats between 0 and full and don't get into a
 > steady state. I believe this is because of socket buffer size is used
 > for limiting the amount of memory used by the socket, instead of being
 > the amount of payload that the buffer can hold. This causes some
 > discrepancy, especially because in SCTP we don't defrag the buffer (as
 > TCP does, it's the collapse operation), and the announced rwnd may
 > turn up being a lie in the end, which triggers rx drops, then tx cwnd
 > reduction, and so on. SCTP min_rto of 1s also doesn't help much on
 > this situation.
 >
 > On netperf, you may use -S 200000,200000 -s 200000,200000. That should
 > help it.

Hi Marcelo,

pity to abandon Oleg's attempt to avoid high order allocations and use
flex_array instead, so i tried to do the performance measurements with
options you kindly suggested.

Here are results:
   * Kernel: v4.18-rc6 - stock and with 2 patches from Oleg (earlier in this thread)
   * Node: CPU (8 cores): Intel(R) Xeon(R) CPU E31230 @ 3.20GHz
           RAM: 32 Gb

   * netperf: taken from https://github.com/HewlettPackard/netperf.git,
	     compiled from sources with sctp support
   * netperf server and client are run on the same node

The script used to run tests:
# cat run_tests.sh
#!/bin/bash

for test in SCTP_STREAM SCTP_STREAM_MANY SCTP_RR SCTP_RR_MANY; do
   echo "TEST: $test";
   for i in `seq 1 3`; do
     echo "Iteration: $i";
     set -x
     netperf -t $test -H localhost -p 22222 -S 200000,200000 -s 200000,200000 -l 60;
     set +x
   done
done
================================================

Results (a bit reformatted to be more readable):
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

				v4.18-rc6	v4.18-rc6 + fixes
TEST: SCTP_STREAM
212992 212992 212992    60.11       4.11	4.11
212992 212992 212992    60.11       4.11	4.11
212992 212992 212992    60.11       4.11	4.11
TEST: SCTP_STREAM_MANY
212992 212992   4096    60.00    1769.26	2283.85
212992 212992   4096    60.00    2309.59	858.43
212992 212992   4096    60.00    5300.65	3351.24

===========
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

					v4.18-rc6	v4.18-rc6 + fixes
TEST: SCTP_RR
212992 212992 1        1       60.00    44832.10	45148.68
212992 212992 1        1       60.00    44835.72	44662.95
212992 212992 1        1       60.00    45199.21	45055.86
TEST: SCTP_RR_MANY
212992 212992 1        1       60.00      40.90		45.55
212992 212992 1        1       60.00      40.65		45.88
212992 212992 1        1       60.00      44.53		42.15

As we can see single stream tests do not show any noticeable degradation,
and SCTP_*_MANY tests spread decreased significantly when -S/-s options are used,
but still too big to consider the performance test pass or fail.

Can you please advise anything else to try - to decrease the dispersion rate -
or can we just consider values are fine and i'm reworking the patch according
to your comment about sctp_stream_in(asoc, sid)/sctp_stream_in_ptr(stream, sid)
and that's it?

Thank you in advance!

--
Best regards,
Konstantin
Marcelo Ricardo Leitner July 24, 2018, 5:36 p.m. UTC | #6
On Tue, Jul 24, 2018 at 06:35:35PM +0300, Konstantin Khorenko wrote:
> Hi Marcelo,
> 
> pity to abandon Oleg's attempt to avoid high order allocations and use
> flex_array instead, so i tried to do the performance measurements with
> options you kindly suggested.

Nice, thanks!

...
> As we can see single stream tests do not show any noticeable degradation,
> and SCTP_*_MANY tests spread decreased significantly when -S/-s options are used,
> but still too big to consider the performance test pass or fail.
> 
> Can you please advise anything else to try - to decrease the dispersion rate -

In addition, you can try also using a veth tunnel or reducing lo mtu
down to 1500, and also make use of sctp tests (need to be after the --
) option -m 1452.  These will alleaviate issues with cwnd handling
that happen on loopback due to the big MTU and minimize issues with
rwnd/buffer size too.

Even with -S, -s, -m and the lower MTU, it is usual to see some
fluctuation, but not that much.

> or can we just consider values are fine and i'm reworking the patch according
> to your comment about sctp_stream_in(asoc, sid)/sctp_stream_in_ptr(stream, sid)
> and that's it?

Ok, thanks. It seems so, yes.

  Marcelo