diff mbox series

[iproute2] lib/libnetlink: ensure a minimum of 32KB for the buffer used in rtnl_recvmsg()

Message ID 20190213015841.140383-1-edumazet@google.com
State Accepted
Delegated to: stephen hemminger
Headers show
Series [iproute2] lib/libnetlink: ensure a minimum of 32KB for the buffer used in rtnl_recvmsg() | expand

Commit Message

Eric Dumazet Feb. 13, 2019, 1:58 a.m. UTC
In the past, we tried to increase the buffer size up to 32 KB in order
to reduce number of syscalls per dump.

Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
brought the size back to 4KB because the kernel can not know the application
is ready to receive bigger requests.

See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
for more details.

Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Cc: Phil Sutter <phil@nwl.cc>
---
 lib/libnetlink.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

David Ahern Feb. 13, 2019, 2:04 a.m. UTC | #1
On 2/12/19 6:58 PM, Eric Dumazet wrote:
> In the past, we tried to increase the buffer size up to 32 KB in order
> to reduce number of syscalls per dump.
> 
> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> brought the size back to 4KB because the kernel can not know the application
> is ready to receive bigger requests.
> 
> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> for more details.
> 
> Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Hangbin Liu <liuhangbin@gmail.com>
> Cc: Phil Sutter <phil@nwl.cc>
> ---
>  lib/libnetlink.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 1892a02ab5d0d73776c9882ffc77edcd2c663d01..0d48a3d43cf03065dacbd419578ab10af56431a4 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -718,6 +718,8 @@ static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
>  	if (len < 0)
>  		return len;
>  
> +	if (len < 32768)
> +		len = 32768;
>  	buf = malloc(len);
>  	if (!buf) {
>  		fprintf(stderr, "malloc error: not enough buffer\n");
> 

I believe that negates the whole point of 2d34851cd341 - which I have no
problem with. 2 recvmsg calls per message is overkill.

Do we know of any single message sizes > 32k? 2d34851cd341 cites
increasing VF's but at some point there is a limit. If not, the whole
PEEK thing should go away and we just malloc 32k (or 64k) buffers for
each recvmsg.
Eric Dumazet Feb. 13, 2019, 2:08 a.m. UTC | #2
On 02/12/2019 06:04 PM, David Ahern wrote:
> On 2/12/19 6:58 PM, Eric Dumazet wrote:
>> In the past, we tried to increase the buffer size up to 32 KB in order
>> to reduce number of syscalls per dump.
>>
>> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
>> brought the size back to 4KB because the kernel can not know the application
>> is ready to receive bigger requests.
>>
>> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
>> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
>> for more details.
>>
>> Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Cc: Hangbin Liu <liuhangbin@gmail.com>
>> Cc: Phil Sutter <phil@nwl.cc>
>> ---
>>  lib/libnetlink.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
>> index 1892a02ab5d0d73776c9882ffc77edcd2c663d01..0d48a3d43cf03065dacbd419578ab10af56431a4 100644
>> --- a/lib/libnetlink.c
>> +++ b/lib/libnetlink.c
>> @@ -718,6 +718,8 @@ static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
>>  	if (len < 0)
>>  		return len;
>>  
>> +	if (len < 32768)
>> +		len = 32768;
>>  	buf = malloc(len);
>>  	if (!buf) {
>>  		fprintf(stderr, "malloc error: not enough buffer\n");
>>
> 
> I believe that negates the whole point of 2d34851cd341 - which I have no
> problem with. 2 recvmsg calls per message is overkill.
> 

It does not negates the point at all.

The main point was to eventually be able to allocate more than 32KB.

We need to have a minimum size of 32KB so that the kernel can cook reasonably sized skbs

Because trying to allocate 4KB only in 2019 is kind of stupid...

( Especially considering ss currently buffers the whole thing before calling render() !!! )

> Do we know of any single message sizes > 32k? 2d34851cd341 cites
> increasing VF's but at some point there is a limit. If not, the whole
> PEEK thing should go away and we just malloc 32k (or 64k) buffers for
> each recvmsg.
>
Hangbin Liu Feb. 13, 2019, 6:20 a.m. UTC | #3
Hi David,
On Tue, Feb 12, 2019 at 06:08:13PM -0800, Eric Dumazet wrote:
> >> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> >> index 1892a02ab5d0d73776c9882ffc77edcd2c663d01..0d48a3d43cf03065dacbd419578ab10af56431a4 100644
> >> --- a/lib/libnetlink.c
> >> +++ b/lib/libnetlink.c
> >> @@ -718,6 +718,8 @@ static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> >>  	if (len < 0)
> >>  		return len;
> >>  
> >> +	if (len < 32768)
> >> +		len = 32768;
> >>  	buf = malloc(len);
> >>  	if (!buf) {
> >>  		fprintf(stderr, "malloc error: not enough buffer\n");
> >>
> > 
> > I believe that negates the whole point of 2d34851cd341 - which I have no
> > problem with. 2 recvmsg calls per message is overkill.

It should not affects ip cmd too much. But for ss, as Eric pointed, it
cause performance issue.

> > 
> 
> It does not negates the point at all.
> 
> The main point was to eventually be able to allocate more than 32KB.
> 
> We need to have a minimum size of 32KB so that the kernel can cook reasonably sized skbs
> 
> Because trying to allocate 4KB only in 2019 is kind of stupid...
> 
> ( Especially considering ss currently buffers the whole thing before calling render() !!! )

This makes sense to me.
 
> > Do we know of any single message sizes > 32k? 2d34851cd341 cites
> > increasing VF's but at some point there is a limit. If not, the whole
> > PEEK thing should go away and we just malloc 32k (or 64k) buffers for
> > each recvmsg.

Apart from the 200 VFs example, I think there will be more and more virtual
interfaces be used in cloud environment, like openstack/OVS, so useing 32K
or 64K is still not safe.

What do you think.

Thanks
Hangbin
Phil Sutter Feb. 13, 2019, 5:46 p.m. UTC | #4
Hi Eric,

On Tue, Feb 12, 2019 at 05:58:41PM -0800, Eric Dumazet wrote:
> In the past, we tried to increase the buffer size up to 32 KB in order
> to reduce number of syscalls per dump.
> 
> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> brought the size back to 4KB because the kernel can not know the application
> is ready to receive bigger requests.
> 
> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> for more details.

Wouldn't it be better if the kernel recognized MSG_TRUNC and allocated a
buffer large enough to hold the full message in that case? I have no
idea how hard that would be to implement, but calling recvmsg() with
MSG_TRUNC set and not getting the full message length in return is not
quite what one expects after reading recvmsg(2).

Cheers, Phil
Eric Dumazet Feb. 13, 2019, 5:58 p.m. UTC | #5
On 02/13/2019 09:46 AM, Phil Sutter wrote:
> Hi Eric,
> 
> On Tue, Feb 12, 2019 at 05:58:41PM -0800, Eric Dumazet wrote:
>> In the past, we tried to increase the buffer size up to 32 KB in order
>> to reduce number of syscalls per dump.
>>
>> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
>> brought the size back to 4KB because the kernel can not know the application
>> is ready to receive bigger requests.
>>
>> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
>> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
>> for more details.
> 
> Wouldn't it be better if the kernel recognized MSG_TRUNC and allocated a
> buffer large enough to hold the full message in that case? I have no
> idea how hard that would be to implement, but calling recvmsg() with
> MSG_TRUNC set and not getting the full message length in return is not
> quite what one expects after reading recvmsg(2).
> 

I am not sure what you are suggesting.

The buffer is already in the kernel, this is the skb in the receive queue.
Its size is unknown... We only can guess.

We can not change old kernels, and new iproute2/ss commands need to work with old kernels.

We can not change MSG_TRUNC semantic.

Adding another MSG_be_gentle_do_not_consume_skb_if_user_size_is_too_small wont solve the issue for old kernels.
Stephen Hemminger Feb. 13, 2019, 9:57 p.m. UTC | #6
On Tue, 12 Feb 2019 17:58:41 -0800
Eric Dumazet <edumazet@google.com> wrote:

> In the past, we tried to increase the buffer size up to 32 KB in order
> to reduce number of syscalls per dump.
> 
> Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> brought the size back to 4KB because the kernel can not know the application
> is ready to receive bigger requests.
> 
> See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> for more details.
> 
> Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Hangbin Liu <liuhangbin@gmail.com>
> Cc: Phil Sutter <phil@nwl.cc>

Applied, although maybe we should bump it to 64K or bigger?
Eric Dumazet Feb. 13, 2019, 9:59 p.m. UTC | #7
On Wed, Feb 13, 2019 at 1:57 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Tue, 12 Feb 2019 17:58:41 -0800
> Eric Dumazet <edumazet@google.com> wrote:
>
> > In the past, we tried to increase the buffer size up to 32 KB in order
> > to reduce number of syscalls per dump.
> >
> > Commit 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> > brought the size back to 4KB because the kernel can not know the application
> > is ready to receive bigger requests.
> >
> > See kernel commits 9063e21fb026 ("netlink: autosize skb lengthes") and
> > d35c99ff77ec ("netlink: do not enter direct reclaim from netlink_dump()")
> > for more details.
> >
> > Fixes: 2d34851cd341 ("lib/libnetlink: re malloc buff if size is not enough")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Hangbin Liu <liuhangbin@gmail.com>
> > Cc: Phil Sutter <phil@nwl.cc>
>
> Applied, although maybe we should bump it to 64K or bigger?

Note the kernel does not yet try 64KB allocations, so I do not see an
urgent need for that :)
Michal Kubecek Feb. 14, 2019, 1:49 p.m. UTC | #8
On Tue, Feb 12, 2019 at 07:04:17PM -0700, David Ahern wrote:
> 
> Do we know of any single message sizes > 32k? 2d34851cd341 cites
> increasing VF's but at some point there is a limit. If not, the whole
> PEEK thing should go away and we just malloc 32k (or 64k) buffers for
> each recvmsg.

IFLA_VF_LIST is by far the biggest thing I have seen so far. I don't
remember exact numbers but the issue with 32KB buffer (for the whole
RTM_NELINK message) was encountered by some of our customers with NICs
having 120 or 128 VFs.

There is a bigger issue with IFLA_VFINFO_LIST, though, as it's an
attribute so that netlink limits its size to 64 KB. IIRC with current
size of IFLA_VF_INFO this would be reached with 270-280 VFs (I'm sure
the number was higer than 256 but not too much higher.)

This would mean unless we let something else grow too much, the whole
message shouldn't get much bigger than 64 KB. And if we can find some
other solution (e.g. passing VF information in separate messages if
client declares support), even 32 KB would be more than enough.

Michal Kubecek
Michal Kubecek Feb. 14, 2019, 1:51 p.m. UTC | #9
On Wed, Feb 13, 2019 at 02:20:12PM +0800, Hangbin Liu wrote:
>  
> > > Do we know of any single message sizes > 32k? 2d34851cd341 cites
> > > increasing VF's but at some point there is a limit. If not, the whole
> > > PEEK thing should go away and we just malloc 32k (or 64k) buffers for
> > > each recvmsg.
> 
> Apart from the 200 VFs example, I think there will be more and more virtual
> interfaces be used in cloud environment, like openstack/OVS, so useing 32K
> or 64K is still not safe.

Many intefraces are not a problem. Those will have separate messages so
that they don't need to fit into one buffer.

Michal Kubecek
David Ahern Feb. 14, 2019, 5:34 p.m. UTC | #10
On 2/14/19 6:49 AM, Michal Kubecek wrote:
> On Tue, Feb 12, 2019 at 07:04:17PM -0700, David Ahern wrote:
>>
>> Do we know of any single message sizes > 32k? 2d34851cd341 cites
>> increasing VF's but at some point there is a limit. If not, the whole
>> PEEK thing should go away and we just malloc 32k (or 64k) buffers for
>> each recvmsg.
> 
> IFLA_VF_LIST is by far the biggest thing I have seen so far. I don't
> remember exact numbers but the issue with 32KB buffer (for the whole
> RTM_NELINK message) was encountered by some of our customers with NICs
> having 120 or 128 VFs.
> 
> There is a bigger issue with IFLA_VFINFO_LIST, though, as it's an
> attribute so that netlink limits its size to 64 KB. IIRC with current
> size of IFLA_VF_INFO this would be reached with 270-280 VFs (I'm sure
> the number was higer than 256 but not too much higher.)
> 
> This would mean unless we let something else grow too much, the whole
> message shouldn't get much bigger than 64 KB. And if we can find some
> other solution (e.g. passing VF information in separate messages if
> client declares support), even 32 KB would be more than enough.

That's what I was asking, thanks. So 32kB today is sufficient, 64kB has
future buffer. So this whole PEEK and allocate the message size is
overkill. It could just as easily been bumped from 32kB to 64kB in the
original patch and been good for a while.
Phil Sutter Feb. 14, 2019, 5:47 p.m. UTC | #11
On Thu, Feb 14, 2019 at 10:34:06AM -0700, David Ahern wrote:
> On 2/14/19 6:49 AM, Michal Kubecek wrote:
> > On Tue, Feb 12, 2019 at 07:04:17PM -0700, David Ahern wrote:
> >>
> >> Do we know of any single message sizes > 32k? 2d34851cd341 cites
> >> increasing VF's but at some point there is a limit. If not, the whole
> >> PEEK thing should go away and we just malloc 32k (or 64k) buffers for
> >> each recvmsg.
> > 
> > IFLA_VF_LIST is by far the biggest thing I have seen so far. I don't
> > remember exact numbers but the issue with 32KB buffer (for the whole
> > RTM_NELINK message) was encountered by some of our customers with NICs
> > having 120 or 128 VFs.
> > 
> > There is a bigger issue with IFLA_VFINFO_LIST, though, as it's an
> > attribute so that netlink limits its size to 64 KB. IIRC with current
> > size of IFLA_VF_INFO this would be reached with 270-280 VFs (I'm sure
> > the number was higer than 256 but not too much higher.)

Using netdevsim, 'ip link show' becomes unusable after enabling more
than 244 VFs. I guess it depends on how much info per VF is available.

> > This would mean unless we let something else grow too much, the whole
> > message shouldn't get much bigger than 64 KB. And if we can find some
> > other solution (e.g. passing VF information in separate messages if
> > client declares support), even 32 KB would be more than enough.
> 
> That's what I was asking, thanks. So 32kB today is sufficient, 64kB has
> future buffer. So this whole PEEK and allocate the message size is
> overkill. It could just as easily been bumped from 32kB to 64kB in the
> original patch and been good for a while.

Yes, I think the real problem is how VF-related messages are structured
currently.

Cheers, Phil
diff mbox series

Patch

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 1892a02ab5d0d73776c9882ffc77edcd2c663d01..0d48a3d43cf03065dacbd419578ab10af56431a4 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -718,6 +718,8 @@  static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
 	if (len < 0)
 		return len;
 
+	if (len < 32768)
+		len = 32768;
 	buf = malloc(len);
 	if (!buf) {
 		fprintf(stderr, "malloc error: not enough buffer\n");