diff mbox

[RFC] netlink: Increase netlink dump skb message size

Message ID 1303834864.3358.58.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 26, 2011, 4:21 p.m. UTC
Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :

> I'm fine with however you folks want to approach this, just give me some direction.

I would just try following patch :




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rose, Gregory V April 26, 2011, 4:24 p.m. UTC | #1
> -----Original Message-----

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]

> Sent: Tuesday, April 26, 2011 9:21 AM

> To: Rose, Gregory V

> Cc: David Miller; netdev@vger.kernel.org; bhutchings@solarflare.com

> Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message size

> 

> Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :

> 

> > I'm fine with however you folks want to approach this, just give me some

> direction.

> 

> I would just try following patch :


I'll test it out, it's certainly a lot simpler.  Perhaps I was getting a bit too fancy.

Ben would want to make sure it works for 127 VFs, my device does 63.

- Greg

> 

> 

> diff --git a/include/linux/netlink.h b/include/linux/netlink.h

> index 4c4ac3f..22cac81 100644

> --- a/include/linux/netlink.h

> +++ b/include/linux/netlink.h

> @@ -210,6 +210,7 @@ int netlink_sendskb(struct sock *sk, struct sk_buff

> *skb);

>  #else

>  #define NLMSG_GOODSIZE	SKB_WITH_OVERHEAD(8192UL)

>  #endif

> +#define NLMSG_DUMPSIZE SKB_WITH_OVERHEAD(8192UL)

> 

>  #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)

> 

> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c

> index c8f35b5..cb8d6ac 100644

> --- a/net/netlink/af_netlink.c

> +++ b/net/netlink/af_netlink.c

> @@ -1669,7 +1669,7 @@ static int netlink_dump(struct sock *sk)

>  	struct nlmsghdr *nlh;

>  	int len, err = -ENOBUFS;

> 

> -	skb = sock_rmalloc(sk, NLMSG_GOODSIZE, 0, GFP_KERNEL);

> +	skb = sock_rmalloc(sk, NLMSG_DUMPSIZE, 0, GFP_KERNEL);

>  	if (!skb)

>  		goto errout;

> 

>
Ben Hutchings April 26, 2011, 7:07 p.m. UTC | #2
On Tue, 2011-04-26 at 09:24 -0700, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> > Sent: Tuesday, April 26, 2011 9:21 AM
> > To: Rose, Gregory V
> > Cc: David Miller; netdev@vger.kernel.org; bhutchings@solarflare.com
> > Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message size
> > 
> > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :
> > 
> > > I'm fine with however you folks want to approach this, just give me some
> > direction.
> > 
> > I would just try following patch :
> 
> I'll test it out, it's certainly a lot simpler.  Perhaps I was getting a bit too fancy.
> 
> Ben would want to make sure it works for 127 VFs, my device does 63.

I haven't been working on SR-IOV myself so I'll pass this on to a
colleague.  Thanks, Eric.

Ben.
Rose, Gregory V April 26, 2011, 9:58 p.m. UTC | #3
> -----Original Message-----

> From: Ben Hutchings [mailto:bhutchings@solarflare.com]

> Sent: Tuesday, April 26, 2011 12:07 PM

> To: Rose, Gregory V; Eric Dumazet

> Cc: David Miller; netdev@vger.kernel.org

> Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message size

> 

> On Tue, 2011-04-26 at 09:24 -0700, Rose, Gregory V wrote:

> > > -----Original Message-----

> > > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]

> > > Sent: Tuesday, April 26, 2011 9:21 AM

> > > To: Rose, Gregory V

> > > Cc: David Miller; netdev@vger.kernel.org; bhutchings@solarflare.com

> > > Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message

> size

> > >

> > > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :

> > >

> > > > I'm fine with however you folks want to approach this, just give me

> some

> > > direction.

> > >

> > > I would just try following patch :

> >

> > I'll test it out, it's certainly a lot simpler.  Perhaps I was getting a

> bit too fancy.

> >

> > Ben would want to make sure it works for 127 VFs, my device does 63.

> 

> I haven't been working on SR-IOV myself so I'll pass this on to a

> colleague.  Thanks, Eric.


Eric's patch works fine for the case of 63 VFs.  For most dumps it's allocating quite a bit more memory than necessary but that's not a big issue since it's transient and not held for long.

- Greg
Steve Hodgson April 27, 2011, 3:46 p.m. UTC | #4
On 04/27/2011 04:24 PM, Eric Dumazet wrote:
> Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :
>
>> I'm fine with however you folks want to approach this, just give me some direction.
>
> I would just try following patch :
>

This allows the sfc driver to use 102 VFs, up from the current limit of 
45 VFs.

It's unfortunate that this patch isn't sufficient to allow all 127 VFs 
to be used, but whilst we wait for a new netlink api this is an 
improvement worth having.

I'd like to see this patch committed.

Thanks Eric,

- Steve
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 27, 2011, 4:30 p.m. UTC | #5
Le mercredi 27 avril 2011 à 16:46 +0100, Steve Hodgson a écrit :
> On 04/27/2011 04:24 PM, Eric Dumazet wrote:
> > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :
> >
> >> I'm fine with however you folks want to approach this, just give me some direction.
> >
> > I would just try following patch :
> >
> 
> This allows the sfc driver to use 102 VFs, up from the current limit of 
> 45 VFs.
> 
> It's unfortunate that this patch isn't sufficient to allow all 127 VFs 
> to be used, but whilst we wait for a new netlink api this is an 
> improvement worth having.
> 

netlink recvmsg() supports MSG_PEEK so user would get the needed size of
its buffer before calling the real recvmsg()

big blobs could be attached as skb fragments (up to 64Kbytes), but do we
really want this...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rose, Gregory V April 27, 2011, 5:15 p.m. UTC | #6
> -----Original Message-----

> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]

> On Behalf Of Eric Dumazet

> Sent: Wednesday, April 27, 2011 9:30 AM

> To: Steve Hodgson

> Cc: Rose, Gregory V; David Miller; netdev@vger.kernel.org;

> bhutchings@solarflare.com

> Subject: Re: [RFC PATCH] netlink: Increase netlink dump skb message size

> 

> Le mercredi 27 avril 2011 à 16:46 +0100, Steve Hodgson a écrit :

> > On 04/27/2011 04:24 PM, Eric Dumazet wrote:

> > > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :

> > >

> > >> I'm fine with however you folks want to approach this, just give me

> some direction.

> > >

> > > I would just try following patch :

> > >

> >

> > This allows the sfc driver to use 102 VFs, up from the current limit of

> > 45 VFs.

> >

> > It's unfortunate that this patch isn't sufficient to allow all 127 VFs

> > to be used, but whilst we wait for a new netlink api this is an

> > improvement worth having.

> >

> 

> netlink recvmsg() supports MSG_PEEK so user would get the needed size of

> its buffer before calling the real recvmsg()

> 

> big blobs could be attached as skb fragments (up to 64Kbytes), but do we

> really want this...

[Greg Rose] 

I'm looking into an approach in which we make the get info dump for VFs orthogonal to the set VF info, i.e. like this:

To set VF info we would follow the current convention:

# ip link set eth(x) vf (n) mac xx:xx:xx:xx:xx:xx
# ip link set eth(x) vf (n) vlan (nnnn)

To see VF info:

# ip link show eth(x) vf (n)

would show that VF's mac and vlan and could then be expanded in the future to display more information required for additional features that users are asking for.

The IFLA_VF_INFO dump would be moved out of the info dump for the physical function interface and would no longer be nested which would get rid of the need for huge amounts of buffer for info dumps on VFs.  The ip link show command for the PF would need to report the number of VFs currently allocated to the PF so that could fed into a script that loops to show each VFs info.

I think this approach would fix the problems we're looking at right now.

- Greg
Eric Dumazet April 27, 2011, 5:29 p.m. UTC | #7
Le mercredi 27 avril 2011 à 10:15 -0700, Rose, Gregory V a écrit :
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> > On Behalf Of Eric Dumazet
> > Sent: Wednesday, April 27, 2011 9:30 AM
> > To: Steve Hodgson
> > Cc: Rose, Gregory V; David Miller; netdev@vger.kernel.org;
> > bhutchings@solarflare.com
> > Subject: Re: [RFC PATCH] netlink: Increase netlink dump skb message size
> > 
> > Le mercredi 27 avril 2011 à 16:46 +0100, Steve Hodgson a écrit :
> > > On 04/27/2011 04:24 PM, Eric Dumazet wrote:
> > > > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :
> > > >
> > > >> I'm fine with however you folks want to approach this, just give me
> > some direction.
> > > >
> > > > I would just try following patch :
> > > >
> > >
> > > This allows the sfc driver to use 102 VFs, up from the current limit of
> > > 45 VFs.
> > >
> > > It's unfortunate that this patch isn't sufficient to allow all 127 VFs
> > > to be used, but whilst we wait for a new netlink api this is an
> > > improvement worth having.
> > >
> > 
> > netlink recvmsg() supports MSG_PEEK so user would get the needed size of
> > its buffer before calling the real recvmsg()
> > 
> > big blobs could be attached as skb fragments (up to 64Kbytes), but do we
> > really want this...
> [Greg Rose] 
> 
> I'm looking into an approach in which we make the get info dump for VFs orthogonal to the set VF info, i.e. like this:
> 
> To set VF info we would follow the current convention:
> 
> # ip link set eth(x) vf (n) mac xx:xx:xx:xx:xx:xx
> # ip link set eth(x) vf (n) vlan (nnnn)
> 
> To see VF info:
> 
> # ip link show eth(x) vf (n)
> 
> would show that VF's mac and vlan and could then be expanded in the future to display more information required for additional features that users are asking for.
> 
> The IFLA_VF_INFO dump would be moved out of the info dump for the physical function interface and would no longer be nested which would get rid of the need for huge amounts of buffer for info dumps on VFs.  The ip link show command for the PF would need to report the number of VFs currently allocated to the PF so that could fed into a script that loops to show each VFs info.
> 
> I think this approach would fix the problems we're looking at right now.
> 

Hmm, if you look at "ip link ..." you'll see it dumps everything from
kernel and does the filter inside user command.

BTW "ip" uses a 16384 bytes buffer, not a 8192 bytes one.


$ strace -e trace=recvmsg ip link
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
msg_iov(1)=[{"\340\3\0\0\20\0\2\0dR\270M\224s\0\0\0\0\4\3\1\0\0\0I\0\1\0
\0\0\0\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3000
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
msg_iov(1)=[{"\364\3\0\0\20\0\2\0dR\270M\224s\0\0\0\0\1\0\4\0\0\0C\24\1
\0\0\0\0\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3024
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
msg_iov(1)=[{"\34\4\0\0\20\0\2\0dR\270M\224s\0\0\0\0\1\0\7\0\0\0C\20\1\0
\0\0\0\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 2104
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000},
msg_iov(1)=[{"\24\0\0\0\3\0\2\0dR\270M\224s\0\0\0\0\0\0\7\0\0\0C\20\1\0
\0\0\0\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 20

...


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 27, 2011, 6:05 p.m. UTC | #8
Le mercredi 27 avril 2011 à 10:39 -0700, Rose, Gregory V a écrit :

> Right, but when I look in rtnetlink I see the routine to calculate the
> amount of buffer needed for VF info dump is the number of device
> parent (PF) VFs * the sizeof various IFLA_VF_INFO items.  The more the
> VFs the bigger this gets, especially if you want to add more stuff to
> IFLA_VF_INFO.  So when the kernel dumps this all out it can get bigger
> than the NLMSG_GOODSIZE (or DUMPSIZE) pretty quickly.
> 
> > 
> > BTW "ip" uses a 16384 bytes buffer, not a 8192 bytes one.
> 
> I know, that's why I suffered some confusion about which size to use.
> The ip command uses 16K but the NLMSG_GOODSIZE can be as small as 3712
> bytes (depending on page size).  Despite the user buffer being 16k if
> the size calculated by if_nlmsg_size() in rtnetlink.c is bigger than
> NLMSG_GOODSIZE then you don't see the info for more than 40 or so VFs.
> More VFs than that and nothing gets displayed.
> 

One solution is to change rtnl_dump_ifinfo() to call rtnl_fill_ifinfo()
once time per device (RTM_NEWLINK like now but no more VFINFO inside),
then call another function to provide vf/vlan informations (RTM_NEWVF),
using cb->args[2] as an index into VF space, so that we can stop if
current skb is filled, and next recvmsg() starts at previous saved
index.

Or implement a new "ip vf ..." dumper and only dumps RTM_NEWVF messages.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rose, Gregory V April 27, 2011, 6:08 p.m. UTC | #9
> -----Original Message-----

> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]

> Sent: Wednesday, April 27, 2011 11:06 AM

> To: Rose, Gregory V

> Cc: Steve Hodgson; David Miller; netdev@vger.kernel.org;

> bhutchings@solarflare.com

> Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message size

> 

> Le mercredi 27 avril 2011 à 10:39 -0700, Rose, Gregory V a écrit :

> 

> > Right, but when I look in rtnetlink I see the routine to calculate the

> > amount of buffer needed for VF info dump is the number of device

> > parent (PF) VFs * the sizeof various IFLA_VF_INFO items.  The more the

> > VFs the bigger this gets, especially if you want to add more stuff to

> > IFLA_VF_INFO.  So when the kernel dumps this all out it can get bigger

> > than the NLMSG_GOODSIZE (or DUMPSIZE) pretty quickly.

> >

> > >

> > > BTW "ip" uses a 16384 bytes buffer, not a 8192 bytes one.

> >

> > I know, that's why I suffered some confusion about which size to use.

> > The ip command uses 16K but the NLMSG_GOODSIZE can be as small as 3712

> > bytes (depending on page size).  Despite the user buffer being 16k if

> > the size calculated by if_nlmsg_size() in rtnetlink.c is bigger than

> > NLMSG_GOODSIZE then you don't see the info for more than 40 or so VFs.

> > More VFs than that and nothing gets displayed.

> >

> 

> One solution is to change rtnl_dump_ifinfo() to call rtnl_fill_ifinfo()

> once time per device (RTM_NEWLINK like now but no more VFINFO inside),

> then call another function to provide vf/vlan informations (RTM_NEWVF),

> using cb->args[2] as an index into VF space, so that we can stop if

> current skb is filled, and next recvmsg() starts at previous saved

> index.

> 

> Or implement a new "ip vf ..." dumper and only dumps RTM_NEWVF messages.


I'm a netlink newbie so maybe that is a more obvious solution that I missed.  Let me have a look at the code and see.

Thanks,

- Greg
diff mbox

Patch

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 4c4ac3f..22cac81 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -210,6 +210,7 @@  int netlink_sendskb(struct sock *sk, struct sk_buff *skb);
 #else
 #define NLMSG_GOODSIZE	SKB_WITH_OVERHEAD(8192UL)
 #endif
+#define NLMSG_DUMPSIZE SKB_WITH_OVERHEAD(8192UL)
 
 #define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)
 
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index c8f35b5..cb8d6ac 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1669,7 +1669,7 @@  static int netlink_dump(struct sock *sk)
 	struct nlmsghdr *nlh;
 	int len, err = -ENOBUFS;
 
-	skb = sock_rmalloc(sk, NLMSG_GOODSIZE, 0, GFP_KERNEL);
+	skb = sock_rmalloc(sk, NLMSG_DUMPSIZE, 0, GFP_KERNEL);
 	if (!skb)
 		goto errout;