diff mbox

tcp: assign the sock correctly to an outgoing SYNACK packet

Message ID 20130408154519.18177.57709.stgit@localhost
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Paul Moore April 8, 2013, 3:45 p.m. UTC
Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
tcp_make_synack() to use alloc_skb() directly instead of calling
sock_wmalloc(), the goal being the elimination of two atomic
operations.  Unfortunately, in doing so the change broke certain
SELinux/NetLabel configurations by no longer correctly assigning
the sock to the outgoing packet.

This patch fixes this regression by doing the skb->sk assignment
directly inside tcp_make_synack().

Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 net/ipv4/tcp_output.c |    2 ++
 1 file changed, 2 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller April 8, 2013, 4:14 p.m. UTC | #1
From: Paul Moore <pmoore@redhat.com>
Date: Mon, 08 Apr 2013 11:45:19 -0400

> Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
> tcp_make_synack() to use alloc_skb() directly instead of calling
> sock_wmalloc(), the goal being the elimination of two atomic
> operations.  Unfortunately, in doing so the change broke certain
> SELinux/NetLabel configurations by no longer correctly assigning
> the sock to the outgoing packet.
> 
> This patch fixes this regression by doing the skb->sk assignment
> directly inside tcp_make_synack().
> 
> Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> Signed-off-by: Paul Moore <pmoore@redhat.com>

Setting skb->sk without the destructor results in an SKB that can live
potentially forever with a stale reference to a destroyed socket.

You cannot fix the problem in this way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2013, 4:19 p.m. UTC | #2
On Mon, 2013-04-08 at 11:45 -0400, Paul Moore wrote:
> Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
> tcp_make_synack() to use alloc_skb() directly instead of calling
> sock_wmalloc(), the goal being the elimination of two atomic
> operations.  Unfortunately, in doing so the change broke certain
> SELinux/NetLabel configurations by no longer correctly assigning
> the sock to the outgoing packet.
> 
> This patch fixes this regression by doing the skb->sk assignment
> directly inside tcp_make_synack().
> 
> Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> Signed-off-by: Paul Moore <pmoore@redhat.com>
> ---
>  net/ipv4/tcp_output.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 5d0b438..23cc295 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2705,6 +2705,8 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
>  		dst_release(dst);
>  		return NULL;
>  	}
> +	skb->sk = sk;
> +

Hmm...

Keeping a pointer on a socket without taking a refcount is not going to
work.

We are trying to make the stack scale, so you need to add a selinux call
to take a ref count only if needed.

That is : If selinux is not used, we don't need to slow down the stack.
 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 5:22 p.m. UTC | #3
On Monday, April 08, 2013 12:14:34 PM David Miller wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Mon, 08 Apr 2013 11:45:19 -0400
> 
> > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
> > tcp_make_synack() to use alloc_skb() directly instead of calling
> > sock_wmalloc(), the goal being the elimination of two atomic
> > operations.  Unfortunately, in doing so the change broke certain
> > SELinux/NetLabel configurations by no longer correctly assigning
> > the sock to the outgoing packet.
> > 
> > This patch fixes this regression by doing the skb->sk assignment
> > directly inside tcp_make_synack().
> > 
> > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> > Signed-off-by: Paul Moore <pmoore@redhat.com>
> 
> Setting skb->sk without the destructor results in an SKB that can live
> potentially forever with a stale reference to a destroyed socket.
> 
> You cannot fix the problem in this way.

Okay, no worries, I'll work on v2.  For some reason I missed the destructor 
assignment in skb_set_owner_w(); I guess I was spending so much time hunting 
around looking for the missing skb->sk assignment that once I found it I 
declared victory ... a bit too soon.

Looking at the code again, I think the right solution is to call 
skb_set_owner_w() instead of doing the assignment directly but that is 
starting to bring us back to sock_wmalloc(force == 1) which gets back to 
Eric's comments ... (below) ...

On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote:
> Keeping a pointer on a socket without taking a refcount is not going to
> work.
> 
> We are trying to make the stack scale, so you need to add a selinux call
> to take a ref count only if needed.
> 
> That is : If selinux is not used, we don't need to slow down the stack.

Contrary to popular belief, my goal is to not destroy the scalability and/or 
performance of our network stack, I just want to make sure we have a quality 
network stack that is not only fast and scalable, but also preserves the 
security functionality that makes Linux attractive to a number of users.  To 
that end, we could put a #ifdef in the middle of tcp_make_synack(), but that 
seems very ugly to me and I think sets a bad precedence for the network stack 
and kernel as a whole.

So a question for Dave, et al. - would you prefer that I fix this by:

1. Restore the original sock_wmalloc() call?
2. Keep things as-is with skb_alloc() but add skb_set_owner_w()?
3. Add a #ifdef depending on SELinux (probably the LSM in general to be safe) 
and use sock_wmalloc() if enabled, skb_alloc() if not?

I guess I'm leaning towards #1 for the sake of simplicity, but I'd be happy 
with either #1 or #2.  The #3 option seems like a hack and makes me a bit 
afraid of the future.  I am also open to suggestions; to me, the most 
important thing is that we fix this regression, I'm less concerned about how 
we do it.
Eric Dumazet April 8, 2013, 5:36 p.m. UTC | #4
On Mon, 2013-04-08 at 13:22 -0400, Paul Moore wrote:
> On Monday, April 08, 2013 12:14:34 PM David Miller wrote:
> > From: Paul Moore <pmoore@redhat.com>
> > Date: Mon, 08 Apr 2013 11:45:19 -0400
> > 
> > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
> > > tcp_make_synack() to use alloc_skb() directly instead of calling
> > > sock_wmalloc(), the goal being the elimination of two atomic
> > > operations.  Unfortunately, in doing so the change broke certain
> > > SELinux/NetLabel configurations by no longer correctly assigning
> > > the sock to the outgoing packet.
> > > 
> > > This patch fixes this regression by doing the skb->sk assignment
> > > directly inside tcp_make_synack().
> > > 
> > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> > > Signed-off-by: Paul Moore <pmoore@redhat.com>
> > 
> > Setting skb->sk without the destructor results in an SKB that can live
> > potentially forever with a stale reference to a destroyed socket.
> > 
> > You cannot fix the problem in this way.
> 
> Okay, no worries, I'll work on v2.  For some reason I missed the destructor 
> assignment in skb_set_owner_w(); I guess I was spending so much time hunting 
> around looking for the missing skb->sk assignment that once I found it I 
> declared victory ... a bit too soon.
> 
> Looking at the code again, I think the right solution is to call 
> skb_set_owner_w() instead of doing the assignment directly but that is 
> starting to bring us back to sock_wmalloc(force == 1) which gets back to 
> Eric's comments ... (below) ...
> 
> On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote:
> > Keeping a pointer on a socket without taking a refcount is not going to
> > work.
> > 
> > We are trying to make the stack scale, so you need to add a selinux call
> > to take a ref count only if needed.
> > 
> > That is : If selinux is not used, we don't need to slow down the stack.
> 
> Contrary to popular belief, my goal is to not destroy the scalability and/or 
> performance of our network stack, I just want to make sure we have a quality 
> network stack that is not only fast and scalable, but also preserves the 
> security functionality that makes Linux attractive to a number of users.  To 
> that end, we could put a #ifdef in the middle of tcp_make_synack(), but that 
> seems very ugly to me and I think sets a bad precedence for the network stack 
> and kernel as a whole.
> 
> So a question for Dave, et al. - would you prefer that I fix this by:
> 
> 1. Restore the original sock_wmalloc() call?
> 2. Keep things as-is with skb_alloc() but add skb_set_owner_w()?
> 3. Add a #ifdef depending on SELinux (probably the LSM in general to be safe) 
> and use sock_wmalloc() if enabled, skb_alloc() if not?

Didnt we had the same issue with RST packets ?

Please take a look at commit 3a7c384ffd57ef5fbd95f48edaa2ca4eb3d9f2ee



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 5:40 p.m. UTC | #5
On Monday, April 08, 2013 10:36:26 AM Eric Dumazet wrote:
> On Mon, 2013-04-08 at 13:22 -0400, Paul Moore wrote:
> > On Monday, April 08, 2013 12:14:34 PM David Miller wrote:
> > > From: Paul Moore <pmoore@redhat.com>
> > > Date: Mon, 08 Apr 2013 11:45:19 -0400
> > > 
> > > > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2 converted
> > > > tcp_make_synack() to use alloc_skb() directly instead of calling
> > > > sock_wmalloc(), the goal being the elimination of two atomic
> > > > operations.  Unfortunately, in doing so the change broke certain
> > > > SELinux/NetLabel configurations by no longer correctly assigning
> > > > the sock to the outgoing packet.
> > > > 
> > > > This patch fixes this regression by doing the skb->sk assignment
> > > > directly inside tcp_make_synack().
> > > > 
> > > > Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> > > > Signed-off-by: Paul Moore <pmoore@redhat.com>
> > > 
> > > Setting skb->sk without the destructor results in an SKB that can live
> > > potentially forever with a stale reference to a destroyed socket.
> > > 
> > > You cannot fix the problem in this way.
> > 
> > Okay, no worries, I'll work on v2.  For some reason I missed the
> > destructor assignment in skb_set_owner_w(); I guess I was spending so much
> > time hunting around looking for the missing skb->sk assignment that once I
> > found it I declared victory ... a bit too soon.
> > 
> > Looking at the code again, I think the right solution is to call
> > skb_set_owner_w() instead of doing the assignment directly but that is
> > starting to bring us back to sock_wmalloc(force == 1) which gets back to
> > Eric's comments ... (below) ...
> > 
> > On Monday, April 08, 2013 09:19:23 AM Eric Dumazet wrote:
> > > Keeping a pointer on a socket without taking a refcount is not going to
> > > work.
> > > 
> > > We are trying to make the stack scale, so you need to add a selinux call
> > > to take a ref count only if needed.
> > > 
> > > That is : If selinux is not used, we don't need to slow down the stack.
> > 
> > Contrary to popular belief, my goal is to not destroy the scalability
> > and/or performance of our network stack, I just want to make sure we have
> > a quality network stack that is not only fast and scalable, but also
> > preserves the security functionality that makes Linux attractive to a
> > number of users.  To that end, we could put a #ifdef in the middle of
> > tcp_make_synack(), but that seems very ugly to me and I think sets a bad
> > precedence for the network stack and kernel as a whole.
> > 
> > So a question for Dave, et al. - would you prefer that I fix this by:
> > 
> > 1. Restore the original sock_wmalloc() call?
> > 2. Keep things as-is with skb_alloc() but add skb_set_owner_w()?
> > 3. Add a #ifdef depending on SELinux (probably the LSM in general to be
> > safe) and use sock_wmalloc() if enabled, skb_alloc() if not?
> 
> Didnt we had the same issue with RST packets ?
> 
> Please take a look at commit 3a7c384ffd57ef5fbd95f48edaa2ca4eb3d9f2ee

Sort of a similar problem, but not really the same.  Also, arguably, there is 
no real associated sock/socket for a RST so orphaning the packet makes sense.  
In the case of a SYNACK we can, and should, associate the packet with a 
sock/socket.
Eric Dumazet April 8, 2013, 5:47 p.m. UTC | #6
On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:

> Sort of a similar problem, but not really the same.  Also, arguably, there is 
> no real associated sock/socket for a RST so orphaning the packet makes sense.  
> In the case of a SYNACK we can, and should, associate the packet with a 
> sock/socket.

What is the intent ?

This kind of requirement has a huge cost, and thats why I want a hook
instead of a 'generic thing'

Please read : 

http://workshop.netfilter.org/2013/wiki/images/3/3a/NFWS-TCP.pdf



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 8, 2013, 6:01 p.m. UTC | #7
On Mon, 2013-04-08 at 10:47 -0700, Eric Dumazet wrote:

> This kind of requirement has a huge cost, and thats why I want a hook
> instead of a 'generic thing'

I meant " I would like ... "

We for example have security_inet_csk_clone()

We could have security_skb_owned_by(skb, sk)

I probably can send a patch, it seems quite easy.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergei Shtylyov April 8, 2013, 6:03 p.m. UTC | #8
Hello.

On 04/08/2013 07:45 PM, Paul Moore wrote:

> Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2

    Please also specify that commit's summary line in parens
(DaveM seems to also require double quotes inside the parens).

> converted
> tcp_make_synack() to use alloc_skb() directly instead of calling
> sock_wmalloc(), the goal being the elimination of two atomic
> operations.  Unfortunately, in doing so the change broke certain
> SELinux/NetLabel configurations by no longer correctly assigning
> the sock to the outgoing packet.
>
> This patch fixes this regression by doing the skb->sk assignment
> directly inside tcp_make_synack().
>
> Reported-by: Miroslav Vadkerti <mvadkert@redhat.com>
> Signed-off-by: Paul Moore <pmoore@redhat.com>

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 6:12 p.m. UTC | #9
On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
> On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
> > Sort of a similar problem, but not really the same.  Also, arguably, there
> > is no real associated sock/socket for a RST so orphaning the packet makes
> > sense. In the case of a SYNACK we can, and should, associate the packet
> > with a sock/socket.
> 
> What is the intent ?

We have to do a number of painful things in SELinux because we aren't allowed 
a proper security blob (void *security) in a sk_buff.  One of those things is 
using the security blob in the originating sock as a stand-in for the packet's 
own security blob; as a result, when skb->sk is not set we have to make some 
guesses about the security attributes of packet.  We do have methods to handle 
packets without a valid sock pointer, but those are used primarily for non-
local packets (e.g. forwarded traffic) and some limited local use cases (e.g. 
TCP RST packets).

Also, don't mention skb->secmark; unfortunately that was used for something 
else and doesn't apply to this conversation.
 
> On Mon, 2013-04-08 at 10:47 -0700, Eric Dumazet wrote:
> > This kind of requirement has a huge cost, and thats why I want a hook
> > instead of a 'generic thing'
> 
> I meant " I would like ... "
> 
> We for example have security_inet_csk_clone()
> 
> We could have security_skb_owned_by(skb, sk)
>
> I probably can send a patch, it seems quite easy.

It seems a bit fragile to me, perhaps even hacky, but in some ways I guess it 
isn't anymore fragile than relying on skb->sk - as this problem demonstrates.  
My other concern is that adding this hook *correctly* is likely to touch a lot 
of files and may be a bit much so late in the 3.9 cycle, Dave, what say you?

Assuming that this is the preferred option, Eric, would you be open to 
reverting your patch for 3.9 with the assumption that I promise to add the 
hook for 3.10?  You've got my word that I'll have it done ASAP.
Paul Moore April 8, 2013, 6:12 p.m. UTC | #10
On Monday, April 08, 2013 10:03:33 PM Sergei Shtylyov wrote:
> Hello.
> 
> On 04/08/2013 07:45 PM, Paul Moore wrote:
> > Commit 90ba9b1986b5ac4b2d184575847147ea7c4280a2
> 
>     Please also specify that commit's summary line in parens
> (DaveM seems to also require double quotes inside the parens).

Noted for future reference.
Eric Dumazet April 8, 2013, 6:21 p.m. UTC | #11
On Mon, 2013-04-08 at 14:12 -0400, Paul Moore wrote:

> 
> It seems a bit fragile to me, perhaps even hacky, but in some ways I guess it 
> isn't anymore fragile than relying on skb->sk - as this problem demonstrates.  
> My other concern is that adding this hook *correctly* is likely to touch a lot 
> of files and may be a bit much so late in the 3.9 cycle, Dave, what say you?

I don't get it, 90ba9b1986b5ac4b2d18 was in 3.6, why do you care of
3.9 ?

I am preparing a fix right now. Not a revert, thank you.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 6:26 p.m. UTC | #12
On Monday, April 08, 2013 11:21:43 AM Eric Dumazet wrote:
> On Mon, 2013-04-08 at 14:12 -0400, Paul Moore wrote:
> > It seems a bit fragile to me, perhaps even hacky, but in some ways I guess
> > it isn't anymore fragile than relying on skb->sk - as this problem
> > demonstrates. My other concern is that adding this hook *correctly* is
> > likely to touch a lot of files and may be a bit much so late in the 3.9
> > cycle, Dave, what say you?
>
> I don't get it, 90ba9b1986b5ac4b2d18 was in 3.6, why do you care of
> 3.9 ?

I wasn't made aware of the particular bug until just recently.  When I see a 
regression I generally try to get it fixed as soon as possible.
 
> I am preparing a fix right now. Not a revert, thank you.

I guess we'll have to wait and see then; the more I think about the new hook 
you proposed the less enthused I am about it.

I'm still curious to hear what Dave has to say on this.
Paul Moore April 8, 2013, 6:32 p.m. UTC | #13
On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote:
> On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
> > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
> > > Sort of a similar problem, but not really the same.  Also, arguably,
> > > there is no real associated sock/socket for a RST so orphaning the
> > > packet makes sense. In the case of a SYNACK we can, and should,
> > > associate the packet with a sock/socket.
> > 
> > What is the intent ?
> 
> We have to do a number of painful things in SELinux because we aren't
> allowed a proper security blob (void *security) in a sk_buff.  One of those
> things ...

Actually, I wonder if this problem means it is a good time to revisit the no-
security-blob-in-sk_buff decision?  The management of the blob would be hidden 
behind the LSM hooks like everything else and it would have a number of 
advantages including making problems like we are seeing here easier to fix or 
avoid entirely.  It would also make life much easier for those of working on 
LSM stuff and it would pave the way for including network access controls in 
the stacked-LSM stuff Casey is kicking around.

I'm aware of all the arguments against, but thought it would be worth bringing 
it up again, if for no other reason than I haven't heard enough shouting yet 
today :)
Eric Dumazet April 8, 2013, 6:34 p.m. UTC | #14
On Mon, 2013-04-08 at 14:26 -0400, Paul Moore wrote:

> I guess we'll have to wait and see then; the more I think about the new hook 
> you proposed the less enthused I am about it.
> 
> I'm still curious to hear what Dave has to say on this.
> 

90ba9b1986b5ac4b2 is 10 months old, and nobody complained until today ?

This sounds like a very small issue to me, a revert is simply overkill. 

Lets go forward instead of applying blind fixes.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 8, 2013, 7:25 p.m. UTC | #15
From: Paul Moore <pmoore@redhat.com>
Date: Mon, 08 Apr 2013 13:22:50 -0400

> Contrary to popular belief, my goal is to not destroy the scalability and/or 
> performance of our network stack, I just want to make sure we have a quality 
> network stack that is not only fast and scalable, but also preserves the 
> security functionality that makes Linux attractive to a number of users.

Get the violin out.

> To that end, we could put a #ifdef in the middle of
> tcp_make_synack(), but that seems very ugly to me and I think sets a
> bad precedence for the network stack and kernel as a whole.

Not an ifdef, a run time state test.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 9:10 p.m. UTC | #16
On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote:
> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote:
> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
> > > > Sort of a similar problem, but not really the same.  Also, arguably,
> > > > there is no real associated sock/socket for a RST so orphaning the
> > > > packet makes sense. In the case of a SYNACK we can, and should,
> > > > associate the packet with a sock/socket.
> > > 
> > > What is the intent ?
> > 
> > We have to do a number of painful things in SELinux because we aren't
> > allowed a proper security blob (void *security) in a sk_buff.  One of
> > those things ...
> 
> Actually, I wonder if this problem means it is a good time to revisit the
> no- security-blob-in-sk_buff decision?  The management of the blob would be
> hidden behind the LSM hooks like everything else and it would have a number
> of advantages including making problems like we are seeing here easier to
> fix or avoid entirely.  It would also make life much easier for those of
> working on LSM stuff and it would pave the way for including network access
> controls in the stacked-LSM stuff Casey is kicking around.

No comment, or am I just too anxious?
David Miller April 8, 2013, 9:15 p.m. UTC | #17
From: Paul Moore <pmoore@redhat.com>
Date: Mon, 08 Apr 2013 17:10:43 -0400

> On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote:
>> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote:
>> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
>> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
>> > > > Sort of a similar problem, but not really the same.  Also, arguably,
>> > > > there is no real associated sock/socket for a RST so orphaning the
>> > > > packet makes sense. In the case of a SYNACK we can, and should,
>> > > > associate the packet with a sock/socket.
>> > > 
>> > > What is the intent ?
>> > 
>> > We have to do a number of painful things in SELinux because we aren't
>> > allowed a proper security blob (void *security) in a sk_buff.  One of
>> > those things ...
>> 
>> Actually, I wonder if this problem means it is a good time to revisit the
>> no- security-blob-in-sk_buff decision?  The management of the blob would be
>> hidden behind the LSM hooks like everything else and it would have a number
>> of advantages including making problems like we are seeing here easier to
>> fix or avoid entirely.  It would also make life much easier for those of
>> working on LSM stuff and it would pave the way for including network access
>> controls in the stacked-LSM stuff Casey is kicking around.
> 
> No comment, or am I just too anxious?

There is no way I'm putting LSM overhead into sk_buff, it's already
too big.

I didn't comment because it wasn't worth a comment, but since you're
pushing me on the issue, I'll make the no explicit.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 8, 2013, 9:24 p.m. UTC | #18
On Monday, April 08, 2013 05:15:12 PM David Miller wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Mon, 08 Apr 2013 17:10:43 -0400
> 
> > On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote:
> >> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote:
> >> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
> >> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
> >> > > > Sort of a similar problem, but not really the same.  Also,
> >> > > > arguably,
> >> > > > there is no real associated sock/socket for a RST so orphaning the
> >> > > > packet makes sense. In the case of a SYNACK we can, and should,
> >> > > > associate the packet with a sock/socket.
> >> > > 
> >> > > What is the intent ?
> >> > 
> >> > We have to do a number of painful things in SELinux because we aren't
> >> > allowed a proper security blob (void *security) in a sk_buff.  One of
> >> > those things ...
> >> 
> >> Actually, I wonder if this problem means it is a good time to revisit the
> >> no- security-blob-in-sk_buff decision?  The management of the blob would
> >> be hidden behind the LSM hooks like everything else and it would have a
> >> number of advantages including making problems like we are seeing here
> >> easier to fix or avoid entirely.  It would also make life much easier for
> >> those of working on LSM stuff and it would pave the way for including
> >> network access controls in the stacked-LSM stuff Casey is kicking around.
> > 
> > No comment, or am I just too anxious?
> 
> There is no way I'm putting LSM overhead into sk_buff, it's already
> too big.

If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and 
the management of that pointer is handled by LSM hooks why is it a concern?  I 
apologize for pushing on the issue, but I'm having a hard time reconciling the 
reason for the "no" with the comments/decisions about the regression fix; at 
present there seems to be a level of contradiction between the two.

> I didn't comment because it wasn't worth a comment, but since you're
> pushing me on the issue, I'll make the no explicit.
David Miller April 8, 2013, 9:33 p.m. UTC | #19
From: Paul Moore <pmoore@redhat.com>
Date: Mon, 08 Apr 2013 17:24:50 -0400

> If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and 
> the management of that pointer is handled by LSM hooks why is it a concern?  I 
> apologize for pushing on the issue, but I'm having a hard time reconciling the 
> reason for the "no" with the comments/decisions about the regression fix; at 
> present there seems to be a level of contradiction between the two.

8 bytes times however many millions of packets per second we can process
on a big machine, you do the math.

It's memory, less cache locality, etc. etc. etc.

It's the most important data structure in the entire networking stack,
and every single byte matters.

I want the overhead to be your problem, so that only users of your
stuff eat the overhead, rather than everyone.

And don't even mention ifdefs, that's bogus, because every
distribution turns every option on, %99.9999999 of users will
therefore not see the savings.

Really, this is a dead topic, let's move on.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings April 8, 2013, 9:34 p.m. UTC | #20
On Mon, 2013-04-08 at 17:24 -0400, Paul Moore wrote:
> On Monday, April 08, 2013 05:15:12 PM David Miller wrote:
> > From: Paul Moore <pmoore@redhat.com>
> > Date: Mon, 08 Apr 2013 17:10:43 -0400
> > 
> > > On Monday, April 08, 2013 02:32:00 PM Paul Moore wrote:
> > >> On Monday, April 08, 2013 02:12:01 PM Paul Moore wrote:
> > >> > On Monday, April 08, 2013 10:47:47 AM Eric Dumazet wrote:
> > >> > > On Mon, 2013-04-08 at 13:40 -0400, Paul Moore wrote:
> > >> > > > Sort of a similar problem, but not really the same.  Also,
> > >> > > > arguably,
> > >> > > > there is no real associated sock/socket for a RST so orphaning the
> > >> > > > packet makes sense. In the case of a SYNACK we can, and should,
> > >> > > > associate the packet with a sock/socket.
> > >> > > 
> > >> > > What is the intent ?
> > >> > 
> > >> > We have to do a number of painful things in SELinux because we aren't
> > >> > allowed a proper security blob (void *security) in a sk_buff.  One of
> > >> > those things ...
> > >> 
> > >> Actually, I wonder if this problem means it is a good time to revisit the
> > >> no- security-blob-in-sk_buff decision?  The management of the blob would
> > >> be hidden behind the LSM hooks like everything else and it would have a
> > >> number of advantages including making problems like we are seeing here
> > >> easier to fix or avoid entirely.  It would also make life much easier for
> > >> those of working on LSM stuff and it would pave the way for including
> > >> network access controls in the stacked-LSM stuff Casey is kicking around.
> > > 
> > > No comment, or am I just too anxious?
> > 
> > There is no way I'm putting LSM overhead into sk_buff, it's already
> > too big.
> 
> If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and 
> the management of that pointer is handled by LSM hooks why is it a concern?  I 
> apologize for pushing on the issue, but I'm having a hard time reconciling the 
> reason for the "no" with the comments/decisions about the regression fix; at 
> present there seems to be a level of contradiction between the two.

Most Linux users are running distribution kernels with one or more LSMs
built-in.  Anything you make dependent on CONFIG_SECURITY or similar
generic symbol will have a cost for all those users, whether or not they
actually use the LSM.

Ben.

> > I didn't comment because it wasn't worth a comment, but since you're
> > pushing me on the issue, I'll make the no explicit.
>
Paul Moore April 8, 2013, 10:01 p.m. UTC | #21
On Monday, April 08, 2013 05:33:25 PM David Miller wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Mon, 08 Apr 2013 17:24:50 -0400
> 
> > If the void pointer is wrapped by a #ifdef (plenty of precedence for that)
> > and the management of that pointer is handled by LSM hooks why is it a
> > concern?  I apologize for pushing on the issue, but I'm having a hard
> > time reconciling the reason for the "no" with the comments/decisions
> > about the regression fix; at present there seems to be a level of
> > contradiction between the two.
>
> 8 bytes times however many millions of packets per second we can process
> on a big machine, you do the math.
> 
> It's memory, less cache locality, etc. etc. etc.
> 
> It's the most important data structure in the entire networking stack,
> and every single byte matters.
> 
> I want the overhead to be your problem, so that only users of your
> stuff eat the overhead, rather than everyone.

Okay, if the objection is really just one of structure size and not the hooks, 
what if I did the work to consolidate the skb->secmark and skb->sp fields into 
a new structure/pointer?  Assuming it wasn't too painful, it would be a net 
reduction of four bytes.  If that worked would you have an objection to us 
adding a LSM security blob to this new structure?
David Miller April 8, 2013, 10:08 p.m. UTC | #22
From: Paul Moore <pmoore@redhat.com>
Date: Mon, 08 Apr 2013 18:01:56 -0400

> Okay, if the objection is really just one of structure size and not the hooks, 
> what if I did the work to consolidate the skb->secmark and skb->sp fields into 
> a new structure/pointer?  Assuming it wasn't too painful, it would be a net 
> reduction of four bytes.  If that worked would you have an objection to us 
> adding a LSM security blob to this new structure?

'sp' is sepreate from LSM and making it indirect would hurt IPSEC
performance.

Please, really, just drop this.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Casey Schaufler April 8, 2013, 11:40 p.m. UTC | #23
On 4/8/2013 2:33 PM, David Miller wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Mon, 08 Apr 2013 17:24:50 -0400
>
>> If the void pointer is wrapped by a #ifdef (plenty of precedence for that) and 
>> the management of that pointer is handled by LSM hooks why is it a concern?  I 
>> apologize for pushing on the issue, but I'm having a hard time reconciling the 
>> reason for the "no" with the comments/decisions about the regression fix; at 
>> present there seems to be a level of contradiction between the two.
> 8 bytes times however many millions of packets per second we can process
> on a big machine, you do the math.

OK, let's do the math.

First off, it's 4 bytes, not 8. It replaces the secmark.
Your increased memory usage is going to be

	4 bytes/packet *  M packets/second * N seconds

Where M is the rate at which you're processing packets and
N is the length of time it takes to process a packet.

Let's pretend we have an embedded system that does nothing but send
128 byte packets on a 10Gb port. That's 10M packets/second. If it
takes a full second to process a packet the overhead is 40MB for that
second. I have it on good authority that packets can be processed
in considerably less time than that. The real number is more like
0.05 seconds. That means your actual overhead is more like 1MB.

These are dumbed down calculations. I am not a memory usage expert.
I am convinced that "real" calculations are going to get similar
numbers. I am, of course, willing to be swayed by evidence that I
am wrong.

Compare that to the overhead associated with using CIPSO on packets
that never leave the box.



>
> It's memory, less cache locality, etc. etc. etc.
>
> It's the most important data structure in the entire networking stack,
> and every single byte matters.
>
> I want the overhead to be your problem, so that only users of your
> stuff eat the overhead, rather than everyone.
>
> And don't even mention ifdefs, that's bogus, because every
> distribution turns every option on, %99.9999999 of users will
> therefore not see the savings.
>
> Really, this is a dead topic, let's move on.
>
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 9, 2013, 12:33 a.m. UTC | #24
On Mon, 2013-04-08 at 16:40 -0700, Casey Schaufler wrote:

> OK, let's do the math.
> 
> First off, it's 4 bytes, not 8. It replaces the secmark.
> Your increased memory usage is going to be
> 
> 	4 bytes/packet *  M packets/second * N seconds
> 
> Where M is the rate at which you're processing packets and
> N is the length of time it takes to process a packet.
> 
> Let's pretend we have an embedded system that does nothing but send
> 128 byte packets on a 10Gb port. That's 10M packets/second. If it
> takes a full second to process a packet the overhead is 40MB for that
> second. I have it on good authority that packets can be processed
> in considerably less time than that. The real number is more like
> 0.05 seconds. That means your actual overhead is more like 1MB.
> 
> These are dumbed down calculations. I am not a memory usage expert.
> I am convinced that "real" calculations are going to get similar
> numbers. I am, of course, willing to be swayed by evidence that I
> am wrong.
> 
> Compare that to the overhead associated with using CIPSO on packets
> that never leave the box.

Maths are not that simple, and its not about size of sk_buff, since the
number of in-flight skb should be quite small.

Its the time to init this memory for _every_ packet.

sizeof(sk_buff) is 0xf8, very close to cross the 256 bytes limit.

Add a single _byte_ and it becomes a matter of adding a _cache_ line,
and thats 25 % cost, assuming 64bytes cache lines.

So instead of processing 10M packets per second, we would process 9M
packets per second, or maybe less.

Yes, 256 bytes per sk_buff, this is the current insane situation.
(Not counting the struct skb_shared_info, adding at least one additional
cache line)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Casey Schaufler April 9, 2013, 12:59 a.m. UTC | #25
On 4/8/2013 5:33 PM, Eric Dumazet wrote:
> On Mon, 2013-04-08 at 16:40 -0700, Casey Schaufler wrote:
>
>> OK, let's do the math.
>>
>> First off, it's 4 bytes, not 8. It replaces the secmark.
>> Your increased memory usage is going to be
>>
>> 	4 bytes/packet *  M packets/second * N seconds
>>
>> Where M is the rate at which you're processing packets and
>> N is the length of time it takes to process a packet.
>>
>> Let's pretend we have an embedded system that does nothing but send
>> 128 byte packets on a 10Gb port. That's 10M packets/second. If it
>> takes a full second to process a packet the overhead is 40MB for that
>> second. I have it on good authority that packets can be processed
>> in considerably less time than that. The real number is more like
>> 0.05 seconds. That means your actual overhead is more like 1MB.
>>
>> These are dumbed down calculations. I am not a memory usage expert.
>> I am convinced that "real" calculations are going to get similar
>> numbers. I am, of course, willing to be swayed by evidence that I
>> am wrong.
>>
>> Compare that to the overhead associated with using CIPSO on packets
>> that never leave the box.
> Maths are not that simple,

I am willing to accept that. I am willing to be
educated. I would be interested to see what the
"real" maths are, and how far off my dumb version
might actually be.

> and its not about size of sk_buff, since the
> number of in-flight skb should be quite small.

The reason we're told that there can't be a blob
pointer in the sk_buff is that it increases the size
of the sk_buff. Yes, it *is* about the size.

> Its the time to init this memory for _every_ packet.

Which is a function of the size, no?

> sizeof(sk_buff) is 0xf8, very close to cross the 256 bytes limit.

0xf8 + 0x4 = 0xfc
248 + 4 = 252

> Add a single _byte_ and it becomes a matter of adding a _cache_ line,
> and thats 25 % cost, assuming 64bytes cache lines.

I don't see that with adding 4 bytes. Again, I'm willing to be
educated if I'm wrong.

> So instead of processing 10M packets per second, we would process 9M
> packets per second, or maybe less.
>
> Yes, 256 bytes per sk_buff, this is the current insane situation.
> (Not counting the struct skb_shared_info, adding at least one additional
> cache line)

Sorry, but I don't see what's insane with either the
current 248 byte sk_buff or a 252 byte sk_buff.

As always, I am willing to be educated.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 9, 2013, 1:09 a.m. UTC | #26
On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:

> I don't see that with adding 4 bytes. Again, I'm willing to be
> educated if I'm wrong.

Feel free to add 4 bytes without having the 'align to 8 bytes' problem
on 64 bit arches. Show us your patch.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Casey Schaufler April 9, 2013, 1:24 a.m. UTC | #27
On 4/8/2013 6:09 PM, Eric Dumazet wrote:
> On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:
>
>> I don't see that with adding 4 bytes. Again, I'm willing to be
>> educated if I'm wrong.
> Feel free to add 4 bytes without having the 'align to 8 bytes' problem
> on 64 bit arches. Show us your patch.

Recall that it's replacing an existing 4 byte value with an 8 byte value.
My compiler days were quite short and long ago, but it would seem that
an 8 byte value ought not have an 'align to 8 bytes' problem.

Again, I'm willing to be educated.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 9, 2013, 1:19 p.m. UTC | #28
On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote:
> On 4/8/2013 6:09 PM, Eric Dumazet wrote:
> > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:
> >> I don't see that with adding 4 bytes. Again, I'm willing to be
> >> educated if I'm wrong.
> > 
> > Feel free to add 4 bytes without having the 'align to 8 bytes' problem
> > on 64 bit arches. Show us your patch.
> 
> Recall that it's replacing an existing 4 byte value with an 8 byte value.
> My compiler days were quite short and long ago, but it would seem that
> an 8 byte value ought not have an 'align to 8 bytes' problem.
> 
> Again, I'm willing to be educated.

Armed with a cup of coffee I took a look at the sk_buff structure this morning 
with the pahole tool and using the current sk_buff if we turn on all the 
#ifdefs here is what I see on x86_64:

struct sk_buff {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        __be16                     protocol;             /*   128     2 */

        /* XXX 6 bytes hole, try to pack */

        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        dma_cookie_t               dma_cookie;           /*   188     4 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        __u32                      secmark;              /*   192     4 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   196     4 */
        sk_buff_data_t             inner_transport_header; /*   200     8 */
        sk_buff_data_t             inner_network_header; /*   208     8 */
        sk_buff_data_t             transport_header;     /*   216     8 */
        sk_buff_data_t             network_header;       /*   224     8 */
        sk_buff_data_t             mac_header;           /*   232     8 */
        sk_buff_data_t             tail;                 /*   240     8 */
        sk_buff_data_t             end;                  /*   248     8 */
        /* --- cacheline 4 boundary (256 bytes) --- */
        unsigned char *            head;                 /*   256     8 */
        unsigned char *            data;                 /*   264     8 */
        unsigned int               truesize;             /*   272     4 */
        atomic_t                   users;                /*   276     4 */

        /* size: 280, cachelines: 5, members: 62 */
        /* sum members: 270, holes: 3, sum holes: 10 */
        /* bit holes: 1, sum bit holes: 6 bits */
        /* last cacheline: 24 bytes */
};

It looks like there some holes we might be able to capitalize on.  If we 
remove "secmark" (we can handle it inside a security blob) and move "protocol" 
to after the flags2 bit field we can make an aligned 8 byte hold for a 
security blob before "destructor".  According to pahole the structure size 
stays the same and the only field which moves to a different cacheline is 
"dma_cookie" which moves from cacheline 2 to 3.  Here is the pahole output:

struct sk_buff_test {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        void *                     security;             /*   128     8 */
        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        __be16                     protocol;             /*   188     2 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        dma_cookie_t               dma_cookie;           /*   192     4 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   196     4 */
        sk_buff_data_t             inner_transport_header; /*   200     8 */
        sk_buff_data_t             inner_network_header; /*   208     8 */
        sk_buff_data_t             transport_header;     /*   216     8 */
        sk_buff_data_t             network_header;       /*   224     8 */
        sk_buff_data_t             mac_header;           /*   232     8 */
        sk_buff_data_t             tail;                 /*   240     8 */
        sk_buff_data_t             end;                  /*   248     8 */
        /* --- cacheline 4 boundary (256 bytes) --- */
        unsigned char *            head;                 /*   256     8 */
        unsigned char *            data;                 /*   264     8 */
        unsigned int               truesize;             /*   272     4 */
        atomic_t                   users;                /*   276     4 */

        /* size: 280, cachelines: 5, members: 62 */
        /* sum members: 274, holes: 3, sum holes: 6 */
        /* bit holes: 1, sum bit holes: 6 bits */
        /* last cacheline: 24 bytes */
};

As Casey already mentioned, if this isn't acceptable please help me understand 
why.
Paul Moore April 9, 2013, 1:33 p.m. UTC | #29
On Tuesday, April 09, 2013 09:19:30 AM Paul Moore wrote:
> On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote:
> > On 4/8/2013 6:09 PM, Eric Dumazet wrote:
> > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:
> > >> I don't see that with adding 4 bytes. Again, I'm willing to be
> > >> educated if I'm wrong.
> > > 
> > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem
> > > on 64 bit arches. Show us your patch.
> > 
> > Recall that it's replacing an existing 4 byte value with an 8 byte value.
> > My compiler days were quite short and long ago, but it would seem that
> > an 8 byte value ought not have an 'align to 8 bytes' problem.
> > 
> > Again, I'm willing to be educated.
> 
> Armed with a cup of coffee I took a look at the sk_buff structure this
> morning with the pahole tool and using the current sk_buff if we turn on
> all the #ifdefs here is what I see on x86_64:
> 
> struct sk_buff {

...

>         /* size: 280, cachelines: 5, members: 62 */
>         /* sum members: 270, holes: 3, sum holes: 10 */
>         /* bit holes: 1, sum bit holes: 6 bits */
>         /* last cacheline: 24 bytes */
> };
> 
> It looks like there some holes we might be able to capitalize on.  If we
> remove "secmark" (we can handle it inside a security blob) and move
> "protocol" to after the flags2 bit field we can make an aligned 8 byte hold
> for a security blob before "destructor".  According to pahole the structure
> size stays the same and the only field which moves to a different cacheline
> is "dma_cookie" which moves from cacheline 2 to 3.  Here is the pahole
> output:
> 
> struct sk_buff_test {

...

>         /* size: 280, cachelines: 5, members: 62 */
>         /* sum members: 274, holes: 3, sum holes: 6 */
>         /* bit holes: 1, sum bit holes: 6 bits */
>         /* last cacheline: 24 bytes */
> };
> 
> As Casey already mentioned, if this isn't acceptable please help me
> understand why.

For the sake of completeness I also checked out the changes when compiled for 
32 bit and it was very much the same; same structure size and in the 32 bit 
case no field movement from one cacheline to another.
Eric Dumazet April 9, 2013, 2 p.m. UTC | #30
On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:

> As Casey already mentioned, if this isn't acceptable please help me understand 
> why.
> 

You see something which is not the reality. If you do such analysis,
better do it properly, because any change you are going to submit will
be doubly checked by people who really care.

sizeof(sk_buff) is not 280. (aligned to 320 because of cache line being
64)

Tell me what you have when doing :

ls -l /sys/kernel/slab/skbuff_head_cache

Do you really see 

$ ls -l /sys/kernel/slab/skbuff_head_cache
lrwxrwxrwx 1 root root 0 Apr  9 06:54 /sys/kernel/slab/skbuff_head_cache -> :t-0000320

Here I get :

$ ls -l /sys/kernel/slab/skbuff_head_cache
lrwxrwxrwx 1 root root 0 Apr  9 06:54 /sys/kernel/slab/skbuff_head_cache -> :t-0000256

because sizeof(sk_buff) <= 256



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings April 9, 2013, 2:05 p.m. UTC | #31
On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote:
> > On 4/8/2013 6:09 PM, Eric Dumazet wrote:
> > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:
> > >> I don't see that with adding 4 bytes. Again, I'm willing to be
> > >> educated if I'm wrong.
> > > 
> > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem
> > > on 64 bit arches. Show us your patch.
> > 
> > Recall that it's replacing an existing 4 byte value with an 8 byte value.
> > My compiler days were quite short and long ago, but it would seem that
> > an 8 byte value ought not have an 'align to 8 bytes' problem.
> > 
> > Again, I'm willing to be educated.
> 
> Armed with a cup of coffee I took a look at the sk_buff structure this morning 
> with the pahole tool and using the current sk_buff if we turn on all the 
> #ifdefs here is what I see on x86_64:
> 
> struct sk_buff {
[...]
>         sk_buff_data_t             inner_transport_header; /*   200     8 */
>         sk_buff_data_t             inner_network_header; /*   208     8 */
>         sk_buff_data_t             transport_header;     /*   216     8 */
>         sk_buff_data_t             network_header;       /*   224     8 */
>         sk_buff_data_t             mac_header;           /*   232     8 */
>         sk_buff_data_t             tail;                 /*   240     8 */
>         sk_buff_data_t             end;                  /*   248     8 */
[...]

This is wrong; sk_buff_data_t is always 32-bit:

#if BITS_PER_LONG > 32
#define NET_SKBUFF_DATA_USES_OFFSET 1
#endif

#ifdef NET_SKBUFF_DATA_USES_OFFSET
typedef unsigned int sk_buff_data_t;
#else
typedef unsigned char *sk_buff_data_t;
#endif

Ben.
Paul Moore April 9, 2013, 2:10 p.m. UTC | #32
On Tuesday, April 09, 2013 03:05:42 PM Ben Hutchings wrote:
> On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> > On Monday, April 08, 2013 06:24:59 PM Casey Schaufler wrote:
> > > On 4/8/2013 6:09 PM, Eric Dumazet wrote:
> > > > On Mon, 2013-04-08 at 17:59 -0700, Casey Schaufler wrote:
> > > >> I don't see that with adding 4 bytes. Again, I'm willing to be
> > > >> educated if I'm wrong.
> > > > 
> > > > Feel free to add 4 bytes without having the 'align to 8 bytes' problem
> > > > on 64 bit arches. Show us your patch.
> > > 
> > > Recall that it's replacing an existing 4 byte value with an 8 byte
> > > value.
> > > My compiler days were quite short and long ago, but it would seem that
> > > an 8 byte value ought not have an 'align to 8 bytes' problem.
> > > 
> > > Again, I'm willing to be educated.
> > 
> > Armed with a cup of coffee I took a look at the sk_buff structure this
> > morning with the pahole tool and using the current sk_buff if we turn on
> > all the #ifdefs here is what I see on x86_64:
> > 
> > struct sk_buff {
> 
> [...]
> 
> >         sk_buff_data_t             inner_transport_header; /*   200     8
> >         */
> >         sk_buff_data_t             inner_network_header; /*   208     8 */
> >         sk_buff_data_t             transport_header;     /*   216     8 */
> >         sk_buff_data_t             network_header;       /*   224     8 */
> >         sk_buff_data_t             mac_header;           /*   232     8 */
> >         sk_buff_data_t             tail;                 /*   240     8 */
> >         sk_buff_data_t             end;                  /*   248     8 */
> 
> [...]
> 
> This is wrong; sk_buff_data_t is always 32-bit:

Yep.  My mistake.

While looking at this a bit more after my original email I noticed the same 
thing.  Ultimately it doesn't change the size or cachelines as the 
sk_buff_data_t structures are at the bottom of sk_buff but here are the 
correct breakdowns:

ORIG:
struct sk_buff {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        __be16                     protocol;             /*   128     2 */

        /* XXX 6 bytes hole, try to pack */

        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        dma_cookie_t               dma_cookie;           /*   188     4 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        __u32                      secmark;              /*   192     4 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   196     4 */
        sk_buff_data_t             inner_transport_header; /*   200     4 */
        sk_buff_data_t             inner_network_header; /*   204     4 */
        sk_buff_data_t             transport_header;     /*   208     4 */
        sk_buff_data_t             network_header;       /*   212     4 */
        sk_buff_data_t             mac_header;           /*   216     4 */
        sk_buff_data_t             tail;                 /*   220     4 */
        sk_buff_data_t             end;                  /*   224     4 */

        /* XXX 4 bytes hole, try to pack */

        unsigned char *            head;                 /*   232     8 */
        unsigned char *            data;                 /*   240     8 */
        unsigned int               truesize;             /*   248     4 */
        atomic_t                   users;                /*   252     4 */
        /* --- cacheline 4 boundary (256 bytes) --- */

        /* size: 256, cachelines: 4, members: 62 */
        /* sum members: 242, holes: 4, sum holes: 14 */
        /* bit holes: 1, sum bit holes: 6 bits */
};

W/BLOB:
struct sk_buff_test {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        void *                     security;             /*   128     8 */
        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        __be16                     protocol;             /*   188     2 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        dma_cookie_t               dma_cookie;           /*   192     4 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   196     4 */
        sk_buff_data_t             inner_transport_header; /*   200     4 */
        sk_buff_data_t             inner_network_header; /*   204     4 */
        sk_buff_data_t             transport_header;     /*   208     4 */
        sk_buff_data_t             network_header;       /*   212     4 */
        sk_buff_data_t             mac_header;           /*   216     4 */
        sk_buff_data_t             tail;                 /*   220     4 */
        sk_buff_data_t             end;                  /*   224     4 */

        /* XXX 4 bytes hole, try to pack */

        unsigned char *            head;                 /*   232     8 */
        unsigned char *            data;                 /*   240     8 */
        unsigned int               truesize;             /*   248     4 */
        atomic_t                   users;                /*   252     4 */
        /* --- cacheline 4 boundary (256 bytes) --- */

        /* size: 256, cachelines: 4, members: 62 */
        /* sum members: 246, holes: 4, sum holes: 10 */
        /* bit holes: 1, sum bit holes: 6 bits */
};
Paul Moore April 9, 2013, 2:19 p.m. UTC | #33
On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote:
> On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> > As Casey already mentioned, if this isn't acceptable please help me
> > understand why.
> 
> You see something which is not the reality. If you do such analysis,
> better do it properly, because any change you are going to submit will
> be doubly checked by people who really care.

I am attempting to do it properly, I simply made a mistake.  Ben also pointed 
it out.  As you wrote yesterday, "Lets go forward".

After fixing the BITS_PER_LONG problem I looked at it again and it appears 
that by simply replacing the "secmark" field with a blob we retain the size of 
the sk_buff as well as the cacheline positions of all the fields, e.g. 
dma_cookie no longer moves cachelines.  Thoughts?

struct sk_buff_test {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        __be16                     protocol;             /*   128     2 */

        /* XXX 6 bytes hole, try to pack */

        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        dma_cookie_t               dma_cookie;           /*   188     4 */
        /* --- cacheline 3 boundary (192 bytes) --- */
        void *                     security;             /*   192     8 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   200     4 */
        sk_buff_data_t             inner_transport_header; /*   204     4 */
        sk_buff_data_t             inner_network_header; /*   208     4 */
        sk_buff_data_t             transport_header;     /*   212     4 */
        sk_buff_data_t             network_header;       /*   216     4 */
        sk_buff_data_t             mac_header;           /*   220     4 */
        sk_buff_data_t             tail;                 /*   224     4 */
        sk_buff_data_t             end;                  /*   228     4 */
        unsigned char *            head;                 /*   232     8 */
        unsigned char *            data;                 /*   240     8 */
        unsigned int               truesize;             /*   248     4 */
        atomic_t                   users;                /*   252     4 */
        /* --- cacheline 4 boundary (256 bytes) --- */

        /* size: 256, cachelines: 4, members: 62 */
        /* sum members: 246, holes: 3, sum holes: 10 */
        /* bit holes: 1, sum bit holes: 6 bits */
};
Eric Dumazet April 9, 2013, 2:31 p.m. UTC | #34
On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote:
> On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote:
> > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> > > As Casey already mentioned, if this isn't acceptable please help me
> > > understand why.
> > 
> > You see something which is not the reality. If you do such analysis,
> > better do it properly, because any change you are going to submit will
> > be doubly checked by people who really care.
> 
> I am attempting to do it properly, I simply made a mistake.  Ben also pointed 
> it out.  As you wrote yesterday, "Lets go forward".
> 
> After fixing the BITS_PER_LONG problem I looked at it again and it appears 
> that by simply replacing the "secmark" field with a blob we retain the size of 
> the sk_buff as well as the cacheline positions of all the fields, e.g. 
> dma_cookie no longer moves cachelines.  Thoughts?

If you take a look at recent history of changes on sk_buff, you can see
we added very recently fields for encapsulation support. These were
absolutely wanted for modern operations at datacenter level.

This effort might still need new room, so I prefer not filling sk_buff
right now.

Take a look at the cloned sk_buff. We need an extra atomic_t at the end,
so if make sk_buff bigger than 0xf8 bytes,  fclone_cache will use an
extra cache line as well. Not a big deal, but RPC workloads like netperf
-t TCP_RR will probably show a regression.

ls -l /sys/kernel/slab/skbuff_fclone_cache



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 9, 2013, 2:52 p.m. UTC | #35
On Tuesday, April 09, 2013 07:31:04 AM Eric Dumazet wrote:
> On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote:
> > On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote:
> > > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> > > > As Casey already mentioned, if this isn't acceptable please help me
> > > > understand why.
> > > 
> > > You see something which is not the reality. If you do such analysis,
> > > better do it properly, because any change you are going to submit will
> > > be doubly checked by people who really care.
> > 
> > I am attempting to do it properly, I simply made a mistake.  Ben also
> > pointed it out.  As you wrote yesterday, "Lets go forward".
> > 
> > After fixing the BITS_PER_LONG problem I looked at it again and it appears
> > that by simply replacing the "secmark" field with a blob we retain the
> > size of the sk_buff as well as the cacheline positions of all the fields,
> > e.g. dma_cookie no longer moves cachelines.  Thoughts?
> 
> If you take a look at recent history of changes on sk_buff, you can see
> we added very recently fields for encapsulation support. These were
> absolutely wanted for modern operations at datacenter level.
> 
> This effort might still need new room, so I prefer not filling sk_buff
> right now.

Has anyone proposed any additional encapsulation patches which need additional 
fields in the sk_buff?  Are you aware of any additional encapsulation patches 
which are in progress?  When would you consider it "safe"?

> Take a look at the cloned sk_buff. We need an extra atomic_t at the end,
> so if make sk_buff bigger than 0xf8 bytes,  fclone_cache will use an
> extra cache line as well. Not a big deal, but RPC workloads like netperf
> -t TCP_RR will probably show a regression.
> 
> ls -l /sys/kernel/slab/skbuff_fclone_cache

Perhaps I'm misunderstanding, but these comments above only apply if we were 
to increase the size of the sk_buff struct, yes?  What I proposed, replacing 
"secmark" with a blob, does not currently change the size of the sk_buff 
struct so the performance and memory usage should remain unchanged as well.
Paul Moore April 9, 2013, 3:05 p.m. UTC | #36
On Tuesday, April 09, 2013 10:52:17 AM Paul Moore wrote:
> On Tuesday, April 09, 2013 07:31:04 AM Eric Dumazet wrote:
> > On Tue, 2013-04-09 at 10:19 -0400, Paul Moore wrote:
> > > On Tuesday, April 09, 2013 07:00:22 AM Eric Dumazet wrote:
> > > > On Tue, 2013-04-09 at 09:19 -0400, Paul Moore wrote:
> > > > > As Casey already mentioned, if this isn't acceptable please help me
> > > > > understand why.
> > > > 
> > > > You see something which is not the reality. If you do such analysis,
> > > > better do it properly, because any change you are going to submit will
> > > > be doubly checked by people who really care.
> > > 
> > > I am attempting to do it properly, I simply made a mistake.  Ben also
> > > pointed it out.  As you wrote yesterday, "Lets go forward".
> > > 
> > > After fixing the BITS_PER_LONG problem I looked at it again and it
> > > appears
> > > that by simply replacing the "secmark" field with a blob we retain the
> > > size of the sk_buff as well as the cacheline positions of all the
> > > fields,
> > > e.g. dma_cookie no longer moves cachelines.  Thoughts?
> > 
> > If you take a look at recent history of changes on sk_buff, you can see
> > we added very recently fields for encapsulation support. These were
> > absolutely wanted for modern operations at datacenter level.
> > 
> > This effort might still need new room, so I prefer not filling sk_buff
> > right now.
> 
> Has anyone proposed any additional encapsulation patches which need
> additional fields in the sk_buff?  Are you aware of any additional
> encapsulation patches which are in progress?  When would you consider it
> "safe"?

Another thought.

If we move the "protocol" field after the flags2 bitfield, remove "secmark", 
and add the blob before/after "destructor" then we preserve the existing four 
byte hole in cacheline #3 for potential use by the encapsulation folks and 
move the dma_cookie into cacheline #3 as well which may actually be a good 
thing (corrections welcome on this comment).  The overall size remains the 
same at 256 bytes.

struct sk_buff_test {
        struct sk_buff *           next;                 /*     0     8 */
        struct sk_buff *           prev;                 /*     8     8 */
        ktime_t                    tstamp;               /*    16     8 */
        struct sock *              sk;                   /*    24     8 */
        struct net_device *        dev;                  /*    32     8 */
        char                       cb[48];               /*    40    48 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
        long unsigned int          _skb_refdst;          /*    88     8 */
        struct sec_path *          sp;                   /*    96     8 */
        unsigned int               len;                  /*   104     4 */
        unsigned int               data_len;             /*   108     4 */
        __u16                      mac_len;              /*   112     2 */
        __u16                      hdr_len;              /*   114     2 */
        union {
                __wsum             csum;                 /*           4 */
                struct {
                        __u16      csum_start;           /*   116     2 */
                        __u16      csum_offset;          /*   118     2 */
                };                                       /*           4 */
        };                                               /*   116     4 */
        __u32                      priority;             /*   120     4 */
        int                        flags1_begin[0];      /*   124     0 */
        __u8                       local_df:1;           /*   124: 7  1 */
        __u8                       cloned:1;             /*   124: 6  1 */
        __u8                       ip_summed:2;          /*   124: 4  1 */
        __u8                       nohdr:1;              /*   124: 3  1 */
        __u8                       nfctinfo:3;           /*   124: 0  1 */
        __u8                       pkt_type:3;           /*   125: 5  1 */
        __u8                       fclone:2;             /*   125: 3  1 */
        __u8                       ipvs_property:1;      /*   125: 2  1 */
        __u8                       peeked:1;             /*   125: 1  1 */
        __u8                       nf_trace:1;           /*   125: 0  1 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 2 boundary (128 bytes) --- */
        int                        flags1_end[0];        /*   128     0 */
        void *                     security;             /*   128     8 */
        void                       (*destructor)(struct sk_buff *); /*   136     
8 */
        struct nf_conntrack *      nfct;                 /*   144     8 */
        struct sk_buff *           nfct_reasm;           /*   152     8 */
        struct nf_bridge_info *    nf_bridge;            /*   160     8 */
        int                        skb_iif;              /*   168     4 */
        __u32                      rxhash;               /*   172     4 */
        __u16                      vlan_tci;             /*   176     2 */
        __u16                      tc_index;             /*   178     2 */
        __u16                      tc_verd;              /*   180     2 */
        __u16                      queue_mapping;        /*   182     2 */
        int                        flags2_begin[0];      /*   184     0 */
        __u8                       ndisc_nodetype:2;     /*   184: 6  1 */
        __u8                       pfmemalloc:1;         /*   184: 5  1 */
        __u8                       ooo_okay:1;           /*   184: 4  1 */
        __u8                       l4_rxhash:1;          /*   184: 3  1 */
        __u8                       wifi_acked_valid:1;   /*   184: 2  1 */
        __u8                       wifi_acked:1;         /*   184: 1  1 */
        __u8                       no_fcs:1;             /*   184: 0  1 */
        __u8                       head_frag:1;          /*   185: 7  1 */
        __u8                       encapsulation:1;      /*   185: 6  1 */

        /* XXX 6 bits hole, try to pack */
        /* XXX 2 bytes hole, try to pack */

        int                        flags2_end[0];        /*   188     0 */
        __be16                     protocol;             /*   188     2 */

        /* XXX 2 bytes hole, try to pack */

        /* --- cacheline 3 boundary (192 bytes) --- */
        dma_cookie_t               dma_cookie;           /*   192     4 */
        union {
                __u32              mark;                 /*           4 */
                __u32              dropcount;            /*           4 */
                __u32              reserved_tailroom;    /*           4 */
        };                                               /*   196     4 */
        sk_buff_data_t             inner_transport_header; /*   200     4 */
        sk_buff_data_t             inner_network_header; /*   204     4 */
        sk_buff_data_t             transport_header;     /*   208     4 */
        sk_buff_data_t             network_header;       /*   212     4 */
        sk_buff_data_t             mac_header;           /*   216     4 */
        sk_buff_data_t             tail;                 /*   220     4 */
        sk_buff_data_t             end;                  /*   224     4 */

        /* XXX 4 bytes hole, try to pack */

        unsigned char *            head;                 /*   232     8 */
        unsigned char *            data;                 /*   240     8 */
        unsigned int               truesize;             /*   248     4 */
        atomic_t                   users;                /*   252     4 */
        /* --- cacheline 4 boundary (256 bytes) --- */

        /* size: 256, cachelines: 4, members: 62 */
        /* sum members: 246, holes: 4, sum holes: 10 */
        /* bit holes: 1, sum bit holes: 6 bits */
};
Eric Dumazet April 9, 2013, 3:07 p.m. UTC | #37
On Tue, 2013-04-09 at 10:52 -0400, Paul Moore wrote:

> 
> Perhaps I'm misunderstanding, but these comments above only apply if we were 
> to increase the size of the sk_buff struct, yes?  What I proposed, replacing 
> "secmark" with a blob, does not currently change the size of the sk_buff 
> struct so the performance and memory usage should remain unchanged as well.
> 

If blob size is 4 bytes, thats fine.

If not, read again my mail.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 9, 2013, 3:17 p.m. UTC | #38
On Tuesday, April 09, 2013 08:07:06 AM Eric Dumazet wrote:
> On Tue, 2013-04-09 at 10:52 -0400, Paul Moore wrote:
> > Perhaps I'm misunderstanding, but these comments above only apply if we
> > were to increase the size of the sk_buff struct, yes?  What I proposed,
> > replacing "secmark" with a blob, does not currently change the size of
> > the sk_buff struct so the performance and memory usage should remain
> > unchanged as well.
>
> If blob size is 4 bytes, thats fine.
> 
> If not, read again my mail.

The "blob" is a void pointer, so 8 bytes.  We're talking about removing the 
"secmark" field (4 bytes) and adding a void pointer (8 bytes).  I've shown 
several different approaches that make this change without increasing the 
overall size of the sk_buff struct.

One of the proposals makes use of the existing holes in the third cacheline to 
make the change without any increase in size, misalignment, or cacheline 
changes.  You were concerned that at some point in the future, the hardware 
encapsulation developers *might* want to add another field.

Taking your comments into consideration I just made another proposal which 
preserves the overall size of the sk_buff struct, as well as the 4 byte hole 
in the third cacheline (for possible use by hw encapsulation folks at some 
later date).  It does shift the "dma_cookie" field from the second to the 
third cacheline, but considering the fields already in the third cacheline 
this may be a good thing.

To the best of my knowledge, all of the proposals I've posted this morning do 
not change the size of the sk_buff so the cloned sk_buff 
functionality/performance/etc. should not be affected.  If that is not the 
case, please let me know because I'm currently at a loss (even after re-
reading your email).
Eric Dumazet April 9, 2013, 3:32 p.m. UTC | #39
On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote:

> The "blob" is a void pointer, so 8 bytes.  We're talking about removing the 
> "secmark" field (4 bytes) and adding a void pointer (8 bytes).  I've shown 
> several different approaches that make this change without increasing the 
> overall size of the sk_buff struct.

You want to use 4 extra bytes in sk_buff. You'll have to show us why we
close the way for other valid uses of the current holes.

I have no idea why its needed, and why it can't be solved in another
way.

It looks like _I_ have to do your work. Sorry, I have no more time to
spend on this topic. You'll have to convince David, not me.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 9, 2013, 3:57 p.m. UTC | #40
On Tuesday, April 09, 2013 08:32:56 AM Eric Dumazet wrote:
> On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote:
> > The "blob" is a void pointer, so 8 bytes.  We're talking about removing
> > the "secmark" field (4 bytes) and adding a void pointer (8 bytes).  I've
> > shown several different approaches that make this change without
> > increasing the overall size of the sk_buff struct.
> 
> You want to use 4 extra bytes in sk_buff. You'll have to show us why we
> close the way for other valid uses of the current holes.
> 
> I have no idea why its needed, and why it can't be solved in another
> way.

FWIW, I was focusing on arriving at a basic design that addressed the initial 
reasons for not including a security blob in a sk_buff.  In the beginning I 
thought it was both the need for LSM hook in the skb management routines as 
well as the memory overhead in the skb itself.  During the course of our 
discussion it became clear that the hooks were acceptable, it was the memory 
overhead that was the concern, so that is what I (and Casey) focused on.

Based on your latest comment, it appears that we have some possible candidates 
for adding a security blob (void *) to the sk_buff that address your technical 
arguments, I wasn't aware we had reached that point, but it is indeed good 
news.  Now we just need to make our case that it is the "Right Thing to Do", 
that is perfectly reasonable.

> It looks like _I_ have to do your work.

I don't believe I ever asked you to do anything other than to repost a patch 
you posted to the LSM list so we could get it included upstream.  A patch that 
you created to counter my proposed fix for a SELinux regression.  Further, I 
tested your patch, and ACK'd it earlier this morning.

I suppose I also asked you to explain/clarify a few of your technical 
objections a bit further so I could address them, but that just seems like 
normal peer design review.

> Sorry, I have no more time to spend on this topic. You'll have to convince
> David, not me.

Well, thank you for your time; I'm sure we'll get to talk about this again in 
the future.  It looks like we've had enough of a conversation now that I can 
start working on some patches.
Casey Schaufler April 9, 2013, 4:11 p.m. UTC | #41
On 4/9/2013 8:32 AM, Eric Dumazet wrote:
> On Tue, 2013-04-09 at 11:17 -0400, Paul Moore wrote:
>
>> The "blob" is a void pointer, so 8 bytes.  We're talking about removing the 
>> "secmark" field (4 bytes) and adding a void pointer (8 bytes).  I've shown 
>> several different approaches that make this change without increasing the 
>> overall size of the sk_buff struct.
> You want to use 4 extra bytes in sk_buff.

So far, so good.

> You'll have to show us why we
> close the way for other valid uses of the current holes.

We have what looks to us like a very legitimate
use for a 4 byte hole, we have that use defined,
and we have it today. In fact, we've had it for the
last five years. The 4 byte secmark (instead of an
8 byte blob pointer) has impacted design choices
since the LSM broke out of the SELinux monopoly.
Not having a blob *will* negatively impact the
forthcoming multiple concurrent LSM support. That
is not hypothetical, that is concrete and very
much a problem.

> I have no idea why its needed, and why it can't be solved in another
> way.

Ah, let's see if I can't help there.

We have two LSMs today and a third on the way that
potentially want to use the secmark. SELinux uses the
secmark, Smack may pick it up for IPv6 support in the
not to distant future, and AppArmor appears to be
considering it as well.

Each of these LSMs uses (or will use) a 4 byte secid
to represent the security information about a process
or other relevant entity. This is what gets put into
the secmark. The secid maps to the real security information.

If I have two LSMs, say SELinux and AppArmor, I have a
real problem. I have 8 bytes of data and a 4 byte
secmark. If I have SELinux, Smack and AppArmor I have
12 bytes to stuff into a 4 byte bag. To top it off,
these are already indirect references to the security
data, not pointers to the data. What I have to try to
do is fit 3 pointers into 4 bytes. Not good.

If we replace secmark with secblob, I don't have to
use the secid indirection if there is only one LSM.
If there are multiple LSMs the secblob contains a
pointer to a master blob, and it's still faster than
the secmark indirection.

The alternative is to restrict the use of secmarks to
one LSM and let all the others figure out some other
method for communicating the security information.
I don't see that as a great choice.

> It looks like _I_ have to do your work. Sorry, I have no more time to
> spend on this topic. You'll have to convince David, not me.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 9, 2013, 4:56 p.m. UTC | #42
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 09 Apr 2013 08:32:56 -0700

> It looks like _I_ have to do your work. Sorry, I have no more time to
> spend on this topic. You'll have to convince David, not me.

I already have no interest in considering these changes seriously,
and I've already asked Paul multiple times to drop this idea.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Moore April 9, 2013, 5 p.m. UTC | #43
On Tuesday, April 09, 2013 12:56:35 PM David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 09 Apr 2013 08:32:56 -0700
> 
> > It looks like _I_ have to do your work. Sorry, I have no more time to
> > spend on this topic. You'll have to convince David, not me.
> 
> I already have no interest in considering these changes seriously,
> and I've already asked Paul multiple times to drop this idea.

If we address all of your technical concerns, why are you not interested in 
allowing a security blob in the sk_buff?
David Miller April 9, 2013, 5:09 p.m. UTC | #44
From: Paul Moore <pmoore@redhat.com>
Date: Tue, 09 Apr 2013 13:00:06 -0400

> On Tuesday, April 09, 2013 12:56:35 PM David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Tue, 09 Apr 2013 08:32:56 -0700
>> 
>> > It looks like _I_ have to do your work. Sorry, I have no more time to
>> > spend on this topic. You'll have to convince David, not me.
>> 
>> I already have no interest in considering these changes seriously,
>> and I've already asked Paul multiple times to drop this idea.
> 
> If we address all of your technical concerns, why are you not interested in 
> allowing a security blob in the sk_buff?

If you have a patch that removes space, rather than adds space, to sk_buff,
then we can talk.

Otherwise, please don't waste our time, because I'm not adding anything to
sk_buff for LSM's sake.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 9, 2013, 5:10 p.m. UTC | #45
And please, until you have a real patch, stop wasting our time.

Like Eric, as you can imagine, I have hundreds of other people's patches
to review and reports to analyze.

You're issue is not so important compared to all of the other ones
that it should consume such an unbalanced amount of my time, sorry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5d0b438..23cc295 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2705,6 +2705,8 @@  struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 		dst_release(dst);
 		return NULL;
 	}
+	skb->sk = sk;
+
 	/* Reserve space for headers. */
 	skb_reserve(skb, MAX_TCP_HEADER);