[v2] kcm: remove any offset before parsing messages

Message ID 1536657703-27577-1-git-send-email-asmadeus@codewreck.org
State Changes Requested
Delegated to: David Miller
Headers show
Series
  • [v2] kcm: remove any offset before parsing messages
Related show

Commit Message

Dominique Martinet Sept. 11, 2018, 9:21 a.m.
The current code assumes kcm users know they need to look for the
strparser offset within their bpf program, which is not documented
anywhere and examples laying around do not do.

The actual recv function does handle the offset well, so we can create a
temporary clone of the skb and pull that one up as required for parsing.

The pull itself has a cost if we are pulling beyond the head data,
measured to 2-3% latency in a noisy VM with a local client stressing
that path. The clone's impact seemed too small to measure.

This bug can be exhibited easily by implementing a "trivial" kcm parser
taking the first bytes as size, and on the client sending at least two
such packets in a single write().

Note that bpf sockmap has the same problem, both for parse and for recv,
so it would pulling twice or a real pull within the strparser logic if
anyone cares about that.

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---

Hmm, while trying to benchmark this, I sometimes got hangs in
kcm_wait_data() for the last packet somehow?
The sender program was done (exited (zombie) so I assumed the sender
socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
indicating it parsed a header but there was no skb to peek at?
But the sock is locked so this shouldn't be racy...

I can get it fairly often with this patch and small messages with an
offset, but I think it's just because the pull changes some timing - I
can't hit it with just the clone, and I can hit it with a pull without
clone as well.... And I don't see how pulling a cloned skb can impact
the original socket, but I'm a bit fuzzy on this.

(it's interesting that I didn't seem to hit this race when pulling in
strparser, that shouldn't be any different)


I'll look at that a bit more, but there have been no activity here for
a while and I don't have the energy to keep pushing in the strparser
direction, so take this patch more as a change of direction and a bit
as a 'bump' on the subject - I think it's important but I have too much
on my plate right now to keep pushing if there is no interest from the
authors.

 net/kcm/kcmsock.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Comments

Dominique Martinet Sept. 12, 2018, 5:36 a.m. | #1
Dominique Martinet wrote on Tue, Sep 11, 2018:
> Hmm, while trying to benchmark this, I sometimes got hangs in
> kcm_wait_data() for the last packet somehow?
> The sender program was done (exited (zombie) so I assumed the sender
> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> indicating it parsed a header but there was no skb to peek at?
> But the sock is locked so this shouldn't be racy...
> 
> I can get it fairly often with this patch and small messages with an
> offset, but I think it's just because the pull changes some timing - I
> can't hit it with just the clone, and I can hit it with a pull without
> clone as well.... And I don't see how pulling a cloned skb can impact
> the original socket, but I'm a bit fuzzy on this.

This is weird, I cannot reproduce at all without that pull, even if I
add another delay there instead of the pull, so it's not just timing...

I was confusing kcm_recvmsg and kcm_rcv_strparser but I still think
there is a race in kcm_wait_data between the peek and the wait. I have
confirmed on a hang that the sock's sk_receive_queue is not empty so
skb_peek would return something at this point, but it is waiting in
kcm_wait_data's sk_wait_data.. And no event would be coming at this
point since the sender is done.

kcm_wait_data does have the socket lock but the enqueuing counterpart
(kcm_queue_rcv_skb) is apparently called without lock most of the time
(at least, sk->sk_lock.owned is not set most of the times) ; but if I
add a lock_sock() here it can deadlock when a kfree_skb triggers
kcm_rcv_ready if that kfree_skb was done with the lock... ugh.

I thought the mux rx lock would be taken all the time but that appear
not to be the case either, and even if it were I don't see how to make
sk_wait_data with another lock in the first place.


So meh, I'm not sure why the pull messes up with this, I'm tempted to
say this is an old problem but I'm quite annoyed I do not seem to be
able to reproduce it.
I'm really out of time now so unless someone cares it'll wait for a few
more weeks.


(As an unrelated note, wrt to the patch, it might be nice to add a new
kcm socket option so users could say "my bpf program handles offsets,
don't bother with this")
David Miller Sept. 18, 2018, 1:45 a.m. | #2
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Wed, 12 Sep 2018 07:36:42 +0200

> Dominique Martinet wrote on Tue, Sep 11, 2018:
>> Hmm, while trying to benchmark this, I sometimes got hangs in
>> kcm_wait_data() for the last packet somehow?
>> The sender program was done (exited (zombie) so I assumed the sender
>> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
>> indicating it parsed a header but there was no skb to peek at?
>> But the sock is locked so this shouldn't be racy...
>> 
>> I can get it fairly often with this patch and small messages with an
>> offset, but I think it's just because the pull changes some timing - I
>> can't hit it with just the clone, and I can hit it with a pull without
>> clone as well.... And I don't see how pulling a cloned skb can impact
>> the original socket, but I'm a bit fuzzy on this.
> 
> This is weird, I cannot reproduce at all without that pull, even if I
> add another delay there instead of the pull, so it's not just timing...

I really can't apply this patch until you resolve this.

It is weird, given your description, though...
Dominique Martinet Sept. 18, 2018, 1:57 a.m. | #3
David Miller wrote on Mon, Sep 17, 2018:
> From: Dominique Martinet <asmadeus@codewreck.org>
> Date: Wed, 12 Sep 2018 07:36:42 +0200
> 
> > Dominique Martinet wrote on Tue, Sep 11, 2018:
> >> Hmm, while trying to benchmark this, I sometimes got hangs in
> >> kcm_wait_data() for the last packet somehow?
> >> The sender program was done (exited (zombie) so I assumed the sender
> >> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> >> indicating it parsed a header but there was no skb to peek at?
> >> But the sock is locked so this shouldn't be racy...
> >> 
> >> I can get it fairly often with this patch and small messages with an
> >> offset, but I think it's just because the pull changes some timing - I
> >> can't hit it with just the clone, and I can hit it with a pull without
> >> clone as well.... And I don't see how pulling a cloned skb can impact
> >> the original socket, but I'm a bit fuzzy on this.
> > 
> > This is weird, I cannot reproduce at all without that pull, even if I
> > add another delay there instead of the pull, so it's not just timing...
> 
> I really can't apply this patch until you resolve this.
> 
> It is weird, given your description, though...

Thanks for the reminder! I totally agree with you here and did not
expect this to be merged as it is (in retrospect, I probably should have
written something to that extent in the subject, "RFC"?)

I really don't have much time to give to that right now as I'm doing
this on my freetime, and the lack of reply has been rather demotivating
so it got pushed back a few times...
Given you did reply now I'll try to spend some time to figure that out
in the next couple of weeks but it might not make it for this cycle
depending on the number of rc we'll get and time you want this to soak
it -next.


(I can start by putting the pull back in netparser and try to reproduce,
it's really weird that I never got it to happen at the time...)
David Miller Sept. 18, 2018, 2:40 a.m. | #4
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Tue, 18 Sep 2018 03:57:23 +0200

> Given you did reply now I'll try to spend some time to figure that out
> in the next couple of weeks but it might not make it for this cycle
> depending on the number of rc we'll get and time you want this to soak
> it -next.

Great.

Remind me, is there actually any way for the bpf programs run in this
situation to even _see_ strp_msg(skb)->offset at all?

There isn't right?  And the alternate proposal was to add such a
facility, right?

Just trying to remember all of the context, maybe it's good
information to add to the commit message?
Dominique Martinet Sept. 18, 2018, 2:45 a.m. | #5
David Miller wrote on Mon, Sep 17, 2018:
> Remind me, is there actually any way for the bpf programs run in this
> situation to even _see_ strp_msg(skb)->offset at all?

No, they can see it, so it's possible to make a KCM program that works
right now if you are careful (I'm not sure why the offset within bpf is
different from the offset in the kernel though, it looks like the bpf
program skips the qos part of the control buffer)

> There isn't right?  And the alternate proposal was to add such a
> facility, right?

The problem is that this isn't documented at all, and I could not find
any example doing that until Dave gave me one (I couldn't get it to work
because of the different offset).

The alternate proposal was to just document it, yes.

> Just trying to remember all of the context, maybe it's good
> information to add to the commit message?

Good idea, I'll add some more explanation there.
David Miller Sept. 18, 2018, 2:51 a.m. | #6
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Tue, 18 Sep 2018 04:45:36 +0200

> David Miller wrote on Mon, Sep 17, 2018:
>> Remind me, is there actually any way for the bpf programs run in this
>> situation to even _see_ strp_msg(skb)->offset at all?
> 
> No, they can see it, so it's possible to make a KCM program that works
> right now if you are careful (I'm not sure why the offset within bpf is
> different from the offset in the kernel though, it looks like the bpf
> program skips the qos part of the control buffer)

What helper is used in the BPF program to get this offset value?

(also good info to add to the commit message)
Dominique Martinet Sept. 18, 2018, 2:58 a.m. | #7
David Miller wrote on Mon, Sep 17, 2018:
> > No, they can see it, so it's possible to make a KCM program that works
> > right now if you are careful (I'm not sure why the offset within bpf is
> > different from the offset in the kernel though, it looks like the bpf
> > program skips the qos part of the control buffer)
> 
> What helper is used in the BPF program to get this offset value?
> 
> (also good info to add to the commit message)

Dave defined one himself ; for a simple protocol where the offset is in
the first four bytes of the message.

The whole bpf program could look like this:

------
struct kcm_rx_msg { int full_len; int offset; };
static inline struct kcm_rx_msg *kcm_rx_msg(struct __sk_buff *skb) {
	return (struct kcm_rx_msg *)skb->cb;
}
int decode_framing(struct __sk_buff *skb) {
	return load_word(skb, kcm_rx_msg(skb)->offset);
}
------

If we go towards documenting it, adding a helper would be useful yes;
buf if we pull that becomes unnecessary.
(I'll add the example program in the commit message anyway at your
suggestion)
Dominique Martinet Oct. 31, 2018, 2:56 a.m. | #8
Dominique Martinet wrote on Tue, Sep 18, 2018:
> David Miller wrote on Mon, Sep 17, 2018:
> > From: Dominique Martinet <asmadeus@codewreck.org>
> > Date: Wed, 12 Sep 2018 07:36:42 +0200
> > > Dominique Martinet wrote on Tue, Sep 11, 2018:
> > >> Hmm, while trying to benchmark this, I sometimes got hangs in
> > >> kcm_wait_data() for the last packet somehow?
> > >> The sender program was done (exited (zombie) so I assumed the sender
> > >> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> > >> indicating it parsed a header but there was no skb to peek at?
> > >> But the sock is locked so this shouldn't be racy...
> > >> 
> > >> I can get it fairly often with this patch and small messages with an
> > >> offset, but I think it's just because the pull changes some timing - I
> > >> can't hit it with just the clone, and I can hit it with a pull without
> > >> clone as well.... And I don't see how pulling a cloned skb can impact
> > >> the original socket, but I'm a bit fuzzy on this.
> > > 
> > > This is weird, I cannot reproduce at all without that pull, even if I
> > > add another delay there instead of the pull, so it's not just timing...
> > 
> > I really can't apply this patch until you resolve this.
> > 
> > It is weird, given your description, though...
> 
> Thanks for the reminder! I totally agree with you here and did not
> expect this to be merged as it is (in retrospect, I probably should have
> written something to that extent in the subject, "RFC"?)

Found the issue after some trouble reproducing on other VM, long story
short:
 - I was blaming kcm_wait_data's sk_wait_data to wait while there was
something in sk->sk_receive_queue, but after adding a fake timeout and
some debug messages I can see the receive queue is empty.
However going back up from the kcm_sock to the kcm_mux to the kcm_psock,
there are things in the psock's socket's receive_queue... (If I'm
following the code correctly, that would be the underlying tcp socket)

 - that psock's strparser contains some hints: the interrupted and
stopped bits are set. strp->interrupted looks like it's only set if
kcm_parse_msg returns something < 0. . .
And surely enough, the skb_pull returns NULL iff there's such a hang...!

I might be tempted to send a patch to strparser to add a pr_debug
message in strp_abort_strp...

Anyway, that probably explains I have no problem with bigger VM
(uselessly more memory available) or without KASAN (I guess there's
overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
of ram, so if there's an allocation failure there I think there's a
problem ! . . .

So, well, I'm not sure on the way forward. Adding a bpf helper and
document that kcm users should mind the offset?

Thanks,
Dominique Martinet Feb. 15, 2019, 1 a.m. | #9
Dominique Martinet wrote on Wed, Oct 31, 2018:
> Anyway, that probably explains I have no problem with bigger VM
> (uselessly more memory available) or without KASAN (I guess there's
> overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> of ram, so if there's an allocation failure there I think there's a
> problem ! . . .
> 
> So, well, I'm not sure on the way forward. Adding a bpf helper and
> document that kcm users should mind the offset?

bump on this - I had mostly forgotten about it but the nfs-ganesha
community that could make use of KCM reminded me of my patch that's
waiting for this.

Summary for people coming back after four months:
 - kcm is great, but the bpf function that's supposed to be called for
each packet does not automatically adjust the offset so that it can
assume the skb starts with the packet it needs to look at

 - there is some workaround code that is far from obvious and
undocumented, see the original thread[1]:
[1] http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf6po@davejwatson-mba.dhcp.thefacebook.com

 - my patch here tried to automatically pull the corresponding packet to
the front, but apparently with KASAN can trigger out of memory
behaviours on "small" VMs, so even if it doesn't seem to impact
performance much without KASAN I don't think it's really ok to
potentially hang the connection due to oom under severe conditions.


The best alternative I see is adding a proper helper to get
"kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
lost as I have been; I'm not quite sure how/where to add such a helper
though as I've barely looked at the bpf code until now, but should we go
for that?


(it really feels wrong to me that some random person who just started by
trying to use kcm has to put this much effort to keep the ball rolling,
if nobody cares about kcm I'm also open to have it removed completely
despite the obvious performance gain I benchmarked for ganesha[2] ;
barely maintained feature is worse than no feature)

[2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314

Thanks,
Tom Herbert Feb. 15, 2019, 1:20 a.m. | #10
On Thu, Feb 14, 2019 at 5:00 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Dominique Martinet wrote on Wed, Oct 31, 2018:
> > Anyway, that probably explains I have no problem with bigger VM
> > (uselessly more memory available) or without KASAN (I guess there's
> > overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> > of ram, so if there's an allocation failure there I think there's a
> > problem ! . . .
> >
> > So, well, I'm not sure on the way forward. Adding a bpf helper and
> > document that kcm users should mind the offset?
>
> bump on this - I had mostly forgotten about it but the nfs-ganesha
> community that could make use of KCM reminded me of my patch that's
> waiting for this.
>
> Summary for people coming back after four months:
>  - kcm is great, but the bpf function that's supposed to be called for
> each packet does not automatically adjust the offset so that it can
> assume the skb starts with the packet it needs to look at
>
>  - there is some workaround code that is far from obvious and
> undocumented, see the original thread[1]:
> [1] http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf6po@davejwatson-mba.dhcp.thefacebook.com
>
>  - my patch here tried to automatically pull the corresponding packet to
> the front, but apparently with KASAN can trigger out of memory
> behaviours on "small" VMs, so even if it doesn't seem to impact
> performance much without KASAN I don't think it's really ok to
> potentially hang the connection due to oom under severe conditions.
>
>
> The best alternative I see is adding a proper helper to get
> "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> lost as I have been; I'm not quite sure how/where to add such a helper
> though as I've barely looked at the bpf code until now, but should we go
> for that?

Dominique,

Thanks for looking into this.

I'd rather not complicate the bpf code for this. Can we just always do
an pskb_pull after skb_clone?

Tom

>
>
> (it really feels wrong to me that some random person who just started by
> trying to use kcm has to put this much effort to keep the ball rolling,
> if nobody cares about kcm I'm also open to have it removed completely
> despite the obvious performance gain I benchmarked for ganesha[2] ;
> barely maintained feature is worse than no feature)
>
> [2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 1:57 a.m. | #11
Tom Herbert wrote on Thu, Feb 14, 2019:
> > The best alternative I see is adding a proper helper to get
> > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > lost as I have been; I'm not quite sure how/where to add such a helper
> > though as I've barely looked at the bpf code until now, but should we go
> > for that?
> 
> I'd rather not complicate the bpf code for this. Can we just always do
> an pskb_pull after skb_clone?

Which skb_clone are you thinking of?
If you're referring to the one in strparser, that was my very first
approach here[1], but Dordon shot it down saying that this is not an
strparser bug but a kcm bug since there are ways for users to properly
get the offset and use it -- and the ktls code does it right.

Frankly, my opinion still is that it'd be better in strparser because
there also is some bad use in bpf sockmap (they never look at the offset
either, while kcm uses it for rcv but not for parse), but looking at it
from an optimal performance point of view I agree the user can make
better decision than strparser so I understand where he comes from.


This second patch[2] (the current thread) now does an extra clone if
there is an offset, but the problem really isn't in the clone but the
pull itself that can fail and return NULL when there is memory pressure.
For some reason I hadn't been able to reproduce that behaviour with
strparser doing the pull, but I assume it would also be possible to hit
in extreme situations, I'm not sure...

So anyway, we basically have three choices that I can see:
 - push harder on strparser and go back to my first patch ; it's simple
and makes using strparser easier/safer but has a small overhead for
ktls, which uses the current strparser implementation correctly.
I hadn't been able to get this to fail with KASAN last time I tried but
I assume it should still be possible somehow.

 - the current patch, that I could only get to fail with KASAN, but does
complexify kcm a bit; this also does not fix bpf sockmap at all.
Would still require to fix the hang, so make strparser retry a few times
if strp->cb.parse_msg failed maybe? Or at least get the error back to
userspace somehow.

 - my last suggestion to document the kcm behaviour, and if possible add
a bpf helper to make proper usage easier ; this will make kcm harder to
use on end users but it's better than not being documented and seeing
random unappropriate lengths being parsed.



[1] first strparser patch
http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmadeus@codewreck.org
[2] current thread's patch
http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmadeus@codewreck.org


Thanks,
Tom Herbert Feb. 15, 2019, 2:48 a.m. | #12
On Thu, Feb 14, 2019 at 5:57 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > The best alternative I see is adding a proper helper to get
> > > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > > lost as I have been; I'm not quite sure how/where to add such a helper
> > > though as I've barely looked at the bpf code until now, but should we go
> > > for that?
> >
> > I'd rather not complicate the bpf code for this. Can we just always do
> > an pskb_pull after skb_clone?
>
> Which skb_clone are you thinking of?
> If you're referring to the one in strparser, that was my very first
> approach here[1], but Dordon shot it down saying that this is not an
> strparser bug but a kcm bug since there are ways for users to properly
> get the offset and use it -- and the ktls code does it right.
>
> Frankly, my opinion still is that it'd be better in strparser because
> there also is some bad use in bpf sockmap (they never look at the offset
> either, while kcm uses it for rcv but not for parse), but looking at it
> from an optimal performance point of view I agree the user can make
> better decision than strparser so I understand where he comes from.
>
>
> This second patch[2] (the current thread) now does an extra clone if
> there is an offset, but the problem really isn't in the clone but the
> pull itself that can fail and return NULL when there is memory pressure.
> For some reason I hadn't been able to reproduce that behaviour with
> strparser doing the pull, but I assume it would also be possible to hit
> in extreme situations, I'm not sure...
>
This option looks the best to me, at least as a way to fix the issue
without requiring a change to the API. If the pull fails, doesn't that
just mean that the parser fails? Is there some behavior with this
patch that is not being handled gracefully?

Thanks,
Tom

> So anyway, we basically have three choices that I can see:
>  - push harder on strparser and go back to my first patch ; it's simple
> and makes using strparser easier/safer but has a small overhead for
> ktls, which uses the current strparser implementation correctly.
> I hadn't been able to get this to fail with KASAN last time I tried but
> I assume it should still be possible somehow.
>
>  - the current patch, that I could only get to fail with KASAN, but does
> complexify kcm a bit; this also does not fix bpf sockmap at all.
> Would still require to fix the hang, so make strparser retry a few times
> if strp->cb.parse_msg failed maybe? Or at least get the error back to
> userspace somehow.
>
>  - my last suggestion to document the kcm behaviour, and if possible add
> a bpf helper to make proper usage easier ; this will make kcm harder to
> use on end users but it's better than not being documented and seeing
> random unappropriate lengths being parsed.
>
>
>
> [1] first strparser patch
> http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmadeus@codewreck.org
> [2] current thread's patch
> http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmadeus@codewreck.org
>
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 3:31 a.m. | #13
Tom Herbert wrote on Thu, Feb 14, 2019:
> > This second patch[2] (the current thread) now does an extra clone if
> > there is an offset, but the problem really isn't in the clone but the
> > pull itself that can fail and return NULL when there is memory pressure.
> > For some reason I hadn't been able to reproduce that behaviour with
> > strparser doing the pull, but I assume it would also be possible to hit
> > in extreme situations, I'm not sure...
>
> This option looks the best to me, at least as a way to fix the issue
> without requiring a change to the API. If the pull fails, doesn't that
> just mean that the parser fails? Is there some behavior with this
> patch that is not being handled gracefully?

Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
all: from a user point of view, the connection just hangs (recvmsg never
returns), without so much as a message in dmesg either.

It took me a while to figure out what failed exactly as I did indeed
expect strparser/kcm to handle that better, but ultimately as things
stand if the parser fails it calls strp_parser_err() with the error
which ends up in strp_abort_strp that should call
sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
failing csk does not make a pending recv on the kcm sock to fail...

I'm not sure whether propagating the error to the upper socket is the
right thing to do, kcm is meant to be able to work with multiple
underlying sockets so I feel we must be cautious about that, but
strparser might be able to retry somehow.
This is what I said below:
> > [,,,]
> >  - the current patch, that I could only get to fail with KASAN, but does
> > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > Would still require to fix the hang, so make strparser retry a few times
> > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > userspace somehow.

Thanks,
Tom Herbert Feb. 15, 2019, 4:01 a.m. | #14
On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > This second patch[2] (the current thread) now does an extra clone if
> > > there is an offset, but the problem really isn't in the clone but the
> > > pull itself that can fail and return NULL when there is memory pressure.
> > > For some reason I hadn't been able to reproduce that behaviour with
> > > strparser doing the pull, but I assume it would also be possible to hit
> > > in extreme situations, I'm not sure...
> >
> > This option looks the best to me, at least as a way to fix the issue
> > without requiring a change to the API. If the pull fails, doesn't that
> > just mean that the parser fails? Is there some behavior with this
> > patch that is not being handled gracefully?
>
> Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> all: from a user point of view, the connection just hangs (recvmsg never
> returns), without so much as a message in dmesg either.
>
Dominique,

That's by design. Here is the description in kcm.txt:

"When a TCP socket is attached to a KCM multiplexor data ready
(POLLIN) and write space available (POLLOUT) events are handled by the
multiplexor. If there is a state change (disconnection) or other error
on a TCP socket, an error is posted on the TCP socket so that a
POLLERR event happens and KCM discontinues using the socket. When the
application gets the error notification for a TCP socket, it should
unattach the socket from KCM and then handle the error condition (the
typical response is to close the socket and create a new connection if
necessary)."

> It took me a while to figure out what failed exactly as I did indeed
> expect strparser/kcm to handle that better, but ultimately as things
> stand if the parser fails it calls strp_parser_err() with the error
> which ends up in strp_abort_strp that should call
> sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
> failing csk does not make a pending recv on the kcm sock to fail...
>
> I'm not sure whether propagating the error to the upper socket is the
> right thing to do, kcm is meant to be able to work with multiple
> underlying sockets so I feel we must be cautious about that, but

Right, that's the motivation for the design.

> strparser might be able to retry somehow.

We could retry on -ENOMEM since it is potentially a transient error,
but generally for errors (like connection is simply broken) it seems
like it's working as intended. I suppose we could add a socket option
to fail the KCM socket when all attached lower sockets have failed,
but I not sure that would be a significant benefit for additional
complexity.

> This is what I said below:
> > > [,,,]
> > >  - the current patch, that I could only get to fail with KASAN, but does
> > > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > > Would still require to fix the hang, so make strparser retry a few times
> > > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > > userspace somehow.

The error should be getting to userspace via the TCP socket.

Tom

>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 4:52 a.m. | #15
Tom Herbert wrote on Thu, Feb 14, 2019:
> On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
> <asmadeus@codewreck.org> wrote:
> > Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> > all: from a user point of view, the connection just hangs (recvmsg never
> > returns), without so much as a message in dmesg either.
>
> That's by design. Here is the description in kcm.txt:
> 
> "When a TCP socket is attached to a KCM multiplexor data ready
> (POLLIN) and write space available (POLLOUT) events are handled by the
> multiplexor. If there is a state change (disconnection) or other error
> on a TCP socket, an error is posted on the TCP socket so that a
> POLLERR event happens and KCM discontinues using the socket. When the
> application gets the error notification for a TCP socket, it should
> unattach the socket from KCM and then handle the error condition (the
> typical response is to close the socket and create a new connection if
> necessary)."

Sigh, that's what I get for relying on pieces of code found on the
internet.

It does make "trivial" kcm sockets difficult to use though, the old
ganesha code I adapted to kcm was the horrible (naive?) kind spawning
one thread per connection and just waiting on read(), so making it just
create a kcm socket from the tcp one and wait on recvmsg() until an
error pops up does not seem an option.

That being said I'm a bit confused, I thought part of the point of kcm
was the multiplexing so a simple server could just keep attaching tcp
sockets to a single kcm socket and only have a single trivial loop
reading from the kcm socket ; but I guess there's no harm in also
looking for POLLERR on the tcp sockets... It would still need to close
them once clients disconnect I guess, this really only affects my naive
server.

> > strparser might be able to retry somehow.
> 
> We could retry on -ENOMEM since it is potentially a transient error,

Yes, I think we should aim to retry on -ENOMEM; I agree all errors are
not meant to be retried on but there already are different cases based
on what the parse cb returned; but that can be a different patch.

> but generally for errors (like connection is simply broken) it seems
> like it's working as intended. I suppose we could add a socket option
> to fail the KCM socket when all attached lower sockets have failed,
> but I not sure that would be a significant benefit for additional
> complexity.

Right, I agree it's probably not worth doing, now I'm aware of this I
can probably motivate myself to change this old code to use epoll
properly.

I think it could make sense to have this option for simpler programs,
but as things stand I guess we can require kcm users to do this much,
thanks for the explanation, and sorry for having failed to notice it.



With all that said I guess my patch should work correctly then, I'll try
to find some time to check the error does come back up the tcp socket in
my reproducer but I have no reason to believe it doesn't.

I'd like to see some retry on ENOMEM before this is merged though, so
while I'm there I'll resend this with a second patch doing that
retry,.. I think just not setting strp->interrupted and not reporting
the error up might be enough? Will have to try either way.



Thanks for the feedback,

Patch

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 571d824e4e24..36c438b95955 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -381,8 +381,32 @@  static int kcm_parse_func_strparser(struct strparser *strp, struct sk_buff *skb)
 {
 	struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp);
 	struct bpf_prog *prog = psock->bpf_prog;
+	struct sk_buff *clone_skb = NULL;
+	struct strp_msg *stm;
+	int rc;
+
+	stm = strp_msg(skb);
+	if (stm->offset) {
+		skb = clone_skb = skb_clone(skb, GFP_ATOMIC);
+		if (!clone_skb)
+			return -ENOMEM;
+
+		if (!pskb_pull(clone_skb, stm->offset)) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		/* reset cloned skb's offset for bpf programs using it */
+		stm = strp_msg(clone_skb);
+		stm->offset = 0;
+	}
+
+	rc = (*prog->bpf_func)(skb, prog->insnsi);
+out:
+	if (clone_skb)
+		kfree_skb(clone_skb);
 
-	return (*prog->bpf_func)(skb, prog->insnsi);
+	return rc;
 }
 
 static int kcm_read_sock_done(struct strparser *strp, int err)