diff mbox series

[v2] kcm: remove any offset before parsing messages

Message ID 1536657703-27577-1-git-send-email-asmadeus@codewreck.org
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [v2] kcm: remove any offset before parsing messages | expand

Commit Message

Dominique Martinet Sept. 11, 2018, 9:21 a.m. UTC
The current code assumes kcm users know they need to look for the
strparser offset within their bpf program, which is not documented
anywhere and examples laying around do not do.

The actual recv function does handle the offset well, so we can create a
temporary clone of the skb and pull that one up as required for parsing.

The pull itself has a cost if we are pulling beyond the head data,
measured to 2-3% latency in a noisy VM with a local client stressing
that path. The clone's impact seemed too small to measure.

This bug can be exhibited easily by implementing a "trivial" kcm parser
taking the first bytes as size, and on the client sending at least two
such packets in a single write().

Note that bpf sockmap has the same problem, both for parse and for recv,
so it would pulling twice or a real pull within the strparser logic if
anyone cares about that.

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
---

Hmm, while trying to benchmark this, I sometimes got hangs in
kcm_wait_data() for the last packet somehow?
The sender program was done (exited (zombie) so I assumed the sender
socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
indicating it parsed a header but there was no skb to peek at?
But the sock is locked so this shouldn't be racy...

I can get it fairly often with this patch and small messages with an
offset, but I think it's just because the pull changes some timing - I
can't hit it with just the clone, and I can hit it with a pull without
clone as well.... And I don't see how pulling a cloned skb can impact
the original socket, but I'm a bit fuzzy on this.

(it's interesting that I didn't seem to hit this race when pulling in
strparser, that shouldn't be any different)


I'll look at that a bit more, but there have been no activity here for
a while and I don't have the energy to keep pushing in the strparser
direction, so take this patch more as a change of direction and a bit
as a 'bump' on the subject - I think it's important but I have too much
on my plate right now to keep pushing if there is no interest from the
authors.

 net/kcm/kcmsock.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Comments

Dominique Martinet Sept. 12, 2018, 5:36 a.m. UTC | #1
Dominique Martinet wrote on Tue, Sep 11, 2018:
> Hmm, while trying to benchmark this, I sometimes got hangs in
> kcm_wait_data() for the last packet somehow?
> The sender program was done (exited (zombie) so I assumed the sender
> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> indicating it parsed a header but there was no skb to peek at?
> But the sock is locked so this shouldn't be racy...
> 
> I can get it fairly often with this patch and small messages with an
> offset, but I think it's just because the pull changes some timing - I
> can't hit it with just the clone, and I can hit it with a pull without
> clone as well.... And I don't see how pulling a cloned skb can impact
> the original socket, but I'm a bit fuzzy on this.

This is weird, I cannot reproduce at all without that pull, even if I
add another delay there instead of the pull, so it's not just timing...

I was confusing kcm_recvmsg and kcm_rcv_strparser but I still think
there is a race in kcm_wait_data between the peek and the wait. I have
confirmed on a hang that the sock's sk_receive_queue is not empty so
skb_peek would return something at this point, but it is waiting in
kcm_wait_data's sk_wait_data.. And no event would be coming at this
point since the sender is done.

kcm_wait_data does have the socket lock but the enqueuing counterpart
(kcm_queue_rcv_skb) is apparently called without lock most of the time
(at least, sk->sk_lock.owned is not set most of the times) ; but if I
add a lock_sock() here it can deadlock when a kfree_skb triggers
kcm_rcv_ready if that kfree_skb was done with the lock... ugh.

I thought the mux rx lock would be taken all the time but that appear
not to be the case either, and even if it were I don't see how to make
sk_wait_data with another lock in the first place.


So meh, I'm not sure why the pull messes up with this, I'm tempted to
say this is an old problem but I'm quite annoyed I do not seem to be
able to reproduce it.
I'm really out of time now so unless someone cares it'll wait for a few
more weeks.


(As an unrelated note, wrt to the patch, it might be nice to add a new
kcm socket option so users could say "my bpf program handles offsets,
don't bother with this")
David Miller Sept. 18, 2018, 1:45 a.m. UTC | #2
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Wed, 12 Sep 2018 07:36:42 +0200

> Dominique Martinet wrote on Tue, Sep 11, 2018:
>> Hmm, while trying to benchmark this, I sometimes got hangs in
>> kcm_wait_data() for the last packet somehow?
>> The sender program was done (exited (zombie) so I assumed the sender
>> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
>> indicating it parsed a header but there was no skb to peek at?
>> But the sock is locked so this shouldn't be racy...
>> 
>> I can get it fairly often with this patch and small messages with an
>> offset, but I think it's just because the pull changes some timing - I
>> can't hit it with just the clone, and I can hit it with a pull without
>> clone as well.... And I don't see how pulling a cloned skb can impact
>> the original socket, but I'm a bit fuzzy on this.
> 
> This is weird, I cannot reproduce at all without that pull, even if I
> add another delay there instead of the pull, so it's not just timing...

I really can't apply this patch until you resolve this.

It is weird, given your description, though...
Dominique Martinet Sept. 18, 2018, 1:57 a.m. UTC | #3
David Miller wrote on Mon, Sep 17, 2018:
> From: Dominique Martinet <asmadeus@codewreck.org>
> Date: Wed, 12 Sep 2018 07:36:42 +0200
> 
> > Dominique Martinet wrote on Tue, Sep 11, 2018:
> >> Hmm, while trying to benchmark this, I sometimes got hangs in
> >> kcm_wait_data() for the last packet somehow?
> >> The sender program was done (exited (zombie) so I assumed the sender
> >> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> >> indicating it parsed a header but there was no skb to peek at?
> >> But the sock is locked so this shouldn't be racy...
> >> 
> >> I can get it fairly often with this patch and small messages with an
> >> offset, but I think it's just because the pull changes some timing - I
> >> can't hit it with just the clone, and I can hit it with a pull without
> >> clone as well.... And I don't see how pulling a cloned skb can impact
> >> the original socket, but I'm a bit fuzzy on this.
> > 
> > This is weird, I cannot reproduce at all without that pull, even if I
> > add another delay there instead of the pull, so it's not just timing...
> 
> I really can't apply this patch until you resolve this.
> 
> It is weird, given your description, though...

Thanks for the reminder! I totally agree with you here and did not
expect this to be merged as it is (in retrospect, I probably should have
written something to that extent in the subject, "RFC"?)

I really don't have much time to give to that right now as I'm doing
this on my freetime, and the lack of reply has been rather demotivating
so it got pushed back a few times...
Given you did reply now I'll try to spend some time to figure that out
in the next couple of weeks but it might not make it for this cycle
depending on the number of rc we'll get and time you want this to soak
it -next.


(I can start by putting the pull back in netparser and try to reproduce,
it's really weird that I never got it to happen at the time...)
David Miller Sept. 18, 2018, 2:40 a.m. UTC | #4
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Tue, 18 Sep 2018 03:57:23 +0200

> Given you did reply now I'll try to spend some time to figure that out
> in the next couple of weeks but it might not make it for this cycle
> depending on the number of rc we'll get and time you want this to soak
> it -next.

Great.

Remind me, is there actually any way for the bpf programs run in this
situation to even _see_ strp_msg(skb)->offset at all?

There isn't right?  And the alternate proposal was to add such a
facility, right?

Just trying to remember all of the context, maybe it's good
information to add to the commit message?
Dominique Martinet Sept. 18, 2018, 2:45 a.m. UTC | #5
David Miller wrote on Mon, Sep 17, 2018:
> Remind me, is there actually any way for the bpf programs run in this
> situation to even _see_ strp_msg(skb)->offset at all?

No, they can see it, so it's possible to make a KCM program that works
right now if you are careful (I'm not sure why the offset within bpf is
different from the offset in the kernel though, it looks like the bpf
program skips the qos part of the control buffer)

> There isn't right?  And the alternate proposal was to add such a
> facility, right?

The problem is that this isn't documented at all, and I could not find
any example doing that until Dave gave me one (I couldn't get it to work
because of the different offset).

The alternate proposal was to just document it, yes.

> Just trying to remember all of the context, maybe it's good
> information to add to the commit message?

Good idea, I'll add some more explanation there.
David Miller Sept. 18, 2018, 2:51 a.m. UTC | #6
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Tue, 18 Sep 2018 04:45:36 +0200

> David Miller wrote on Mon, Sep 17, 2018:
>> Remind me, is there actually any way for the bpf programs run in this
>> situation to even _see_ strp_msg(skb)->offset at all?
> 
> No, they can see it, so it's possible to make a KCM program that works
> right now if you are careful (I'm not sure why the offset within bpf is
> different from the offset in the kernel though, it looks like the bpf
> program skips the qos part of the control buffer)

What helper is used in the BPF program to get this offset value?

(also good info to add to the commit message)
Dominique Martinet Sept. 18, 2018, 2:58 a.m. UTC | #7
David Miller wrote on Mon, Sep 17, 2018:
> > No, they can see it, so it's possible to make a KCM program that works
> > right now if you are careful (I'm not sure why the offset within bpf is
> > different from the offset in the kernel though, it looks like the bpf
> > program skips the qos part of the control buffer)
> 
> What helper is used in the BPF program to get this offset value?
> 
> (also good info to add to the commit message)

Dave defined one himself ; for a simple protocol where the offset is in
the first four bytes of the message.

The whole bpf program could look like this:

------
struct kcm_rx_msg { int full_len; int offset; };
static inline struct kcm_rx_msg *kcm_rx_msg(struct __sk_buff *skb) {
	return (struct kcm_rx_msg *)skb->cb;
}
int decode_framing(struct __sk_buff *skb) {
	return load_word(skb, kcm_rx_msg(skb)->offset);
}
------

If we go towards documenting it, adding a helper would be useful yes;
buf if we pull that becomes unnecessary.
(I'll add the example program in the commit message anyway at your
suggestion)
Dominique Martinet Oct. 31, 2018, 2:56 a.m. UTC | #8
Dominique Martinet wrote on Tue, Sep 18, 2018:
> David Miller wrote on Mon, Sep 17, 2018:
> > From: Dominique Martinet <asmadeus@codewreck.org>
> > Date: Wed, 12 Sep 2018 07:36:42 +0200
> > > Dominique Martinet wrote on Tue, Sep 11, 2018:
> > >> Hmm, while trying to benchmark this, I sometimes got hangs in
> > >> kcm_wait_data() for the last packet somehow?
> > >> The sender program was done (exited (zombie) so I assumed the sender
> > >> socket flushed), but the receiver was in kcm_wait_data in kcm_recvmsg
> > >> indicating it parsed a header but there was no skb to peek at?
> > >> But the sock is locked so this shouldn't be racy...
> > >> 
> > >> I can get it fairly often with this patch and small messages with an
> > >> offset, but I think it's just because the pull changes some timing - I
> > >> can't hit it with just the clone, and I can hit it with a pull without
> > >> clone as well.... And I don't see how pulling a cloned skb can impact
> > >> the original socket, but I'm a bit fuzzy on this.
> > > 
> > > This is weird, I cannot reproduce at all without that pull, even if I
> > > add another delay there instead of the pull, so it's not just timing...
> > 
> > I really can't apply this patch until you resolve this.
> > 
> > It is weird, given your description, though...
> 
> Thanks for the reminder! I totally agree with you here and did not
> expect this to be merged as it is (in retrospect, I probably should have
> written something to that extent in the subject, "RFC"?)

Found the issue after some trouble reproducing on other VM, long story
short:
 - I was blaming kcm_wait_data's sk_wait_data to wait while there was
something in sk->sk_receive_queue, but after adding a fake timeout and
some debug messages I can see the receive queue is empty.
However going back up from the kcm_sock to the kcm_mux to the kcm_psock,
there are things in the psock's socket's receive_queue... (If I'm
following the code correctly, that would be the underlying tcp socket)

 - that psock's strparser contains some hints: the interrupted and
stopped bits are set. strp->interrupted looks like it's only set if
kcm_parse_msg returns something < 0. . .
And surely enough, the skb_pull returns NULL iff there's such a hang...!

I might be tempted to send a patch to strparser to add a pr_debug
message in strp_abort_strp...

Anyway, that probably explains I have no problem with bigger VM
(uselessly more memory available) or without KASAN (I guess there's
overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
of ram, so if there's an allocation failure there I think there's a
problem ! . . .

So, well, I'm not sure on the way forward. Adding a bpf helper and
document that kcm users should mind the offset?

Thanks,
Dominique Martinet Feb. 15, 2019, 1 a.m. UTC | #9
Dominique Martinet wrote on Wed, Oct 31, 2018:
> Anyway, that probably explains I have no problem with bigger VM
> (uselessly more memory available) or without KASAN (I guess there's
> overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> of ram, so if there's an allocation failure there I think there's a
> problem ! . . .
> 
> So, well, I'm not sure on the way forward. Adding a bpf helper and
> document that kcm users should mind the offset?

bump on this - I had mostly forgotten about it but the nfs-ganesha
community that could make use of KCM reminded me of my patch that's
waiting for this.

Summary for people coming back after four months:
 - kcm is great, but the bpf function that's supposed to be called for
each packet does not automatically adjust the offset so that it can
assume the skb starts with the packet it needs to look at

 - there is some workaround code that is far from obvious and
undocumented, see the original thread[1]:
[1] http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf6po@davejwatson-mba.dhcp.thefacebook.com

 - my patch here tried to automatically pull the corresponding packet to
the front, but apparently with KASAN can trigger out of memory
behaviours on "small" VMs, so even if it doesn't seem to impact
performance much without KASAN I don't think it's really ok to
potentially hang the connection due to oom under severe conditions.


The best alternative I see is adding a proper helper to get
"kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
lost as I have been; I'm not quite sure how/where to add such a helper
though as I've barely looked at the bpf code until now, but should we go
for that?


(it really feels wrong to me that some random person who just started by
trying to use kcm has to put this much effort to keep the ball rolling,
if nobody cares about kcm I'm also open to have it removed completely
despite the obvious performance gain I benchmarked for ganesha[2] ;
barely maintained feature is worse than no feature)

[2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314

Thanks,
Tom Herbert Feb. 15, 2019, 1:20 a.m. UTC | #10
On Thu, Feb 14, 2019 at 5:00 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Dominique Martinet wrote on Wed, Oct 31, 2018:
> > Anyway, that probably explains I have no problem with bigger VM
> > (uselessly more memory available) or without KASAN (I guess there's
> > overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> > of ram, so if there's an allocation failure there I think there's a
> > problem ! . . .
> >
> > So, well, I'm not sure on the way forward. Adding a bpf helper and
> > document that kcm users should mind the offset?
>
> bump on this - I had mostly forgotten about it but the nfs-ganesha
> community that could make use of KCM reminded me of my patch that's
> waiting for this.
>
> Summary for people coming back after four months:
>  - kcm is great, but the bpf function that's supposed to be called for
> each packet does not automatically adjust the offset so that it can
> assume the skb starts with the packet it needs to look at
>
>  - there is some workaround code that is far from obvious and
> undocumented, see the original thread[1]:
> [1] http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf6po@davejwatson-mba.dhcp.thefacebook.com
>
>  - my patch here tried to automatically pull the corresponding packet to
> the front, but apparently with KASAN can trigger out of memory
> behaviours on "small" VMs, so even if it doesn't seem to impact
> performance much without KASAN I don't think it's really ok to
> potentially hang the connection due to oom under severe conditions.
>
>
> The best alternative I see is adding a proper helper to get
> "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> lost as I have been; I'm not quite sure how/where to add such a helper
> though as I've barely looked at the bpf code until now, but should we go
> for that?

Dominique,

Thanks for looking into this.

I'd rather not complicate the bpf code for this. Can we just always do
an pskb_pull after skb_clone?

Tom

>
>
> (it really feels wrong to me that some random person who just started by
> trying to use kcm has to put this much effort to keep the ball rolling,
> if nobody cares about kcm I'm also open to have it removed completely
> despite the obvious performance gain I benchmarked for ganesha[2] ;
> barely maintained feature is worse than no feature)
>
> [2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 1:57 a.m. UTC | #11
Tom Herbert wrote on Thu, Feb 14, 2019:
> > The best alternative I see is adding a proper helper to get
> > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > lost as I have been; I'm not quite sure how/where to add such a helper
> > though as I've barely looked at the bpf code until now, but should we go
> > for that?
> 
> I'd rather not complicate the bpf code for this. Can we just always do
> an pskb_pull after skb_clone?

Which skb_clone are you thinking of?
If you're referring to the one in strparser, that was my very first
approach here[1], but Dordon shot it down saying that this is not an
strparser bug but a kcm bug since there are ways for users to properly
get the offset and use it -- and the ktls code does it right.

Frankly, my opinion still is that it'd be better in strparser because
there also is some bad use in bpf sockmap (they never look at the offset
either, while kcm uses it for rcv but not for parse), but looking at it
from an optimal performance point of view I agree the user can make
better decision than strparser so I understand where he comes from.


This second patch[2] (the current thread) now does an extra clone if
there is an offset, but the problem really isn't in the clone but the
pull itself that can fail and return NULL when there is memory pressure.
For some reason I hadn't been able to reproduce that behaviour with
strparser doing the pull, but I assume it would also be possible to hit
in extreme situations, I'm not sure...

So anyway, we basically have three choices that I can see:
 - push harder on strparser and go back to my first patch ; it's simple
and makes using strparser easier/safer but has a small overhead for
ktls, which uses the current strparser implementation correctly.
I hadn't been able to get this to fail with KASAN last time I tried but
I assume it should still be possible somehow.

 - the current patch, that I could only get to fail with KASAN, but does
complexify kcm a bit; this also does not fix bpf sockmap at all.
Would still require to fix the hang, so make strparser retry a few times
if strp->cb.parse_msg failed maybe? Or at least get the error back to
userspace somehow.

 - my last suggestion to document the kcm behaviour, and if possible add
a bpf helper to make proper usage easier ; this will make kcm harder to
use on end users but it's better than not being documented and seeing
random unappropriate lengths being parsed.



[1] first strparser patch
http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmadeus@codewreck.org
[2] current thread's patch
http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmadeus@codewreck.org


Thanks,
Tom Herbert Feb. 15, 2019, 2:48 a.m. UTC | #12
On Thu, Feb 14, 2019 at 5:57 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > The best alternative I see is adding a proper helper to get
> > > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > > lost as I have been; I'm not quite sure how/where to add such a helper
> > > though as I've barely looked at the bpf code until now, but should we go
> > > for that?
> >
> > I'd rather not complicate the bpf code for this. Can we just always do
> > an pskb_pull after skb_clone?
>
> Which skb_clone are you thinking of?
> If you're referring to the one in strparser, that was my very first
> approach here[1], but Dordon shot it down saying that this is not an
> strparser bug but a kcm bug since there are ways for users to properly
> get the offset and use it -- and the ktls code does it right.
>
> Frankly, my opinion still is that it'd be better in strparser because
> there also is some bad use in bpf sockmap (they never look at the offset
> either, while kcm uses it for rcv but not for parse), but looking at it
> from an optimal performance point of view I agree the user can make
> better decision than strparser so I understand where he comes from.
>
>
> This second patch[2] (the current thread) now does an extra clone if
> there is an offset, but the problem really isn't in the clone but the
> pull itself that can fail and return NULL when there is memory pressure.
> For some reason I hadn't been able to reproduce that behaviour with
> strparser doing the pull, but I assume it would also be possible to hit
> in extreme situations, I'm not sure...
>
This option looks the best to me, at least as a way to fix the issue
without requiring a change to the API. If the pull fails, doesn't that
just mean that the parser fails? Is there some behavior with this
patch that is not being handled gracefully?

Thanks,
Tom

> So anyway, we basically have three choices that I can see:
>  - push harder on strparser and go back to my first patch ; it's simple
> and makes using strparser easier/safer but has a small overhead for
> ktls, which uses the current strparser implementation correctly.
> I hadn't been able to get this to fail with KASAN last time I tried but
> I assume it should still be possible somehow.
>
>  - the current patch, that I could only get to fail with KASAN, but does
> complexify kcm a bit; this also does not fix bpf sockmap at all.
> Would still require to fix the hang, so make strparser retry a few times
> if strp->cb.parse_msg failed maybe? Or at least get the error back to
> userspace somehow.
>
>  - my last suggestion to document the kcm behaviour, and if possible add
> a bpf helper to make proper usage easier ; this will make kcm harder to
> use on end users but it's better than not being documented and seeing
> random unappropriate lengths being parsed.
>
>
>
> [1] first strparser patch
> http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmadeus@codewreck.org
> [2] current thread's patch
> http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmadeus@codewreck.org
>
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 3:31 a.m. UTC | #13
Tom Herbert wrote on Thu, Feb 14, 2019:
> > This second patch[2] (the current thread) now does an extra clone if
> > there is an offset, but the problem really isn't in the clone but the
> > pull itself that can fail and return NULL when there is memory pressure.
> > For some reason I hadn't been able to reproduce that behaviour with
> > strparser doing the pull, but I assume it would also be possible to hit
> > in extreme situations, I'm not sure...
>
> This option looks the best to me, at least as a way to fix the issue
> without requiring a change to the API. If the pull fails, doesn't that
> just mean that the parser fails? Is there some behavior with this
> patch that is not being handled gracefully?

Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
all: from a user point of view, the connection just hangs (recvmsg never
returns), without so much as a message in dmesg either.

It took me a while to figure out what failed exactly as I did indeed
expect strparser/kcm to handle that better, but ultimately as things
stand if the parser fails it calls strp_parser_err() with the error
which ends up in strp_abort_strp that should call
sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
failing csk does not make a pending recv on the kcm sock to fail...

I'm not sure whether propagating the error to the upper socket is the
right thing to do, kcm is meant to be able to work with multiple
underlying sockets so I feel we must be cautious about that, but
strparser might be able to retry somehow.
This is what I said below:
> > [,,,]
> >  - the current patch, that I could only get to fail with KASAN, but does
> > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > Would still require to fix the hang, so make strparser retry a few times
> > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > userspace somehow.

Thanks,
Tom Herbert Feb. 15, 2019, 4:01 a.m. UTC | #14
On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > This second patch[2] (the current thread) now does an extra clone if
> > > there is an offset, but the problem really isn't in the clone but the
> > > pull itself that can fail and return NULL when there is memory pressure.
> > > For some reason I hadn't been able to reproduce that behaviour with
> > > strparser doing the pull, but I assume it would also be possible to hit
> > > in extreme situations, I'm not sure...
> >
> > This option looks the best to me, at least as a way to fix the issue
> > without requiring a change to the API. If the pull fails, doesn't that
> > just mean that the parser fails? Is there some behavior with this
> > patch that is not being handled gracefully?
>
> Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> all: from a user point of view, the connection just hangs (recvmsg never
> returns), without so much as a message in dmesg either.
>
Dominique,

That's by design. Here is the description in kcm.txt:

"When a TCP socket is attached to a KCM multiplexor data ready
(POLLIN) and write space available (POLLOUT) events are handled by the
multiplexor. If there is a state change (disconnection) or other error
on a TCP socket, an error is posted on the TCP socket so that a
POLLERR event happens and KCM discontinues using the socket. When the
application gets the error notification for a TCP socket, it should
unattach the socket from KCM and then handle the error condition (the
typical response is to close the socket and create a new connection if
necessary)."

> It took me a while to figure out what failed exactly as I did indeed
> expect strparser/kcm to handle that better, but ultimately as things
> stand if the parser fails it calls strp_parser_err() with the error
> which ends up in strp_abort_strp that should call
> sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
> failing csk does not make a pending recv on the kcm sock to fail...
>
> I'm not sure whether propagating the error to the upper socket is the
> right thing to do, kcm is meant to be able to work with multiple
> underlying sockets so I feel we must be cautious about that, but

Right, that's the motivation for the design.

> strparser might be able to retry somehow.

We could retry on -ENOMEM since it is potentially a transient error,
but generally for errors (like connection is simply broken) it seems
like it's working as intended. I suppose we could add a socket option
to fail the KCM socket when all attached lower sockets have failed,
but I not sure that would be a significant benefit for additional
complexity.

> This is what I said below:
> > > [,,,]
> > >  - the current patch, that I could only get to fail with KASAN, but does
> > > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > > Would still require to fix the hang, so make strparser retry a few times
> > > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > > userspace somehow.

The error should be getting to userspace via the TCP socket.

Tom

>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 15, 2019, 4:52 a.m. UTC | #15
Tom Herbert wrote on Thu, Feb 14, 2019:
> On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
> <asmadeus@codewreck.org> wrote:
> > Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> > all: from a user point of view, the connection just hangs (recvmsg never
> > returns), without so much as a message in dmesg either.
>
> That's by design. Here is the description in kcm.txt:
> 
> "When a TCP socket is attached to a KCM multiplexor data ready
> (POLLIN) and write space available (POLLOUT) events are handled by the
> multiplexor. If there is a state change (disconnection) or other error
> on a TCP socket, an error is posted on the TCP socket so that a
> POLLERR event happens and KCM discontinues using the socket. When the
> application gets the error notification for a TCP socket, it should
> unattach the socket from KCM and then handle the error condition (the
> typical response is to close the socket and create a new connection if
> necessary)."

Sigh, that's what I get for relying on pieces of code found on the
internet.

It does make "trivial" kcm sockets difficult to use though, the old
ganesha code I adapted to kcm was the horrible (naive?) kind spawning
one thread per connection and just waiting on read(), so making it just
create a kcm socket from the tcp one and wait on recvmsg() until an
error pops up does not seem an option.

That being said I'm a bit confused, I thought part of the point of kcm
was the multiplexing so a simple server could just keep attaching tcp
sockets to a single kcm socket and only have a single trivial loop
reading from the kcm socket ; but I guess there's no harm in also
looking for POLLERR on the tcp sockets... It would still need to close
them once clients disconnect I guess, this really only affects my naive
server.

> > strparser might be able to retry somehow.
> 
> We could retry on -ENOMEM since it is potentially a transient error,

Yes, I think we should aim to retry on -ENOMEM; I agree all errors are
not meant to be retried on but there already are different cases based
on what the parse cb returned; but that can be a different patch.

> but generally for errors (like connection is simply broken) it seems
> like it's working as intended. I suppose we could add a socket option
> to fail the KCM socket when all attached lower sockets have failed,
> but I not sure that would be a significant benefit for additional
> complexity.

Right, I agree it's probably not worth doing, now I'm aware of this I
can probably motivate myself to change this old code to use epoll
properly.

I think it could make sense to have this option for simpler programs,
but as things stand I guess we can require kcm users to do this much,
thanks for the explanation, and sorry for having failed to notice it.



With all that said I guess my patch should work correctly then, I'll try
to find some time to check the error does come back up the tcp socket in
my reproducer but I have no reason to believe it doesn't.

I'd like to see some retry on ENOMEM before this is merged though, so
while I'm there I'll resend this with a second patch doing that
retry,.. I think just not setting strp->interrupted and not reporting
the error up might be enough? Will have to try either way.



Thanks for the feedback,
Dominique Martinet Feb. 20, 2019, 4:11 a.m. UTC | #16
Dominique Martinet wrote on Fri, Feb 15, 2019:
> With all that said I guess my patch should work correctly then, I'll try
> to find some time to check the error does come back up the tcp socket in
> my reproducer but I have no reason to believe it doesn't.

Ok, so I can confirm this part - the 'csock' does come back up with
POLLERR if the parse function returns ENOMEM in the current code.

It also comes back up with POLLERR when the remote side closes the
connection, which is expected, but I'm having a very hard time
understanding how an application is supposed to deal with these
POLLERR after having read the documentation and a bit of
experimentation.
I'm not sure how much it would matter for real life (if the other end
closes the connection most servers would not care about what they said
just before closing, but I can imagine some clients doing that in real
life e.g. a POST request they don't care if it succeeds or not)...
My test program works like this:
 - client connects, sends 10k messages and close()s the socket
 - server loops recving and close()s after 10k messages; it used to be
recvmsg() directly but it's now using poll then recvmsg.


When the client closes the socket, some messages are obviously still "in
flight", and the server will recv a POLLERR notification on the csock at
some point with many messages left.
The documentation says to unattach the csock when you get POLLER. If I
do that, the kcm socket will no longer give me any message, so all the
messages still in flight at the time are lost.

If I just ignore the csock like I used to, all the messages do come just
fine, but as said previously on a real error this will just make recvmsg
or the polling hang forever and I see no way to distinguish a "real"
error vs. a connection shut down from the remote side with data left in
the pipe.
I thought of checking POLLERR on csock and POLLIN not set on kcmsock,
but even that seems to happen fairly regularily - the kcm sock hasn't
been filled up, it's still reading from the csock.


On the other hand, checking POLLIN on the csock does say there is still
data left, so I know there is data left on the csock, but this is also
the case on a real error (e.g. if parser returns -ENOMEM)
... And this made me try to read from the csock after detaching it and I
can resume manual tcp parsing for a few messages until read() fails with
EPROTO ?! and I cannot seem to be able to get anything out of attaching
it back to kcm (for e.g. an ENOMEM error that was transient)...



I'm honestly not sure how the POLLERR notification mechanism works but I
think it would be much easier to use KCM if we could somehow delay that
error until KCM is done feeding from the csock (when netparser really
stops reading from it like on real error, e.g. abort callback maybe?)
I think it's fine if the csock is closed before the kcm sock message is
read, but we should not lose messages like this.



> I'd like to see some retry on ENOMEM before this is merged though, so
> while I'm there I'll resend this with a second patch doing that
> retry,.. I think just not setting strp->interrupted and not reporting
> the error up might be enough? Will have to try either way.

I also tried playing with that without much success.
I had assumed just not calling strp_parser_err() (which calls the
abort_parser cb) would be enough, eventually calling strp_start_timer()
like the !len case, but no can do.
With that said, returning 0 from the parse function also raises POLLERR
on the csock and hangs netparser, so things aren't that simple...


I could use a bit of help again.

Thanks,
Tom Herbert Feb. 20, 2019, 4:18 p.m. UTC | #17
On Tue, Feb 19, 2019 at 8:12 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Dominique Martinet wrote on Fri, Feb 15, 2019:
> > With all that said I guess my patch should work correctly then, I'll try
> > to find some time to check the error does come back up the tcp socket in
> > my reproducer but I have no reason to believe it doesn't.
>
> Ok, so I can confirm this part - the 'csock' does come back up with
> POLLERR if the parse function returns ENOMEM in the current code.
>
Good.

> It also comes back up with POLLERR when the remote side closes the
> connection, which is expected, but I'm having a very hard time
> understanding how an application is supposed to deal with these
> POLLERR after having read the documentation and a bit of
> experimentation.
> I'm not sure how much it would matter for real life (if the other end
> closes the connection most servers would not care about what they said
> just before closing, but I can imagine some clients doing that in real
> life e.g. a POST request they don't care if it succeeds or not)...
> My test program works like this:
>  - client connects, sends 10k messages and close()s the socket
>  - server loops recving and close()s after 10k messages; it used to be
> recvmsg() directly but it's now using poll then recvmsg.
>
>
> When the client closes the socket, some messages are obviously still "in
> flight", and the server will recv a POLLERR notification on the csock at
> some point with many messages left.
> The documentation says to unattach the csock when you get POLLER. If I
> do that, the kcm socket will no longer give me any message, so all the
> messages still in flight at the time are lost.

So basically it sounds like you're interested in supporting TCP
connections that are half closed. I believe that the error in half
closed is EPIPE, so if the TCP socket returns that it can be ignored
and the socket can continue being attached and used to send data.

Another possibility is to add some linger semantics to an attached
socket. For instance, a large message might be sent so that part of
the messge is queued in TCP and part is queued in the KCM socket.
Unattach would probably break that message. We probably want to linger
option similar to SO_LINGER (or maybe just use the option on the TCP
socket) that means don't complete the detach until any message being
transmitted on the lower socket has been queued.
>
> If I just ignore the csock like I used to, all the messages do come just
> fine, but as said previously on a real error this will just make recvmsg
> or the polling hang forever and I see no way to distinguish a "real"
> error vs. a connection shut down from the remote side with data left in
> the pipe.
> I thought of checking POLLERR on csock and POLLIN not set on kcmsock,
> but even that seems to happen fairly regularily - the kcm sock hasn't
> been filled up, it's still reading from the csock.
>
>
> On the other hand, checking POLLIN on the csock does say there is still
> data left, so I know there is data left on the csock, but this is also
> the case on a real error (e.g. if parser returns -ENOMEM)
> ... And this made me try to read from the csock after detaching it and I
> can resume manual tcp parsing for a few messages until read() fails with
> EPROTO ?! and I cannot seem to be able to get anything out of attaching
> it back to kcm (for e.g. an ENOMEM error that was transient)...
>
>
>
> I'm honestly not sure how the POLLERR notification mechanism works but I
> think it would be much easier to use KCM if we could somehow delay that
> error until KCM is done feeding from the csock (when netparser really
> stops reading from it like on real error, e.g. abort callback maybe?)
> I think it's fine if the csock is closed before the kcm sock message is
> read, but we should not lose messages like this.

Sounds like linger semantics is needed then.

>
>
>
> > I'd like to see some retry on ENOMEM before this is merged though, so
> > while I'm there I'll resend this with a second patch doing that
> > retry,.. I think just not setting strp->interrupted and not reporting
> > the error up might be enough? Will have to try either way.
>
> I also tried playing with that without much success.
> I had assumed just not calling strp_parser_err() (which calls the
> abort_parser cb) would be enough, eventually calling strp_start_timer()
> like the !len case, but no can do.

I think you need to ignore the ENOMEM and have a timer or other
callback to retry the operation in the future.

> With that said, returning 0 from the parse function also raises POLLERR
> on the csock and hangs netparser, so things aren't that simple...

Can you point to where this is happening. If the parse_msg callback
returns 0 that is suppose to indicate that more bytes are needed.

>

>
> I could use a bit of help again.
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 21, 2019, 8:22 a.m. UTC | #18
Tom Herbert wrote on Wed, Feb 20, 2019:
> > When the client closes the socket, some messages are obviously still "in
> > flight", and the server will recv a POLLERR notification on the csock at
> > some point with many messages left.
> > The documentation says to unattach the csock when you get POLLER. If I
> > do that, the kcm socket will no longer give me any message, so all the
> > messages still in flight at the time are lost.
> 
> So basically it sounds like you're interested in supporting TCP
> connections that are half closed. I believe that the error in half
> closed is EPIPE, so if the TCP socket returns that it can be ignored
> and the socket can continue being attached and used to send data.

Did you mean 'can continue being attached and used to receive data'?

I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
both the csock and kcm socket will do many needless wakeups on only the
csock from what I can see, so I'd need some holdoff timer or something.
I guess it's possible though.

> Another possibility is to add some linger semantics to an attached
> socket. For instance, a large message might be sent so that part of
> the messge is queued in TCP and part is queued in the KCM socket.
> Unattach would probably break that message. We probably want to linger
> option similar to SO_LINGER (or maybe just use the option on the TCP
> socket) that means don't complete the detach until any message being
> transmitted on the lower socket has been queued.

That would certainly work, even if non-obvious from a user standpoint.


> > > I'd like to see some retry on ENOMEM before this is merged though, so
> > > while I'm there I'll resend this with a second patch doing that
> > > retry,.. I think just not setting strp->interrupted and not reporting
> > > the error up might be enough? Will have to try either way.
> >
> > I also tried playing with that without much success.
> > I had assumed just not calling strp_parser_err() (which calls the
> > abort_parser cb) would be enough, eventually calling strp_start_timer()
> > like the !len case, but no can do.
> 
> I think you need to ignore the ENOMEM and have a timer or other
> callback to retry the operation in the future.

Yes, that's what I had intended to try; basically just break out and
schedule timer as said above.

After a bit more debugging, this part works (__strp_recv() is called
again); but the next packet that is treated properly is rejected because
by the time __strp_recv() was called again a new skb was read and the
length isn't large enough to go all the way into the new packet, so this
test fails:
                        } else if (len <= (ssize_t)head->len -
                                          skb->len - stm->strp.offset) {
                                /* Length must be into new skb (and also
                                 * greater than zero)
                                 */
                                STRP_STATS_INCR(strp->stats.bad_hdr_len);
                                strp_parser_err(strp, -EPROTO, desc);

So I need to figure a way to say "call this function again without
reading more data" somehow, or make this check more lax e.g. accept any
len > 0 after a retry maybe...
Removing that branch altogether seems to work at least but I'm not sure
we'd want to?
(grmbl at this slow VM and strparser not being possible to enable as a
module, it takes ages to test)


> > With that said, returning 0 from the parse function also raises POLLERR
> > on the csock and hangs netparser, so things aren't that simple...
> 
> Can you point to where this is happening. If the parse_msg callback
> returns 0 that is suppose to indicate that more bytes are needed.

I just blindly returned 0 "from time to time" in the kcm parser
function, but looking at above it was failing on the same check.
This somewhat makes sense for this one to fail here if a new packet was
read, no sane parser function should give a length smaller than what
they require to determine the length.


Thanks,
Tom Herbert Feb. 22, 2019, 7:24 p.m. UTC | #19
On Thu, Feb 21, 2019 at 12:22 AM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Wed, Feb 20, 2019:
> > > When the client closes the socket, some messages are obviously still "in
> > > flight", and the server will recv a POLLERR notification on the csock at
> > > some point with many messages left.
> > > The documentation says to unattach the csock when you get POLLER. If I
> > > do that, the kcm socket will no longer give me any message, so all the
> > > messages still in flight at the time are lost.
> >
> > So basically it sounds like you're interested in supporting TCP
> > connections that are half closed. I believe that the error in half
> > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > and the socket can continue being attached and used to send data.
>
> Did you mean 'can continue being attached and used to receive data'?
>
No, I meant shutdown on receive side when FIN is receved. TX is still
allowed to drain an queued bytes. To support shutdown on the TX side
would require additional logic since we need to effectively detach the
transmit path but retain the receive path. I'm not sure this is a
compelling use case to support.

> I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> both the csock and kcm socket will do many needless wakeups on only the
> csock from what I can see, so I'd need some holdoff timer or something.
> I guess it's possible though.

We might need to clear the error somehow. May a read of zero bytes?

>
> > Another possibility is to add some linger semantics to an attached
> > socket. For instance, a large message might be sent so that part of
> > the messge is queued in TCP and part is queued in the KCM socket.
> > Unattach would probably break that message. We probably want to linger
> > option similar to SO_LINGER (or maybe just use the option on the TCP
> > socket) that means don't complete the detach until any message being
> > transmitted on the lower socket has been queued.
>
> That would certainly work, even if non-obvious from a user standpoint.
>
>
> > > > I'd like to see some retry on ENOMEM before this is merged though, so
> > > > while I'm there I'll resend this with a second patch doing that
> > > > retry,.. I think just not setting strp->interrupted and not reporting
> > > > the error up might be enough? Will have to try either way.
> > >
> > > I also tried playing with that without much success.
> > > I had assumed just not calling strp_parser_err() (which calls the
> > > abort_parser cb) would be enough, eventually calling strp_start_timer()
> > > like the !len case, but no can do.
> >
> > I think you need to ignore the ENOMEM and have a timer or other
> > callback to retry the operation in the future.
>
> Yes, that's what I had intended to try; basically just break out and
> schedule timer as said above.

You might want to look at some other systems, I don't recall if
there's a hook that can be used for when memory pressure is relieved.

>
> After a bit more debugging, this part works (__strp_recv() is called
> again); but the next packet that is treated properly is rejected because
> by the time __strp_recv() was called again a new skb was read and the
> length isn't large enough to go all the way into the new packet, so this
> test fails:
>                         } else if (len <= (ssize_t)head->len -
>                                           skb->len - stm->strp.offset) {
>                                 /* Length must be into new skb (and also
>                                  * greater than zero)
>                                  */
>                                 STRP_STATS_INCR(strp->stats.bad_hdr_len);
>                                 strp_parser_err(strp, -EPROTO, desc);
>
> So I need to figure a way to say "call this function again without
> reading more data" somehow, or make this check more lax e.g. accept any
> len > 0 after a retry maybe...
> Removing that branch altogether seems to work at least but I'm not sure
> we'd want to?

I like the check since it's conservative and covers the normal case.
Maybe just need some more logic?
> (grmbl at this slow VM and strparser not being possible to enable as a
> module, it takes ages to test)
>
>
> > > With that said, returning 0 from the parse function also raises POLLERR
> > > on the csock and hangs netparser, so things aren't that simple...
> >
> > Can you point to where this is happening. If the parse_msg callback
> > returns 0 that is suppose to indicate that more bytes are needed.
>
> I just blindly returned 0 "from time to time" in the kcm parser
> function, but looking at above it was failing on the same check.
> This somewhat makes sense for this one to fail here if a new packet was
> read, no sane parser function should give a length smaller than what
> they require to determine the length.
>
>
> Thanks,
> --
> Dominique
Dominique Martinet Feb. 22, 2019, 8:27 p.m. UTC | #20
Tom Herbert wrote on Fri, Feb 22, 2019:
> > > So basically it sounds like you're interested in supporting TCP
> > > connections that are half closed. I believe that the error in half
> > > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > > and the socket can continue being attached and used to send data.
> >
> > Did you mean 'can continue being attached and used to receive data'?
>
> No, I meant shutdown on receive side when FIN is receved. TX is still
> allowed to drain an queued bytes. To support shutdown on the TX side
> would require additional logic since we need to effectively detach the
> transmit path but retain the receive path. I'm not sure this is a
> compelling use case to support.

Hm, it must be a matter of how we see thing but from what I understand
it's exactly the other way around. The remote closed the connection, so
trying to send anything would just yield a RST, so TX doesn't make
sense.
On the other hand, anything that had been sent by the remote before the
FIN and is on the local side's memory should still be receivable.

When you think about it as a TCP stream it's really weird: data coming,
data coming, data coming, FIN received.
But in the networking stack that received FIN short-circuits all the
data that was left around and immediately raises an EPIPE error.

I don't see what makes this FIN packet so great that it should be
processed before the data; we should only see that EPIPE when we're
done reading the data before it or trying to send something.

I'll check tomorrow/next week but I'm pretty sure the packets before
that have been ack'd at a tcp level as well, so losing them in the
application level is really unexpected.
 
> > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> > both the csock and kcm socket will do many needless wakeups on only the
> > csock from what I can see, so I'd need some holdoff timer or something.
> > I guess it's possible though.
> 
> We might need to clear the error somehow. May a read of zero bytes?

Can try.

> > After a bit more debugging, this part works (__strp_recv() is called
> > again); but the next packet that is treated properly is rejected because
> > by the time __strp_recv() was called again a new skb was read and the
> > length isn't large enough to go all the way into the new packet, so this
> > test fails:
> >                         } else if (len <= (ssize_t)head->len -
> >                                           skb->len - stm->strp.offset) {
> >                                 /* Length must be into new skb (and also
> >                                  * greater than zero)
> >                                  */
> >                                 STRP_STATS_INCR(strp->stats.bad_hdr_len);
> >                                 strp_parser_err(strp, -EPROTO, desc);
> >
> > So I need to figure a way to say "call this function again without
> > reading more data" somehow, or make this check more lax e.g. accept any
> > len > 0 after a retry maybe...
> > Removing that branch altogether seems to work at least but I'm not sure
> > we'd want to?
> 
> I like the check since it's conservative and covers the normal case.
> Maybe just need some more logic?

I can add a "retrying" state and not fail here if we ewre retrying for
whatever reason perhaps...
But I'm starting to wonder how this would work if my client didn't keep
on sending data, I'll try to fail on the last client's packet and see
how __strp_recv is called again through the timer, with the same skb
perhaps?
Tom Herbert Feb. 22, 2019, 9:01 p.m. UTC | #21
On Fri, Feb 22, 2019 at 12:28 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Tom Herbert wrote on Fri, Feb 22, 2019:
> > > > So basically it sounds like you're interested in supporting TCP
> > > > connections that are half closed. I believe that the error in half
> > > > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > > > and the socket can continue being attached and used to send data.
> > >
> > > Did you mean 'can continue being attached and used to receive data'?
> >
> > No, I meant shutdown on receive side when FIN is receved. TX is still
> > allowed to drain an queued bytes. To support shutdown on the TX side
> > would require additional logic since we need to effectively detach the
> > transmit path but retain the receive path. I'm not sure this is a
> > compelling use case to support.
>
Dominque,

> Hm, it must be a matter of how we see thing but from what I understand
> it's exactly the other way around. The remote closed the connection, so
> trying to send anything would just yield a RST, so TX doesn't make
> sense.
> On the other hand, anything that had been sent by the remote before the
> FIN and is on the local side's memory should still be receivable.

That's right, any data sent before the FIN can be received. After the
FIN, there is not more data to received. But, the FIN doesn't say
anything about the reverse path. For instance, if the peer did
shutdown(SHUT_WR), then it sent the FIN so it no longer transmits on
the connection, but still can receive data.

>
> When you think about it as a TCP stream it's really weird: data coming,
> data coming, data coming, FIN received.
> But in the networking stack that received FIN short-circuits all the
> data that was left around and immediately raises an EPIPE error.
>
Right, it doesn't help that most people automatically assume a
received FIN means the connection is closed! (although in practice
that assumption works pretty well for most applications).

> I don't see what makes this FIN packet so great that it should be
> processed before the data; we should only see that EPIPE when we're
> done reading the data before it or trying to send something.

It's not supposed to be. If your seeing a problem, it might because
state_change is processed before the received data is drained. If
plain recvmsg were being used, zero bytes is on returned after all
outstanding received data on the socket has been read. We might need
some additional logic in KCM to handle this.

>
> I'll check tomorrow/next week but I'm pretty sure the packets before
> that have been ack'd at a tcp level as well, so losing them in the
> application level is really unexpected.
>
> > > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> > > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> > > both the csock and kcm socket will do many needless wakeups on only the
> > > csock from what I can see, so I'd need some holdoff timer or something.
> > > I guess it's possible though.
> >
> > We might need to clear the error somehow. May a read of zero bytes?
>
> Can try.
>
> > > After a bit more debugging, this part works (__strp_recv() is called
> > > again); but the next packet that is treated properly is rejected because
> > > by the time __strp_recv() was called again a new skb was read and the
> > > length isn't large enough to go all the way into the new packet, so this
> > > test fails:
> > >                         } else if (len <= (ssize_t)head->len -
> > >                                           skb->len - stm->strp.offset) {
> > >                                 /* Length must be into new skb (and also
> > >                                  * greater than zero)
> > >                                  */
> > >                                 STRP_STATS_INCR(strp->stats.bad_hdr_len);
> > >                                 strp_parser_err(strp, -EPROTO, desc);
> > >
> > > So I need to figure a way to say "call this function again without
> > > reading more data" somehow, or make this check more lax e.g. accept any
> > > len > 0 after a retry maybe...
> > > Removing that branch altogether seems to work at least but I'm not sure
> > > we'd want to?
> >
> > I like the check since it's conservative and covers the normal case.
> > Maybe just need some more logic?
>
> I can add a "retrying" state and not fail here if we ewre retrying for
> whatever reason perhaps...
> But I'm starting to wonder how this would work if my client didn't keep
> on sending data, I'll try to fail on the last client's packet and see
> how __strp_recv is called again through the timer, with the same skb
> perhaps?

Right, a timer or some sort of aynchronous callbask is needed. Like I
said, I don't think dealing with memory failure like this is unique to
KCM.

Thanks again for looking into this, I think this is good progress!

Tom

>
> --
> Dominique
diff mbox series

Patch

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 571d824e4e24..36c438b95955 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -381,8 +381,32 @@  static int kcm_parse_func_strparser(struct strparser *strp, struct sk_buff *skb)
 {
 	struct kcm_psock *psock = container_of(strp, struct kcm_psock, strp);
 	struct bpf_prog *prog = psock->bpf_prog;
+	struct sk_buff *clone_skb = NULL;
+	struct strp_msg *stm;
+	int rc;
+
+	stm = strp_msg(skb);
+	if (stm->offset) {
+		skb = clone_skb = skb_clone(skb, GFP_ATOMIC);
+		if (!clone_skb)
+			return -ENOMEM;
+
+		if (!pskb_pull(clone_skb, stm->offset)) {
+			rc = -ENOMEM;
+			goto out;
+		}
+
+		/* reset cloned skb's offset for bpf programs using it */
+		stm = strp_msg(clone_skb);
+		stm->offset = 0;
+	}
+
+	rc = (*prog->bpf_func)(skb, prog->insnsi);
+out:
+	if (clone_skb)
+		kfree_skb(clone_skb);
 
-	return (*prog->bpf_func)(skb, prog->insnsi);
+	return rc;
 }
 
 static int kcm_read_sock_done(struct strparser *strp, int err)