[net] net: saving irq context for peernet2id()

Message ID	CAM_iQpVFWzUm2BJZUs+-LWpKA8kubuKhzyE01_V69TEsrFnm-Q@mail.gmail.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> MIME-Version: 1.0 In-Reply-To: <CAHC9VhSronk914-0QZd7cPBEAOvSzN++Umc=_aueeCuOB9iG8Q@mail.gmail.com> References: <1476946352-15770-1-git-send-email-xiyou.wangcong@gmail.com> <2707c52d-88ec-7b93-f96e-eeaffc952c9c@tycho.nsa.gov> <CAM_iQpWD3WiX1ZX+3LJ8vbJc42K1jZcZ4-f6MO5kzZY-+Szkqw@mail.gmail.com> <CAHC9VhSronk914-0QZd7cPBEAOvSzN++Umc=_aueeCuOB9iG8Q@mail.gmail.com> From: Cong Wang <xiyou.wangcong@gmail.com> Date: Thu, 20 Oct 2016 16:35:53 -0700 Message-ID: <CAM_iQpVFWzUm2BJZUs+-LWpKA8kubuKhzyE01_V69TEsrFnm-Q@mail.gmail.com> Subject: Re: [Patch net] net: saving irq context for peernet2id() To: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <sds@tycho.nsa.gov>, Linux Kernel Network Developers <netdev@vger.kernel.org>, Elad Raz <e@eladraz.com>, Richard Guy Briggs <rgb@redhat.com> Content-Type: multipart/mixed; boundary=94eb2c0590f036a536053f54644f Sender: netdev-owner@vger.kernel.org Precedence: bulk

Cong Wang Oct. 20, 2016, 11:35 p.m. UTC

On Thu, Oct 20, 2016 at 12:07 PM, Paul Moore <paul@paul-moore.com> wrote:
> On Thu, Oct 20, 2016 at 2:29 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>> On 10/20/2016 02:52 AM, Cong Wang wrote:
>>>> A kernel warning inside __local_bh_enable_ip() was reported by people
>>>> running SELinux, this is caused due to some SELinux functions
>>>> (indirectly) call peernet2id() with IRQ disabled in process context,
>>>> when we re-enable BH with IRQ disabled kernel complains. Shut up this
>>>> warning by saving IRQ context in peernet2id(), BH is still implicitly
>>>> disabled.
>>>
>>> Not sure this suffices; kill_fasync() -> send_sigio() ->
>>> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
>>> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
>>> peernet2id()
>>
>> Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
>> do multicast in IRQ context.... It makes no sense, netlink multicast could
>> be very expensive if we have many listeners.
>
> I'm sure there are a few others I don't know about, but I believe the
> only commonly used audit multicast listener is systemd.

But user-space is free to listen to this group, right? If so this is just open
for a potential DDOS attack.

>
>> I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
>> not easy to just move the multicast to a workqueue, since the skb is copied
>> from audit_buffer which is freed immediately after that, probably need another
>> queue like audit_skb_queue.
>
> This approach would double the queue size which is something I want to
> avoid.  I would suggest sticking with a single queue and dealing with
> the netlink message link fixup and multicast send in the existing
> netlink unicast thread; basically we would just be moving the
> multicast code from audit_log_end() into kauditd_thread().  This is
> the same approach I mentioned earlier off-list.

This is what I did in the follow up patch. I attach the updated version
in this email for you to review, I still can't make selinux-testsuites working
on my Fedora even though I have SELinux=enforcing, anyhow I don't
see any kernel warning in my dmesg at least.

>
> However, that isn't something I want to mess with as a regression fix,
> mostly because I really want to see this regression gone by -rc2 as it
> is making SELinux testing a real pain.  If the patch posted at the top
> of this thread isn't a suitable fix, we really should revert the
> original patch.

Since you want to test SELinux anyway, please test the attached one.

Thanks.

Cong Wang Oct. 21, 2016, 4:47 a.m. UTC | #1

On Thu, Oct 20, 2016 at 4:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Since you want to test SELinux anyway, please test the attached one.
>

Finally my kernel config is friendly to SELinux, and now there are several
tests fails:


Test Summary Report
-------------------
sysctl/test           (Wstat: 0 Tests: 4 Failed: 2)
  Failed tests:  1-2
mmap/test             (Wstat: 0 Tests: 44 Failed: 2)
  Failed tests:  20, 26
overlay/test          (Wstat: 65280 Tests: 0 Failed: 0)
  Non-zero exit status: 255
  Parse errors: Bad plan.  You planned 121 tests but ran 0.
Files=44, Tests=306, 47 wallclock secs ( 0.23 usr  0.30 sys +  1.00
cusr  7.35 csys =  8.88 CPU)
Result: FAIL

Seems not related to my patch?

And, there is a kernel warning which is totally unrelated to my code:


[  136.788836] ------------[ cut here ]------------
[  136.793046] WARNING: CPU: 1 PID: 1345 at fs/locks.c:615
locks_unlink_lock_ctx+0xc7/0xcc
[  136.799829] CPU: 1 PID: 1345 Comm: test_lease Not tainted 4.8.0+ #271
[  136.803814] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  136.803814]  ffffc900036bfd38 ffffffff8152c08e 0000000000000000
0000000000000000
[  136.803814]  ffffc900036bfd78 ffffffff810841c0 00000267036bfc88
ffff8800db2371e8
[  136.803814]  ffff8800db26f4c0 ffffc900036bfe00 ffffc900036bfe00
ffff8800db239240
[  136.803814] Call Trace:
[  136.803814]  [<ffffffff8152c08e>] dump_stack+0x67/0x90
[  136.803814]  [<ffffffff810841c0>] __warn+0xbd/0xd8
[  136.803814]  [<ffffffff81084297>] warn_slowpath_null+0x1d/0x1f
[  136.803814]  [<ffffffff8121b9cb>] locks_unlink_lock_ctx+0xc7/0xcc
[  136.803814]  [<ffffffff8121bd9e>] locks_delete_lock_ctx+0x17/0x3a
[  136.803814]  [<ffffffff8121be71>] lease_modify+0xb0/0xb9
[  136.803814]  [<ffffffff8121e868>] locks_remove_file+0x93/0xbd
[  136.803814]  [<ffffffff811d903d>] __fput+0xd0/0x19b
[  136.803814]  [<ffffffff811d913e>] ____fput+0xe/0x10
[  136.803814]  [<ffffffff8109ff86>] task_work_run+0x6c/0x97
[  136.803814]  [<ffffffff8100188a>] prepare_exit_to_usermode+0xb0/0xcc
[  136.803814]  [<ffffffff81001a04>] syscall_return_slowpath+0x15e/0x1c2
[  136.803814]  [<ffffffff81001ad7>] do_syscall_64+0x6f/0x76
[  136.803814]  [<ffffffff81b22424>] entry_SYSCALL64_slow_path+0x25/0x25
[  138.538370] ---[ end trace 90914a16bd3272b8 ]---

Stephen Smalley Oct. 21, 2016, 1:51 p.m. UTC | #2

On 10/21/2016 12:47 AM, Cong Wang wrote:
> On Thu, Oct 20, 2016 at 4:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> Since you want to test SELinux anyway, please test the attached one.
>>
> 
> Finally my kernel config is friendly to SELinux, and now there are several
> tests fails:
> 
> 
> Test Summary Report
> -------------------
> sysctl/test           (Wstat: 0 Tests: 4 Failed: 2)
>   Failed tests:  1-2
> mmap/test             (Wstat: 0 Tests: 44 Failed: 2)
>   Failed tests:  20, 26
> overlay/test          (Wstat: 65280 Tests: 0 Failed: 0)
>   Non-zero exit status: 255
>   Parse errors: Bad plan.  You planned 121 tests but ran 0.
> Files=44, Tests=306, 47 wallclock secs ( 0.23 usr  0.30 sys +  1.00
> cusr  7.35 csys =  8.88 CPU)
> Result: FAIL
> 
> Seems not related to my patch?

Probably kernel config related.  All tests pass for me on rawhide.
Kernel config?

> And, there is a kernel warning which is totally unrelated to my code:
> 
> 
> [  136.788836] ------------[ cut here ]------------
> [  136.793046] WARNING: CPU: 1 PID: 1345 at fs/locks.c:615
> locks_unlink_lock_ctx+0xc7/0xcc
> [  136.799829] CPU: 1 PID: 1345 Comm: test_lease Not tainted 4.8.0+ #271
> [  136.803814] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  136.803814]  ffffc900036bfd38 ffffffff8152c08e 0000000000000000
> 0000000000000000
> [  136.803814]  ffffc900036bfd78 ffffffff810841c0 00000267036bfc88
> ffff8800db2371e8
> [  136.803814]  ffff8800db26f4c0 ffffc900036bfe00 ffffc900036bfe00
> ffff8800db239240
> [  136.803814] Call Trace:
> [  136.803814]  [<ffffffff8152c08e>] dump_stack+0x67/0x90
> [  136.803814]  [<ffffffff810841c0>] __warn+0xbd/0xd8
> [  136.803814]  [<ffffffff81084297>] warn_slowpath_null+0x1d/0x1f
> [  136.803814]  [<ffffffff8121b9cb>] locks_unlink_lock_ctx+0xc7/0xcc
> [  136.803814]  [<ffffffff8121bd9e>] locks_delete_lock_ctx+0x17/0x3a
> [  136.803814]  [<ffffffff8121be71>] lease_modify+0xb0/0xb9
> [  136.803814]  [<ffffffff8121e868>] locks_remove_file+0x93/0xbd
> [  136.803814]  [<ffffffff811d903d>] __fput+0xd0/0x19b
> [  136.803814]  [<ffffffff811d913e>] ____fput+0xe/0x10
> [  136.803814]  [<ffffffff8109ff86>] task_work_run+0x6c/0x97
> [  136.803814]  [<ffffffff8100188a>] prepare_exit_to_usermode+0xb0/0xcc
> [  136.803814]  [<ffffffff81001a04>] syscall_return_slowpath+0x15e/0x1c2
> [  136.803814]  [<ffffffff81001ad7>] do_syscall_64+0x6f/0x76
> [  136.803814]  [<ffffffff81b22424>] entry_SYSCALL64_slow_path+0x25/0x25
> [  138.538370] ---[ end trace 90914a16bd3272b8 ]---

Don't see this on current kernels (e.g. 4.9-rc1).

Paul Moore Oct. 21, 2016, 4:19 p.m. UTC | #3

On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 12:07 PM, Paul Moore <paul@paul-moore.com> wrote:
>> On Thu, Oct 20, 2016 at 2:29 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>> On Thu, Oct 20, 2016 at 7:58 AM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>>> On 10/20/2016 02:52 AM, Cong Wang wrote:
>>>>> A kernel warning inside __local_bh_enable_ip() was reported by people
>>>>> running SELinux, this is caused due to some SELinux functions
>>>>> (indirectly) call peernet2id() with IRQ disabled in process context,
>>>>> when we re-enable BH with IRQ disabled kernel complains. Shut up this
>>>>> warning by saving IRQ context in peernet2id(), BH is still implicitly
>>>>> disabled.
>>>>
>>>> Not sure this suffices; kill_fasync() -> send_sigio() ->
>>>> send_sigio_to_task() -> sigio_perm() -> security_file_send_sigiotask()
>>>> -> selinux_file_send_sigiotask() -> ... -> audit_log() -> ... ->
>>>> peernet2id()
>>>
>>> Oh, this is a new one. kill_fasync() is called in IRQ handler, so we actually
>>> do multicast in IRQ context.... It makes no sense, netlink multicast could
>>> be very expensive if we have many listeners.
>>
>> I'm sure there are a few others I don't know about, but I believe the
>> only commonly used audit multicast listener is systemd.
>
> But user-space is free to listen to this group, right? If so this is just open
> for a potential DDOS attack.

Listeners are required to have CAP_AUDIT_READ.

>>> I am Cc'ing Richard who added that multicast in audit_log_end(). It seems
>>> not easy to just move the multicast to a workqueue, since the skb is copied
>>> from audit_buffer which is freed immediately after that, probably need another
>>> queue like audit_skb_queue.
>>
>> This approach would double the queue size which is something I want to
>> avoid.  I would suggest sticking with a single queue and dealing with
>> the netlink message link fixup and multicast send in the existing
>> netlink unicast thread; basically we would just be moving the
>> multicast code from audit_log_end() into kauditd_thread().  This is
>> the same approach I mentioned earlier off-list.
>
> This is what I did in the follow up patch. I attach the updated version
> in this email for you to review ...

I think there is still some confusion.  The second patch you posted
still has two queues with potentially duplicated (minus the length
tweaks) skbs.

What I am talking about is queuing the skb in audit_log_end(), without
any modification, waking up the kauditd_thread, and then letting the
kauditd_thread() function do both the netlink multicast and unicast
sends, complete with the skb_copy() and length tweaks.  This way we
only queue one copy of the skb.  To help make this more clear, I'll
work up a patch and CC you.

However, let me say this one more time: this is *NOT* a change I want
to make during the -rcX cycle, this is a change that we should do for
-next and submit during the next merge window after is has been tested
and soaked in linux-next.  Given where we are at right now - it's
Friday and I expect -rc2 on Sunday - I think the best course of action
is to revert the original patch and move on.  I'm going to do that now
and I'll submit it to netdev as soon as I've done some basic sanity
checks.

> ... I still can't make selinux-testsuites working
> on my Fedora even though I have SELinux=enforcing, anyhow I don't
> see any kernel warning in my dmesg at least.

Stephen already responded and I agree with him, it looks like
something might be off with your kernel config.  Everything works
correctly for me with upstream and Fedora Rawhide kernels.

>> However, that isn't something I want to mess with as a regression fix,
>> mostly because I really want to see this regression gone by -rc2 as it
>> is making SELinux testing a real pain.  If the patch posted at the top
>> of this thread isn't a suitable fix, we really should revert the
>> original patch.
>
> Since you want to test SELinux anyway, please test the attached one.

Cong Wang Oct. 21, 2016, 6:02 p.m. UTC | #4

On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
> On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> This is what I did in the follow up patch. I attach the updated version
>> in this email for you to review ...
>
> I think there is still some confusion.  The second patch you posted
> still has two queues with potentially duplicated (minus the length
> tweaks) skbs.

The current code without my patch is already this, the only difference
is there is no queue for multicast case, duplication is already there.
So, why do you expect me to fix two problems in one patch? This
is totally unfair, it is probably based on your eager to revert...

>
> What I am talking about is queuing the skb in audit_log_end(), without
> any modification, waking up the kauditd_thread, and then letting the
> kauditd_thread() function do both the netlink multicast and unicast
> sends, complete with the skb_copy() and length tweaks.  This way we
> only queue one copy of the skb.  To help make this more clear, I'll
> work up a patch and CC you.

Sure, I hate the skb_copy() too since it could be in a IRQ handler,
I didn't remove it because that would make the patch more complicated
than the current one. We can always improve this later for the next merge
window, can't we? Why are you pushing something irrelevant to my
patch to make it unnecessarily complicated?

> However, let me say this one more time: this is *NOT* a change I want
> to make during the -rcX cycle, this is a change that we should do for
> -next and submit during the next merge window after is has been tested
> and soaked in linux-next.  Given where we are at right now - it's
> Friday and I expect -rc2 on Sunday - I think the best course of action
> is to revert the original patch and move on.  I'm going to do that now
> and I'll submit it to netdev as soon as I've done some basic sanity
> checks.

The problem with this is: I would have to revert this revert for the next
merge window, in the end you would have the following in git log:

1) original one
2) revert
3) audit fix
4) revert the above revert

comparing with:

1) original one
2) audit fix

You just want to make things unnecessarily complicated.

You need to really CALM DOWN, -rc2 is NOT late, assuming -rc7 is the final
release candidate, we still have 5 weeks to fix it, why are you so scared?

We have dealt much more complicated patch/backport for networking
for -stable. Please don't panic.

Richard Guy Briggs Oct. 21, 2016, 7:39 p.m. UTC | #5

On 2016-10-21 11:02, Cong Wang wrote:
> On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >> This is what I did in the follow up patch. I attach the updated version
> >> in this email for you to review ...
> >
> > I think there is still some confusion.  The second patch you posted
> > still has two queues with potentially duplicated (minus the length
> > tweaks) skbs.
> 
> The current code without my patch is already this, the only difference
> is there is no queue for multicast case, duplication is already there.
> So, why do you expect me to fix two problems in one patch? This
> is totally unfair, it is probably based on your eager to revert...
> 
> >
> > What I am talking about is queuing the skb in audit_log_end(), without
> > any modification, waking up the kauditd_thread, and then letting the
> > kauditd_thread() function do both the netlink multicast and unicast
> > sends, complete with the skb_copy() and length tweaks.  This way we
> > only queue one copy of the skb.  To help make this more clear, I'll
> > work up a patch and CC you.
> 
> Sure, I hate the skb_copy() too since it could be in a IRQ handler,
> I didn't remove it because that would make the patch more complicated
> than the current one. We can always improve this later for the next merge
> window, can't we? Why are you pushing something irrelevant to my
> patch to make it unnecessarily complicated?
> 
> 
> > However, let me say this one more time: this is *NOT* a change I want
> > to make during the -rcX cycle, this is a change that we should do for
> > -next and submit during the next merge window after is has been tested
> > and soaked in linux-next.  Given where we are at right now - it's
> > Friday and I expect -rc2 on Sunday - I think the best course of action
> > is to revert the original patch and move on.  I'm going to do that now
> > and I'll submit it to netdev as soon as I've done some basic sanity
> > checks.
> 
> The problem with this is: I would have to revert this revert for the next
> merge window, in the end you would have the following in git log:
> 
> 1) original one
> 2) revert
> 3) audit fix
> 4) revert the above revert
> 
> comparing with:
> 
> 1) original one
> 2) audit fix
> 
> You just want to make things unnecessarily complicated.

I agree here.  I've been following this, thinking about it, but don't
yet have a solid recommendation about the way to proceed yet, but
reverting it does not seem like the right solution.

> You need to really CALM DOWN, -rc2 is NOT late, assuming -rc7 is the final
> release candidate, we still have 5 weeks to fix it, why are you so scared?

A revert seems pretty impulsive to me now.

> We have dealt much more complicated patch/backport for networking
> for -stable. Please don't panic.

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635

Paul Moore Oct. 21, 2016, 8:03 p.m. UTC | #6

On Fri, Oct 21, 2016 at 2:02 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
>> On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>> This is what I did in the follow up patch. I attach the updated version
>>> in this email for you to review ...
>>
>> I think there is still some confusion.  The second patch you posted
>> still has two queues with potentially duplicated (minus the length
>> tweaks) skbs.
>
> The current code without my patch is already this, the only difference
> is there is no queue for multicast case, duplication is already there.

The difference is the period of time where the skbs are duplicated.
You patch duplicates the skb and then queues them, I'm suggesting
putting a single skb in the queue and then only duplicating it once it
has been pulled off the queue.

> So, why do you expect me to fix two problems in one patch? This
> is totally unfair, it is probably based on your eager to revert...

All I've been asking for this week is a fix before -rc2 is released; I
think I've been pretty clear and consistent about that.  At the start
of the week I didn't care if it was a revert or some other fix, so
long as the fix was relatively small and easily verified/tested.  I
did say that if we got to the end of the week and we didn't have a
solution in place I would advocate for a revert.  It's Friday
afternoon as I type this.

I've also been pretty clear from the very beginning that I don't
consider a rework of the audit multicast code to be a candidate for
4.9-rcX, that is -next material as far as I'm concerned.  I'll readily
admit that perhaps I'm more conservative than most maintainers, but I
take that approach to try to keep from breaking other subsystems (and
avoid situations like these, because this thread is so much fun after
all).

>> What I am talking about is queuing the skb in audit_log_end(), without
>> any modification, waking up the kauditd_thread, and then letting the
>> kauditd_thread() function do both the netlink multicast and unicast
>> sends, complete with the skb_copy() and length tweaks.  This way we
>> only queue one copy of the skb.  To help make this more clear, I'll
>> work up a patch and CC you.
>
> Sure, I hate the skb_copy() too since it could be in a IRQ handler,
> I didn't remove it because that would make the patch more complicated
> than the current one. We can always improve this later for the next merge
> window, can't we? Why are you pushing something irrelevant to my
> patch to make it unnecessarily complicated?

I don't even know where to begin ... please just re-read what I've
said above as well as previously this week.  I just want a simple fix
for 4.9-rc.  I'm not going to sign-off/ack a rework of the audit
multicast code for 4.9-rc.

>> However, let me say this one more time: this is *NOT* a change I want
>> to make during the -rcX cycle, this is a change that we should do for
>> -next and submit during the next merge window after is has been tested
>> and soaked in linux-next.  Given where we are at right now - it's
>> Friday and I expect -rc2 on Sunday - I think the best course of action
>> is to revert the original patch and move on.  I'm going to do that now
>> and I'll submit it to netdev as soon as I've done some basic sanity
>> checks.
>
> The problem with this is: I would have to revert this revert for the next
> merge window, in the end you would have the following in git log:
>
> 1) original one
> 2) revert
> 3) audit fix
> 4) revert the above revert
>
> comparing with:
>
> 1) original one
> 2) audit fix
>
> You just want to make things unnecessarily complicated.

No.  What I want, (one more time) is a fix in -rc2.

> You need to really CALM DOWN, -rc2 is NOT late, assuming -rc7 is the final
> release candidate, we still have 5 weeks to fix it, why are you so scared?

We don't know each other, in fact before this week I'm not sure we've
even emailed each other (my apologies if we have), so I'm just
ignoring statements like the above, but just some advice (feel free to
ignore): I don't think anyone ever responds well to demands to "CALM
DOWN", especially when I feel my side of the conversation has been
quite civil.

> We have dealt much more complicated patch/backport for networking
> for -stable. Please don't panic.

Paul Moore Oct. 21, 2016, 8:15 p.m. UTC | #7

On Fri, Oct 21, 2016 at 3:39 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2016-10-21 11:02, Cong Wang wrote:
>> On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
>> > On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> >> This is what I did in the follow up patch. I attach the updated version
>> >> in this email for you to review ...
>> >
>> > I think there is still some confusion.  The second patch you posted
>> > still has two queues with potentially duplicated (minus the length
>> > tweaks) skbs.
>>
>> The current code without my patch is already this, the only difference
>> is there is no queue for multicast case, duplication is already there.
>> So, why do you expect me to fix two problems in one patch? This
>> is totally unfair, it is probably based on your eager to revert...
>>
>> >
>> > What I am talking about is queuing the skb in audit_log_end(), without
>> > any modification, waking up the kauditd_thread, and then letting the
>> > kauditd_thread() function do both the netlink multicast and unicast
>> > sends, complete with the skb_copy() and length tweaks.  This way we
>> > only queue one copy of the skb.  To help make this more clear, I'll
>> > work up a patch and CC you.
>>
>> Sure, I hate the skb_copy() too since it could be in a IRQ handler,
>> I didn't remove it because that would make the patch more complicated
>> than the current one. We can always improve this later for the next merge
>> window, can't we? Why are you pushing something irrelevant to my
>> patch to make it unnecessarily complicated?
>>
>>
>> > However, let me say this one more time: this is *NOT* a change I want
>> > to make during the -rcX cycle, this is a change that we should do for
>> > -next and submit during the next merge window after is has been tested
>> > and soaked in linux-next.  Given where we are at right now - it's
>> > Friday and I expect -rc2 on Sunday - I think the best course of action
>> > is to revert the original patch and move on.  I'm going to do that now
>> > and I'll submit it to netdev as soon as I've done some basic sanity
>> > checks.
>>
>> The problem with this is: I would have to revert this revert for the next
>> merge window, in the end you would have the following in git log:
>>
>> 1) original one
>> 2) revert
>> 3) audit fix
>> 4) revert the above revert
>>
>> comparing with:
>>
>> 1) original one
>> 2) audit fix
>>
>> You just want to make things unnecessarily complicated.
>
> I agree here.  I've been following this, thinking about it, but don't
> yet have a solid recommendation about the way to proceed yet, but
> reverting it does not seem like the right solution.
>
>> You need to really CALM DOWN, -rc2 is NOT late, assuming -rc7 is the final
>> release candidate, we still have 5 weeks to fix it, why are you so scared?
>
> A revert seems pretty impulsive to me now.

I agree that if this issue had been identified today, it would be
impulsive to do a revert for -rc2; Stephen reported this problem
Wednesday morning.  I would also be okay waiting on a fix past -rc2 if
the solution was still under development and we all agreed on the
solution.

However, that's not the case is it?  Unless I missed something, the
fix that Cong Wang is advocating (rework the audit multicast code), is
a change that I have said I'm not going to accept during the -rc
phase.  It has been a few days now and no alternate fix has been
proposed, I'll give it a few more hours ...

David Miller Oct. 21, 2016, 8:33 p.m. UTC | #8

From: Paul Moore <paul@paul-moore.com>
Date: Fri, 21 Oct 2016 16:15:00 -0400

> However, that's not the case is it?  Unless I missed something, the
> fix that Cong Wang is advocating (rework the audit multicast code), is
> a change that I have said I'm not going to accept during the -rc
> phase.  It has been a few days now and no alternate fix has been
> proposed, I'll give it a few more hours ...

It really is the right way to fix this though.

Nothing should be emitting netlink messages, potentially en-masse
to a multicast group or broadcast, in hardware interrupt context.

I know it's been said that only systemd receives these things, so
that point doesn't need to be remade again.

We have many weeks until -final is released so I really don't
understand the reluctance at a slightly more involved fix in -rc2.  In
fact this is the most optimal time to try it this way, as we'll have
the maximum amount of time for it to have exposure for testing before
-final.

Thanks.

Paul Moore Oct. 21, 2016, 8:53 p.m. UTC | #9

On Fri, Oct 21, 2016 at 4:33 PM, David Miller <davem@davemloft.net> wrote:
> From: Paul Moore <paul@paul-moore.com>
> Date: Fri, 21 Oct 2016 16:15:00 -0400
>
>> However, that's not the case is it?  Unless I missed something, the
>> fix that Cong Wang is advocating (rework the audit multicast code), is
>> a change that I have said I'm not going to accept during the -rc
>> phase.  It has been a few days now and no alternate fix has been
>> proposed, I'll give it a few more hours ...
>
> It really is the right way to fix this though.
>
> Nothing should be emitting netlink messages, potentially en-masse
> to a multicast group or broadcast, in hardware interrupt context.
>
> I know it's been said that only systemd receives these things, so
> that point doesn't need to be remade again.

I think it is also worth noting that this code has been doing it this
way for some time now.  I say this not to advocate that it is correct,
only that there hasn't been a demonstrated problem until Cong Wang's
patch.

> We have many weeks until -final is released so I really don't
> understand the reluctance at a slightly more involved fix in -rc2.  In
> fact this is the most optimal time to try it this way, as we'll have
> the maximum amount of time for it to have exposure for testing before
> -final.

Well, I understand what you are trying to say, but the maximum amount
of time for exposure/testing would be to put it in -next.  The audit
netlink code needs a rework, but introducing such a change in the -rc
kernels is not something I'm going to do, especially when the change
which triggered the regression is an optimization that can be easily
reverted ... or fixed, but the only two options I've heard mentioned
are the audit multicast rework and the revert; if someone has a third
option I'm listening ...

Paul Moore Oct. 22, 2016, 1:55 a.m. UTC | #10

On Fri, Oct 21, 2016 at 4:53 PM, Paul Moore <paul@paul-moore.com> wrote:
> On Fri, Oct 21, 2016 at 4:33 PM, David Miller <davem@davemloft.net> wrote:
>> From: Paul Moore <paul@paul-moore.com>
>> Date: Fri, 21 Oct 2016 16:15:00 -0400
>>
>>> However, that's not the case is it?  Unless I missed something, the
>>> fix that Cong Wang is advocating (rework the audit multicast code), is
>>> a change that I have said I'm not going to accept during the -rc
>>> phase.  It has been a few days now and no alternate fix has been
>>> proposed, I'll give it a few more hours ...
>>
>> It really is the right way to fix this though.
>>
>> Nothing should be emitting netlink messages, potentially en-masse
>> to a multicast group or broadcast, in hardware interrupt context.
>>
>> I know it's been said that only systemd receives these things, so
>> that point doesn't need to be remade again.
>
> I think it is also worth noting that this code has been doing it this
> way for some time now.  I say this not to advocate that it is correct,
> only that there hasn't been a demonstrated problem until Cong Wang's
> patch.
>
>> We have many weeks until -final is released so I really don't
>> understand the reluctance at a slightly more involved fix in -rc2.  In
>> fact this is the most optimal time to try it this way, as we'll have
>> the maximum amount of time for it to have exposure for testing before
>> -final.
>
> Well, I understand what you are trying to say, but the maximum amount
> of time for exposure/testing would be to put it in -next.  The audit
> netlink code needs a rework, but introducing such a change in the -rc
> kernels is not something I'm going to do, especially when the change
> which triggered the regression is an optimization that can be easily
> reverted ... or fixed, but the only two options I've heard mentioned
> are the audit multicast rework and the revert; if someone has a third
> option I'm listening ...

It's the end of my day, and commitments over the weekend will limit
how much additional testing/work I can do so I went ahead and just
posted a simple revert to netdev, it should be in your inbox already.
Please fix this, either through a revert, or preferably some other fix
we haven't thought of yet, in time for -rc2.

Thanks.

Cong Wang Oct. 22, 2016, 3:26 a.m. UTC | #11

On Fri, Oct 21, 2016 at 1:03 PM, Paul Moore <paul@paul-moore.com> wrote:
> On Fri, Oct 21, 2016 at 2:02 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
>>> On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>>> This is what I did in the follow up patch. I attach the updated version
>>>> in this email for you to review ...
>>>
>>> I think there is still some confusion.  The second patch you posted
>>> still has two queues with potentially duplicated (minus the length
>>> tweaks) skbs.
>>
>> The current code without my patch is already this, the only difference
>> is there is no queue for multicast case, duplication is already there.
>
> The difference is the period of time where the skbs are duplicated.
> You patch duplicates the skb and then queues them, I'm suggesting
> putting a single skb in the queue and then only duplicating it once it
> has been pulled off the queue.

I never disagree, the only thing you never explain is why we must do
it in this patch rather than a patch later?


>
>> So, why do you expect me to fix two problems in one patch? This
>> is totally unfair, it is probably based on your eager to revert...
>
> All I've been asking for this week is a fix before -rc2 is released; I


Why -rc2? Not -rc3 ... -rc7? Why should we, as a community, care
about a release candidate and specifically -rc2? If that is your own
schedule, why not cooperating with Linus' schedule to make other
people like me easy?


> think I've been pretty clear and consistent about that.  At the start
> of the week I didn't care if it was a revert or some other fix, so
> long as the fix was relatively small and easily verified/tested.  I
> did say that if we got to the end of the week and we didn't have a
> solution in place I would advocate for a revert.  It's Friday
> afternoon as I type this.

Blame the one who makes things unnecessarily complicated, please.
I am all for it.

>
> I've also been pretty clear from the very beginning that I don't
> consider a rework of the audit multicast code to be a candidate for
> 4.9-rcX, that is -next material as far as I'm concerned.  I'll readily

Calling a change of 30 lines a rework?? OMG... We definitely have
different definitions for rework. Many stable backports are more than
this size... (I know the number of lines can't tell everything, but it tells
something.)


> admit that perhaps I'm more conservative than most maintainers, but I
> take that approach to try to keep from breaking other subsystems (and
> avoid situations like these, because this thread is so much fun after
> all).


I totally understand your conservative, but in the meantime, please be
rational. If -rc7 is the final RC, then we have 5 weeks away.


>>> What I am talking about is queuing the skb in audit_log_end(), without
>>> any modification, waking up the kauditd_thread, and then letting the
>>> kauditd_thread() function do both the netlink multicast and unicast
>>> sends, complete with the skb_copy() and length tweaks.  This way we
>>> only queue one copy of the skb.  To help make this more clear, I'll
>>> work up a patch and CC you.
>>
>> Sure, I hate the skb_copy() too since it could be in a IRQ handler,
>> I didn't remove it because that would make the patch more complicated
>> than the current one. We can always improve this later for the next merge
>> window, can't we? Why are you pushing something irrelevant to my
>> patch to make it unnecessarily complicated?
>
> I don't even know where to begin ... please just re-read what I've
> said above as well as previously this week.  I just want a simple fix
> for 4.9-rc.  I'm not going to sign-off/ack a rework of the audit
> multicast code for 4.9-rc.


Calling a 30-line patch a rework is just unfair.


>
>>> However, let me say this one more time: this is *NOT* a change I want
>>> to make during the -rcX cycle, this is a change that we should do for
>>> -next and submit during the next merge window after is has been tested
>>> and soaked in linux-next.  Given where we are at right now - it's
>>> Friday and I expect -rc2 on Sunday - I think the best course of action
>>> is to revert the original patch and move on.  I'm going to do that now
>>> and I'll submit it to netdev as soon as I've done some basic sanity
>>> checks.
>>
>> The problem with this is: I would have to revert this revert for the next
>> merge window, in the end you would have the following in git log:
>>
>> 1) original one
>> 2) revert
>> 3) audit fix
>> 4) revert the above revert
>>
>> comparing with:
>>
>> 1) original one
>> 2) audit fix
>>
>> You just want to make things unnecessarily complicated.
>
> No.  What I want, (one more time) is a fix in -rc2.

You have a fix, you refuse, you consider it as a rework. It's your problem.

Cong Wang Oct. 22, 2016, 3:34 a.m. UTC | #12

On Fri, Oct 21, 2016 at 6:55 PM, Paul Moore <paul@paul-moore.com> wrote:
> It's the end of my day, and commitments over the weekend will limit
> how much additional testing/work I can do so I went ahead and just
> posted a simple revert to netdev, it should be in your inbox already.
> Please fix this, either through a revert, or preferably some other fix
> we haven't thought of yet, in time for -rc2.

Creative artificial deadline! Next time pick -rc1 please so that no single
fix can go in after merge window!!

Cong Wang Oct. 22, 2016, 3:53 a.m. UTC | #13

On Fri, Oct 21, 2016 at 1:33 PM, David Miller <davem@davemloft.net> wrote:
> From: Paul Moore <paul@paul-moore.com>
> Date: Fri, 21 Oct 2016 16:15:00 -0400
>
>> However, that's not the case is it?  Unless I missed something, the
>> fix that Cong Wang is advocating (rework the audit multicast code), is
>> a change that I have said I'm not going to accept during the -rc
>> phase.  It has been a few days now and no alternate fix has been
>> proposed, I'll give it a few more hours ...
>
> It really is the right way to fix this though.
>
> Nothing should be emitting netlink messages, potentially en-masse
> to a multicast group or broadcast, in hardware interrupt context.

+1

>
> I know it's been said that only systemd receives these things, so
> that point doesn't need to be remade again.
>
> We have many weeks until -final is released so I really don't
> understand the reluctance at a slightly more involved fix in -rc2.  In
> fact this is the most optimal time to try it this way, as we'll have
> the maximum amount of time for it to have exposure for testing before
> -final.

Exactly, this is how release candidates work and this is why Linus usually
puts 6 or 7 rc's before a final release, so that we have 6/7 weeks to fix bugs
(and bugs of bug fixes of course) from the merge window.

It is very common we have hidden bugs before a merge window, since they
were just sitting in a subsystem maintainer's tree, more testers come in after
merge window, a bug like this one is clearly a cross-subsystem one, and
-rc2 a perfect time to fix it. This is how we work for years.

Audit subsystem is in a different world with the rest of us. Sigh.

Paul Moore Oct. 22, 2016, 7:02 p.m. UTC | #14

On Fri, Oct 21, 2016 at 11:26 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 1:03 PM, Paul Moore <paul@paul-moore.com> wrote:
>> On Fri, Oct 21, 2016 at 2:02 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>> On Fri, Oct 21, 2016 at 9:19 AM, Paul Moore <paul@paul-moore.com> wrote:
>>>> On Thu, Oct 20, 2016 at 7:35 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>>>> This is what I did in the follow up patch. I attach the updated version
>>>>> in this email for you to review ...
>>>>
>>>> I think there is still some confusion.  The second patch you posted
>>>> still has two queues with potentially duplicated (minus the length
>>>> tweaks) skbs.
>>>
>>> The current code without my patch is already this, the only difference
>>> is there is no queue for multicast case, duplication is already there.
>>
>> The difference is the period of time where the skbs are duplicated.
>> You patch duplicates the skb and then queues them, I'm suggesting
>> putting a single skb in the queue and then only duplicating it once it
>> has been pulled off the queue.
>
> I never disagree, the only thing you never explain is why we must do
> it in this patch rather than a patch later?

It seems obvious: if you do the skb_copy() before you queue the skbs
you are doubling the amount of memory used which is undesirable.

[net] net: saving irq context for peernet2id()

Commit Message

Comments

Patch