diff mbox

[net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

Message ID 58C08535.3070000@iogearbox.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Borkmann March 8, 2017, 10:27 p.m. UTC
[ + Kees, Laura, and Dave ]

On 03/08/2017 08:25 PM, Linus Torvalds wrote:
> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.
>
> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)
>
> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
>          /* No alias checking for _NX bit modifications */
>          checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?
>
> But that doesn't explain why the bpf code would have issues with this
> all only on x86-32.
>
> Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
> 64-bit only..

Last days I tried to narrow this down a bit, I have a dirty hack I
used just for the sake of experimenting when booting into Fengguang's
x86-32 test kernel via qemu:

  lib/Makefile      |  3 ++-
  lib/test_setmem.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
  2 files changed, 52 insertions(+), 1 deletion(-)
  create mode 100644 lib/test_setmem.c

Comments

Kees Cook March 8, 2017, 10:36 p.m. UTC | #1
On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> [   28.474232] rodata_test: test data was not read only
> [...]

In my tests so far, I've never been able to get rodata_test to fail
(Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
recheck under Qemu 2.7.1. Do you see these failures on real hardware?

-Kees
Linus Torvalds March 8, 2017, 10:43 p.m. UTC | #2
On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> The issue seems to be accessing buff first (can be read or write access)
> and then doing set_memory_ro() doesn't make it read-only immediately,
> meaning the subsequent call into probe_kernel_write() will succeed without
> error.
>
> Then, if I don't touch buff first and only do the set_memory_ro() seems
> to work and probe_kernel_write() will then fail as expected due to pages
> being read-only now.

Ok, that definitely sounds like a TLB invalidate didn't happen.

> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
> example, it "kind of" works most of the time (see last log extract below),
> and probe_kernel_write() will fail.

Yeah, very much consistent with a missing TLB invalidate. Scheduling
will end up invalidating it, although if it's a global page even that
might not do it (but eventually the entry will just get flushed due to
other activity).

> None of this seems an issue with x86_64 and the test_setmem runs fine all
> the time, same for the actual BPF stuff.

The code does look somewhat confused about when to actually flush
things - see my earlier note about NX - but it would seem to always do
__flush_tlb_all() unless I missed something. At least as long as
CPA_FLUSHTLB is set. Maybe some case forgets to set that..

       Linus
Daniel Borkmann March 8, 2017, 10:51 p.m. UTC | #3
On 03/08/2017 11:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> [   28.474232] rodata_test: test data was not read only
>> [...]
>
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?

The x86_64 tests on real hardware, and the x86-32 only in the qemu
environment with the reproducer script. Haven't tested x86-32 kernel
outside of qemu so far.

Thanks,
Daniel
Laura Abbott March 8, 2017, 11:55 p.m. UTC | #4
On 03/08/2017 02:36 PM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> [   28.474232] rodata_test: test data was not read only
>> [...]
> 
> In my tests so far, I've never been able to get rodata_test to fail
> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
> 
> -Kees
> 

FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
and rawhide respectively. 

I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
goes away. I can't tell if this is an indication of magically hiding
the TLB problem or if there is an issue with !X86_PAE invalidation.


Thanks,
Laura
kbuild test robot March 9, 2017, 1:34 a.m. UTC | #5
On Wed, Mar 08, 2017 at 02:43:44PM -0800, Linus Torvalds wrote:
>On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> The issue seems to be accessing buff first (can be read or write access)
>> and then doing set_memory_ro() doesn't make it read-only immediately,
>> meaning the subsequent call into probe_kernel_write() will succeed without
>> error.
>>
>> Then, if I don't touch buff first and only do the set_memory_ro() seems
>> to work and probe_kernel_write() will then fail as expected due to pages
>> being read-only now.
>
>Ok, that definitely sounds like a TLB invalidate didn't happen.
>
>> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
>> example, it "kind of" works most of the time (see last log extract below),
>> and probe_kernel_write() will fail.
>
>Yeah, very much consistent with a missing TLB invalidate. Scheduling
>will end up invalidating it, although if it's a global page even that
>might not do it (but eventually the entry will just get flushed due to
>other activity).
>
>> None of this seems an issue with x86_64 and the test_setmem runs fine all
>> the time, same for the actual BPF stuff.
>
>The code does look somewhat confused about when to actually flush
>things - see my earlier note about NX - but it would seem to always do
>__flush_tlb_all() unless I missed something. At least as long as
>CPA_FLUSHTLB is set. Maybe some case forgets to set that..

Not sure if it's relevant, but out of 189 boots there are 2 boots
showing the below "CPA: called for zero pte." warning.

[    7.116932] random: trinity: uninitialized urandom read (4 bytes read)
[   16.366468] sock: process `trinity-main' is using obsolete setsockopt SO_BSDCOMPAT
[   17.202396] BUG: unable to handle kernel paging request at 655d9eb2
[   17.204081] IP: __release_sock+0x6e/0x100
[   17.205207] *pde = 00000000
[   17.205208]
[   17.206755] Oops: 0000 [#1]
[   17.207686] CPU: 0 PID: 382 Comm: trinity-main Not tainted 4.10.0-rc8-02017-g9d876e7 #1
[   17.209819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[   17.212431] task: d625d200 task.stack: d6222000
[   17.213655] EIP: __release_sock+0x6e/0x100
[   17.214833] EFLAGS: 00010246 CPU: 0
[   17.215951] EAX: 00000000 EBX: 655d9eb2 ECX: 00000000 EDX: 00000201
[   17.217587] ESI: 00000605 EDI: d6064800 EBP: d6223ef4 ESP: d6223ee8
[   17.219185]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[   17.220602] CR0: 80050033 CR2: 655d9eb2 CR3: 1610f000 CR4: 00000610
[   17.221966] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   17.223444] DR6: ffff0ff0 DR7: 00000600
[   17.224343] Call Trace:
[   17.225007]  release_sock+0x2e/0x80
[   17.225900]  sock_setsockopt+0x8c/0x880
[   17.226857]  SyS_socketcall+0x658/0x6a0
[   17.227804]  do_fast_syscall_32+0x9a/0x160
[   17.228765]  entry_SYSENTER_32+0x4c/0x7b
[   17.229694] EIP: 0xb7777cc5
[   17.230428] EFLAGS: 00000282 CPU: 0
[   17.231263] EAX: ffffffda EBX: 0000000e ECX: bfedce00 EDX: bfedce80
[   17.232582] ESI: 0000001a EDI: 000000ae EBP: b754f93c ESP: bfedcdec
[   17.233882]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[   17.235044] Code: eb 29 8d 76 00 89 da 89 f8 ff 97 98 01 00 00 31 c9 ba 06 08 00 00 b8 d8 19 b1 c1 e8 ed 3d 85 ff e8 e8 62 04 00 85 f6 89 f3 74 42 <8b>
 33 0f 18 06 8b 43 48 a8 01 74 0e 83 e0 fe 74 09 80 3d 3d 9c
[   17.240429] EIP: __release_sock+0x6e/0x100 SS:ESP: 0068:d6223ee8
[   17.241689] CR2: 00000000655d9eb2
[   17.242509] ---[ end trace dc10480164c75444 ]---
[   17.243569] ------------[ cut here ]------------
[   17.243574] WARNING: CPU: 0 PID: 15 at arch/x86/mm/pageattr.c:1150 __cpa_process_fault+0x388/0x390
[   17.243575] CPA: called for zero pte. vaddr = d7ab4000 cpa->vaddr = d7ab4000
[   17.243577] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G      D         4.10.0-rc8-02017-g9d876e7 #1
[   17.243578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[   17.243582] Workqueue: events bpf_prog_free_deferred
[   17.243583] Call Trace:
[   17.243588]  dump_stack+0x16/0x25
[   17.243588]  dump_stack+0x16/0x25
[   17.243590]  __warn+0xd1/0xf0
[   17.243592]  ? __cpa_process_fault+0x388/0x390
[   17.243593]  warn_slowpath_fmt+0x3b/0x40
[   17.243594]  __cpa_process_fault+0x388/0x390
[   17.243596]  ? lookup_address_in_pgd+0xa/0x90
[   17.243598]  __change_page_attr+0x520/0x6c0
[   17.243600]  ? pfn_range_is_mapped+0xe/0x80
[   17.243601]  __change_page_attr_set_clr+0x38/0x180
[   17.243603]  change_page_attr_set_clr+0x107/0x3f0
[   17.243605]  ? dequeue_entity+0x86/0x230
[   17.243607]  set_memory_rw+0x3a/0x40
[   17.243608]  bpf_prog_free_deferred+0x16/0x30
[   17.243612]  process_one_work+0xfc/0x440
[   17.243614]  ? pick_next_task_fair+0x149/0x1d0
[   17.243615]  worker_thread+0x37/0x4e0
[   17.243617]  kthread+0xdd/0x110
[   17.243618]  ? process_one_work+0x440/0x440
[   17.243620]  ? __kthread_create_on_node+0x100/0x100
[   17.243622]  ret_from_fork+0x21/0x2c
[   17.243623] ---[ end trace dc10480164c75445 ]---
[   17.243627] BUG: unable to handle kernel NULL pointer dereference at 00000007
[   17.243630] IP: ___cache_free+0x14/0x140
[   17.243631] *pde = 00000000
[   17.243631]
[   17.243633] Oops: 0000 [#2]
[   17.243635] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G      D W       4.10.0-rc8-02017-g9d876e7 #1
[   17.243635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[   17.243636] Workqueue: events bpf_prog_free_deferred
[   17.243637] task: d524e8c0 task.stack: d5254000
[   17.243639] EIP: ___cache_free+0x14/0x140
[   17.243640] EFLAGS: 00010046 CPU: 0
[   17.243641] EAX: d6945af8 EBX: 00000003 ECX: c10dc08e EDX: 3beb072d
[   17.243642] ESI: d6945af8 EDI: 3beb072d EBP: d5255f00 ESP: d5255ed4
[   17.243643]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[   17.243644] CR0: 80050033 CR2: 00000007 CR3: 1610f000 CR4: 00000610
[   17.243647] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   17.243648] DR6: ffff0ff0 DR7: 00000600
[   17.243648] Call Trace:
[   17.243650]  kfree+0x64/0xe0
[   17.243652]  ? bpf_prog_free_deferred+0x1e/0x30
[   17.243653]  bpf_prog_free_deferred+0x1e/0x30
[   17.243654]  process_one_work+0xfc/0x440
[   17.243656]  ? pick_next_task_fair+0x149/0x1d0
[   17.243658]  worker_thread+0x37/0x4e0
[   17.243659]  kthread+0xdd/0x110
[   17.243661]  ? process_one_work+0x440/0x440
[   17.243662]  ? __kthread_create_on_node+0x100/0x100
[   17.243664]  ret_from_fork+0x21/0x2c
[   17.243664] Code: 89 da 89 f0 ff 0d 64 3e b4 c1 e8 f8 fe ff ff 83 c4 14 5b 5e 5d c3 90 55 89 e5 57 56 53 83 ec 20 e8 d2 21 72 00 8b 18 89 c6 89 d7 <8b> 43 04 39 03 73 65 a1 a0 fa 4b c2 85 c0 7f 1c 8b 03 8d 50 01
[   17.243684] EIP: ___cache_free+0x14/0x140 SS:ESP: 0068:d5255ed4
[   17.243684] CR2: 0000000000000007
[   17.243685] ---[ end trace dc10480164c75446 ]---
[   17.243686] Kernel panic - not syncing: Fatal exception
[   17.243687] Kernel Offset: disabled

Regards,
Fengguang
Kees Cook March 9, 2017, 5:36 a.m. UTC | #6
On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <labbott@redhat.com> wrote:
> On 03/08/2017 02:36 PM, Kees Cook wrote:
>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>> [   28.474232] rodata_test: test data was not read only
>>> [...]
>>
>> In my tests so far, I've never been able to get rodata_test to fail
>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>
>> -Kees
>>
>
> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
> and rawhide respectively.
>
> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
> goes away. I can't tell if this is an indication of magically hiding
> the TLB problem or if there is an issue with !X86_PAE invalidation.

I found my difference. I normally run qemu with "-cpu host" which
makes the failure go away. With "-cpu kvm64", I see the rodata_test
failure immediately. Seems like this may be a kvm cpu feature
emulation bug? I'll see if I can find the specific cpu feature in the
morning...

-Kees
Daniel Borkmann March 9, 2017, 1:04 p.m. UTC | #7
On 03/09/2017 06:36 AM, Kees Cook wrote:
> On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <labbott@redhat.com> wrote:
>> On 03/08/2017 02:36 PM, Kees Cook wrote:
>>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> [   28.474232] rodata_test: test data was not read only
>>>> [...]
>>>
>>> In my tests so far, I've never been able to get rodata_test to fail
>>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can
>>> recheck under Qemu 2.7.1. Do you see these failures on real hardware?
>>>
>>> -Kees
>>
>> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24
>> and rawhide respectively.
>>
>> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If
>> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem
>> goes away. I can't tell if this is an indication of magically hiding
>> the TLB problem or if there is an issue with !X86_PAE invalidation.
>
> I found my difference. I normally run qemu with "-cpu host" which
> makes the failure go away. With "-cpu kvm64", I see the rodata_test
> failure immediately. Seems like this may be a kvm cpu feature
> emulation bug? I'll see if I can find the specific cpu feature in the
> morning...

Interesting! Changing to "-cpu host" makes rodata_test succeed plus
my test_setmem and the test_bpf suite runs fine as well. Haven't seen
a corruption since. Switching back to "-cpu kvm64" I immediately see
mentioned issues again.

With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
was set each time we switched attrs and a cpa_flush_range() was
performed (with the correct number of pages and cache set to 0). That
would be a __flush_tlb_all() eventually.

Hmm, it indeed might seem likely that this could be an emulation bug.

Thanks,
Daniel
Thomas Gleixner March 9, 2017, 1:10 p.m. UTC | #8
On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> was set each time we switched attrs and a cpa_flush_range() was
> performed (with the correct number of pages and cache set to 0). That
> would be a __flush_tlb_all() eventually.
> 
> Hmm, it indeed might seem likely that this could be an emulation bug.

Which variant of __flush_tlb_all() is used when the test fails?

Check for the following flags in /proc/cpuinfo: pge invpcid

Thanks,

	tglx
Daniel Borkmann March 9, 2017, 1:25 p.m. UTC | #9
On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>> was set each time we switched attrs and a cpa_flush_range() was
>> performed (with the correct number of pages and cache set to 0). That
>> would be a __flush_tlb_all() eventually.
>>
>> Hmm, it indeed might seem likely that this could be an emulation bug.
>
> Which variant of __flush_tlb_all() is used when the test fails?
>
> Check for the following flags in /proc/cpuinfo: pge invpcid

I added the following and booted with both variants:

printk("X86_FEATURE_PGE:%u\n",     static_cpu_has(X86_FEATURE_PGE));
printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));

"-cpu host" gives:

[    8.326117] X86_FEATURE_PGE:1
[    8.326381] X86_FEATURE_INVPCID:1

"-cpu kvm64" gives:

[    8.517069] X86_FEATURE_PGE:1
[    8.517393] X86_FEATURE_INVPCID:0

Thanks,
Daniel
Thomas Gleixner March 9, 2017, 2:49 p.m. UTC | #10
On Thu, 9 Mar 2017, Daniel Borkmann wrote:

> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> > On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> > > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> > > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> > > was set each time we switched attrs and a cpa_flush_range() was
> > > performed (with the correct number of pages and cache set to 0). That
> > > would be a __flush_tlb_all() eventually.
> > > 
> > > Hmm, it indeed might seem likely that this could be an emulation bug.
> > 
> > Which variant of __flush_tlb_all() is used when the test fails?
> > 
> > Check for the following flags in /proc/cpuinfo: pge invpcid
> 
> I added the following and booted with both variants:
> 
> printk("X86_FEATURE_PGE:%u\n",     static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
> 
> "-cpu host" gives:
> 
> [    8.326117] X86_FEATURE_PGE:1
> [    8.326381] X86_FEATURE_INVPCID:1
> 
> "-cpu kvm64" gives:
> 
> [    8.517069] X86_FEATURE_PGE:1
> [    8.517393] X86_FEATURE_INVPCID:0

That's the one which fails. So it's using the CR4 based flushing. Just ran
a test on a physical system with PGE=1 and INVPCID=0. Works fine.

Emulation problem?

Thanks,

	tglx
Daniel Borkmann March 9, 2017, 5:51 p.m. UTC | #11
On 03/09/2017 03:49 PM, Thomas Gleixner wrote:
> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
>>> On Thu, 9 Mar 2017, Daniel Borkmann wrote:
>>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
>>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
>>>> was set each time we switched attrs and a cpa_flush_range() was
>>>> performed (with the correct number of pages and cache set to 0). That
>>>> would be a __flush_tlb_all() eventually.
>>>>
>>>> Hmm, it indeed might seem likely that this could be an emulation bug.
>>>
>>> Which variant of __flush_tlb_all() is used when the test fails?
>>>
>>> Check for the following flags in /proc/cpuinfo: pge invpcid
>>
>> I added the following and booted with both variants:
>>
>> printk("X86_FEATURE_PGE:%u\n",     static_cpu_has(X86_FEATURE_PGE));
>> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
>>
>> "-cpu host" gives:
>>
>> [    8.326117] X86_FEATURE_PGE:1
>> [    8.326381] X86_FEATURE_INVPCID:1
>>
>> "-cpu kvm64" gives:
>>
>> [    8.517069] X86_FEATURE_PGE:1
>> [    8.517393] X86_FEATURE_INVPCID:0
>
> That's the one which fails. So it's using the CR4 based flushing. Just ran
> a test on a physical system with PGE=1 and INVPCID=0. Works fine.
>
> Emulation problem?

So in the git qemu code base (target/i386/helper.c), cr3 vs cr4 looks
like the following, both sharing the tlb_flush() itself:

void cpu_x86_update_cr3(CPUX86State *env, target_ulong new_cr3)
{
     X86CPU *cpu = x86_env_get_cpu(env);

     env->cr[3] = new_cr3;
     if (env->cr[0] & CR0_PG_MASK) {
         qemu_log_mask(CPU_LOG_MMU,
                         "CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3);
         tlb_flush(CPU(cpu));
     }
}

void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
{
     X86CPU *cpu = x86_env_get_cpu(env);
     uint32_t hflags;

#if defined(DEBUG_MMU)
     printf("CR4 update: %08x -> %08x\n", (uint32_t)env->cr[4], new_cr4);
#endif
     if ((new_cr4 ^ env->cr[4]) &
         (CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK |
          CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_LA57_MASK)) {
         tlb_flush(CPU(cpu));
     }

     [...]
}

I added some debugging around __native_flush_tlb_global_irq_disabled()
and if I understand it correctly, the idea of cr4 is that we need to
toggle X86_CR4_PGE in order to trigger a TLB flush.

What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
consistent to native_read_cr4() and since cr4 is != 0, it tells me
based on the comment in native_read_cr4() that cr4 seems to be
supported. Thus, meaning we end up with writing ...

   native_write_cr4(0x610);
   native_write_cr4(0x610);

... twice, and this just doesn't trigger the desired TLB flush. I
changed the code into the following ...

         cr4 = this_cpu_read(cpu_tlbstate.cr4);
         /* clear PGE */
-       native_write_cr4(cr4 & ~X86_CR4_PGE);
+       native_write_cr4(cr4 ^ X86_CR4_PGE);
         /* write old PGE again and flush TLBs */
         native_write_cr4(cr4);

... and the test cases seem to be working for me now with "-cpu kvm64",
so that seems to trigger the TLB we were missing.

I don't know enough about x86 internals to tell whether the change is
sane, though, but it seems at least for qemu fwiw. ;) Thoughts?

Thanks,
Daniel
David Miller March 9, 2017, 6:08 p.m. UTC | #12
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Thu, 09 Mar 2017 18:51:03 +0100

> I added some debugging around __native_flush_tlb_global_irq_disabled()
> and if I understand it correctly, the idea of cr4 is that we need to
> toggle X86_CR4_PGE in order to trigger a TLB flush.
> 
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
> 
>   native_write_cr4(0x610);
>   native_write_cr4(0x610);
> 
> ... twice, and this just doesn't trigger the desired TLB flush. I
> changed the code into the following ...
> 
>         cr4 = this_cpu_read(cpu_tlbstate.cr4);
>         /* clear PGE */
> -       native_write_cr4(cr4 & ~X86_CR4_PGE);
> +       native_write_cr4(cr4 ^ X86_CR4_PGE);
>         /* write old PGE again and flush TLBs */
>         native_write_cr4(cr4);
> 
> ... and the test cases seem to be working for me now with "-cpu
> kvm64",
> so that seems to trigger the TLB we were missing.

Great detective work Daniel.
Linus Torvalds March 9, 2017, 6:10 p.m. UTC | #13
On Thu, Mar 9, 2017 at 9:51 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is
> consistent to native_read_cr4() and since cr4 is != 0, it tells me
> based on the comment in native_read_cr4() that cr4 seems to be
> supported. Thus, meaning we end up with writing ...
>
>   native_write_cr4(0x610);
>   native_write_cr4(0x610);
>
> ... twice, and this just doesn't trigger the desired TLB flush.

Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
supports it).

But yes, if PGE is clear then that certainly explains the bug, and
it's not an emulation issue.

> I changed the code into the following ...
>
>         cr4 = this_cpu_read(cpu_tlbstate.cr4);
>         /* clear PGE */
> -       native_write_cr4(cr4 & ~X86_CR4_PGE);
> +       native_write_cr4(cr4 ^ X86_CR4_PGE);
>         /* write old PGE again and flush TLBs */
>         native_write_cr4(cr4);

Yeah, good for debugging, but not a good patch in general. The only
valid reason for not having PGE enabled would be that the CPU doesn't
support PGE at all.

                    Linus
Linus Torvalds March 9, 2017, 6:15 p.m. UTC | #14
On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
> supports it).

Daniel, do you see the code in probe_page_size_mask() triggering?

        /* Enable PGE if available */
        if (boot_cpu_has(X86_FEATURE_PGE)) {
                cr4_set_bits_and_update_boot(X86_CR4_PGE);
                __supported_pte_mask |= _PAGE_GLOBAL;
        } else
                __supported_pte_mask &= ~_PAGE_GLOBAL;

but maybe there's something wrong with the percpu cr4 caching?

               Linus
Daniel Borkmann March 9, 2017, 6:31 p.m. UTC | #15
On 03/09/2017 07:15 PM, Linus Torvalds wrote:
> On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU
>> supports it).
>
> Daniel, do you see the code in probe_page_size_mask() triggering?
>
>          /* Enable PGE if available */
>          if (boot_cpu_has(X86_FEATURE_PGE)) {
>                  cr4_set_bits_and_update_boot(X86_CR4_PGE);
>                  __supported_pte_mask |= _PAGE_GLOBAL;

We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this
branch here. So it seems something must be clearing it later, hmm.

>          } else
>                  __supported_pte_mask &= ~_PAGE_GLOBAL;
>
> but maybe there's something wrong with the percpu cr4 caching?
>
>                 Linus
>
diff mbox

Patch

diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a..702af98 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,8 @@  obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
  	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
  	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
  	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o
+	 once.o refcount.o test_setmem.o
+
  obj-y += string_helpers.o
  obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
  obj-y += hexdump.o
diff --git a/lib/test_setmem.c b/lib/test_setmem.c
new file mode 100644
index 0000000..82b08db
--- /dev/null
+++ b/lib/test_setmem.c
@@ -0,0 +1,50 @@ 
+#include <linux/init.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <asm/cacheflush.h>
+
+static int __init test_setmem(void)
+{
+        const gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO;
+        int ret, i, good = 0, max_runs = 1000, one = 1;
+        u32 pages, size;
+        u8 *buff;
+
+        for (i = 0; i < max_runs; i++) {
+                size  = round_up(4096*8, PAGE_SIZE);
+                pages = size / PAGE_SIZE;
+
+                buff  = __vmalloc(size, gfp_flags, PAGE_KERNEL);
+                if (buff == NULL) {
+                        printk("%s: allocation failed!\n", __func__);
+                        return 0;
+                }
+#if 1
+                buff[0] = 41;
+                buff[10] = 42;
+                buff[100] = 43;
+                buff[1000] = 44;
+		// can also be just read access, f.e. if (buff[0] + ...)
+#endif
+                WARN_ON_ONCE(set_memory_ro((unsigned long)buff, pages));
+//              msleep(0);
+
+                ret = probe_kernel_write(&buff[1], &one, sizeof(one));
+                if (!ret)
+                        printk("%s: write succeeded despite read-only!\n", __func__);
+                else
+                        good++;
+
+                WARN_ON_ONCE(set_memory_rw((unsigned long)buff, pages));
+                vfree(buff);
+        }
+
+        printk("%s: write failed due to read-only: %d/%d\n", __func__, good, max_runs);
+        return 0;
+}
+
+late_initcall(test_setmem);