Message ID | 58C08535.3070000@iogearbox.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: > [ 28.474232] rodata_test: test data was not read only > [...] In my tests so far, I've never been able to get rodata_test to fail (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can recheck under Qemu 2.7.1. Do you see these failures on real hardware? -Kees
On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: > > The issue seems to be accessing buff first (can be read or write access) > and then doing set_memory_ro() doesn't make it read-only immediately, > meaning the subsequent call into probe_kernel_write() will succeed without > error. > > Then, if I don't touch buff first and only do the set_memory_ro() seems > to work and probe_kernel_write() will then fail as expected due to pages > being read-only now. Ok, that definitely sounds like a TLB invalidate didn't happen. > Now, if I access buff, do the set_memory_ro() and then a msleep(0), for > example, it "kind of" works most of the time (see last log extract below), > and probe_kernel_write() will fail. Yeah, very much consistent with a missing TLB invalidate. Scheduling will end up invalidating it, although if it's a global page even that might not do it (but eventually the entry will just get flushed due to other activity). > None of this seems an issue with x86_64 and the test_setmem runs fine all > the time, same for the actual BPF stuff. The code does look somewhat confused about when to actually flush things - see my earlier note about NX - but it would seem to always do __flush_tlb_all() unless I missed something. At least as long as CPA_FLUSHTLB is set. Maybe some case forgets to set that.. Linus
On 03/08/2017 11:36 PM, Kees Cook wrote: > On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: >> [ 28.474232] rodata_test: test data was not read only >> [...] > > In my tests so far, I've never been able to get rodata_test to fail > (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can > recheck under Qemu 2.7.1. Do you see these failures on real hardware? The x86_64 tests on real hardware, and the x86-32 only in the qemu environment with the reproducer script. Haven't tested x86-32 kernel outside of qemu so far. Thanks, Daniel
On 03/08/2017 02:36 PM, Kees Cook wrote: > On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: >> [ 28.474232] rodata_test: test data was not read only >> [...] > > In my tests so far, I've never been able to get rodata_test to fail > (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can > recheck under Qemu 2.7.1. Do you see these failures on real hardware? > > -Kees > FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24 and rawhide respectively. I also notice that CONFIG_X86_PAE is turned off in the defconfig. If I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem goes away. I can't tell if this is an indication of magically hiding the TLB problem or if there is an issue with !X86_PAE invalidation. Thanks, Laura
On Wed, Mar 08, 2017 at 02:43:44PM -0800, Linus Torvalds wrote: >On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: >> >> The issue seems to be accessing buff first (can be read or write access) >> and then doing set_memory_ro() doesn't make it read-only immediately, >> meaning the subsequent call into probe_kernel_write() will succeed without >> error. >> >> Then, if I don't touch buff first and only do the set_memory_ro() seems >> to work and probe_kernel_write() will then fail as expected due to pages >> being read-only now. > >Ok, that definitely sounds like a TLB invalidate didn't happen. > >> Now, if I access buff, do the set_memory_ro() and then a msleep(0), for >> example, it "kind of" works most of the time (see last log extract below), >> and probe_kernel_write() will fail. > >Yeah, very much consistent with a missing TLB invalidate. Scheduling >will end up invalidating it, although if it's a global page even that >might not do it (but eventually the entry will just get flushed due to >other activity). > >> None of this seems an issue with x86_64 and the test_setmem runs fine all >> the time, same for the actual BPF stuff. > >The code does look somewhat confused about when to actually flush >things - see my earlier note about NX - but it would seem to always do >__flush_tlb_all() unless I missed something. At least as long as >CPA_FLUSHTLB is set. Maybe some case forgets to set that.. Not sure if it's relevant, but out of 189 boots there are 2 boots showing the below "CPA: called for zero pte." warning. [ 7.116932] random: trinity: uninitialized urandom read (4 bytes read) [ 16.366468] sock: process `trinity-main' is using obsolete setsockopt SO_BSDCOMPAT [ 17.202396] BUG: unable to handle kernel paging request at 655d9eb2 [ 17.204081] IP: __release_sock+0x6e/0x100 [ 17.205207] *pde = 00000000 [ 17.205208] [ 17.206755] Oops: 0000 [#1] [ 17.207686] CPU: 0 PID: 382 Comm: trinity-main Not tainted 4.10.0-rc8-02017-g9d876e7 #1 [ 17.209819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 [ 17.212431] task: d625d200 task.stack: d6222000 [ 17.213655] EIP: __release_sock+0x6e/0x100 [ 17.214833] EFLAGS: 00010246 CPU: 0 [ 17.215951] EAX: 00000000 EBX: 655d9eb2 ECX: 00000000 EDX: 00000201 [ 17.217587] ESI: 00000605 EDI: d6064800 EBP: d6223ef4 ESP: d6223ee8 [ 17.219185] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 17.220602] CR0: 80050033 CR2: 655d9eb2 CR3: 1610f000 CR4: 00000610 [ 17.221966] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 17.223444] DR6: ffff0ff0 DR7: 00000600 [ 17.224343] Call Trace: [ 17.225007] release_sock+0x2e/0x80 [ 17.225900] sock_setsockopt+0x8c/0x880 [ 17.226857] SyS_socketcall+0x658/0x6a0 [ 17.227804] do_fast_syscall_32+0x9a/0x160 [ 17.228765] entry_SYSENTER_32+0x4c/0x7b [ 17.229694] EIP: 0xb7777cc5 [ 17.230428] EFLAGS: 00000282 CPU: 0 [ 17.231263] EAX: ffffffda EBX: 0000000e ECX: bfedce00 EDX: bfedce80 [ 17.232582] ESI: 0000001a EDI: 000000ae EBP: b754f93c ESP: bfedcdec [ 17.233882] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 17.235044] Code: eb 29 8d 76 00 89 da 89 f8 ff 97 98 01 00 00 31 c9 ba 06 08 00 00 b8 d8 19 b1 c1 e8 ed 3d 85 ff e8 e8 62 04 00 85 f6 89 f3 74 42 <8b> 33 0f 18 06 8b 43 48 a8 01 74 0e 83 e0 fe 74 09 80 3d 3d 9c [ 17.240429] EIP: __release_sock+0x6e/0x100 SS:ESP: 0068:d6223ee8 [ 17.241689] CR2: 00000000655d9eb2 [ 17.242509] ---[ end trace dc10480164c75444 ]--- [ 17.243569] ------------[ cut here ]------------ [ 17.243574] WARNING: CPU: 0 PID: 15 at arch/x86/mm/pageattr.c:1150 __cpa_process_fault+0x388/0x390 [ 17.243575] CPA: called for zero pte. vaddr = d7ab4000 cpa->vaddr = d7ab4000 [ 17.243577] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D 4.10.0-rc8-02017-g9d876e7 #1 [ 17.243578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 [ 17.243582] Workqueue: events bpf_prog_free_deferred [ 17.243583] Call Trace: [ 17.243588] dump_stack+0x16/0x25 [ 17.243588] dump_stack+0x16/0x25 [ 17.243590] __warn+0xd1/0xf0 [ 17.243592] ? __cpa_process_fault+0x388/0x390 [ 17.243593] warn_slowpath_fmt+0x3b/0x40 [ 17.243594] __cpa_process_fault+0x388/0x390 [ 17.243596] ? lookup_address_in_pgd+0xa/0x90 [ 17.243598] __change_page_attr+0x520/0x6c0 [ 17.243600] ? pfn_range_is_mapped+0xe/0x80 [ 17.243601] __change_page_attr_set_clr+0x38/0x180 [ 17.243603] change_page_attr_set_clr+0x107/0x3f0 [ 17.243605] ? dequeue_entity+0x86/0x230 [ 17.243607] set_memory_rw+0x3a/0x40 [ 17.243608] bpf_prog_free_deferred+0x16/0x30 [ 17.243612] process_one_work+0xfc/0x440 [ 17.243614] ? pick_next_task_fair+0x149/0x1d0 [ 17.243615] worker_thread+0x37/0x4e0 [ 17.243617] kthread+0xdd/0x110 [ 17.243618] ? process_one_work+0x440/0x440 [ 17.243620] ? __kthread_create_on_node+0x100/0x100 [ 17.243622] ret_from_fork+0x21/0x2c [ 17.243623] ---[ end trace dc10480164c75445 ]--- [ 17.243627] BUG: unable to handle kernel NULL pointer dereference at 00000007 [ 17.243630] IP: ___cache_free+0x14/0x140 [ 17.243631] *pde = 00000000 [ 17.243631] [ 17.243633] Oops: 0000 [#2] [ 17.243635] CPU: 0 PID: 15 Comm: kworker/0:1 Tainted: G D W 4.10.0-rc8-02017-g9d876e7 #1 [ 17.243635] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 [ 17.243636] Workqueue: events bpf_prog_free_deferred [ 17.243637] task: d524e8c0 task.stack: d5254000 [ 17.243639] EIP: ___cache_free+0x14/0x140 [ 17.243640] EFLAGS: 00010046 CPU: 0 [ 17.243641] EAX: d6945af8 EBX: 00000003 ECX: c10dc08e EDX: 3beb072d [ 17.243642] ESI: d6945af8 EDI: 3beb072d EBP: d5255f00 ESP: d5255ed4 [ 17.243643] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 17.243644] CR0: 80050033 CR2: 00000007 CR3: 1610f000 CR4: 00000610 [ 17.243647] DR0: 080cb000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 17.243648] DR6: ffff0ff0 DR7: 00000600 [ 17.243648] Call Trace: [ 17.243650] kfree+0x64/0xe0 [ 17.243652] ? bpf_prog_free_deferred+0x1e/0x30 [ 17.243653] bpf_prog_free_deferred+0x1e/0x30 [ 17.243654] process_one_work+0xfc/0x440 [ 17.243656] ? pick_next_task_fair+0x149/0x1d0 [ 17.243658] worker_thread+0x37/0x4e0 [ 17.243659] kthread+0xdd/0x110 [ 17.243661] ? process_one_work+0x440/0x440 [ 17.243662] ? __kthread_create_on_node+0x100/0x100 [ 17.243664] ret_from_fork+0x21/0x2c [ 17.243664] Code: 89 da 89 f0 ff 0d 64 3e b4 c1 e8 f8 fe ff ff 83 c4 14 5b 5e 5d c3 90 55 89 e5 57 56 53 83 ec 20 e8 d2 21 72 00 8b 18 89 c6 89 d7 <8b> 43 04 39 03 73 65 a1 a0 fa 4b c2 85 c0 7f 1c 8b 03 8d 50 01 [ 17.243684] EIP: ___cache_free+0x14/0x140 SS:ESP: 0068:d5255ed4 [ 17.243684] CR2: 0000000000000007 [ 17.243685] ---[ end trace dc10480164c75446 ]--- [ 17.243686] Kernel panic - not syncing: Fatal exception [ 17.243687] Kernel Offset: disabled Regards, Fengguang
On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <labbott@redhat.com> wrote: > On 03/08/2017 02:36 PM, Kees Cook wrote: >> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: >>> [ 28.474232] rodata_test: test data was not read only >>> [...] >> >> In my tests so far, I've never been able to get rodata_test to fail >> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can >> recheck under Qemu 2.7.1. Do you see these failures on real hardware? >> >> -Kees >> > > FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24 > and rawhide respectively. > > I also notice that CONFIG_X86_PAE is turned off in the defconfig. If > I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem > goes away. I can't tell if this is an indication of magically hiding > the TLB problem or if there is an issue with !X86_PAE invalidation. I found my difference. I normally run qemu with "-cpu host" which makes the failure go away. With "-cpu kvm64", I see the rodata_test failure immediately. Seems like this may be a kvm cpu feature emulation bug? I'll see if I can find the specific cpu feature in the morning... -Kees
On 03/09/2017 06:36 AM, Kees Cook wrote: > On Wed, Mar 8, 2017 at 3:55 PM, Laura Abbott <labbott@redhat.com> wrote: >> On 03/08/2017 02:36 PM, Kees Cook wrote: >>> On Wed, Mar 8, 2017 at 2:27 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: >>>> [ 28.474232] rodata_test: test data was not read only >>>> [...] >>> >>> In my tests so far, I've never been able to get rodata_test to fail >>> (Qemu 2.5.0, Ubuntu). I'll retry with your .config and see if I can >>> recheck under Qemu 2.7.1. Do you see these failures on real hardware? >>> >>> -Kees >> >> FWIW, I'm seeing the same issue with qemu 2.6.2 and 2.8.0 on Fedora 24 >> and rawhide respectively. >> >> I also notice that CONFIG_X86_PAE is turned off in the defconfig. If >> I set CONFIG_HIGHMEM_64G which turns on CONFIG_X86_PAE the problem >> goes away. I can't tell if this is an indication of magically hiding >> the TLB problem or if there is an issue with !X86_PAE invalidation. > > I found my difference. I normally run qemu with "-cpu host" which > makes the failure go away. With "-cpu kvm64", I see the rodata_test > failure immediately. Seems like this may be a kvm cpu feature > emulation bug? I'll see if I can find the specific cpu feature in the > morning... Interesting! Changing to "-cpu host" makes rodata_test succeed plus my test_setmem and the test_bpf suite runs fine as well. Haven't seen a corruption since. Switching back to "-cpu kvm64" I immediately see mentioned issues again. With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB was set each time we switched attrs and a cpa_flush_range() was performed (with the correct number of pages and cache set to 0). That would be a __flush_tlb_all() eventually. Hmm, it indeed might seem likely that this could be an emulation bug. Thanks, Daniel
On Thu, 9 Mar 2017, Daniel Borkmann wrote: > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB > was set each time we switched attrs and a cpa_flush_range() was > performed (with the correct number of pages and cache set to 0). That > would be a __flush_tlb_all() eventually. > > Hmm, it indeed might seem likely that this could be an emulation bug. Which variant of __flush_tlb_all() is used when the test fails? Check for the following flags in /proc/cpuinfo: pge invpcid Thanks, tglx
On 03/09/2017 02:10 PM, Thomas Gleixner wrote: > On Thu, 9 Mar 2017, Daniel Borkmann wrote: >> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated >> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB >> was set each time we switched attrs and a cpa_flush_range() was >> performed (with the correct number of pages and cache set to 0). That >> would be a __flush_tlb_all() eventually. >> >> Hmm, it indeed might seem likely that this could be an emulation bug. > > Which variant of __flush_tlb_all() is used when the test fails? > > Check for the following flags in /proc/cpuinfo: pge invpcid I added the following and booted with both variants: printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE)); printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID)); "-cpu host" gives: [ 8.326117] X86_FEATURE_PGE:1 [ 8.326381] X86_FEATURE_INVPCID:1 "-cpu kvm64" gives: [ 8.517069] X86_FEATURE_PGE:1 [ 8.517393] X86_FEATURE_INVPCID:0 Thanks, Daniel
On Thu, 9 Mar 2017, Daniel Borkmann wrote: > On 03/09/2017 02:10 PM, Thomas Gleixner wrote: > > On Thu, 9 Mar 2017, Daniel Borkmann wrote: > > > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated > > > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB > > > was set each time we switched attrs and a cpa_flush_range() was > > > performed (with the correct number of pages and cache set to 0). That > > > would be a __flush_tlb_all() eventually. > > > > > > Hmm, it indeed might seem likely that this could be an emulation bug. > > > > Which variant of __flush_tlb_all() is used when the test fails? > > > > Check for the following flags in /proc/cpuinfo: pge invpcid > > I added the following and booted with both variants: > > printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE)); > printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID)); > > "-cpu host" gives: > > [ 8.326117] X86_FEATURE_PGE:1 > [ 8.326381] X86_FEATURE_INVPCID:1 > > "-cpu kvm64" gives: > > [ 8.517069] X86_FEATURE_PGE:1 > [ 8.517393] X86_FEATURE_INVPCID:0 That's the one which fails. So it's using the CR4 based flushing. Just ran a test on a physical system with PGE=1 and INVPCID=0. Works fine. Emulation problem? Thanks, tglx
On 03/09/2017 03:49 PM, Thomas Gleixner wrote: > On Thu, 9 Mar 2017, Daniel Borkmann wrote: >> On 03/09/2017 02:10 PM, Thomas Gleixner wrote: >>> On Thu, 9 Mar 2017, Daniel Borkmann wrote: >>>> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated >>>> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB >>>> was set each time we switched attrs and a cpa_flush_range() was >>>> performed (with the correct number of pages and cache set to 0). That >>>> would be a __flush_tlb_all() eventually. >>>> >>>> Hmm, it indeed might seem likely that this could be an emulation bug. >>> >>> Which variant of __flush_tlb_all() is used when the test fails? >>> >>> Check for the following flags in /proc/cpuinfo: pge invpcid >> >> I added the following and booted with both variants: >> >> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE)); >> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID)); >> >> "-cpu host" gives: >> >> [ 8.326117] X86_FEATURE_PGE:1 >> [ 8.326381] X86_FEATURE_INVPCID:1 >> >> "-cpu kvm64" gives: >> >> [ 8.517069] X86_FEATURE_PGE:1 >> [ 8.517393] X86_FEATURE_INVPCID:0 > > That's the one which fails. So it's using the CR4 based flushing. Just ran > a test on a physical system with PGE=1 and INVPCID=0. Works fine. > > Emulation problem? So in the git qemu code base (target/i386/helper.c), cr3 vs cr4 looks like the following, both sharing the tlb_flush() itself: void cpu_x86_update_cr3(CPUX86State *env, target_ulong new_cr3) { X86CPU *cpu = x86_env_get_cpu(env); env->cr[3] = new_cr3; if (env->cr[0] & CR0_PG_MASK) { qemu_log_mask(CPU_LOG_MMU, "CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3); tlb_flush(CPU(cpu)); } } void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4) { X86CPU *cpu = x86_env_get_cpu(env); uint32_t hflags; #if defined(DEBUG_MMU) printf("CR4 update: %08x -> %08x\n", (uint32_t)env->cr[4], new_cr4); #endif if ((new_cr4 ^ env->cr[4]) & (CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_LA57_MASK)) { tlb_flush(CPU(cpu)); } [...] } I added some debugging around __native_flush_tlb_global_irq_disabled() and if I understand it correctly, the idea of cr4 is that we need to toggle X86_CR4_PGE in order to trigger a TLB flush. What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is consistent to native_read_cr4() and since cr4 is != 0, it tells me based on the comment in native_read_cr4() that cr4 seems to be supported. Thus, meaning we end up with writing ... native_write_cr4(0x610); native_write_cr4(0x610); ... twice, and this just doesn't trigger the desired TLB flush. I changed the code into the following ... cr4 = this_cpu_read(cpu_tlbstate.cr4); /* clear PGE */ - native_write_cr4(cr4 & ~X86_CR4_PGE); + native_write_cr4(cr4 ^ X86_CR4_PGE); /* write old PGE again and flush TLBs */ native_write_cr4(cr4); ... and the test cases seem to be working for me now with "-cpu kvm64", so that seems to trigger the TLB we were missing. I don't know enough about x86 internals to tell whether the change is sane, though, but it seems at least for qemu fwiw. ;) Thoughts? Thanks, Daniel
From: Daniel Borkmann <daniel@iogearbox.net> Date: Thu, 09 Mar 2017 18:51:03 +0100 > I added some debugging around __native_flush_tlb_global_irq_disabled() > and if I understand it correctly, the idea of cr4 is that we need to > toggle X86_CR4_PGE in order to trigger a TLB flush. > > What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is > consistent to native_read_cr4() and since cr4 is != 0, it tells me > based on the comment in native_read_cr4() that cr4 seems to be > supported. Thus, meaning we end up with writing ... > > native_write_cr4(0x610); > native_write_cr4(0x610); > > ... twice, and this just doesn't trigger the desired TLB flush. I > changed the code into the following ... > > cr4 = this_cpu_read(cpu_tlbstate.cr4); > /* clear PGE */ > - native_write_cr4(cr4 & ~X86_CR4_PGE); > + native_write_cr4(cr4 ^ X86_CR4_PGE); > /* write old PGE again and flush TLBs */ > native_write_cr4(cr4); > > ... and the test cases seem to be working for me now with "-cpu > kvm64", > so that seems to trigger the TLB we were missing. Great detective work Daniel.
On Thu, Mar 9, 2017 at 9:51 AM, Daniel Borkmann <daniel@iogearbox.net> wrote: > > What I see is that original cr4 is 0x610. The cpu_tlbstate.cr4 is > consistent to native_read_cr4() and since cr4 is != 0, it tells me > based on the comment in native_read_cr4() that cr4 seems to be > supported. Thus, meaning we end up with writing ... > > native_write_cr4(0x610); > native_write_cr4(0x610); > > ... twice, and this just doesn't trigger the desired TLB flush. Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU supports it). But yes, if PGE is clear then that certainly explains the bug, and it's not an emulation issue. > I changed the code into the following ... > > cr4 = this_cpu_read(cpu_tlbstate.cr4); > /* clear PGE */ > - native_write_cr4(cr4 & ~X86_CR4_PGE); > + native_write_cr4(cr4 ^ X86_CR4_PGE); > /* write old PGE again and flush TLBs */ > native_write_cr4(cr4); Yeah, good for debugging, but not a good patch in general. The only valid reason for not having PGE enabled would be that the CPU doesn't support PGE at all. Linus
On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU > supports it). Daniel, do you see the code in probe_page_size_mask() triggering? /* Enable PGE if available */ if (boot_cpu_has(X86_FEATURE_PGE)) { cr4_set_bits_and_update_boot(X86_CR4_PGE); __supported_pte_mask |= _PAGE_GLOBAL; } else __supported_pte_mask &= ~_PAGE_GLOBAL; but maybe there's something wrong with the percpu cr4 caching? Linus
On 03/09/2017 07:15 PM, Linus Torvalds wrote: > On Thu, Mar 9, 2017 at 10:10 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Very odd. We should always have PGE (0x0080) set in cr4 (if the CPU >> supports it). > > Daniel, do you see the code in probe_page_size_mask() triggering? > > /* Enable PGE if available */ > if (boot_cpu_has(X86_FEATURE_PGE)) { > cr4_set_bits_and_update_boot(X86_CR4_PGE); > __supported_pte_mask |= _PAGE_GLOBAL; We do have boot_cpu_has(X86_FEATURE_PGE) and go indeed into this branch here. So it seems something must be clearing it later, hmm. > } else > __supported_pte_mask &= ~_PAGE_GLOBAL; > > but maybe there's something wrong with the percpu cr4 caching? > > Linus >
diff --git a/lib/Makefile b/lib/Makefile index 320ac46a..702af98 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -41,7 +41,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ - once.o refcount.o + once.o refcount.o test_setmem.o + obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/test_setmem.c b/lib/test_setmem.c new file mode 100644 index 0000000..82b08db --- /dev/null +++ b/lib/test_setmem.c @@ -0,0 +1,50 @@ +#include <linux/init.h> +#include <linux/uaccess.h> +#include <linux/vmalloc.h> +#include <linux/printk.h> +#include <linux/slab.h> +#include <linux/delay.h> + +#include <asm/cacheflush.h> + +static int __init test_setmem(void) +{ + const gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO; + int ret, i, good = 0, max_runs = 1000, one = 1; + u32 pages, size; + u8 *buff; + + for (i = 0; i < max_runs; i++) { + size = round_up(4096*8, PAGE_SIZE); + pages = size / PAGE_SIZE; + + buff = __vmalloc(size, gfp_flags, PAGE_KERNEL); + if (buff == NULL) { + printk("%s: allocation failed!\n", __func__); + return 0; + } +#if 1 + buff[0] = 41; + buff[10] = 42; + buff[100] = 43; + buff[1000] = 44; + // can also be just read access, f.e. if (buff[0] + ...) +#endif + WARN_ON_ONCE(set_memory_ro((unsigned long)buff, pages)); +// msleep(0); + + ret = probe_kernel_write(&buff[1], &one, sizeof(one)); + if (!ret) + printk("%s: write succeeded despite read-only!\n", __func__); + else + good++; + + WARN_ON_ONCE(set_memory_rw((unsigned long)buff, pages)); + vfree(buff); + } + + printk("%s: write failed due to read-only: %d/%d\n", __func__, good, max_runs); + return 0; +} + +late_initcall(test_setmem);