diff mbox series

[v15,7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

Message ID 20210609013431.9805-8-jniethe5@gmail.com (mailing list archive)
State Accepted
Headers show
Series powerpc: Further Strict RWX support | expand
Related show

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (c53db722ec7ab3ebf29ecf61e922820f31e5284b)
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 14 lines checked
snowpatch_ozlabs/needsstable success Patch has no Fixes tags

Commit Message

Jordan Niethe June 9, 2021, 1:34 a.m. UTC
From: Russell Currey <ruscur@russell.cc>

To enable strict module RWX on powerpc, set:

    CONFIG_STRICT_MODULE_RWX=y

You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
security benefit.

ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
makes STRICT_MODULE_RWX *on by default* in configurations where
STRICT_KERNEL_RWX is *unavailable*.

Since this doesn't make much sense, and module RWX without kernel RWX
doesn't make much sense, having the same dependencies as kernel RWX
works around this problem.

Book3s/32 603 and 604 core processors are not able to write protect
kernel pages so do not set ARCH_HAS_STRICT_MODULE_RWX for Book3s/32.

Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Russell Currey <ruscur@russell.cc>
[jpn: - predicate on !PPC_BOOK3S_604
      - make module_alloc() use PAGE_KERNEL protection]
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
---
v10: - Predicate on !PPC_BOOK3S_604
     - Make module_alloc() use PAGE_KERNEL protection
v11: - Neaten up
v13: Use strict_kernel_rwx_enabled()
v14: Make changes to module_alloc() its own commit
v15: - Force STRICT_KERNEL_RWX if STRICT_MODULE_RWX is selected
     - Predicate on !PPC_BOOK3S_32 instead
---
 arch/powerpc/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

Comments

Laurent Vivier Aug. 5, 2021, 9:34 a.m. UTC | #1
Hi,

On 09/06/2021 03:34, Jordan Niethe wrote:
> From: Russell Currey <ruscur@russell.cc>
> 
> To enable strict module RWX on powerpc, set:
> 
>     CONFIG_STRICT_MODULE_RWX=y
> 
> You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
> security benefit.
> 
> ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
> This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
> makes STRICT_MODULE_RWX *on by default* in configurations where
> STRICT_KERNEL_RWX is *unavailable*.
> 
> Since this doesn't make much sense, and module RWX without kernel RWX
> doesn't make much sense, having the same dependencies as kernel RWX
> works around this problem.
> 
> Book3s/32 603 and 604 core processors are not able to write protect
> kernel pages so do not set ARCH_HAS_STRICT_MODULE_RWX for Book3s/32.
> 
> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> [jpn: - predicate on !PPC_BOOK3S_604
>       - make module_alloc() use PAGE_KERNEL protection]
> Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
> ---
> v10: - Predicate on !PPC_BOOK3S_604
>      - Make module_alloc() use PAGE_KERNEL protection
> v11: - Neaten up
> v13: Use strict_kernel_rwx_enabled()
> v14: Make changes to module_alloc() its own commit
> v15: - Force STRICT_KERNEL_RWX if STRICT_MODULE_RWX is selected
>      - Predicate on !PPC_BOOK3S_32 instead
> ---
>  arch/powerpc/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index abfe2e9225fa..72f307f1796b 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -142,6 +142,7 @@ config PPC
>  	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
>  	select ARCH_HAS_SET_MEMORY
>  	select ARCH_HAS_STRICT_KERNEL_RWX	if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION)
> +	select ARCH_HAS_STRICT_MODULE_RWX	if ARCH_HAS_STRICT_KERNEL_RWX && !PPC_BOOK3S_32
>  	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
>  	select ARCH_HAS_UACCESS_FLUSHCACHE
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
> @@ -267,6 +268,7 @@ config PPC
>  	select PPC_DAWR				if PPC64
>  	select RTC_LIB
>  	select SPARSE_IRQ
> +	select STRICT_KERNEL_RWX if STRICT_MODULE_RWX
>  	select SYSCTL_EXCEPTION_TRACE
>  	select THREAD_INFO_IN_TASK
>  	select VIRT_TO_BUS			if !PPC64
> 

since this patch is merged my VM is experiencing a crash at boot (20% of the time):

[    8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit
attempt? (uid: 0)
[    8.496921] BUG: Unable to handle kernel instruction fetch
[    8.496954] Faulting instruction address: 0xc008000004073278
[    8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
[    8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[    8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs
libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash
dm_log dm_mod
[    8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
[    8.497228] Workqueue: events control_work_handler [virtio_console]
[    8.497272] NIP:  c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
[    8.497320] REGS: c00000002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
[    8.497361] MSR:  800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822
XER: 200400cf
[    8.497426] CFAR: c0000000001e9e44 IRQMASK: 1
[    8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520
[    8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff
[    8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008
[    8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400
[    8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340
[    8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0
[    8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038
[    8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
[    8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
[    8.497976] Call Trace:
[    8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210
[virtio_console] (unreliable)
[    8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console]
[    8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8
[virtio_console]
[    8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590
[    8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620
[    8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0
[    8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64
[    8.498349] Instruction dump:
[    8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c
[    8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018
[    8.498485] ---[ end trace 16ee10903290b647 ]---
[    8.501433]
[    9.502601] Kernel panic - not syncing: Fatal exception

add_port+0x1a8/0x470 :

  1420	
  1421		/* We can safely ignore ENOSPC because it means
  1422		 * the queue already has buffers. Buffers are removed
  1423		 * only by virtcons_remove(), not by unplug_port()
  1424		 */
->1425		err = fill_queue(port->in_vq, &port->inbuf_lock);
  1426		if (err < 0 && err != -ENOSPC) {
  1427			dev_err(port->dev, "Error allocating inbufs\n");
  1428			goto free_device;
  1429		}

fill_queue+0x90/0x210 :

  1326	static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
  1327	{
  1328		struct port_buffer *buf;
  1329		int nr_added_bufs;
  1330		int ret;
  1331	
  1332		nr_added_bufs = 0;
  1333		do {
  1334			buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
  1335			if (!buf)
  1336				return -ENOMEM;
  1337	
->1338			spin_lock_irq(lock);

I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM.

My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325

My qemu command line is:

/usr/libexec/qemu-kvm \
-M pseries,accel=kvm \
-nographic -nodefaults \
-device virtio-serial-pci \
-device virtconsole \
-device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0  \
-blockdev
node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \
-netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
-device virtio-blk-pci,id=image1,drive=disk1 \
-m 8192  \
-smp 4 \
-serial mon:stdio


Do we need something in qemu/kvm to support STRICT_MODULE_RWX ?

Thanks,
Laurent
Fabiano Rosas Aug. 13, 2021, 10:58 p.m. UTC | #2
Laurent Vivier <lvivier@redhat.com> writes:

>
> since this patch is merged my VM is experiencing a crash at boot (20% of the time):
>
> [    8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit
> attempt? (uid: 0)
> [    8.496921] BUG: Unable to handle kernel instruction fetch
> [    8.496954] Faulting instruction address: 0xc008000004073278
> [    8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
> [    8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [    8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs
> libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash
> dm_log dm_mod
> [    8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
> [    8.497228] Workqueue: events control_work_handler [virtio_console]
> [    8.497272] NIP:  c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
> [    8.497320] REGS: c00000002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
> [    8.497361] MSR:  800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822
> XER: 200400cf
> [    8.497426] CFAR: c0000000001e9e44 IRQMASK: 1
> [    8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520
> [    8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff
> [    8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008
> [    8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400
> [    8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [    8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340
> [    8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0
> [    8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038
> [    8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> [    8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> [    8.497976] Call Trace:
> [    8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210
> [virtio_console] (unreliable)
> [    8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console]
> [    8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8
> [virtio_console]
> [    8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590
> [    8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620
> [    8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0
> [    8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64
> [    8.498349] Instruction dump:
> [    8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c
> [    8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018
> [    8.498485] ---[ end trace 16ee10903290b647 ]---
> [    8.501433]
> [    9.502601] Kernel panic - not syncing: Fatal exception
>
> add_port+0x1a8/0x470 :
>
>   1420	
>   1421		/* We can safely ignore ENOSPC because it means
>   1422		 * the queue already has buffers. Buffers are removed
>   1423		 * only by virtcons_remove(), not by unplug_port()
>   1424		 */
> ->1425		err = fill_queue(port->in_vq, &port->inbuf_lock);
>   1426		if (err < 0 && err != -ENOSPC) {
>   1427			dev_err(port->dev, "Error allocating inbufs\n");
>   1428			goto free_device;
>   1429		}
>
> fill_queue+0x90/0x210 :
>
>   1326	static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
>   1327	{
>   1328		struct port_buffer *buf;
>   1329		int nr_added_bufs;
>   1330		int ret;
>   1331	
>   1332		nr_added_bufs = 0;
>   1333		do {
>   1334			buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
>   1335			if (!buf)
>   1336				return -ENOMEM;
>   1337	
> ->1338			spin_lock_irq(lock);
>
> I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM.
>
> My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325
>
> My qemu command line is:
>
> /usr/libexec/qemu-kvm \
> -M pseries,accel=kvm \
> -nographic -nodefaults \
> -device virtio-serial-pci \
> -device virtconsole \
> -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0  \
> -blockdev
> node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \
> -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> -device virtio-blk-pci,id=image1,drive=disk1 \
> -m 8192  \
> -smp 4 \
> -serial mon:stdio
>
>
> Do we need something in qemu/kvm to support STRICT_MODULE_RWX ?
>
> Thanks,
> Laurent

This patch survived 300 consecutive runs without the issue:

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index 0876216ceee6..7aeb2b62e5dc 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -35,10 +35,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
        pte_t pte;
 
        spin_lock(&init_mm.page_table_lock);
-
-       /* invalidate the PTE so it's safe to modify */
-       pte = ptep_get_and_clear(&init_mm, addr, ptep);
-       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+       pte = *ptep;
 
        /* modify the PTE bits as desired, then apply */
        switch (action) {
@@ -60,10 +57,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
        }
 
        set_pte_at(&init_mm, addr, ptep, pte);
-
-       /* See ptesync comment in radix__set_pte_at() */
-       if (radix_enabled())
-               asm volatile("ptesync": : :"memory");
+       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
        spin_unlock(&init_mm.page_table_lock);
 
        return 0;
---

What I think is happening is that the virtio_console code is running
at the same time we are doing the `module_enable_ro` at the end of
`do_init_module` due to the async nature of the work handler. There is
a window after the TLB flush when the PTE has its permission bits
cleared, so any translations of the module code page attempted during
that window will fault.

I'm ignorant of strict rwx in general so I don't see why we need to
clear the bits before setting them to their final value, but as I
understand it, the set_pte_at + flush_tlb_kernel_range satisfy the ISA
requirement of [ptesync; tlbie; eieio; tlbsync; ptesync;] so it seems
like the patch should work.

Now, I cannot explain why the crash always happens around the code
that does the module's symbols relocation (the NIP in Laurent's trace
is the TOC reload from module_64.c:restore_r2). Maybe because
instructions are already in icache until the first branch into the
stub?

Anyway, this is what Murilo and I found out over our debugging session
in the past couple of days. I hope it helps. =)
Jordan Niethe Aug. 14, 2021, 1:23 a.m. UTC | #3
On Sat, Aug 14, 2021 at 8:59 AM Fabiano Rosas <farosas@linux.ibm.com> wrote:
>
> Laurent Vivier <lvivier@redhat.com> writes:
>
> >
> > since this patch is merged my VM is experiencing a crash at boot (20% of the time):
> >
> > [    8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit
> > attempt? (uid: 0)
> > [    8.496921] BUG: Unable to handle kernel instruction fetch
> > [    8.496954] Faulting instruction address: 0xc008000004073278
> > [    8.496994] Oops: Kernel access of bad area, sig: 11 [#1]
> > [    8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > [    8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs
> > libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash
> > dm_log dm_mod
> > [    8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
> > [    8.497228] Workqueue: events control_work_handler [virtio_console]
> > [    8.497272] NIP:  c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
> > [    8.497320] REGS: c00000002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
> > [    8.497361] MSR:  800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822
> > XER: 200400cf
> > [    8.497426] CFAR: c0000000001e9e44 IRQMASK: 1
> > [    8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520
> > [    8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff
> > [    8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008
> > [    8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400
> > [    8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [    8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340
> > [    8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0
> > [    8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038
> > [    8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> > [    8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console]
> > [    8.497976] Call Trace:
> > [    8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210
> > [virtio_console] (unreliable)
> > [    8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console]
> > [    8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8
> > [virtio_console]
> > [    8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590
> > [    8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620
> > [    8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0
> > [    8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64
> > [    8.498349] Instruction dump:
> > [    8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c
> > [    8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018
> > [    8.498485] ---[ end trace 16ee10903290b647 ]---
> > [    8.501433]
> > [    9.502601] Kernel panic - not syncing: Fatal exception
> >
> > add_port+0x1a8/0x470 :
> >
> >   1420
> >   1421                /* We can safely ignore ENOSPC because it means
> >   1422                 * the queue already has buffers. Buffers are removed
> >   1423                 * only by virtcons_remove(), not by unplug_port()
> >   1424                 */
> > ->1425                err = fill_queue(port->in_vq, &port->inbuf_lock);
> >   1426                if (err < 0 && err != -ENOSPC) {
> >   1427                        dev_err(port->dev, "Error allocating inbufs\n");
> >   1428                        goto free_device;
> >   1429                }
> >
> > fill_queue+0x90/0x210 :
> >
> >   1326        static int fill_queue(struct virtqueue *vq, spinlock_t *lock)
> >   1327        {
> >   1328                struct port_buffer *buf;
> >   1329                int nr_added_bufs;
> >   1330                int ret;
> >   1331
> >   1332                nr_added_bufs = 0;
> >   1333                do {
> >   1334                        buf = alloc_buf(vq->vdev, PAGE_SIZE, 0);
> >   1335                        if (!buf)
> >   1336                                return -ENOMEM;
> >   1337
> > ->1338                        spin_lock_irq(lock);
> >
> > I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM.
> >
> > My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325
> >
> > My qemu command line is:
> >
> > /usr/libexec/qemu-kvm \
> > -M pseries,accel=kvm \
> > -nographic -nodefaults \
> > -device virtio-serial-pci \
> > -device virtconsole \
> > -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0  \
> > -blockdev
> > node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \
> > -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \
> > -device virtio-blk-pci,id=image1,drive=disk1 \
> > -m 8192  \
> > -smp 4 \
> > -serial mon:stdio
> >
> >
> > Do we need something in qemu/kvm to support STRICT_MODULE_RWX ?
> >
> > Thanks,
> > Laurent
>
> This patch survived 300 consecutive runs without the issue:
>
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> index 0876216ceee6..7aeb2b62e5dc 100644
> --- a/arch/powerpc/mm/pageattr.c
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -35,10 +35,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
>         pte_t pte;
>
>         spin_lock(&init_mm.page_table_lock);
> -
> -       /* invalidate the PTE so it's safe to modify */
> -       pte = ptep_get_and_clear(&init_mm, addr, ptep);
> -       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +       pte = *ptep;
>
>         /* modify the PTE bits as desired, then apply */
>         switch (action) {
> @@ -60,10 +57,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
>         }
>
>         set_pte_at(&init_mm, addr, ptep, pte);
> -
> -       /* See ptesync comment in radix__set_pte_at() */
> -       if (radix_enabled())
> -               asm volatile("ptesync": : :"memory");
> +       flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>         spin_unlock(&init_mm.page_table_lock);
>
>         return 0;
> ---
>
> What I think is happening is that the virtio_console code is running
> at the same time we are doing the `module_enable_ro` at the end of
> `do_init_module` due to the async nature of the work handler. There is
> a window after the TLB flush when the PTE has its permission bits
> cleared, so any translations of the module code page attempted during
> that window will fault.

Thanks for looking at this. I agree that this is the problem.
This avoids the crash (not a proper solution):

--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2011,13 +2011,15 @@ static void module_enable_ro(const struct
module *mod, bool after_init)
        if (!rodata_enabled)
                return;

-       set_vm_flush_reset_perms(mod->core_layout.base);
-       set_vm_flush_reset_perms(mod->init_layout.base);
-       frob_text(&mod->core_layout, set_memory_ro);
-
-       frob_rodata(&mod->core_layout, set_memory_ro);
-       frob_text(&mod->init_layout, set_memory_ro);
-       frob_rodata(&mod->init_layout, set_memory_ro);
+       if (!after_init) {
+               set_vm_flush_reset_perms(mod->core_layout.base);
+               set_vm_flush_reset_perms(mod->init_layout.base);
+               frob_text(&mod->core_layout, set_memory_ro);
+
+               frob_rodata(&mod->core_layout, set_memory_ro);
+               frob_text(&mod->init_layout, set_memory_ro);
+               frob_rodata(&mod->init_layout, set_memory_ro);
+       }

>
> I'm ignorant of strict rwx in general so I don't see why we need to
> clear the bits before setting them to their final value, but as I
> understand it, the set_pte_at + flush_tlb_kernel_range satisfy the ISA
> requirement of [ptesync; tlbie; eieio; tlbsync; ptesync;] so it seems
> like the patch should work.
I believe it's done like that because ISA 6.10.1.2 Modifying a
Translation Table Entry -
"The sequence is equivalent to deleting the PTE and then adding a new one".
In that sequence, I think it is [set pte to 0; ptesync; tlbie; eieio;
tlbsync; ptesync;]

But it does seem like we need to do something like your patch for
change_page_attr() to work as expected.
>
> Now, I cannot explain why the crash always happens around the code
> that does the module's symbols relocation (the NIP in Laurent's trace
> is the TOC reload from module_64.c:restore_r2). Maybe because
> instructions are already in icache until the first branch into the
> stub?
Yeah, mpe thought it might be some cache effect of a branch instruction.
From time to time I did see Oopses like this (not "kernel tried to
execute exec-protected page")

[  124.986964][   T52] Oops: Kernel access of bad area, sig: 11 [#1]
[  124.987043][   T52] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  124.987095][   T52] Modules linked in: virtio_console binfmt_misc
virtiofs fuse virtio_net virtio_blk virtio_pci virtio_pci_modern_dev
virtio_ring virtio crc32c_vpmsum [last unloaded: virtio_console]
[  124.987209][   T52] CPU: 2 PID: 52 Comm: kworker/2:1 Not tainted
5.14.0-rc4 #4
[  124.987259][   T52] Workqueue: events control_work_handler [virtio_console]
[  124.987307][   T52] NIP:  c008000001a86044 LR: c008000001a82cf8
CTR: c00000000042a6c0
[  124.987358][   T52] REGS: c00000000b273770 TRAP: 0300   Not tainted
 (5.14.0-rc4)
[  124.987406][   T52] MSR:  800000000280b033
<SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 82002882  XER: 00000004
[  124.987475][   T52] CFAR: c008000001a82cf4 DAR: c008000001a86058
DSISR: 40000000 IRQMASK: 0
[  124.987475][   T52] GPR00: c008000001a82ce8 c00000000b273a10
c008000001ab8000 c000000010731320
[  124.987475][   T52] GPR04: 0000000000000001 0000000000000000
0000000000000000 0000000000000000
[  124.987475][   T52] GPR08: 0000000000000000 0000000000000000
0000000000000000 c008000001a86038
[  124.987475][   T52] GPR12: 0000000022002884 c0000001ffffdf00
c000000000169ae8 c0000000037700c0
[  124.987475][   T52] GPR16: 0000000000000000 0000000000000000
c000000021af0000 c000000003313900
[  124.987475][   T52] GPR20: 0000000000000001 0000000000000000
0000000000000000 c000000003316780
[  124.987475][   T52] GPR24: c008000001a90288 0000000000000000
0000000000000000 c000000010731320
[  124.987475][   T52] GPR28: c000000003316900 0000000000000018
0000000000000068 c00000000379c5a0
[  124.988028][   T52] NIP [c008000001a86044] fini+0x824/0xa7e0 [virtio_console]
[  124.988085][   T52] LR [c008000001a82cf8] fill_queue+0xb0/0x230
[virtio_console]
[  124.988141][   T52] Call Trace:
[  124.988171][   T52] [c00000000b273a10] [c008000001a82ce8]
fill_queue+0xa0/0x230 [virtio_console] (unreliable)
[  124.988246][   T52] [c00000000b273aa0] [c008000001a83208]
add_port.isra.0+0x1a0/0x4b0 [virtio_console]
[  124.988312][   T52] [c00000000b273b70] [c008000001a85474]
control_work_handler+0x46c/0x654 [virtio_console]
[  124.988403][   T52] [c00000000b273c70] [c00000000015ce60]
process_one_work+0x2a0/0x570
[  124.988586][   T52] [c00000000b273d10] [c00000000015d1d8]
worker_thread+0xa8/0x660
[  124.988684][   T52] [c00000000b273da0] [c000000000169c5c] kthread+0x17c/0x190
[  124.988762][   T52] [c00000000b273e10] [c00000000000cf54]
ret_from_kernel_thread+0x5c/0x64
[  124.988916][   T52] Instruction dump:
[  124.988965][   T52] 396be010 f8410018 e98b0020 7d8903a6 4e800420
00000000 73747562 003a0300
[  124.989152][   T52] c0000000 3d62fffd 396be038 f8410018 <e98b0020>
7d8903a6 4e800420 00000000
[  124.989298][   T52] ---[ end trace ab9046c024eb3154 ]---

I guess it depends on the timing of which pte is invalidated.
>
> Anyway, this is what Murilo and I found out over our debugging session
> in the past couple of days. I hope it helps. =)
Yes it does, thanks heaps.
>
diff mbox series

Patch

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index abfe2e9225fa..72f307f1796b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -142,6 +142,7 @@  config PPC
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
 	select ARCH_HAS_SET_MEMORY
 	select ARCH_HAS_STRICT_KERNEL_RWX	if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION)
+	select ARCH_HAS_STRICT_MODULE_RWX	if ARCH_HAS_STRICT_KERNEL_RWX && !PPC_BOOK3S_32
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UACCESS_FLUSHCACHE
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
@@ -267,6 +268,7 @@  config PPC
 	select PPC_DAWR				if PPC64
 	select RTC_LIB
 	select SPARSE_IRQ
+	select STRICT_KERNEL_RWX if STRICT_MODULE_RWX
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
 	select VIRT_TO_BUS			if !PPC64