diff mbox series

[v2] vmalloc: Fix issues with flush flag

Message ID 20190520200703.15997-1-rick.p.edgecombe@intel.com
State Not Applicable
Delegated to: David Miller
Headers show
Series [v2] vmalloc: Fix issues with flush flag | expand

Commit Message

Edgecombe, Rick P May 20, 2019, 8:07 p.m. UTC
Switch VM_FLUSH_RESET_PERMS to use a regular TLB flush intead of
vm_unmap_aliases() and fix calculation of the direct map for the
CONFIG_ARCH_HAS_SET_DIRECT_MAP case.

Meelis Roos reported issues with the new VM_FLUSH_RESET_PERMS flag on a
sparc machine. On investigation some issues were noticed:

1. The calculation of the direct map address range to flush was wrong.
This could cause problems on x86 if a RO direct map alias ever got loaded
into the TLB. This shouldn't normally happen, but it could cause the
permissions to remain RO on the direct map alias, and then the page
would return from the page allocator to some other component as RO and
cause a crash.

2. Calling vm_unmap_alias() on vfree could potentially be a lot of work to
do on a free operation. Simply flushing the TLB instead of the whole
vm_unmap_alias() operation makes the frees faster and pushes the heavy
work to happen on allocation where it would be more expected.
In addition to the extra work, vm_unmap_alias() takes some locks including
a long hold of vmap_purge_lock, which will make all other
VM_FLUSH_RESET_PERMS vfrees wait while the purge operation happens.

3. page_address() can have locking on some configurations, so skip calling
this when possible to further speed this up.

Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
Reported-by: Meelis Roos <mroos@linux.ee>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---

Changes since v1:
 - Update commit message with more detail
 - Fix flush end range on !CONFIG_ARCH_HAS_SET_DIRECT_MAP case

 mm/vmalloc.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comments

Andy Lutomirski May 20, 2019, 9:25 p.m. UTC | #1
> On May 20, 2019, at 1:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> 
> Switch VM_FLUSH_RESET_PERMS to use a regular TLB flush intead of
> vm_unmap_aliases() and fix calculation of the direct map for the
> CONFIG_ARCH_HAS_SET_DIRECT_MAP case.
> 
> Meelis Roos reported issues with the new VM_FLUSH_RESET_PERMS flag on a
> sparc machine. On investigation some issues were noticed:
> 

Can you split this into a few (3?) patches, each fixing one issue?

> 1. The calculation of the direct map address range to flush was wrong.
> This could cause problems on x86 if a RO direct map alias ever got loaded
> into the TLB. This shouldn't normally happen, but it could cause the
> permissions to remain RO on the direct map alias, and then the page
> would return from the page allocator to some other component as RO and
> cause a crash.
> 
> 2. Calling vm_unmap_alias() on vfree could potentially be a lot of work to
> do on a free operation. Simply flushing the TLB instead of the whole
> vm_unmap_alias() operation makes the frees faster and pushes the heavy
> work to happen on allocation where it would be more expected.
> In addition to the extra work, vm_unmap_alias() takes some locks including
> a long hold of vmap_purge_lock, which will make all other
> VM_FLUSH_RESET_PERMS vfrees wait while the purge operation happens.
> 
> 3. page_address() can have locking on some configurations, so skip calling
> this when possible to further speed this up.
> 
> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
> Reported-by: Meelis Roos <mroos@linux.ee>
> Cc: Meelis Roos <mroos@linux.ee>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Nadav Amit <namit@vmware.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> 
> Changes since v1:
> - Update commit message with more detail
> - Fix flush end range on !CONFIG_ARCH_HAS_SET_DIRECT_MAP case
> 
> mm/vmalloc.c | 23 +++++++++++++----------
> 1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c42872ed82ac..8d03427626dc 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2122,9 +2122,10 @@ static inline void set_area_direct_map(const struct vm_struct *area,
> /* Handle removing and resetting vm mappings related to the vm_struct. */
> static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
> {
> +    const bool has_set_direct = IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP);
> +    const bool flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
>    unsigned long addr = (unsigned long)area->addr;
> -    unsigned long start = ULONG_MAX, end = 0;
> -    int flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
> +    unsigned long start = addr, end = addr + area->size;
>    int i;
> 
>    /*
> @@ -2133,7 +2134,7 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
>     * This is concerned with resetting the direct map any an vm alias with
>     * execute permissions, without leaving a RW+X window.
>     */
> -    if (flush_reset && !IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> +    if (flush_reset && !has_set_direct) {
>        set_memory_nx(addr, area->nr_pages);
>        set_memory_rw(addr, area->nr_pages);
>    }
> @@ -2146,22 +2147,24 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
> 
>    /*
>     * If not deallocating pages, just do the flush of the VM area and
> -     * return.
> +     * return. If the arch doesn't have set_direct_map_(), also skip the
> +     * below work.
>     */
> -    if (!deallocate_pages) {
> -        vm_unmap_aliases();
> +    if (!deallocate_pages || !has_set_direct) {
> +        flush_tlb_kernel_range(start, end);
>        return;
>    }
> 
>    /*
>     * If execution gets here, flush the vm mapping and reset the direct
>     * map. Find the start and end range of the direct mappings to make sure
> -     * the vm_unmap_aliases() flush includes the direct map.
> +     * the flush_tlb_kernel_range() includes the direct map.
>     */
>    for (i = 0; i < area->nr_pages; i++) {
> -        if (page_address(area->pages[i])) {
> +        addr = (unsigned long)page_address(area->pages[i]);
> +        if (addr) {
>            start = min(addr, start);
> -            end = max(addr, end);
> +            end = max(addr + PAGE_SIZE, end);
>        }
>    }
> 
> @@ -2171,7 +2174,7 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
>     * reset the direct map permissions to the default.
>     */
>    set_area_direct_map(area, set_direct_map_invalid_noflush);
> -    _vm_unmap_aliases(start, end, 1);
> +    flush_tlb_kernel_range(start, end);
>    set_area_direct_map(area, set_direct_map_default_noflush);
> }
> 
> -- 
> 2.20.1
>
Meelis Roos May 20, 2019, 9:36 p.m. UTC | #2
> Switch VM_FLUSH_RESET_PERMS to use a regular TLB flush intead of
> vm_unmap_aliases() and fix calculation of the direct map for the
> CONFIG_ARCH_HAS_SET_DIRECT_MAP case.
> 
> Meelis Roos reported issues with the new VM_FLUSH_RESET_PERMS flag on a
> sparc machine. On investigation some issues were noticed:
> 
> 1. The calculation of the direct map address range to flush was wrong.
> This could cause problems on x86 if a RO direct map alias ever got loaded
> into the TLB. This shouldn't normally happen, but it could cause the
> permissions to remain RO on the direct map alias, and then the page
> would return from the page allocator to some other component as RO and
> cause a crash.
> 
> 2. Calling vm_unmap_alias() on vfree could potentially be a lot of work to
> do on a free operation. Simply flushing the TLB instead of the whole
> vm_unmap_alias() operation makes the frees faster and pushes the heavy
> work to happen on allocation where it would be more expected.
> In addition to the extra work, vm_unmap_alias() takes some locks including
> a long hold of vmap_purge_lock, which will make all other
> VM_FLUSH_RESET_PERMS vfrees wait while the purge operation happens.
> 
> 3. page_address() can have locking on some configurations, so skip calling
> this when possible to further speed this up.
> 
> Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
> Reported-by: Meelis Roos<mroos@linux.ee>
> Cc: Meelis Roos<mroos@linux.ee>
> Cc: Peter Zijlstra<peterz@infradead.org>
> Cc: "David S. Miller"<davem@davemloft.net>
> Cc: Dave Hansen<dave.hansen@intel.com>
> Cc: Borislav Petkov<bp@alien8.de>
> Cc: Andy Lutomirski<luto@kernel.org>
> Cc: Ingo Molnar<mingo@redhat.com>
> Cc: Nadav Amit<namit@vmware.com>
> Signed-off-by: Rick Edgecombe<rick.p.edgecombe@intel.com>
> ---
> 
> Changes since v1:
>   - Update commit message with more detail
>   - Fix flush end range on !CONFIG_ARCH_HAS_SET_DIRECT_MAP case

It does not work on my V445 where the initial problem happened.

[   46.582633] systemd[1]: Detected architecture sparc64.

Welcome to Debian GNU/Linux 10 (buster)!

[   46.759048] systemd[1]: Set hostname to <v445>.
[   46.831383] systemd[1]: Failed to bump fs.file-max, ignoring: Invalid argument
[   67.989695] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   68.074706] rcu:     0-...!: (0 ticks this GP) idle=5c6/1/0x4000000000000000 softirq=33/33 fqs=0
[   68.198443] rcu:     2-...!: (0 ticks this GP) idle=e7e/1/0x4000000000000000 softirq=67/67 fqs=0
[   68.322198]  (detected by 1, t=5252 jiffies, g=-939, q=108)
[   68.402204]   CPU[  0]: TSTATE[0000000080001603] TPC[000000000043f298] TNPC[000000000043f29c] TASK[systemd-debug-g:89]
[   68.556001]              TPC[smp_synchronize_tick_client+0x18/0x1a0] O7[0xfff000010000691c] I7[xcall_sync_tick+0x1c/0x2c] RPC[alloc_set_pte+0xf4/0x300]
[   68.750973]   CPU[  2]: TSTATE[0000000080001600] TPC[000000000043f298] TNPC[000000000043f29c] TASK[systemd-cryptse:88]
[   68.904741]              TPC[smp_synchronize_tick_client+0x18/0x1a0] O7[filemap_map_pages+0x3cc/0x3e0] I7[xcall_sync_tick+0x1c/0x2c] RPC[handle_mm_fault+0xa0/0x180]
[   69.115991] rcu: rcu_sched kthread starved for 5252 jiffies! g-939 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[   69.262239] rcu: RCU grace-period kthread stack dump:
[   69.334741] rcu_sched       I    0    10      2 0x06000000
[   69.413495] Call Trace:
[   69.448501]  [000000000093325c] schedule+0x1c/0xc0
[   69.517253]  [0000000000936c74] schedule_timeout+0x154/0x260
[   69.598514]  [00000000004b65a4] rcu_gp_kthread+0x4e4/0xac0
[   69.677261]  [000000000047ecfc] kthread+0xfc/0x120
[   69.746018]  [00000000004060a4] ret_from_fork+0x1c/0x2c
[   69.821014]  [0000000000000000] 0x0

and hangs here, software watchdog kicks in soon.
Edgecombe, Rick P May 20, 2019, 9:48 p.m. UTC | #3
On Mon, 2019-05-20 at 14:25 -0700, Andy Lutomirski wrote:
> 
> 
> > On May 20, 2019, at 1:07 PM, Rick Edgecombe <
> > rick.p.edgecombe@intel.com> wrote:
> > 
> > Switch VM_FLUSH_RESET_PERMS to use a regular TLB flush intead of
> > vm_unmap_aliases() and fix calculation of the direct map for the
> > CONFIG_ARCH_HAS_SET_DIRECT_MAP case.
> > 
> > Meelis Roos reported issues with the new VM_FLUSH_RESET_PERMS flag
> > on a
> > sparc machine. On investigation some issues were noticed:
> > 
> 
> Can you split this into a few (3?) patches, each fixing one issue?
Sure, I just did one because because it was all in the same function
and the address range calculation needs to be done differently for pure
TLB flush, so its kind of intertwined.

> > 1. The calculation of the direct map address range to flush was
> > wrong.
> > This could cause problems on x86 if a RO direct map alias ever got
> > loaded
> > into the TLB. This shouldn't normally happen, but it could cause
> > the
> > permissions to remain RO on the direct map alias, and then the page
> > would return from the page allocator to some other component as RO
> > and
> > cause a crash.
> > 
> > 2. Calling vm_unmap_alias() on vfree could potentially be a lot of
> > work to
> > do on a free operation. Simply flushing the TLB instead of the
> > whole
> > vm_unmap_alias() operation makes the frees faster and pushes the
> > heavy
> > work to happen on allocation where it would be more expected.
> > In addition to the extra work, vm_unmap_alias() takes some locks
> > including
> > a long hold of vmap_purge_lock, which will make all other
> > VM_FLUSH_RESET_PERMS vfrees wait while the purge operation happens.
> > 
> > 3. page_address() can have locking on some configurations, so skip
> > calling
> > this when possible to further speed this up.
> > 
> > Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special
> > permsissions")
> > Reported-by: Meelis Roos <mroos@linux.ee>
> > Cc: Meelis Roos <mroos@linux.ee>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Nadav Amit <namit@vmware.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > 
> > Changes since v1:
> > - Update commit message with more detail
> > - Fix flush end range on !CONFIG_ARCH_HAS_SET_DIRECT_MAP case
> > 
> > mm/vmalloc.c | 23 +++++++++++++----------
> > 1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index c42872ed82ac..8d03427626dc 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2122,9 +2122,10 @@ static inline void set_area_direct_map(const
> > struct vm_struct *area,
> > /* Handle removing and resetting vm mappings related to the
> > vm_struct. */
> > static void vm_remove_mappings(struct vm_struct *area, int
> > deallocate_pages)
> > {
> > +    const bool has_set_direct =
> > IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP);
> > +    const bool flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
> >    unsigned long addr = (unsigned long)area->addr;
> > -    unsigned long start = ULONG_MAX, end = 0;
> > -    int flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
> > +    unsigned long start = addr, end = addr + area->size;
> >    int i;
> > 
> >    /*
> > @@ -2133,7 +2134,7 @@ static void vm_remove_mappings(struct
> > vm_struct *area, int deallocate_pages)
> >     * This is concerned with resetting the direct map any an vm
> > alias with
> >     * execute permissions, without leaving a RW+X window.
> >     */
> > -    if (flush_reset &&
> > !IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> > +    if (flush_reset && !has_set_direct) {
> >        set_memory_nx(addr, area->nr_pages);
> >        set_memory_rw(addr, area->nr_pages);
> >    }
> > @@ -2146,22 +2147,24 @@ static void vm_remove_mappings(struct
> > vm_struct *area, int deallocate_pages)
> > 
> >    /*
> >     * If not deallocating pages, just do the flush of the VM area
> > and
> > -     * return.
> > +     * return. If the arch doesn't have set_direct_map_(), also
> > skip the
> > +     * below work.
> >     */
> > -    if (!deallocate_pages) {
> > -        vm_unmap_aliases();
> > +    if (!deallocate_pages || !has_set_direct) {
> > +        flush_tlb_kernel_range(start, end);
> >        return;
> >    }
> > 
> >    /*
> >     * If execution gets here, flush the vm mapping and reset the
> > direct
> >     * map. Find the start and end range of the direct mappings to
> > make sure
> > -     * the vm_unmap_aliases() flush includes the direct map.
> > +     * the flush_tlb_kernel_range() includes the direct map.
> >     */
> >    for (i = 0; i < area->nr_pages; i++) {
> > -        if (page_address(area->pages[i])) {
> > +        addr = (unsigned long)page_address(area->pages[i]);
> > +        if (addr) {
> >            start = min(addr, start);
> > -            end = max(addr, end);
> > +            end = max(addr + PAGE_SIZE, end);
> >        }
> >    }
> > 
> > @@ -2171,7 +2174,7 @@ static void vm_remove_mappings(struct
> > vm_struct *area, int deallocate_pages)
> >     * reset the direct map permissions to the default.
> >     */
> >    set_area_direct_map(area, set_direct_map_invalid_noflush);
> > -    _vm_unmap_aliases(start, end, 1);
> > +    flush_tlb_kernel_range(start, end);
> >    set_area_direct_map(area, set_direct_map_default_noflush);
> > }
> > 
> > -- 
> > 2.20.1
> >
Edgecombe, Rick P May 20, 2019, 10:17 p.m. UTC | #4
On Tue, 2019-05-21 at 00:36 +0300, Meelis Roos wrote:
> > Switch VM_FLUSH_RESET_PERMS to use a regular TLB flush intead of
> > vm_unmap_aliases() and fix calculation of the direct map for the
> > CONFIG_ARCH_HAS_SET_DIRECT_MAP case.
> > 
> > Meelis Roos reported issues with the new VM_FLUSH_RESET_PERMS flag
> > on a
> > sparc machine. On investigation some issues were noticed:
> > 
> > 1. The calculation of the direct map address range to flush was
> > wrong.
> > This could cause problems on x86 if a RO direct map alias ever got
> > loaded
> > into the TLB. This shouldn't normally happen, but it could cause
> > the
> > permissions to remain RO on the direct map alias, and then the page
> > would return from the page allocator to some other component as RO
> > and
> > cause a crash.
> > 
> > 2. Calling vm_unmap_alias() on vfree could potentially be a lot of
> > work to
> > do on a free operation. Simply flushing the TLB instead of the
> > whole
> > vm_unmap_alias() operation makes the frees faster and pushes the
> > heavy
> > work to happen on allocation where it would be more expected.
> > In addition to the extra work, vm_unmap_alias() takes some locks
> > including
> > a long hold of vmap_purge_lock, which will make all other
> > VM_FLUSH_RESET_PERMS vfrees wait while the purge operation happens.
> > 
> > 3. page_address() can have locking on some configurations, so skip
> > calling
> > this when possible to further speed this up.
> > 
> > Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special
> > permsissions")
> > Reported-by: Meelis Roos<mroos@linux.ee>
> > Cc: Meelis Roos<mroos@linux.ee>
> > Cc: Peter Zijlstra<peterz@infradead.org>
> > Cc: "David S. Miller"<davem@davemloft.net>
> > Cc: Dave Hansen<dave.hansen@intel.com>
> > Cc: Borislav Petkov<bp@alien8.de>
> > Cc: Andy Lutomirski<luto@kernel.org>
> > Cc: Ingo Molnar<mingo@redhat.com>
> > Cc: Nadav Amit<namit@vmware.com>
> > Signed-off-by: Rick Edgecombe<rick.p.edgecombe@intel.com>
> > ---
> > 
> > Changes since v1:
> >   - Update commit message with more detail
> >   - Fix flush end range on !CONFIG_ARCH_HAS_SET_DIRECT_MAP case
> 
> It does not work on my V445 where the initial problem happened.
> 
Thanks for testing. So I guess that suggests it's the TLB flush causing
the problem on sparc and not any lazy purge deadlock. I had sent Meelis
another test patch that just flushed the entire 0 to ULONG_MAX range to
try to always the get the "flush all" logic and apprently it didn't
boot mostly either. It also showed that it's not getting stuck anywhere
in the vm_remove_alias() function. Something just hangs later.
David Miller May 20, 2019, 10:48 p.m. UTC | #5
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Mon, 20 May 2019 22:17:49 +0000

> Thanks for testing. So I guess that suggests it's the TLB flush causing
> the problem on sparc and not any lazy purge deadlock. I had sent Meelis
> another test patch that just flushed the entire 0 to ULONG_MAX range to
> try to always the get the "flush all" logic and apprently it didn't
> boot mostly either. It also showed that it's not getting stuck anywhere
> in the vm_remove_alias() function. Something just hangs later.

I wonder if an address is making it to the TLB flush routines which is
not page aligned.  Or a TLB flush is being done before the callsites
are patched properly for the given cpu type.
Edgecombe, Rick P May 21, 2019, 12:20 a.m. UTC | #6
On Mon, 2019-05-20 at 15:48 -0700, David Miller wrote:
> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Mon, 20 May 2019 22:17:49 +0000
> 
> > Thanks for testing. So I guess that suggests it's the TLB flush
> > causing
> > the problem on sparc and not any lazy purge deadlock. I had sent
> > Meelis
> > another test patch that just flushed the entire 0 to ULONG_MAX
> > range to
> > try to always the get the "flush all" logic and apprently it didn't
> > boot mostly either. It also showed that it's not getting stuck
> > anywhere
> > in the vm_remove_alias() function. Something just hangs later.
> 
> I wonder if an address is making it to the TLB flush routines which
> is
> not page aligned.
I think vmalloc should force PAGE_SIZE alignment, but will double check
nothing got screwed up.

> Or a TLB flush is being done before the callsites
> are patched properly for the given cpu type.
Any idea how I could log when this is done? It looks like it's done
really early in boot assembly. This behavior shouldn't happen until
modules or BPF are being freed.
David Miller May 21, 2019, 12:33 a.m. UTC | #7
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Tue, 21 May 2019 00:20:13 +0000

> This behavior shouldn't happen until modules or BPF are being freed.

Then that would rule out my theory.

The only thing left is whether the permissions are actually set
properly.  If they aren't we'll take an exception when the BPF program
is run and I'm not %100 sure that kernel execute permission violations
are totally handled cleanly.
Edgecombe, Rick P May 21, 2019, 1:20 a.m. UTC | #8
On Mon, 2019-05-20 at 20:33 -0400, David Miller wrote:
> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Tue, 21 May 2019 00:20:13 +0000
> 
> > This behavior shouldn't happen until modules or BPF are being
> > freed.
> 
> Then that would rule out my theory.
> 
> The only thing left is whether the permissions are actually set
> properly.  If they aren't we'll take an exception when the BPF
> program
> is run and I'm not %100 sure that kernel execute permission
> violations
> are totally handled cleanly.
Permissions shouldn't be affected with this except on free. But reading
the code it looked like sparc had all PAGE_KERNEL as executable and no
set_memory_() implementations. Is there some places where permissions
are being set?

Should it handle executing an unmapped page gracefully? Because this
change is causing that to happen much earlier. If something was relying
on a cached translation to execute something it could find the mapping
disappear.
David Miller May 21, 2019, 1:43 a.m. UTC | #9
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Tue, 21 May 2019 01:20:33 +0000

> Should it handle executing an unmapped page gracefully? Because this
> change is causing that to happen much earlier. If something was relying
> on a cached translation to execute something it could find the mapping
> disappear.

Does this work by not mapping any kernel mappings at the beginning,
and then filling in the BPF mappings in response to faults?
Edgecombe, Rick P May 21, 2019, 1:59 a.m. UTC | #10
On Mon, 2019-05-20 at 18:43 -0700, David Miller wrote:
> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Tue, 21 May 2019 01:20:33 +0000
> 
> > Should it handle executing an unmapped page gracefully? Because
> > this
> > change is causing that to happen much earlier. If something was
> > relying
> > on a cached translation to execute something it could find the
> > mapping
> > disappear.
> 
> Does this work by not mapping any kernel mappings at the beginning,
> and then filling in the BPF mappings in response to faults?
No, nothing too fancy. It just flushes the vm mapping immediatly in
vfree for execute (and RO) mappings. The only thing that happens around
allocation time is setting of a new flag to tell vmalloc to do the
flush.

The problem before was that the pages would be freed before the execute
mapping was flushed. So then when the pages got recycled, random,
sometimes coming from userspace, data would be mapped as executable in
the kernel by the un-flushed tlb entries.
David Miller May 22, 2019, 5:40 p.m. UTC | #11
From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Date: Tue, 21 May 2019 01:59:54 +0000

> On Mon, 2019-05-20 at 18:43 -0700, David Miller wrote:
>> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
>> Date: Tue, 21 May 2019 01:20:33 +0000
>> 
>> > Should it handle executing an unmapped page gracefully? Because
>> > this
>> > change is causing that to happen much earlier. If something was
>> > relying
>> > on a cached translation to execute something it could find the
>> > mapping
>> > disappear.
>> 
>> Does this work by not mapping any kernel mappings at the beginning,
>> and then filling in the BPF mappings in response to faults?
> No, nothing too fancy. It just flushes the vm mapping immediatly in
> vfree for execute (and RO) mappings. The only thing that happens around
> allocation time is setting of a new flag to tell vmalloc to do the
> flush.
> 
> The problem before was that the pages would be freed before the execute
> mapping was flushed. So then when the pages got recycled, random,
> sometimes coming from userspace, data would be mapped as executable in
> the kernel by the un-flushed tlb entries.

If I am to understand things correctly, there was a case where 'end'
could be smaller than 'start' when doing a range flush.  That would
definitely kill some of the sparc64 TLB flush routines.
Edgecombe, Rick P May 22, 2019, 7:26 p.m. UTC | #12
On Wed, 2019-05-22 at 10:40 -0700, David Miller wrote:
> From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> Date: Tue, 21 May 2019 01:59:54 +0000
> 
> > On Mon, 2019-05-20 at 18:43 -0700, David Miller wrote:
> > > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > > Date: Tue, 21 May 2019 01:20:33 +0000
> > > 
> > > > Should it handle executing an unmapped page gracefully? Because
> > > > this
> > > > change is causing that to happen much earlier. If something was
> > > > relying
> > > > on a cached translation to execute something it could find the
> > > > mapping
> > > > disappear.
> > > 
> > > Does this work by not mapping any kernel mappings at the
> > > beginning,
> > > and then filling in the BPF mappings in response to faults?
> > No, nothing too fancy. It just flushes the vm mapping immediatly in
> > vfree for execute (and RO) mappings. The only thing that happens
> > around
> > allocation time is setting of a new flag to tell vmalloc to do the
> > flush.
> > 
> > The problem before was that the pages would be freed before the
> > execute
> > mapping was flushed. So then when the pages got recycled, random,
> > sometimes coming from userspace, data would be mapped as executable
> > in
> > the kernel by the un-flushed tlb entries.
> 
> If I am to understand things correctly, there was a case where 'end'
> could be smaller than 'start' when doing a range flush.  That would
> definitely kill some of the sparc64 TLB flush routines.

Ok, thanks.

The patch at the beginning of this thread doesn't have that behavior
though and it apparently still hung. I asked if Meelis could test with
this feature disabled and DEBUG_PAGEALLOC on, since it flushes on every
vfree and is not new logic, and also with a patch that logs exact TLB
flush ranges and fault addresses on top of the kernel having this
issue. Hopefully that will shed some light.

Sorry for all the noise and speculation on this. It has been difficult
to debug remotely with a tester and developer in different time zones.
Edgecombe, Rick P May 22, 2019, 10:40 p.m. UTC | #13
On Wed, 2019-05-22 at 12:26 -0700, Rick Edgecombe wrote:
> On Wed, 2019-05-22 at 10:40 -0700, David Miller wrote:
> > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > Date: Tue, 21 May 2019 01:59:54 +0000
> > 
> > > On Mon, 2019-05-20 at 18:43 -0700, David Miller wrote:
> > > > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > > > Date: Tue, 21 May 2019 01:20:33 +0000
> > > > 
> > > > > Should it handle executing an unmapped page gracefully?
> > > > > Because
> > > > > this
> > > > > change is causing that to happen much earlier. If something
> > > > > was
> > > > > relying
> > > > > on a cached translation to execute something it could find
> > > > > the
> > > > > mapping
> > > > > disappear.
> > > > 
> > > > Does this work by not mapping any kernel mappings at the
> > > > beginning,
> > > > and then filling in the BPF mappings in response to faults?
> > > No, nothing too fancy. It just flushes the vm mapping immediatly
> > > in
> > > vfree for execute (and RO) mappings. The only thing that happens
> > > around
> > > allocation time is setting of a new flag to tell vmalloc to do
> > > the
> > > flush.
> > > 
> > > The problem before was that the pages would be freed before the
> > > execute
> > > mapping was flushed. So then when the pages got recycled, random,
> > > sometimes coming from userspace, data would be mapped as
> > > executable
> > > in
> > > the kernel by the un-flushed tlb entries.
> > 
> > If I am to understand things correctly, there was a case where
> > 'end'
> > could be smaller than 'start' when doing a range flush.  That would
> > definitely kill some of the sparc64 TLB flush routines.
> 
> Ok, thanks.
> 
> The patch at the beginning of this thread doesn't have that behavior
> though and it apparently still hung. I asked if Meelis could test
> with
> this feature disabled and DEBUG_PAGEALLOC on, since it flushes on
> every
> vfree and is not new logic, and also with a patch that logs exact TLB
> flush ranges and fault addresses on top of the kernel having this
> issue. Hopefully that will shed some light.
> 
> Sorry for all the noise and speculation on this. It has been
> difficult
> to debug remotely with a tester and developer in different time
> zones.
> 
> 
Ok, so with a patch to disable setting the new vmalloc flush flag on
architectures that have normal memory as executable (includes sparc),
boot succeeds.

With this disable patch and DEBUG_PAGEALLOC on, it hangs earlier than
before. Going from clues in other logs, it looks like it hangs right at
the first normal vfree.

Thanks for all the testing Meelis!

So it seems like other, not new, TLB flushes also trigger the hang.

From earlier logs provided, this vfree would be the first call to
flush_tlb_kernel_range(), and before any BPF allocations appear in the
logs. So I am suspecting some other cause than the bisected patch at
this point, but I guess it's not fully conclusive.

It could be informative to bisect upstream again with the
DEBUG_PAGEALLOC configs on, to see if it indeed points to an earlier
commit.
Edgecombe, Rick P May 24, 2019, 3:50 p.m. UTC | #14
On Wed, 2019-05-22 at 15:40 -0700, Rick Edgecombe wrote:
> On Wed, 2019-05-22 at 12:26 -0700, Rick Edgecombe wrote:
> > On Wed, 2019-05-22 at 10:40 -0700, David Miller wrote:
> > > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > > Date: Tue, 21 May 2019 01:59:54 +0000
> > > 
> > > > On Mon, 2019-05-20 at 18:43 -0700, David Miller wrote:
> > > > > From: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
> > > > > Date: Tue, 21 May 2019 01:20:33 +0000
> > > > > 
> > > > > > Should it handle executing an unmapped page gracefully?
> > > > > > Because
> > > > > > this
> > > > > > change is causing that to happen much earlier. If something
> > > > > > was
> > > > > > relying
> > > > > > on a cached translation to execute something it could find
> > > > > > the
> > > > > > mapping
> > > > > > disappear.
> > > > > 
> > > > > Does this work by not mapping any kernel mappings at the
> > > > > beginning,
> > > > > and then filling in the BPF mappings in response to faults?
> > > > No, nothing too fancy. It just flushes the vm mapping
> > > > immediatly
> > > > in
> > > > vfree for execute (and RO) mappings. The only thing that
> > > > happens
> > > > around
> > > > allocation time is setting of a new flag to tell vmalloc to do
> > > > the
> > > > flush.
> > > > 
> > > > The problem before was that the pages would be freed before the
> > > > execute
> > > > mapping was flushed. So then when the pages got recycled,
> > > > random,
> > > > sometimes coming from userspace, data would be mapped as
> > > > executable
> > > > in
> > > > the kernel by the un-flushed tlb entries.
> > > 
> > > If I am to understand things correctly, there was a case where
> > > 'end'
> > > could be smaller than 'start' when doing a range flush.  That
> > > would
> > > definitely kill some of the sparc64 TLB flush routines.
> > 
> > Ok, thanks.
> > 
> > The patch at the beginning of this thread doesn't have that
> > behavior
> > though and it apparently still hung. I asked if Meelis could test
> > with
> > this feature disabled and DEBUG_PAGEALLOC on, since it flushes on
> > every
> > vfree and is not new logic, and also with a patch that logs exact
> > TLB
> > flush ranges and fault addresses on top of the kernel having this
> > issue. Hopefully that will shed some light.
> > 
> > Sorry for all the noise and speculation on this. It has been
> > difficult
> > to debug remotely with a tester and developer in different time
> > zones.
> > 
> > 
> Ok, so with a patch to disable setting the new vmalloc flush flag on
> architectures that have normal memory as executable (includes sparc),
> boot succeeds.
> 
> With this disable patch and DEBUG_PAGEALLOC on, it hangs earlier than
> before. Going from clues in other logs, it looks like it hangs right
> at
> the first normal vfree.
> 
> Thanks for all the testing Meelis!
> 
> So it seems like other, not new, TLB flushes also trigger the hang.
> 
> From earlier logs provided, this vfree would be the first call to
> flush_tlb_kernel_range(), and before any BPF allocations appear in
> the
> logs. So I am suspecting some other cause than the bisected patch at
> this point, but I guess it's not fully conclusive.
> 
> It could be informative to bisect upstream again with the
> DEBUG_PAGEALLOC configs on, to see if it indeed points to an earlier
> commit.

So now Meelis has found that the commit before any of my vmalloc
changes also hangs during boot with DEBUG_PAGEALLOC on. It does this
shortly after the first vfree, which DEBUG_PAGEALLOC would of course
make trigger a flush_tlb_kernel_range() on the allocation just like my
vmalloc changes do on certain vmallocs. The upstream code calls
vm_unmap_aliases() instead of the flush_tlb_kernel_range() directly,
but we also tested a version that called the flush directly on just the
allocation and it also hung. So it seems like issues flushing vmallocs
on this platform exist outside my commits.

How do people feel about calling this a sparc specific issue uncovered
by my patch instead of caused by it at this point?

If people agree with this assesment, it of course still seems like the
new changes turn the root cause into a more impactful issue for this
specific combination. On the other hand I am not the right person to
fix the root cause for several reasons including no hardware access. 

Otherwise I could submit a patch to disable this for sparc since it
doesn't really get a security benefit from it anyway. What do people
think?
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c42872ed82ac..8d03427626dc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2122,9 +2122,10 @@  static inline void set_area_direct_map(const struct vm_struct *area,
 /* Handle removing and resetting vm mappings related to the vm_struct. */
 static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
 {
+	const bool has_set_direct = IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP);
+	const bool flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
 	unsigned long addr = (unsigned long)area->addr;
-	unsigned long start = ULONG_MAX, end = 0;
-	int flush_reset = area->flags & VM_FLUSH_RESET_PERMS;
+	unsigned long start = addr, end = addr + area->size;
 	int i;
 
 	/*
@@ -2133,7 +2134,7 @@  static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
 	 * This is concerned with resetting the direct map any an vm alias with
 	 * execute permissions, without leaving a RW+X window.
 	 */
-	if (flush_reset && !IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
+	if (flush_reset && !has_set_direct) {
 		set_memory_nx(addr, area->nr_pages);
 		set_memory_rw(addr, area->nr_pages);
 	}
@@ -2146,22 +2147,24 @@  static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
 
 	/*
 	 * If not deallocating pages, just do the flush of the VM area and
-	 * return.
+	 * return. If the arch doesn't have set_direct_map_(), also skip the
+	 * below work.
 	 */
-	if (!deallocate_pages) {
-		vm_unmap_aliases();
+	if (!deallocate_pages || !has_set_direct) {
+		flush_tlb_kernel_range(start, end);
 		return;
 	}
 
 	/*
 	 * If execution gets here, flush the vm mapping and reset the direct
 	 * map. Find the start and end range of the direct mappings to make sure
-	 * the vm_unmap_aliases() flush includes the direct map.
+	 * the flush_tlb_kernel_range() includes the direct map.
 	 */
 	for (i = 0; i < area->nr_pages; i++) {
-		if (page_address(area->pages[i])) {
+		addr = (unsigned long)page_address(area->pages[i]);
+		if (addr) {
 			start = min(addr, start);
-			end = max(addr, end);
+			end = max(addr + PAGE_SIZE, end);
 		}
 	}
 
@@ -2171,7 +2174,7 @@  static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
 	 * reset the direct map permissions to the default.
 	 */
 	set_area_direct_map(area, set_direct_map_invalid_noflush);
-	_vm_unmap_aliases(start, end, 1);
+	flush_tlb_kernel_range(start, end);
 	set_area_direct_map(area, set_direct_map_default_noflush);
 }