mbox series

[v1,0/7] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages

Message ID 20220315141837.137118-1-david@redhat.com (mailing list archive)
Headers show
Series mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages | expand

Message

David Hildenbrand March 15, 2022, 2:18 p.m. UTC
More information on the general COW issues can be found at [2]. This series
is based on v5.17-rc8, [1]:
	[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
	THP and swap
and [4]:
	[PATCH v2 00/15] mm: COW fixes part 2: reliable GUP pins of
	anonymous pages

v1 is located at:
	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_3_v1


This series fixes memory corruptions when a GUP R/W reference
(FOLL_WRITE | FOLL_GET) was taken on an anonymous page and COW logic fails
to detect exclusivity of the page to then replacing the anonymous page by
a copy in the page table: The GUP reference lost synchronicity with the
pages mapped into the page tables. This series focuses on x86, arm64,
s390x and ppc64/book3s -- other architectures are fairly easy to support
by implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.

This primarily fixes the O_DIRECT memory corruptions that can happen on
concurrent swapout, whereby we lose DMA reads to a page (modifying the user
page by writing to it).

O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM)
DMA from/to a user page. In the long run, we want to convert it to properly
use FOLL_PIN, and John is working on it, but that might take a while and
might not be easy to backport. In the meantime, let's restore what used to
work before we started modifying our COW logic: make R/W FOLL_GET
references reliable as long as there is no fork() after GUP involved.

This is just the natural follow-up of part 2, that will also further
reduce "wrong COW" on the swapin path, for example, when we cannot remove
a page from the swapcache due to concurrent writeback, or if we have two
threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
nice side-product :)

This issue, including other related COW issues, has been summarized in [3]
under 2):
"
  2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)

  It was discovered that we can create a memory corruption by reading a
  file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
  concurrently writing to an unrelated part (e.g., last byte) of the same
  page, and concurrently write-protecting the page via clear_refs
  SOFTDIRTY tracking [6].

  For the reproducer, the issue is that O_DIRECT grabs a reference of the
  target page (via FOLL_GET) and clear_refs write-protects the relevant
  page table entry. On successive write access to the page from the
  process itself, we wrongly COW the page when resolving the write fault,
  resulting in a loss of synchronicity and consequently a memory corruption.

  While some people might think that using clear_refs in this combination
  is a corner cases, it turns out to be a more generic problem unfortunately.

  For example, it was just recently discovered that we can similarly
  create a memory corruption without clear_refs, simply by concurrently
  swapping out the buffer pages [7]. Note that we nowadays even use the
  swap infrastructure in Linux without an actual swap disk/partition: the
  prime example is zram which is enabled as default under Fedora [10].

  The root issue is that a write-fault on a page that has additional
  references results in a COW and thereby a loss of synchronicity
  and consequently a memory corruption if two parties believe they are
  referencing the same page.
"

We don't particularly care about R/O FOLL_GET references: they were never
reliable and O_DIRECT doesn't expect to observe modifications from a page
after DMA was started.

Note that:
* this only fixes the issue on x86, arm64, s390x and ppc64/book3s
  ("enterprise architectures"). Other architectures have to implement
  __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
* this does *not * consider any kind of fork() after taking the reference:
  fork() after GUP never worked reliably with FOLL_GET.
* Not losing PG_anon_exclusive during swapout was the last remaining
  piece. KSM already makes sure that there are no other references on
  a page before considering it for sharing. Page migration maintains
  PG_anon_exclusive and simply fails when there are additional references
  (freezing the refcount fails). Only swapout code dropped the
  PG_anon_exclusive flag because it requires more work to remember +
  restore it.

With this series in place, most COW issues of [3] are fixed on said
architectures. Other architectures can implement
__HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.

What remains is the COW security issue on hugetlb with FOLL_GET, and
SOFTDIRTY tracking. I'll tackle both (guess what?) in part 4 once part 2
and part 3 are on its way upstream.

[1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
[2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
[3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
[4] https://lkml.kernel.org/r/20220315104741.63071-1-david@redhat.com


David Hildenbrand (7):
  mm/swap: remember PG_anon_exclusive via a swp pte bit
  mm/debug_vm_pgtable: add tests for __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  arm64/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
  powerpc/pgtable: remove _PAGE_BIT_SWAP_TYPE for book3s
  powerpc/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE for book3s

 arch/arm64/include/asm/pgtable-prot.h        |  1 +
 arch/arm64/include/asm/pgtable.h             | 23 ++++++--
 arch/powerpc/include/asm/book3s/64/pgtable.h | 31 ++++++++---
 arch/s390/include/asm/pgtable.h              | 37 ++++++++++---
 arch/x86/include/asm/pgtable.h               | 16 ++++++
 arch/x86/include/asm/pgtable_64.h            |  4 +-
 arch/x86/include/asm/pgtable_types.h         |  5 ++
 include/linux/pgtable.h                      | 29 +++++++++++
 include/linux/swapops.h                      |  2 +
 mm/debug_vm_pgtable.c                        | 15 ++++++
 mm/memory.c                                  | 55 ++++++++++++++++++--
 mm/rmap.c                                    | 19 ++++---
 mm/swapfile.c                                | 13 ++++-
 13 files changed, 219 insertions(+), 31 deletions(-)

Comments

Jason Gunthorpe March 18, 2022, 11:48 p.m. UTC | #1
On Tue, Mar 15, 2022 at 03:18:30PM +0100, David Hildenbrand wrote:
> This is just the natural follow-up of part 2, that will also further
> reduce "wrong COW" on the swapin path, for example, when we cannot remove
> a page from the swapcache due to concurrent writeback, or if we have two
> threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
> nice side-product :)

I know I would benefit alot from a description of the swap specific
issue a bit more. Most of this message talks about clear_refs which I
do understand a bit better.

Is this talking about what happens after a page gets swapped back in?
eg the exclusive bit is missing when the page is recreated?

Otherwise nothing stood out to me here, but I know almost nothing
about swap.

Thanks,
Jason
David Hildenbrand March 19, 2022, 11:17 a.m. UTC | #2
On 19.03.22 00:48, Jason Gunthorpe wrote:
> On Tue, Mar 15, 2022 at 03:18:30PM +0100, David Hildenbrand wrote:
>> This is just the natural follow-up of part 2, that will also further
>> reduce "wrong COW" on the swapin path, for example, when we cannot remove
>> a page from the swapcache due to concurrent writeback, or if we have two
>> threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
>> nice side-product :)

Hi Jason,

thanks or the review!

> 
> I know I would benefit alot from a description of the swap specific
> issue a bit more. Most of this message talks about clear_refs which I
> do understand a bit better.

Patch #1 contains some additional information. In general, it's the same
issue as with any other mechanism that could get the page mapped R/O
while there is a FOLL_GET | FOLL_WRITE reference to it --  for example,
DMA to that page as happens with our O_DIRECT reproducer.

Part 2 essentially fixed the other cases (i.e., clear_refs), but the
remaining swapout+refault from swapcache case is handled in this series.

> 
> Is this talking about what happens after a page gets swapped back in?
> eg the exclusive bit is missing when the page is recreated?

Right, try_to_unmap() was the last remaining case where we'd have lost
the exclusivity information -- it wasn't required for reliable GUP pins
in part 2.

Here is what happens without PG_anon_exclusive:

1. The application uses parts of an anonymous base page for direct I/O,
let's assume the first 512 bytes of page.

fd = open(filename, O_DIRECT| ...);
pread(fd, page, 512, 0);

O_DIRECT will take a FOLL_GET|FOLL_WRITE reference on the page

2. Reclaim kicks in and wants to swapout the page -- mm/vmscan.c

shrink_page_list() first adds the page to the swapcache and then unmaps
it via try_to_unmap().

After the page was successfully unmapped, pageout() will start
triggering writeback but will realize that there are additional
references on the page (via is_page_cache_freeable()) and fail.

3. The application uses unrelated parts of the page for other purposes
while the DMA is not completed, e.g., doing a a simple

page[4095]++;

The read access will fault in the page readable from the swap cache in
do_swap_page(). The write access will trigger our COW fault handler. As
we have an additional reference on the page, we will create a copy and
map it into out page table. At this point, the page table and the GUP
reference are out of sync.

4. O_DIRECT completes

The read targets the page that is no longer referenced in the page
tables. For the application, it looks like the read() never happened, as
we lost our DMA read to our page.


With PG_anon_exclusive from series part 2, we don't remember exclusivity
information in try_to_unmap() yet. do_swap_page() cannot restore it as
it has to assume the page is possibly shared.

With this series, we remember exclusivity information in try_to_unmap()
in the SWP PTE. do_swap_page() can restore it. Consequently, our COW
fault handler won't create a wrong copy and we won't go out of sync
between GUP and the page mapped into the page table.


Hope that helps!