mbox series

[RFC,00/10] fs/dax: Fix FS DAX page reference counts

Message ID cover.fe275e9819458a4bbb9451b888cafb88af8867d4.1712796818.git-series.apopple@nvidia.com
Headers show
Series fs/dax: Fix FS DAX page reference counts | expand

Message

Alistair Popple April 11, 2024, 12:57 a.m. UTC
FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

This RFC is an initial attempt at removing the special reference
counting and instead refcount FS DAX pages the same as normal pages.

There are still a couple of rough edges - in particular I haven't
completely removed the devmap PTE bit references from arch specific
code and there is probably some more cleanup of dev_pagemap reference
counting that could be done, particular in mm/gup.c. I also haven't
yet compiled on anything other than x86_64.

Before continuing further with this clean-up though I would appreciate
some feedback on the viability of this approach and any issues I may
have overlooked, as I am not intimately familiar with FS DAX code (or
for that matter the FS layer in general).

I have of course run some basic testing which didn't reveal any
problems.

Signed-off-by: Alistair Popple <apopple@nvidia.com>

Alistair Popple (10):
  mm/gup.c: Remove redundant check for PCI P2PDMA page
  mm/hmm: Remove dead check for HugeTLB and FS DAX
  pci/p2pdma: Don't initialise page refcount to one
  fs/dax: Don't track page mapping/index
  fs/dax: Refactor wait for dax idle page
  fs/dax: Add dax_page_free callback
  mm: Allow compound zone device pages
  fs/dax: Properly refcount fs dax pages
  mm/khugepage.c: Warn if trying to scan devmap pmd
  mm: Remove pXX_devmap

 Documentation/mm/arch_pgtable_helpers.rst    |   6 +-
 arch/arm64/include/asm/pgtable.h             |  24 +---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  42 +-----
 arch/powerpc/mm/book3s64/hash_pgtable.c      |   3 +-
 arch/powerpc/mm/book3s64/pgtable.c           |   8 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c     |   5 +-
 arch/powerpc/mm/pgtable.c                    |   2 +-
 arch/x86/include/asm/pgtable.h               |  31 +---
 drivers/dax/super.c                          |   2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c       |   2 +-
 drivers/nvdimm/pmem.c                        |  10 +-
 drivers/pci/p2pdma.c                         |   4 +-
 fs/dax.c                                     | 158 +++++++-----------
 fs/ext4/inode.c                              |   5 +-
 fs/fuse/dax.c                                |   4 +-
 fs/fuse/virtio_fs.c                          |   8 +-
 fs/userfaultfd.c                             |   2 +-
 fs/xfs/xfs_file.c                            |   4 +-
 include/linux/dax.h                          |  16 ++-
 include/linux/huge_mm.h                      |  11 +-
 include/linux/memremap.h                     |  12 +-
 include/linux/migrate.h                      |   2 +-
 include/linux/mm.h                           |  41 +-----
 include/linux/page-flags.h                   |   6 +-
 include/linux/pgtable.h                      |  17 +--
 lib/test_hmm.c                               |   2 +-
 mm/debug_vm_pgtable.c                        |  51 +------
 mm/gup.c                                     | 165 +------------------
 mm/hmm.c                                     |  40 +----
 mm/huge_memory.c                             | 180 +++++++++-----------
 mm/internal.h                                |   2 +-
 mm/khugepaged.c                              |   2 +-
 mm/mapping_dirty_helpers.c                   |   4 +-
 mm/memory-failure.c                          |   6 +-
 mm/memory.c                                  | 109 ++++++++----
 mm/memremap.c                                |  36 +---
 mm/migrate_device.c                          |   6 +-
 mm/mm_init.c                                 |   5 +-
 mm/mprotect.c                                |   2 +-
 mm/mremap.c                                  |   5 +-
 mm/page_vma_mapped.c                         |   5 +-
 mm/pgtable-generic.c                         |   7 +-
 mm/swap.c                                    |   2 +-
 mm/vmscan.c                                  |   5 +-
 44 files changed, 338 insertions(+), 721 deletions(-)

base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa

Comments

Dan Williams April 11, 2024, 5:28 p.m. UTC | #1
Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.

> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.

This is fair. However, for anyone coming in fresh to this situation
maybe some more "how we get here" history helps. That longer story is
here:

http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

> This RFC is an initial attempt at removing the special reference
> counting and instead refcount FS DAX pages the same as normal pages.
> 
> There are still a couple of rough edges - in particular I haven't
> completely removed the devmap PTE bit references from arch specific
> code and there is probably some more cleanup of dev_pagemap reference
> counting that could be done, particular in mm/gup.c. I also haven't
> yet compiled on anything other than x86_64.
> 
> Before continuing further with this clean-up though I would appreciate
> some feedback on the viability of this approach and any issues I may
> have overlooked, as I am not intimately familiar with FS DAX code (or
> for that matter the FS layer in general).
> 
> I have of course run some basic testing which didn't reveal any
> problems.

FWIW I see the following with the ndctl/dax test-suite (double-checked
that vanilla v6.6 passes). I will take a look at the patches, but in the
meantime...

# meson test -C build --suite ndctl:dax
ninja: no work to do.
ninja: Entering directory `/root/git/ndctl/build'
[1/70] Generating version.h with a custom command
 1/13 ndctl:dax / daxdev-errors.sh          OK              14.46s
 2/13 ndctl:dax / multi-dax.sh              OK               2.70s
 3/13 ndctl:dax / sub-section.sh            OK               7.21s
 4/13 ndctl:dax / dax-dev                   OK               0.08s
[5/13] 🌖 ndctl:dax / dax-ext4.sh                            0/600s

...that last test crashed with:

 EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
ota mode: none.
 page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1

 head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 aops:ext4_dax_aops ino:c dentry name:"image"
 flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
 page_type: 0xffffffff()
 raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
 raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127

 ------------[ cut here ]------------
 kernel BUG at include/linux/mm.h:1419!
 invalid opcode: 0000 [#1] PREEMPT SMP PTI
 CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G           OE    N 6.6.0+ #209
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
 RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
 Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90

 RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
 RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
 RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
 R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
 R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
 FS:  00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? die+0x32/0x80
  ? do_trap+0xd6/0x100
  ? dax_insert_pfn_pmd+0x41c/0x430
  ? dax_insert_pfn_pmd+0x41c/0x430
  ? do_error_trap+0x81/0x110
  ? dax_insert_pfn_pmd+0x41c/0x430
  ? exc_invalid_op+0x4c/0x60
  ? dax_insert_pfn_pmd+0x41c/0x430
  ? asm_exc_invalid_op+0x16/0x20
  ? dax_insert_pfn_pmd+0x41c/0x430
  ? dax_insert_pfn_pmd+0x41c/0x430
  dax_fault_iter+0x5d0/0x700
  dax_iomap_pmd_fault+0x212/0x450
  ext4_dax_huge_fault+0x1dc/0x470
  __handle_mm_fault+0x808/0x13e0
  handle_mm_fault+0x178/0x3e0
  do_user_addr_fault+0x186/0x830
  exc_page_fault+0x6f/0x1d0
  asm_exc_page_fault+0x22/0x30
 RIP: 0033:0x7fdaa072d009
Jason Gunthorpe April 11, 2024, 5:35 p.m. UTC | #2
On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote:
> Alistair Popple wrote:
> > FS DAX pages have always maintained their own page reference counts
> > without following the normal rules for page reference counting. In
> > particular pages are considered free when the refcount hits one rather
> > than zero and refcounts are not added when mapping the page.
> 
> > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > mechanism for allowing GUP to hold references on the page (see
> > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > DAX pages need their own reference counting scheme.
> 
> This is fair. However, for anyone coming in fresh to this situation
> maybe some more "how we get here" history helps. That longer story is
> here:
> 
> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

This never got merged? :(

Jason
Dan Williams April 11, 2024, 5:56 p.m. UTC | #3
Jason Gunthorpe wrote:
> On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote:
> > Alistair Popple wrote:
> > > FS DAX pages have always maintained their own page reference counts
> > > without following the normal rules for page reference counting. In
> > > particular pages are considered free when the refcount hits one rather
> > > than zero and refcounts are not added when mapping the page.
> > 
> > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > > mechanism for allowing GUP to hold references on the page (see
> > > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > > DAX pages need their own reference counting scheme.
> > 
> > This is fair. However, for anyone coming in fresh to this situation
> > maybe some more "how we get here" history helps. That longer story is
> > here:
> > 
> > http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
> 
> This never got merged? :(

Yeah... I got hung up on the "allocation" API to take ZONE_DEVICE pages
from recfount 0 to 1 and then never circled back.
Alistair Popple April 12, 2024, 3:54 a.m. UTC | #4
Dan Williams <dan.j.williams@intel.com> writes:

> Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>
> This is fair. However, for anyone coming in fresh to this situation
> maybe some more "how we get here" history helps. That longer story is
> here:
>
> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

Good idea.

>> This RFC is an initial attempt at removing the special reference
>> counting and instead refcount FS DAX pages the same as normal pages.
>> 
>> There are still a couple of rough edges - in particular I haven't
>> completely removed the devmap PTE bit references from arch specific
>> code and there is probably some more cleanup of dev_pagemap reference
>> counting that could be done, particular in mm/gup.c. I also haven't
>> yet compiled on anything other than x86_64.
>> 
>> Before continuing further with this clean-up though I would appreciate
>> some feedback on the viability of this approach and any issues I may
>> have overlooked, as I am not intimately familiar with FS DAX code (or
>> for that matter the FS layer in general).
>> 
>> I have of course run some basic testing which didn't reveal any
>> problems.
>
> FWIW I see the following with the ndctl/dax test-suite (double-checked
> that vanilla v6.6 passes). I will take a look at the patches, but in the
> meantime...

Hmmm...

> # meson test -C build --suite ndctl:dax
> ninja: no work to do.
> ninja: Entering directory `/root/git/ndctl/build'
> [1/70] Generating version.h with a custom command
>  1/13 ndctl:dax / daxdev-errors.sh          OK              14.46s
>  2/13 ndctl:dax / multi-dax.sh              OK               2.70s
>  3/13 ndctl:dax / sub-section.sh            OK               7.21s
>  4/13 ndctl:dax / dax-dev                   OK               0.08s
> [5/13] 🌖 ndctl:dax / dax-ext4.sh                            0/600s

...thanks for pasting that output. Turns out I didn't have destructive
testing enabled during the build so hadn't noticed these tests were not
running. It would be nice if these were reported as skipped when not
enabled rather than hidden.

With that fixed I'm seeing a couple of kernel warnings (and I think I
know why), so it might be worth holding off looking at this too closely
until I've fixed these.

> ...that last test crashed with:
>
>  EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
> ota mode: none.
>  page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1
>
>  head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>  aops:ext4_dax_aops ino:c dentry name:"image"
>  flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
>  page_type: 0xffffffff()
>  raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
>  raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
>  page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127
>
>  ------------[ cut here ]------------
>  kernel BUG at include/linux/mm.h:1419!
>  invalid opcode: 0000 [#1] PREEMPT SMP PTI
>  CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G           OE    N 6.6.0+ #209
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
>  RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
>  Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
> c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90
>
>  RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
>  RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
>  RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
>  RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
>  R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
>  R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
>  FS:  00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  Call Trace:
>   <TASK>
>   ? die+0x32/0x80
>   ? do_trap+0xd6/0x100
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   ? do_error_trap+0x81/0x110
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   ? exc_invalid_op+0x4c/0x60
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   ? asm_exc_invalid_op+0x16/0x20
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   ? dax_insert_pfn_pmd+0x41c/0x430
>   dax_fault_iter+0x5d0/0x700
>   dax_iomap_pmd_fault+0x212/0x450
>   ext4_dax_huge_fault+0x1dc/0x470
>   __handle_mm_fault+0x808/0x13e0
>   handle_mm_fault+0x178/0x3e0
>   do_user_addr_fault+0x186/0x830
>   exc_page_fault+0x6f/0x1d0
>   asm_exc_page_fault+0x22/0x30
>  RIP: 0033:0x7fdaa072d009
Alistair Popple April 12, 2024, 6:55 a.m. UTC | #5
Alistair Popple <apopple@nvidia.com> writes:

> Dan Williams <dan.j.williams@intel.com> writes:
>
>> Alistair Popple wrote:
>>> FS DAX pages have always maintained their own page reference counts
>>> without following the normal rules for page reference counting. In
>>> particular pages are considered free when the refcount hits one rather
>>> than zero and refcounts are not added when mapping the page.
>>
>>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>>> mechanism for allowing GUP to hold references on the page (see
>>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>>> DAX pages need their own reference counting scheme.
>>
>> This is fair. However, for anyone coming in fresh to this situation
>> maybe some more "how we get here" history helps. That longer story is
>> here:
>>
>> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
>
> Good idea.
>
>>> This RFC is an initial attempt at removing the special reference
>>> counting and instead refcount FS DAX pages the same as normal pages.
>>> 
>>> There are still a couple of rough edges - in particular I haven't
>>> completely removed the devmap PTE bit references from arch specific
>>> code and there is probably some more cleanup of dev_pagemap reference
>>> counting that could be done, particular in mm/gup.c. I also haven't
>>> yet compiled on anything other than x86_64.
>>> 
>>> Before continuing further with this clean-up though I would appreciate
>>> some feedback on the viability of this approach and any issues I may
>>> have overlooked, as I am not intimately familiar with FS DAX code (or
>>> for that matter the FS layer in general).
>>> 
>>> I have of course run some basic testing which didn't reveal any
>>> problems.
>>
>> FWIW I see the following with the ndctl/dax test-suite (double-checked
>> that vanilla v6.6 passes). I will take a look at the patches, but in the
>> meantime...
>
> Hmmm...
>
>> # meson test -C build --suite ndctl:dax
>> ninja: no work to do.
>> ninja: Entering directory `/root/git/ndctl/build'
>> [1/70] Generating version.h with a custom command
>>  1/13 ndctl:dax / daxdev-errors.sh          OK              14.46s
>>  2/13 ndctl:dax / multi-dax.sh              OK               2.70s
>>  3/13 ndctl:dax / sub-section.sh            OK               7.21s
>>  4/13 ndctl:dax / dax-dev                   OK               0.08s
>> [5/13] 🌖 ndctl:dax / dax-ext4.sh                            0/600s
>
> ...thanks for pasting that output. Turns out I didn't have destructive
> testing enabled during the build so hadn't noticed these tests were not
> running. It would be nice if these were reported as skipped when not
> enabled rather than hidden.
>
> With that fixed I'm seeing a couple of kernel warnings (and I think I
> know why), so it might be worth holding off looking at this too closely
> until I've fixed these.

Ok, I think I found the dragons you were talking about earlier for
device-dax. I completely broke that because as you've already pointed
out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
(because the pages are essentially normal pages now anyway), but we
don't have a PMD equivalent of vm_normal_page() which leads to all sorts
of issues for DEVDAX.

So I will probably have to add something like that unless we only need
to support large (pmd/pud) mappings of DEVDAX pages on systems with
CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
based on pte_special().

>> ...that last test crashed with:
>>
>>  EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
>> ota mode: none.
>>  page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1
>>
>>  head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>>  aops:ext4_dax_aops ino:c dentry name:"image"
>>  flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
>>  page_type: 0xffffffff()
>>  raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
>>  raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
>>  page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127
>>
>>  ------------[ cut here ]------------
>>  kernel BUG at include/linux/mm.h:1419!
>>  invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>  CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G           OE    N 6.6.0+ #209
>>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
>>  RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
>>  Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
>> c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90
>>
>>  RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
>>  RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
>>  RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
>>  RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
>>  R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
>>  R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
>>  FS:  00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
>>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>  Call Trace:
>>   <TASK>
>>   ? die+0x32/0x80
>>   ? do_trap+0xd6/0x100
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   ? do_error_trap+0x81/0x110
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   ? exc_invalid_op+0x4c/0x60
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   ? asm_exc_invalid_op+0x16/0x20
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   ? dax_insert_pfn_pmd+0x41c/0x430
>>   dax_fault_iter+0x5d0/0x700
>>   dax_iomap_pmd_fault+0x212/0x450
>>   ext4_dax_huge_fault+0x1dc/0x470
>>   __handle_mm_fault+0x808/0x13e0
>>   handle_mm_fault+0x178/0x3e0
>>   do_user_addr_fault+0x186/0x830
>>   exc_page_fault+0x6f/0x1d0
>>   asm_exc_page_fault+0x22/0x30
>>  RIP: 0033:0x7fdaa072d009
Jason Gunthorpe April 12, 2024, 11:53 a.m. UTC | #6
On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote:

> Ok, I think I found the dragons you were talking about earlier for
> device-dax. I completely broke that because as you've already pointed
> out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
> (because the pages are essentially normal pages now anyway), but we
> don't have a PMD equivalent of vm_normal_page() which leads to all sorts
> of issues for DEVDAX.

What about vm_normal_page() depends on the radix level ?

Doesn't DEVDAX memory have struct page too?

> So I will probably have to add something like that unless we only need
> to support large (pmd/pud) mappings of DEVDAX pages on systems with
> CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
> based on pte_special().

pte_special should only be used by memory without a struct page, is
that what DEVDAX is?

Jason
Dan Williams April 12, 2024, 5:32 p.m. UTC | #7
Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote:
> 
> > Ok, I think I found the dragons you were talking about earlier for
> > device-dax. I completely broke that because as you've already pointed
> > out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
> > (because the pages are essentially normal pages now anyway), but we
> > don't have a PMD equivalent of vm_normal_page() which leads to all sorts
> > of issues for DEVDAX.
> 
> What about vm_normal_page() depends on the radix level ?
> 
> Doesn't DEVDAX memory have struct page too?

Yes.

> > So I will probably have to add something like that unless we only need
> > to support large (pmd/pud) mappings of DEVDAX pages on systems with
> > CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
> > based on pte_special().
> 
> pte_special should only be used by memory without a struct page, is
> that what DEVDAX is?

Right, I don't think pte_special is applicable for any DAX pages.