[{"id":1763965,"web_url":"http://patchwork.ozlabs.org/comment/1763965/","msgid":"<20170906094700.GC27916@quack2.suse.cz>","list_archive_url":null,"date":"2017-09-06T09:47:00","subject":"Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes","submitter":{"id":363,"url":"http://patchwork.ozlabs.org/api/people/363/","name":"Jan Kara","email":"jack@suse.cz"},"content":"On Tue 05-09-17 16:35:38, Ross Zwisler wrote:\n> The IOCTL path which switches the journaling mode for an inode is currently\n> unsafe because it doesn't properly do a writeback and invalidation on the\n> inode.  In XFS, for example, safe transitions of S_DAX are handled by\n> xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,\n> does a writeback via filemap_write_and_wait() and an invalidation via\n> invalidate_inode_pages2().\n> \n> Without this in place we can see the following kernel warning when we try\n> and insert a DAX exceptional entry but find that a dirty page cache page is\n> still in the mapping->radix_tree:\n> \n>  WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550\n>  Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm\n>  CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3\n>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014\n>  task: ffff88020ccd0000 task.stack: ffffc900021d4000\n>  RIP: 0010:__delete_from_page_cache+0x375/0x550\n>  RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002\n>  RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68\n>  RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5\n>  RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70\n>  R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70\n>  R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68\n>  FS:  00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000\n>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033\n>  CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0\n>  Call Trace:\n>   dax_insert_mapping_entry+0x158/0x2c0\n>   dax_iomap_fault+0x1020/0x1bb0\n>   ext4_dax_huge_fault+0xc8/0x160\n>   ext4_dax_fault+0x10/0x20\n>   __do_fault+0x20/0x110\n>   __handle_mm_fault+0x97d/0x1120\n>   handle_mm_fault+0x188/0x2f0\n>   __do_page_fault+0x28f/0x590\n>   trace_do_page_fault+0x58/0x2c0\n>   do_async_page_fault+0x2c/0x90\n>   async_page_fault+0x28/0x30\n> \n> I'm pretty sure we could make a test that shows userspace visible data\n> corruption as well in this scenario.\n> \n> Make it safe to change the journaling mode and turn on or off S_DAX by\n> adding locking to properly lock out page faults (i_mmap_sem) and then doing\n> the writeback and invalidate.  I/O is already held off because all callers\n> of ext4_ioctl_setflags() hold the inode lock.\n\nYeah, this is a good point. It is just that this is not enough as I\ndiscovered in [1]. You also need to tear down & recreate VMAs when changing\nDAX flag which is a bit tricky. So for now I think returning EBUSY when\nfile is mmaped and we'd like to flip DAX flag is the best solution. Hmm?\n\n[1] https://www.spinics.net/lists/linux-xfs/msg09859.html\n\n> The locking for this new code is complex because of the following:\n> \n> 1) filemap_write_and_wait() eventually calls ext4_writepages(), which\n> acquires the sbi->s_journal_flag_rwsem.  This lock ranks above the\n> jbdw_handle which is eventually taken by ext4_journal_start().  This\n> essentially means that the writeback has to happen outside of the context\n> of an active journal handle (outside of ext4_journal_start() to\n> ext4_journal_stop().)\n> \n> 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and\n> this lock again ranks above the jbd2_handle taken by ext4_journal_start().\n> So, as with the writeback code in 1) above we have to take ei->i_mmap_sem\n> outside of the context of an active journal handle.\n\nWelcome to the joy of fs locking ;)\n\n> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>\n> CC: stable@vger.kernel.org\n\n\t\t\t\t\t\t\t\tHonza","headers":{"Return-Path":"<linux-ext4-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-ext4-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xnJf02PGZz9sCZ\n\tfor <patchwork-incoming@ozlabs.org>;\n\tWed,  6 Sep 2017 19:47:20 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1752549AbdIFJrG (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 6 Sep 2017 05:47:06 -0400","from mx2.suse.de ([195.135.220.15]:57929 \"EHLO mx1.suse.de\"\n\trhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP\n\tid S1752257AbdIFJrF (ORCPT <rfc822;linux-ext4@vger.kernel.org>);\n\tWed, 6 Sep 2017 05:47:05 -0400","from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254])\n\tby mx1.suse.de (Postfix) with ESMTP id 073A5AC61;\n\tWed,  6 Sep 2017 09:47:03 +0000 (UTC)","by quack2.suse.cz (Postfix, from userid 1000)\n\tid AFF671E07DE; Wed,  6 Sep 2017 11:47:00 +0200 (CEST)"],"X-Virus-Scanned":"by amavisd-new at test-mx.suse.de","Date":"Wed, 6 Sep 2017 11:47:00 +0200","From":"Jan Kara <jack@suse.cz>","To":"Ross Zwisler <ross.zwisler@linux.intel.com>","Cc":"Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org,\n\t\"Darrick J. Wong\" <darrick.wong@oracle.com>,\n\tTheodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>,\n\tChristoph Hellwig <hch@lst.de>, Dan Williams <dan.j.williams@intel.com>,\n\tDave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,\n\tlinux-ext4@vger.kernel.org, linux-nvdimm@lists.01.org,\n\tlinux-xfs@vger.kernel.org, stable@vger.kernel.org","Subject":"Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes","Message-ID":"<20170906094700.GC27916@quack2.suse.cz>","References":"<20170905223541.20594-1-ross.zwisler@linux.intel.com>\n\t<20170905223541.20594-7-ross.zwisler@linux.intel.com>","MIME-Version":"1.0","Content-Type":"text/plain; charset=us-ascii","Content-Disposition":"inline","In-Reply-To":"<20170905223541.20594-7-ross.zwisler@linux.intel.com>","User-Agent":"Mutt/1.5.24 (2015-08-30)","Sender":"linux-ext4-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-ext4.vger.kernel.org>","X-Mailing-List":"linux-ext4@vger.kernel.org"}},{"id":1764268,"web_url":"http://patchwork.ozlabs.org/comment/1764268/","msgid":"<20170906170946.GC17663@linux.intel.com>","list_archive_url":null,"date":"2017-09-06T17:09:46","subject":"Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes","submitter":{"id":46514,"url":"http://patchwork.ozlabs.org/api/people/46514/","name":"Ross Zwisler","email":"ross.zwisler@linux.intel.com"},"content":"On Wed, Sep 06, 2017 at 11:47:00AM +0200, Jan Kara wrote:\n> On Tue 05-09-17 16:35:38, Ross Zwisler wrote:\n> > The IOCTL path which switches the journaling mode for an inode is currently\n> > unsafe because it doesn't properly do a writeback and invalidation on the\n> > inode.  In XFS, for example, safe transitions of S_DAX are handled by\n> > xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,\n> > does a writeback via filemap_write_and_wait() and an invalidation via\n> > invalidate_inode_pages2().\n> > \n> > Without this in place we can see the following kernel warning when we try\n> > and insert a DAX exceptional entry but find that a dirty page cache page is\n> > still in the mapping->radix_tree:\n> > \n> >  WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550\n> >  Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm\n> >  CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3\n> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014\n> >  task: ffff88020ccd0000 task.stack: ffffc900021d4000\n> >  RIP: 0010:__delete_from_page_cache+0x375/0x550\n> >  RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002\n> >  RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68\n> >  RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5\n> >  RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70\n> >  R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70\n> >  R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68\n> >  FS:  00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000\n> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033\n> >  CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0\n> >  Call Trace:\n> >   dax_insert_mapping_entry+0x158/0x2c0\n> >   dax_iomap_fault+0x1020/0x1bb0\n> >   ext4_dax_huge_fault+0xc8/0x160\n> >   ext4_dax_fault+0x10/0x20\n> >   __do_fault+0x20/0x110\n> >   __handle_mm_fault+0x97d/0x1120\n> >   handle_mm_fault+0x188/0x2f0\n> >   __do_page_fault+0x28f/0x590\n> >   trace_do_page_fault+0x58/0x2c0\n> >   do_async_page_fault+0x2c/0x90\n> >   async_page_fault+0x28/0x30\n> > \n> > I'm pretty sure we could make a test that shows userspace visible data\n> > corruption as well in this scenario.\n> > \n> > Make it safe to change the journaling mode and turn on or off S_DAX by\n> > adding locking to properly lock out page faults (i_mmap_sem) and then doing\n> > the writeback and invalidate.  I/O is already held off because all callers\n> > of ext4_ioctl_setflags() hold the inode lock.\n> \n> Yeah, this is a good point. It is just that this is not enough as I\n> discovered in [1]. You also need to tear down & recreate VMAs when changing\n> DAX flag which is a bit tricky. So for now I think returning EBUSY when\n> file is mmaped and we'd like to flip DAX flag is the best solution. Hmm?\n> \n> [1] https://www.spinics.net/lists/linux-xfs/msg09859.html\n\nYea, thanks for the link, I totally missed this discussion (obviously). \n\nCool, I'll rework this for v2.\n\n> > The locking for this new code is complex because of the following:\n> > \n> > 1) filemap_write_and_wait() eventually calls ext4_writepages(), which\n> > acquires the sbi->s_journal_flag_rwsem.  This lock ranks above the\n> > jbdw_handle which is eventually taken by ext4_journal_start().  This\n> > essentially means that the writeback has to happen outside of the context\n> > of an active journal handle (outside of ext4_journal_start() to\n> > ext4_journal_stop().)\n> > \n> > 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and\n> > this lock again ranks above the jbd2_handle taken by ext4_journal_start().\n> > So, as with the writeback code in 1) above we have to take ei->i_mmap_sem\n> > outside of the context of an active journal handle.\n> \n> Welcome to the joy of fs locking ;)\n\n:)  Well, I feel like I learned a lot more about ext4 during this patch set!\n\n> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>\n> > CC: stable@vger.kernel.org\n> \n> \t\t\t\t\t\t\t\tHonza\n> \n> -- \n> Jan Kara <jack@suse.com>\n> SUSE Labs, CR","headers":{"Return-Path":"<linux-ext4-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=linux-ext4-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xnVSq6qXPz9t3Z\n\tfor <patchwork-incoming@ozlabs.org>;\n\tThu,  7 Sep 2017 03:10:03 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S932217AbdIFRJv (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tWed, 6 Sep 2017 13:09:51 -0400","from mga01.intel.com ([192.55.52.88]:47037 \"EHLO mga01.intel.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1755638AbdIFRJt (ORCPT <rfc822;linux-ext4@vger.kernel.org>);\n\tWed, 6 Sep 2017 13:09:49 -0400","from orsmga005.jf.intel.com ([10.7.209.41])\n\tby fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;\n\t06 Sep 2017 10:09:49 -0700","from theros.lm.intel.com (HELO linux.intel.com) ([10.232.112.77])\n\tby orsmga005.jf.intel.com with ESMTP; 06 Sep 2017 10:09:47 -0700"],"X-ExtLoop1":"1","X-IronPort-AV":"E=Sophos;i=\"5.42,354,1500966000\"; d=\"scan'208\";a=\"146185868\"","Date":"Wed, 6 Sep 2017 11:09:46 -0600","From":"Ross Zwisler <ross.zwisler@linux.intel.com>","To":"Jan Kara <jack@suse.cz>","Cc":"Ross Zwisler <ross.zwisler@linux.intel.com>,\n\tAndrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org,\n\t\"Darrick J. Wong\" <darrick.wong@oracle.com>,\n\tTheodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger.kernel@dilger.ca>,\n\tChristoph Hellwig <hch@lst.de>, Dan Williams <dan.j.williams@intel.com>,\n\tDave Chinner <david@fromorbit.com>, linux-ext4@vger.kernel.org,\n\tlinux-nvdimm@lists.01.org, linux-xfs@vger.kernel.org,\n\tstable@vger.kernel.org","Subject":"Re: [PATCH 6/9] ext4: safely transition S_DAX on journaling changes","Message-ID":"<20170906170946.GC17663@linux.intel.com>","References":"<20170905223541.20594-1-ross.zwisler@linux.intel.com>\n\t<20170905223541.20594-7-ross.zwisler@linux.intel.com>\n\t<20170906094700.GC27916@quack2.suse.cz>","MIME-Version":"1.0","Content-Type":"text/plain; charset=us-ascii","Content-Disposition":"inline","In-Reply-To":"<20170906094700.GC27916@quack2.suse.cz>","User-Agent":"Mutt/1.8.3 (2017-05-23)","Sender":"linux-ext4-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<linux-ext4.vger.kernel.org>","X-Mailing-List":"linux-ext4@vger.kernel.org"}}]