diff mbox

[RFC] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Message ID 20110422155839.3295e8e8.toshi.okajima@jp.fujitsu.com
State New, archived
Headers show

Commit Message

Toshiyuki Okajima April 22, 2011, 6:58 a.m. UTC
Hi,

On Tue, 19 Apr 2011 18:43:16 +0900
Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
> Hi,
> 
> (2011/04/18 19:51), Jan Kara wrote:
> > On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>>> get modified to block while minor-faulting the page on frozen fs because
> >>>>> when blocks are already allocated we may skip starting a transaction and so
> >>>>> we could possibly modify the filesystem.
> >>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>>>
> >>>> (minor-pagefault)
> >>>> ->   do_wp_page()
> >>>>     ->   page_mkwrite(= ext4_mkwrite())
> >>>>        =>   BLOCK!
> >>>>
> >>>> (major-pagefault)
> >>>> ->   do_liner_fault()
> >>>>     ->   page_mkwrite(= ext4_mkwrite())
> >>>>        =>   BLOCK!
> >>>>
> >>>>>
> >>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>>> while fsfreezing.
> >>>>>>>    Technically speaking, we block all the transaction starts which means we
> >>>>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>>>> note the mmap case is one of such exceptions.
> >>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>>>> fsfreeze operation is done...
> >>>>>    I'm sorry I don't understand now - are you speaking about the case above
> >>>>> when writepage() does not wait for filesystem being frozen or something
> >>>>> else?
> >>>> Sorry, I didn't understand around the page fault path.
> >>>> So, I had read the kernel source code around it, then I maybe understand...
> >>>>
> >>>> I worry whether we can update the file data in mmap case while fsfreezing.
> >>>> Of course, I understand that we can write to in-memory cache, and it is not a
> >>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>>> So, I summarize the cases whether we can write to disk or not.
> >>>>
> >>>> --------------------------------------------------------------------------
> >>>> Cases (Whether we can write the data mmapped to the file on the disk
> >>>> while fsfreezing)
> >>>>
> >>>> [1] One of the page which has been mmapped is not bound. And
> >>>>   the page is not allocated yet. (major fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) __do_falut is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     (5) ext4_write_begin is called
> >>>>     (6) ext4_journal_start_sb       =>   We can STOP!
> >>>>
> >>>> [2] One of the page which has been mmapped is not bound. But
> >>>>   the page is already allocated, and the buffer_heads of the page
> >>>>   are not mapped (BH_Mapped).  (minor fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) do_wp_page is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     (5) ext4_write_begin is called
> >>>>     (6) ext4_journal_start_sb       =>   We can STOP!
> >>>>
> >>>> [3] One of the page which has been mmapped is not bound. But
> >>>>   the page is already allocated, and the buffer_heads of the page
> >>>>   are mapped (BH_Mapped).  (minor fault?)
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     (2) a page fault occurs (do_page_fault)
> >>>>     (3) do_wp_page is called.
> >>>>     (4) ext4_page_mkwrite is called
> >>>>     * Cannot block the dirty page to be written because all bh is mapped.
> >>>>     (5) user munmaps the page (munmap)
> >>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>>     (7) writeback thread writes the page (struct page) to disk
> >>>>                                             =>   We cannot STOP!
> >>>>
> >>>> [4] One of the page which has been mmapped is bound. And
> >>>>   the page is already allocated.
> >>>>
> >>>>     (1) user dirtys a page
> >>>>     ( ) no page fault occurs
> >>>>     (2) user munmaps the page (munmap)
> >>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>>>     (4) writeback thread writes the page (struct page) to disk
> >>>>                                             =>   We cannot STOP!
> >>>> --------------------------------------------------------------------------
> >>>>
> >>>> So, we can block the cases [1], [2].
> >>>> But I think we cannot block the cases [3], [4] now.
> >>>> If fixing the page_mkwrite, we can also block the case [3].
> >>>> But the case [4] is not blocked because no page fault occurs
> >>>> when we dirty the mmapped page.
> >>>>
> >>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>>> I think we must modify the writeback thread to fix the case [4].
> >>>    The trick here is that when we write a page to disk, we write-protect
> >>> the page (you seem to call this that "the page is bound", I'm not sure why).
> >> Hm, I want to understand how to write-protect the page under fsfreezing.
> >    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
> Thanks. I'll read that.
> 
> >
> >> But, anyway, I understand we don't need to consider the case [4].
> >    Yes.
> >
> >>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >>> modify a page after we finish writeback while freezing the filesystem.
> >>> So principially all we need to do is just wait in ext4_page_mkwrite().
> >> OK. I understand.
> >> Are there any concrete ideas to fix this?
> >> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> >    Yes.
> >
> >> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
> >    Sadly I don't see a simple way to fix this issue for all filesystems at
> > once. Implementing proper wait in block_page_mkwrite() should fix the issue
> > for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> > separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> > have patches for this already for some time but I have to get to properly
> > testing them in more exotic conditions like 64k pages...
> OK. I understand the current status of your works to fix the problem which
> can be written with some data at mmap path while fsfreezing.
I have confirmed that the following patch works fine while my or
Mizuma-san's reproducer is running. Therefore,
 we can block to write the data, which is mmapped to a file, into a disk
by a page-fault while fsfreezing. 

I think this patch fixes the following two problems:
- A deadlock occurs between ext4_da_writepages() (called from
writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
- We can also write the data, which is mmapped to a file,
  into a disk while fsfreezing (ext3/ext4).
                                       (reported by me)

Please examine this patch.

Thanks,
Toshiyuki Okajima
---
 fs/ext3/file.c          |   19 ++++++++++++-
 fs/ext3/inode.c         |   71 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/inode.c         |    4 ++-
 include/linux/ext3_fs.h |    1 +
 4 files changed, 93 insertions(+), 2 deletions(-)

Comments

Peter M. Petrakis April 22, 2011, 9:26 p.m. UTC | #1
Hi All,

On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> Hi,
> 
> On Tue, 19 Apr 2011 18:43:16 +0900
> Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> wrote:
>> Hi,
>>
>> (2011/04/18 19:51), Jan Kara wrote:
>>> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>>>> we could possibly modify the filesystem.
>>>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>>>
>>>>>> (minor-pagefault)
>>>>>> ->   do_wp_page()
>>>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>>>        =>   BLOCK!
>>>>>>
>>>>>> (major-pagefault)
>>>>>> ->   do_liner_fault()
>>>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>>>        =>   BLOCK!
>>>>>>
>>>>>>>
>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>>>> while fsfreezing.
>>>>>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>>>> fsfreeze operation is done...
>>>>>>>    I'm sorry I don't understand now - are you speaking about the case above
>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>> else?
>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>>>
>>>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>> while fsfreezing)
>>>>>>
>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>   the page is not allocated yet. (major fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) __do_falut is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     (5) ext4_write_begin is called
>>>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>>>
>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>   the page is already allocated, and the buffer_heads of the page
>>>>>>   are not mapped (BH_Mapped).  (minor fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) do_wp_page is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     (5) ext4_write_begin is called
>>>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>>>
>>>>>> [3] One of the page which has been mmapped is not bound. But
>>>>>>   the page is already allocated, and the buffer_heads of the page
>>>>>>   are mapped (BH_Mapped).  (minor fault?)
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     (2) a page fault occurs (do_page_fault)
>>>>>>     (3) do_wp_page is called.
>>>>>>     (4) ext4_page_mkwrite is called
>>>>>>     * Cannot block the dirty page to be written because all bh is mapped.
>>>>>>     (5) user munmaps the page (munmap)
>>>>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>>     (7) writeback thread writes the page (struct page) to disk
>>>>>>                                             =>   We cannot STOP!
>>>>>>
>>>>>> [4] One of the page which has been mmapped is bound. And
>>>>>>   the page is already allocated.
>>>>>>
>>>>>>     (1) user dirtys a page
>>>>>>     ( ) no page fault occurs
>>>>>>     (2) user munmaps the page (munmap)
>>>>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>>>     (4) writeback thread writes the page (struct page) to disk
>>>>>>                                             =>   We cannot STOP!
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> So, we can block the cases [1], [2].
>>>>>> But I think we cannot block the cases [3], [4] now.
>>>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>>>> But the case [4] is not blocked because no page fault occurs
>>>>>> when we dirty the mmapped page.
>>>>>>
>>>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>>>> I think we must modify the writeback thread to fix the case [4].
>>>>>    The trick here is that when we write a page to disk, we write-protect
>>>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>>>> Hm, I want to understand how to write-protect the page under fsfreezing.
>>>    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
>> Thanks. I'll read that.
>>
>>>
>>>> But, anyway, I understand we don't need to consider the case [4].
>>>    Yes.
>>>
>>>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>>>> modify a page after we finish writeback while freezing the filesystem.
>>>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>>>> OK. I understand.
>>>> Are there any concrete ideas to fix this?
>>>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
>>>    Yes.
>>>
>>>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
>>>    Sadly I don't see a simple way to fix this issue for all filesystems at
>>> once. Implementing proper wait in block_page_mkwrite() should fix the issue
>>> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
>>> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
>>> have patches for this already for some time but I have to get to properly
>>> testing them in more exotic conditions like 64k pages...
>> OK. I understand the current status of your works to fix the problem which
>> can be written with some data at mmap path while fsfreezing.
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
>  we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing. 
> 
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
>   into a disk while fsfreezing (ext3/ext4).
>                                        (reported by me)
> 
> Please examine this patch.

We've recently identified the same root cause in 2.6.32 though the hit rate
is much much higher. The configuration is a SAN ALUA Active/Standby using
multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
when a path comes back into service, as a result of a kpartx invocation on
behalf of this udev rule.

/lib/udev/rules.d/95-kpartx.rules

# Create dm tables for partitions
ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
        RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"


Below are the logs of the current incarntion of the fault with your current patch against 2.6.38.
Still working to obtain a viable crashdump.

[ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)

[ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)

[ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)

[ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)

[ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)

[ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)

[ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)

[ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)

[ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA

[ 1898.578635] device-mapper: multipath: Failing path 8:32.

[ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.

[ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000

[ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000

[ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8

[ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0

[ 2041.395561] Call Trace:

[ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0

[ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50

[ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90

[ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50

[ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90

[ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40

[ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30

[ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0

[ 2041.455873]  [<ffffffff81038d09>] ? default_spin_lock_flags+0x9/0x10

[ 2041.463020]  [<ffffffff8107443c>] ? lock_timer_base+0x3c/0x70

[ 2041.469514]  [<ffffffff81074e33>] ? try_to_del_timer_sync+0x83/0xe0

[ 2041.476563]  [<ffffffff8123df7d>] kjournald+0xed/0x250

[ 2041.482349]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.489624]  [<ffffffff8123de90>] ? kjournald+0x0/0x250

[ 2041.495504]  [<ffffffff810865e6>] kthread+0x96/0xa0

[ 2041.501003]  [<ffffffff8100ce64>] kernel_thread_helper+0x4/0x10

[ 2041.507667]  [<ffffffff81086550>] ? kthread+0x0/0xa0

[ 2041.513301]  [<ffffffff8100ce60>] ? kernel_thread_helper+0x0/0x10

[ 2041.520247] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2041.527677] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.536499] rsyslogd        D ffff88063c513170     0  1854      1 0x00000000

[ 2041.544533]  ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2041.553108]  0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2041.561691]  ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2041.570323] Call Trace:

[ 2041.573108]  [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2041.580447]  [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2041.587496]  [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2041.594489]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2041.601833]  [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2041.608831]  [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2041.615165]  [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2041.621050]  [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2041.628395]  [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2041.635827]  [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2041.641649]  [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2041.647337]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.654091] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2041.661750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000

[ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2041.704369] Call Trace:

[ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.852639] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2041.859921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.868760] iozone          D ffff880c3bc21a90     0  1871   1869 0x00000000

[ 2041.876728]  ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2041.885177]  0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2041.893647]  ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2041.902112] Call Trace:

[ 2041.906302]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2041.912494]  [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2041.919718]  [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2041.925719]  [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2041.932690]  [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2041.940116]  [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2041.945990]  [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2041.952149]  [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2041.957763]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000

[ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2042.013939] Call Trace:

[ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2161.971321] INFO: task rsyslogd:1854 blocked for more than 120 seconds.

[ 2161.978798] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2161.987656] rsyslogd        D ffff88063c513170     0  1854      1 0x00000000

[ 2161.995718]  ffff88063d0e3cd8 0000000000000082 ffff88063c479180 0000000000000000

[ 2162.004340]  0000000000013cc0 ffff88063c512dc0 ffff88063c513170 ffff88063d0e3fd8

[ 2162.012932]  ffff88063c513178 0000000000013cc0 ffff88063d0e2010 0000000000013cc0

[ 2162.021481] Call Trace:

[ 2162.024290]  [<ffffffff8110c78d>] __generic_file_aio_write+0xbd/0x470

[ 2162.031627]  [<ffffffff8108a82d>] ? hrtimer_try_to_cancel+0x3d/0xd0

[ 2162.038711]  [<ffffffff81097e3d>] ? futex_wait_queue_me+0xcd/0x110

[ 2162.045662]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.053007]  [<ffffffff8110cba2>] generic_file_aio_write+0x62/0xd0

[ 2162.059962]  [<ffffffff81163a9a>] do_sync_write+0xda/0x120

[ 2162.066165]  [<ffffffff812de756>] ? rb_erase+0xd6/0x160

[ 2162.072048]  [<ffffffff812ac918>] ? apparmor_file_permission+0x18/0x20

[ 2162.079387]  [<ffffffff81279b23>] ? security_file_permission+0x23/0x90

[ 2162.086761]  [<ffffffff81164018>] vfs_write+0xc8/0x190

[ 2162.092552]  [<ffffffff811641d1>] sys_write+0x51/0x90

[ 2162.098247]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.105042] INFO: task multipathd:1337 blocked for more than 120 seconds.

[ 2162.112667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.121487] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000

[ 2162.129517]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000

[ 2162.138112]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8

[ 2162.146688]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0

[ 2162.155253] Call Trace:

[ 2162.158073]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300

[ 2162.164639]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.170886]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410

[ 2162.177389]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180

[ 2162.183852]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20

[ 2162.190317]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20

[ 2162.197304]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20

[ 2162.203968]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0

[ 2162.211111]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240

[ 2162.217807]  [<ffffffff8107e060>] __request_module+0x190/0x210

[ 2162.224461]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40

[ 2162.230054]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240

[ 2162.236503]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0

[ 2162.242673]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140

[ 2162.249079]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250

[ 2162.255840]  [<ffffffff8149de3a>] table_load+0xca/0x2f0

[ 2162.261719]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0

[ 2162.267701]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.273621]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.279592]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.285710]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170

[ 2162.291694]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.297383]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.304169] INFO: task iozone:1871 blocked for more than 120 seconds.

[ 2162.311407] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.320229] iozone          D ffff880c3bc21a90     0  1871   1869 0x00000000

[ 2162.328317]  ffff880c3e743e20 0000000000000086 0000000000000001 ffff880c00000000

[ 2162.336901]  0000000000013cc0 ffff880c3bc216e0 ffff880c3bc21a90 ffff880c3e743fd8

[ 2162.345415]  ffff880c3bc21a98 0000000000013cc0 ffff880c3e742010 0000000000013cc0

[ 2162.353887] Call Trace:

[ 2162.356650]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90

[ 2162.362815]  [<ffffffff815d4ddd>] rwsem_down_failed_common+0xcd/0x170

[ 2162.370042]  [<ffffffff8118f480>] ? sync_one_sb+0x0/0x30

[ 2162.376121]  [<ffffffff815d4eb5>] rwsem_down_read_failed+0x15/0x17

[ 2162.383075]  [<ffffffff812e41a4>] call_rwsem_down_read_failed+0x14/0x30

[ 2162.390575]  [<ffffffff815d4207>] ? down_read+0x17/0x20

[ 2162.396501]  [<ffffffff811665e1>] iterate_supers+0x71/0xf0

[ 2162.402768]  [<ffffffff8118f4df>] sys_sync+0x2f/0x70

[ 2162.408360]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2162.415159] INFO: task kpartx:1897 blocked for more than 120 seconds.

[ 2162.422493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[ 2162.431405] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000

[ 2162.439440]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000

[ 2162.448021]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8

[ 2162.456468]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0

[ 2162.464962] Call Trace:

[ 2162.467724]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150

[ 2162.474088]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40

[ 2162.481319]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.487577]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70

[ 2162.493548]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330

[ 2162.500107]  [<ffffffff81115391>] ? do_writepages+0x21/0x40

[ 2162.506415]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60

[ 2162.513947]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90

[ 2162.520514]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70

[ 2162.526783]  [<ffffffff81166a85>] freeze_super+0x55/0x100

[ 2162.532896]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0

[ 2162.538819]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0

[ 2162.544705]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0

[ 2162.551174]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.557160]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0

[ 2162.563082]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.569102]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0

[ 2162.574987]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0

[ 2162.581068]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240

[ 2162.586954]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20

[ 2162.593217]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20

[ 2162.599190]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0

[ 2162.605298]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0

[ 2162.610990]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b

[ 2191.336354] Uhhuh. NMI received for unknown reason 21 on CPU 0.

[ 2191.343064] Do you have a strange power saving mode enabled?

[ 2191.349476] Kernel panic - not syncing: NMI: Not continuing

[ 2191.355753] Pid: 0, comm: swapper Not tainted 2.6.38-8-server #43

[ 2191.362593] Call Trace:

[ 2191.365380]  <NMI>  [<ffffffff815d2083>] ? panic+0x91/0x19e

[ 2191.371779]  [<ffffffff815d21f8>] ? printk+0x68/0x70

[ 2191.377381]  [<ffffffff815d6333>] ? default_do_nmi+0x1f3/0x200

[ 2191.383929]  [<ffffffff815d63c0>] ? do_nmi+0x80/0x90

[ 2191.389526]  [<ffffffff815d5b50>] ? nmi+0x20/0x30

[ 2191.394816]  [<ffffffff81332d74>] ? intel_idle+0x94/0x120

[ 2191.400897]  <<EOE>>  [<ffffffff814b3472>] ? cpuidle_idle_call+0xb2/0x1b0

[ 2191.408606]  [<ffffffff8100b067>] ? cpu_idle+0xb7/0x110

[ 2191.414497]  [<ffffffff815b7682>] ? rest_init+0x72/0x80

[ 2191.420367]  [<ffffffff81ae2c95>] ? start_kernel+0x374/0x37b

[ 2191.426780]  [<ffffffff81ae2346>] ? x86_64_start_reservations+0x131/0x135

[ 2191.434457]  [<ffffffff81ae244d>] ? x86_64_start_kernel+0x103/0x112


Thanks.

Peter




> 
> Thanks,
> Toshiyuki Okajima
> ---
>  fs/ext3/file.c          |   19 ++++++++++++-
>  fs/ext3/inode.c         |   71 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext4/inode.c         |    4 ++-
>  include/linux/ext3_fs.h |    1 +
>  4 files changed, 93 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
>  	return 0;
>  }
>  
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> +	.fault          = filemap_fault,
> +	.page_mkwrite   = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +
> +	if (!mapping->a_ops->readpage)
> +		return -ENOEXEC;
> +	file_accessed(file);
> +	vma->vm_ops = &ext3_file_vm_ops;
> +	vma->vm_flags |= VM_CAN_NONLINEAR;
> +	return 0;
> +}
> +
>  const struct file_operations ext3_file_operations = {
>  	.llseek		= generic_file_llseek,
>  	.read		= do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
>  #ifdef CONFIG_COMPAT
>  	.compat_ioctl	= ext3_compat_ioctl,
>  #endif
> -	.mmap		= generic_file_mmap,
> +	.mmap		= ext3_file_mmap,
>  	.open		= dquot_file_open,
>  	.release	= ext3_release_file,
>  	.fsync		= ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>  
>  	return err;
>  }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	loff_t size;
> +	unsigned long len;
> +	int ret = -EINVAL;
> +	void *fsdata;
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	/*
> +	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> +	 * get i_mutex because we are already holding mmap_sem.
> +	 */
> +	down_read(&inode->i_alloc_sem);
> +	size = i_size_read(inode);
> +	if (page->mapping != mapping || size <= page_offset(page)
> +	   || !PageUptodate(page)) {
> +		/* page got truncated from under us? */
> +		goto out_unlock;
> +	}
> +	ret = 0;
> +	if (PageMappedToDisk(page))
> +		goto out_frozen;
> +
> +	if (page->index == size >> PAGE_CACHE_SHIFT)
> +		len = size & ~PAGE_CACHE_MASK;
> +	else
> +		len = PAGE_CACHE_SIZE;
> +
> +	lock_page(page);
> +	/*
> +	 * return if we have all the buffers mapped. This avoid
> +	 * the need to call write_begin/write_end which does a
> +	 * journal_start/journal_stop which can block and take
> +	 * long time
> +	 */
> +	if (page_has_buffers(page)) {
> +		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> +					buffer_unmapped)) {
> +			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +			goto out_unlock;
> +		}
> +	}
> +	unlock_page(page);
> +	/*
> +	 * OK, we need to fill the hole... Do write_begin write_end
> +	 * to do block allocation/reservation.We are not holding
> +	 * inode.i__mutex here. That allow * parallel write_begin,
> +	 * write_end call. lock_page prevent this from happening
> +	 * on the same page though
> +	 */
> +	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> +			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> +			len, len, page, fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = 0;
> +out_unlock:
> +	if (ret)
> +		ret = VM_FAULT_SIGBUS;
> +	up_read(&inode->i_alloc_sem);
> +	return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	ret = 0;
>  	if (PageMappedToDisk(page))
> -		goto out_unlock;
> +		goto out_frozen;
>  
>  	if (page->index == size >> PAGE_CACHE_SHIFT)
>  		len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
>  			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
>  			goto out_unlock;
>  		}
>  	}
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* ioctl.c */
>  extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 22, 2011, 9:40 p.m. UTC | #2
Hello,

On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
> > On Tue, 19 Apr 2011 18:43:16 +0900
> > I have confirmed that the following patch works fine while my or
> > Mizuma-san's reproducer is running. Therefore,
> >  we can block to write the data, which is mmapped to a file, into a disk
> > by a page-fault while fsfreezing. 
> > 
> > I think this patch fixes the following two problems:
> > - A deadlock occurs between ext4_da_writepages() (called from
> > writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> > - We can also write the data, which is mmapped to a file,
> >   into a disk while fsfreezing (ext3/ext4).
> >                                        (reported by me)
> > 
> > Please examine this patch.
> 
> We've recently identified the same root cause in 2.6.32 though the hit rate
> is much much higher. The configuration is a SAN ALUA Active/Standby using
> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
> when a path comes back into service, as a result of a kpartx invocation on
> behalf of this udev rule.
> 
> /lib/udev/rules.d/95-kpartx.rules
> 
> # Create dm tables for partitions
> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
>         RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
  Hmm, I don't think this is the same problem... See:

> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000
> [ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
> [ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
> [ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
> 
> [ 2041.395561] Call Trace:
> [ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0
> [ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50
> [ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
> [ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
> [ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
> [ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
> [ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
> [ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
  So kjournald is committing a transaction and waiting for IO to complete.
Which maybe never happens because of multipath being in transition? That
would be a bug...

> [ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000
> [ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
> [ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
> [ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
> [ 2041.704369] Call Trace:
> [ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
> [ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
> [ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
> [ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180
> [ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
> [ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
> [ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
> [ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
> [ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
> [ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210
> [ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40
> [ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
> [ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
> [ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
> [ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
> [ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0
> [ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
> [ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
> [ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
  multipathd is hung waiting for module to be loaded? How come?

> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000
> [ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
> [ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
> [ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
> [ 2042.013939] Call Trace:
> [ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
> [ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
> [ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
> [ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
> [ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40
> [ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
> [ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
> [ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
> [ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100
> [ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
> [ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
> [ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
> [ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
> [ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0
> [ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
> [ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
> [ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
> [ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
> [ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
> [ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
> [ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
  kpartx is waiting for kjournald to finish transaction commit and it is
holding s_umount but that doesn't really seem to be a problem...

So as I say, find a reason why kjournald is not able to finish committing a
transaction and you should solve this riddle ;).

								Honza
Jan Kara April 22, 2011, 10:10 p.m. UTC | #3
On Fri 22-04-11 15:58:39, Toshiyuki Okajima wrote:
> I have confirmed that the following patch works fine while my or
> Mizuma-san's reproducer is running. Therefore,
>  we can block to write the data, which is mmapped to a file, into a disk
> by a page-fault while fsfreezing. 
> 
> I think this patch fixes the following two problems:
> - A deadlock occurs between ext4_da_writepages() (called from
> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
> - We can also write the data, which is mmapped to a file,
>   into a disk while fsfreezing (ext3/ext4).
>                                        (reported by me)
> 
> Please examine this patch.
  Thanks for the patch. The ext3 part is not as easy as this. You cannot
really get i_alloc_sem in ext3_page_mkwrite() because mmap_sem is already
held by page fault code and i_alloc_sem should be acquired before it (yes I
know, ext4 already has this bug which should be fixed when I get to it).
Also you'll find that performance of random writers via mmap (which is
relatively common) is going to be rather bad with this patch (because the
file will be heavily fragmented). We have to be more clever which is
exactly why it's taking me so long with my patch :) But tests are already
running so if everything goes fine, I should have patches to submit next
week.

The ext4 part looks correct. I'd just also like to have some comments about
how freeze handling is done because it's kind of subtle.

								Honza

> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index f55df0e..6d376ef 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -52,6 +52,23 @@ static int ext3_release_file (struct inode * inode, struct file * filp)
>  	return 0;
>  }
>  
> +static const struct vm_operations_struct ext3_file_vm_ops = {
> +	.fault          = filemap_fault,
> +	.page_mkwrite   = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct address_space *mapping = file->f_mapping;
> +
> +	if (!mapping->a_ops->readpage)
> +		return -ENOEXEC;
> +	file_accessed(file);
> +	vma->vm_ops = &ext3_file_vm_ops;
> +	vma->vm_flags |= VM_CAN_NONLINEAR;
> +	return 0;
> +}
> +
>  const struct file_operations ext3_file_operations = {
>  	.llseek		= generic_file_llseek,
>  	.read		= do_sync_read,
> @@ -62,7 +79,7 @@ const struct file_operations ext3_file_operations = {
>  #ifdef CONFIG_COMPAT
>  	.compat_ioctl	= ext3_compat_ioctl,
>  #endif
> -	.mmap		= generic_file_mmap,
> +	.mmap		= ext3_file_mmap,
>  	.open		= dquot_file_open,
>  	.release	= ext3_release_file,
>  	.fsync		= ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 68b2e43..66c31dd 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3496,3 +3496,74 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>  
>  	return err;
>  }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	loff_t size;
> +	unsigned long len;
> +	int ret = -EINVAL;
> +	void *fsdata;
> +	struct file *file = vma->vm_file;
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +
> +	/*
> +	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> +	 * get i_mutex because we are already holding mmap_sem.
> +	 */
> +	down_read(&inode->i_alloc_sem);
> +	size = i_size_read(inode);
> +	if (page->mapping != mapping || size <= page_offset(page)
> +	   || !PageUptodate(page)) {
> +		/* page got truncated from under us? */
> +		goto out_unlock;
> +	}
> +	ret = 0;
> +	if (PageMappedToDisk(page))
> +		goto out_frozen;
> +
> +	if (page->index == size >> PAGE_CACHE_SHIFT)
> +		len = size & ~PAGE_CACHE_MASK;
> +	else
> +		len = PAGE_CACHE_SIZE;
> +
> +	lock_page(page);
> +	/*
> +	 * return if we have all the buffers mapped. This avoid
> +	 * the need to call write_begin/write_end which does a
> +	 * journal_start/journal_stop which can block and take
> +	 * long time
> +	 */
> +	if (page_has_buffers(page)) {
> +		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> +					buffer_unmapped)) {
> +			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +			goto out_unlock;
> +		}
> +	}
> +	unlock_page(page);
> +	/*
> +	 * OK, we need to fill the hole... Do write_begin write_end
> +	 * to do block allocation/reservation.We are not holding
> +	 * inode.i__mutex here. That allow * parallel write_begin,
> +	 * write_end call. lock_page prevent this from happening
> +	 * on the same page though
> +	 */
> +	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
> +			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
> +			len, len, page, fsdata);
> +	if (ret < 0)
> +		goto out_unlock;
> +	ret = 0;
> +out_unlock:
> +	if (ret)
> +		ret = VM_FAULT_SIGBUS;
> +	up_read(&inode->i_alloc_sem);
> +	return ret;
> +}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f2fa5e8..44979ae 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5812,7 +5812,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	}
>  	ret = 0;
>  	if (PageMappedToDisk(page))
> -		goto out_unlock;
> +		goto out_frozen;
>  
>  	if (page->index == size >> PAGE_CACHE_SHIFT)
>  		len = size & ~PAGE_CACHE_MASK;
> @@ -5830,6 +5830,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
>  			unlock_page(page);
> +out_frozen:
> +			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
>  			goto out_unlock;
>  		}
>  	}
> diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
> index 85c1d30..a0e39ca 100644
> --- a/include/linux/ext3_fs.h
> +++ b/include/linux/ext3_fs.h
> @@ -919,6 +919,7 @@ extern void ext3_get_inode_flags(struct ext3_inode_info *);
>  extern void ext3_set_aops(struct inode *inode);
>  extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		       u64 start, u64 len);
> +extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>  
>  /* ioctl.c */
>  extern long ext3_ioctl(struct file *, unsigned int, unsigned long);
> -- 
> 1.5.5.6
Peter M. Petrakis April 22, 2011, 10:57 p.m. UTC | #4
On 04/22/2011 05:40 PM, Jan Kara wrote:
>   Hello,
> 
> On Fri 22-04-11 17:26:07, Peter M. Petrakis wrote:
>> On 04/22/2011 02:58 AM, Toshiyuki Okajima wrote:
>>> On Tue, 19 Apr 2011 18:43:16 +0900
>>> I have confirmed that the following patch works fine while my or
>>> Mizuma-san's reproducer is running. Therefore,
>>>  we can block to write the data, which is mmapped to a file, into a disk
>>> by a page-fault while fsfreezing. 
>>>
>>> I think this patch fixes the following two problems:
>>> - A deadlock occurs between ext4_da_writepages() (called from
>>> writeback_inodes_wb) and thaw_super(). (reported by Mizuma-san)
>>> - We can also write the data, which is mmapped to a file,
>>>   into a disk while fsfreezing (ext3/ext4).
>>>                                        (reported by me)
>>>
>>> Please examine this patch.
>>
>> We've recently identified the same root cause in 2.6.32 though the hit rate
>> is much much higher. The configuration is a SAN ALUA Active/Standby using
>> multipath. The s_wait_unfrozen/s_umount deadlock is regularly encountered
>> when a path comes back into service, as a result of a kpartx invocation on
>> behalf of this udev rule.
>>
>> /lib/udev/rules.d/95-kpartx.rules
>>
>> # Create dm tables for partitions
>> ENV{DM_STATE}=="ACTIVE", ENV{DM_UUID}=="mpath-*", \
>>         RUN+="/sbin/dmsetup ls --target multipath --exec '/sbin/kpartx -a -p -part' -j %M -m %m"
>   Hmm, I don't think this is the same problem... See:

Figures :)

 
>> [ 1898.017614] mptsas: ioc0: mptsas_add_fw_event: add (fw_event=0xffff880c3c815200)
>> [ 1898.025995] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814780)
>> [ 1898.034625] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c814b40), event = (0x12)
>> [ 1898.044803] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c814b40)
>> [ 1898.053475] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815c80), event = (0x12)
>> [ 1898.063690] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815c80)
>> [ 1898.072316] mptsas: ioc0: mptsas_firmware_event_work: fw_event=(0xffff880c3c815200), event = (0x0f)
>> [ 1898.082544] mptsas: ioc0: mptsas_free_fw_event: kfree (fw_event=0xffff880c3c815200)
>> [ 1898.571426] sd 0:0:1:0: alua: port group 01 state S supports toluSnA
>> [ 1898.578635] device-mapper: multipath: Failing path 8:32.
>> [ 2041.345645] INFO: task kjournald:595 blocked for more than 120 seconds.
>> [ 2041.353075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.361891] kjournald       D ffff88063acb9a90     0   595      2 0x00000000
>> [ 2041.369891]  ffff88063ace1c30 0000000000000046 ffff88063c282140 ffff880600000000
>> [ 2041.378416]  0000000000013cc0 ffff88063acb96e0 ffff88063acb9a90 ffff88063ace1fd8
>> [ 2041.386954]  ffff88063acb9a98 0000000000013cc0 ffff88063ace0010 0000000000013cc0
>>
>> [ 2041.395561] Call Trace:
>> [ 2041.398358]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.404342]  [<ffffffff815d3120>] io_schedule+0x70/0xc0
>> [ 2041.410227]  [<ffffffff811923c5>] sync_buffer+0x45/0x50
>> [ 2041.416179]  [<ffffffff815d378f>] __wait_on_bit+0x5f/0x90
>> [ 2041.422258]  [<ffffffff81192380>] ? sync_buffer+0x0/0x50
>> [ 2041.428275]  [<ffffffff815d3838>] out_of_line_wait_on_bit+0x78/0x90
>> [ 2041.435324]  [<ffffffff81086b90>] ? wake_bit_function+0x0/0x40
>> [ 2041.441958]  [<ffffffff8119237e>] __wait_on_buffer+0x2e/0x30
>> [ 2041.448333]  [<ffffffff8123ab14>] journal_commit_transaction+0x7e4/0xec0
>   So kjournald is committing a transaction and waiting for IO to complete.
> Which maybe never happens because of multipath being in transition? That
> would be a bug...
>

and it would be a new one for us. It's entirely possible the original deadlock
is resolved, and this is new. With only the tracebacks to consult, and general
unfamiliarity with this area, it looked like the same fault to me.
In 2.6.32 it's a dead ringer per the thread parent:

http://permalink.gmane.org/gmane.comp.file-systems.ext4/23171

[Ubuntu 10.04 - 2.6.32 crashdump]

crash-5.0> ps | grep UN
    992      2   7  ffff8802678a8000  UN   0.0       0      0  [flush-251:5]
  17295   2537   2  ffff880267be0000  UN   0.2   47060  17368  iozone
  17314   2477   5  ffff88026a952010  UN   0.2   47060  17364  iozone
  17447   2573   0  ffff880268bd2010  UN   0.2   47060  17340  iozone
  17460      1  13  ffff88026b3c4020  UN   0.0  191564   1992  rsyslogd
  17606  17597  15  ffff880268420000  UN   0.0   10436    808  kpartx
  17738   2268  13  ffff88016908a010  UN   0.0   17756   1616  dhclient-script
  17747   2223  15  ffff88026a950000  UN   0.0  151460   1596  multipathd
  17748   2284   1  ffff88016908c020  UN   0.0   49260    688  sshd
  17749   2284   1  ffff880169088000  UN   0.0   49260    692  sshd
  17750   2284   1  ffff88016a628000  UN   0.0   49260    688  sshd
  17751   2284   0  ffff88026a3cc020  UN   0.0   49260    688  sshd
  17752   2284   0  ffff88026a3ca010  UN   0.0   49260    688  sshd
  17753   2284   0  ffff88026a3c8000  UN   0.0   49260    688  sshd
  17754   2284   0  ffff880268f60000  UN   0.0   49260    692  sshd
  17755   2284   0  ffff880268f62010  UN   0.0   49260    688  sshd
crash-5.0> bt 17606
PID: 17606  TASK: ffff880268420000  CPU: 15  COMMAND: "kpartx"
 #0 [ffff88026aac3b18] schedule at ffffffff8158bcbd
 #1 [ffff88026aac3bd0] rwsem_down_failed_common at ffffffff8158df2d
 #2 [ffff88026aac3c30] rwsem_down_write_failed at ffffffff8158e0b3
 #3 [ffff88026aac3c70] call_rwsem_down_write_failed at ffffffff812d9903
 #4 [ffff88026aac3ce0] thaw_bdev at ffffffff81186d5a
 #5 [ffff88026aac3d40] unlock_fs at ffffffff8145e46d
 #6 [ffff88026aac3d60] dm_resume at ffffffff8145fb38
 #7 [ffff88026aac3db0] do_resume at ffffffff81465c98
 #8 [ffff88026aac3de0] dev_suspend at ffffffff81465d65
 #9 [ffff88026aac3e20] ctl_ioctl at ffffffff814665f5
#10 [ffff88026aac3e90] dm_ctl_ioctl at ffffffff814666a3
#11 [ffff88026aac3ea0] vfs_ioctl at ffffffff81165e92
#12 [ffff88026aac3ee0] do_vfs_ioctl at ffffffff81166140
#13 [ffff88026aac3f30] sys_ioctl at ffffffff811664b1
#14 [ffff88026aac3f80] system_call_fastpath at ffffffff810131b2
    RIP: 00007fa798b04197  RSP: 00007fff4cf1c6e8  RFLAGS: 00010202
    RAX: 0000000000000010  RBX: ffffffff810131b2  RCX: 0000000000000000
    RDX: 0000000000bcf310  RSI: 00000000c138fd06  RDI: 0000000000000004
    RBP: 0000000000bcf340   R8: 00007fa798dc2528   R9: 00007fff4cf1c640
    R10: 00007fa798dc1dc0  R11: 0000000000000246  R12: 00007fa798dc1dc0
    R13: 0000000000004000  R14: 0000000000bce0f0  R15: 00007fa798dc1dc0
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
crash-5.0> bt 992

PID: 992    TASK: ffff8802678a8000  CPU: 7   COMMAND: "flush-251:5"
 #0 [ffff880267bddb00] schedule at ffffffff8158bcbd
 #1 [ffff880267bddbb8] ext4_force_commit at ffffffff8120b16d
 #2 [ffff880267bddc18] ext4_write_inode at ffffffff811f29e5
 #3 [ffff880267bddc68] writeback_single_inode at ffffffff81178964
 #4 [ffff880267bddcb8] writeback_sb_inodes at ffffffff81178f09
 #5 [ffff880267bddd18] wb_writeback at ffffffff8117995c
 #6 [ffff880267bdddc8] wb_do_writeback at ffffffff81179b6b
 #7 [ffff880267bdde58] bdi_writeback_task at ffffffff81179cc3
 #8 [ffff880267bdde98] bdi_start_fn at ffffffff8111e816
 #9 [ffff880267bddec8] kthread at ffffffff81088a06
#10 [ffff880267bddf48] kernel_thread at ffffffff810142ea

crash-5.0> super_block.s_frozen ffff880268a4e000
  s_frozen = 0x2,

int ext4_force_commit(struct super_block *sb)
{
        journal_t *journal;
        int ret = 0;

        if (sb->s_flags & MS_RDONLY)
                return 0;

        journal = EXT4_SB(sb)->s_journal;
        if (journal) {
                vfs_check_frozen(sb, SB_FREEZE_TRANS); <=== this is where sleep
                ret = ext4_journal_force_commit(journal);
        }

        return ret;
}
 

I have tried the previous versions of the patch, backporting
to 2.6.32 without any success. I thought I would just go for it
this time with the latest.

 
>> [ 2041.670669] multipathd      D ffff88063e3303b0     0  1337      1 0x00000000
>> [ 2041.678746]  ffff88063c0fda18 0000000000000082 0000000000000000 ffff880600000000
>> [ 2041.687219]  0000000000013cc0 ffff88063e330000 ffff88063e3303b0 ffff88063c0fdfd8
>> [ 2041.695818]  ffff88063e3303b8 0000000000013cc0 ffff88063c0fc010 0000000000013cc0
>> [ 2041.704369] Call Trace:
>> [ 2041.707128]  [<ffffffff815d349d>] schedule_timeout+0x21d/0x300
>> [ 2041.713679]  [<ffffffff8104c8ec>] ? resched_task+0x2c/0x90
>> [ 2041.719846]  [<ffffffff8105f763>] ? try_to_wake_up+0xc3/0x410
>> [ 2041.726301]  [<ffffffff815d2436>] wait_for_common+0xd6/0x180
>> [ 2041.732685]  [<ffffffff8105fb05>] ? wake_up_process+0x15/0x20
>> [ 2041.739138]  [<ffffffff8105fab0>] ? default_wake_function+0x0/0x20
>> [ 2041.746079]  [<ffffffff815d25bd>] wait_for_completion+0x1d/0x20
>> [ 2041.752716]  [<ffffffff8107de18>] call_usermodehelper_exec+0xd8/0xe0
>> [ 2041.759853]  [<ffffffff814a3110>] ? parse_hw_handler+0xb0/0x240
>> [ 2041.766503]  [<ffffffff8107e060>] __request_module+0x190/0x210
>> [ 2041.773054]  [<ffffffff812e0c28>] ? sscanf+0x38/0x40
>> [ 2041.778636]  [<ffffffff814a3110>] parse_hw_handler+0xb0/0x240
>> [ 2041.785121]  [<ffffffff814a38c3>] multipath_ctr+0x83/0x1d0
>> [ 2041.791312]  [<ffffffff8149abd5>] ? dm_split_args+0x75/0x140
>> [ 2041.797671]  [<ffffffff8149b9af>] dm_table_add_target+0xff/0x250
>> [ 2041.804413]  [<ffffffff8149de3a>] table_load+0xca/0x2f0
>> [ 2041.810317]  [<ffffffff8149dd70>] ? table_load+0x0/0x2f0
>> [ 2041.816316]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2041.822184]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2041.828188]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2041.834250]  [<ffffffff8109ae6b>] ? sys_futex+0x7b/0x170
>> [ 2041.840219]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2041.845898]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
>   multipathd is hung waiting for module to be loaded? How come?

It shouldn't, dh_alua is already loaded. 


>> [ 2041.964575] INFO: task kpartx:1897 blocked for more than 120 seconds.
>> [ 2041.971801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [ 2041.980626] kpartx          D ffff88063d05df30     0  1897   1896 0x00000000
>> [ 2041.988607]  ffff88063c3a5b58 0000000000000082 0000000e3c3a5ac8 ffff880c00000000
>> [ 2041.997056]  0000000000013cc0 ffff88063d05db80 ffff88063d05df30 ffff88063c3a5fd8
>> [ 2042.005496]  ffff88063d05df38 0000000000013cc0 ffff88063c3a4010 0000000000013cc0
>> [ 2042.013939] Call Trace:
>> [ 2042.016702]  [<ffffffff8123dc85>] log_wait_commit+0xc5/0x150
>> [ 2042.023089]  [<ffffffff81086b50>] ? autoremove_wake_function+0x0/0x40
>> [ 2042.030321]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.036584]  [<ffffffff811e6256>] ext3_sync_fs+0x66/0x70
>> [ 2042.042552]  [<ffffffff811ba7c1>] dquot_quota_sync+0x1c1/0x330
>> [ 2042.049133]  [<ffffffff81115391>] ? do_writepages+0x21/0x40
>> [ 2042.055423]  [<ffffffff8110ae8b>] ? __filemap_fdatawrite_range+0x5b/0x60
>> [ 2042.062944]  [<ffffffff8118f42c>] __sync_filesystem+0x3c/0x90
>> [ 2042.069430]  [<ffffffff8118f56b>] sync_filesystem+0x4b/0x70
>> [ 2042.075690]  [<ffffffff81166a85>] freeze_super+0x55/0x100
>> [ 2042.081754]  [<ffffffff811993b8>] freeze_bdev+0x98/0xe0
>> [ 2042.087625]  [<ffffffff81499001>] dm_suspend+0xa1/0x2e0
>> [ 2042.093495]  [<ffffffff8149ced9>] ? __get_name_cell+0x99/0xb0
>> [ 2042.099948]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.105916]  [<ffffffff8149e29b>] do_resume+0x17b/0x1b0
>> [ 2042.111784]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.117753]  [<ffffffff8149e365>] dev_suspend+0x95/0xb0
>> [ 2042.123621]  [<ffffffff8149e2d0>] ? dev_suspend+0x0/0xb0
>> [ 2042.129591]  [<ffffffff8149f0d5>] ctl_ioctl+0x1a5/0x240
>> [ 2042.135493]  [<ffffffff815d4eee>] ? _raw_spin_lock+0xe/0x20
>> [ 2042.141770]  [<ffffffff8149f183>] dm_ctl_ioctl+0x13/0x20
>> [ 2042.147739]  [<ffffffff81175245>] do_vfs_ioctl+0x95/0x3c0
>> [ 2042.153801]  [<ffffffff81175611>] sys_ioctl+0xa1/0xb0
>> [ 2042.159478]  [<ffffffff8100c042>] system_call_fastpath+0x16/0x1b
>   kpartx is waiting for kjournald to finish transaction commit and it is
> holding s_umount but that doesn't really seem to be a problem...
> 
> So as I say, find a reason why kjournald is not able to finish committing a
> transaction and you should solve this riddle ;).

Cool, thanks!

Peter

> 
> 								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index f55df0e..6d376ef 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -52,6 +52,23 @@  static int ext3_release_file (struct inode * inode, struct file * filp)
 	return 0;
 }
 
+static const struct vm_operations_struct ext3_file_vm_ops = {
+	.fault          = filemap_fault,
+	.page_mkwrite   = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct address_space *mapping = file->f_mapping;
+
+	if (!mapping->a_ops->readpage)
+		return -ENOEXEC;
+	file_accessed(file);
+	vma->vm_ops = &ext3_file_vm_ops;
+	vma->vm_flags |= VM_CAN_NONLINEAR;
+	return 0;
+}
+
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -62,7 +79,7 @@  const struct file_operations ext3_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ext3_compat_ioctl,
 #endif
-	.mmap		= generic_file_mmap,
+	.mmap		= ext3_file_mmap,
 	.open		= dquot_file_open,
 	.release	= ext3_release_file,
 	.fsync		= ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 68b2e43..66c31dd 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3496,3 +3496,74 @@  int ext3_change_inode_journal_flag(struct inode *inode, int val)
 
 	return err;
 }
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	loff_t size;
+	unsigned long len;
+	int ret = -EINVAL;
+	void *fsdata;
+	struct file *file = vma->vm_file;
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+
+	/*
+	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
+	 * get i_mutex because we are already holding mmap_sem.
+	 */
+	down_read(&inode->i_alloc_sem);
+	size = i_size_read(inode);
+	if (page->mapping != mapping || size <= page_offset(page)
+	   || !PageUptodate(page)) {
+		/* page got truncated from under us? */
+		goto out_unlock;
+	}
+	ret = 0;
+	if (PageMappedToDisk(page))
+		goto out_frozen;
+
+	if (page->index == size >> PAGE_CACHE_SHIFT)
+		len = size & ~PAGE_CACHE_MASK;
+	else
+		len = PAGE_CACHE_SIZE;
+
+	lock_page(page);
+	/*
+	 * return if we have all the buffers mapped. This avoid
+	 * the need to call write_begin/write_end which does a
+	 * journal_start/journal_stop which can block and take
+	 * long time
+	 */
+	if (page_has_buffers(page)) {
+		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
+					buffer_unmapped)) {
+			unlock_page(page);
+out_frozen:
+			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+			goto out_unlock;
+		}
+	}
+	unlock_page(page);
+	/*
+	 * OK, we need to fill the hole... Do write_begin write_end
+	 * to do block allocation/reservation.We are not holding
+	 * inode.i__mutex here. That allow * parallel write_begin,
+	 * write_end call. lock_page prevent this from happening
+	 * on the same page though
+	 */
+	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
+			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
+			len, len, page, fsdata);
+	if (ret < 0)
+		goto out_unlock;
+	ret = 0;
+out_unlock:
+	if (ret)
+		ret = VM_FAULT_SIGBUS;
+	up_read(&inode->i_alloc_sem);
+	return ret;
+}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..44979ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5812,7 +5812,7 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	}
 	ret = 0;
 	if (PageMappedToDisk(page))
-		goto out_unlock;
+		goto out_frozen;
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
@@ -5830,6 +5830,8 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
 			unlock_page(page);
+out_frozen:
+			vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 			goto out_unlock;
 		}
 	}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 85c1d30..a0e39ca 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -919,6 +919,7 @@  extern void ext3_get_inode_flags(struct ext3_inode_info *);
 extern void ext3_set_aops(struct inode *inode);
 extern int ext3_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		       u64 start, u64 len);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* ioctl.c */
 extern long ext3_ioctl(struct file *, unsigned int, unsigned long);