From patchwork Tue Aug 16 21:32:12 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jiaying Zhang X-Patchwork-Id: 110243 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 2A6E5B6F7C for ; Wed, 17 Aug 2011 07:32:23 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752861Ab1HPVcT (ORCPT ); Tue, 16 Aug 2011 17:32:19 -0400 Received: from smtp-out.google.com ([216.239.44.51]:36910 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752739Ab1HPVcS convert rfc822-to-8bit (ORCPT ); Tue, 16 Aug 2011 17:32:18 -0400 Received: from wpaz9.hot.corp.google.com (wpaz9.hot.corp.google.com [172.24.198.73]) by smtp-out.google.com with ESMTP id p7GLWEEH005005 for ; Tue, 16 Aug 2011 14:32:14 -0700 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=google.com; s=beta; t=1313530335; bh=GiirEPdw38xXLgUQiGTBlJkvrBA=; h=MIME-Version:In-Reply-To:References:Date:Message-ID:Subject:From: To:Cc:Content-Type:Content-Transfer-Encoding; b=U/LZ9AkaxvbHsvyDN68lTBd+w2pw1aIMJ70/ly0ZFvhTWDAvQd+RHu629RnoGIaTO RkR27YLazVJ1cMAT/uHZg== DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=dkim-signature:mime-version:in-reply-to:references:date: message-id:subject:from:to:cc:content-type: content-transfer-encoding:x-system-of-record; b=PZmPFza0xtLSuHAm2TPsD0ih78ceIq+3PqzbkDHZzmDsW4z1cB/o6/tzJzF771v02 vas9T7lJN1xodFJr+mtGg== Received: from yib19 (yib19.prod.google.com [10.243.65.83]) by wpaz9.hot.corp.google.com with ESMTP id p7GLUQMf009665 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Tue, 16 Aug 2011 14:32:13 -0700 Received: by yib19 with SMTP id 19so416340yib.27 for ; Tue, 16 Aug 2011 14:32:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=beta; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=AnDD+9IH2FqfuWBFTXqnlQGgngPvQobawu02FBl9X9k=; b=kWAUPtvn1LA8PmZbQ/MKj9H0yEhd1WruQ7Q2yXIBCdQQ7W641Ir4m/1Nmv4fwCqlgU i2eqJw0XdPlr4c6kaeNg== Received: by 10.90.126.5 with SMTP id y5mr224693agc.185.1313530332790; Tue, 16 Aug 2011 14:32:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.126.5 with SMTP id y5mr224675agc.185.1313530332379; Tue, 16 Aug 2011 14:32:12 -0700 (PDT) Received: by 10.90.5.5 with HTTP; Tue, 16 Aug 2011 14:32:12 -0700 (PDT) In-Reply-To: <4E4A86D0.2070300@tao.ma> References: <4E456436.8070107@msgid.tls.msk.ru> <1313251371-3672-1-git-send-email-tm@tao.ma> <4E4836A8.3080709@msgid.tls.msk.ru> <4E48390E.9050102@msgid.tls.msk.ru> <4E488625.609@tao.ma> <4E48D231.5060807@msgid.tls.msk.ru> <4E48DF31.4050603@msgid.tls.msk.ru> <20110816135325.GD23416@quack.suse.cz> <4E4A86D0.2070300@tao.ma> Date: Tue, 16 Aug 2011 14:32:12 -0700 Message-ID: Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0) From: Jiaying Zhang To: Tao Ma Cc: Jan Kara , Michael Tokarev , linux-ext4@vger.kernel.org, sandeen@redhat.com X-System-Of-Record: true Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma wrote: > On 08/16/2011 09:53 PM, Jan Kara wrote: >> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev wrote: >>>> 15.08.2011 12:00, Michael Tokarev wrote: >>>> [....] >>>> >>>> So, it looks like this (starting with cold cache): >>>> >>>> 1. rename the redologs and copy them over - this will >>>>   make a hot copy of redologs >>>> 2. startup oracle - it will complain that the redologs aren't >>>>   redologs, the header is corrupt >>>> 3. shut down oracle, start it up again - it will succeed. >>>> >>>> If between 1 and 2 you'll issue sync(1) everything will work. >>>> When shutting down, oracle calls fsync(), so that's like >>>> sync(1) again. >>>> >>>> If there will be some time between 1. and 2., everything >>>> will work too. >>>> >>>> Without dioread_nolock I can't trigger the problem no matter >>>> how I tried. >>>> >>>> >>>> A smaller test case.  I used redo1.odf file (one of the >>>> redologs) as a test file, any will work. >>>> >>>>  $ cp -p redo1.odf temp >>>>  $ dd if=temp of=foo iflag=direct count=20 >>> Isn't this the expected behavior here? When doing >>> 'cp -p redo1.odf temp', data is copied to temp through >>> buffer write, but there is no guarantee when data will be >>> actually written to disk. Then with 'dd if=temp of=foo >>> iflag=direct count=20', data is read directly from disk. >>> Very likely, the written data hasn't been flushed to disk >>> yet so ext4 returns zero in this case. >>   No it's not. Buffered and direct IO is supposed to work correctly >> (although not fast) together. In this particular case we take care to flush >> dirty data from page cache before performing direct IO read... But >> something is broken in this path obviously. I see. Thanks a lot for the explanation. I wonder whether the following patch will solve the problem: I tested the patch a little bit and it seems to resolve the race on dioread_nolock in my case. Michael, I would very appreciate if you can try this patch with your test case and see whether it works. >> >> I don't have time to dig into this in detail now but what seems to be the >> problem is that with dioread_nolock option, we don't acquire i_mutex for >> direct IO reads anymore. Thus these reads can compete with >> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under >> i_mutex so without dioread_nolock the race cannot happen). >> >> Hmm, the new writepages code seems to be broken in combination with direct >> IO. Direct IO code expects that when filemap_write_and_wait() finishes, >> data is on disk but with new bio submission code this is not true because >> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in >> ext4_end_io_buffer_write() but do extent conversion only after that in >> convert workqueue. So the race seems to be there all the time, just without >> dioread_nolock it's much smaller. I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled. So I think we should be ok with the default mount options. Jiaying > You are absolutely right. The really problem is that ext4_direct_IO > begins to work *after* we clear the page writeback flag and *before* we > convert unwritten extent to a valid state. Some of my trace does show > that. I am working on it now. >> >> Fixing this is going to be non-trivial - I'm not sure we can really move >> clearing of PageWriteback bit to conversion workqueue. I thienk we already >> tried that once but it caused deadlocks for some reason... > I just did as what you described and yes I met with another problem and > try to resolve it now. Once it is OK, I will send out the patch. > > Thanks > Tao > --- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 6c27111..ca90d73 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, } retry: - if (rw == READ && ext4_should_dioread_nolock(inode)) + if (rw == READ && ext4_should_dioread_nolock(inode)) { + if (unlikely(!list_empty(&ei->i_completed_io_list))) { + mutex_lock(&inode->i_mutex); + ext4_flush_completed_IO(inode); + mutex_unlock(&inode->i_mutex); + } ret = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ext4_get_block, NULL, NULL, 0); - else { + } else { ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs,