Patchwork [03/12] ext4: Remove bogus wait for unwritten extents in ext4_ind_direct_IO

login
register
mail settings
Submitter Jan Kara
Date Jan. 18, 2013, noon
Message ID <1358510446-19174-4-git-send-email-jack@suse.cz>
Download mbox | patch
Permalink /patch/213588/
State Rejected
Headers show

Comments

Jan Kara - Jan. 18, 2013, noon
When using indirect blocks there is no possibility to have any unwritten
extents. So wait for them in ext4_ind_direct_IO() is just bogus.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)
Dmitri Monakho - Jan. 22, 2013, 11:11 a.m.
On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> When using indirect blocks there is no possibility to have any unwritten
> extents. So wait for them in ext4_ind_direct_IO() is just bogus.
But as soon as i remember indirect implementation may also be used by
extents based inodes 3074: ext4_ext_direct_IO
    /* Use the old path for reads and writes beyond i_size. */
    if (rw != WRITE || final_size > inode->i_size)
       return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);

Am I missing ?
> 
> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/indirect.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 20862f9..993247c 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Jan. 22, 2013, 1:44 p.m.
On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > When using indirect blocks there is no possibility to have any unwritten
> > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> But as soon as i remember indirect implementation may also be used by
> extents based inodes 3074: ext4_ext_direct_IO
>     /* Use the old path for reads and writes beyond i_size. */
>     if (rw != WRITE || final_size > inode->i_size)
>        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> 
> Am I missing ?
  Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
and that code path needs some cleaning and commenting. In particular I'm
afraid using dioread_nolock for inodes with indirect map causes data
exposure bugs when unlocked DIO read races with DIO write because such
inodes don't support uninitialized extents.

								Honza
Dmitri Monakho - Jan. 22, 2013, 2:12 p.m.
On Tue, 22 Jan 2013 14:44:00 +0100, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > When using indirect blocks there is no possibility to have any unwritten
> > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > But as soon as i remember indirect implementation may also be used by
> > extents based inodes 3074: ext4_ext_direct_IO
> >     /* Use the old path for reads and writes beyond i_size. */
> >     if (rw != WRITE || final_size > inode->i_size)
> >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > 
> > Am I missing ?
>   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> and that code path needs some cleaning and commenting. In particular I'm
> afraid using dioread_nolock for inodes with indirect map causes data
> exposure bugs when unlocked DIO read races with DIO write because such
> inodes don't support uninitialized extents.
Yes that's why dioread_nolock works only for extent based inodes

static inline int ext4_should_dioread_nolock(struct inode *inode)
{
        if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
                return 0;
        if (!S_ISREG(inode->i_mode))
                return 0;
                if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
                   return 0;
                   if (ext4_should_journal_data(inode))
                return 0;
        return 1;
}

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Jan. 22, 2013, 2:22 p.m.
On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > When using indirect blocks there is no possibility to have any unwritten
> > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > But as soon as i remember indirect implementation may also be used by
> > extents based inodes 3074: ext4_ext_direct_IO
> >     /* Use the old path for reads and writes beyond i_size. */
> >     if (rw != WRITE || final_size > inode->i_size)
> >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > 
> > Am I missing ?
>   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> and that code path needs some cleaning and commenting. In particular I'm
> afraid using dioread_nolock for inodes with indirect map causes data
> exposure bugs when unlocked DIO read races with DIO write because such
> inodes don't support uninitialized extents.

Sorry, but I am still confused.  dioread_nolock is only for extent-based
file.  So when a file system without extent feature, dioread_nolock
couldn't be enabled.  It seems that we don't need to worry about
exposing stale data here.

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Jan. 22, 2013, 3:21 p.m.
On Tue 22-01-13 18:12:38, Dmitry Monakhov wrote:
> On Tue, 22 Jan 2013 14:44:00 +0100, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > When using indirect blocks there is no possibility to have any unwritten
> > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > But as soon as i remember indirect implementation may also be used by
> > > extents based inodes 3074: ext4_ext_direct_IO
> > >     /* Use the old path for reads and writes beyond i_size. */
> > >     if (rw != WRITE || final_size > inode->i_size)
> > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > 
> > > Am I missing ?
> >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > and that code path needs some cleaning and commenting. In particular I'm
> > afraid using dioread_nolock for inodes with indirect map causes data
> > exposure bugs when unlocked DIO read races with DIO write because such
> > inodes don't support uninitialized extents.
> Yes that's why dioread_nolock works only for extent based inodes
> 
> static inline int ext4_should_dioread_nolock(struct inode *inode)
> {
>         if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
>                 return 0;
>         if (!S_ISREG(inode->i_mode))
>                 return 0;
>                 if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>                    return 0;
>                    if (ext4_should_journal_data(inode))
>                 return 0;
>         return 1;
> }
  Sure, I was confused. Things work correctly just the code flow is a bit
confusing.

								Honza
Jan Kara - Jan. 22, 2013, 3:22 p.m.
On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > When using indirect blocks there is no possibility to have any unwritten
> > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > But as soon as i remember indirect implementation may also be used by
> > > extents based inodes 3074: ext4_ext_direct_IO
> > >     /* Use the old path for reads and writes beyond i_size. */
> > >     if (rw != WRITE || final_size > inode->i_size)
> > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > 
> > > Am I missing ?
> >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > and that code path needs some cleaning and commenting. In particular I'm
> > afraid using dioread_nolock for inodes with indirect map causes data
> > exposure bugs when unlocked DIO read races with DIO write because such
> > inodes don't support uninitialized extents.
> 
> Sorry, but I am still confused.  dioread_nolock is only for extent-based
> file.  So when a file system without extent feature, dioread_nolock
> couldn't be enabled.  It seems that we don't need to worry about
> exposing stale data here.
  Well, you can have fs with extent feature enabled but still with inodes
using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
handles that correctly. So there's not a bug I was suspecting.

							Honza
Zheng Liu - Jan. 22, 2013, 4 p.m.
On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > But as soon as i remember indirect implementation may also be used by
> > > > extents based inodes 3074: ext4_ext_direct_IO
> > > >     /* Use the old path for reads and writes beyond i_size. */
> > > >     if (rw != WRITE || final_size > inode->i_size)
> > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > 
> > > > Am I missing ?
> > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > and that code path needs some cleaning and commenting. In particular I'm
> > > afraid using dioread_nolock for inodes with indirect map causes data
> > > exposure bugs when unlocked DIO read races with DIO write because such
> > > inodes don't support uninitialized extents.
> > 
> > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > file.  So when a file system without extent feature, dioread_nolock
> > couldn't be enabled.  It seems that we don't need to worry about
> > exposing stale data here.
>   Well, you can have fs with extent feature enabled but still with inodes
> using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> handles that correctly. So there's not a bug I was suspecting.

Yep, the patch itself is fine.  But that would be great if a comment is
added here.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Jan. 22, 2013, 11:14 p.m.
On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > But as soon as i remember indirect implementation may also be used by
> > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > 
> > > > > Am I missing ?
> > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > inodes don't support uninitialized extents.
> > > 
> > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > file.  So when a file system without extent feature, dioread_nolock
> > > couldn't be enabled.  It seems that we don't need to worry about
> > > exposing stale data here.
> >   Well, you can have fs with extent feature enabled but still with inodes
> > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > handles that correctly. So there's not a bug I was suspecting.
> 
> Yep, the patch itself is fine.  But that would be great if a comment is
> added here.
  No, the patch is wrong. The code before the patch is correct. We can get
to that code for extent based inode which has unwritten conversions pending
and we need to wait for those as otherwise we could return 0s in places
where we acknowledged successful write just a while ago. Or am I missing
something?

								Honza
Zheng Liu - Jan. 23, 2013, 6:11 a.m.
On Wed, Jan 23, 2013 at 12:14:32AM +0100, Jan Kara wrote:
> On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> > On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > > But as soon as i remember indirect implementation may also be used by
> > > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > > 
> > > > > > Am I missing ?
> > > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > > inodes don't support uninitialized extents.
> > > > 
> > > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > > file.  So when a file system without extent feature, dioread_nolock
> > > > couldn't be enabled.  It seems that we don't need to worry about
> > > > exposing stale data here.
> > >   Well, you can have fs with extent feature enabled but still with inodes
> > > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > > handles that correctly. So there's not a bug I was suspecting.
> > 
> > Yep, the patch itself is fine.  But that would be great if a comment is
> > added here.
>   No, the patch is wrong. The code before the patch is correct. We can get
> to that code for extent based inode which has unwritten conversions pending
> and we need to wait for those as otherwise we could return 0s in places
> where we acknowledged successful write just a while ago. Or am I missing
> something?

Ah, I see.  I guess that the problem is that the dio read races with buffered
write.

   dio read                         buffered write
                                    ->ext4_file_write
                                      ->ext4_da_write_begin
                                      ->ext4_da_write_end
                                      [buffered write has finished, but the data
                                       and metadata has not been flushed]
   ->generic_file_aio_read
     ->filemap_write_and_wait_range
       ->do_writepages
         ->ext4_da_writepages
     ->filemap_fdatawait_range
       ->wait_on_page_writeback
                                    ->ext4_end_bio
                                      ->end_page_writeback
                                        [unwritten extent has not been
                                         converted]
     ->ext4_ind_direct_IO
       [here we need to flush unwritten io]

So it seems that this patch could be applied after reworking unwritten extent
conversion.

FWIW, after applied this patch, the latency of dio read could be reduced
dramatically.  So that would be great if this patch can be applied when it
doesn't break something.

Thanks,
						- Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Jan. 23, 2013, 9:42 a.m.
On Wed 23-01-13 14:11:41, Zheng Liu wrote:
> On Wed, Jan 23, 2013 at 12:14:32AM +0100, Jan Kara wrote:
> > On Wed 23-01-13 00:00:17, Zheng Liu wrote:
> > > On Tue, Jan 22, 2013 at 04:22:43PM +0100, Jan Kara wrote:
> > > > On Tue 22-01-13 22:22:21, Zheng Liu wrote:
> > > > > On Tue, Jan 22, 2013 at 02:44:00PM +0100, Jan Kara wrote:
> > > > > > On Tue 22-01-13 15:11:24, Dmitry Monakhov wrote:
> > > > > > > On Fri, 18 Jan 2013 13:00:37 +0100, Jan Kara <jack@suse.cz> wrote:
> > > > > > > > When using indirect blocks there is no possibility to have any unwritten
> > > > > > > > extents. So wait for them in ext4_ind_direct_IO() is just bogus.
> > > > > > > But as soon as i remember indirect implementation may also be used by
> > > > > > > extents based inodes 3074: ext4_ext_direct_IO
> > > > > > >     /* Use the old path for reads and writes beyond i_size. */
> > > > > > >     if (rw != WRITE || final_size > inode->i_size)
> > > > > > >        return ext4_ind_direct_IO(rw, iocb, iov, offset, nr_segs);
> > > > > > > 
> > > > > > > Am I missing ?
> > > > > >   Ah, that's a catch. Thanks for pointing that out! So my patch is wrong
> > > > > > and that code path needs some cleaning and commenting. In particular I'm
> > > > > > afraid using dioread_nolock for inodes with indirect map causes data
> > > > > > exposure bugs when unlocked DIO read races with DIO write because such
> > > > > > inodes don't support uninitialized extents.
> > > > > 
> > > > > Sorry, but I am still confused.  dioread_nolock is only for extent-based
> > > > > file.  So when a file system without extent feature, dioread_nolock
> > > > > couldn't be enabled.  It seems that we don't need to worry about
> > > > > exposing stale data here.
> > > >   Well, you can have fs with extent feature enabled but still with inodes
> > > > using indirect map. But as Dmitry pointed out, ext4_should_dioread_nolock()
> > > > handles that correctly. So there's not a bug I was suspecting.
> > > 
> > > Yep, the patch itself is fine.  But that would be great if a comment is
> > > added here.
> >   No, the patch is wrong. The code before the patch is correct. We can get
> > to that code for extent based inode which has unwritten conversions pending
> > and we need to wait for those as otherwise we could return 0s in places
> > where we acknowledged successful write just a while ago. Or am I missing
> > something?
> 
> Ah, I see.  I guess that the problem is that the dio read races with buffered
> write.
> 
>    dio read                         buffered write
>                                     ->ext4_file_write
>                                       ->ext4_da_write_begin
>                                       ->ext4_da_write_end
>                                       [buffered write has finished, but the data
>                                        and metadata has not been flushed]
>    ->generic_file_aio_read
>      ->filemap_write_and_wait_range
>        ->do_writepages
>          ->ext4_da_writepages
>      ->filemap_fdatawait_range
>        ->wait_on_page_writeback
>                                     ->ext4_end_bio
>                                       ->end_page_writeback
>                                         [unwritten extent has not been
>                                          converted]
>      ->ext4_ind_direct_IO
>        [here we need to flush unwritten io]
  Yes, exactly.

> So it seems that this patch could be applied after reworking unwritten extent
> conversion.
  Yes. When PageWriteback is cleared after extent conversion, this waiting
can go away and everything should be fine.

> FWIW, after applied this patch, the latency of dio read could be reduced
> dramatically.  So that would be great if this patch can be applied when it
> doesn't break something.
  Sure, I'll have that in mind.

								Honza

Patch

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 20862f9..993247c 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,6 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state