diff mbox series

[1/1] ext4: Mark buffer new if it is unwritten to avoid stale data exposure

Message ID c307579df7e109eb4eedaaf07ebdc98b15d8b7ff.1693909504.git.ojaswin@linux.ibm.com
State New
Headers show
Series ext4: Fix stale data exposure caused with dioread_nolock | expand

Commit Message

Ojaswin Mujoo Sept. 5, 2023, 10:28 a.m. UTC
** Short Version **

In ext4 with dioread_nolock, we could have a scenario where the bh returned by
get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
never zero out the range of bh that is not under write, causing whatever stale
data is present in the folio at that time to be written out to disk. To fix this
mark the buffer as new in _ext4_get_block(), in case it is unwritten.

-----
** Long Version **

The issue mentioned above was resulting in two different bugs:

1. On block size < page size case in ext4, generic/269 was reliably
failing with dioread_nolock. The state of the write was as follows:

  * The write was extending i_size.
  * The last block of the file was fallocated and had an unwritten extent
  * We were near ENOSPC and hence we were switching to non-delayed alloc
    allocation.

In this case, the back trace that triggers the bug is as follows:

  ext4_da_write_begin()
    /* switch to nodelalloc due to low space */
    ext4_write_begin()
      ext4_should_dioread_nolock() // true since mount flags still have delalloc
      __block_write_begin(..., ext4_get_block_unwritten)
        __block_write_begin_int()
          for(each buffer head in page) {
            /* first iteration, this is bh1 which contains i_size */
            if (!buffer_mapped)
              get_block() /* returns bh with only UNWRITTEN and MAPPED */
            /* second iteration, bh2 */
              if (!buffer_mapped)
                get_block() /* we fail here, could be ENOSPC */
          }
          if (err)
            /*
             * this would zero out all new buffers and mark them uptodate.
             * Since bh1 was never marked new, we skip it here which causes
             * the bug later.
             */
            folio_zero_new_buffers();
      /* ext4_wrte_begin() error handling */
      ext4_truncate_failed_write()
        ext4_truncate()
          ext4_block_truncate_page()
            __ext4_block_zero_page_range()
              if(!buffer_uptodate())
                ext4_read_bh_lock()
                  ext4_read_bh() -> ... ext4_submit_bh_wbc()
                    BUG_ON(buffer_unwritten(bh)); /* !!! */

2. The second issue is stale data exposure with page size >= blocksize
with dioread_nolock. The conditions needed for it to happen are same as
the previous issue ie dioread_nolock around ENOSPC condition. The issue
is also similar where in __block_write_begin_int() when we call
ext4_get_block_unwritten() on the buffer_head and the underlying extent
is unwritten, we get an unwritten and mapped buffer head. Since it is
not new, we never zero out the partial range which is not under write,
thus writing stale data to disk. This can be easily observed with the
following reporducer:

 fallocate -l 4k testfile
 xfs_io -c "pwrite 2k 2k" testfile
 # hexdump output will have stale data in from byte 0 to 2k in testfile
 hexdump -C testfile

NOTE: To trigger this, we need dioread_nolock enabled and write
happening via ext4_write_begin(), which is usually used when we have -o
nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
alternate way to call ext4_write_begin() is to fill make sure dellayed
alloc switches to nodelalloc (ext4_da_write_begin() calls
ext4_write_begin()).  This will usually happen when FS is almost full
like the way generic/269 was triggering it in Issue 1 above. This might
make this issue harder to replicate hence for reliable replicate, I used
the below patch to temporarily allow dioread_nolock with nodelalloc and
then mount the disk with -o nodealloc,dioread_nolock. With this you can
hit the stale data issue 100% of times:

@@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
  if (ext4_should_journal_data(inode))
    return 0;
  /* temporary fix to prevent generic/422 test failures */
- if (!test_opt(inode->i_sb, DELALLOC))
-   return 0;
+ // if (!test_opt(inode->i_sb, DELALLOC))
+ //  return 0;
  return 1;
 }

-------

After applying this patch to mark buffer as NEW, both the above issues are
fixed.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Ojaswin Mujoo Sept. 7, 2023, 7:36 a.m. UTC | #1
On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > ** Short Version **
> > 
> > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > never zero out the range of bh that is not under write, causing whatever stale
> > data is present in the folio at that time to be written out to disk. To fix this
> > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > 
> > -----
> > ** Long Version **
> > 
> > The issue mentioned above was resulting in two different bugs:
> > 
> > 1. On block size < page size case in ext4, generic/269 was reliably
> > failing with dioread_nolock. The state of the write was as follows:
> > 
> >   * The write was extending i_size.
> >   * The last block of the file was fallocated and had an unwritten extent
> >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> >     allocation.
> > 
> > In this case, the back trace that triggers the bug is as follows:
> > 
> >   ext4_da_write_begin()
> >     /* switch to nodelalloc due to low space */
> >     ext4_write_begin()
> >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> >       __block_write_begin(..., ext4_get_block_unwritten)
> >         __block_write_begin_int()
> >           for(each buffer head in page) {
> >             /* first iteration, this is bh1 which contains i_size */
> >             if (!buffer_mapped)
> >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> >             /* second iteration, bh2 */
> >               if (!buffer_mapped)
> >                 get_block() /* we fail here, could be ENOSPC */
> >           }
> >           if (err)
> >             /*
> >              * this would zero out all new buffers and mark them uptodate.
> >              * Since bh1 was never marked new, we skip it here which causes
> >              * the bug later.
> >              */
> >             folio_zero_new_buffers();
> >       /* ext4_wrte_begin() error handling */
> >       ext4_truncate_failed_write()
> >         ext4_truncate()
> >           ext4_block_truncate_page()
> >             __ext4_block_zero_page_range()
> 	>               if(!buffer_uptodate())
> >                 ext4_read_bh_lock()
> >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > 
> > 2. The second issue is stale data exposure with page size >= blocksize
> > with dioread_nolock. The conditions needed for it to happen are same as
> > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > is also similar where in __block_write_begin_int() when we call
> > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > not new, we never zero out the partial range which is not under write,
> > thus writing stale data to disk. This can be easily observed with the
> > following reporducer:
> > 
> >  fallocate -l 4k testfile
> >  xfs_io -c "pwrite 2k 2k" testfile
> >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> >  hexdump -C testfile
> > 
> > NOTE: To trigger this, we need dioread_nolock enabled and write
> > happening via ext4_write_begin(), which is usually used when we have -o
> > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > ext4_write_begin()).  This will usually happen when FS is almost full
> > like the way generic/269 was triggering it in Issue 1 above. This might
> > make this issue harder to replicate hence for reliable replicate, I used
> > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > hit the stale data issue 100% of times:
> > 
> > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> >   if (ext4_should_journal_data(inode))
> >     return 0;
> >   /* temporary fix to prevent generic/422 test failures */
> > - if (!test_opt(inode->i_sb, DELALLOC))
> > -   return 0;
> > + // if (!test_opt(inode->i_sb, DELALLOC))
> > + //  return 0;
> >   return 1;
> >  }
> > 
> > -------
> > 
> > After applying this patch to mark buffer as NEW, both the above issues are
> > fixed.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Hi Jan, thanks for the review.

> 
> Good catch! But I'm wondering whether this is really the right fix. For
> example in ext4_block_truncate_page() shouldn't we rather be checking
> whether the buffer isn't unwritten and if yes then bail because there's
> nothing to zero out in the block? That would seem like a more logical
> and robust solution of the first problem to me.

Right, I agree. I will look into it and prepare a patch for this in v2.

> 
> Regarding the second issue I agree that using buffer_new flag makes the
> most sense. But it would make most sense to me to put this special logic
> directly into ext4_get_block_unwritten() because it is really special logic
> when preparing buffered write via unwritten extent (and it relies on
> __block_write_begin_int() logic to interpret buffer_new flag in the right
> way). Putting in _ext4_get_block() seems confusing to me because it raises
> questions like why should we set it for reads? And why not set it already
> in ext4_map_blocks() which is also used by iomap?

Originally I had kept it there because it didn't seem to affect any read
related paths, and marking an unwritten buffer as new for zero'ing out
seemed like the right thing to do irrespective of which code path we
were coming from. However, I think its okay to move it
ext4_get_block_unwritten() it seems to be the only place where we need
to explicitly mark it as such.

That being said, I also had an alternate solution that marked the map
flag as NEW in ext4_map_blocks() -> ext4_ext4_map_blocks() ->
ext4_ext_handle_unwritten_extents(). Do you think it makes more
sense to handle this issue in ext4 map layer instead of relying on special
handling of buffer head?

Yesterday I looked into this a bit more and it seems that all the other
code paths in ext4, except ext4_da_get_block_prep(), rely on
ext4_map_blocks() setting the NEW flag correctly in map->m_flags
whenever the buffer might need to be zeroed out (this is true for dio
write via iomap as well). Now this makes me incline towards fixing the
issue in ext4_map_blocks layer, which might be better in the longer for
eg when we eventually move to iomap.

Regards,
ojaswin

> 
> 								Honza
> 
> 
> > ---
> >  fs/ext4/inode.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 6c490f05e2ba..a30bfec0b525 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -765,6 +765,10 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
> >  	if (ret > 0) {
> >  		map_bh(bh, inode->i_sb, map.m_pblk);
> >  		ext4_update_bh_state(bh, map.m_flags);
> > +
> > +		if (buffer_unwritten(bh))
> > +			set_buffer_new(bh);
> > +
> >  		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
> >  		ret = 0;
> >  	} else if (ret == 0) {
> > -- 
> > 2.31.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Sept. 7, 2023, 11:46 a.m. UTC | #2
On Thu 07-09-23 13:06:56, Ojaswin Mujoo wrote:
> On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > ** Short Version **
> > > 
> > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > never zero out the range of bh that is not under write, causing whatever stale
> > > data is present in the folio at that time to be written out to disk. To fix this
> > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > 
> > > -----
> > > ** Long Version **
> > > 
> > > The issue mentioned above was resulting in two different bugs:
> > > 
> > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > failing with dioread_nolock. The state of the write was as follows:
> > > 
> > >   * The write was extending i_size.
> > >   * The last block of the file was fallocated and had an unwritten extent
> > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > >     allocation.
> > > 
> > > In this case, the back trace that triggers the bug is as follows:
> > > 
> > >   ext4_da_write_begin()
> > >     /* switch to nodelalloc due to low space */
> > >     ext4_write_begin()
> > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > >       __block_write_begin(..., ext4_get_block_unwritten)
> > >         __block_write_begin_int()
> > >           for(each buffer head in page) {
> > >             /* first iteration, this is bh1 which contains i_size */
> > >             if (!buffer_mapped)
> > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > >             /* second iteration, bh2 */
> > >               if (!buffer_mapped)
> > >                 get_block() /* we fail here, could be ENOSPC */
> > >           }
> > >           if (err)
> > >             /*
> > >              * this would zero out all new buffers and mark them uptodate.
> > >              * Since bh1 was never marked new, we skip it here which causes
> > >              * the bug later.
> > >              */
> > >             folio_zero_new_buffers();
> > >       /* ext4_wrte_begin() error handling */
> > >       ext4_truncate_failed_write()
> > >         ext4_truncate()
> > >           ext4_block_truncate_page()
> > >             __ext4_block_zero_page_range()
> > 	>               if(!buffer_uptodate())
> > >                 ext4_read_bh_lock()
> > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > 
> > > 2. The second issue is stale data exposure with page size >= blocksize
> > > with dioread_nolock. The conditions needed for it to happen are same as
> > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > is also similar where in __block_write_begin_int() when we call
> > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > not new, we never zero out the partial range which is not under write,
> > > thus writing stale data to disk. This can be easily observed with the
> > > following reporducer:
> > > 
> > >  fallocate -l 4k testfile
> > >  xfs_io -c "pwrite 2k 2k" testfile
> > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > >  hexdump -C testfile
> > > 
> > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > happening via ext4_write_begin(), which is usually used when we have -o
> > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > make this issue harder to replicate hence for reliable replicate, I used
> > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > hit the stale data issue 100% of times:
> > > 
> > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > >   if (ext4_should_journal_data(inode))
> > >     return 0;
> > >   /* temporary fix to prevent generic/422 test failures */
> > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > -   return 0;
> > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > + //  return 0;
> > >   return 1;
> > >  }
> > > 
> > > -------
> > > 
> > > After applying this patch to mark buffer as NEW, both the above issues are
> > > fixed.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Hi Jan, thanks for the review.
> 
> > 
> > Good catch! But I'm wondering whether this is really the right fix. For
> > example in ext4_block_truncate_page() shouldn't we rather be checking
> > whether the buffer isn't unwritten and if yes then bail because there's
> > nothing to zero out in the block? That would seem like a more logical
> > and robust solution of the first problem to me.
> 
> Right, I agree. I will look into it and prepare a patch for this in v2.
> 
> > 
> > Regarding the second issue I agree that using buffer_new flag makes the
> > most sense. But it would make most sense to me to put this special logic
> > directly into ext4_get_block_unwritten() because it is really special logic
> > when preparing buffered write via unwritten extent (and it relies on
> > __block_write_begin_int() logic to interpret buffer_new flag in the right
> > way). Putting in _ext4_get_block() seems confusing to me because it raises
> > questions like why should we set it for reads? And why not set it already
> > in ext4_map_blocks() which is also used by iomap?
> 
> Originally I had kept it there because it didn't seem to affect any read
> related paths, and marking an unwritten buffer as new for zero'ing out
> seemed like the right thing to do irrespective of which code path we
> were coming from. However, I think its okay to move it
> ext4_get_block_unwritten() it seems to be the only place where we need
> to explicitly mark it as such.
> 
> That being said, I also had an alternate solution that marked the map
> flag as NEW in ext4_map_blocks() -> ext4_ext4_map_blocks() ->
> ext4_ext_handle_unwritten_extents(). Do you think it makes more
> sense to handle this issue in ext4 map layer instead of relying on special
> handling of buffer head?
> 
> Yesterday I looked into this a bit more and it seems that all the other
> code paths in ext4, except ext4_da_get_block_prep(), rely on
> ext4_map_blocks() setting the NEW flag correctly in map->m_flags
> whenever the buffer might need to be zeroed out (this is true for dio
> write via iomap as well). Now this makes me incline towards fixing the
> issue in ext4_map_blocks layer, which might be better in the longer for
> eg when we eventually move to iomap.

I was also thinking about this and I'm concerned about the following:
__block_write_begin_int() does:

                if (buffer_new(bh))
                        clear_buffer_new(bh);

before checking for buffer_mapped() flag. So if we mapped the buffer e.g.
in the read path and marked it as new there, then __block_write_begin_int()
will happily clear the new flag and because the buffer is mapped it will
just not bother with calling get_block() again. The buffer_new flag is not
really a buffer state flag but just a special side-band communication
between the ->get_block handler and __block_write_begin_int(). We have
similar communication happening through other bits of b_state in the legacy
direct IO code.

So this mess is specific to __block_write_begin_int() and its handling of
buffer heads. In iomap code we have iomap_block_needs_zeroing() used in
__iomap_write_begin() and unwritten extents do end up being zeroed
automatically regardless of the IOMAP_F_NEW flag.

								Honza
Ojaswin Mujoo Sept. 7, 2023, 7:33 p.m. UTC | #3
On Thu, Sep 07, 2023 at 01:46:30PM +0200, Jan Kara wrote:
> On Thu 07-09-23 13:06:56, Ojaswin Mujoo wrote:
> > On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > > ** Short Version **
> > > > 
> > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > > never zero out the range of bh that is not under write, causing whatever stale
> > > > data is present in the folio at that time to be written out to disk. To fix this
> > > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > > 
> > > > -----
> > > > ** Long Version **
> > > > 
> > > > The issue mentioned above was resulting in two different bugs:
> > > > 
> > > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > > failing with dioread_nolock. The state of the write was as follows:
> > > > 
> > > >   * The write was extending i_size.
> > > >   * The last block of the file was fallocated and had an unwritten extent
> > > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > > >     allocation.
> > > > 
> > > > In this case, the back trace that triggers the bug is as follows:
> > > > 
> > > >   ext4_da_write_begin()
> > > >     /* switch to nodelalloc due to low space */
> > > >     ext4_write_begin()
> > > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > > >       __block_write_begin(..., ext4_get_block_unwritten)
> > > >         __block_write_begin_int()
> > > >           for(each buffer head in page) {
> > > >             /* first iteration, this is bh1 which contains i_size */
> > > >             if (!buffer_mapped)
> > > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > > >             /* second iteration, bh2 */
> > > >               if (!buffer_mapped)
> > > >                 get_block() /* we fail here, could be ENOSPC */
> > > >           }
> > > >           if (err)
> > > >             /*
> > > >              * this would zero out all new buffers and mark them uptodate.
> > > >              * Since bh1 was never marked new, we skip it here which causes
> > > >              * the bug later.
> > > >              */
> > > >             folio_zero_new_buffers();
> > > >       /* ext4_wrte_begin() error handling */
> > > >       ext4_truncate_failed_write()
> > > >         ext4_truncate()
> > > >           ext4_block_truncate_page()
> > > >             __ext4_block_zero_page_range()
> > > 	>               if(!buffer_uptodate())
> > > >                 ext4_read_bh_lock()
> > > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > > 
> > > > 2. The second issue is stale data exposure with page size >= blocksize
> > > > with dioread_nolock. The conditions needed for it to happen are same as
> > > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > > is also similar where in __block_write_begin_int() when we call
> > > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > > not new, we never zero out the partial range which is not under write,
> > > > thus writing stale data to disk. This can be easily observed with the
> > > > following reporducer:
> > > > 
> > > >  fallocate -l 4k testfile
> > > >  xfs_io -c "pwrite 2k 2k" testfile
> > > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > > >  hexdump -C testfile
> > > > 
> > > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > > happening via ext4_write_begin(), which is usually used when we have -o
> > > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > > make this issue harder to replicate hence for reliable replicate, I used
> > > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > > hit the stale data issue 100% of times:
> > > > 
> > > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > > >   if (ext4_should_journal_data(inode))
> > > >     return 0;
> > > >   /* temporary fix to prevent generic/422 test failures */
> > > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > > -   return 0;
> > > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > > + //  return 0;
> > > >   return 1;
> > > >  }
> > > > 
> > > > -------
> > > > 
> > > > After applying this patch to mark buffer as NEW, both the above issues are
> > > > fixed.
> > > > 
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > 
> > Hi Jan, thanks for the review.
> > 
> > > 
> > > Good catch! But I'm wondering whether this is really the right fix. For
> > > example in ext4_block_truncate_page() shouldn't we rather be checking
> > > whether the buffer isn't unwritten and if yes then bail because there's
> > > nothing to zero out in the block? That would seem like a more logical
> > > and robust solution of the first problem to me.
> > 
> > Right, I agree. I will look into it and prepare a patch for this in v2.
> > 
> > > 
> > > Regarding the second issue I agree that using buffer_new flag makes the
> > > most sense. But it would make most sense to me to put this special logic
> > > directly into ext4_get_block_unwritten() because it is really special logic
> > > when preparing buffered write via unwritten extent (and it relies on
> > > __block_write_begin_int() logic to interpret buffer_new flag in the right
> > > way). Putting in _ext4_get_block() seems confusing to me because it raises
> > > questions like why should we set it for reads? And why not set it already
> > > in ext4_map_blocks() which is also used by iomap?
> > 
> > Originally I had kept it there because it didn't seem to affect any read
> > related paths, and marking an unwritten buffer as new for zero'ing out
> > seemed like the right thing to do irrespective of which code path we
> > were coming from. However, I think its okay to move it
> > ext4_get_block_unwritten() it seems to be the only place where we need
> > to explicitly mark it as such.
> > 
> > That being said, I also had an alternate solution that marked the map
> > flag as NEW in ext4_map_blocks() -> ext4_ext4_map_blocks() ->
> > ext4_ext_handle_unwritten_extents(). Do you think it makes more
> > sense to handle this issue in ext4 map layer instead of relying on special
> > handling of buffer head?
> > 
> > Yesterday I looked into this a bit more and it seems that all the other
> > code paths in ext4, except ext4_da_get_block_prep(), rely on
> > ext4_map_blocks() setting the NEW flag correctly in map->m_flags
> > whenever the buffer might need to be zeroed out (this is true for dio
> > write via iomap as well). Now this makes me incline towards fixing the
> > issue in ext4_map_blocks layer, which might be better in the longer for
> > eg when we eventually move to iomap.
> 
> I was also thinking about this and I'm concerned about the following:
> __block_write_begin_int() does:
> 
>                 if (buffer_new(bh))
>                         clear_buffer_new(bh);
> 
> before checking for buffer_mapped() flag. So if we mapped the buffer e.g.
> in the read path and marked it as new there, then __block_write_begin_int()
> will happily clear the new flag and because the buffer is mapped it will
> just not bother with calling get_block() again. The buffer_new flag is not

So a question here, if we mark a buffer mapped while reading, then we
don't really need the new flag on it right? Since it'll already have
valid data, in which case it shouldn't matter if
__block_write_begin_int() clears the flag.

> really a buffer state flag but just a special side-band communication
> between the ->get_block handler and __block_write_begin_int(). We have
> similar communication happening through other bits of b_state in the legacy
> direct IO code.
> 
> So this mess is specific to __block_write_begin_int() and its handling of
> buffer heads. In iomap code we have iomap_block_needs_zeroing() used in
> __iomap_write_begin() and unwritten extents do end up being zeroed
> automatically regardless of the IOMAP_F_NEW flag.

So basically when to zero out is communicated to
__block_write_begin_int() via the buffer head new flag irrespective of
whether the extent itself is "new" or not (that is map flags has new).
Hence, the buffer being new (needing zeroing) is really something
get_blocks should figure out and communicate to buffer handling layer. 

Thanks for explaining this. Going through all this makes me feel the
whole interaction between __block_write_begin_int() -> get_blocks() ->
ext4_map_blocks() is kinda confusing/fragile and each layer has several
implicit assumptions about how the others will behave.

Also, just bouncing some ideas here. Why is it that
__block_write_begin_int() only considers buffer_new() when deciding to
zero out? Shouldn't we zero out when the buffer is unwritten as well?
That way we could avoid all the special logic of marking the buffer as
new whenever it is unwritten, as seen in this patch and in
ext4_da_get_block_prep().

Thanks,
ojaswin
Jan Kara Sept. 8, 2023, 8:38 a.m. UTC | #4
On Fri 08-09-23 01:03:35, Ojaswin Mujoo wrote:
> On Thu, Sep 07, 2023 at 01:46:30PM +0200, Jan Kara wrote:
> > On Thu 07-09-23 13:06:56, Ojaswin Mujoo wrote:
> > > On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > > > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > > > ** Short Version **
> > > > > 
> > > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > > > never zero out the range of bh that is not under write, causing whatever stale
> > > > > data is present in the folio at that time to be written out to disk. To fix this
> > > > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > > > 
> > > > > -----
> > > > > ** Long Version **
> > > > > 
> > > > > The issue mentioned above was resulting in two different bugs:
> > > > > 
> > > > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > > > failing with dioread_nolock. The state of the write was as follows:
> > > > > 
> > > > >   * The write was extending i_size.
> > > > >   * The last block of the file was fallocated and had an unwritten extent
> > > > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > > > >     allocation.
> > > > > 
> > > > > In this case, the back trace that triggers the bug is as follows:
> > > > > 
> > > > >   ext4_da_write_begin()
> > > > >     /* switch to nodelalloc due to low space */
> > > > >     ext4_write_begin()
> > > > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > > > >       __block_write_begin(..., ext4_get_block_unwritten)
> > > > >         __block_write_begin_int()
> > > > >           for(each buffer head in page) {
> > > > >             /* first iteration, this is bh1 which contains i_size */
> > > > >             if (!buffer_mapped)
> > > > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > > > >             /* second iteration, bh2 */
> > > > >               if (!buffer_mapped)
> > > > >                 get_block() /* we fail here, could be ENOSPC */
> > > > >           }
> > > > >           if (err)
> > > > >             /*
> > > > >              * this would zero out all new buffers and mark them uptodate.
> > > > >              * Since bh1 was never marked new, we skip it here which causes
> > > > >              * the bug later.
> > > > >              */
> > > > >             folio_zero_new_buffers();
> > > > >       /* ext4_wrte_begin() error handling */
> > > > >       ext4_truncate_failed_write()
> > > > >         ext4_truncate()
> > > > >           ext4_block_truncate_page()
> > > > >             __ext4_block_zero_page_range()
> > > > 	>               if(!buffer_uptodate())
> > > > >                 ext4_read_bh_lock()
> > > > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > > > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > > > 
> > > > > 2. The second issue is stale data exposure with page size >= blocksize
> > > > > with dioread_nolock. The conditions needed for it to happen are same as
> > > > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > > > is also similar where in __block_write_begin_int() when we call
> > > > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > > > not new, we never zero out the partial range which is not under write,
> > > > > thus writing stale data to disk. This can be easily observed with the
> > > > > following reporducer:
> > > > > 
> > > > >  fallocate -l 4k testfile
> > > > >  xfs_io -c "pwrite 2k 2k" testfile
> > > > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > > > >  hexdump -C testfile
> > > > > 
> > > > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > > > happening via ext4_write_begin(), which is usually used when we have -o
> > > > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > > > make this issue harder to replicate hence for reliable replicate, I used
> > > > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > > > hit the stale data issue 100% of times:
> > > > > 
> > > > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > > > >   if (ext4_should_journal_data(inode))
> > > > >     return 0;
> > > > >   /* temporary fix to prevent generic/422 test failures */
> > > > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > > > -   return 0;
> > > > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > > > + //  return 0;
> > > > >   return 1;
> > > > >  }
> > > > > 
> > > > > -------
> > > > > 
> > > > > After applying this patch to mark buffer as NEW, both the above issues are
> > > > > fixed.
> > > > > 
> > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > 
> > > Hi Jan, thanks for the review.
> > > 
> > > > 
> > > > Good catch! But I'm wondering whether this is really the right fix. For
> > > > example in ext4_block_truncate_page() shouldn't we rather be checking
> > > > whether the buffer isn't unwritten and if yes then bail because there's
> > > > nothing to zero out in the block? That would seem like a more logical
> > > > and robust solution of the first problem to me.
> > > 
> > > Right, I agree. I will look into it and prepare a patch for this in v2.
> > > 
> > > > 
> > > > Regarding the second issue I agree that using buffer_new flag makes the
> > > > most sense. But it would make most sense to me to put this special logic
> > > > directly into ext4_get_block_unwritten() because it is really special logic
> > > > when preparing buffered write via unwritten extent (and it relies on
> > > > __block_write_begin_int() logic to interpret buffer_new flag in the right
> > > > way). Putting in _ext4_get_block() seems confusing to me because it raises
> > > > questions like why should we set it for reads? And why not set it already
> > > > in ext4_map_blocks() which is also used by iomap?
> > > 
> > > Originally I had kept it there because it didn't seem to affect any read
> > > related paths, and marking an unwritten buffer as new for zero'ing out
> > > seemed like the right thing to do irrespective of which code path we
> > > were coming from. However, I think its okay to move it
> > > ext4_get_block_unwritten() it seems to be the only place where we need
> > > to explicitly mark it as such.
> > > 
> > > That being said, I also had an alternate solution that marked the map
> > > flag as NEW in ext4_map_blocks() -> ext4_ext4_map_blocks() ->
> > > ext4_ext_handle_unwritten_extents(). Do you think it makes more
> > > sense to handle this issue in ext4 map layer instead of relying on special
> > > handling of buffer head?
> > > 
> > > Yesterday I looked into this a bit more and it seems that all the other
> > > code paths in ext4, except ext4_da_get_block_prep(), rely on
> > > ext4_map_blocks() setting the NEW flag correctly in map->m_flags
> > > whenever the buffer might need to be zeroed out (this is true for dio
> > > write via iomap as well). Now this makes me incline towards fixing the
> > > issue in ext4_map_blocks layer, which might be better in the longer for
> > > eg when we eventually move to iomap.
> > 
> > I was also thinking about this and I'm concerned about the following:
> > __block_write_begin_int() does:
> > 
> >                 if (buffer_new(bh))
> >                         clear_buffer_new(bh);
> > 
> > before checking for buffer_mapped() flag. So if we mapped the buffer e.g.
> > in the read path and marked it as new there, then __block_write_begin_int()
> > will happily clear the new flag and because the buffer is mapped it will
> > just not bother with calling get_block() again. The buffer_new flag is not
> 
> So a question here, if we mark a buffer mapped while reading, then we
> don't really need the new flag on it right? Since it'll already have
> valid data, in which case it shouldn't matter if
> __block_write_begin_int() clears the flag.

OK, that is true so this shouldn't generally be a problem. I've now also
realized another related quirk of the ->get_block functions and that is
that on read, unwritten buffers have to be returned back from ->get_block
as !buffer_mapped so that block_read_full_page() properly zeroes the buffer
content.

> > really a buffer state flag but just a special side-band communication
> > between the ->get_block handler and __block_write_begin_int(). We have
> > similar communication happening through other bits of b_state in the legacy
> > direct IO code.
> > 
> > So this mess is specific to __block_write_begin_int() and its handling of
> > buffer heads. In iomap code we have iomap_block_needs_zeroing() used in
> > __iomap_write_begin() and unwritten extents do end up being zeroed
> > automatically regardless of the IOMAP_F_NEW flag.
> 
> So basically when to zero out is communicated to
> __block_write_begin_int() via the buffer head new flag irrespective of
> whether the extent itself is "new" or not (that is map flags has new).
> Hence, the buffer being new (needing zeroing) is really something
> get_blocks should figure out and communicate to buffer handling layer. 

Yes.
 
> Thanks for explaining this. Going through all this makes me feel the
> whole interaction between __block_write_begin_int() -> get_blocks() ->
> ext4_map_blocks() is kinda confusing/fragile and each layer has several
> implicit assumptions about how the others will behave.

Agreed. That is one of the reasons why iomap was created :). Because these
interactions became unmaintainable mess.

> Also, just bouncing some ideas here. Why is it that
> __block_write_begin_int() only considers buffer_new() when deciding to
> zero out? Shouldn't we zero out when the buffer is unwritten as well?
> That way we could avoid all the special logic of marking the buffer as
> new whenever it is unwritten, as seen in this patch and in
> ext4_da_get_block_prep().

__block_write_begin_int() (at least the core of its functionality) comes
from the times when the buffer layer didn't handle unwritten buffers at
all. Then unwritten buffers and delayed allocation was bolted on top
and the design decisions were ... suboptimal ... at some points. Well,
everybody can be clever when the problems become obvious :). Now we could
obviously rewrite buffer head handling to a saner state but since buffer
heads have also other issues, the energy is much better invested into
conversion of filesystems into iomap. That's why I'd go for a minimal patch
to fix the data corruption bug.

								Honza
Ojaswin Mujoo Sept. 9, 2023, 7:25 a.m. UTC | #5
On Fri, Sep 08, 2023 at 10:38:05AM +0200, Jan Kara wrote:
> On Fri 08-09-23 01:03:35, Ojaswin Mujoo wrote:
> > On Thu, Sep 07, 2023 at 01:46:30PM +0200, Jan Kara wrote:
> > > On Thu 07-09-23 13:06:56, Ojaswin Mujoo wrote:
> > > > On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > > > > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > > > > ** Short Version **
> > > > > > 
> > > > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > > > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > > > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > > > > never zero out the range of bh that is not under write, causing whatever stale
> > > > > > data is present in the folio at that time to be written out to disk. To fix this
> > > > > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > > > > 
> > > > > > -----
> > > > > > ** Long Version **
> > > > > > 
> > > > > > The issue mentioned above was resulting in two different bugs:
> > > > > > 
> > > > > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > > > > failing with dioread_nolock. The state of the write was as follows:
> > > > > > 
> > > > > >   * The write was extending i_size.
> > > > > >   * The last block of the file was fallocated and had an unwritten extent
> > > > > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > > > > >     allocation.
> > > > > > 
> > > > > > In this case, the back trace that triggers the bug is as follows:
> > > > > > 
> > > > > >   ext4_da_write_begin()
> > > > > >     /* switch to nodelalloc due to low space */
> > > > > >     ext4_write_begin()
> > > > > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > > > > >       __block_write_begin(..., ext4_get_block_unwritten)
> > > > > >         __block_write_begin_int()
> > > > > >           for(each buffer head in page) {
> > > > > >             /* first iteration, this is bh1 which contains i_size */
> > > > > >             if (!buffer_mapped)
> > > > > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > > > > >             /* second iteration, bh2 */
> > > > > >               if (!buffer_mapped)
> > > > > >                 get_block() /* we fail here, could be ENOSPC */
> > > > > >           }
> > > > > >           if (err)
> > > > > >             /*
> > > > > >              * this would zero out all new buffers and mark them uptodate.
> > > > > >              * Since bh1 was never marked new, we skip it here which causes
> > > > > >              * the bug later.
> > > > > >              */
> > > > > >             folio_zero_new_buffers();
> > > > > >       /* ext4_wrte_begin() error handling */
> > > > > >       ext4_truncate_failed_write()
> > > > > >         ext4_truncate()
> > > > > >           ext4_block_truncate_page()
> > > > > >             __ext4_block_zero_page_range()
> > > > > 	>               if(!buffer_uptodate())
> > > > > >                 ext4_read_bh_lock()
> > > > > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > > > > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > > > > 
> > > > > > 2. The second issue is stale data exposure with page size >= blocksize
> > > > > > with dioread_nolock. The conditions needed for it to happen are same as
> > > > > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > > > > is also similar where in __block_write_begin_int() when we call
> > > > > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > > > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > > > > not new, we never zero out the partial range which is not under write,
> > > > > > thus writing stale data to disk. This can be easily observed with the
> > > > > > following reporducer:
> > > > > > 
> > > > > >  fallocate -l 4k testfile
> > > > > >  xfs_io -c "pwrite 2k 2k" testfile
> > > > > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > > > > >  hexdump -C testfile
> > > > > > 
> > > > > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > > > > happening via ext4_write_begin(), which is usually used when we have -o
> > > > > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > > > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > > > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > > > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > > > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > > > > make this issue harder to replicate hence for reliable replicate, I used
> > > > > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > > > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > > > > hit the stale data issue 100% of times:
> > > > > > 
> > > > > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > > > > >   if (ext4_should_journal_data(inode))
> > > > > >     return 0;
> > > > > >   /* temporary fix to prevent generic/422 test failures */
> > > > > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > > > > -   return 0;
> > > > > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > > > > + //  return 0;
> > > > > >   return 1;
> > > > > >  }
> > > > > > 
> > > > > > -------
> > > > > > 
> > > > > > After applying this patch to mark buffer as NEW, both the above issues are
> > > > > > fixed.
> > > > > > 
> > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > > 
> > > > Hi Jan, thanks for the review.
> > > > 
> > > > > 
> > > > > Good catch! But I'm wondering whether this is really the right fix. For
> > > > > example in ext4_block_truncate_page() shouldn't we rather be checking
> > > > > whether the buffer isn't unwritten and if yes then bail because there's
> > > > > nothing to zero out in the block? That would seem like a more logical
> > > > > and robust solution of the first problem to me.
> > > > 
> > > > Right, I agree. I will look into it and prepare a patch for this in v2.
> > > > 
> > > > > 
> > > > > Regarding the second issue I agree that using buffer_new flag makes the
> > > > > most sense. But it would make most sense to me to put this special logic
> > > > > directly into ext4_get_block_unwritten() because it is really special logic
> > > > > when preparing buffered write via unwritten extent (and it relies on
> > > > > __block_write_begin_int() logic to interpret buffer_new flag in the right
> > > > > way). Putting in _ext4_get_block() seems confusing to me because it raises
> > > > > questions like why should we set it for reads? And why not set it already
> > > > > in ext4_map_blocks() which is also used by iomap?
> > > > 
> > > > Originally I had kept it there because it didn't seem to affect any read
> > > > related paths, and marking an unwritten buffer as new for zero'ing out
> > > > seemed like the right thing to do irrespective of which code path we
> > > > were coming from. However, I think its okay to move it
> > > > ext4_get_block_unwritten() it seems to be the only place where we need
> > > > to explicitly mark it as such.
> > > > 
> > > > That being said, I also had an alternate solution that marked the map
> > > > flag as NEW in ext4_map_blocks() -> ext4_ext4_map_blocks() ->
> > > > ext4_ext_handle_unwritten_extents(). Do you think it makes more
> > > > sense to handle this issue in ext4 map layer instead of relying on special
> > > > handling of buffer head?
> > > > 
> > > > Yesterday I looked into this a bit more and it seems that all the other
> > > > code paths in ext4, except ext4_da_get_block_prep(), rely on
> > > > ext4_map_blocks() setting the NEW flag correctly in map->m_flags
> > > > whenever the buffer might need to be zeroed out (this is true for dio
> > > > write via iomap as well). Now this makes me incline towards fixing the
> > > > issue in ext4_map_blocks layer, which might be better in the longer for
> > > > eg when we eventually move to iomap.
> > > 
> > > I was also thinking about this and I'm concerned about the following:
> > > __block_write_begin_int() does:
> > > 
> > >                 if (buffer_new(bh))
> > >                         clear_buffer_new(bh);
> > > 
> > > before checking for buffer_mapped() flag. So if we mapped the buffer e.g.
> > > in the read path and marked it as new there, then __block_write_begin_int()
> > > will happily clear the new flag and because the buffer is mapped it will
> > > just not bother with calling get_block() again. The buffer_new flag is not
> > 
> > So a question here, if we mark a buffer mapped while reading, then we
> > don't really need the new flag on it right? Since it'll already have
> > valid data, in which case it shouldn't matter if
> > __block_write_begin_int() clears the flag.
> 
> OK, that is true so this shouldn't generally be a problem. I've now also
> realized another related quirk of the ->get_block functions and that is
> that on read, unwritten buffers have to be returned back from ->get_block
> as !buffer_mapped so that block_read_full_page() properly zeroes the buffer
> content.

Yes, and the same extent lookup will return buffer_mapped and
buffer_unwritten in other code paths like write with dioread_nolock :)

> 
> > > really a buffer state flag but just a special side-band communication
> > > between the ->get_block handler and __block_write_begin_int(). We have
> > > similar communication happening through other bits of b_state in the legacy
> > > direct IO code.
> > > 
> > > So this mess is specific to __block_write_begin_int() and its handling of
> > > buffer heads. In iomap code we have iomap_block_needs_zeroing() used in
> > > __iomap_write_begin() and unwritten extents do end up being zeroed
> > > automatically regardless of the IOMAP_F_NEW flag.
> > 
> > So basically when to zero out is communicated to
> > __block_write_begin_int() via the buffer head new flag irrespective of
> > whether the extent itself is "new" or not (that is map flags has new).
> > Hence, the buffer being new (needing zeroing) is really something
> > get_blocks should figure out and communicate to buffer handling layer. 
> 
> Yes.
>  
> > Thanks for explaining this. Going through all this makes me feel the
> > whole interaction between __block_write_begin_int() -> get_blocks() ->
> > ext4_map_blocks() is kinda confusing/fragile and each layer has several
> > implicit assumptions about how the others will behave.
> 
> Agreed. That is one of the reasons why iomap was created :). Because these
> interactions became unmaintainable mess.
> 
> > Also, just bouncing some ideas here. Why is it that
> > __block_write_begin_int() only considers buffer_new() when deciding to
> > zero out? Shouldn't we zero out when the buffer is unwritten as well?
> > That way we could avoid all the special logic of marking the buffer as
> > new whenever it is unwritten, as seen in this patch and in
> > ext4_da_get_block_prep().
> 
> __block_write_begin_int() (at least the core of its functionality) comes
> from the times when the buffer layer didn't handle unwritten buffers at
> all. Then unwritten buffers and delayed allocation was bolted on top
> and the design decisions were ... suboptimal ... at some points. Well,
> everybody can be clever when the problems become obvious :). Now we could
> obviously rewrite buffer head handling to a saner state but since buffer
> heads have also other issues, the energy is much better invested into
> conversion of filesystems into iomap. That's why I'd go for a minimal patch
> to fix the data corruption bug.

Got it, thanks for the background Jan. I'll move ahead with the changes
you suggested for v2.

Regards,
ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Ojaswin Mujoo Sept. 14, 2023, 11:54 a.m. UTC | #6
On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > ** Short Version **
> > 
> > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > never zero out the range of bh that is not under write, causing whatever stale
> > data is present in the folio at that time to be written out to disk. To fix this
> > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > 
> > -----
> > ** Long Version **
> > 
> > The issue mentioned above was resulting in two different bugs:
> > 
> > 1. On block size < page size case in ext4, generic/269 was reliably
> > failing with dioread_nolock. The state of the write was as follows:
> > 
> >   * The write was extending i_size.
> >   * The last block of the file was fallocated and had an unwritten extent
> >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> >     allocation.
> > 
> > In this case, the back trace that triggers the bug is as follows:
> > 
> >   ext4_da_write_begin()
> >     /* switch to nodelalloc due to low space */
> >     ext4_write_begin()
> >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> >       __block_write_begin(..., ext4_get_block_unwritten)
> >         __block_write_begin_int()
> >           for(each buffer head in page) {
> >             /* first iteration, this is bh1 which contains i_size */
> >             if (!buffer_mapped)
> >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> >             /* second iteration, bh2 */
> >               if (!buffer_mapped)
> >                 get_block() /* we fail here, could be ENOSPC */
> >           }
> >           if (err)
> >             /*
> >              * this would zero out all new buffers and mark them uptodate.
> >              * Since bh1 was never marked new, we skip it here which causes
> >              * the bug later.
> >              */
> >             folio_zero_new_buffers();
> >       /* ext4_wrte_begin() error handling */
> >       ext4_truncate_failed_write()
> >         ext4_truncate()
> >           ext4_block_truncate_page()
> >             __ext4_block_zero_page_range()
> 	>               if(!buffer_uptodate())
> >                 ext4_read_bh_lock()
> >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > 
> > 2. The second issue is stale data exposure with page size >= blocksize
> > with dioread_nolock. The conditions needed for it to happen are same as
> > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > is also similar where in __block_write_begin_int() when we call
> > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > not new, we never zero out the partial range which is not under write,
> > thus writing stale data to disk. This can be easily observed with the
> > following reporducer:
> > 
> >  fallocate -l 4k testfile
> >  xfs_io -c "pwrite 2k 2k" testfile
> >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> >  hexdump -C testfile
> > 
> > NOTE: To trigger this, we need dioread_nolock enabled and write
> > happening via ext4_write_begin(), which is usually used when we have -o
> > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > ext4_write_begin()).  This will usually happen when FS is almost full
> > like the way generic/269 was triggering it in Issue 1 above. This might
> > make this issue harder to replicate hence for reliable replicate, I used
> > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > hit the stale data issue 100% of times:
> > 
> > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> >   if (ext4_should_journal_data(inode))
> >     return 0;
> >   /* temporary fix to prevent generic/422 test failures */
> > - if (!test_opt(inode->i_sb, DELALLOC))
> > -   return 0;
> > + // if (!test_opt(inode->i_sb, DELALLOC))
> > + //  return 0;
> >   return 1;
> >  }
> > 
> > -------
> > 
> > After applying this patch to mark buffer as NEW, both the above issues are
> > fixed.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Good catch! But I'm wondering whether this is really the right fix. For
> example in ext4_block_truncate_page() shouldn't we rather be checking
> whether the buffer isn't unwritten and if yes then bail because there's
> nothing to zero out in the block? That would seem like a more logical
> and robust solution of the first problem to me.

Hey Jan,

So I was looking into this to understand the code paths and it seems
like ext4_truncate doesn't really impose that a unwritten buffer does
not have any data in its corresponding folio, which might sometimes be
the case. 

For example, imagine a case where we get the last block of a file via
ext4_da_get_block_prep() which returns a bh that is unwritten, mapped
and new. During the write, we'll copy data in this folio and then 
adjust i_size in write_end, release the folio lock and ultimately the
inode_lock().

In this intermediate state, before writeback happens, the buffer is
unwritten but has data which will be written. At this point, if we call
ext4_block_truncate_page() and have the logic to exit early for bh_unwritten, the
we will never actually zero out the folio which might cause stale data to be
written during writeback (?)

Now, most of the calls to ext4_block_truncate_page() happen via ext4_truncate ( like via ext4_setattr,
ext4_truncate_failed_write() etc) call truncate_inode_pages() which
seems to handle zeroing the partial folio beyond i_size. However, if we
add the logic to skip unwritten blocks in this function then:

1. We create an implicit dependency in ext4_block_truncate_page() that the folio
needs to not have any data if its unwritten ie some other function has
taken care of by some other function called before it.

2. Additionally, that other function will also need to mark the relevant buffer dirty,
which is done in this function.

3. There are some other paths that call ext4_block_truncate_page()
without turncating the pagecache like ext4_zero_range(). Im not sure if
this will cause any issues as I've not gone through the function
completely though but yes, other functions that later call truncate
in future might need to keep this implicit dependency in mind.

This just makes me think that adding this particular if() to skip
bh_unwritten might not be as straightforward as I was initially thinking
it to be. I think for now I'll atleast post the patch to mark buffer_new
so that the immediate stale data issue and this bug on is taken care
of. This particular patch might need more through of the existing call
sites.

Let me know if I miss understood something here or if you have any inputs on this.

Thanks again!
Ojaswin

> 
> Regarding the second issue I agree that using buffer_new flag makes the
> most sense. But it would make most sense to me to put this special logic
> directly into ext4_get_block_unwritten() because it is really special logic
> when preparing buffered write via unwritten extent (and it relies on
> __block_write_begin_int() logic to interpret buffer_new flag in the right
> way). Putting in _ext4_get_block() seems confusing to me because it raises
> questions like why should we set it for reads? And why not set it already
> in ext4_map_blocks() which is also used by iomap?
> 
> 								Honza
> 
> 
> > ---
> >  fs/ext4/inode.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 6c490f05e2ba..a30bfec0b525 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -765,6 +765,10 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
> >  	if (ret > 0) {
> >  		map_bh(bh, inode->i_sb, map.m_pblk);
> >  		ext4_update_bh_state(bh, map.m_flags);
> > +
> > +		if (buffer_unwritten(bh))
> > +			set_buffer_new(bh);
> > +
> >  		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
> >  		ret = 0;
> >  	} else if (ret == 0) {
> > -- 
> > 2.31.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Sept. 14, 2023, 2:19 p.m. UTC | #7
Hello Ojaswin,

On Thu 14-09-23 17:24:52, Ojaswin Mujoo wrote:
> On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > ** Short Version **
> > > 
> > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > never zero out the range of bh that is not under write, causing whatever stale
> > > data is present in the folio at that time to be written out to disk. To fix this
> > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > 
> > > -----
> > > ** Long Version **
> > > 
> > > The issue mentioned above was resulting in two different bugs:
> > > 
> > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > failing with dioread_nolock. The state of the write was as follows:
> > > 
> > >   * The write was extending i_size.
> > >   * The last block of the file was fallocated and had an unwritten extent
> > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > >     allocation.
> > > 
> > > In this case, the back trace that triggers the bug is as follows:
> > > 
> > >   ext4_da_write_begin()
> > >     /* switch to nodelalloc due to low space */
> > >     ext4_write_begin()
> > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > >       __block_write_begin(..., ext4_get_block_unwritten)
> > >         __block_write_begin_int()
> > >           for(each buffer head in page) {
> > >             /* first iteration, this is bh1 which contains i_size */
> > >             if (!buffer_mapped)
> > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > >             /* second iteration, bh2 */
> > >               if (!buffer_mapped)
> > >                 get_block() /* we fail here, could be ENOSPC */
> > >           }
> > >           if (err)
> > >             /*
> > >              * this would zero out all new buffers and mark them uptodate.
> > >              * Since bh1 was never marked new, we skip it here which causes
> > >              * the bug later.
> > >              */
> > >             folio_zero_new_buffers();
> > >       /* ext4_wrte_begin() error handling */
> > >       ext4_truncate_failed_write()
> > >         ext4_truncate()
> > >           ext4_block_truncate_page()
> > >             __ext4_block_zero_page_range()
> > 	>               if(!buffer_uptodate())
> > >                 ext4_read_bh_lock()
> > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > 
> > > 2. The second issue is stale data exposure with page size >= blocksize
> > > with dioread_nolock. The conditions needed for it to happen are same as
> > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > is also similar where in __block_write_begin_int() when we call
> > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > not new, we never zero out the partial range which is not under write,
> > > thus writing stale data to disk. This can be easily observed with the
> > > following reporducer:
> > > 
> > >  fallocate -l 4k testfile
> > >  xfs_io -c "pwrite 2k 2k" testfile
> > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > >  hexdump -C testfile
> > > 
> > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > happening via ext4_write_begin(), which is usually used when we have -o
> > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > make this issue harder to replicate hence for reliable replicate, I used
> > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > hit the stale data issue 100% of times:
> > > 
> > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > >   if (ext4_should_journal_data(inode))
> > >     return 0;
> > >   /* temporary fix to prevent generic/422 test failures */
> > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > -   return 0;
> > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > + //  return 0;
> > >   return 1;
> > >  }
> > > 
> > > -------
> > > 
> > > After applying this patch to mark buffer as NEW, both the above issues are
> > > fixed.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > 
> > Good catch! But I'm wondering whether this is really the right fix. For
> > example in ext4_block_truncate_page() shouldn't we rather be checking
> > whether the buffer isn't unwritten and if yes then bail because there's
> > nothing to zero out in the block? That would seem like a more logical
> > and robust solution of the first problem to me.
> 
> So I was looking into this to understand the code paths and it seems
> like ext4_truncate doesn't really impose that a unwritten buffer does
> not have any data in its corresponding folio, which might sometimes be
> the case. 
> 
> For example, imagine a case where we get the last block of a file via
> ext4_da_get_block_prep() which returns a bh that is unwritten, mapped
> and new. During the write, we'll copy data in this folio and then 
> adjust i_size in write_end, release the folio lock and ultimately the
> inode_lock().
> 
> In this intermediate state, before writeback happens, the buffer is
> unwritten but has data which will be written. At this point, if we call
> ext4_block_truncate_page() and have the logic to exit early for bh_unwritten, the
> we will never actually zero out the folio which might cause stale data to be
> written during writeback (?)

Actually we will. truncate_inode_pages_range() ends up calling
truncate_inode_partial_folio() which zeros out the tail of the partial
page. I think you are confusing two different things. One is zeroing of
partial page cache pages - that is generally handled by the generic
truncate code - and another one is zeroing of on-disk partial blocks - that
is handled by the filesystem itself. The contents on on-disk blocks does
not matter as long as they are marked as unwritten in the extent tree
(their contents is random anyway) and therefore __ext4_block_zero_page_range()
has nothing to do in that case.

> Now, most of the calls to ext4_block_truncate_page() happen via ext4_truncate ( like via ext4_setattr,
> ext4_truncate_failed_write() etc) call truncate_inode_pages() which
> seems to handle zeroing the partial folio beyond i_size. However, if we
> add the logic to skip unwritten blocks in this function then:
> 
> 1. We create an implicit dependency in ext4_block_truncate_page() that
> the folio needs to not have any data if its unwritten ie some other
> function has taken care of by some other function called before it.

Yes, this dependency already exists today because when blocksize < pagesize
the zeroing happening in __ext4_block_zero_page_range() may be a subset of
what gets zeroed by truncate_inode_partial_folio(). Still we zero the page
in __ext4_block_zero_page_range() for the case when the page was not cached
at all and we've just loaded it from the disk.

> 2. Additionally, that other function will also need to mark the relevant
> buffer dirty, which is done in this function.

AFAICT there's no need to mark the buffer dirty - the whole point is we
don't need to touch the on-disk contents if the block is unwritten...

> 3. There are some other paths that call ext4_block_truncate_page()
> without turncating the pagecache like ext4_zero_range(). Im not sure if
> this will cause any issues as I've not gone through the function
> completely though but yes, other functions that later call truncate
> in future might need to keep this implicit dependency in mind.

Indeed, this is a good catch. So we either need to make both sites calling
ext4_zero_partial_blocks() to use truncate_pagecache_range() for the whole
range including partial blocks or we need to zero out the page cache
(without bringing the page uptodate or marking it dirty) in
ext4_zero_partial_blocks() even if the buffer is unwritten. I don't have a
strong preference either way.

								Honza
Ojaswin Mujoo Sept. 16, 2023, 11:02 a.m. UTC | #8
On Thu, Sep 14, 2023 at 04:19:20PM +0200, Jan Kara wrote:
> Hello Ojaswin,
> 
> On Thu 14-09-23 17:24:52, Ojaswin Mujoo wrote:
> > On Tue, Sep 05, 2023 at 03:56:29PM +0200, Jan Kara wrote:
> > > On Tue 05-09-23 15:58:01, Ojaswin Mujoo wrote:
> > > > ** Short Version **
> > > > 
> > > > In ext4 with dioread_nolock, we could have a scenario where the bh returned by
> > > > get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has
> > > > UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we
> > > > never zero out the range of bh that is not under write, causing whatever stale
> > > > data is present in the folio at that time to be written out to disk. To fix this
> > > > mark the buffer as new in _ext4_get_block(), in case it is unwritten.
> > > > 
> > > > -----
> > > > ** Long Version **
> > > > 
> > > > The issue mentioned above was resulting in two different bugs:
> > > > 
> > > > 1. On block size < page size case in ext4, generic/269 was reliably
> > > > failing with dioread_nolock. The state of the write was as follows:
> > > > 
> > > >   * The write was extending i_size.
> > > >   * The last block of the file was fallocated and had an unwritten extent
> > > >   * We were near ENOSPC and hence we were switching to non-delayed alloc
> > > >     allocation.
> > > > 
> > > > In this case, the back trace that triggers the bug is as follows:
> > > > 
> > > >   ext4_da_write_begin()
> > > >     /* switch to nodelalloc due to low space */
> > > >     ext4_write_begin()
> > > >       ext4_should_dioread_nolock() // true since mount flags still have delalloc
> > > >       __block_write_begin(..., ext4_get_block_unwritten)
> > > >         __block_write_begin_int()
> > > >           for(each buffer head in page) {
> > > >             /* first iteration, this is bh1 which contains i_size */
> > > >             if (!buffer_mapped)
> > > >               get_block() /* returns bh with only UNWRITTEN and MAPPED */
> > > >             /* second iteration, bh2 */
> > > >               if (!buffer_mapped)
> > > >                 get_block() /* we fail here, could be ENOSPC */
> > > >           }
> > > >           if (err)
> > > >             /*
> > > >              * this would zero out all new buffers and mark them uptodate.
> > > >              * Since bh1 was never marked new, we skip it here which causes
> > > >              * the bug later.
> > > >              */
> > > >             folio_zero_new_buffers();
> > > >       /* ext4_wrte_begin() error handling */
> > > >       ext4_truncate_failed_write()
> > > >         ext4_truncate()
> > > >           ext4_block_truncate_page()
> > > >             __ext4_block_zero_page_range()
> > > 	>               if(!buffer_uptodate())
> > > >                 ext4_read_bh_lock()
> > > >                   ext4_read_bh() -> ... ext4_submit_bh_wbc()
> > > >                     BUG_ON(buffer_unwritten(bh)); /* !!! */
> > > > 
> > > > 2. The second issue is stale data exposure with page size >= blocksize
> > > > with dioread_nolock. The conditions needed for it to happen are same as
> > > > the previous issue ie dioread_nolock around ENOSPC condition. The issue
> > > > is also similar where in __block_write_begin_int() when we call
> > > > ext4_get_block_unwritten() on the buffer_head and the underlying extent
> > > > is unwritten, we get an unwritten and mapped buffer head. Since it is
> > > > not new, we never zero out the partial range which is not under write,
> > > > thus writing stale data to disk. This can be easily observed with the
> > > > following reporducer:
> > > > 
> > > >  fallocate -l 4k testfile
> > > >  xfs_io -c "pwrite 2k 2k" testfile
> > > >  # hexdump output will have stale data in from byte 0 to 2k in testfile
> > > >  hexdump -C testfile
> > > > 
> > > > NOTE: To trigger this, we need dioread_nolock enabled and write
> > > > happening via ext4_write_begin(), which is usually used when we have -o
> > > > nodealloc. Since dioread_nolock is disabled with nodelalloc, the only
> > > > alternate way to call ext4_write_begin() is to fill make sure dellayed
> > > > alloc switches to nodelalloc (ext4_da_write_begin() calls
> > > > ext4_write_begin()).  This will usually happen when FS is almost full
> > > > like the way generic/269 was triggering it in Issue 1 above. This might
> > > > make this issue harder to replicate hence for reliable replicate, I used
> > > > the below patch to temporarily allow dioread_nolock with nodelalloc and
> > > > then mount the disk with -o nodealloc,dioread_nolock. With this you can
> > > > hit the stale data issue 100% of times:
> > > > 
> > > > @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > > >   if (ext4_should_journal_data(inode))
> > > >     return 0;
> > > >   /* temporary fix to prevent generic/422 test failures */
> > > > - if (!test_opt(inode->i_sb, DELALLOC))
> > > > -   return 0;
> > > > + // if (!test_opt(inode->i_sb, DELALLOC))
> > > > + //  return 0;
> > > >   return 1;
> > > >  }
> > > > 
> > > > -------
> > > > 
> > > > After applying this patch to mark buffer as NEW, both the above issues are
> > > > fixed.
> > > > 
> > > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > 
> > > Good catch! But I'm wondering whether this is really the right fix. For
> > > example in ext4_block_truncate_page() shouldn't we rather be checking
> > > whether the buffer isn't unwritten and if yes then bail because there's
> > > nothing to zero out in the block? That would seem like a more logical
> > > and robust solution of the first problem to me.
> > 
> > So I was looking into this to understand the code paths and it seems
> > like ext4_truncate doesn't really impose that a unwritten buffer does
> > not have any data in its corresponding folio, which might sometimes be
> > the case. 
> > 
> > For example, imagine a case where we get the last block of a file via
> > ext4_da_get_block_prep() which returns a bh that is unwritten, mapped
> > and new. During the write, we'll copy data in this folio and then 
> > adjust i_size in write_end, release the folio lock and ultimately the
> > inode_lock().
> > 
> > In this intermediate state, before writeback happens, the buffer is
> > unwritten but has data which will be written. At this point, if we call
> > ext4_block_truncate_page() and have the logic to exit early for bh_unwritten, the
> > we will never actually zero out the folio which might cause stale data to be
> > written during writeback (?)
> 
> Actually we will. truncate_inode_pages_range() ends up calling
> truncate_inode_partial_folio() which zeros out the tail of the partial
> page. I think you are confusing two different things. One is zeroing of
> partial page cache pages - that is generally handled by the generic
> truncate code - and another one is zeroing of on-disk partial blocks - that
> is handled by the filesystem itself. The contents on on-disk blocks does
> not matter as long as they are marked as unwritten in the extent tree
> (their contents is random anyway) and therefore __ext4_block_zero_page_range()
> has nothing to do in that case.

> 
> > Now, most of the calls to ext4_block_truncate_page() happen via ext4_truncate ( like via ext4_setattr,
> > ext4_truncate_failed_write() etc) call truncate_inode_pages() which
> > seems to handle zeroing the partial folio beyond i_size. However, if we
> > add the logic to skip unwritten blocks in this function then:
> > 
> > 1. We create an implicit dependency in ext4_block_truncate_page() that
> > the folio needs to not have any data if its unwritten ie some other
> > function has taken care of by some other function called before it.
> 
> Yes, this dependency already exists today because when blocksize < pagesize
> the zeroing happening in __ext4_block_zero_page_range() may be a subset of
> what gets zeroed by truncate_inode_partial_folio(). Still we zero the page
> in __ext4_block_zero_page_range() for the case when the page was not cached
> at all and we've just loaded it from the disk.

> 
> > 2. Additionally, that other function will also need to mark the relevant
> > buffer dirty, which is done in this function.
> 
> AFAICT there's no need to mark the buffer dirty - the whole point is we
> don't need to touch the on-disk contents if the block is unwritten...
> 
> > 3. There are some other paths that call ext4_block_truncate_page()
> > without turncating the pagecache like ext4_zero_range(). Im not sure if
> > this will cause any issues as I've not gone through the function
> > completely though but yes, other functions that later call truncate
> > in future might need to keep this implicit dependency in mind.
> 
> Indeed, this is a good catch. So we either need to make both sites calling
> ext4_zero_partial_blocks() to use truncate_pagecache_range() for the whole
> range including partial blocks or we need to zero out the page cache
> (without bringing the page uptodate or marking it dirty) in
> ext4_zero_partial_blocks() even if the buffer is unwritten. I don't have a
> strong preference either way.

Okay, got it so basically we depend on the truncate_inode_* family of
functions to zero out the partial pagecache anyways so once we have
fixed the other call sites like ext4_zero_partial_blocks() we should be
good to skip unwritten buffers in ext4_block_zero_page_range().

Thanks for the explanation, let me go through this and prepare a fix. Since
this might take a bit more time, I've submitted the v2 fix to stale data
exposure issue and will fix this one separately.

Thanks,
Ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6c490f05e2ba..a30bfec0b525 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -765,6 +765,10 @@  static int _ext4_get_block(struct inode *inode, sector_t iblock,
 	if (ret > 0) {
 		map_bh(bh, inode->i_sb, map.m_pblk);
 		ext4_update_bh_state(bh, map.m_flags);
+
+		if (buffer_unwritten(bh))
+			set_buffer_new(bh);
+
 		bh->b_size = inode->i_sb->s_blocksize * map.m_len;
 		ret = 0;
 	} else if (ret == 0) {