Message ID | 20231122181440.12043-1-jack@suse.cz |
---|---|
State | Superseded |
Headers | show |
Series | ext4: Fix warning in ext4_dio_write_end_io() | expand |
Jan Kara <jack@suse.cz> writes: > The syzbot has reported that it can hit the warning in > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the > reproducer creates a race between DIO IO completion and truncate > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent > inode state where i_disksize is already updated but i_size is not > updated yet. Since we are careful when setting up DIO write and consider > it extending (and thus performing the IO synchronously with i_rwsem held > exclusively) whenever it goes past either of i_size or i_disksize, we > can use the same test during IO completion without risking entering > ext4_handle_inode_extension() without i_rwsem held. This way we make it > obvious both i_size and i_disksize are large enough when we report DIO > completion without relying on unreliable WARN_ON. Does it make sense to add this in ext4_handle_inode_extension()? WARN_ON_ONCE(!inode_is_locked(inode)); Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so hopefully it can catch via lockdep. So, IIUC, the WARN happened when we were doing a non-extending AIO-DIO write which was racing with truncate trying to expand the file size. Because only then the DIO completion will not have i_rwsem held which can race with truncate. Truncate since it is expanding the file size, will not use inode_dio_wait() (since no block allocations). Is this understanding correct? > > Reported-by: syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com > Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/ext4/file.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 0166bb9ca160..ba497aabdd1e 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, > * blocks. But the code in ext4_iomap_alloc() is careful to use > * zeroed/unwritten extents if this is possible; thus we won't leave > * uninitialized blocks in a file even if we didn't succeed in writing > - * as much as we intended. > + * as much as we intended. Also we can race with truncate or write > + * expanding the file so we have to be a bit careful here. > */ > - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize)); > - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize)) > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) && > + pos + size <= i_size_read(inode)) > return size; > return ext4_handle_inode_extension(inode, pos, size); > } > -- > 2.35.3
On Thu 23-11-23 12:37:03, Ritesh Harjani wrote: > Jan Kara <jack@suse.cz> writes: > > > The syzbot has reported that it can hit the warning in > > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the > > reproducer creates a race between DIO IO completion and truncate > > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent > > inode state where i_disksize is already updated but i_size is not > > updated yet. Since we are careful when setting up DIO write and consider > > it extending (and thus performing the IO synchronously with i_rwsem held > > exclusively) whenever it goes past either of i_size or i_disksize, we > > can use the same test during IO completion without risking entering > > ext4_handle_inode_extension() without i_rwsem held. This way we make it > > obvious both i_size and i_disksize are large enough when we report DIO > > completion without relying on unreliable WARN_ON. > > Does it make sense to add this in ext4_handle_inode_extension()? > WARN_ON_ONCE(!inode_is_locked(inode)); > Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so > hopefully it can catch via lockdep. Exactly. > So, IIUC, the WARN happened when we were doing a non-extending > AIO-DIO write which was racing with truncate trying to expand the file > size. Because only then the DIO completion will not have i_rwsem held > which can race with truncate. Truncate since it is expanding the file > size, will not use inode_dio_wait() (since no block allocations). > > Is this understanding correct? Yes, correct. Honza > > > > > Reported-by: syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com > > Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/ext4/file.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > > index 0166bb9ca160..ba497aabdd1e 100644 > > --- a/fs/ext4/file.c > > +++ b/fs/ext4/file.c > > @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, > > * blocks. But the code in ext4_iomap_alloc() is careful to use > > * zeroed/unwritten extents if this is possible; thus we won't leave > > * uninitialized blocks in a file even if we didn't succeed in writing > > - * as much as we intended. > > + * as much as we intended. Also we can race with truncate or write > > + * expanding the file so we have to be a bit careful here. > > */ > > - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize)); > > - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize)) > > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) && > > + pos + size <= i_size_read(inode)) > > return size; > > return ext4_handle_inode_extension(inode, pos, size); > > } > > -- > > 2.35.3 >
Jan Kara <jack@suse.cz> writes: > On Thu 23-11-23 12:37:03, Ritesh Harjani wrote: >> Jan Kara <jack@suse.cz> writes: >> >> > The syzbot has reported that it can hit the warning in >> > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the >> > reproducer creates a race between DIO IO completion and truncate >> > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent >> > inode state where i_disksize is already updated but i_size is not >> > updated yet. Since we are careful when setting up DIO write and consider >> > it extending (and thus performing the IO synchronously with i_rwsem held >> > exclusively) whenever it goes past either of i_size or i_disksize, we >> > can use the same test during IO completion without risking entering >> > ext4_handle_inode_extension() without i_rwsem held. This way we make it >> > obvious both i_size and i_disksize are large enough when we report DIO >> > completion without relying on unreliable WARN_ON. >> >> Does it make sense to add this in ext4_handle_inode_extension()? >> WARN_ON_ONCE(!inode_is_locked(inode)); >> Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so >> hopefully it can catch via lockdep. > > Exactly. > >> So, IIUC, the WARN happened when we were doing a non-extending >> AIO-DIO write which was racing with truncate trying to expand the file >> size. Because only then the DIO completion will not have i_rwsem held >> which can race with truncate. Truncate since it is expanding the file >> size, will not use inode_dio_wait() (since no block allocations). >> >> Is this understanding correct? > > Yes, correct. Thanks Jan, Also ext4_inode_extension_cleanup() function can take care of deleting the inode from the orphan list in case if there is a race with truncate which extended made both i_disksize and inode->i_size and the DIO completion couldn't call ext4_handle_inode_extension(), right? In that case, does it make sense to update a comment here too? @@ -350,7 +350,10 @@ static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count) } /* * If i_disksize got extended due to writeback of delalloc blocks while - * the DIO was running we could fail to cleanup the orphan list in + * the DIO was running, or + * If i_disksize and inode->i_size both got extened during truncate + * which raced with DIO completion, + * In both such cases, we could fail to cleanup the orphan list in * ext4_handle_inode_extension(). Do it now. */ if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { -ritesh > > Honza > >> >> > >> > Reported-by: syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com >> > Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") >> > Signed-off-by: Jan Kara <jack@suse.cz> >> > --- >> > fs/ext4/file.c | 7 ++++--- >> > 1 file changed, 4 insertions(+), 3 deletions(-) >> > >> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c >> > index 0166bb9ca160..ba497aabdd1e 100644 >> > --- a/fs/ext4/file.c >> > +++ b/fs/ext4/file.c >> > @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, >> > * blocks. But the code in ext4_iomap_alloc() is careful to use >> > * zeroed/unwritten extents if this is possible; thus we won't leave >> > * uninitialized blocks in a file even if we didn't succeed in writing >> > - * as much as we intended. >> > + * as much as we intended. Also we can race with truncate or write >> > + * expanding the file so we have to be a bit careful here. >> > */ >> > - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize)); >> > - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize)) >> > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) && >> > + pos + size <= i_size_read(inode)) >> > return size; >> > return ext4_handle_inode_extension(inode, pos, size); >> > } >> > -- >> > 2.35.3 >> > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Thu 23-11-23 15:17:28, Ritesh Harjani wrote: > Jan Kara <jack@suse.cz> writes: > > > On Thu 23-11-23 12:37:03, Ritesh Harjani wrote: > >> Jan Kara <jack@suse.cz> writes: > >> > >> > The syzbot has reported that it can hit the warning in > >> > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the > >> > reproducer creates a race between DIO IO completion and truncate > >> > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent > >> > inode state where i_disksize is already updated but i_size is not > >> > updated yet. Since we are careful when setting up DIO write and consider > >> > it extending (and thus performing the IO synchronously with i_rwsem held > >> > exclusively) whenever it goes past either of i_size or i_disksize, we > >> > can use the same test during IO completion without risking entering > >> > ext4_handle_inode_extension() without i_rwsem held. This way we make it > >> > obvious both i_size and i_disksize are large enough when we report DIO > >> > completion without relying on unreliable WARN_ON. > >> > >> Does it make sense to add this in ext4_handle_inode_extension()? > >> WARN_ON_ONCE(!inode_is_locked(inode)); > >> Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so > >> hopefully it can catch via lockdep. > > > > Exactly. > > > >> So, IIUC, the WARN happened when we were doing a non-extending > >> AIO-DIO write which was racing with truncate trying to expand the file > >> size. Because only then the DIO completion will not have i_rwsem held > >> which can race with truncate. Truncate since it is expanding the file > >> size, will not use inode_dio_wait() (since no block allocations). > >> > >> Is this understanding correct? > > > > Yes, correct. > > Thanks Jan, > > Also ext4_inode_extension_cleanup() function can take care of deleting > the inode from the orphan list in case if there is a race with truncate > which extended made both i_disksize and inode->i_size and the DIO > completion couldn't call ext4_handle_inode_extension(), right? > > In that case, does it make sense to update a comment here too? > > @@ -350,7 +350,10 @@ static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count) > } > /* > * If i_disksize got extended due to writeback of delalloc blocks while > - * the DIO was running we could fail to cleanup the orphan list in > + * the DIO was running, or > + * If i_disksize and inode->i_size both got extened during truncate > + * which raced with DIO completion, > + * In both such cases, we could fail to cleanup the orphan list in > * ext4_handle_inode_extension(). Do it now. > */ > if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { Good point. Expanded comment in this way. I'll send v2 shortly. Honza
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 0166bb9ca160..ba497aabdd1e 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, * blocks. But the code in ext4_iomap_alloc() is careful to use * zeroed/unwritten extents if this is possible; thus we won't leave * uninitialized blocks in a file even if we didn't succeed in writing - * as much as we intended. + * as much as we intended. Also we can race with truncate or write + * expanding the file so we have to be a bit careful here. */ - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize)); - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize)) + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) && + pos + size <= i_size_read(inode)) return size; return ext4_handle_inode_extension(inode, pos, size); }
The syzbot has reported that it can hit the warning in ext4_dio_write_end_io() because i_size < i_disksize. Indeed the reproducer creates a race between DIO IO completion and truncate expanding the file and thus ext4_dio_write_end_io() sees an inconsistent inode state where i_disksize is already updated but i_size is not updated yet. Since we are careful when setting up DIO write and consider it extending (and thus performing the IO synchronously with i_rwsem held exclusively) whenever it goes past either of i_size or i_disksize, we can use the same test during IO completion without risking entering ext4_handle_inode_extension() without i_rwsem held. This way we make it obvious both i_size and i_disksize are large enough when we report DIO completion without relying on unreliable WARN_ON. Reported-by: syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)