diff mbox series

[PATCHv5,3/6] ext4: Move ext4 bmap to use iomap infrastructure.

Message ID 8bbd53bd719d5ccfecafcce93f2bf1d7955a44af.1582880246.git.riteshh@linux.ibm.com
State Accepted
Headers show
Series ext4: bmap & fiemap conversion to use iomap | expand

Commit Message

Ritesh Harjani Feb. 28, 2020, 9:26 a.m. UTC
ext4_iomap_begin is already implemented which provides ext4_map_blocks,
so just move the API from generic_block_bmap to iomap_bmap for iomap
conversion.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Darrick Wong Feb. 28, 2020, 3:25 p.m. UTC | #1
On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> so just move the API from generic_block_bmap to iomap_bmap for iomap
> conversion.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6cf3b969dc86..81fccbae0aea 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>  			return 0;
>  	}
>  
> -	return generic_block_bmap(mapping, block, ext4_get_block);
> +	return iomap_bmap(mapping, block, &ext4_iomap_ops);

/me notes that iomap_bmap will filemap_write_and_wait for you, so one
could optimize ext4_bmap to avoid the double-flush by moving the
filemap_write_and_wait at the top of the function into the JDATA state
clearing block.

OTOH it's bmap, who cares... :)

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

>  }
>  
>  static int ext4_readpage(struct file *file, struct page *page)
> -- 
> 2.21.0
>
Ritesh Harjani March 2, 2020, 8:58 a.m. UTC | #2
On 2/28/20 8:55 PM, Darrick J. Wong wrote:
> On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
>> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
>> so just move the API from generic_block_bmap to iomap_bmap for iomap
>> conversion.
>>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> Reviewed-by: Jan Kara <jack@suse.cz>
>> ---
>>   fs/ext4/inode.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6cf3b969dc86..81fccbae0aea 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>>   			return 0;
>>   	}
>>   
>> -	return generic_block_bmap(mapping, block, ext4_get_block);
>> +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
> 
> /me notes that iomap_bmap will filemap_write_and_wait for you, so one
> could optimize ext4_bmap to avoid the double-flush by moving the
> filemap_write_and_wait at the top of the function into the JDATA state
> clearing block.

IIUC, delalloc and data=journal mode are both mutually exclusive.
So we could get rid of calling filemap_write_and_wait() all together
from ext4_bmap().
And as you pointed filemap_write_and_wait() is called by default in
iomap_bmap which should cover for delalloc case.


@Jan/Darrick,
Could you check if the attached patch looks good. If yes then
will add your Reviewed-by and send a v6.

Thanks for the review!!

-ritesh
Darrick Wong March 3, 2020, 3:47 p.m. UTC | #3
On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
> 
> 
> On 2/28/20 8:55 PM, Darrick J. Wong wrote:
> > On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > conversion.
> > > 
> > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> > >   fs/ext4/inode.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6cf3b969dc86..81fccbae0aea 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
> > >   			return 0;
> > >   	}
> > > -	return generic_block_bmap(mapping, block, ext4_get_block);
> > > +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
> > 
> > /me notes that iomap_bmap will filemap_write_and_wait for you, so one
> > could optimize ext4_bmap to avoid the double-flush by moving the
> > filemap_write_and_wait at the top of the function into the JDATA state
> > clearing block.
> 
> IIUC, delalloc and data=journal mode are both mutually exclusive.
> So we could get rid of calling filemap_write_and_wait() all together
> from ext4_bmap().
> And as you pointed filemap_write_and_wait() is called by default in
> iomap_bmap which should cover for delalloc case.
> 
> 
> @Jan/Darrick,
> Could you check if the attached patch looks good. If yes then
> will add your Reviewed-by and send a v6.
> 
> Thanks for the review!!
> 
> -ritesh
> 
> 

> From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
> From: Ritesh Harjani <riteshh@linux.ibm.com>
> Date: Tue, 20 Aug 2019 18:36:33 +0530
> Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
> 
> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> so just move the API from generic_block_bmap to iomap_bmap for iomap
> conversion.
> 
> Also no need to call for filemap_write_and_wait() any more in ext4_bmap
> since data=journal mode anyway doesn't support delalloc and for all other
> cases iomap_bmap() anyway calls the same function, so no need for doing
> it twice.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>

Hmmm.  I don't recall how jdata actually works, but I get the impression
here that we're trying to flush dirty data out to the journal and then
out to disk, and then drop the JDATA state from the inode.  This
mechanism exists (I guess?) so that dirty file pages get checkpointed
out of jbd2 back into the filesystem so that bmap() returns meaningful
results to lilo.

This makes me wonder if you still need the filemap_write_and_wait in the
JDATA case because otherwise the journal flush won't have the effect of
writing all the dirty pagecache back to the filesystem?  OTOH I suppose
the implicit write-and-wait call after we clear JDATA will not be
writing to the journal.

Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?

--D

> ---
>  fs/ext4/inode.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6cf3b969dc86..fac8adbbb3f6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3174,16 +3174,6 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>  	if (ext4_has_inline_data(inode))
>  		return 0;
>  
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
> -			test_opt(inode->i_sb, DELALLOC)) {
> -		/*
> -		 * With delalloc we want to sync the file
> -		 * so that we can make sure we allocate
> -		 * blocks for file
> -		 */
> -		filemap_write_and_wait(mapping);
> -	}
> -
>  	if (EXT4_JOURNAL(inode) &&
>  	    ext4_test_inode_state(inode, EXT4_STATE_JDATA)) {
>  		/*
> @@ -3214,7 +3204,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>  			return 0;
>  	}
>  
> -	return generic_block_bmap(mapping, block, ext4_get_block);
> +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
>  }
>  
>  static int ext4_readpage(struct file *file, struct page *page)
> -- 
> 2.21.0
>
Jan Kara March 4, 2020, 12:42 p.m. UTC | #4
On Tue 03-03-20 07:47:09, Darrick J. Wong wrote:
> On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
> > 
> > 
> > On 2/28/20 8:55 PM, Darrick J. Wong wrote:
> > > On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> > > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > > conversion.
> > > > 
> > > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > ---
> > > >   fs/ext4/inode.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 6cf3b969dc86..81fccbae0aea 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
> > > >   			return 0;
> > > >   	}
> > > > -	return generic_block_bmap(mapping, block, ext4_get_block);
> > > > +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
> > > 
> > > /me notes that iomap_bmap will filemap_write_and_wait for you, so one
> > > could optimize ext4_bmap to avoid the double-flush by moving the
> > > filemap_write_and_wait at the top of the function into the JDATA state
> > > clearing block.
> > 
> > IIUC, delalloc and data=journal mode are both mutually exclusive.
> > So we could get rid of calling filemap_write_and_wait() all together
> > from ext4_bmap().
> > And as you pointed filemap_write_and_wait() is called by default in
> > iomap_bmap which should cover for delalloc case.
> > 
> > 
> > @Jan/Darrick,
> > Could you check if the attached patch looks good. If yes then
> > will add your Reviewed-by and send a v6.
> > 
> > Thanks for the review!!
> > 
> > -ritesh
> > 
> > 
> 
> > From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
> > From: Ritesh Harjani <riteshh@linux.ibm.com>
> > Date: Tue, 20 Aug 2019 18:36:33 +0530
> > Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
> > 
> > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > conversion.
> > 
> > Also no need to call for filemap_write_and_wait() any more in ext4_bmap
> > since data=journal mode anyway doesn't support delalloc and for all other
> > cases iomap_bmap() anyway calls the same function, so no need for doing
> > it twice.
> > 
> > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> 
> Hmmm.  I don't recall how jdata actually works, but I get the impression
> here that we're trying to flush dirty data out to the journal and then
> out to disk, and then drop the JDATA state from the inode.  This
> mechanism exists (I guess?) so that dirty file pages get checkpointed
> out of jbd2 back into the filesystem so that bmap() returns meaningful
> results to lilo.

Exactly. E.g. when we are journalling data, we fill hole through mmap, we will
have block allocated as unwritten and we need to write it out so that the
data gets to the journal and then do journal flush to get the data to disk
so that lilo can read it from the devices. So removing
filemap_write_and_wait() when journalling data is wrong.

> This makes me wonder if you still need the filemap_write_and_wait in the
> JDATA case because otherwise the journal flush won't have the effect of
> writing all the dirty pagecache back to the filesystem?  OTOH I suppose
> the implicit write-and-wait call after we clear JDATA will not be
> writing to the journal.
> 
> Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?

Yeah, it should do that but that's only performance optimization so that we
bother with journal flushing only when someone uses block mapping call on
a file with journalled dirty data. So you can hardly notice the bug by
testing...

								Honza
Darrick Wong March 4, 2020, 3:37 p.m. UTC | #5
On Wed, Mar 04, 2020 at 01:42:11PM +0100, Jan Kara wrote:
> On Tue 03-03-20 07:47:09, Darrick J. Wong wrote:
> > On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
> > > 
> > > 
> > > On 2/28/20 8:55 PM, Darrick J. Wong wrote:
> > > > On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> > > > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > > > conversion.
> > > > > 
> > > > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > > ---
> > > > >   fs/ext4/inode.c | 2 +-
> > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 6cf3b969dc86..81fccbae0aea 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
> > > > >   			return 0;
> > > > >   	}
> > > > > -	return generic_block_bmap(mapping, block, ext4_get_block);
> > > > > +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
> > > > 
> > > > /me notes that iomap_bmap will filemap_write_and_wait for you, so one
> > > > could optimize ext4_bmap to avoid the double-flush by moving the
> > > > filemap_write_and_wait at the top of the function into the JDATA state
> > > > clearing block.
> > > 
> > > IIUC, delalloc and data=journal mode are both mutually exclusive.
> > > So we could get rid of calling filemap_write_and_wait() all together
> > > from ext4_bmap().
> > > And as you pointed filemap_write_and_wait() is called by default in
> > > iomap_bmap which should cover for delalloc case.
> > > 
> > > 
> > > @Jan/Darrick,
> > > Could you check if the attached patch looks good. If yes then
> > > will add your Reviewed-by and send a v6.
> > > 
> > > Thanks for the review!!
> > > 
> > > -ritesh
> > > 
> > > 
> > 
> > > From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
> > > From: Ritesh Harjani <riteshh@linux.ibm.com>
> > > Date: Tue, 20 Aug 2019 18:36:33 +0530
> > > Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
> > > 
> > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > conversion.
> > > 
> > > Also no need to call for filemap_write_and_wait() any more in ext4_bmap
> > > since data=journal mode anyway doesn't support delalloc and for all other
> > > cases iomap_bmap() anyway calls the same function, so no need for doing
> > > it twice.
> > > 
> > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > 
> > Hmmm.  I don't recall how jdata actually works, but I get the impression
> > here that we're trying to flush dirty data out to the journal and then
> > out to disk, and then drop the JDATA state from the inode.  This
> > mechanism exists (I guess?) so that dirty file pages get checkpointed
> > out of jbd2 back into the filesystem so that bmap() returns meaningful
> > results to lilo.
> 
> Exactly. E.g. when we are journalling data, we fill hole through mmap, we will
> have block allocated as unwritten and we need to write it out so that the
> data gets to the journal and then do journal flush to get the data to disk
> so that lilo can read it from the devices. So removing
> filemap_write_and_wait() when journalling data is wrong.

<nod>

> > This makes me wonder if you still need the filemap_write_and_wait in the
> > JDATA case because otherwise the journal flush won't have the effect of
> > writing all the dirty pagecache back to the filesystem?  OTOH I suppose
> > the implicit write-and-wait call after we clear JDATA will not be
> > writing to the journal.
> > 
> > Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?
> 
> Yeah, it should do that but that's only performance optimization so that we
> bother with journal flushing only when someone uses block mapping call on
> a file with journalled dirty data. So you can hardly notice the bug by
> testing...

If we ever decide to deprecate FIBMAP officially and push bootloaders to
use FIEMAP, then we'll have to emulate all the flushing behaviors.  But
that's something for a separate patch.

--D

> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Ritesh Harjani March 6, 2020, 5:49 p.m. UTC | #6
On 3/4/20 6:12 PM, Jan Kara wrote:
> On Tue 03-03-20 07:47:09, Darrick J. Wong wrote:
>> On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
>>>
>>>
>>> On 2/28/20 8:55 PM, Darrick J. Wong wrote:
>>>> On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
>>>>> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
>>>>> so just move the API from generic_block_bmap to iomap_bmap for iomap
>>>>> conversion.
>>>>>
>>>>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>>> ---
>>>>>    fs/ext4/inode.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index 6cf3b969dc86..81fccbae0aea 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>>>>>    			return 0;
>>>>>    	}
>>>>> -	return generic_block_bmap(mapping, block, ext4_get_block);
>>>>> +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
>>>>
>>>> /me notes that iomap_bmap will filemap_write_and_wait for you, so one
>>>> could optimize ext4_bmap to avoid the double-flush by moving the
>>>> filemap_write_and_wait at the top of the function into the JDATA state
>>>> clearing block.
>>>
>>> IIUC, delalloc and data=journal mode are both mutually exclusive.
>>> So we could get rid of calling filemap_write_and_wait() all together
>>> from ext4_bmap().
>>> And as you pointed filemap_write_and_wait() is called by default in
>>> iomap_bmap which should cover for delalloc case.
>>>
>>>
>>> @Jan/Darrick,
>>> Could you check if the attached patch looks good. If yes then
>>> will add your Reviewed-by and send a v6.
>>>
>>> Thanks for the review!!
>>>
>>> -ritesh
>>>
>>>
>>
>>>  From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
>>> From: Ritesh Harjani <riteshh@linux.ibm.com>
>>> Date: Tue, 20 Aug 2019 18:36:33 +0530
>>> Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
>>>
>>> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
>>> so just move the API from generic_block_bmap to iomap_bmap for iomap
>>> conversion.
>>>
>>> Also no need to call for filemap_write_and_wait() any more in ext4_bmap
>>> since data=journal mode anyway doesn't support delalloc and for all other
>>> cases iomap_bmap() anyway calls the same function, so no need for doing
>>> it twice.
>>>
>>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>
>> Hmmm.  I don't recall how jdata actually works, but I get the impression
>> here that we're trying to flush dirty data out to the journal and then
>> out to disk, and then drop the JDATA state from the inode.  This
>> mechanism exists (I guess?) so that dirty file pages get checkpointed
>> out of jbd2 back into the filesystem so that bmap() returns meaningful
>> results to lilo.
> 
> Exactly. E.g. when we are journalling data, we fill hole through mmap, we will
> have block allocated as unwritten and we need to write it out so that the
> data gets to the journal and then do journal flush to get the data to disk

So in data=journal case in ext4_page_mkwrite the data buffer will also
be marked as, to be journalled. So does jbd2_journal_flush() itself
don't take care of writing back any dirty page cache before it commit
that transaction? and after then checkpoint it?

Sorry my knowledge about jbd2 is very naive.

> so that lilo can read it from the devices. So removing
> filemap_write_and_wait() when journalling data is wrong.

Sure I understand this part. But was just curious on above query.
Otherwise, IIUC, we will have to add
filemap_write_and_wait() for JDATA case as well before calling
for jbd2_journal_flush(). Will add this as a separate patch.


-ritesh

> 
>> This makes me wonder if you still need the filemap_write_and_wait in the
>> JDATA case because otherwise the journal flush won't have the effect of
>> writing all the dirty pagecache back to the filesystem?  OTOH I suppose
>> the implicit write-and-wait call after we clear JDATA will not be
>> writing to the journal.
>>
>> Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?
> 
> Yeah, it should do that but that's only performance optimization so that we
> bother with journal flushing only when someone uses block mapping call on
> a file with journalled dirty data. So you can hardly notice the bug by
> testing...
> 
> 								Honza
>
Darrick Wong March 7, 2020, 12:51 a.m. UTC | #7
On Fri, Mar 06, 2020 at 11:19:31PM +0530, Ritesh Harjani wrote:
> 
> 
> On 3/4/20 6:12 PM, Jan Kara wrote:
> > On Tue 03-03-20 07:47:09, Darrick J. Wong wrote:
> > > On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
> > > > 
> > > > 
> > > > On 2/28/20 8:55 PM, Darrick J. Wong wrote:
> > > > > On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> > > > > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > > > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > > > > conversion.
> > > > > > 
> > > > > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > > > > ---
> > > > > >    fs/ext4/inode.c | 2 +-
> > > > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > > index 6cf3b969dc86..81fccbae0aea 100644
> > > > > > --- a/fs/ext4/inode.c
> > > > > > +++ b/fs/ext4/inode.c
> > > > > > @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
> > > > > >    			return 0;
> > > > > >    	}
> > > > > > -	return generic_block_bmap(mapping, block, ext4_get_block);
> > > > > > +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
> > > > > 
> > > > > /me notes that iomap_bmap will filemap_write_and_wait for you, so one
> > > > > could optimize ext4_bmap to avoid the double-flush by moving the
> > > > > filemap_write_and_wait at the top of the function into the JDATA state
> > > > > clearing block.
> > > > 
> > > > IIUC, delalloc and data=journal mode are both mutually exclusive.
> > > > So we could get rid of calling filemap_write_and_wait() all together
> > > > from ext4_bmap().
> > > > And as you pointed filemap_write_and_wait() is called by default in
> > > > iomap_bmap which should cover for delalloc case.
> > > > 
> > > > 
> > > > @Jan/Darrick,
> > > > Could you check if the attached patch looks good. If yes then
> > > > will add your Reviewed-by and send a v6.
> > > > 
> > > > Thanks for the review!!
> > > > 
> > > > -ritesh
> > > > 
> > > > 
> > > 
> > > >  From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
> > > > From: Ritesh Harjani <riteshh@linux.ibm.com>
> > > > Date: Tue, 20 Aug 2019 18:36:33 +0530
> > > > Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
> > > > 
> > > > ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> > > > so just move the API from generic_block_bmap to iomap_bmap for iomap
> > > > conversion.
> > > > 
> > > > Also no need to call for filemap_write_and_wait() any more in ext4_bmap
> > > > since data=journal mode anyway doesn't support delalloc and for all other
> > > > cases iomap_bmap() anyway calls the same function, so no need for doing
> > > > it twice.
> > > > 
> > > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> > > 
> > > Hmmm.  I don't recall how jdata actually works, but I get the impression
> > > here that we're trying to flush dirty data out to the journal and then
> > > out to disk, and then drop the JDATA state from the inode.  This
> > > mechanism exists (I guess?) so that dirty file pages get checkpointed
> > > out of jbd2 back into the filesystem so that bmap() returns meaningful
> > > results to lilo.
> > 
> > Exactly. E.g. when we are journalling data, we fill hole through mmap, we will
> > have block allocated as unwritten and we need to write it out so that the
> > data gets to the journal and then do journal flush to get the data to disk
> 
> So in data=journal case in ext4_page_mkwrite the data buffer will also
> be marked as, to be journalled. So does jbd2_journal_flush() itself
> don't take care of writing back any dirty page cache before it commit
> that transaction? and after then checkpoint it?

Er... this sentence is a little garbled, but I think the answer you're
looking for is:

"Yes, writeback (i.e. filemap_write_and_wait) attaches the dirty blocks
to a journal transaction; then jbd2_journal_flush forces the transaction
data out to the on-disk journal; and it also checkpoints the journal so
that the dirty blocks are then written back into the filesystem."

> Sorry my knowledge about jbd2 is very naive.
> 
> > so that lilo can read it from the devices. So removing
> > filemap_write_and_wait() when journalling data is wrong.
> 
> Sure I understand this part. But was just curious on above query.
> Otherwise, IIUC, we will have to add
> filemap_write_and_wait() for JDATA case as well before calling
> for jbd2_journal_flush(). Will add this as a separate patch.

Well you could just move it...

bmap()
{
	/*
	 * In data=journal mode, we must checkpoint the journal to
	 * ensure that any dirty blocks in the journalare checkpointed
	 * to the location that we return to userspace.  Clear JDATA so
	 * that future writes will not be written through the journal.
	 */
	if (JDATA) {
		filemap_write_and_wait(...);
		clear JDATA
		jbd2_journal_flush(...);
	}

	return iomap_bmap(...);
}

(or did "Will add this as a separate patch" refer to fixing FIEMAP?)

--D

> 
> -ritesh
> 
> > 
> > > This makes me wonder if you still need the filemap_write_and_wait in the
> > > JDATA case because otherwise the journal flush won't have the effect of
> > > writing all the dirty pagecache back to the filesystem?  OTOH I suppose
> > > the implicit write-and-wait call after we clear JDATA will not be
> > > writing to the journal.
> > > 
> > > Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?
> > 
> > Yeah, it should do that but that's only performance optimization so that we
> > bother with journal flushing only when someone uses block mapping call on
> > a file with journalled dirty data. So you can hardly notice the bug by
> > testing...
> > 
> > 								Honza
> > 
>
Theodore Ts'o March 7, 2020, 2:32 a.m. UTC | #8
On Wed, Mar 04, 2020 at 07:37:45AM -0800, Darrick J. Wong wrote:
> > > This makes me wonder if you still need the filemap_write_and_wait in the
> > > JDATA case because otherwise the journal flush won't have the effect of
> > > writing all the dirty pagecache back to the filesystem?  OTOH I suppose
> > > the implicit write-and-wait call after we clear JDATA will not be
> > > writing to the journal.
> > > 
> > > Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?
> > 
> > Yeah, it should do that but that's only performance optimization so that we
> > bother with journal flushing only when someone uses block mapping call on
> > a file with journalled dirty data. So you can hardly notice the bug by
> > testing...
> 
> If we ever decide to deprecate FIBMAP officially and push bootloaders to
> use FIEMAP, then we'll have to emulate all the flushing behaviors.  But
> that's something for a separate patch.

This is really only needed for LILO, since I believe this is the only
bootloader which uses the output of FIBMAP to determine the block
number where it will attempt to ***write*** into a data block of a
mounted file system.

I seem to recall either Dave or Christoph ranting at one point that
any program which attempted to write into a mounted file system using
the output of FIEMAP was insane, and we should not be encouraging that
kind of wacko behavior.  :-)

What most bootloaders want is simply the accurate list of block
locations so they can write that into the stage 1 bootloader so it can
read the stage 2 bootloader from the disk.  The reason why we have the
JDATA hack in the bmap code is because LILO will get the block
location, and then try to write config information into that block.
So we are trying to prevent LILO's write of the boot command line from
possibly getting rewritten after a journal replay.  (Of course, no
distribution installer would do something as rude as to just forcibly
rebooting the system without a clean unmount, so this would *never* be
a problem, RIGHT?  :-)

In any case, I'd much rather try to get LILO fixed to do something
sane, rather that move that heavy-ugly JDATA code into FIEMAP, where
it might get triggered unnecessarily by 99.9% of the users who are
doing something not-insane.

							- Ted
Ritesh Harjani March 7, 2020, 5:50 a.m. UTC | #9
On 3/7/20 6:21 AM, Darrick J. Wong wrote:
> On Fri, Mar 06, 2020 at 11:19:31PM +0530, Ritesh Harjani wrote:
>>
>>
>> On 3/4/20 6:12 PM, Jan Kara wrote:
>>> On Tue 03-03-20 07:47:09, Darrick J. Wong wrote:
>>>> On Mon, Mar 02, 2020 at 02:28:39PM +0530, Ritesh Harjani wrote:
>>>>>
>>>>>
>>>>> On 2/28/20 8:55 PM, Darrick J. Wong wrote:
>>>>>> On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
>>>>>>> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
>>>>>>> so just move the API from generic_block_bmap to iomap_bmap for iomap
>>>>>>> conversion.
>>>>>>>
>>>>>>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>>>>> Reviewed-by: Jan Kara <jack@suse.cz>
>>>>>>> ---
>>>>>>>     fs/ext4/inode.c | 2 +-
>>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>>> index 6cf3b969dc86..81fccbae0aea 100644
>>>>>>> --- a/fs/ext4/inode.c
>>>>>>> +++ b/fs/ext4/inode.c
>>>>>>> @@ -3214,7 +3214,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
>>>>>>>     			return 0;
>>>>>>>     	}
>>>>>>> -	return generic_block_bmap(mapping, block, ext4_get_block);
>>>>>>> +	return iomap_bmap(mapping, block, &ext4_iomap_ops);
>>>>>>
>>>>>> /me notes that iomap_bmap will filemap_write_and_wait for you, so one
>>>>>> could optimize ext4_bmap to avoid the double-flush by moving the
>>>>>> filemap_write_and_wait at the top of the function into the JDATA state
>>>>>> clearing block.
>>>>>
>>>>> IIUC, delalloc and data=journal mode are both mutually exclusive.
>>>>> So we could get rid of calling filemap_write_and_wait() all together
>>>>> from ext4_bmap().
>>>>> And as you pointed filemap_write_and_wait() is called by default in
>>>>> iomap_bmap which should cover for delalloc case.
>>>>>
>>>>>
>>>>> @Jan/Darrick,
>>>>> Could you check if the attached patch looks good. If yes then
>>>>> will add your Reviewed-by and send a v6.
>>>>>
>>>>> Thanks for the review!!
>>>>>
>>>>> -ritesh
>>>>>
>>>>>
>>>>
>>>>>   From 93f560d9a483b4f389056e543012d0941734a8f4 Mon Sep 17 00:00:00 2001
>>>>> From: Ritesh Harjani <riteshh@linux.ibm.com>
>>>>> Date: Tue, 20 Aug 2019 18:36:33 +0530
>>>>> Subject: [PATCH 3/6] ext4: Move ext4 bmap to use iomap infrastructure.
>>>>>
>>>>> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
>>>>> so just move the API from generic_block_bmap to iomap_bmap for iomap
>>>>> conversion.
>>>>>
>>>>> Also no need to call for filemap_write_and_wait() any more in ext4_bmap
>>>>> since data=journal mode anyway doesn't support delalloc and for all other
>>>>> cases iomap_bmap() anyway calls the same function, so no need for doing
>>>>> it twice.
>>>>>
>>>>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>>
>>>> Hmmm.  I don't recall how jdata actually works, but I get the impression
>>>> here that we're trying to flush dirty data out to the journal and then
>>>> out to disk, and then drop the JDATA state from the inode.  This
>>>> mechanism exists (I guess?) so that dirty file pages get checkpointed
>>>> out of jbd2 back into the filesystem so that bmap() returns meaningful
>>>> results to lilo.
>>>
>>> Exactly. E.g. when we are journalling data, we fill hole through mmap, we will
>>> have block allocated as unwritten and we need to write it out so that the
>>> data gets to the journal and then do journal flush to get the data to disk
>>
>> So in data=journal case in ext4_page_mkwrite the data buffer will also
>> be marked as, to be journalled. So does jbd2_journal_flush() itself
>> don't take care of writing back any dirty page cache before it commit
>> that transaction? and after then checkpoint it?
> 
> Er... this sentence is a little garbled, but I think the answer you're
> looking for is:
> 
> "Yes, writeback (i.e. filemap_write_and_wait) attaches the dirty blocks
> to a journal transaction; then jbd2_journal_flush forces the transaction
> data out to the on-disk journal; and it also checkpoints the journal so
> that the dirty blocks are then written back into the filesystem."

Yes. Thanks.


> 
>> Sorry my knowledge about jbd2 is very naive.
>>
>>> so that lilo can read it from the devices. So removing
>>> filemap_write_and_wait() when journalling data is wrong.
>>
>> Sure I understand this part. But was just curious on above query.
>> Otherwise, IIUC, we will have to add
>> filemap_write_and_wait() for JDATA case as well before calling
>> for jbd2_journal_flush(). Will add this as a separate patch.
> 
> Well you could just move it...
> 
> bmap()
> {
> 	/*
> 	 * In data=journal mode, we must checkpoint the journal to
> 	 * ensure that any dirty blocks in the journalare checkpointed
> 	 * to the location that we return to userspace.  Clear JDATA so
> 	 * that future writes will not be written through the journal.
> 	 */
> 	if (JDATA) {
> 		filemap_write_and_wait(...);
> 		clear JDATA
> 		jbd2_journal_flush(...);
> 	}
> 
> 	return iomap_bmap(...);
> }
> 

> (or did "Will add this as a separate patch" refer to fixing FIEMAP?)
No.

What I meant was if filemap_write_and_wait() is required for JDATA case
then the above diff which you just showed, I will add as a separate
patch before moving ext4_bmap() to use iomap_bmap(). i.e. rather then
clubbing it with Patch-3, it will be a separate patch before patch-3.

Sorry about the confusion.

-ritesh

> 
> --D
> 
>>
>> -ritesh
>>
>>>
>>>> This makes me wonder if you still need the filemap_write_and_wait in the
>>>> JDATA case because otherwise the journal flush won't have the effect of
>>>> writing all the dirty pagecache back to the filesystem?  OTOH I suppose
>>>> the implicit write-and-wait call after we clear JDATA will not be
>>>> writing to the journal.
>>>>
>>>> Even more weirdly, the FIEMAP code doesn't drop JDATA at all...?
>>>
>>> Yeah, it should do that but that's only performance optimization so that we
>>> bother with journal flushing only when someone uses block mapping call on
>>> a file with journalled dirty data. So you can hardly notice the bug by
>>> testing...
>>>
>>> 								Honza
>>>
>>
Theodore Ts'o March 13, 2020, 8:16 p.m. UTC | #10
On Fri, Feb 28, 2020 at 02:56:56PM +0530, Ritesh Harjani wrote:
> ext4_iomap_begin is already implemented which provides ext4_map_blocks,
> so just move the API from generic_block_bmap to iomap_bmap for iomap
> conversion.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Applied, thanks.

					- Ted
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6cf3b969dc86..81fccbae0aea 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3214,7 +3214,7 @@  static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
 			return 0;
 	}
 
-	return generic_block_bmap(mapping, block, ext4_get_block);
+	return iomap_bmap(mapping, block, &ext4_iomap_ops);
 }
 
 static int ext4_readpage(struct file *file, struct page *page)