diff mbox

ubifs: Add new mount option to force fdatasync before rename

Message ID 560984B4.7090105@codeaurora.org
State Changes Requested
Headers show

Commit Message

Nikhilesh Reddy Sept. 28, 2015, 6:19 p.m. UTC
The rename operation in UBIFS is synchronous (or nearly synchronous)
while the write operation is not. This can result in zero length files when
renaming of files followed by an abrupt power down or a crash.

For example:
1) Say a file a.txt exists with size 1KB.
2) Create a file b.tmp (open)
3) Update the data in b.tmp with new values (write and close)
4) rename b.tmp to a.txt
5) Abrupt power down or crash

This above scenario can result in a.txt becoming a file of zero length and
giving the impression of a.txt being truncated.
This scenario can ofcourse be prevented by calling fsync or fdatasync
before the rename operation.

There are many applications and libraries that do not call fsync or
fdatasync since they were tested on EXT4 which has a hack to handle
the above case.

Add a new mount option of sync_before_rename which would implicitly
sync the data before renaming the file to help address cases where the
rename cases need to be handled implicitly and prevent the zero length
files as a result of a rename.

Change-Id: I4e8771d40307543b532df7f46bd87864f0d3d294
Signed-off-by: Nikhilesh Reddy <reddyn@codeaurora.org>
---
  fs/ubifs/dir.c   | 29 +++++++++++++++++++++++++++++
  fs/ubifs/super.c | 17 +++++++++++++++++
  fs/ubifs/ubifs.h |  5 +++++
  3 files changed, 51 insertions(+)

@enext, and
   *             @calc_idx_sz
@@ -1275,6 +1279,7 @@ struct ubifs_info {
  	unsigned int bulk_read:1;
  	unsigned int default_compr:2;
  	unsigned int rw_incompat:1;
+	unsigned int sync_before_rename:1;

  	struct mutex tnc_mutex;
  	struct ubifs_zbranch zroot;

Comments

Richard Weinberger Sept. 28, 2015, 7:38 p.m. UTC | #1
Hi!

Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
> The rename operation in UBIFS is synchronous (or nearly synchronous)
> while the write operation is not. This can result in zero length files when
> renaming of files followed by an abrupt power down or a crash.
> 
> For example:
> 1) Say a file a.txt exists with size 1KB.
> 2) Create a file b.tmp (open)
> 3) Update the data in b.tmp with new values (write and close)
> 4) rename b.tmp to a.txt
> 5) Abrupt power down or crash
> 
> This above scenario can result in a.txt becoming a file of zero length and
> giving the impression of a.txt being truncated.
> This scenario can ofcourse be prevented by calling fsync or fdatasync
> before the rename operation.
> 
> There are many applications and libraries that do not call fsync or
> fdatasync since they were tested on EXT4 which has a hack to handle
> the above case.

...and these applications are buggy by design.
ext4 has some hacks to detect some misuses (IIRC replace by rename and replace by truncate)
as these applications worked by chance on ext3.
Adding such a hack now to UBIFS needs a bit more justification.
Especially as your new mount option is a sledgehammer.

Which application triggers this issue?
I'm asking because UBIFS is more or less an embedded filesystem.
On ext4 mostly broken GUI programs like eclipse or kwrite forgot to fsync().

> Add a new mount option of sync_before_rename which would implicitly
> sync the data before renaming the file to help address cases where the
> rename cases need to be handled implicitly and prevent the zero length
> files as a result of a rename.
> 
> Change-Id: I4e8771d40307543b532df7f46bd87864f0d3d294

What is that?

Thanks,
//richard
Nikhilesh Reddy Sept. 28, 2015, 8:38 p.m. UTC | #2
On Mon 28 Sep 2015 12:38:29 PM PDT, Richard Weinberger wrote:
> Hi!
>
> Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
>> The rename operation in UBIFS is synchronous (or nearly synchronous)
>> while the write operation is not. This can result in zero length files when
>> renaming of files followed by an abrupt power down or a crash.
>>
>> For example:
>> 1) Say a file a.txt exists with size 1KB.
>> 2) Create a file b.tmp (open)
>> 3) Update the data in b.tmp with new values (write and close)
>> 4) rename b.tmp to a.txt
>> 5) Abrupt power down or crash
>>
>> This above scenario can result in a.txt becoming a file of zero length and
>> giving the impression of a.txt being truncated.
>> This scenario can ofcourse be prevented by calling fsync or fdatasync
>> before the rename operation.
>>
>> There are many applications and libraries that do not call fsync or
>> fdatasync since they were tested on EXT4 which has a hack to handle
>> the above case.
>
> ...and these applications are buggy by design.
> ext4 has some hacks to detect some misuses (IIRC replace by rename and replace by truncate)
> as these applications worked by chance on ext3.
> Adding such a hack now to UBIFS needs a bit more justification.
> Especially as your new mount option is a sledgehammer.
>
> Which application triggers this issue?
> I'm asking because UBIFS is more or less an embedded filesystem.
> On ext4 mostly broken GUI programs like eclipse or kwrite forgot to fsync().

Thanks Richard for lookign into this patch.
I completely agree with you on the fact that these applications are 
indeed buggy.
But yes the issues were seen on embedded systems.
We saw this issue when debugging a few applications that used an xml 
parser library.
to write  data.
There were a few other applications as well but i dont have access to 
their source.
Fixing all the applications is not exactly feasible since they may have 
bugs in multiple places.
And sometimes we dont have a legal go ahead to fix code that is from 
thirdparties who may never fix their code... or just distribute a s 
binaries.
This change was made due to multiple requests that came from our 
customers who ran into this issue on the applications that they run on 
their products.

We could  use "-o sync" mount option. But this makes UBIFS perform 
badly that just syncing the old inode data alone.
The idea was to have a mount point option that could be enabled only as 
needed and taking a performance hit during a rename.
All the tests showed no real performance degradation.
Since it would be disabled by default the normal mount without this 
would have no impact what so ever to the current behavior.
Only on filesystems that are mounted with this option will this new 
behavior kick in.

Please do consider applying the patch.
If you have any suggestions on improving this patch to you liking 
please do let me know and I am happy to make any chances that you deem 
necessary.
Thanks again for you consideration.



>
>> Add a new mount option of sync_before_rename which would implicitly
>> sync the data before renaming the file to help address cases where the
>> rename cases need to be handled implicitly and prevent the zero length
>> files as a result of a rename.
>>
>> Change-Id: I4e8771d40307543b532df7f46bd87864f0d3d294
>
> What is that?
Oops this was a leftover from an internal change id... can get rid of 
it.
Sorry about that.

>
> Thanks,
> //richard



--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project.
Richard Weinberger Sept. 28, 2015, 8:49 p.m. UTC | #3
Hi!

Am 28.09.2015 um 22:38 schrieb Nikhilesh Reddy:
>> ...and these applications are buggy by design.
>> ext4 has some hacks to detect some misuses (IIRC replace by rename and replace by truncate)
>> as these applications worked by chance on ext3.
>> Adding such a hack now to UBIFS needs a bit more justification.
>> Especially as your new mount option is a sledgehammer.
>>
>> Which application triggers this issue?
>> I'm asking because UBIFS is more or less an embedded filesystem.
>> On ext4 mostly broken GUI programs like eclipse or kwrite forgot to fsync().
> 
> Thanks Richard for lookign into this patch.
> I completely agree with you on the fact that these applications are indeed buggy.
> But yes the issues were seen on embedded systems.
> We saw this issue when debugging a few applications that used an xml parser library.
> to write  data.

libxml2?

> There were a few other applications as well but i dont have access to their source.
> Fixing all the applications is not exactly feasible since they may have bugs in multiple places.
> And sometimes we dont have a legal go ahead to fix code that is from thirdparties who may never fix their code... or just distribute a s binaries.
> This change was made due to multiple requests that came from our customers who ran into this issue on the applications that they run on their products.
> 
> We could  use "-o sync" mount option. But this makes UBIFS perform badly that just syncing the old inode data alone.
> The idea was to have a mount point option that could be enabled only as needed and taking a performance hit during a rename.
> All the tests showed no real performance degradation.

Hmm, I'd have assumed that programs with heavy rename() usage would degrade.

> Since it would be disabled by default the normal mount without this would have no impact what so ever to the current behavior.
> Only on filesystems that are mounted with this option will this new behavior kick in.
> 
> Please do consider applying the patch.
> If you have any suggestions on improving this patch to you liking please do let me know and I am happy to make any chances that you deem necessary.

Instead of making all rename() synchronous it would be a good start to detect broken
patterns like ext4 does.
I'll happily test and review those. :-)

Thanks,
//richard
Nikhilesh Reddy Sept. 29, 2015, 5:04 p.m. UTC | #4
On Mon 28 Sep 2015 01:49:28 PM PDT, Richard Weinberger wrote:
> Hi!
>
> Am 28.09.2015 um 22:38 schrieb Nikhilesh Reddy:
>>> ...and these applications are buggy by design.
>>> ext4 has some hacks to detect some misuses (IIRC replace by rename and replace by truncate)
>>> as these applications worked by chance on ext3.
>>> Adding such a hack now to UBIFS needs a bit more justification.
>>> Especially as your new mount option is a sledgehammer.
>>>
>>> Which application triggers this issue?
>>> I'm asking because UBIFS is more or less an embedded filesystem.
>>> On ext4 mostly broken GUI programs like eclipse or kwrite forgot to fsync().
>>
>> Thanks Richard for lookign into this patch.
>> I completely agree with you on the fact that these applications are indeed buggy.
>> But yes the issues were seen on embedded systems.
>> We saw this issue when debugging a few applications that used an xml parser library.
>> to write  data.
>
> libxml2?
No i believe its pugixml (not sure which version) or its variant.
>
>> There were a few other applications as well but i dont have access to their source.
>> Fixing all the applications is not exactly feasible since they may have bugs in multiple places.
>> And sometimes we dont have a legal go ahead to fix code that is from thirdparties who may never fix their code... or just distribute a s binaries.
>> This change was made due to multiple requests that came from our customers who ran into this issue on the applications that they run on their products.
>>
>> We could  use "-o sync" mount option. But this makes UBIFS perform badly that just syncing the old inode data alone.
>> The idea was to have a mount point option that could be enabled only as needed and taking a performance hit during a rename.
>> All the tests showed no real performance degradation.
>
> Hmm, I'd have assumed that programs with heavy rename() usage would degrade.
>
>> Since it would be disabled by default the normal mount without this would have no impact what so ever to the current behavior.
>> Only on filesystems that are mounted with this option will this new behavior kick in.
>>
>> Please do consider applying the patch.
>> If you have any suggestions on improving this patch to you liking please do let me know and I am happy to make any chances that you deem necessary.
>
> Instead of making all rename() synchronous it would be a good start to detect broken
> patterns like ext4 does.
> I'll happily test and review those. :-)

Unfortunately major changes would not be completely up to me :(
But before i spend time on it ... i was hoping to understand the 
disadvantages of this approach and the advantages of the other approach?
Can you kindly help me understand?

So the patch effectively converts all the rename to a fdatasync+rename 
which is what we would expect all the userspace applications to do 
anyway.
If the application already calls fdatasync then nothing new occurs as 
part of the rename. If it doesnt then we perform an implicit fdatasync.
So only the broken paths ( which do not fdatasync) would be impacted.

Hopefully that is ok?
If enough time had passed and the data is already written to flash 
before the rename then again nothing additional happens.
Hopefully i did not miss something obvious.
Please do correct me if i missed something or made wrong assumptions.

>
> Thanks,
> //richard



--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project.
Richard Weinberger Sept. 29, 2015, 5:52 p.m. UTC | #5
Hi!

Am 29.09.2015 um 19:04 schrieb Nikhilesh Reddy:
> Unfortunately major changes would not be completely up to me :(

hm?

> But before i spend time on it ... i was hoping to understand the disadvantages of this approach and the advantages of the other approach?
> Can you kindly help me understand?

Sure.
Instead of making rename() unconditionally synchronous we could do it like ext4 and detect rename+close patterns and make only
these synchronous.

So, your application does:
fd = open("foo", ...)
write(fd, "yadda yadda, ...)
close(fd)
rename("foo", "bar")

and then expects "bar" to exist with contents "yadda yadda"?

I'm asking because we also have to make sure that this is not an UBIFS bug.
We have to be careful, ext4 has a metadata journal, UBIFS' is different.

My point is that UBIFS better should detect the close+rename pattern then only then do rename synchronous.

> So the patch effectively converts all the rename to a fdatasync+rename which is what we would expect all the userspace applications to do anyway.

But there are also cases where you want a fast rename() and sync later.

Artem, what do you think?

Thanks,
//richard
Richard Weinberger Oct. 2, 2015, 9:38 p.m. UTC | #6
Hi!

Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
> The rename operation in UBIFS is synchronous (or nearly synchronous)
> while the write operation is not. This can result in zero length files when
> renaming of files followed by an abrupt power down or a crash.
> 
> For example:
> 1) Say a file a.txt exists with size 1KB.
> 2) Create a file b.tmp (open)
> 3) Update the data in b.tmp with new values (write and close)
> 4) rename b.tmp to a.txt
> 5) Abrupt power down or crash
> 
> This above scenario can result in a.txt becoming a file of zero length and
> giving the impression of a.txt being truncated.
> This scenario can ofcourse be prevented by calling fsync or fdatasync
> before the rename operation.

I gave this a try and hacked up something to emulate a powercut *exactly* after
rename() in UBIFS.

fd = open("b.tmp", ...)
write(fd, "foo", ...)
close(fd)
rename("b.tmp", "a.txt")
^---- powercut

After remounting UBIFS both a.txt and b.tmp are present
but b.tmp is truncated. Not a.txt as you said.

Can you please double check?
I want to make sure that we're talking about the same things.

Thanks,
//richard
Nikhilesh Reddy Oct. 5, 2015, 6:05 p.m. UTC | #7
On 10/02/2015 02:38 PM, Richard Weinberger wrote:
> Hi!
>
> Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
>> The rename operation in UBIFS is synchronous (or nearly synchronous)
>> while the write operation is not. This can result in zero length files when
>> renaming of files followed by an abrupt power down or a crash.
>>
>> For example:
>> 1) Say a file a.txt exists with size 1KB.
>> 2) Create a file b.tmp (open)
>> 3) Update the data in b.tmp with new values (write and close)
>> 4) rename b.tmp to a.txt
>> 5) Abrupt power down or crash
>>
>> This above scenario can result in a.txt becoming a file of zero length and
>> giving the impression of a.txt being truncated.
>> This scenario can ofcourse be prevented by calling fsync or fdatasync
>> before the rename operation.
>
> I gave this a try and hacked up something to emulate a powercut *exactly* after
> rename() in UBIFS.
>
> fd = open("b.tmp", ...)
> write(fd, "foo", ...)
> close(fd)
> rename("b.tmp", "a.txt")
> ^---- powercut
>
> After remounting UBIFS both a.txt and b.tmp are present
> but b.tmp is truncated. Not a.txt as you said.
>
> Can you please double check?
> I want to make sure that we're talking about the same things.

Since you mentioned a.txt and b.tmp are both present... i assume the 
file a.txt was present even before b.tmp was created?

I will try and explain as to what i understand the situation to be.

If both the files are present then the rename didnt actually get written 
to the device and was probably still in the internal ubifs write buffer.
I believe there is a small delay between the rename call and the inodes
being updated on the the device from the internal ubifs write buffer.

The scenario i described above seems to occur when the inode update is 
committed to the device... i.e here the b.tmp should not exist since the 
rename was successfully written but the file data writeback (that is in 
the page cache) has not yet been committed to the device.
Since the writeback buffer is way smaller than the page cache the inode 
update occurs first or is likely to have.


Hopefully i did not mess up on my understanding or explanation.
Please do correct me if i messed up.

>
> Thanks,
> //richard
>
Richard Weinberger Oct. 6, 2015, 8:09 a.m. UTC | #8
Am 05.10.2015 um 20:05 schrieb Nikhilesh Reddy:
> On 10/02/2015 02:38 PM, Richard Weinberger wrote:
>> Hi!
>>
>> Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
>>> The rename operation in UBIFS is synchronous (or nearly synchronous)
>>> while the write operation is not. This can result in zero length files when
>>> renaming of files followed by an abrupt power down or a crash.
>>>
>>> For example:
>>> 1) Say a file a.txt exists with size 1KB.
>>> 2) Create a file b.tmp (open)
>>> 3) Update the data in b.tmp with new values (write and close)
>>> 4) rename b.tmp to a.txt
>>> 5) Abrupt power down or crash
>>>
>>> This above scenario can result in a.txt becoming a file of zero length and
>>> giving the impression of a.txt being truncated.
>>> This scenario can ofcourse be prevented by calling fsync or fdatasync
>>> before the rename operation.
>>
>> I gave this a try and hacked up something to emulate a powercut *exactly* after
>> rename() in UBIFS.
>>
>> fd = open("b.tmp", ...)
>> write(fd, "foo", ...)
>> close(fd)
>> rename("b.tmp", "a.txt")
>> ^---- powercut
>>
>> After remounting UBIFS both a.txt and b.tmp are present
>> but b.tmp is truncated. Not a.txt as you said.
>>
>> Can you please double check?
>> I want to make sure that we're talking about the same things.
> 
> Since you mentioned a.txt and b.tmp are both present... i assume the file a.txt was present even before b.tmp was created?

Yes.

> I will try and explain as to what i understand the situation to be.
> 
> If both the files are present then the rename didnt actually get written to the device and was probably still in the internal ubifs write buffer.

A rename operation does not trigger a commit, therefore a powercut directly after rename() would make the rename() void.
In this context "both files present" means a.txt and b.tmp exist and are both synched to disk?

> I believe there is a small delay between the rename call and the inodes
> being updated on the the device from the internal ubifs write buffer.
> 
> The scenario i described above seems to occur when the inode update is committed to the device... i.e here the b.tmp should not exist since the rename was successfully written but
> the file data writeback (that is in the page cache) has not yet been committed to the device.
> Since the writeback buffer is way smaller than the page cache the inode update occurs first or is likely to have.
> 
> 
> Hopefully i did not mess up on my understanding or explanation.

Can you please share a reproducer?
A simple sequence of syscall would also do it.

Thanks,
//richard
Nikhilesh Reddy Oct. 13, 2015, 6:41 p.m. UTC | #9
On Tue 06 Oct 2015 01:09:22 AM PDT, Richard Weinberger wrote:
> Am 05.10.2015 um 20:05 schrieb Nikhilesh Reddy:
>> On 10/02/2015 02:38 PM, Richard Weinberger wrote:
>>> Hi!
>>>
>>> Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
>>>> The rename operation in UBIFS is synchronous (or nearly synchronous)
>>>> while the write operation is not. This can result in zero length files when
>>>> renaming of files followed by an abrupt power down or a crash.
>>>>
>>>> For example:
>>>> 1) Say a file a.txt exists with size 1KB.
>>>> 2) Create a file b.tmp (open)
>>>> 3) Update the data in b.tmp with new values (write and close)
>>>> 4) rename b.tmp to a.txt
>>>> 5) Abrupt power down or crash
>>>>
>>>> This above scenario can result in a.txt becoming a file of zero length and
>>>> giving the impression of a.txt being truncated.
>>>> This scenario can ofcourse be prevented by calling fsync or fdatasync
>>>> before the rename operation.
>>>
>>> I gave this a try and hacked up something to emulate a powercut *exactly* after
>>> rename() in UBIFS.
>>>
>>> fd = open("b.tmp", ...)
>>> write(fd, "foo", ...)
>>> close(fd)
>>> rename("b.tmp", "a.txt")
>>> ^---- powercut
>>>
>>> After remounting UBIFS both a.txt and b.tmp are present
>>> but b.tmp is truncated. Not a.txt as you said.
>>>
>>> Can you please double check?
>>> I want to make sure that we're talking about the same things.
>>
>> Since you mentioned a.txt and b.tmp are both present... i assume the file a.txt was present even before b.tmp was created?
>
> Yes.
>
>> I will try and explain as to what i understand the situation to be.
>>
>> If both the files are present then the rename didnt actually get written to the device and was probably still in the internal ubifs write buffer.
>
> A rename operation does not trigger a commit, therefore a powercut directly after rename() would make the rename() void.
> In this context "both files present" means a.txt and b.tmp exist and are both synched to disk?
>
>> I believe there is a small delay between the rename call and the inodes
>> being updated on the the device from the internal ubifs write buffer.
>>
>> The scenario i described above seems to occur when the inode update is committed to the device... i.e here the b.tmp should not exist since the rename was successfully written but
>> the file data writeback (that is in the page cache) has not yet been committed to the device.
>> Since the writeback buffer is way smaller than the page cache the inode update occurs first or is likely to have.
>>
>>
>> Hopefully i did not mess up on my understanding or explanation.
>
> Can you please share a reproducer?
> A simple sequence of syscall would also do it.
>
> Thanks,
> //richard

Sorry for the delay in my reply
I got tied up...

as for the reproducer... its exactly as i described in the commit 
message... though we performed the power reset after a bit of delay. it 
does take a few tries on our end to reproduce... so we have it on a 
loop until it is reproduced.

I Will definitely  send you more concrete steps once i have a bit of 
time.

--
Thanks
Nikhilesh Reddy

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,
a Linux Foundation Collaborative Project.
Richard Weinberger Oct. 15, 2015, 7:16 p.m. UTC | #10
Am 13.10.2015 um 20:41 schrieb Nikhilesh Reddy:
> On Tue 06 Oct 2015 01:09:22 AM PDT, Richard Weinberger wrote:
>> Am 05.10.2015 um 20:05 schrieb Nikhilesh Reddy:
>>> On 10/02/2015 02:38 PM, Richard Weinberger wrote:
>>>> Hi!
>>>>
>>>> Am 28.09.2015 um 20:19 schrieb Nikhilesh Reddy:
>>>>> The rename operation in UBIFS is synchronous (or nearly synchronous)
>>>>> while the write operation is not. This can result in zero length files when
>>>>> renaming of files followed by an abrupt power down or a crash.
>>>>>
>>>>> For example:
>>>>> 1) Say a file a.txt exists with size 1KB.
>>>>> 2) Create a file b.tmp (open)
>>>>> 3) Update the data in b.tmp with new values (write and close)
>>>>> 4) rename b.tmp to a.txt
>>>>> 5) Abrupt power down or crash
>>>>>
>>>>> This above scenario can result in a.txt becoming a file of zero length and
>>>>> giving the impression of a.txt being truncated.
>>>>> This scenario can ofcourse be prevented by calling fsync or fdatasync
>>>>> before the rename operation.
>>>>
>>>> I gave this a try and hacked up something to emulate a powercut *exactly* after
>>>> rename() in UBIFS.
>>>>
>>>> fd = open("b.tmp", ...)
>>>> write(fd, "foo", ...)
>>>> close(fd)
>>>> rename("b.tmp", "a.txt")
>>>> ^---- powercut
>>>>
>>>> After remounting UBIFS both a.txt and b.tmp are present
>>>> but b.tmp is truncated. Not a.txt as you said.
>>>>
>>>> Can you please double check?
>>>> I want to make sure that we're talking about the same things.
>>>
>>> Since you mentioned a.txt and b.tmp are both present... i assume the file a.txt was present even before b.tmp was created?
>>
>> Yes.
>>
>>> I will try and explain as to what i understand the situation to be.
>>>
>>> If both the files are present then the rename didnt actually get written to the device and was probably still in the internal ubifs write buffer.
>>
>> A rename operation does not trigger a commit, therefore a powercut directly after rename() would make the rename() void.
>> In this context "both files present" means a.txt and b.tmp exist and are both synched to disk?
>>
>>> I believe there is a small delay between the rename call and the inodes
>>> being updated on the the device from the internal ubifs write buffer.
>>>
>>> The scenario i described above seems to occur when the inode update is committed to the device... i.e here the b.tmp should not exist since the rename was successfully written but
>>> the file data writeback (that is in the page cache) has not yet been committed to the device.
>>> Since the writeback buffer is way smaller than the page cache the inode update occurs first or is likely to have.
>>>
>>>
>>> Hopefully i did not mess up on my understanding or explanation.
>>
>> Can you please share a reproducer?
>> A simple sequence of syscall would also do it.
>>
>> Thanks,
>> //richard
> 
> Sorry for the delay in my reply
> I got tied up...

No big deal.

> as for the reproducer... its exactly as i described in the commit message... though we performed the power reset after a bit of delay. it does take a few tries on our end to
> reproduce... so we have it on a loop until it is reproduced.
> 
> I Will definitely  send you more concrete steps once i have a bit of time.

Please do so.
As I said, if I do exactly what you wrote, as expected b.tmp will be truncated but the
already synced file a.txt stays.

Thanks,
//richard
diff mbox

Patch

diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 5083036..7943291 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -968,6 +968,24 @@  static void unlock_3_inodes(struct inode *inode1, 
struct inode *inode2,
  	mutex_unlock(&ubifs_inode(inode1)->ui_mutex);
  }

+static int force_sync_inode_data(struct ubifs_info *c, struct inode *inode)
+{
+	int err;
+
+	ubifs_assert(mutex_is_locked(&inode->i_mutex));
+
+	err = filemap_write_and_wait_range(inode->i_mapping, 0, LLONG_MAX);
+	if (err) {
+		ubifs_err(c, "filemap_write_and_wait_range failed with %d",
+				err);
+		return err;
+	}
+	err = ubifs_sync_wbufs_by_inode(c, inode);
+	if (err)
+		ubifs_err(c, "ubifs_sync_wbufs_by_inode failed with %d", err);
+	return err;
+}
+
  static int ubifs_rename(struct inode *old_dir, struct dentry *old_dentry,
  			struct inode *new_dir, struct dentry *new_dentry)
  {
@@ -1020,6 +1038,16 @@  static int ubifs_rename(struct inode *old_dir, 
struct dentry *old_dentry,
  		return err;
  	}

+	/* Before renaming, make sure old_inode is synced to disc if the
+	 * mount option is selected. */
+	if (c->sync_before_rename) {
+		err = force_sync_inode_data(c, old_inode);
+		if (err) {
+			ubifs_err(c, "old_inode data sync failed for rename");
+			goto out;
+		}
+	}
+
  	lock_3_inodes(old_dir, new_dir, new_inode);

  	/*
@@ -1129,6 +1157,7 @@  out_cancel:
  		}
  	}
  	unlock_3_inodes(old_dir, new_dir, new_inode);
+out:
  	ubifs_release_budget(c, &ino_req);
  	ubifs_release_budget(c, &req);
  	return err;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 4fa829f..f185894 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -443,6 +443,11 @@  static int ubifs_show_options(struct seq_file *s, 
struct dentry *root)
  			   ubifs_compr_name(c->mount_opts.compr_type));
  	}

+	if (c->mount_opts.sync_before_rename == 2)
+		seq_puts(s, ",sync_before_rename");
+	else if (c->mount_opts.sync_before_rename == 1)
+		seq_puts(s, ",no_sync_before_rename");
+
  	return 0;
  }

@@ -928,6 +933,8 @@  enum {
  	Opt_chk_data_crc,
  	Opt_no_chk_data_crc,
  	Opt_override_compr,
+	Opt_sync_before_rename,
+	Opt_no_sync_before_rename,
  	Opt_err,
  };

@@ -939,6 +946,8 @@  static const match_table_t tokens = {
  	{Opt_chk_data_crc, "chk_data_crc"},
  	{Opt_no_chk_data_crc, "no_chk_data_crc"},
  	{Opt_override_compr, "compr=%s"},
+	{Opt_sync_before_rename, "sync_before_rename"},
+	{Opt_no_sync_before_rename, "no_sync_before_rename"},
  	{Opt_err, NULL},
  };

@@ -1039,6 +1048,14 @@  static int ubifs_parse_options(struct ubifs_info 
*c, char *options,
  			c->default_compr = c->mount_opts.compr_type;
  			break;
  		}
+		case Opt_sync_before_rename:
+			c->mount_opts.sync_before_rename = 2;
+			c->sync_before_rename = 1;
+			break;
+		case Opt_no_sync_before_rename:
+			c->mount_opts.sync_before_rename = 1;
+			c->sync_before_rename = 0;
+			break;
  		default:
  		{
  			unsigned long flag;
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index de75902..c2adf9c 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -942,6 +942,8 @@  struct ubifs_orphan {
   *                  specified in @compr_type)
   * @compr_type: compressor type to override the superblock compressor with
   *              (%UBIFS_COMPR_NONE, etc)
+ * @sync_before_rename: Sync the data associated with the old inode before
+			the rename.(%0 default, %1 disable, %2 enabled)
   */
  struct ubifs_mount_opts {
  	unsigned int unmount_mode:2;
@@ -949,6 +951,7 @@  struct ubifs_mount_opts {
  	unsigned int chk_data_crc:2;
  	unsigned int override_compr:1;
  	unsigned int compr_type:2;
+	unsigned int sync_before_rename:2;
  };

  /**
@@ -1034,6 +1037,7 @@  struct ubifs_debug_info;
   * @bulk_read: enable bulk-reads
   * @default_compr: default compression algorithm (%UBIFS_COMPR_LZO, etc)
   * @rw_incompat: the media is not R/W compatible
+ * @sync_before_rename: Perform an fdatasync on the old inode before 
rename.
   *
   * @tnc_mutex: protects the Tree Node Cache (TNC), @zroot, @cnext,