Message ID | 20170817092153.GA14074@quack2.suse.cz |
---|---|
State | Superseded, archived |
Headers | show |
On Aug 17, 2017, at 3:21 AM, Jan Kara <jack@suse.cz> wrote: > > On Thu 17-08-17 11:19:59, Jan Kara wrote: >> Hi Shilong! >> >> On Thu 17-08-17 06:23:26, Wang Shilong wrote: >>> thanks for good suggestion, just one question we could not hold lock >>> with nojounal mode, how about something attached one? >>> >>> please let me know if you have better taste for it, much appreciated! >> >> Thanks for quickly updating the patch! Is the only reason why you cannot >> hold the lock in the nojournal mode that sb_getblk() might sleep? The >> attached patch should fix that so that you don't have to special-case the >> nojournal mode anymore. > > Forgot to attach the patch - here it is. Feel free to include it in your > series as a preparatory patch. Strange, I never even knew recently_deleted() existed, even though it was added to the tree 4 years ago yesterday. It looks like this is only used with the no-journal code, which I don't really interact with. One thing I did notice when looking at it is that there is a Y2038 bug in recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit get_seconds(). To fix this, it would be possible to either use a wrapped 32-bit comparison, like time_after() for jiffies, something like: u32 now, dtime; /* assume dtime is within the past 30 years, see time_after() */ now = get_seconds(); if (dtime && (dtime - now < 0) && (dtime + recentcy - now < 0)) ret = 1; or use i_ctime_extra to implicitly extend i_dtime beyond 2038, something like: /* assume dtime epoch same as ctime, see EXT4_INODE_GET_XTIME() */ dtime = le32_to_cpu(raw_inode->i_dtime); if (EXT4_INODE_SIZE(sb) > EXT4_GOOD_OLD_INODE_SIZE && offsetof(typeof(*raw_inode), i_ctime_extra) + 4 <= EXT4_GOOD_OLD_INODE_SIZE + le32_to_cpu(raw_inode->i_extra_isize)) dtime += (long)(le32_to_cpu(raw_inode->i_ctime_extra) & EXT4_EPOCH_MASK) << 32; Cheers, Andreas
> Strange, I never even knew recently_deleted() existed, even though it was > added to the tree 4 years ago yesterday. It looks like this is only used > with the no-journal code, which I don't really interact with. > > One thing I did notice when looking at it is that there is a Y2038 bug in > recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit > get_seconds(). I don't think dtime has widened on the disk layout for ext4 according to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am not sure how fixing the internal implementation would be useful until we do that. Is there a plan for that? As far as get_seconds() is concerned, get_seconds() returns unsigned long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch. Since dtime variable is declared as unsigned long in this function, same holds for the size of this variable. There is no y2038 problem on a 64 bit machine. So moving to the case of a 32 bit machine: get_seconds() can return values until year 2106. And, recentcy at max can only be 35. Analyzing the current line: if (dtime && (dtime < now) && (now < dtime + recentcy)) The above equation should work fine at least until 35 seconds before y2038 deadline. -Deepa
On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote: >> Strange, I never even knew recently_deleted() existed, even though it was >> added to the tree 4 years ago yesterday. It looks like this is only used >> with the no-journal code, which I don't really interact with. >> >> One thing I did notice when looking at it is that there is a Y2038 bug in >> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit >> get_seconds(). > > I don't think dtime has widened on the disk layout for ext4 according > to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am > not sure how fixing the internal implementation would be useful until > we do that. Is there a plan for that? > > As far as get_seconds() is concerned, get_seconds() returns unsigned > long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch. > Since dtime variable is declared as unsigned long in this function, > same holds for the size of this variable. > > There is no y2038 problem on a 64 bit machine. I think what Andreas was saying is that it's actually the opposite: on a 32-bit machine, the code will work correctly for 32-bit unsigned long values as long as 'dtime' and 'now' are in the same epoch, e.g. both are before 2106 or both are after. On 64-bit systems it's always wrong after 2106. > So moving to the case of a 32 bit machine: > > get_seconds() can return values until year 2106. And, recentcy at max > can only be 35. Analyzing the current line: > > if (dtime && (dtime < now) && (now < dtime + recentcy)) > > The above equation should work fine at least until 35 seconds before > y2038 deadline. Since it's all unsigned arithmetic, it should be fine until 2106. However, we should get rid of get_seconds() long before then and use ktime_get_real_seconds() instead, as most other users of get_seconds() are (more) broken. Looking at the two suggested approaches: >> u32 now, dtime; >> >> /* assume dtime is within the past 30 years, see time_after() */ >> now = get_seconds(); >> if (dtime && (dtime - now < 0) && (dtime + recentcy - now < 0)) >> ret = 1; * As 'dtime' and 'now' are both unsigned, subtracting them will also result in an unsigned value that is never less than zero, so it won't work. Adding a cast to 's32' would fix that the same way that time_after() does. * please use ktime_get_real_seconds() instead of get_seconds(), so we don't have to replace it later. * The comment should say '68 years', not 30. > or use i_ctime_extra to implicitly extend i_dtime beyond 2038, something like: > > /* assume dtime epoch same as ctime, see EXT4_INODE_GET_XTIME() */ > dtime = le32_to_cpu(raw_inode->i_dtime); > if (EXT4_INODE_SIZE(sb) > EXT4_GOOD_OLD_INODE_SIZE && > offsetof(typeof(*raw_inode), i_ctime_extra) + 4 <= > EXT4_GOOD_OLD_INODE_SIZE + le32_to_cpu(raw_inode->i_extra_isize)) > dtime += (long)(le32_to_cpu(raw_inode->i_ctime_extra) & > EXT4_EPOCH_MASK) << 32; * This is slightly incorrect when we are close to the epoch boundary, as i_ctime and i_dtime might end up being in different epochs. I would not go there. * If we were to pick this approach, a cast to 'long' is obviously wrong on 32-bit systems, better use 'u64' or 'time64_t'. Arnd
On Thu, Aug 17, 2017 at 06:23:26PM -0700, Deepa Dinamani wrote: > > I don't think dtime has widened on the disk layout for ext4 according > to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am > not sure how fixing the internal implementation would be useful until > we do that. Is there a plan for that? The dtime field is not visible to user; it's mostly for debugging purposes. For debugfs we just are just using i_ctime_extra to compose the time. (Perhaps we should be using i_mtime_extra, or the max of the ctime, mtime, and atime extra fields; but it's not really that important.) The issue which Andreas pointed out is the only place where we actually use the dtime field, and that's so we can avoid re-using a freshly deleted inode until at least N seconds have gone by in no-journal node. That's because if we don't, there are some unfortunate effects that can take place if we crash and not all of the metadata gets updated. Even after running e2fsck -fy, we can end up having a directory or an immutable file show up where ntp or timed expects to find a time adjustment file, or some such, that can cause various system daemons to crash and burn because they aren't expecting find a file at a particular pathname they own which they can't delete. There are a number ways we could solve it; one is to just use a new in-memory variable which can be 64-bits wide. This burns an extra 8 bytes for each inode in the inode cache, which is why we didn't do that. It doesn't really have to be super exact; if we actually have an inode that avoids getting reused for 136 years (2**32 seconds), it will have disappeared from the in-memory inode cache. We just need something which is valid for N seconds after the deletion time. (I think we may have upped N to a larger value on our data center kernels --- 300 seconds if I recall correctly --- because there were some edge cases where 35 seconds wasn't enough.) - Ted
On Fri, Aug 18, 2017 at 2:31 AM, Arnd Bergmann <arnd@arndb.de> wrote: > On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote: >>> Strange, I never even knew recently_deleted() existed, even though it was >>> added to the tree 4 years ago yesterday. It looks like this is only used >>> with the no-journal code, which I don't really interact with. >>> >>> One thing I did notice when looking at it is that there is a Y2038 bug in >>> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit >>> get_seconds(). >> >> I don't think dtime has widened on the disk layout for ext4 according >> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am >> not sure how fixing the internal implementation would be useful until >> we do that. Is there a plan for that? >> >> As far as get_seconds() is concerned, get_seconds() returns unsigned >> long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch. >> Since dtime variable is declared as unsigned long in this function, >> same holds for the size of this variable. >> >> There is no y2038 problem on a 64 bit machine. > > I think what Andreas was saying is that it's actually the opposite: > on a 32-bit machine, the code will work correctly for 32-bit unsigned > long values as long as 'dtime' and 'now' are in the same epoch, > e.g. both are before 2106 or both are after. > On 64-bit systems it's always wrong after 2106. There is some confusion here. I was only referring to the current implementation: static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino) { . . . unsigned long dtime, now; int offset, ret = 0, recentcy = RECENTCY_MIN; . . . offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb); raw_inode = (struct ext4_inode *) (bh->b_data + offset); dtime = le32_to_cpu(raw_inode->i_dtime); now = get_seconds(); if (buffer_dirty(bh)) recentcy += RECENTCY_DIRTY; if (dtime && (dtime < now) && (now < dtime + recentcy)) ret = 1; . . . } In the above implementation, I do not see any problem on a 64 bit machine. The only problem is that dtime on disk representation is signed 32 bits only. If that were not a problem then this would be fine from time prespective. On 32 bit machine, dtime on disk representation again prevents it from being able to represent times beyond 2038 unless one of the approaches Ted mentioned is used to extend/ interpret it. >> So moving to the case of a 32 bit machine: >> >> get_seconds() can return values until year 2106. And, recentcy at max >> can only be 35. Analyzing the current line: >> >> if (dtime && (dtime < now) && (now < dtime + recentcy)) >> >> The above equation should work fine at least until 35 seconds before >> y2038 deadline. > > Since it's all unsigned arithmetic, it should be fine until 2106. > However, we should get rid of get_seconds() long before then > and use ktime_get_real_seconds() instead, as most other users > of get_seconds() are (more) broken. Dtime on disk representation again breaks this for certain values in 2038 even though everything is unsigned. I was just saying that whatever we do here depends on how dtime on disk is interpreted. Agree that ktime_get_real_seconds() should be used here. But, the way we handle new values would rely on this new interpretation of dtime. Also, using time64_t variables on stack only matters after this. Once the types are corrected, maybe the comparison expression need not change at all(after new dtime interpretation is in place). Let me know if I am missing something here. -Deepa
> On Aug 18, 2017, at 9:38 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote: > > On Fri, Aug 18, 2017 at 2:31 AM, Arnd Bergmann <arnd@arndb.de> wrote: >> On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote: >>> >>>> One thing I did notice when looking at it is that there is a Y2038 bug in >>>> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit >>>> get_seconds(). >>> >>> I don't think dtime has widened on the disk layout for ext4 according >>> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am >>> not sure how fixing the internal implementation would be useful until >>> we do that. Is there a plan for that? >>> >>> As far as get_seconds() is concerned, get_seconds() returns unsigned >>> long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch. >>> Since dtime variable is declared as unsigned long in this function, >>> same holds for the size of this variable. >>> >>> There is no y2038 problem on a 64 bit machine. >> >> I think what Andreas was saying is that it's actually the opposite: >> on a 32-bit machine, the code will work correctly for 32-bit unsigned >> long values as long as 'dtime' and 'now' are in the same epoch, >> e.g. both are before 2106 or both are after. >> On 64-bit systems it's always wrong after 2106. > > There is some confusion here. > I was only referring to the current implementation: > > static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino) > { > . > . > . > unsigned long dtime, now; > int offset, ret = 0, recentcy = RECENTCY_MIN; > . > . > . > offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb); > raw_inode = (struct ext4_inode *) (bh->b_data + offset); > dtime = le32_to_cpu(raw_inode->i_dtime); > now = get_seconds(); > if (buffer_dirty(bh)) > recentcy += RECENTCY_DIRTY; > > if (dtime && (dtime < now) && (now < dtime + recentcy)) > ret = 1; > . > . > . > } > > In the above implementation, I do not see any problem on a 64 bit machine. > The only problem is that dtime on disk representation is signed 32 bits only. > If that were not a problem then this would be fine from time prespective. The 32-bit dtime is the root of the problem. There is no plan to extend the dtime field on disk, because it is used so little (mostly as a boolean value, and for forensics). >>> So moving to the case of a 32 bit machine: >>> >>> get_seconds() can return values until year 2106. And, recentcy at max >>> can only be 35. Analyzing the current line: >>> >>> if (dtime && (dtime < now) && (now < dtime + recentcy)) >>> >>> The above equation should work fine at least until 35 seconds before >>> y2038 deadline. >> >> Since it's all unsigned arithmetic, it should be fine until 2106. >> However, we should get rid of get_seconds() long before then >> and use ktime_get_real_seconds() instead, as most other users >> of get_seconds() are (more) broken. > > Dtime on disk representation again breaks this for certain values in > 2038 even though everything is unsigned. > > I was just saying that whatever we do here depends on how dtime on > disk is interpreted. > > Agree that ktime_get_real_seconds() should be used here. But, the way > we handle new values would rely on this new interpretation of dtime. > Also, using time64_t variables on stack only matters after this. Once > the types are corrected, maybe the comparison expression need not > change at all (after new dtime interpretation is in place). There will not be a new dtime format on disk, but since the calculation here only depends on relative times (within a few minutes), then it would be fine to use only 32-bit timestamps, and truncate off the high bits from get_seconds()/ktime_get_real_seconds(). Cheers, Andreas
On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote: > >>>> So moving to the case of a 32 bit machine: >>>> >>>> get_seconds() can return values until year 2106. And, recentcy at max >>>> can only be 35. Analyzing the current line: >>>> >>>> if (dtime && (dtime < now) && (now < dtime + recentcy)) >>>> >>>> The above equation should work fine at least until 35 seconds before >>>> y2038 deadline. >>> >>> Since it's all unsigned arithmetic, it should be fine until 2106. >>> However, we should get rid of get_seconds() long before then >>> and use ktime_get_real_seconds() instead, as most other users >>> of get_seconds() are (more) broken. >> >> Dtime on disk representation again breaks this for certain values in >> 2038 even though everything is unsigned. >> >> I was just saying that whatever we do here depends on how dtime on >> disk is interpreted. >> >> Agree that ktime_get_real_seconds() should be used here. But, the way >> we handle new values would rely on this new interpretation of dtime. >> Also, using time64_t variables on stack only matters after this. Once >> the types are corrected, maybe the comparison expression need not >> change at all (after new dtime interpretation is in place). > > There will not be a new dtime format on disk, but since the calculation > here only depends on relative times (within a few minutes), then it would > be fine to use only 32-bit timestamps, and truncate off the high bits > from get_seconds()/ktime_get_real_seconds(). Agreed. Are you planning to apply your fix for it then? I think your first suggestion is all we need, aside from the three minor comments I had. Arnd
On Aug 22, 2017, at 9:18 AM, Arnd Bergmann <arnd@arndb.de> wrote: > > On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote: >> >>>>> So moving to the case of a 32 bit machine: >>>>> >>>>> get_seconds() can return values until year 2106. And, recentcy at max >>>>> can only be 35. Analyzing the current line: >>>>> >>>>> if (dtime && (dtime < now) && (now < dtime + recentcy)) >>>>> >>>>> The above equation should work fine at least until 35 seconds before >>>>> y2038 deadline. >>>> >>>> Since it's all unsigned arithmetic, it should be fine until 2106. >>>> However, we should get rid of get_seconds() long before then >>>> and use ktime_get_real_seconds() instead, as most other users >>>> of get_seconds() are (more) broken. >>> >>> Dtime on disk representation again breaks this for certain values in >>> 2038 even though everything is unsigned. >>> >>> I was just saying that whatever we do here depends on how dtime on >>> disk is interpreted. >>> >>> Agree that ktime_get_real_seconds() should be used here. But, the way >>> we handle new values would rely on this new interpretation of dtime. >>> Also, using time64_t variables on stack only matters after this. Once >>> the types are corrected, maybe the comparison expression need not >>> change at all (after new dtime interpretation is in place). >> >> There will not be a new dtime format on disk, but since the calculation >> here only depends on relative times (within a few minutes), then it would >> be fine to use only 32-bit timestamps, and truncate off the high bits >> from get_seconds()/ktime_get_real_seconds(). > > Agreed. > > Are you planning to apply your fix for it then? I think your first > suggestion is all we need, aside from the three minor comments > I had. Do you think it is worthwhile to introduce a "time_after32()" helper for this? I suspect that this will also be useful for other parts of the kernel that deal with relative 32-bit timestamps. Cheers, Andreas
On Tue, Aug 22, 2017 at 6:20 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On Aug 22, 2017, at 9:18 AM, Arnd Bergmann <arnd@arndb.de> wrote: >> >> On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote: >>> >>>>>> So moving to the case of a 32 bit machine: >>>>>> >>>>>> get_seconds() can return values until year 2106. And, recentcy at max >>>>>> can only be 35. Analyzing the current line: >>>>>> >>>>>> if (dtime && (dtime < now) && (now < dtime + recentcy)) >>>>>> >>>>>> The above equation should work fine at least until 35 seconds before >>>>>> y2038 deadline. >>>>> >>>>> Since it's all unsigned arithmetic, it should be fine until 2106. >>>>> However, we should get rid of get_seconds() long before then >>>>> and use ktime_get_real_seconds() instead, as most other users >>>>> of get_seconds() are (more) broken. >>>> >>>> Dtime on disk representation again breaks this for certain values in >>>> 2038 even though everything is unsigned. >>>> >>>> I was just saying that whatever we do here depends on how dtime on >>>> disk is interpreted. >>>> >>>> Agree that ktime_get_real_seconds() should be used here. But, the way >>>> we handle new values would rely on this new interpretation of dtime. >>>> Also, using time64_t variables on stack only matters after this. Once >>>> the types are corrected, maybe the comparison expression need not >>>> change at all (after new dtime interpretation is in place). >>> >>> There will not be a new dtime format on disk, but since the calculation >>> here only depends on relative times (within a few minutes), then it would >>> be fine to use only 32-bit timestamps, and truncate off the high bits >>> from get_seconds()/ktime_get_real_seconds(). >> >> Agreed. >> >> Are you planning to apply your fix for it then? I think your first >> suggestion is all we need, aside from the three minor comments >> I had. > > Do you think it is worthwhile to introduce a "time_after32()" helper for this? > I suspect that this will also be useful for other parts of the kernel that > deal with relative 32-bit timestamps. I can't think of any other one at the moment. The RTC code may need a similar check somewhere but it's more likely that they want something slightly different. No objections to introducing a time_after32() from my side if only for documentation purposes, but we probably won't use it elsewhere. Arnd
From c9e9550fe6e2a7e498c1a8b709b570f4c5ed8e2b Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Thu, 17 Aug 2017 11:07:10 +0200 Subject: [PATCH] ext4: Do not unnecessarily allocate buffer in recently_deleted() In recently_deleted() function we want to check whether inode is still cached in buffer cache. Use sb_find_get_block() for that instead of sb_getblk() to avoid unnecessary allocation of bdev page and buffer heads. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/ialloc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 507bfb3344d4..0d03e73dccaf 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -707,9 +707,9 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino) if (unlikely(!gdp)) return 0; - bh = sb_getblk(sb, ext4_inode_table(sb, gdp) + + bh = sb_find_get_block(sb, ext4_inode_table(sb, gdp) + (ino / inodes_per_block)); - if (unlikely(!bh) || !buffer_uptodate(bh)) + if (!bh || !buffer_uptodate(bh)) /* * If the block is not in the buffer cache, then it * must have been written out. -- 2.12.3