ext4: Return EIO on read error in ext4_find_entry

Submitted by Andreas Dilger on June 23, 2017, 9:36 p.m.

Details

Message ID C37CE24B-A6E7-4B65-B0C4-4965B3C45007@dilger.ca
State New
Headers show

Commit Message

Andreas Dilger June 23, 2017, 9:36 p.m.
On Jun 23, 2017, at 6:26 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Fri, Jun 23, 2017 at 12:33:02AM -0600, Andreas Dilger wrote:
>> On Jun 22, 2017, at 22:43, Theodore Ts'o <tytso@mit.edu> wrote:
>>> 
>>>> On Thu, Jun 22, 2017 at 04:23:07PM -0700, Khazhismel Kumykov wrote:
>>>> Previously, a read error would be ignored and we would eventually return
>>>> NULL from ext4_find_entry, which signals "no such file or directory". We
>>>> should be returning EIO.
>>>> 
>>>> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
>>> 
>>> Thanks, applied.
>> 
>> I don't necessarily agree that this is an improvement.
>> 
>> If the requested entry is not in the bad block, this will return an
>> error even if the file name could be found in another block. It
>> would be better to save the error until the end and only return -EIO
>> if the entry cannot be found.
> 
> The problem is that if we continue, successive reads may all take
> seconds or minutes to fail, thus tieing up the process for a long
> time.

Sorry, I don't understand where the seconds or minutes of delay come from?
Is that because of long SCSI retries in the block layer, or in the disk
itself, or something caused specifically because of this code?

This is in the non-htree lookup path, so it is already doing a linear
directory scan to find the entry.  If this is a non-htree directory because
the directory is only a single block in size, then saving the EIO to return
at "the end" is identical to returning it immediately.  If it is using this
codepath because it is a large non-htree directory, then there is a real
chance that the entry can be found in another block, and this will cause the application to fail trying to find the file when it doesn't need to.

Since this is after EXT4_ERROR_INODE() then the filesystem must be mounted
with "errors=continue", so I'd think in that case the filesystem should try
to continue in the face of errors.

> If this process happens to be, say, the node's Kubernetes
> management server it can take down the entire node (since if there is
> a watchdog daemon which assumes that if the management server is down,
> it's time to kill and reset the entire node), and the file system is,
> say, a network file system error which should only kill the individual
> job, and not the entire node, the results could be quite unfortunate.

Sorry, I don't understand your example.  It isn't clear where the error is
coming from?  Is this a media error or I don't understand how this could be
a network filesystem error?  To my reading, the patch removing the ability
to skip a bad directory leaf block would _induce_ an error rather than avoid it.

> By returning EIO right away, we can "fast fail".

But it seems like you don't necessarily need to fail at all?  Something like the
following would return an error if the entry is not found, but still search the
rest of the leaf blocks (if any) before giving up:


Cheers, Andreas

Comments

Khazhismel Kumykov June 23, 2017, 10:33 p.m.
On Fri, Jun 23, 2017 at 2:36 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jun 23, 2017, at 6:26 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>>
>> The problem is that if we continue, successive reads may all take
>> seconds or minutes to fail, thus tieing up the process for a long
>> time.
>
> Sorry, I don't understand where the seconds or minutes of delay come from?
> Is that because of long SCSI retries in the block layer, or in the disk
> itself, or something caused specifically because of this code?

For a networked block device we may be retrying for a while before
giving up, although this also applies to the initial failed read.

>
>> By returning EIO right away, we can "fast fail".
>
> But it seems like you don't necessarily need to fail at all?  Something like the
> following would return an error if the entry is not found, but still search the
> rest of the leaf blocks (if any) before giving up:
>

Giving up early or checking future blocks both work, critical thing
here is not returning NULL after seeing a read error.
Previously to this the behavior was to continue to check future blocks
after a read error, and it seemed OK.

Khazhy
Theodore Ts'o June 23, 2017, 11:26 p.m.
On Fri, Jun 23, 2017 at 03:33:46PM -0700, Khazhismel Kumykov wrote:
> 
> Giving up early or checking future blocks both work, critical thing
> here is not returning NULL after seeing a read error.
> Previously to this the behavior was to continue to check future blocks
> after a read error, and it seemed OK.

Whether or not it is OK probably depends on how big the directory is.
If we need to suffer through N long error retries, whether it is
caused by long SCSI error retries, or long iSCSI error retries, sooner
or later it's going to be problematic if the process which is taking
forever to search through the whole directory has a some kind health
monitoring service or other watchdog timer.

Still, I agree that there will be some cases where instead of "Fast
fail", having the file server try as hard as possible fetch the file
from the failing disk is worthwhile.  I tend to be focused on the
cluster file system case where if it's going to several hundred
milliseconds to fetch the file, you're better off getting it from the
one other replicated copies from another server, or start the
reed-solomon reconstruction from.  However, if you have an
architecture where the only copy of the file is on the particular file
server (perhaps because you are depending on RAID instead of n=3
replication or reed-solomon erasure codes), having the file server try
as hard as possible to find the file is a good thing.

I wonder if the right answer is to have "fastfail" and "nofastfail"
mount option.

					- Ted
Andreas Dilger June 23, 2017, 11:34 p.m.
On Jun 23, 2017, at 5:26 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Fri, Jun 23, 2017 at 03:33:46PM -0700, Khazhismel Kumykov wrote:
>> 
>> Giving up early or checking future blocks both work, critical thing
>> here is not returning NULL after seeing a read error.
>> Previously to this the behavior was to continue to check future blocks
>> after a read error, and it seemed OK.
> 
> Whether or not it is OK probably depends on how big the directory is.
> If we need to suffer through N long error retries, whether it is
> caused by long SCSI error retries, or long iSCSI error retries, sooner
> or later it's going to be problematic if the process which is taking
> forever to search through the whole directory has a some kind health
> monitoring service or other watchdog timer.

I think this is a problem regardless of what is being done by the filesystem,
basically if the block device is broken then there will be a lot of retries
and/or errors.  I agree it doesn't make sense to return a benign error like
"ENOENT" if there are IO errors.

> Still, I agree that there will be some cases where instead of "Fast
> fail", having the file server try as hard as possible fetch the file
> from the failing disk is worthwhile.  I tend to be focused on the
> cluster file system case where if it's going to several hundred
> milliseconds to fetch the file, you're better off getting it from the
> one other replicated copies from another server, or start the
> reed-solomon reconstruction from.

Sure, but that is a problem independent of the readdir case I think?

> However, if you have an
> architecture where the only copy of the file is on the particular file
> server (perhaps because you are depending on RAID instead of n=3
> replication or reed-solomon erasure codes), having the file server try
> as hard as possible to find the file is a good thing.
> 
> I wonder if the right answer is to have "fastfail" and "nofastfail"
> mount option.

Wouldn't it just make sense to mount the filesystem with "errors=remount-ro"
or "errors=panic" in your case, where you can give up on a single node
easily if it detects device-level errors, rather than "errors=continue" as
it seems you currently have?  This is what we do in HA environments, and
fail the storage over to a backup server in case the problem is with the
node, SCSI cards, cables, etc. and not the disk (preventing further automatic
failback to prevent node ping-pong if there is actually a media error).

Cheers, Andreas
Theodore Ts'o June 24, 2017, 12:24 a.m.
On Fri, Jun 23, 2017 at 05:34:23PM -0600, Andreas Dilger wrote:
> 
> Sure, but that is a problem independent of the readdir case I think?

This is lookup case not the readdir case....

> Wouldn't it just make sense to mount the filesystem with "errors=remount-ro"
> or "errors=panic" in your case, where you can give up on a single node
> easily if it detects device-level errors, rather than "errors=continue" as
> it seems you currently have?  This is what we do in HA environments, and
> fail the storage over to a backup server in case the problem is with the
> node, SCSI cards, cables, etc. and not the disk (preventing further automatic
> failback to prevent node ping-pong if there is actually a media error).

"errors=remount-ro" doesn't really help when it ends up tieing up
process doing the lookup for long periods of time.

"errors=panic" isn't helpful if you have a large number of file
systems mounted.  The system might have dozens of disks mounted (as
individual file systems) on a file server, or there might be even more
networked-block devices mounted on a container-oriented job server.
So you really don't want to blow away a node with a reboot just
because errors are found on a single device or network device.

In such an environment it might be far more appropriate to return an
error to userspace with dispatch, and trust userspace to take the
appropriate recovery steps.

					- Ted
Tahsin Erdogan June 26, 2017, 7:22 p.m.
On Thu, Jun 22, 2017 at 4:23 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
> -                       /* read error, skip block & hope for the best */
>                         EXT4_ERROR_INODE(dir, "reading directory lblock %lu",
>                                          (unsigned long) block);
>                         brelse(bh);
> -                       goto next;
> +                       ret = ERR_PTR(-EIO);
> +                       goto cleanup_and_exit;

EXT4_ERROR_INODE() triggers ext4_handle_error() which performs error
handling. I think it should be removed here because we are returning
the error to the caller and there is nothing more drastic about this
error than other error return paths in the same function.
Andreas Dilger June 26, 2017, 8:45 p.m.
> On Jun 26, 2017, at 1:22 PM, Tahsin Erdogan <tahsin@google.com> wrote:
> 
> On Thu, Jun 22, 2017 at 4:23 PM, Khazhismel Kumykov <khazhy@google.com> wrote:
>> -                       /* read error, skip block & hope for the best */
>>                        EXT4_ERROR_INODE(dir, "reading directory lblock %lu",
>>                                         (unsigned long) block);
>>                        brelse(bh);
>> -                       goto next;
>> +                       ret = ERR_PTR(-EIO);
>> +                       goto cleanup_and_exit;
> 
> EXT4_ERROR_INODE() triggers ext4_handle_error() which performs error
> handling. I think it should be removed here because we are returning
> the error to the caller and there is nothing more drastic about this
> error than other error return paths in the same function.

The issue is that this represents filesystem corruption, since the
IO error in the leaf block means some files cannot be looked up, and
new files cannot be created in this directory.  Running e2fsck will
repair this problem, so ext4_handle_error() will mark the filesystem
in error so that e2fsck will be run on the next restart and this error
is not lost because it is only returned to the application.

The sysadmin already has the option to declare the behaviour of
ext4_handle_error() - "continue" (which seems to what you are already
running), "remount-ro" (what we are using), or "panic" the node (which
some HA systems use).

To my reading, this patch only applies to the "continue" case, and it
is good to make this handling more robust.  Similar work was done with
the block and inode bitmaps (EXT4_{MB_GRP,GROUP_INFO}_BBITMAP_CORRUPT
and EXT4_{MB_GRP,GROUP_INFO}_IBITMAP_CORRUPT) to skip allocations in
those groups if an error is detected.

Cheers, Andreas

Patch hide | download patch | download mbox

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b81f7d4..c75b6dc 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1434,6 +1434,8 @@  static struct buffer_head * ext4_find_entry (struct inode *dir,
 			EXT4_ERROR_INODE(dir, "reading directory lblock %lu",
 					 (unsigned long) block);
 			brelse(bh);
+			/* if entry isn't found in a later block, return an error */
+			ret = ERR_PTR(-EIO);
 			goto next;
 		}
 		if (!buffer_verified(bh) &&