Message ID | C37CE24B-A6E7-4B65-B0C4-4965B3C45007@dilger.ca |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 23, 2017 at 2:36 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On Jun 23, 2017, at 6:26 AM, Theodore Ts'o <tytso@mit.edu> wrote: >> >> The problem is that if we continue, successive reads may all take >> seconds or minutes to fail, thus tieing up the process for a long >> time. > > Sorry, I don't understand where the seconds or minutes of delay come from? > Is that because of long SCSI retries in the block layer, or in the disk > itself, or something caused specifically because of this code? For a networked block device we may be retrying for a while before giving up, although this also applies to the initial failed read. > >> By returning EIO right away, we can "fast fail". > > But it seems like you don't necessarily need to fail at all? Something like the > following would return an error if the entry is not found, but still search the > rest of the leaf blocks (if any) before giving up: > Giving up early or checking future blocks both work, critical thing here is not returning NULL after seeing a read error. Previously to this the behavior was to continue to check future blocks after a read error, and it seemed OK. Khazhy
On Fri, Jun 23, 2017 at 03:33:46PM -0700, Khazhismel Kumykov wrote: > > Giving up early or checking future blocks both work, critical thing > here is not returning NULL after seeing a read error. > Previously to this the behavior was to continue to check future blocks > after a read error, and it seemed OK. Whether or not it is OK probably depends on how big the directory is. If we need to suffer through N long error retries, whether it is caused by long SCSI error retries, or long iSCSI error retries, sooner or later it's going to be problematic if the process which is taking forever to search through the whole directory has a some kind health monitoring service or other watchdog timer. Still, I agree that there will be some cases where instead of "Fast fail", having the file server try as hard as possible fetch the file from the failing disk is worthwhile. I tend to be focused on the cluster file system case where if it's going to several hundred milliseconds to fetch the file, you're better off getting it from the one other replicated copies from another server, or start the reed-solomon reconstruction from. However, if you have an architecture where the only copy of the file is on the particular file server (perhaps because you are depending on RAID instead of n=3 replication or reed-solomon erasure codes), having the file server try as hard as possible to find the file is a good thing. I wonder if the right answer is to have "fastfail" and "nofastfail" mount option. - Ted
On Jun 23, 2017, at 5:26 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, Jun 23, 2017 at 03:33:46PM -0700, Khazhismel Kumykov wrote: >> >> Giving up early or checking future blocks both work, critical thing >> here is not returning NULL after seeing a read error. >> Previously to this the behavior was to continue to check future blocks >> after a read error, and it seemed OK. > > Whether or not it is OK probably depends on how big the directory is. > If we need to suffer through N long error retries, whether it is > caused by long SCSI error retries, or long iSCSI error retries, sooner > or later it's going to be problematic if the process which is taking > forever to search through the whole directory has a some kind health > monitoring service or other watchdog timer. I think this is a problem regardless of what is being done by the filesystem, basically if the block device is broken then there will be a lot of retries and/or errors. I agree it doesn't make sense to return a benign error like "ENOENT" if there are IO errors. > Still, I agree that there will be some cases where instead of "Fast > fail", having the file server try as hard as possible fetch the file > from the failing disk is worthwhile. I tend to be focused on the > cluster file system case where if it's going to several hundred > milliseconds to fetch the file, you're better off getting it from the > one other replicated copies from another server, or start the > reed-solomon reconstruction from. Sure, but that is a problem independent of the readdir case I think? > However, if you have an > architecture where the only copy of the file is on the particular file > server (perhaps because you are depending on RAID instead of n=3 > replication or reed-solomon erasure codes), having the file server try > as hard as possible to find the file is a good thing. > > I wonder if the right answer is to have "fastfail" and "nofastfail" > mount option. Wouldn't it just make sense to mount the filesystem with "errors=remount-ro" or "errors=panic" in your case, where you can give up on a single node easily if it detects device-level errors, rather than "errors=continue" as it seems you currently have? This is what we do in HA environments, and fail the storage over to a backup server in case the problem is with the node, SCSI cards, cables, etc. and not the disk (preventing further automatic failback to prevent node ping-pong if there is actually a media error). Cheers, Andreas
On Fri, Jun 23, 2017 at 05:34:23PM -0600, Andreas Dilger wrote: > > Sure, but that is a problem independent of the readdir case I think? This is lookup case not the readdir case.... > Wouldn't it just make sense to mount the filesystem with "errors=remount-ro" > or "errors=panic" in your case, where you can give up on a single node > easily if it detects device-level errors, rather than "errors=continue" as > it seems you currently have? This is what we do in HA environments, and > fail the storage over to a backup server in case the problem is with the > node, SCSI cards, cables, etc. and not the disk (preventing further automatic > failback to prevent node ping-pong if there is actually a media error). "errors=remount-ro" doesn't really help when it ends up tieing up process doing the lookup for long periods of time. "errors=panic" isn't helpful if you have a large number of file systems mounted. The system might have dozens of disks mounted (as individual file systems) on a file server, or there might be even more networked-block devices mounted on a container-oriented job server. So you really don't want to blow away a node with a reboot just because errors are found on a single device or network device. In such an environment it might be far more appropriate to return an error to userspace with dispatch, and trust userspace to take the appropriate recovery steps. - Ted
On Thu, Jun 22, 2017 at 4:23 PM, Khazhismel Kumykov <khazhy@google.com> wrote: > - /* read error, skip block & hope for the best */ > EXT4_ERROR_INODE(dir, "reading directory lblock %lu", > (unsigned long) block); > brelse(bh); > - goto next; > + ret = ERR_PTR(-EIO); > + goto cleanup_and_exit; EXT4_ERROR_INODE() triggers ext4_handle_error() which performs error handling. I think it should be removed here because we are returning the error to the caller and there is nothing more drastic about this error than other error return paths in the same function.
> On Jun 26, 2017, at 1:22 PM, Tahsin Erdogan <tahsin@google.com> wrote: > > On Thu, Jun 22, 2017 at 4:23 PM, Khazhismel Kumykov <khazhy@google.com> wrote: >> - /* read error, skip block & hope for the best */ >> EXT4_ERROR_INODE(dir, "reading directory lblock %lu", >> (unsigned long) block); >> brelse(bh); >> - goto next; >> + ret = ERR_PTR(-EIO); >> + goto cleanup_and_exit; > > EXT4_ERROR_INODE() triggers ext4_handle_error() which performs error > handling. I think it should be removed here because we are returning > the error to the caller and there is nothing more drastic about this > error than other error return paths in the same function. The issue is that this represents filesystem corruption, since the IO error in the leaf block means some files cannot be looked up, and new files cannot be created in this directory. Running e2fsck will repair this problem, so ext4_handle_error() will mark the filesystem in error so that e2fsck will be run on the next restart and this error is not lost because it is only returned to the application. The sysadmin already has the option to declare the behaviour of ext4_handle_error() - "continue" (which seems to what you are already running), "remount-ro" (what we are using), or "panic" the node (which some HA systems use). To my reading, this patch only applies to the "continue" case, and it is good to make this handling more robust. Similar work was done with the block and inode bitmaps (EXT4_{MB_GRP,GROUP_INFO}_BBITMAP_CORRUPT and EXT4_{MB_GRP,GROUP_INFO}_IBITMAP_CORRUPT) to skip allocations in those groups if an error is detected. Cheers, Andreas
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index b81f7d4..c75b6dc 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1434,6 +1434,8 @@ static struct buffer_head * ext4_find_entry (struct inode *dir, EXT4_ERROR_INODE(dir, "reading directory lblock %lu", (unsigned long) block); brelse(bh); + /* if entry isn't found in a later block, return an error */ + ret = ERR_PTR(-EIO); goto next; } if (!buffer_verified(bh) &&