Message ID | 20200810204856.2111211-3-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | [1/3] stdlib: Use fixed buffer size for realpath (BZ #26241) | expand |
On 8/10/20 1:48 PM, Adhemerval Zanella wrote: > Regarding syscalls usage, for a sucessful path without symlinks it > trades 2 syscalls (getcwd/lstat) for 5 (openat, readlink, fstat, stat, > and close). Thanks for looking into this. There's no need for the new __resolve_path function to call __lstat64. It does that only to determine whether the file is a symlink or a directory, and it can do the former with __readlink (which it's gonna do already) and the latter by going into the next iteration and seeing whether it gets NOTDIR. Avoiding __lstat64 means we don't have to worry about irrelevant EOVERFLOW hassles. In the Linux __realpath_system, the code should be prepared for the /proc syscall to fail because /proc is not mounted. It can fall back on __resolve_path in that case. The code should be prepared for the fstat64 to fail due to EOVERFLOW. And I'm not quite getting why the code is comparing fstat64's result to stat64's result -- perhaps explain this in a comment and how EOVERFLOW is dealt with?
On Aug 10 2020, Adhemerval Zanella via Libc-alpha wrote: > This optimizes the stack usage for success case (from ~8K to ~4k) and > where 'resolved' input buffer is not provided. For ithe failure case s/ithe/the/ Andreas.
On 10/08/2020 18:25, Paul Eggert wrote: > On 8/10/20 1:48 PM, Adhemerval Zanella wrote: > >> Regarding syscalls usage, for a sucessful path without symlinks it >> trades 2 syscalls (getcwd/lstat) for 5 (openat, readlink, fstat, stat, >> and close). > > Thanks for looking into this. > > There's no need for the new __resolve_path function to call __lstat64. It does that only to determine whether the file is a symlink or a directory, and it can do the former with __readlink (which it's gonna do already) and the latter by going into the next iteration and seeing whether it gets NOTDIR. Avoiding __lstat64 means we don't have to worry about irrelevant EOVERFLOW hassles. I am not sure we can skip the __lstat64 because we can't check if a subpath passed on __readlink is a directory or not (so we can move to next iteration in some cases). For instance the test 30 from stdlib/test-canon.c which issues: realpath ("./doesExist/someFile/", ...) On the directory with: mkdir ("doesExist", 0777) creat ("doesExist/someFile", 0777) By just issuing the readlink on (strace output): readlink(".../doesExist", 0x7fff32598060, 4095) = -1 EINVAL We can't see that is a S_ISDIR to go for next iteration (and then fail on next iteration since "../doesExist/someFile' is not a directory end *end != '\0'). > > In the Linux __realpath_system, the code should be prepared for the /proc syscall to fail because /proc is not mounted. It can fall back on __resolve_path in that case. The code should be prepared for the fstat64 to fail due to EOVERFLOW. And I'm not quite getting why the code is comparing fstat64's result to stat64's result -- perhaps explain this in a comment and how EOVERFLOW is dealt with? Right, I though about the '/proc' failure and decided to fail based on on other implementations (fexecve fallback, fchmodat), but the __resolve_path should be ok to use in such case. For fstat64, afaik EOVERFLOW is returned only when called non-LFS interface which is not this case. Do you think re really should handle EOVERFLOW ? The fstat64/stat64 comparison is just a small check to see if the file was removed between calls. Thinking twice I am not sure how effective this really is, maybe remove it?
On 11/08/2020 11:14, Adhemerval Zanella wrote: > > > On 10/08/2020 18:25, Paul Eggert wrote: >> On 8/10/20 1:48 PM, Adhemerval Zanella wrote: >> >>> Regarding syscalls usage, for a sucessful path without symlinks it >>> trades 2 syscalls (getcwd/lstat) for 5 (openat, readlink, fstat, stat, >>> and close). >> >> Thanks for looking into this. >> >> There's no need for the new __resolve_path function to call __lstat64. It does that only to determine whether the file is a symlink or a directory, and it can do the former with __readlink (which it's gonna do already) and the latter by going into the next iteration and seeing whether it gets NOTDIR. Avoiding __lstat64 means we don't have to worry about irrelevant EOVERFLOW hassles. > > I am not sure we can skip the __lstat64 because we can't check if a subpath > passed on __readlink is a directory or not (so we can move to next iteration > in some cases). > > For instance the test 30 from stdlib/test-canon.c which issues: > > realpath ("./doesExist/someFile/", ...) And I just checked your patch on BZ#24970 and it shows the issues: $ grep ^FAIL stdlib/subdir-tests.sum FAIL: stdlib/test-canon $ cat stdlib/test-canon.out /home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/test-canon: flunked test 30 (expected `NULL', got `/home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/doesExist/someFile') /home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/test-canon: flunked test 31 (expected `NULL', got `/home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/doesExist') 2 errors. I couldn't figure out a way to have the expected realpath semantic by just using readlink.
On 8/11/20 7:14 AM, Adhemerval Zanella wrote: > For instance the test 30 from stdlib/test-canon.c which issues: > > realpath ("./doesExist/someFile/", ...) > > On the directory with: > > mkdir ("doesExist", 0777) > creat ("doesExist/someFile", 0777) > > By just issuing the readlink on (strace output): > > readlink(".../doesExist", 0x7fff32598060, 4095) = -1 EINVAL > > We can't see that is a S_ISDIR to go for next iteration (and then fail on next > iteration since "../doesExist/someFile' is not a directory end *end != '\0'). We don't need to test whether "doesExist" is a directory. The EINVAL tells us that it exists and is not a symlink. Therefore it is either a directory or a non-(symlink-or-directory). To discover whether it is a directory or a non-(symlink-or-directory), we go ahead with the next iteration: readlink ("doesExist/someFile", ...) If this fails with ENOTDIR, then "doesExist" was a non-(symlink-or-directory); if it succeeds (or fails with EINVAL) doesExist was a directory; otherwise can simply fail with whatever errno it fails with. > For fstat64, afaik EOVERFLOW is returned only when called non-LFS interface > which is not this case. Do you think re really should handle EOVERFLOW ? Ah, I hadn't thought of that. If EOVERFLOW is indeed impossible with fstat64 then we needn't worry about EOVERFLOW. > The fstat64/stat64 comparison is just a small check to see if the > file was removed between calls. Thinking twice I am not sure how > effective this really is, maybe remove it? Yes, let's remove that. realpath cannot possibly be atomic when it is issuing multiple syscalls, so there's no point to it trying to be "atomic" here.
On Aug 10 2020, Paul Eggert wrote: > In the Linux __realpath_system, the code should be prepared for the /proc > syscall to fail because /proc is not mounted. We already depend very much on /proc to be mounted, so this can be ignored. Andreas.
On 11/08/2020 12:52, Paul Eggert wrote: > On 8/11/20 7:14 AM, Adhemerval Zanella wrote: > >> For instance the test 30 from stdlib/test-canon.c which issues: >> >> realpath ("./doesExist/someFile/", ...) >> >> On the directory with: >> >> mkdir ("doesExist", 0777) >> creat ("doesExist/someFile", 0777) >> >> By just issuing the readlink on (strace output): >> >> readlink(".../doesExist", 0x7fff32598060, 4095) = -1 EINVAL >> >> We can't see that is a S_ISDIR to go for next iteration (and then fail on next >> iteration since "../doesExist/someFile' is not a directory end *end != '\0'). > > We don't need to test whether "doesExist" is a directory. The EINVAL tells us that it exists and is not a symlink. Therefore it is either a directory or a non-(symlink-or-directory). To discover whether it is a directory or a non-(symlink-or-directory), we go ahead with the next iteration: > > readlink ("doesExist/someFile", ...) > > If this fails with ENOTDIR, then "doesExist" was a non-(symlink-or-directory); if it succeeds (or fails with EINVAL) doesExist was a directory; otherwise can simply fail with whatever errno it fails with. The issue seems that readlink does not fail with ENOTDIR if 'doesExist/someFile' is indeed a file: readlink("/home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/doesExist/someFile", 0x7ffe2d3efb70, 4096) = -1 EINVAL (Invalid argument) We need to check with a final '/': readlink("/home/azanella/Projects/glibc/build/x86_64-linux-gnu/stdlib/doesExist/someFile/", 0x7ffee1893aa0, 4096) = -1 ENOTDIR (Not a directory) This is turn requires to change the loop to make this extra check with readlink with expected path (by appending the '/' when required). There is another issue which we also need to handle "./doesExist/someFile/..". But I have spent too much time for an small optimization not directly related to the Linux optimization. If you could make either current algorithm or the one on stdlib/canonicalize-internal.c no call lstat I can add on series (I don't have plan to spend more time on this readlink optimization). > >> For fstat64, afaik EOVERFLOW is returned only when called non-LFS interface >> which is not this case. Do you think re really should handle EOVERFLOW ? > > Ah, I hadn't thought of that. If EOVERFLOW is indeed impossible with fstat64 then we needn't worry about EOVERFLOW. > >> The fstat64/stat64 comparison is just a small check to see if the >> file was removed between calls. Thinking twice I am not sure how >> effective this really is, maybe remove it? > > Yes, let's remove that. realpath cannot possibly be atomic when it is issuing multiple syscalls, so there's no point to it trying to be "atomic" here. Ack.
On Tue, Aug 11, 2020 at 06:46:24PM +0200, Andreas Schwab wrote: > On Aug 10 2020, Paul Eggert wrote: > > > In the Linux __realpath_system, the code should be prepared for the /proc > > syscall to fail because /proc is not mounted. > > We already depend very much on /proc to be mounted, so this can be > ignored. Do we?
On Aug 17 2020, Dmitry V. Levin wrote: > On Tue, Aug 11, 2020 at 06:46:24PM +0200, Andreas Schwab wrote: >> On Aug 10 2020, Paul Eggert wrote: >> >> > In the Linux __realpath_system, the code should be prepared for the /proc >> > syscall to fail because /proc is not mounted. >> >> We already depend very much on /proc to be mounted, so this can be >> ignored. > > Do we? Yes, see <87blk9ry5d.fsf@oldenburg2.str.redhat.com> for example. Andreas.
On Mon, Aug 17, 2020 at 05:13:44PM +0200, Andreas Schwab wrote: > On Aug 17 2020, Dmitry V. Levin wrote: > > On Tue, Aug 11, 2020 at 06:46:24PM +0200, Andreas Schwab wrote: > >> On Aug 10 2020, Paul Eggert wrote: > >> > >> > In the Linux __realpath_system, the code should be prepared for the /proc > >> > syscall to fail because /proc is not mounted. > >> > >> We already depend very much on /proc to be mounted, so this can be > >> ignored. > > > > Do we? > > Yes, see <87blk9ry5d.fsf@oldenburg2.str.redhat.com> for example. That's a regression, I filed a bug at https://sourceware.org/bugzilla/show_bug.cgi?id=26401 Thanks,
diff --git a/include/stdlib.h b/include/stdlib.h index ffcefd7b85..dd51f66b26 100644 --- a/include/stdlib.h +++ b/include/stdlib.h @@ -20,6 +20,14 @@ # include <rtld-malloc.h> +# ifndef PATH_MAX +# ifdef MAXPATHLEN +# define PATH_MAX MAXPATHLEN +# else +# define PATH_MAX 1024 +# endif +# endif + extern __typeof (strtol_l) __strtol_l; extern __typeof (strtoul_l) __strtoul_l; extern __typeof (strtoll_l) __strtoll_l; @@ -92,6 +100,11 @@ extern int __unsetenv (const char *__name) attribute_hidden; extern int __clearenv (void) attribute_hidden; extern char *__mktemp (char *__template) __THROW __nonnull ((1)); extern char *__canonicalize_file_name (const char *__name); +extern _Bool __resolve_path (const char *name, char *resolved, + size_t path_max) + attribute_hidden; +extern char *__realpath_system (const char *name, char *resolved) + attribute_hidden; extern char *__realpath (const char *__name, char *__resolved); libc_hidden_proto (__realpath) extern int __ptsname_r (int __fd, char *__buf, size_t __buflen) diff --git a/stdlib/Makefile b/stdlib/Makefile index 7093b8a584..35ca04541f 100644 --- a/stdlib/Makefile +++ b/stdlib/Makefile @@ -53,7 +53,7 @@ routines := \ strtof strtod strtold \ strtof_l strtod_l strtold_l \ strtof_nan strtod_nan strtold_nan \ - system canonicalize \ + system realpath canonicalize canonicalize-internal \ a64l l64a \ rpmatch strfmon strfmon_l getsubopt xpg_basename fmtmsg \ strtoimax strtoumax wcstoimax wcstoumax \ diff --git a/stdlib/canonicalize-internal.c b/stdlib/canonicalize-internal.c new file mode 100644 index 0000000000..1b5f73a1cc --- /dev/null +++ b/stdlib/canonicalize-internal.c @@ -0,0 +1,136 @@ +/* Internal function for canonicalize absolute name of a given file. + Copyright (C) 1996-2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <stdbool.h> +#include <string.h> +#include <stdlib.h> +#include <errno.h> +#include <unistd.h> +#include <eloop-threshold.h> + +_Bool +__resolve_path (const char *name, char *resolved, size_t path_max) +{ + const char *start, *end; + char *rpath = resolved; + char *rpath_limit = rpath + path_max; + char *dest = resolved; + char extra_buf[PATH_MAX]; + int num_links = 0; + + if (name[0] != '/') + { + if (__getcwd (rpath, path_max) == NULL) + { + rpath[0] = '\0'; + return false; + } + dest = __rawmemchr (rpath, '\0'); + } + else + { + rpath[0] = '/'; + dest = rpath + 1; + } + + for (start = end = name; *start; start = end) + { + /* Skip sequence of multiple path-separators. */ + while (*start == '/') + ++start; + + /* Find end of path component. */ + for (end = start; *end && *end != '/'; ++end) + /* Nothing. */; + + if (end - start == 0) + break; + else if (end - start == 1 && start[0] == '.') + /* nothing */; + else if (end - start == 2 && start[0] == '.' && start[1] == '.') + { + /* Back up to previous component, ignore if at root already. */ + if (dest > rpath + 1) + while ((--dest)[-1] != '/'); + } + else + { + struct stat64 st; + + if (dest[-1] != '/') + *dest++ = '/'; + if (dest + (end - start) >= rpath_limit) + { + __set_errno (ENAMETOOLONG); + if (dest > rpath + 1) + dest--; + *dest = '\0'; + return false; + } + + dest = __mempcpy (dest, start, end - start); + *dest = '\0'; + + if (__lstat64 (rpath, &st) < 0) + return false; + + if (S_ISLNK (st.st_mode)) + { + if (++num_links > __eloop_threshold ()) + { + __set_errno (ELOOP); + return false; + } + + char buf[PATH_MAX]; + ssize_t n = __readlink (rpath, buf, sizeof (buf) - 1); + if (n < 0) + return false; + buf[n] = '\0'; + + size_t len = strlen (end); + if (path_max - n <= len) + { + __set_errno (ENAMETOOLONG); + return false; + } + + memmove (&extra_buf[n], end, len + 1); + name = end = memcpy (extra_buf, buf, n); + + if (buf[0] == '/') + dest = rpath + 1; /* It's an absolute symlink */ + else + /* Back up to previous component, ignore if at root already: */ + if (dest > rpath + 1) + while ((--dest)[-1] != '/'); + } + else if (!S_ISDIR (st.st_mode) && *end != '\0') + { + __set_errno (ENOTDIR); + return false; + } + } + } + + if (dest > rpath + 1 && dest[-1] == '/') + --dest; + *dest = '\0'; + + return true; +} diff --git a/stdlib/canonicalize.c b/stdlib/canonicalize.c index 91c30e38be..f4ab528a15 100644 --- a/stdlib/canonicalize.c +++ b/stdlib/canonicalize.c @@ -16,26 +16,11 @@ License along with the GNU C Library; if not, see <https://www.gnu.org/licenses/>. */ -#include <assert.h> #include <stdlib.h> -#include <string.h> -#include <unistd.h> -#include <limits.h> -#include <sys/stat.h> #include <errno.h> -#include <stddef.h> -#include <eloop-threshold.h> #include <shlib-compat.h> -#ifndef PATH_MAX -# ifdef MAXPATHLEN -# define PATH_MAX MAXPATHLEN -# else -# define PATH_MAX 1024 -# endif -#endif - /* Return the canonical absolute name of file NAME. A canonical name does not contain any `.', `..' components nor any repeated path separators ('/') or symlinks. All path components must exist. If @@ -50,10 +35,6 @@ char * __realpath (const char *name, char *resolved) { - char *rpath, *dest, extra_buf[PATH_MAX]; - const char *start, *end, *rpath_limit; - int num_links = 0; - if (name == NULL) { /* As per Single Unix Specification V2 we must return an error if @@ -72,127 +53,7 @@ __realpath (const char *name, char *resolved) return NULL; } - if (resolved == NULL) - { - rpath = malloc (PATH_MAX); - if (rpath == NULL) - return NULL; - } - else - rpath = resolved; - rpath_limit = rpath + PATH_MAX; - - if (name[0] != '/') - { - if (!__getcwd (rpath, PATH_MAX)) - { - rpath[0] = '\0'; - goto error; - } - dest = __rawmemchr (rpath, '\0'); - } - else - { - rpath[0] = '/'; - dest = rpath + 1; - } - - for (start = end = name; *start; start = end) - { - struct stat64 st; - int n; - - /* Skip sequence of multiple path-separators. */ - while (*start == '/') - ++start; - - /* Find end of path component. */ - for (end = start; *end && *end != '/'; ++end) - /* Nothing. */; - - if (end - start == 0) - break; - else if (end - start == 1 && start[0] == '.') - /* nothing */; - else if (end - start == 2 && start[0] == '.' && start[1] == '.') - { - /* Back up to previous component, ignore if at root already. */ - if (dest > rpath + 1) - while ((--dest)[-1] != '/'); - } - else - { - if (dest[-1] != '/') - *dest++ = '/'; - - if (dest + (end - start) >= rpath_limit) - { - __set_errno (ENAMETOOLONG); - if (dest > rpath + 1) - dest--; - *dest = '\0'; - goto error; - } - - dest = __mempcpy (dest, start, end - start); - *dest = '\0'; - - if (__lxstat64 (_STAT_VER, rpath, &st) < 0) - goto error; - - if (S_ISLNK (st.st_mode)) - { - char buf[PATH_MAX]; - size_t len; - - if (++num_links > __eloop_threshold ()) - { - __set_errno (ELOOP); - goto error; - } - - n = __readlink (rpath, buf, sizeof (buf) - 1); - if (n < 0) - goto error; - buf[n] = '\0'; - - len = strlen (end); - if (PATH_MAX - n <= len) - { - __set_errno (ENAMETOOLONG); - goto error; - } - - /* Careful here, end may be a pointer into extra_buf... */ - memmove (&extra_buf[n], end, len + 1); - name = end = memcpy (extra_buf, buf, n); - - if (buf[0] == '/') - dest = rpath + 1; /* It's an absolute symlink */ - else - /* Back up to previous component, ignore if at root already: */ - if (dest > rpath + 1) - while ((--dest)[-1] != '/'); - } - else if (!S_ISDIR (st.st_mode) && *end != '\0') - { - __set_errno (ENOTDIR); - goto error; - } - } - } - if (dest > rpath + 1 && dest[-1] == '/') - --dest; - *dest = '\0'; - - assert (resolved == NULL || resolved == rpath); - return rpath; - -error: - assert (resolved == NULL || resolved == rpath); - if (resolved == NULL) - free (rpath); - return NULL; + return __realpath_system (name, resolved); } libc_hidden_def (__realpath) versioned_symbol (libc, __realpath, realpath, GLIBC_2_3); diff --git a/stdlib/realpath.c b/stdlib/realpath.c new file mode 100644 index 0000000000..1a70c658b7 --- /dev/null +++ b/stdlib/realpath.c @@ -0,0 +1,42 @@ +/* Return the canonical absolute name of a given file. + Copyright (C) 1996-2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <stdlib.h> +#include <errno.h> +#include <shlib-compat.h> + +char * +__realpath_system (const char *name, char *resolved) +{ + bool resolved_malloc = false; + if (resolved == NULL) + { + resolved = malloc (PATH_MAX); + if (resolved == NULL) + return NULL; + resolved_malloc = true; + } + + if (! __resolve_path (name, resolved, PATH_MAX)) + { + if (resolved_malloc) + free (resolved); + return NULL; + } + return resolved; +} diff --git a/sysdeps/unix/sysv/linux/realpath.c b/sysdeps/unix/sysv/linux/realpath.c new file mode 100644 index 0000000000..4c69011f92 --- /dev/null +++ b/sysdeps/unix/sysv/linux/realpath.c @@ -0,0 +1,65 @@ +/* Return the canonical absolute name of a given file. Linux version. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <stdlib.h> +#include <errno.h> +#include <not-cancel.h> +#include <shlib-compat.h> + +char * +__realpath_system (const char *name, char *resolved) +{ + int fd = __open64_nocancel (name, O_PATH | O_NONBLOCK | O_CLOEXEC); + if (fd == -1) + { + /* If the call fails with either EACCES or ENOENT and resolved_path is + not NULL, then the prefix of path that is not readable or does not + exist is returned in resolved_path. This is a GNU extension. */ + if (resolved != NULL) + __resolve_path (name, resolved, PATH_MAX); + return NULL; + } + + char procname[sizeof ("/proc/self/fd/") + 3 * sizeof (int)]; + *_fitoa_word (fd, __stpcpy (procname, "/proc/self/fd/"), 10, 0) = '\0'; + + char path[PATH_MAX]; + ssize_t len = __readlink (procname, path, sizeof (path) - 1); + if (len < 0) + { + __close_nocancel_nostatus (fd); + return NULL; + } + path[len] = '\0'; + + struct stat64 st; + fstat64 (fd, &st); + dev_t st_dev = st.st_dev; + ino_t st_ino = st.st_ino; + int r = stat64 (path, &st); + if (r == -1 || st.st_dev != st_dev || st.st_ino != st_ino) + { + if (r == 0) + __set_errno (ELOOP); + __close_nocancel_nostatus (fd); + return NULL; + } + + __close_nocancel_nostatus (fd); + return resolved != NULL ? strcpy (resolved, path) : __strdup (path); +}