diff mbox

[PATCHv10,man-pages,5/5] execveat.2: initial man page for execveat(2)

Message ID 1416830039-21952-6-git-send-email-drysdale@google.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

David Drysdale Nov. 24, 2014, 11:53 a.m. UTC
Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 man2/execveat.2

--
2.1.0.rc2.206.gedb03e5
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michael Kerrisk \(man-pages\) Jan. 9, 2015, 3:47 p.m. UTC | #1
On 11/24/2014 12:53 PM, David Drysdale wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 man2/execveat.2

David,

Thanks for the very nicely prepared man page. I've done 
a few very light edits, and will release the version below 
with the next man-pages release.

I have one question. In the message accompanying
commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:

  The filename fed to the executed program as argv[0] (or the name of the
  script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
  (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
  reflecting how the executable was found.  This does however mean that
  execution of a script in a /proc-less environment won't work; also, script
  execution via an O_CLOEXEC file descriptor fails (as the file will not be
  accessible after exec).

How does one produce this situation where the execed program sees 
argv[0] as a /dev/fd path? (i.e., what would the execveat()
call look like?) I tried to produce this scenario, but could not.

Cheers,

Michael

.\" Copyright (c) 2014 Google, Inc.
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
.SH NAME
execveat \- execute program relative to a directory file descriptor
.SH SYNOPSIS
.B #include <unistd.h>
.sp
.BI "int execveat(int " dirfd ", const char *" pathname ","
.br
.BI "             char *const " argv "[], char *const " envp "[],"
.br
.BI "             int " flags );
.SH DESCRIPTION
.\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
The
.BR execveat ()
system call executes the program referred to by the combination of
.I dirfd
and
.IR pathname .
It operates in exactly the same way as
.BR execve (2),
except for the differences described in this manual page.

If the pathname given in
.I pathname
is relative, then it is interpreted relative to the directory
referred to by the file descriptor
.I dirfd
(rather than relative to the current working directory of
the calling process, as is done by
.BR execve (2)
for a relative pathname).

If
.I pathname
is relative and
.I dirfd
is the special value
.BR AT_FDCWD ,
then
.I pathname
is interpreted relative to the current working
directory of the calling process (like
.BR execve (2)).

If
.I pathname
is absolute, then
.I dirfd
is ignored.

If
.I pathname
is an empty string and the
.BR AT_EMPTY_PATH
flag is specified, then the file descriptor
.I dirfd
specifies the file to be executed (i.e.,
.IR dirfd
refers to an executable file, rather than a directory).

The
.I flags
argument is a bit mask that can include zero or more of the following flags:
.TP
.BR AT_EMPTY_PATH
If
.I pathname
is an empty string, operate on the file referred to by
.IR dirfd
(which may have been obtained using the
.BR open (2)
.B O_PATH
flag).
.TP
.B AT_SYMLINK_NOFOLLOW
If the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link, then the call fails with the error
.BR EINVAL .
.SH "RETURN VALUE"
On success,
.BR execveat ()
does not return. On error \-1 is returned, and
.I errno
is set appropriately.
.SH ERRORS
The same errors that occur for
.BR execve (2)
can also occur for
.BR execveat ().
The following additional errors can occur for
.BR execveat ():
.TP
.B EBADF
.I dirfd
is not a valid file descriptor.
.TP
.B EINVAL
.I flags
includes
.BR AT_SYMLINK_NOFOLLOW
and the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link.
.TP
.B EINVAL
Invalid flag specified in
.IR flags .
.TP
.B ENOENT
The program identified by
.I dirfd
and
.I pathname
requires the use of an interpreter program
(such as a script starting with "#!"), but the file descriptor
.I dirfd
was opened with the
.B O_CLOEXEC
flag, with the result that
the program file is inaccessible to the launched interpreter.
.TP
.B ENOTDIR
.I pathname
is relative and
.I dirfd
is a file descriptor referring to a file other than a directory.
.SH VERSIONS
.BR execveat ()
was added to Linux in kernel 3.19.
GNU C library support is pending.
.\" FIXME . check for glibc support in a future release
.SH CONFORMING TO
The
.BR execveat ()
system call is Linux-specific.
.SH NOTES
In addition to the reasons explained in
.BR openat (2),
the
.BR execveat ()
system call is also needed to allow
.BR fexecve (3)
to be implemented on systems that do not have the
.I /proc
filesystem mounted.
.SH SEE ALSO
.BR execve (2),
.BR openat (2),
.BR fexecve (3)
Rich Felker Jan. 9, 2015, 4:13 p.m. UTC | #2
On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 153 insertions(+)
> >  create mode 100644 man2/execveat.2
> 
> David,
> 
> Thanks for the very nicely prepared man page. I've done 
> a few very light edits, and will release the version below 
> with the next man-pages release.
> 
> I have one question. In the message accompanying
> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> 
>   The filename fed to the executed program as argv[0] (or the name of the
>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>   reflecting how the executable was found.  This does however mean that
>   execution of a script in a /proc-less environment won't work; also, script
>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>   accessible after exec).
> 
> How does one produce this situation where the execed program sees 
> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> call look like?) I tried to produce this scenario, but could not.

I think this is wrong. argv[0] is an arbitrary string provided by the
caller and would never be derived from the fd passed. It's AT_EXECFN,
/proc/self/exe, and filenames shown elsewhere in /proc that may be
derived in odd ways.

I would also move the text about O_CLOEXEC to a BUGS or NOTES section
rather than the main description. The long-term intent should be that
script execution this way should work. IIRC this was discussed earlier
in the thread.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Drysdale Jan. 9, 2015, 5:46 p.m. UTC | #3
On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> > Signed-off-by: David Drysdale <drysdale@google.com>
>> > ---
>> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 153 insertions(+)
>> >  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
>
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed.

Yeah, I think I just wrote that wrong, it's only relevant for scripts.
As Rich says, for normal binaries argv[0] is just the argv[0] that
was passed into the execve[at] call.  For a script, the code in
fs/binfmt_script.c will remove the original argv[0] and put the
interpreter name and the script filename (e.g. "/bin/sh",
"/dev/fd/6/script") in as 2 arguments in its place.

[As an aside, IIRC the filename does get put into the new
process's memory, up above the environment strings -- but
that copy isn't visible via argv nor envp.]

> It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
>
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I may be misremembering, but I thought we hoped to be able to fix
execveat of a script without /proc in future, but didn't expect to fix
execveat of a script via an O_CLOEXEC fd (because in the latter
case the fd gets closed before the script interpreter runs, so even
if the interpreter (or a special filesystem) does clever things for names
starting with "/dev/fd/..." the file descriptor is already gone).
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Drysdale Jan. 9, 2015, 6:02 p.m. UTC | #4
On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> Signed-off-by: David Drysdale <drysdale@google.com>
>> ---
>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 man2/execveat.2
>
> David,
>
> Thanks for the very nicely prepared man page. I've done
> a few very light edits, and will release the version below
> with the next man-pages release.

Many thanks, one error (of mine) in 2 places pointed out below.


> .TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
> .SH NAME
> execveat \- execute program relative to a directory file descriptor
> .SH SYNOPSIS
> .B #include <unistd.h>
> .sp
> .BI "int execveat(int " dirfd ", const char *" pathname ","
> .br
> .BI "             char *const " argv "[], char *const " envp "[],"
> .br
> .BI "             int " flags );
> .SH DESCRIPTION
> .\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
> The
> .BR execveat ()
> system call executes the program referred to by the combination of
> .I dirfd
> and
> .IR pathname .
> It operates in exactly the same way as
> .BR execve (2),
> except for the differences described in this manual page.
>
> If the pathname given in
> .I pathname
> is relative, then it is interpreted relative to the directory
> referred to by the file descriptor
> .I dirfd
> (rather than relative to the current working directory of
> the calling process, as is done by
> .BR execve (2)
> for a relative pathname).
>
> If
> .I pathname
> is relative and
> .I dirfd
> is the special value
> .BR AT_FDCWD ,
> then
> .I pathname
> is interpreted relative to the current working
> directory of the calling process (like
> .BR execve (2)).
>
> If
> .I pathname
> is absolute, then
> .I dirfd
> is ignored.
>
> If
> .I pathname
> is an empty string and the
> .BR AT_EMPTY_PATH
> flag is specified, then the file descriptor
> .I dirfd
> specifies the file to be executed (i.e.,
> .IR dirfd
> refers to an executable file, rather than a directory).
>
> The
> .I flags
> argument is a bit mask that can include zero or more of the following flags:
> .TP
> .BR AT_EMPTY_PATH
> If
> .I pathname
> is an empty string, operate on the file referred to by
> .IR dirfd
> (which may have been obtained using the
> .BR open (2)
> .B O_PATH
> flag).
> .TP
> .B AT_SYMLINK_NOFOLLOW
> If the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link, then the call fails with the error
> .BR EINVAL .

Apologies, I think this should be ELOOP.

> .SH "RETURN VALUE"
> On success,
> .BR execveat ()
> does not return. On error \-1 is returned, and
> .I errno
> is set appropriately.
> .SH ERRORS
> The same errors that occur for
> .BR execve (2)
> can also occur for
> .BR execveat ().
> The following additional errors can occur for
> .BR execveat ():
> .TP
> .B EBADF
> .I dirfd
> is not a valid file descriptor.
> .TP
> .B EINVAL

ELOOP here too.

> .I flags
> includes
> .BR AT_SYMLINK_NOFOLLOW
> and the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link.
> .TP
> .B EINVAL
> Invalid flag specified in
> .IR flags .
> .TP
> .B ENOENT
> The program identified by
> .I dirfd
> and
> .I pathname
> requires the use of an interpreter program
> (such as a script starting with "#!"), but the file descriptor
> .I dirfd
> was opened with the
> .B O_CLOEXEC
> flag, with the result that
> the program file is inaccessible to the launched interpreter.
> .TP
> .B ENOTDIR
> .I pathname
> is relative and
> .I dirfd
> is a file descriptor referring to a file other than a directory.
> .SH VERSIONS
> .BR execveat ()
> was added to Linux in kernel 3.19.
> GNU C library support is pending.
> .\" FIXME . check for glibc support in a future release
> .SH CONFORMING TO
> The
> .BR execveat ()
> system call is Linux-specific.
> .SH NOTES
> In addition to the reasons explained in
> .BR openat (2),
> the
> .BR execveat ()
> system call is also needed to allow
> .BR fexecve (3)
> to be implemented on systems that do not have the
> .I /proc
> filesystem mounted.
> .SH SEE ALSO
> .BR execve (2),
> .BR openat (2),
> .BR fexecve (3)
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 8:48 p.m. UTC | #5
On Fri, Jan 09, 2015 at 05:46:28PM +0000, David Drysdale wrote:
> > It's AT_EXECFN,
> > /proc/self/exe, and filenames shown elsewhere in /proc that may be
> > derived in odd ways.
> >
> > I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> > rather than the main description. The long-term intent should be that
> > script execution this way should work. IIRC this was discussed earlier
> > in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

I think this is a case that needs to be fixed, though it's hard. The
normal correct usage for fexecve is to always pass an O_CLOEXEC file
descriptor, and the caller can't really be expected to know whether
the file is a script or not. We discussed workarounds before and one
idea I proposed was having fexecve provide a "one open only" magic
symlink in /proc/self/ to pass to the interpreter. It would behave
like an O_PATH file descriptor magic symlink in /proc/self/fd, but
would automatically cease to exist on the first open (at which point
the interpreter would have a real O_RDONLY file descriptor for the
underlying file).

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 8:56 p.m. UTC | #6
On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> I think this is a case that needs to be fixed, though it's hard. The
> normal correct usage for fexecve is to always pass an O_CLOEXEC file
> descriptor, and the caller can't really be expected to know whether
> the file is a script or not. We discussed workarounds before and one
> idea I proposed was having fexecve provide a "one open only" magic
> symlink in /proc/self/ to pass to the interpreter. It would behave
> like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> would automatically cease to exist on the first open (at which point
> the interpreter would have a real O_RDONLY file descriptor for the
> underlying file).

For fsck sake, folks, if you have bloody /proc, you don't need that shite
at all!  Just do execve on /proc/self/fd/n, and be done with that.

The sole excuse for merging that thing in the first place had been
"would anybody think of children^Wsclerotic^Whardened environments
where they have no /proc at all".

Sheesh...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 8:59 p.m. UTC | #7
On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> > I think this is a case that needs to be fixed, though it's hard. The
> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> > descriptor, and the caller can't really be expected to know whether
> > the file is a script or not. We discussed workarounds before and one
> > idea I proposed was having fexecve provide a "one open only" magic
> > symlink in /proc/self/ to pass to the interpreter. It would behave
> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> > would automatically cease to exist on the first open (at which point
> > the interpreter would have a real O_RDONLY file descriptor for the
> > underlying file).
> 
> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> 
> The sole excuse for merging that thing in the first place had been
> "would anybody think of children^Wsclerotic^Whardened environments
> where they have no /proc at all".

That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
the time the interpreter runs, whether you're using fexecveat or
execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
problem. This breaks the intended idiom for fexecve.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 9:09 p.m. UTC | #8
On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:

> > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > 
> > The sole excuse for merging that thing in the first place had been
> > "would anybody think of children^Wsclerotic^Whardened environments
> > where they have no /proc at all".
> 
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

Just what will your magical symlink do in case when the file is opened,
unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
etc. happen?  And no, you can't assume that interpreter will open the
damn thing even once - there's nothing to oblige it to do so.

Al, more and more tempted to ask reverting the whole thing - this hardcoded
/dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
even more revolting kludges in the name of "intended idiom for fexecve"...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 9, 2015, 9:20 p.m. UTC | #9
Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
>> > I think this is a case that needs to be fixed, though it's hard. The
>> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
>> > descriptor, and the caller can't really be expected to know whether
>> > the file is a script or not. We discussed workarounds before and one
>> > idea I proposed was having fexecve provide a "one open only" magic
>> > symlink in /proc/self/ to pass to the interpreter. It would behave
>> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
>> > would automatically cease to exist on the first open (at which point
>> > the interpreter would have a real O_RDONLY file descriptor for the
>> > underlying file).
>> 
>> For fsck sake, folks, if you have bloody /proc, you don't need that shite
>> at all!  Just do execve on /proc/self/fd/n, and be done with that.
>> 
>> The sole excuse for merging that thing in the first place had been
>> "would anybody think of children^Wsclerotic^Whardened environments
>> where they have no /proc at all".
>
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
closed a #! interpreter can not open it.   So I don't know why or how
you want that to work but it is nonsense.

This certainly does not break the intended usage for execveat.

Eric

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 9:28 p.m. UTC | #10
On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:
> 
> > > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > > 
> > > The sole excuse for merging that thing in the first place had been
> > > "would anybody think of children^Wsclerotic^Whardened environments
> > > where they have no /proc at all".
> > 
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> Just what will your magical symlink do in case when the file is opened,
> unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
> etc. happen?  And no, you can't assume that interpreter will open the
> damn thing even once - there's nothing to oblige it to do so.

Unlinking is not relevant. Magical symlinks refer to open file
descriptions (either real ones or O_PATH inode-reference-only ones),
not files. There is no new complexity proposed for freeing disk blocks
here. Semantics are identical to existing O_PATH inode references.

> Al, more and more tempted to ask reverting the whole thing - this hardcoded
> /dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
> even more revolting kludges in the name of "intended idiom for fexecve"...

If you have a multithreaded process that's executing an external
program via fexecve, then unless it has specialized knowledge about
what other parts of the program/libraries are doing, it needs to be
using O_CLOEXEC for the file descriptor. Otherwise, the file
descriptor could be leaked to child processes started by other
threads. This is what I mean by the "intended idiom". Note that it's
easier to use pathnames instead of fexecve, but doing so may not be an
option if the program needs to verify the file before exec'ing it.

This issue can be avoided if you're going to fork-and-fexecve rather
than replacing the calling process, since after forking it's safe to
remove the close-on-exec flag. But then you still have the issue that
the child process, after exec, keeps a spurious file descriptor to its
own process image (executable file) open which it can never close
(because it doesn't know the number). This could eventually lead to fd
exhaustion after many generations.

The "magic open-once magic symlink" approach is really the cleanest
solution I can find. In the case where the interpreter does not open
the script, nothing terribly bad happens; the magic symlink just
sticks around until _exit or exec. In the case where the interpreter
opens it more than once, you get a failure, but as far as I know
existing interpreters don't do this, and it's arguably bad design. In
any case it's a caught error.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 9:31 p.m. UTC | #11
On Fri, Jan 09, 2015 at 03:20:04PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> >> > I think this is a case that needs to be fixed, though it's hard. The
> >> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> >> > descriptor, and the caller can't really be expected to know whether
> >> > the file is a script or not. We discussed workarounds before and one
> >> > idea I proposed was having fexecve provide a "one open only" magic
> >> > symlink in /proc/self/ to pass to the interpreter. It would behave
> >> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> >> > would automatically cease to exist on the first open (at which point
> >> > the interpreter would have a real O_RDONLY file descriptor for the
> >> > underlying file).
> >> 
> >> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> >> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> >> 
> >> The sole excuse for merging that thing in the first place had been
> >> "would anybody think of children^Wsclerotic^Whardened environments
> >> where they have no /proc at all".
> >
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
> closed a #! interpreter can not open it.   So I don't know why or how
> you want that to work but it is nonsense.

The why is simple: fexecve always expects a close-on-exec file
descriptor. Otherwise the program being executed would need to take a
special option telling it to close the spurious fd it inherits. Most
programs don't have such an option, and there's no way to do it
without application-specific knowledge.

The how is difficult, but it can be done.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 9:50 p.m. UTC | #12
On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

You know what's cleaner than that?  git revert 27d6ec7ad
It has just been merged; until 3.19 it's fair game for removal.

And yes, I should've NAKed the damn thing loud and clear, rather than
asking questions back then, getting no answers and letting it slip.
Mea culpa.

Back then the procfs-free environments had been pushed as a serious argument
in favour of merging the damn thing.  Now you guys turn around and say that
we not only need procfs mounted, we need a yet-to-be-added kludge in there
to cope with the actual intended uses.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 9, 2015, 10:13 p.m. UTC | #13
Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

And it doesn't work without introducing security vulnerabilities into
the kernel, because it breaks close-on-exec semantics.

All you have to do is pick a file descriptor, good canidates are 0 and
255 and make it a convention that that file descriptor is used for
fexecve.  At least when you want to support scripts.  Otherwise you can
set close-on-exec.

That results in no accumulation of file descriptors  because everyone
always uses the same file descriptor.

Regardless you don't have a patch and you aren't proposing code and the
code isn't actually broken so please go away.

Eric
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 10:17 p.m. UTC | #14
On Fri, Jan 09, 2015 at 09:50:42PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> You know what's cleaner than that?  git revert 27d6ec7ad
> It has just been merged; until 3.19 it's fair game for removal.
> 
> And yes, I should've NAKed the damn thing loud and clear, rather than
> asking questions back then, getting no answers and letting it slip.
> Mea culpa.
> 
> Back then the procfs-free environments had been pushed as a serious argument
> in favour of merging the damn thing.  Now you guys turn around and say that
> we not only need procfs mounted, we need a yet-to-be-added kludge in there
> to cope with the actual intended uses.

Reverting does not fix the problem. There is no way to make fexecve
work for scripts without kernel support, and the needed kernel support
without fexecve would be even nastier, since handling of /proc/self/fd
magic-symlinks would need to be special-cased. The added fexecveat
syscall supports fully /proc-less operation for non-scripts.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 10:33 p.m. UTC | #15
On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > Back then the procfs-free environments had been pushed as a serious argument
> > in favour of merging the damn thing.  Now you guys turn around and say that
> > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > to cope with the actual intended uses.
> 
> Reverting does not fix the problem. There is no way to make fexecve
> work for scripts without kernel support, and the needed kernel support
> without fexecve would be even nastier, since handling of /proc/self/fd
> magic-symlinks would need to be special-cased. The added fexecveat
> syscall supports fully /proc-less operation for non-scripts.

Oh, yes it does.  It's not *our* problem if it's out of tree and not
a part of ABI.  That way if you need it, *you* get to come up with clean
implementation.  If it's in-tree you get leverage to push ugly kludges
further in.  And frankly, I don't trust you to abstain from using that
leverage in rather nasty ways.

Out of curiosity, how would you expect that "open only once" to work?
All reliable variants I see are beyond sick...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 10:38 p.m. UTC | #16
On Fri, Jan 09, 2015 at 04:13:27PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.

I'm curious what those security vulnerabilities would be. The standard
issue with close-on-exec failure (e.g. races) is the leaking of
arbitrary file descriptors (typically, ones opened by other threads or
other unrelated portions of the program) to resources the new process
should not have. "Leaking" of an inode-reference-only (no permissions)
O_PATH fd or pseudo-fd to the script that's to be run does not seem
like a vulnerability to me, and it would only be "leaked" if the
interpreter does something unexpected.

> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.

0 is obviously not a candidate; it's stdin. 255 is also not a
candidate though. Consider for example something like irssi's /upgrade
that's going to have the child inheriting an arbitrary set of file
descriptors that need to keep their original numbers, possibly
including 255. Imposing a script in between should not cause arbitrary
file descriptors to be lost.

> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

I'm not proposing code because I'm a libc developer not a kernel
developer. I know what's needed for userspace to provide a conforming
fexecve to applications, not how to implement that on the kernel side,
although I'm trying to provide constructive ideas. The hostility is
really not necessary.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 10:42 p.m. UTC | #17
On Fri, Jan 09, 2015 at 10:33:00PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > > Back then the procfs-free environments had been pushed as a serious argument
> > > in favour of merging the damn thing.  Now you guys turn around and say that
> > > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > > to cope with the actual intended uses.
> > 
> > Reverting does not fix the problem. There is no way to make fexecve
> > work for scripts without kernel support, and the needed kernel support
> > without fexecve would be even nastier, since handling of /proc/self/fd
> > magic-symlinks would need to be special-cased. The added fexecveat
> > syscall supports fully /proc-less operation for non-scripts.
> 
> Oh, yes it does.  It's not *our* problem if it's out of tree and not
> a part of ABI.  That way if you need it, *you* get to come up with clean
> implementation.  If it's in-tree you get leverage to push ugly kludges
> further in.  And frankly, I don't trust you to abstain from using that
> leverage in rather nasty ways.
> 
> Out of curiosity, how would you expect that "open only once" to work?
> All reliable variants I see are beyond sick...

Here's a very simple way it could work -- it could put the O_PATH fd
on a previously-unused fd number, and put a special flag on the fd,
like FD_CLOEXEC, but that causes the kernel to close it whenever it's
opened. The pathname passed could then simply be /dev/fd/%d or
/proc/self/fd/%d, and although this is presently dependent on /proc
being mounted, virtual /dev/fd/* could someday be something completely
independent of procfs. The kernel keeps all the freedom to choose how
to pass the name to the interpreter. I'm not proposing any kernel
API/ABI lock-in and I'm with you in opposing such lock-in.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 10:57 p.m. UTC | #18
On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:

> Here's a very simple way it could work -- it could put the O_PATH fd
> on a previously-unused fd number, and put a special flag on the fd,
> like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> opened. The pathname passed could then simply be /dev/fd/%d or
> /proc/self/fd/%d, and although this is presently dependent on /proc
> being mounted, virtual /dev/fd/* could someday be something completely
> independent of procfs. The kernel keeps all the freedom to choose how
> to pass the name to the interpreter. I'm not proposing any kernel
> API/ABI lock-in and I'm with you in opposing such lock-in.

Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
traversed and after that point there is no information whatsoever how we
got to that vfsmount/dentry pair.  I can imagine several kludges that would
work, but they are unspeakably ugly, and do_last() is already far too
convoluted as it is.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 11:12 p.m. UTC | #19
On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> 
> > Here's a very simple way it could work -- it could put the O_PATH fd
> > on a previously-unused fd number, and put a special flag on the fd,
> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> > opened. The pathname passed could then simply be /dev/fd/%d or
> > /proc/self/fd/%d, and although this is presently dependent on /proc
> > being mounted, virtual /dev/fd/* could someday be something completely
> > independent of procfs. The kernel keeps all the freedom to choose how
> > to pass the name to the interpreter. I'm not proposing any kernel
> > API/ABI lock-in and I'm with you in opposing such lock-in.
> 
> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> traversed and after that point there is no information whatsoever how we
> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> work, but they are unspeakably ugly, and do_last() is already far too
> convoluted as it is.

I'm not sure where you're disagreeing with me. open of procfs symlinks
does not resolve the symlink and open the resulting pathname. They are
"magic symlinks" which are bound to the inode of the open file. I
don't see why this action, which is already special for magic
symlinks, can't check a flag on the magic symlink and possibly close
the corresponding file descriptor as part of its action.

In any case, whether/how fexecve works with interpreters is something
the kernel can change without breaking userspace expectations. My goal
is to avoid creating any new API/ABI requirement here.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 9, 2015, 11:24 p.m. UTC | #20
On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
>>
>> > Here's a very simple way it could work -- it could put the O_PATH fd
>> > on a previously-unused fd number, and put a special flag on the fd,
>> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
>> > opened. The pathname passed could then simply be /dev/fd/%d or
>> > /proc/self/fd/%d, and although this is presently dependent on /proc
>> > being mounted, virtual /dev/fd/* could someday be something completely
>> > independent of procfs. The kernel keeps all the freedom to choose how
>> > to pass the name to the interpreter. I'm not proposing any kernel
>> > API/ABI lock-in and I'm with you in opposing such lock-in.
>>
>> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
>> traversed and after that point there is no information whatsoever how we
>> got to that vfsmount/dentry pair.  I can imagine several kludges that would
>> work, but they are unspeakably ugly, and do_last() is already far too
>> convoluted as it is.
>
> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.
>
> In any case, whether/how fexecve works with interpreters is something
> the kernel can change without breaking userspace expectations. My goal
> is to avoid creating any new API/ABI requirement here.
>

I think that, if we really want to support clean fexecve on O_CLOEXEC
scripts some day, the right way to do it is to fix the script
interface for real.  Have a special flag in the headers of script
interpreters that support a new interface that says "when I'm a script
interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
CLOEXEC set".  Then we can directly exec scripts by fd, even with
O_CLOEXEC set, without any races.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 9, 2015, 11:36 p.m. UTC | #21
On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:

> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.

_What_ action?  ->follow_link()?  As in "the same thing that e.g.
stat(2) would trigger"?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 9, 2015, 11:37 p.m. UTC | #22
On Fri, Jan 09, 2015 at 03:24:12PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> >>
> >> > Here's a very simple way it could work -- it could put the O_PATH fd
> >> > on a previously-unused fd number, and put a special flag on the fd,
> >> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> >> > opened. The pathname passed could then simply be /dev/fd/%d or
> >> > /proc/self/fd/%d, and although this is presently dependent on /proc
> >> > being mounted, virtual /dev/fd/* could someday be something completely
> >> > independent of procfs. The kernel keeps all the freedom to choose how
> >> > to pass the name to the interpreter. I'm not proposing any kernel
> >> > API/ABI lock-in and I'm with you in opposing such lock-in.
> >>
> >> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> >> traversed and after that point there is no information whatsoever how we
> >> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> >> work, but they are unspeakably ugly, and do_last() is already far too
> >> convoluted as it is.
> >
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> >
> > In any case, whether/how fexecve works with interpreters is something
> > the kernel can change without breaking userspace expectations. My goal
> > is to avoid creating any new API/ABI requirement here.
> 
> I think that, if we really want to support clean fexecve on O_CLOEXEC
> scripts some day, the right way to do it is to fix the script
> interface for real.  Have a special flag in the headers of script
> interpreters that support a new interface that says "when I'm a script
> interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
> CLOEXEC set".  Then we can directly exec scripts by fd, even with
> O_CLOEXEC set, without any races.

This is also acceptable, but I don't think you'd really need a special
header flag. Just pass it, and also pass /dev/fd/%d or
/proc/self/fd/%d in argv[]. If the interpreter supports it, everything
works fine. If not, it still works as long as /proc is mounted, but
with a partial fd leak. (Note: the leak is not so bad since the
interpreter would inherit a close-on-exec fd and thus would not leak
it further.)

Aside from setting up the new auxv entry, the main trick the kernel
would have to do is bypassing FD_CLOEXEC at exec time while keeping
the FD_CLOEXEC flag present on the fd after exec.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 10, 2015, 12:01 a.m. UTC | #23
On Fri, Jan 09, 2015 at 03:24:12PM -0800, Andy Lutomirski wrote:

> I think that, if we really want to support clean fexecve on O_CLOEXEC
> scripts some day, the right way to do it is to fix the script
> interface for real.  Have a special flag in the headers of script
> interpreters that support a new interface that says "when I'm a script
> interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
> CLOEXEC set".  Then we can directly exec scripts by fd, even with
> O_CLOEXEC set, without any races.

Amazing.  Let me see if I got it straight - you want a magical Linux-only
flag to mark the binaries that might be used as interpreters.  _Plus_ the
Linux-only logics in their source to go with that.  With corresponding kludges
to parsing the command line (you know, like #!/usr/bin/make -f as the first
line in a script - somehow it should recognize the deep magic of the oh
so fucking superior interface and suppress the normal behaviour).  Maintained
by hell knows whom.  Onna stick.  Inna bun.  CMOT Dibbler would be proud...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 10, 2015, 1:17 a.m. UTC | #24
Rich Felker <dalias@aerifal.cx> writes:

> I'm not proposing code because I'm a libc developer not a kernel
> developer. I know what's needed for userspace to provide a conforming
> fexecve to applications, not how to implement that on the kernel side,
> although I'm trying to provide constructive ideas. The hostility is
> really not necessary.

Conforming to what?

The open group fexecve says nothing about requiring a file descriptor
passed to fexecve to have O_CLOEXEC.

Further looking at open group specification of exec it seems to indicate
the preferred way to handle this is for the kernel to return O_NOEXEC
and then libc gets to figure out how to run the shell script.  Is that
the kind of ``conforming'' implementation you are looking for?

Eric
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 10, 2015, 1:33 a.m. UTC | #25
On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > I'm not proposing code because I'm a libc developer not a kernel
> > developer. I know what's needed for userspace to provide a conforming
> > fexecve to applications, not how to implement that on the kernel side,
> > although I'm trying to provide constructive ideas. The hostility is
> > really not necessary.
> 
> Conforming to what?
> 
> The open group fexecve says nothing about requiring a file descriptor
> passed to fexecve to have O_CLOEXEC.

It doesn't require it but it allows it, and in multithreaded programs
that might run child processes (or library code that might be used in
such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

> Further looking at open group specification of exec it seems to indicate
> the preferred way to handle this is for the kernel to return O_NOEXEC
> and then libc gets to figure out how to run the shell script.  Is that
> the kind of ``conforming'' implementation you are looking for?

This is a complex issue, and does not apply to native #! support
(which is a supported executable format and thus not ENOEXEC) but
rather standard POSIX shell scripts (which don't have a #! line at
all). In this case the behavior of fexecve is perhaps under-specified.
However, in cases where execve would succeed (without causing
ENOEXEC), I think it's at least undesirable, if not non-conforming,
for fexecve to fail.

Should we request clarification from the Austin Group?

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 10, 2015, 3:03 a.m. UTC | #26
On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> 
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> 
> _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> stat(2) would trigger"?

To elaborate a bit: the fundamental method for symlink traversal is
->follow_link().  It gets dentry of the object itself + opaque context.
Usually it just obtains some string (== symlink contents) and calls
nd_set_link(context, string).  In that case the string will be interpreted
by its callers in usual way.  Another possibility is to call
nd_jump_link(context, location), which will reset the current position
(directory in which the symlink has been found and relative to which it
would be interpreted) to given location in tree.  It might actually do
both - then the string will be interpreted relative to the new location.
Once the pathname resolution is done with the string stored by nd_set_link(),
it calls another method - ->put_link().  That one releases the object
that contains this string; it gets an opaque pointer returned by
->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
nd_set_link(context, ERR_PTR(-Esomething)).

readlink(2) is using a different method (->readlink()) and any object whose
->follow_link() only uses nd_set_link() can use generic_readlink as its
->readlink instance - that will call ->follow_link(), copy the string
stored by nd_set_link() to userland buffer and use ->put_link() to release
whatever needs to be released.  Most of the symlinks are doing just that.

procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
they obviously can't use generic_readlink() (there is no string left
by ->follow_link() for caller to traverse), so they have non-standard
->readlink() instances - ones that use d_path() to generate a plausible
pathname of the would-be destination of their ->follow_link().  Or something
like pipe:[696969], etc.

Note, however, that ->readlink() is used only by readlink(2) syscall; as far
as pathname resolution is concerned it is completely irrelevant.  What matters
is ->follow_link().

Now, the callers do not know (and do not care) what a particular symlink _is_.
A symlink is just a dentry with inode that has non-NULL ->follow_link()
method.  That's it.  Moreover, _any_ pathname resolution is using the
same method for symlink traversal, be it open(2), stat(2), whatever.  If
a symlink is to be traversed, that's it - the only choice VFS has is whether
to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
O_NOFOLLOW, etc.)

_After_ the traversal it's too late to do this sort of thing - after all,
how do you tell if your current position had been set by the traversal of
your symlink or that of any normal /proc/self/fd/<n>?

And doing that _during_ the traversal would really suck - stray ls -lR /proc
could race with that open() done by script interpreter.

It might be possible to work around that, but trying that rapidly gets into
very ugly territory, *especially* since the handling of the final component
of open(2) (fs/namei.c:do_last()) is already far too convoluted.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 10, 2015, 3:41 a.m. UTC | #27
On Sat, Jan 10, 2015 at 03:03:00AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> > On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> > 
> > > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > > does not resolve the symlink and open the resulting pathname. They are
> > > "magic symlinks" which are bound to the inode of the open file. I
> > > don't see why this action, which is already special for magic
> > > symlinks, can't check a flag on the magic symlink and possibly close
> > > the corresponding file descriptor as part of its action.
> > 
> > _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> > stat(2) would trigger"?
> 
> To elaborate a bit: the fundamental method for symlink traversal is
> ->follow_link().  It gets dentry of the object itself + opaque context.
> Usually it just obtains some string (== symlink contents) and calls
> nd_set_link(context, string).  In that case the string will be interpreted
> by its callers in usual way.  Another possibility is to call
> nd_jump_link(context, location), which will reset the current position
> (directory in which the symlink has been found and relative to which it
> would be interpreted) to given location in tree.  It might actually do
> both - then the string will be interpreted relative to the new location.
> Once the pathname resolution is done with the string stored by nd_set_link(),
> it calls another method - ->put_link().  That one releases the object
> that contains this string; it gets an opaque pointer returned by
> ->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
> nd_set_link(context, ERR_PTR(-Esomething)).
> 
> readlink(2) is using a different method (->readlink()) and any object whose
> ->follow_link() only uses nd_set_link() can use generic_readlink as its
> ->readlink instance - that will call ->follow_link(), copy the string
> stored by nd_set_link() to userland buffer and use ->put_link() to release
> whatever needs to be released.  Most of the symlinks are doing just that.
> 
> procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
> they obviously can't use generic_readlink() (there is no string left
> by ->follow_link() for caller to traverse), so they have non-standard
> ->readlink() instances - ones that use d_path() to generate a plausible
> pathname of the would-be destination of their ->follow_link().  Or something
> like pipe:[696969], etc.
> 
> Note, however, that ->readlink() is used only by readlink(2) syscall; as far
> as pathname resolution is concerned it is completely irrelevant.  What matters
> is ->follow_link().
> 
> Now, the callers do not know (and do not care) what a particular symlink _is_.
> A symlink is just a dentry with inode that has non-NULL ->follow_link()
> method.  That's it.  Moreover, _any_ pathname resolution is using the
> same method for symlink traversal, be it open(2), stat(2), whatever.  If
> a symlink is to be traversed, that's it - the only choice VFS has is whether
> to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
> O_NOFOLLOW, etc.)
> 
> _After_ the traversal it's too late to do this sort of thing - after all,
> how do you tell if your current position had been set by the traversal of
> your symlink or that of any normal /proc/self/fd/<n>?

Thanks for clarifying how this all works in the kernel. It makes it
easier to understand what the costs (especially complexity costs) of
different implementation options might be for the kernel.

> And doing that _during_ the traversal would really suck - stray ls -lR /proc
> could race with that open() done by script interpreter.

IMO this one issue is easily solvable by limiting the special action
to calls by the owning pid.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 10, 2015, 4:14 a.m. UTC | #28
On Fri, Jan 09, 2015 at 10:41:44PM -0500, Rich Felker wrote:
> > _After_ the traversal it's too late to do this sort of thing - after all,
> > how do you tell if your current position had been set by the traversal of
> > your symlink or that of any normal /proc/self/fd/<n>?
> 
> Thanks for clarifying how this all works in the kernel. It makes it
> easier to understand what the costs (especially complexity costs) of
> different implementation options might be for the kernel.
> 
> > And doing that _during_ the traversal would really suck - stray ls -lR /proc
> > could race with that open() done by script interpreter.
> 
> IMO this one issue is easily solvable by limiting the special action
> to calls by the owning pid.

Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
etc.) before bothering with open(2), you'll get screwed.  Moreover, if it
does so only in case when you have something specific in environment,
you'll have the devil of the time trying to figure out how to reproduce
such a bug report...
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 10, 2015, 5:57 a.m. UTC | #29
On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 10:41:44PM -0500, Rich Felker wrote:
> > > _After_ the traversal it's too late to do this sort of thing - after all,
> > > how do you tell if your current position had been set by the traversal of
> > > your symlink or that of any normal /proc/self/fd/<n>?
> > 
> > Thanks for clarifying how this all works in the kernel. It makes it
> > easier to understand what the costs (especially complexity costs) of
> > different implementation options might be for the kernel.
> > 
> > > And doing that _during_ the traversal would really suck - stray ls -lR /proc
> > > could race with that open() done by script interpreter.
> > 
> > IMO this one issue is easily solvable by limiting the special action
> > to calls by the owning pid.
> 
> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> etc.) before bothering with open(2), you'll get screwed.

Yes, but I think that would be very bad interpreter design.
stat/getxattr/access/whatever followed by open is always a TOCTOU
race. The correct sequence of actions is always open followed by
fstat/fgetxattr/...

> Moreover, if it
> does so only in case when you have something specific in environment,
> you'll have the devil of the time trying to figure out how to reproduce
> such a bug report...

Yes, this is a more serious concern. For example, if a shell processes
$HISTFILE or something before opening the script. I'm starting to
prefer the idea of just refusing to honor the close-on-exec flag for
the fd passed to fexecve but preserving it, and letting the
interpreter close the file itself if it wants to. This could be done
with or without the new auxv entry stuff.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Kerrisk \(man-pages\) Jan. 10, 2015, 7:13 a.m. UTC | #30
On 01/09/2015 11:13 PM, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
>> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
>> The "magic open-once magic symlink" approach is really the cleanest
>> solution I can find. In the case where the interpreter does not open
>> the script, nothing terribly bad happens; the magic symlink just
>> sticks around until _exit or exec. In the case where the interpreter
>> opens it more than once, you get a failure, but as far as I know
>> existing interpreters don't do this, and it's arguably bad design. In
>> any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.
> 
> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.
> 
> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

Eric,

This style of response isn't helpful. Suggesting that people must have
a patch in hand in order to have a conversation about kernel development
means a lot of clever people are going to be excluded from important
conversations. Those clever people are some user-space developers
who develop the software that the kernel interacts with--you know, the
user-space that is the kernel's raison-d'être.

Rich, as far as I've seen, is one of those clever people--he implemented
and maintains a (pretty much complete?) standard C library, so when he
comes to a conversation like this, I think it's best to start with
the assumption that he's thought long and hard about the problem, and 
seemingly hostile responses as you (and Al) make above don't do much 
to advance the conversation to a solution.

And there is a problem [*] and nothing I've seen so far in this
conversation seems to provide a solution within the current 
kernel implementation (but, maybe I am not clever enough to see it).

==

[*] A summary of the problem for bystanders:

[0.a] Some people want a solution to implementing fexecve() 
      (http://man7.org/linux/man-pages/man3/fexecve.3.html )
      in the absence of /proc (which is currently used for 
      the implementation). The new execveat() is a stepping
      stone to that solution.

[0.b] POSIX permits, but does not require, the FD_CLOEXEC
      (close-on-exec) file descriptor flag to be set on the
      file descriptor passed to fexecve().

[1]   The sequence:
          * Open a script file, to get a descriptor, 'fd'
          * Set the close-on-exec flag on 'fd'
	  * execveat(fd, NULL, argv, envp, AT_EMPTY_PATH)

      fails in the execveat() because by the time the script 
      interpreter has been loaded, 'fd' has been closed because
      of the close-on-exec flag.

[2]   Omitting the use of close-on-exec on the FD given to
      fexecve()/execveat() means that the execed script
      receives a superfluous file descriptor that refers to the
      script file. The script cannot determine that there is such 
      an FD or which FD it is without some some messy special-case
      hacking to inspect its environment (and that hacking must be
      based on /proc, AFAICT!)

[3]   Scripts won't do the check in [2], with the result that
      that there'll be descriptor leaks in some cases where
      fexecve()/execveat() is used repeatedly.

[4]   (As Rich points out in a reply to the parent message, the
      solution suggested above of using a fixed file descriptor 
      for fexecve() does not solve the problem either.)

For an example of the leak, consider the following simple program 
and script. The program is just a simple command-line interface to 
exercise execveat():

=====
/* t_execveat.c
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

#define __NR_execveat 322 /* x86-64 */

static int execveat(int dirfd, const char *pathname, char *const argv[],
                    char *const envp[], int flags)
{
            return syscall(__NR_execveat, dirfd, pathname, argv, envp, flags);
}

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

extern char **environ;

int
main(int argc, char *argv[])
{
    int flags, dirfd;
    char *path;

    flags = 0;

    if (argc < 4) {
        fprintf(stderr, "%s dirfd-path path argv0 [argvN...]\n", argv[0]);
        fprintf(stderr, "\tSpecify 'dirfd' as '-' to get AT_FDCWD\n");
        fprintf(stderr, "\tSpecify 'path' as an empty string to get "
                "AT_EMPTY_PATH\n");
        exit(EXIT_FAILURE);
    }

    if (argv[1][0] == '-')
        dirfd = AT_FDCWD;
    else {
        dirfd = open(argv[1], O_RDONLY);
        if (dirfd == -1)
            errExit("open");
    }

    path = argv[2];
    if (strlen(path) == 0)
        flags = AT_EMPTY_PATH;

    execveat(dirfd, path, &argv[3], environ, flags);
    errExit("execveat");

    exit(EXIT_SUCCESS);
}
=====

And then a simple script (necho.sh) that recursively invokes itself using
the above program demonstrates the problem.

=====
#!/bin/sh
echo 
echo '$0 =' $0
ls -l /proc/$$/fd
./t_execveat ./necho.sh "" arg1 # $arg
=====

When we run this script, we see:

=====

# chmod +x necho.sh
# ./t_execveat ./necho.sh "" arg1

$0 = /dev/fd/3
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh

$0 = /dev/fd/4
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh

$0 = /dev/fd/5
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh

$0 = /dev/fd/6
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh

$0 = /dev/fd/7
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 7 -> /home/mtk/necho.sh


[and so on until we run out of file descriptors]
=====

(I think the FD 199 in the above output is some bash(1) artifact, unrelated 
to the  conversation at hand.)

Thanks,

Michael
Michael Kerrisk \(man-pages\) Jan. 10, 2015, 7:38 a.m. UTC | #31
On 01/09/2015 05:13 PM, Rich Felker wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done 
>> a few very light edits, and will release the version below 
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees 
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
> 
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed. It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
> 
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I agree, that something needs to be said. What I instead did was 
added "See BUGS" to the ENOEXEC error, and then this text:

   BUGS
       The  ENOENT  error  described above means that it is not possible
       possible to set the close-on-exec flag  on  the  file  descriptor
       given to a call of the form:

           execveat(fd, "", argv, envp, AT_EMPTY_PATH);

       However, the inability to set the close-on-exec flag means that a
       file descriptor referring to the  script  leaks  through  to  the
       script  itself.  As well as wasting a file descriptor, this leak‐
       age can lead to file-descriptor  exhaustion  in  scenarios  where
       scripts  recursively  employ  exceveat()  (or a future fexecve(3)
       implementation that might be based on execveat()).

Okay?

Thanks,

Michael
Michael Kerrisk \(man-pages\) Jan. 10, 2015, 7:43 a.m. UTC | #32
On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

Yep, got it now.

> [As an aside, IIRC the filename does get put into the new
> process's memory, up above the environment strings -- but
> that copy isn't visible via argv nor envp.]
> 
>> It's AT_EXECFN,
>> /proc/self/exe, and filenames shown elsewhere in /proc that may be
>> derived in odd ways.
>>
>> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
>> rather than the main description. The long-term intent should be that
>> script execution this way should work. IIRC this was discussed earlier
>> in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

See my other replies (and of course, Rich's). It does seem there is 
a real problem to be solved here.

Thanks,

Michael
Michael Kerrisk \(man-pages\) Jan. 10, 2015, 7:56 a.m. UTC | #33
On 01/09/2015 07:02 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
> 
> Many thanks, one error (of mine) in 2 places pointed out below.

Well, the first error was yours. The second error was mine,
when I replicated your info about AT_SYMLINK_NOFOLLOW
into the ERRORS without verifying it. (Sorry about that!)

Both cases fixed now.

Thanks,

Michael
Michael Kerrisk \(man-pages\) Jan. 10, 2015, 8:27 a.m. UTC | #34
On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

So, on reflection, I think it's worth saying something about this, and 
I added the following text to the man page:

   NOTES
       When asked to execute a script file, the argv[0] that  is  passed
       to  the  script  interpreter is a string of the form /dev/fd/N or
       /dev/fd/N/P, where N is the number of the file descriptor  passed
       via  the  dirfd argument.  A string of the first form occurs when
       AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
       when the script is specified via both dirfd and pathname; in this
       case, P is the value given in pathname.

Thanks,

Michael
Rich Felker Jan. 10, 2015, 1:31 p.m. UTC | #35
On Sat, Jan 10, 2015 at 09:27:46AM +0100, Michael Kerrisk (man-pages) wrote:
> On 01/09/2015 06:46 PM, David Drysdale wrote:
> > On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> >> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> >>> On 11/24/2014 12:53 PM, David Drysdale wrote:
> >>>> Signed-off-by: David Drysdale <drysdale@google.com>
> >>>> ---
> >>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 153 insertions(+)
> >>>>  create mode 100644 man2/execveat.2
> >>>
> >>> David,
> >>>
> >>> Thanks for the very nicely prepared man page. I've done
> >>> a few very light edits, and will release the version below
> >>> with the next man-pages release.
> >>>
> >>> I have one question. In the message accompanying
> >>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> >>>
> >>>   The filename fed to the executed program as argv[0] (or the name of the
> >>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> >>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> >>>   reflecting how the executable was found.  This does however mean that
> >>>   execution of a script in a /proc-less environment won't work; also, script
> >>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
> >>>   accessible after exec).
> >>>
> >>> How does one produce this situation where the execed program sees
> >>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> >>> call look like?) I tried to produce this scenario, but could not.
> >>
> >> I think this is wrong. argv[0] is an arbitrary string provided by the
> >> caller and would never be derived from the fd passed.
> > 
> > Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> > As Rich says, for normal binaries argv[0] is just the argv[0] that
> > was passed into the execve[at] call.  For a script, the code in
> > fs/binfmt_script.c will remove the original argv[0] and put the
> > interpreter name and the script filename (e.g. "/bin/sh",
> > "/dev/fd/6/script") in as 2 arguments in its place.
> 
> So, on reflection, I think it's worth saying something about this, and 
> I added the following text to the man page:
> 
>    NOTES
>        When asked to execute a script file, the argv[0] that  is  passed
>        to  the  script  interpreter is a string of the form /dev/fd/N or
>        /dev/fd/N/P, where N is the number of the file descriptor  passed
>        via  the  dirfd argument.  A string of the first form occurs when
>        AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
>        when the script is specified via both dirfd and pathname; in this
>        case, P is the value given in pathname.

While I'm aware that you're simply documenting, it seems unnecessary
to me (and unnecessarily complicating of the cloexec issue) to have
the /dev/fd/N/P form. This could always be resolved by the kernel to a
single temp fd for the new process to use, and in fact it's probably
preferable to always get a "temp fd" in case the fd passed to fexecve
is NOT a throwaway one (e.g. if the original fd was stdin or
something); the program being executed should not have to use ugly and
error-prone heuristics to decide if it should close the exec fd.

On the other hand, this resolution could be done by userspace (open
with O_PATH|O_CLOEXEC prior to making the fexecveat syscall, and
always passing AT_EMPTY_PATH to the kernel) if desirable, so maybe it
doesn't make sense to have the kernel do it. In this sense the whole
"at" part of fexecveat becomes vestigial, though.

Any thoughts?

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 10, 2015, 10:27 p.m. UTC | #36
Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:

>> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> etc.) before bothering with open(2), you'll get screwed.
>
> Yes, but I think that would be very bad interpreter design.
> stat/getxattr/access/whatever followed by open is always a TOCTOU
> race. The correct sequence of actions is always open followed by
> fstat/fgetxattr/...

Sigh.  I think everyone who has looked at this has been blind.

If userspace is reasonable all we have to do is fix /proc/self/exe
for shell scripts to point at the actual script,
and then pass /proc/self/exe on the shell scripts command line.

At a practical level we have to worry about backwards compability and
chroot jails.  But the existence of a clean implementation with
/proc/self/exe serves a proof of concept that it would not be too
difficult.  When someone cares enough to implement it.

Eric
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 11, 2015, 1:15 a.m. UTC | #37
On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> 
> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> >> etc.) before bothering with open(2), you'll get screwed.
> >
> > Yes, but I think that would be very bad interpreter design.
> > stat/getxattr/access/whatever followed by open is always a TOCTOU
> > race. The correct sequence of actions is always open followed by
> > fstat/fgetxattr/...
> 
> Sigh.  I think everyone who has looked at this has been blind.
> 
> If userspace is reasonable all we have to do is fix /proc/self/exe
> for shell scripts to point at the actual script,
> and then pass /proc/self/exe on the shell scripts command line.
> 
> At a practical level we have to worry about backwards compability and
> chroot jails.  But the existence of a clean implementation with
> /proc/self/exe serves a proof of concept that it would not be too
> difficult.  When someone cares enough to implement it.

Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
a regular symlink? In the latter case it defeats the whole purpose of
using O_EXEC fds and fexecve rather than pathnames.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Jan. 11, 2015, 2:09 a.m. UTC | #38
Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>> 
>> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
>> 
>> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> >> etc.) before bothering with open(2), you'll get screwed.
>> >
>> > Yes, but I think that would be very bad interpreter design.
>> > stat/getxattr/access/whatever followed by open is always a TOCTOU
>> > race. The correct sequence of actions is always open followed by
>> > fstat/fgetxattr/...
>> 
>> Sigh.  I think everyone who has looked at this has been blind.
>> 
>> If userspace is reasonable all we have to do is fix /proc/self/exe
>> for shell scripts to point at the actual script,
>> and then pass /proc/self/exe on the shell scripts command line.
>> 
>> At a practical level we have to worry about backwards compability and
>> chroot jails.  But the existence of a clean implementation with
>> /proc/self/exe serves a proof of concept that it would not be too
>> difficult.  When someone cares enough to implement it.
>
> Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
> a regular symlink? In the latter case it defeats the whole purpose of
> using O_EXEC fds and fexecve rather than pathnames.

In implementation /proc/self/exe is a named rather than a numbered file
descriptor.  Essentially when loading an elf executable the file
descriptor is duped to the name /proc/self/exe.  The implementation
otherwise is the same as /proc/self/fd/N.

The downside of course is that I expect if we were actually to change
/proc/self/exe from to point at the script instead of the shell some
piece of software somewhere would come melting down.  I am totally not
ready to consider that kind of mine field today.

Eric

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Jan. 11, 2015, 11:02 a.m. UTC | #39
On Sat, Jan 10, 2015 at 08:09:10PM -0600, Eric W. Biederman wrote:
> In implementation /proc/self/exe is a named rather than a numbered file
> descriptor.  Essentially when loading an elf executable the file
> descriptor is duped to the name /proc/self/exe.  The implementation
> otherwise is the same as /proc/self/fd/N.
> 
> The downside of course is that I expect if we were actually to change
> /proc/self/exe from to point at the script instead of the shell some
> piece of software somewhere would come melting down.  I am totally not
> ready to consider that kind of mine field today.

We could add a /proc/self/script that points to the script, and either
is not available or still points to the executable if we are not running
a script.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Drysdale Jan. 12, 2015, 11:33 a.m. UTC | #40
On Sat, Jan 10, 2015 at 1:33 AM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>>
>> > I'm not proposing code because I'm a libc developer not a kernel
>> > developer. I know what's needed for userspace to provide a conforming
>> > fexecve to applications, not how to implement that on the kernel side,
>> > although I'm trying to provide constructive ideas. The hostility is
>> > really not necessary.
>>
>> Conforming to what?
>>
>> The open group fexecve says nothing about requiring a file descriptor
>> passed to fexecve to have O_CLOEXEC.
>
> It doesn't require it but it allows it, and in multithreaded programs
> that might run child processes (or library code that might be used in
> such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

As a naive idea related to Andy's suggestion elsewhere, could you
just have an environment convention for fexecve-ing scripts?  That
would reduce FD leaks without any need for kernel involvement/changes.

For example, set _FEXECVED_VIA_FD=4 but don't set
O_CLOEXEC before fexecve, and the interpreter reads then
closes that FD.  Or just get the interpreter to spot scripts named
"/dev/fd/%d" and read-then-close the FD that way, cf. Eric's suggestion
at https://lkml.org/lkml/2014/10/22/652.

By the way, FreeBSD has a fexecve(2) syscall that behaves
in the same way as the current Linux code for an O_CLOEXEC
script -- the interpreter fails to open "/dev/fd/6" as it's gone.
Do you know if there are any other OSes that already do
something more sophisticated for this case?
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Drysdale Jan. 12, 2015, 2:18 p.m. UTC | #41
On Fri, Jan 9, 2015 at 9:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:
>
>> The "magic open-once magic symlink" approach is really the cleanest
>> solution I can find. In the case where the interpreter does not open
>> the script, nothing terribly bad happens; the magic symlink just
>> sticks around until _exit or exec. In the case where the interpreter
>> opens it more than once, you get a failure, but as far as I know
>> existing interpreters don't do this, and it's arguably bad design. In
>> any case it's a caught error.
>
> You know what's cleaner than that?  git revert 27d6ec7ad
> It has just been merged; until 3.19 it's fair game for removal.
>
> And yes, I should've NAKed the damn thing loud and clear, rather than
> asking questions back then, getting no answers and letting it slip.
> Mea culpa.

Al, I'm sorry if I missed a question or concern of yours back in
October -- I certainly didn't intend to (that would be foolish indeed!).

[I thought the main open question was whether a dupfs
implementation would help with /dev/fd/ and /proc/ semantics, but I
had the (possibly incorrect) understanding that that was somewhat
orthogonal to the execveat implementation.]

Are there any changes/fixes/refactorings that I could do (especially
within the 3.19 timeframe) that would help mollify at all?

> Back then the procfs-free environments had been pushed as a serious argument
> in favour of merging the damn thing.  Now you guys turn around and say that
> we not only need procfs mounted, we need a yet-to-be-added kludge in there
> to cope with the actual intended uses.

Not me!
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rich Felker Jan. 12, 2015, 4:07 p.m. UTC | #42
On Mon, Jan 12, 2015 at 11:33:49AM +0000, David Drysdale wrote:
> On Sat, Jan 10, 2015 at 1:33 AM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> >> Rich Felker <dalias@aerifal.cx> writes:
> >>
> >> > I'm not proposing code because I'm a libc developer not a kernel
> >> > developer. I know what's needed for userspace to provide a conforming
> >> > fexecve to applications, not how to implement that on the kernel side,
> >> > although I'm trying to provide constructive ideas. The hostility is
> >> > really not necessary.
> >>
> >> Conforming to what?
> >>
> >> The open group fexecve says nothing about requiring a file descriptor
> >> passed to fexecve to have O_CLOEXEC.
> >
> > It doesn't require it but it allows it, and in multithreaded programs
> > that might run child processes (or library code that might be used in
> > such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.
> 
> As a naive idea related to Andy's suggestion elsewhere, could you
> just have an environment convention for fexecve-ing scripts?  That
> would reduce FD leaks without any need for kernel involvement/changes.
> 
> For example, set _FEXECVED_VIA_FD=4 but don't set
> O_CLOEXEC before fexecve, and the interpreter reads then
> closes that FD.  Or just get the interpreter to spot scripts named
> "/dev/fd/%d" and read-then-close the FD that way, cf. Eric's suggestion
> at https://lkml.org/lkml/2014/10/22/652.

No. Any omission of O_CLOEXEC even momentarily is a potentially
dangerous fd leak. This is the case whenever the process is
multithreaded and it's possible that other threads might fork and
exec. Think of the case of a privileged daemon re-execing itself (e.g.
to switch to an updated version) while there are potentially other
threads spawning non-privileged processes.

Rich
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..937d79e4c4f0
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,153 @@ 
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execveat(int " fd ", const char *" pathname ","
+.br
+.BI "             char *const " argv "[],  char *const " envp "[],"
+.br
+.BI "             int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is an empty string and the
+.BR AT_EMPTY_PATH
+flag is specified, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flags:
+.TP
+.BR AT_EMPTY_PATH
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR fd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B ENOENT
+The program identified by \fIfd\fP and \fIpathname\fP requires the
+use of an interpreter program (such as a script starting with
+"#!") but the file descriptor
+.I fd
+was opened with the
+.B O_CLOEXEC
+flag and so the program file is inaccessible to the launched interpreter.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)