diff mbox

[Y2038,04/11] posix timers:Introduce the 64bit methods with timespec64 type for k_clock structure

Message ID 3755355.Xf0HbltZXg@wuerfel
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Arnd Bergmann April 21, 2015, 8:59 a.m. UTC
On Monday 20 April 2015 22:40:57 Thomas Gleixner wrote:
> On Mon, 20 Apr 2015, Baolin Wang wrote:
> > @@ -771,6 +771,7 @@ SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> >  		struct itimerspec __user *, setting)
> >  {
> >  	struct itimerspec cur_setting;
> > +	struct itimerspec64 cur_setting64;
> >  	struct k_itimer *timr;
> >  	struct k_clock *kc;
> >  	unsigned long flags;
> > @@ -781,10 +782,16 @@ SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> >  		return -EINVAL;
> >  
> >  	kc = clockid_to_kclock(timr->it_clock);
> > -	if (WARN_ON_ONCE(!kc || !kc->timer_get))
> > +	if (WARN_ON_ONCE(!kc || (!kc->timer_get && !kc->timer_get64))) {
> >  		ret = -EINVAL;
> > -	else
> > -		kc->timer_get(timr, &cur_setting);
> > +	} else {
> > +		if (kc->timer_get64) {
> > +			kc->timer_get64(timr, &cur_setting64);
> > +			cur_setting = itimerspec64_to_itimerspec(cur_setting64);
> > +		} else {
> > +			kc->timer_get(timr, &cur_setting);
> > +		}
> > +	}
> 
> This is really horrible. You add a metric ton of conditionals to every
> syscall just to remove them later again. I have not yet checked the
> end result, but this approach is error prone as hell and just
> introduces completely useless code churn.
> 
> It's useless because you do not factor out the guts of the syscall
> functions so we can reuse the very same logic for the future 2038 safe
> syscalls which we need to introduce for 32bit machines.

Hi Thomas,

I should probably go ahead and introduce all the new syscalls in a
separate patch series. I have my own ideas about how to get there,
in a slightly different way from what you propose:

> Take a look at the compat syscalls. They do the right thing.
> 
> COMPAT_SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
>                        struct compat_itimerspec __user *, setting)
> {
>         long err;
>         mm_segment_t oldfs;
>         struct itimerspec ts;
> 
>         oldfs = get_fs();
>         set_fs(KERNEL_DS);
>         err = sys_timer_gettime(timer_id,
>                                 (struct itimerspec __user *) &ts);
>         set_fs(oldfs);
>         if (!err && put_compat_itimerspec(setting, &ts))
>                 return -EFAULT;
>         return err;
> }

As a side note, I want to kill off the get_fs()/set_fs() calls in
the process. These always make me dizzy when I try to work out whether
there is a potential security hole (there is not in this case), and
they can be slow on some architectures.

> So we can be clever and do the following:
> 
> 1) Preparatory work in posix-timer.c (Patch #1)
>
> 2) Do the 64bit infrastructure work in posix-timer.c (Patch #2)
> 
> The result is two simple to review patches with minimal code churn.

This all sounds great, good idea.

> The nice thing is that once we introduce new syscalls for 32bit
> machines, e.g. sys_timer_gettime64(), all we need to do is:
> 
> /* Get the time remaining on a POSIX.1b interval timer. */
> SYSCALL_DEFINE2(timer_gettime64, timer_t, timer_id,
> 		struct itimerspec64 __user *, setting)
> {
> 	struct itimerspec64 cur_setting64;
> 	int ret = __timer_gettime(timer_id, &cur_setting64);
> 
> 	if (!ret && copy_to_user(setting, &cur_setting, sizeof (cur_setting)))
> 		return -EFAULT;
> 
> 	return ret;
> }
> 
> And on 64bit timer_gettime64() and timer_gettime() are the same, so we
> just need to do a clever mapping of timer_gettime() to
> timer_gettime64(). Not rocket science....
> 
> For 32 bit we provide the old timer_gettime() for non converted
> applications:
> 
> #ifdef CONFIG_32BIT_OLD_TIMESPEC_SYSCALLS
> /* Get the time remaining on a POSIX.1b interval timer in the old timespec format. */
> SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> 		struct itimerspec __user *, setting)
> {
> 	struct itimerspec64 cur_setting64;
> 	struct itimerspec cur_setting;
> 	int ret = __timer_gettime(timer_id, &cur_setting64);
> 
> 	if (!ret) {
> 		cur_setting = itimerspec64_to_itimerspec(&cur_setting64);
> 
> 		if (copy_to_user(setting, &cur_setting, sizeof (cur_setting)))
> 		   	return -EFAULT;
> 	}
> 	return ret;
> }
> #endif
> 
> Simple, isn't it? No useless churn. Proper refactoring for the next
> step. No useless copying for 64 bit.

My preferred solution is one where we end up with the same syscalls for
both 32-bit and 64-bit, and basically use the compat_sys_timer_gettime()
implementation (or a simplified version) for the existing , something like this:



Note the use of a separate __kernel_itimerspec64 for the user interface
here, which I think will be needed to hide the differences between the
normal itimerspec on 64-bit machines, and the new itimerspec on 32-bit
platforms that will be defined differently (using 'long long').

My plan for migration here is to temporarily #define __kernel_itimerspec64
as itimerspec on 32-bit architectures (which may be a bit confusing),
and then introduce a new CONFIG_COMPAT_TIME option that is set by
each architecture that has been converted over to use the new syscalls.
This way we can do one syscall at a time at first, followed by one
architecture at a time.

Unless you have serious objections to that plan, I'd like to work
on a first version of this myself and send that out for clarifications.

I would also prefer not too many people to work on the syscalls, and
would rather have Baolin not touch any of the syscall prototypes for
the moment.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Thomas Gleixner April 21, 2015, 2:14 p.m. UTC | #1
On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> > COMPAT_SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
> >                        struct compat_itimerspec __user *, setting)
> 
> As a side note, I want to kill off the get_fs()/set_fs() calls in
> the process. These always make me dizzy when I try to work out whether
> there is a potential security hole (there is not in this case), and
> they can be slow on some architectures.

Yeah. I have to take a deep breath every time I look at those :)
 
> My preferred solution is one where we end up with the same syscalls
> for both 32-bit and 64-bit, and basically use the
> compat_sys_timer_gettime() implementation (or a simplified version)
> for the existing , something like this:

No objections from my side. I was not looking into the syscall magic
yet. I just wanted to avoid the code churn and have the guts of the
syscalls factored out for simple reusage.

....
 
> Note the use of a separate __kernel_itimerspec64 for the user interface
> here, which I think will be needed to hide the differences between the
> normal itimerspec on 64-bit machines, and the new itimerspec on 32-bit
> platforms that will be defined differently (using 'long long').

Confused.

timespec64 / itimerspec64 should be the same independent of 64bit and
32bit. So why do we need another variant ?

> I would also prefer not too many people to work on the syscalls, and
> would rather have Baolin not touch any of the syscall prototypes for
> the moment.

I did not ask him to change any of the syscall prototypes. I just
wanted him to split out the guts of the syscall into a seperate static
function to avoid all that code churn.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2015, 2:57 p.m. UTC | #2
On Tuesday 21 April 2015 16:14:26 Thomas Gleixner wrote:
> > Note the use of a separate __kernel_itimerspec64 for the user interface
> > here, which I think will be needed to hide the differences between the
> > normal itimerspec on 64-bit machines, and the new itimerspec on 32-bit
> > platforms that will be defined differently (using 'long long').
> 
> Confused.
> 
> timespec64 / itimerspec64 should be the same independent of 64bit and
> 32bit. So why do we need another variant ?

There are multiple reasons:

* On 64-bit systems, timespec64 would always be defined in the same way
  as struct timespec { __kernel_time_t tv_sec; long tv_nsec; }, with
  __kernel_time_t being 'long'. On 32-bit, we probably need to make both
  members 'long long' for the user space side, in order to share the
  syscall implementation with the kernel side, but we may also want to
  keep the internal timespec64 using a 'long' for tv_nsec, as we do
  today. This means that both the binary layout (padding or no padding)
  and the basic types (long or long long) are different between 32-bit
  and 64-bit, and between kernel and user space
* We should not put 'struct timespec64' into the user space namespace,
  as applications might already use that identifier. This is similar
  to the __u32/u32 or __kernel_time_t/time_t tuple of types for interface
  and in-kernel uses. This is particularly important when embedding a
  timespec in another data structure.
* My plan is to use a temporary hack where I actually define
  __kernel_timespec64 to look like the 32-bit version of timespec,
  as an intermediate step when converting all 32-bit architectures over
  to use the compat_*() syscalls in place of the existing ones, so
  I can change over the normal syscalls to use __kernel_timespec64
  without having to change all architectures at once, or having to
  modify each syscall multiple times.

> > I would also prefer not too many people to work on the syscalls, and
> > would rather have Baolin not touch any of the syscall prototypes for
> > the moment.
> 
> I did not ask him to change any of the syscall prototypes. I just
> wanted him to split out the guts of the syscall into a seperate static
> function to avoid all that code churn.

Ok, I wasn't sure about that part, thanks for clarifying.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner April 21, 2015, 3:13 p.m. UTC | #3
On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> On Tuesday 21 April 2015 16:14:26 Thomas Gleixner wrote:
> > > Note the use of a separate __kernel_itimerspec64 for the user interface
> > > here, which I think will be needed to hide the differences between the
> > > normal itimerspec on 64-bit machines, and the new itimerspec on 32-bit
> > > platforms that will be defined differently (using 'long long').
> > 
> > Confused.
> > 
> > timespec64 / itimerspec64 should be the same independent of 64bit and
> > 32bit. So why do we need another variant ?
> 
> There are multiple reasons:
> 
> * On 64-bit systems, timespec64 would always be defined in the same way
>   as struct timespec { __kernel_time_t tv_sec; long tv_nsec; }, with
>   __kernel_time_t being 'long'. On 32-bit, we probably need to make both
>   members 'long long' for the user space side, in order to share the
>   syscall implementation with the kernel side, but we may also want to
>   keep the internal timespec64 using a 'long' for tv_nsec, as we do
>   today. This means that both the binary layout (padding or no padding)
>   and the basic types (long or long long) are different between 32-bit
>   and 64-bit, and between kernel and user space

So you want to avoid a compat syscall for 32bit applications on a
64bit kernel, right?

That burdens 32bit with the extra 'long long' in user space. Not sure
whether user space folks will be happy about it.

> * We should not put 'struct timespec64' into the user space namespace,
>   as applications might already use that identifier. This is similar
>   to the __u32/u32 or __kernel_time_t/time_t tuple of types for interface
>   and in-kernel uses. This is particularly important when embedding a
>   timespec in another data structure.

Fair enough.

> * My plan is to use a temporary hack where I actually define
>   __kernel_timespec64 to look like the 32-bit version of timespec,
>   as an intermediate step when converting all 32-bit architectures over
>   to use the compat_*() syscalls in place of the existing ones, so
>   I can change over the normal syscalls to use __kernel_timespec64
>   without having to change all architectures at once, or having to
>   modify each syscall multiple times.

Makes sense.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 21, 2015, 3:40 p.m. UTC | #4
On Tuesday 21 April 2015 17:13:32 Thomas Gleixner wrote:
> On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> > On Tuesday 21 April 2015 16:14:26 Thomas Gleixner wrote:
> > > > Note the use of a separate __kernel_itimerspec64 for the user interface
> > > > here, which I think will be needed to hide the differences between the
> > > > normal itimerspec on 64-bit machines, and the new itimerspec on 32-bit
> > > > platforms that will be defined differently (using 'long long').
> > > 
> > > Confused.
> > > 
> > > timespec64 / itimerspec64 should be the same independent of 64bit and
> > > 32bit. So why do we need another variant ?
> > 
> > There are multiple reasons:
> > 
> > * On 64-bit systems, timespec64 would always be defined in the same way
> >   as struct timespec { __kernel_time_t tv_sec; long tv_nsec; }, with
> >   __kernel_time_t being 'long'. On 32-bit, we probably need to make both
> >   members 'long long' for the user space side, in order to share the
> >   syscall implementation with the kernel side, but we may also want to
> >   keep the internal timespec64 using a 'long' for tv_nsec, as we do
> >   today. This means that both the binary layout (padding or no padding)
> >   and the basic types (long or long long) are different between 32-bit
> >   and 64-bit, and between kernel and user space
> 
> So you want to avoid a compat syscall for 32bit applications on a
> 64bit kernel, right?

That is the idea at the moment, yes. At least for the kernel-user
interface.

> That burdens 32bit with the extra 'long long' in user space. Not sure
> whether user space folks will be happy about it.

I know there are concerns about this, in particular because C11 and
POSIX both require tv_nsec to be 'long', unlike timeval->tv_usec,
which is a 'suseconds_t' and can be defined as 'long long'.

There are four possible ways that 32-bit user space could define
timespec based on the uncontroversial (I hope) 'typedef long long
time_t;'.

a)

struct timespec {
	time_t tv_sec;
	long long tv_nsec; /* or typedef long long snseconds_t */
};

This is not directly compatible with C11 or POSIX.1-2008, but it
matches what we do inside of 64-bit kernels, so probably has the
highest chance of working correctly in practice

b)

struct timespec {
	time_t tv_sec;
	long tv_nsec;
};

This is the definition according to C11 and POSIX, and the main problem
here is the padding, which may cause a 4-byte data leak and has the tv_nsec
in the wrong place when used together with the timespec we have in the kernel
on big-endian 64-bit machines.

c)

struct timespec {
	time_t tv_sec;
#if __BITS_PER_LONG == 32 && __BYTE_ORDER == __BIG_ENDIAN
	long __pad;
#endif
	long tv_nsec; /* or typedef long long snseconds_t */
#if __BITS_PER_LONG == 32 && __BYTE_ORDER == __LITTLE_ENDIAN
	long __pad;
#endif
};

This version could be used transparently by user space, but has
the potential to cause problems with existing user space doing
things like

struct timespec ts = { 0, 1000 }; /* one microsecond */

even though it is probably compliant.

d)

struct timespec {
	time_t tv_sec;
#if __BITS_PER_LONG == 32 && __BYTE_ORDER == __BIG_ENDIAN
	long :32;
#endif
	long tv_nsec; /* or typedef long long snseconds_t */
#if __BITS_PER_LONG == 32 && __BYTE_ORDER == __LITTLE_ENDIAN
	long :32;
#endif
};

This is very similar to c, but trades the problem I described
above for a dependency on a gcc extension that is not part of
C11 or any earlier standard.


From the kernel's point of view, b), c) and d) can all be
treated the same for output data, as we only ever pass back
normalized timespec structures that have zeroes in the upper
bits of timespec. However, for input to the kernel, we would
require an extra conditional on 64-bit kernels to decide whether
we ignore the upper bits (doing tv->tv_nsec &= 0xffffffff) or 
a structure that has the upper bits set  needs to cause EINVAL.
We could still do that in get_timespec() if we decide to not to
go with a).

See also https://lwn.net/Articles/457089/ for an earlier discussion
on the topic when debating the x32 ABI.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner April 21, 2015, 8:13 p.m. UTC | #5
On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> I know there are concerns about this, in particular because C11 and
> POSIX both require tv_nsec to be 'long', unlike timeval->tv_usec,
> which is a 'suseconds_t' and can be defined as 'long long'.
>
> a)
> 
> struct timespec {
> 	time_t tv_sec;
> 	long long tv_nsec; /* or typedef long long snseconds_t */
> };
> 
> This is not directly compatible with C11 or POSIX.1-2008, but it
> matches what we do inside of 64-bit kernels, so probably has the
> highest chance of working correctly in practice

After reading Linus rant in the x32 thread again (thanks for the
reminder), and looking at b/c/d - which rate between ugly and butt
ugly - I think we should go for a) and screw POSIX and C11 as those
committee dinosaurs seem to completely ignore the 2038 problem on
32bit machines. At least I have not found any hint that these folks
care at all. So why should we comply to something which is completely
useless?

That also makes the question about the upper 32bits check moot, so
it's the simplest and clearest of the possible solutions.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner April 22, 2015, 8:45 a.m. UTC | #6
On Tue, 21 Apr 2015, Thomas Gleixner wrote:
> On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> > I know there are concerns about this, in particular because C11 and
> > POSIX both require tv_nsec to be 'long', unlike timeval->tv_usec,
> > which is a 'suseconds_t' and can be defined as 'long long'.
> >
> > a)
> > 
> > struct timespec {
> > 	time_t tv_sec;
> > 	long long tv_nsec; /* or typedef long long snseconds_t */
> > };
> > 
> > This is not directly compatible with C11 or POSIX.1-2008, but it
> > matches what we do inside of 64-bit kernels, so probably has the
> > highest chance of working correctly in practice
> 
> After reading Linus rant in the x32 thread again (thanks for the
> reminder), and looking at b/c/d - which rate between ugly and butt
> ugly - I think we should go for a) and screw POSIX and C11 as those
> committee dinosaurs seem to completely ignore the 2038 problem on
> 32bit machines. At least I have not found any hint that these folks
> care at all. So why should we comply to something which is completely
> useless?
> 
> That also makes the question about the upper 32bits check moot, so
> it's the simplest and clearest of the possible solutions.

Second thoughts after some sleep.

So the outcome of this is going to be that user space libraries will
not expose the syscall variant of

    syscall_timespec64 {
           s64 tv_sec;
	   s64 tv_nsec;
    };

to applications. The libs will translate them to spec conforming

   timespec {
           time_t tv_sec;
	   long   tv_nsec;
   };

anyway. That means we have two translation steps on 32bit systems:

  1) user space timespec -> syscall timespec64

  2) syscall timespec64 -> scalar nsec s64 (ktime_t)

and the other way round. The kernel internal representation is simply
s64 (nsec) based all over the place.

So we could save one translation step if we implement new syscalls
which have a scalar nsec interface instead of the timespec/timeval
cruft and let user space do the translation to whatever it wants.

So

sys_clock_nanosleep(const clockid_t which_clock, int flags,
	            const struct timespec __user *expires,
		    struct timespec __user *reminder)

would get the new syscall variant:

sys_clock_nanosleep_ns(const clockid_t which_clock, int flags,
		       const s64 expires, s64 __user *reminder)

I personally would welcome such an interface as it makes user space
programming simpler. Just (re)arming a periodic nanosleep based on
absolute expiry time is horrible stupid today:

	 struct timespec expires;
	 ....
	 while ()
	       expires.tv_nsec += period.tv_nsec;
	       expires.tv_sec += period.tv_sec;
	       normalize_timespec(&expires);
	       sys_clock_nanosleep(CLOCK_ID, ABS, &expires, NULL);

So with a scalar interface this would reduce to:

	 s64 expires;
	 ....
	 while ()
	       expires += period;
	       sys_clock_nanosleep_ns(CLOCK_ID, ABS, &expires, NULL);

There is a difference both in text and storage size plus the avoidance
of the two translation steps (one translation step on 64bit).

I know that this is non portable, but OTOH if I look at the non
portable mechanisms which are used by data bases, java VMs and other
apps which exist to squeeze the last cycles out of the system, there
is certainly some value to that.

The portable/spec conforming apps can still use the user space
assisted translated timespec/timeval mechanisms.

There is one caveat though: sys_clock_gettime and sys_gettimeofday
will still need a syscall_timespec64 variant. We have no double
translation steps there because we maintain the timespec
representation in the timekeeping code for performance reasons to
avoid the division in the syscall interface. But everything else can
do nicely without the timespec cruft.

We really should talk to libc folks and high performance users about
this before blindly adding a gazillion of new timespec64 based
interfaces.

Thoughts?

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran April 22, 2015, 10:11 a.m. UTC | #7
On Wed, Apr 22, 2015 at 10:45:23AM +0200, Thomas Gleixner wrote:
> So we could save one translation step if we implement new syscalls
> which have a scalar nsec interface instead of the timespec/timeval
> cruft and let user space do the translation to whatever it wants.

+1

> I personally would welcome such an interface as it makes user space
> programming simpler. Just (re)arming a periodic nanosleep based on
> absolute expiry time is horrible stupid today:

Jup.

> Thoughts?

Current user space example: The linuxptp programs are doing ns64 to
timespec conversions to call into the kernel, which then does timespec
to ns64 to talk to the hardware.  I would bet that most (all?) use
cases are better served with 64 bit nanosecond system calls.

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight April 22, 2015, 10:44 a.m. UTC | #8
From: Thomas Gleixner
> Sent: 22 April 2015 09:45
> On Tue, 21 Apr 2015, Thomas Gleixner wrote:
> > On Tue, 21 Apr 2015, Arnd Bergmann wrote:
> > > I know there are concerns about this, in particular because C11 and
> > > POSIX both require tv_nsec to be 'long', unlike timeval->tv_usec,
> > > which is a 'suseconds_t' and can be defined as 'long long'.
> > >
> > > a)
> > >
> > > struct timespec {
> > > 	time_t tv_sec;
> > > 	long long tv_nsec; /* or typedef long long snseconds_t */
> > > };
> > >
> > > This is not directly compatible with C11 or POSIX.1-2008, but it
> > > matches what we do inside of 64-bit kernels, so probably has the
> > > highest chance of working correctly in practice
> >
> > After reading Linus rant in the x32 thread again (thanks for the
> > reminder), and looking at b/c/d - which rate between ugly and butt
> > ugly - I think we should go for a) and screw POSIX and C11 as those
> > committee dinosaurs seem to completely ignore the 2038 problem on
> > 32bit machines. At least I have not found any hint that these folks
> > care at all. So why should we comply to something which is completely
> > useless?
> >
> > That also makes the question about the upper 32bits check moot, so
> > it's the simplest and clearest of the possible solutions.
> 
> Second thoughts after some sleep.
> 
> So the outcome of this is going to be that user space libraries will
> not expose the syscall variant of
> 
>     syscall_timespec64 {
>            s64 tv_sec;
> 	   s64 tv_nsec;
>     };
> 
> to applications. The libs will translate them to spec conforming
> 
>    timespec {
>            time_t tv_sec;
> 	   long   tv_nsec;
>    };
> 
> anyway. That means we have two translation steps on 32bit systems:
> 
>   1) user space timespec -> syscall timespec64
> 
>   2) syscall timespec64 -> scalar nsec s64 (ktime_t)
> 
> and the other way round. The kernel internal representation is simply
> s64 (nsec) based all over the place.

Do you need the double-translation?
If all the kernel uses a 64bit nsec value the in-kernel syscall stub
can convert the user-supplied values appropriately before calling
the standard function.
Not that a syscall that takes a linear nsec value isn't useful.

FWIW I can't remember what NetBSD did when they extended time_t to 64bits.

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2015, 11:07 a.m. UTC | #9
On Wednesday 22 April 2015 10:45:23 Thomas Gleixner wrote:
> On Tue, 21 Apr 2015, Thomas Gleixner wrote:

> So we could save one translation step if we implement new syscalls
> which have a scalar nsec interface instead of the timespec/timeval
> cruft and let user space do the translation to whatever it wants.
> 
> So
> 
> sys_clock_nanosleep(const clockid_t which_clock, int flags,
> 	            const struct timespec __user *expires,
> 		    struct timespec __user *reminder)
> 
> would get the new syscall variant:
> 
> sys_clock_nanosleep_ns(const clockid_t which_clock, int flags,
> 		       const s64 expires, s64 __user *reminder)

As you might expect, there are a number of complications with this
approach:

- John Stultz likes to point out that it's easier to do one change
  at a time, so extending the interface to 64-bit has less potential
  of breaking things than a more fundamental change. I think it's
  useful to drop a lot of the syscalls when a more modern version
  is around (e.g. let libc implement usleep and nanosleep through
  clock_nanosleep), but keep the syscalls as close to the known-working
  64-bit versions as we can.
- The inode timestamp related syscalls (stat, utimes and variants
  thereof) require the full range of time64_t and cannot use ktime_t.
- converting between timespec types of different size is cheap,
  converting timespec to ktime_t is still relatively cheap, but
  converting ktime_t to timespec is rather expensive (at least eight
  32-bit multiplies, plus a few shifts and additions if you don't
  have 64-bit arithmetic).
- ioctls that pass a timespec need to keep doing that or would require
  a source-level change in user space instead of recompiling.

> I personally would welcome such an interface as it makes user space
> programming simpler. Just (re)arming a periodic nanosleep based on
> absolute expiry time is horrible stupid today:
> 
> 	 struct timespec expires;
> 	 ....
> 	 while ()
> 	       expires.tv_nsec += period.tv_nsec;
> 	       expires.tv_sec += period.tv_sec;
> 	       normalize_timespec(&expires);
> 	       sys_clock_nanosleep(CLOCK_ID, ABS, &expires, NULL);
> 
> So with a scalar interface this would reduce to:
> 
> 	 s64 expires;
> 	 ....
> 	 while ()
> 	       expires += period;
> 	       sys_clock_nanosleep_ns(CLOCK_ID, ABS, &expires, NULL);
> 
> There is a difference both in text and storage size plus the avoidance
> of the two translation steps (one translation step on 64bit).

We should probably look at it separately for each syscall. It's
quite possible that we find a number of them for which it helps
and others for which it hurts, so we need to see the big pictures.

There are also a few other calls that will never need 64-bit
time_t because the range is limited by the need to only ever
pass relative timeouts (select, poll, io_getevents, recvmmsg,
clock_getres, rt_sigtimedwait, sched_rr_get_interval, getrusage,
waitid, semtimedop, sysinfo), so we could actually leave them
using a 32-bit structure and have the libc do the conversion.

> I know that this is non portable, but OTOH if I look at the non
> portable mechanisms which are used by data bases, java VMs and other
> apps which exist to squeeze the last cycles out of the system, there
> is certainly some value to that.
> 
> The portable/spec conforming apps can still use the user space
> assisted translated timespec/timeval mechanisms.
> 
> There is one caveat though: sys_clock_gettime and sys_gettimeofday
> will still need a syscall_timespec64 variant. We have no double
> translation steps there because we maintain the timespec
> representation in the timekeeping code for performance reasons to
> avoid the division in the syscall interface. But everything else can
> do nicely without the timespec cruft.
> 
> We really should talk to libc folks and high performance users about
> this before blindly adding a gazillion of new timespec64 based
> interfaces.

I've started a list of affected syscalls at
https://docs.google.com/spreadsheets/d/1HCYwHXxs48TsTb6IGUduNjQnmfRvMPzCN6T_0YiQwis/edit?usp=sharing

Still adding more calls and description, let me know if you want edit
permissions.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner April 22, 2015, 1:37 p.m. UTC | #10
On Wed, 22 Apr 2015, Arnd Bergmann wrote:
> On Wednesday 22 April 2015 10:45:23 Thomas Gleixner wrote:
> > On Tue, 21 Apr 2015, Thomas Gleixner wrote:
> 
> > So we could save one translation step if we implement new syscalls
> > which have a scalar nsec interface instead of the timespec/timeval
> > cruft and let user space do the translation to whatever it wants.
> > 
> > So
> > 
> > sys_clock_nanosleep(const clockid_t which_clock, int flags,
> > 	            const struct timespec __user *expires,
> > 		    struct timespec __user *reminder)
> > 
> > would get the new syscall variant:
> > 
> > sys_clock_nanosleep_ns(const clockid_t which_clock, int flags,
> > 		       const s64 expires, s64 __user *reminder)
> 
> As you might expect, there are a number of complications with this
> approach:
> 
> - John Stultz likes to point out that it's easier to do one change
>   at a time, so extending the interface to 64-bit has less potential
>   of breaking things than a more fundamental change. I think it's
>   useful to drop a lot of the syscalls when a more modern version
>   is around (e.g. let libc implement usleep and nanosleep through
>   clock_nanosleep), but keep the syscalls as close to the known-working
>   64-bit versions as we can.

Well. I don't see a massive risk when implementing
e.g. usleep/nanosleep & al with clock_nanosleep_ns().

> - The inode timestamp related syscalls (stat, utimes and variants
>   thereof) require the full range of time64_t and cannot use ktime_t.

I'm aware that there are a lot of interfaces which cannot use
ktime_t. That's fine.

> - converting between timespec types of different size is cheap,
>   converting timespec to ktime_t is still relatively cheap, but
>   converting ktime_t to timespec is rather expensive (at least eight
>   32-bit multiplies, plus a few shifts and additions if you don't
>   have 64-bit arithmetic).

Right. That's what I said vs. gettime().

> - ioctls that pass a timespec need to keep doing that or would require
>   a source-level change in user space instead of recompiling.

No argument here.

> We should probably look at it separately for each syscall. It's
> quite possible that we find a number of them for which it helps
> and others for which it hurts, so we need to see the big pictures.

Agreed.

> There are also a few other calls that will never need 64-bit
> time_t because the range is limited by the need to only ever
> pass relative timeouts (select, poll, io_getevents, recvmmsg,
> clock_getres, rt_sigtimedwait, sched_rr_get_interval, getrusage,
> waitid, semtimedop, sysinfo), so we could actually leave them
> using a 32-bit structure and have the libc do the conversion.

Indeed.
 
> I've started a list of affected syscalls at
> https://docs.google.com/spreadsheets/d/1HCYwHXxs48TsTb6IGUduNjQnmfRvMPzCN6T_0YiQwis/edit?usp=sharing
> 
> Still adding more calls and description, let me know if you want edit
> permissions.

Only if you have a strong backup system for this file. My GUI foo is
rather limited :)

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2015, 1:50 p.m. UTC | #11
On Wednesday 22 April 2015 13:07:44 Arnd Bergmann wrote:
> 
> I've started a list of affected syscalls at
> https://docs.google.com/spreadsheets/d/1HCYwHXxs48TsTb6IGUduNjQnmfRvMPzCN6T_0YiQwis/edit?usp=sharing
> 
> Still adding more calls and description, let me know if you want edit
> permissions.
> 

Got a first draft now, I'm relatively sure that the list is complete,
but it's not the end of the world if I missed a syscall now.

Here are my findings, and I guess we should discuss these with the libc
folks too. I'll group the syscalls according to subsystems:

=== clocks and timers ===

clock_gettime, clock_settime, clock_adjtime, clock_getres, clock_nanosleep,
timer_gettime, timer_settime, timerfd_gettime, timerfd_settime:

 these should be done consistently, either using timespec64 or 64-bit
 nanoseconds, either one works. 64-bit nanoseconds would simplify the
 kernel internally quite a bit by avoiding the double timekeeping (we
 keep track of both nanoseconds and timespec in the timekeeper struct).
 the downside of nanoseconds-only is that each existing caller would
 need a conversion in user space, where currently we can avoid the
 expensive ktime_to_ts() for some cases.

time, stime, gettimeofday, settimeofday, adjtimex, nanosleep,
getitimer, setitimer:
 all deprecated => wontfix

=== i/o ===

pselect6, ppoll, io_getevents, recvmmsg:
 These currently pass a timespec into the kernel with *relative*
 timeouts. Internally, they convert it to ktime_t and back on the
 way out. We have three options:
 - leave as is, get the libc to convert 64-bit timespec to 32-bit
   timespec on the way into the kernel and back on the way out,
   which works because the relative timeout will not overflow
 - use ktime_t to make these more efficient in the kernel, at the
   expense of requiring user space to convert it (all except 
   io_getevents pass back the remaining time).
 - leave the current behavior, but use 64-bit timespec.

select, old_selct, pselect6: deprecated

=== ipc ===

mq_timedsend, mqtimedreceive: These get an *absolute* timeout,
so we have to change them. Internally they use ktime_t, so that
would be the natural interface, but timespec64 would work as well.

semtimedop: This uses a relative timeout that is converted to
jiffies internally, so using ktime_t would not be as natural,
unless we rewrite the function to use hrtimers.

msgctl, semctl, shmctl: These have an output, which is a time_t
that stores the absolute seconds value of the last time something
happened. Internally this comes from get_seconds(), which has to
be efficient anyway. The best way forward is probably to use a
structure layout for these that is compatible with what 64-bit
architectures do. Note that the structures sometimes have padding
to deal with the extension of time_t to 64-bit, but not all
architectures have that, and some (notably big-endian arm) have
it in the wrong place, so my feeling is that we're better off not
using that padding and instead doing something that works for
everyone.

=== inodes and filesystems ===

utimesnsat, fstat64, fstatat64:

inode timestamps need to represent times before 1970 and way into
the future, so we need 64-bit time_t here, I see no other alternatives
here, so we have to pass struct timespec64 into utimensat, and
create version 4 of 'struct stat' to pass into the future fstat and
fstatat. I would use a version that matches the 64-bit layout
of 'struct stat'.

utime, utimes, futimensat, oldstat, oldlstat, oldfstat, newstat,
newlstat, newfstat, newfstatat, stat64 and lstat64: these are all
deprecated now, we have to stop getting this wrong!

=== tasks ===

getrusage, waitid: these pass a 'struct rusage' that contains a
'struct timeval' with elapsed time. Again there are multiple options:
- We could change rusage to contain a new 'struct relative_timeval'
  instead, with an unchanged layout, which makes the format incompatible
  with a standard libc that uses a 64-bit based timeval.
- We could make the layout the same as on 64-bit machines, as x32 does,
  which is again incompatible with posix but would work better
- We could make the layout what glibc expects, using 64-bit based
  timeval structures at the beginning.
- We could define a new structure usings pure nanosecond counters.

rt_sigtimedwait: This passes a relative timespec value in back out,
so we could keep the current layout and have glibc convert it, or
change it to something else. The kernel internally converts to jiffies
to call schedule_timeout.

futex: this passes a relative *or* absolute timespec in, so we have to
change it. The kernel uses ktime_t internally here, so we could make
the interface nanosecond based or stick with timespec64.

sched_rr_get_interval: This returns a timespec with the schedule interval
to user space, using a 32-bit based format is fine here, or we could
convert to timespec64. The kernel uses jiffies internally.

wait4: replaced by waitid

=== system wide ===

sysinfo: struct sysinfo contains '__kernel_long_t uptime', we can keep
that, it's fine.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Richard Cochran April 22, 2015, 2:54 p.m. UTC | #12
On Wed, Apr 22, 2015 at 03:50:45PM +0200, Arnd Bergmann wrote:
> time, stime, gettimeofday, settimeofday, adjtimex, nanosleep,
> getitimer, setitimer:
>  all deprecated => wontfix

If adjtimex is deprecated, what will replace it?  It is really
important for ntp.

Thanks,
Richard

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luc Van Oostenryck April 22, 2015, 3:14 p.m. UTC | #13
On Wed, Apr 22, 2015 at 03:50:45PM +0200, Arnd Bergmann wrote:
> On Wednesday 22 April 2015 13:07:44 Arnd Bergmann wrote:
...
> select, old_selct, pselect6: deprecated

Small, copy&paste error here for pselect6.


Luc
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2015, 3:37 p.m. UTC | #14
On Wednesday 22 April 2015 16:54:32 Richard Cochran wrote:
> On Wed, Apr 22, 2015 at 03:50:45PM +0200, Arnd Bergmann wrote:
> > time, stime, gettimeofday, settimeofday, adjtimex, nanosleep,
> > getitimer, setitimer:
> >  all deprecated => wontfix
> 
> If adjtimex is deprecated, what will replace it?  It is really
> important for ntp.

clock_adjtime(). Basically all all the basic time related syscalls
are superceded by the clock_* calls that can take a clockid argument.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann April 22, 2015, 3:38 p.m. UTC | #15
On Wednesday 22 April 2015 17:14:47 Luc Van Oostenryck wrote:
> On Wed, Apr 22, 2015 at 03:50:45PM +0200, Arnd Bergmann wrote:
> > On Wednesday 22 April 2015 13:07:44 Arnd Bergmann wrote:
> ...
> > select, old_selct, pselect6: deprecated
> 
> Small, copy&paste error here for pselect6.
> 

Right, that should have been poll.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index ef8187f9d28d..71e74a47a2cf 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -267,7 +267,7 @@ 
 258	i386	set_tid_address		sys_set_tid_address
 259	i386	timer_create		sys_timer_create		compat_sys_timer_create
 260	i386	timer_settime		sys_timer_settime		compat_sys_timer_settime
-261	i386	timer_gettime		sys_timer_gettime		compat_sys_timer_gettime
+261	i386	timer_gettime		compat_sys_timer_gettime
 262	i386	timer_getoverrun	sys_timer_getoverrun
 263	i386	timer_delete		sys_timer_delete
 264	i386	clock_settime		sys_clock_settime		compat_sys_clock_settime
@@ -365,3 +365,4 @@ 
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	timer_gettime64		sys_timer_gettime
diff --git a/kernel/compat.c b/kernel/compat.c
index 24f00610c575..15d4f5abb492 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -723,18 +723,14 @@  COMPAT_SYSCALL_DEFINE4(timer_settime, timer_t, timer_id, int, flags,
 COMPAT_SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
 		       struct compat_itimerspec __user *, setting)
 {
-	long err;
-	mm_segment_t oldfs;
-	struct itimerspec ts;
+	struct itimerspec64 ts;
+	long ret;
 
-	oldfs = get_fs();
-	set_fs(KERNEL_DS);
-	err = sys_timer_gettime(timer_id,
-				(struct itimerspec __user *) &ts);
-	set_fs(oldfs);
-	if (!err && put_compat_itimerspec(setting, &ts))
-		return -EFAULT;
-	return err;
+	ret = __timer_gettime(timer_id, &ts);
+	if (ret)
+		return ret;
+
+	return put_compat_itimerspec(setting, &ts);
 }
 
 COMPAT_SYSCALL_DEFINE2(clock_settime, clockid_t, which_clock,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 31ea01f42e1f..c601f1969f5a 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -766,32 +766,27 @@  common_timer_get(struct k_itimer *timr, struct itimerspec *cur_setting)
 		cur_setting->it_value = ktime_to_timespec(remaining);
 }
 
+static int put_compat_itimerspec(struct __kernel_itimerspec64 __user *dst,
+				 const struct itimerspec64 *src)
+{
+	if (__put_timespec64(&src->it_interval, &dst->it_interval) ||
+	    __put_timespec64(&src->it_value, &dst->it_value))
+		return -EFAULT;
+        return 0;
+}
+
 /* Get the time remaining on a POSIX.1b interval timer. */
 SYSCALL_DEFINE2(timer_gettime, timer_t, timer_id,
-		struct itimerspec __user *, setting)
+		struct __kernel_itimerspec64 __user *, setting)
 {
-	struct itimerspec cur_setting;
-	struct k_itimer *timr;
-	struct k_clock *kc;
-	unsigned long flags;
-	int ret = 0;
-
-	timr = lock_timer(timer_id, &flags);
-	if (!timr)
-		return -EINVAL;
+	struct itimerspec64 cur_setting;
+	int ret;
 
-	kc = clockid_to_kclock(timr->it_clock);
-	if (WARN_ON_ONCE(!kc || !kc->timer_get))
-		ret = -EINVAL;
-	else
-		kc->timer_get(timr, &cur_setting);
+	ret = __timer_gettime(timer_id, &cur_setting);
+	if (ret)
+		return ret;
 
-	unlock_timer(timr, flags);
-
-	if (!ret && copy_to_user(setting, &cur_setting, sizeof (cur_setting)))
-		return -EFAULT;
-
-	return ret;
+	return put_itimerspec(setting, &cur_setting);
 }
 
 /*