Message ID | a87876829697e1b3c63601b1401a07af79eddae6.1572651216.git.jstancek@redhat.com |
---|---|
State | Not Applicable |
Headers | show |
Series | kernel: use ktime_get_real_ts64() to calculate acct.ac_btime | expand |
Jan, The subsystem prefix for acct is surely not 'kernel.'. Try: git log --oneline kernel/acct.c On Sat, 2 Nov 2019, Jan Stancek wrote: > diff --git a/kernel/acct.c b/kernel/acct.c > index 81f9831a7859..991c898160cd 100644 > --- a/kernel/acct.c > +++ b/kernel/acct.c > @@ -417,6 +417,7 @@ static void fill_ac(acct_t *ac) > struct pacct_struct *pacct = ¤t->signal->pacct; > u64 elapsed, run_time; > struct tty_struct *tty; > + struct timespec64 ts; > > /* > * Fill the accounting struct with the needed info as recorded > @@ -448,7 +449,8 @@ static void fill_ac(acct_t *ac) > } > #endif > do_div(elapsed, AHZ); > - ac->ac_btime = get_seconds() - elapsed; > + ktime_get_real_ts64(&ts); > + ac->ac_btime = ts.tv_sec - elapsed; So the calculation goes like this: runtime = ktime_get_ns() - current->...->start_time; elapsed = ns_to_ahz(runtime) elapsed /= AHZ -> runtime in seconds And then you retrieve the current wall time and just use the seconds portion for building the delta. That still can fail in corner cases when the runtime to seconds conversion does not have truncation in the conversions and the timespec nanoseconds part is close to 1e9. There is another issue related to suspend. If the system suspends then runtime is accurate vs. clock MONOTONIC, but the offset between clock MONOTONIC and clock REALTIME is not longer the same due to the suspend/resume which has the same issue as what you are trying to solve because runtime = totaltime - time_in_suspend so your ac_btime will be off by time_in_suspend. Something like this should work: runtime = ktime_get_ns() - current->...->start_time; .... runtime_boot = ktime_get_boot_ns() - current->...->real_start_time; start_ns = ktime_get_real_ns() - runtime_boot; start_s = startns / NSEC_PER_SEC; current->...->real_start_time is clearly a misnomer as it suggests CLOCK_REALTIME at first sight ... But it would be way simpler just to store the CLOCK_REALTIME start time along with BOOT and MONOTONIC and just get rid of all these horrible calculations which are bound to be wrong. Peter? Thanks, tglx
On Sat, Nov 02, 2019 at 12:39:24AM +0100, Jan Stancek wrote: > fill_ac() calculates process creation time from current time as: > ac->ac_btime = get_seconds() - elapsed > > get_seconds() doesn't accumulate nanoseconds as regular time getters. > This creates race for user-space (e.g. LTP acct02), which typically > uses gettimeofday(), because process creation time sometimes appear > to be dated 'in past': > > acct("myfile"); > time_t start_time = time(NULL); > if (fork()==0) { > sleep(1); > exit(0); > } > waitpid(NULL); > acct(NULL); > > // acct.ac_btime == 1572616777 > // start_time == 1572616778 > Lets start by saying this accounting stuff is terrible crap and it deserves to fail and burn. The only spec I found for what it actually wants in those fields is Documentation/accounting/taskstats-struct.rst, that states: /* The time when a task begins, in [secs] since 1970. */ __u32 ac_btime; /* Begin time [sec since 1970] */ /* The elapsed time of a task, in [usec]. */ __u64 ac_etime; /* Elapsed time [usec] */ But that is not really well defined. As implemented etime does not include suspend time (maybe on purpose, maybe not). And what does btime want? As implemented it jumps around if you ask the question twice with an adjtime() call or suspend in between. Of course, if we take an actual CLOCK_REALTIME timestamp at fork() the value doesn't change, but then it can be in the future (DST,adjtime()), which is exactly the reason why CLOCK_REALTIME is absolute shit for timestamps (logging, accounting, etc.). And your 'fix' is pretty terible too. Arguably ktime_get_seconds() wants fixing for not having the ns accumulation and actually differing from tv_sec, but now you accrue one source of ns while still disregarding another (also, I friggin hate timespec, it's a terrible interface for time). All in all, I'm tempted to just declare this stuff broken and -EWONTFIX, but if we have to do something, something like the below is at least internally consistent. --- diff --git a/kernel/tsacct.c b/kernel/tsacct.c index 7be3e7530841..76d6325c2724 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -23,18 +23,31 @@ void bacct_add_tsk(struct user_namespace *user_ns, { const struct cred *tcred; u64 utime, stime, utimescaled, stimescaled; - u64 delta; + u64 mono, real, btime; BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN); + mono = ktime_get_ns(); + real = ktime_get_real_ns(); + /* calculate task elapsed time in nsec */ - delta = ktime_get_ns() - tsk->start_time; + delta = mono - tsk->start_time; /* Convert to micro seconds */ do_div(delta, NSEC_PER_USEC); stats->ac_etime = delta; - /* Convert to seconds for btime */ - do_div(delta, USEC_PER_SEC); - stats->ac_btime = get_seconds() - delta; + + /* + * Compute btime by subtracting the elapsed time from the current + * CLOCK_REALTIME. + * + * XXX totally buggered, because it changes results across + * adjtime() calls and suspend/resume. + */ + delta = mono - tsk->start_time; // elapsed in ns + btime = real - delta; // real ns - elapsed ns + do_div(btime, NSEC_PER_SEC); // truncated to seconds + stats->ac_btime = btime; + if (thread_group_leader(tsk)) { stats->ac_exitcode = tsk->exit_code; if (tsk->flags & PF_FORKNOEXEC)
On Thu, 7 Nov 2019, Peter Zijlstra wrote: > Lets start by saying this accounting stuff is terrible crap and it > deserves to fail and burn. No argument about that. > And what does btime want? As implemented it jumps around if you ask the > question twice with an adjtime() call or suspend in between. Of course, > if we take an actual CLOCK_REALTIME timestamp at fork() the value > doesn't change, but then it can be in the future (DST,adjtime()), which > is exactly the reason why CLOCK_REALTIME is absolute shit for timestamps > (logging, accounting, etc.). > > And your 'fix' is pretty terible too. Arguably ktime_get_seconds() wants > fixing for not having the ns accumulation and actually differing from > tv_sec, but now you accrue one source of ns while still disregarding > another (also, I friggin hate timespec, it's a terrible interface for > time). > > All in all, I'm tempted to just declare this stuff broken and -EWONTFIX, > but if we have to do something, something like the below is at least > internally consistent. Kinda :) > + mono = ktime_get_ns(); > + real = ktime_get_real_ns(); > + /* > + * Compute btime by subtracting the elapsed time from the current > + * CLOCK_REALTIME. > + * > + * XXX totally buggered, because it changes results across > + * adjtime() calls and suspend/resume. > + */ > + delta = mono - tsk->start_time; // elapsed in ns > + btime = real - delta; // real ns - elapsed ns > + do_div(btime, NSEC_PER_SEC); // truncated to seconds > + stats->ac_btime = btime; That has pretty much the same problem as just storing the CLOCK_REALTIME start time at fork and additionally it is wreckaged vs. suspend resume. So a CLOCK_REALTIME time stamp at fork would at least be correct vs. suspend resume. The same result is achieved by: boot = ktime_get_boot_ns(); delta = boot = tsk->real_start_time; Typing real_start_time makes me really cringe. Thanks, tglx
On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote: > Typing real_start_time makes me really cringe. I have a patch fixing that... --- Subject: kernel: Rename tsk->real_start_time From: Peter Zijlstra <peterz@infradead.org> Date: Thu Nov 7 11:07:58 CET 2019 Since it stores CLOCK_BOOTTIME, not, as the name suggests, CLOCK_REALTIME, let's rename real_start_time. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- --- a/fs/exec.c +++ b/fs/exec.c @@ -1132,7 +1132,7 @@ static int de_thread(struct task_struct * also take its birthdate (always earlier than our own). */ tsk->start_time = leader->start_time; - tsk->real_start_time = leader->real_start_time; + tsk->start_boottime = leader->start_boottime; BUG_ON(!same_thread_group(leader, tsk)); BUG_ON(has_group_leader_pid(tsk)); --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -533,7 +533,7 @@ static int do_task_stat(struct seq_file nice = task_nice(task); /* convert nsec -> ticks */ - start_time = nsec_to_clock_t(task->real_start_time); + start_time = nsec_to_clock_t(task->start_boottime); seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); seq_puts(m, " ("); --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -879,7 +879,7 @@ struct task_struct { u64 start_time; /* Boot based time in nsecs: */ - u64 real_start_time; + u64 start_boottime; /* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */ unsigned long min_flt; --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2130,7 +2130,7 @@ static __latent_entropy struct task_stru */ p->start_time = ktime_get_ns(); - p->real_start_time = ktime_get_boottime_ns(); + p->start_boottime = ktime_get_boottime_ns(); /* * Make it visible to the rest of the system, but dont wake it up yet.
On Thu, Nov 07, 2019 at 09:56:12AM +0100, Thomas Gleixner wrote: > But it would be way simpler just to store the CLOCK_REALTIME start time > along with BOOT and MONOTONIC and just get rid of all these horrible > calculations which are bound to be wrong. > > Peter? So I'd really hate to do that, as that gives the impression REALTIME is a sane clock to use. As I argued in that other email, REALTIME is horrible crap.
On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote: > On Thu, 7 Nov 2019, Peter Zijlstra wrote: > > + mono = ktime_get_ns(); > > + real = ktime_get_real_ns(); > > + /* > > + * Compute btime by subtracting the elapsed time from the current > > + * CLOCK_REALTIME. > > + * > > + * XXX totally buggered, because it changes results across > > + * adjtime() calls and suspend/resume. > > + */ > > + delta = mono - tsk->start_time; // elapsed in ns > > + btime = real - delta; // real ns - elapsed ns > > + do_div(btime, NSEC_PER_SEC); // truncated to seconds > > + stats->ac_btime = btime; > > That has pretty much the same problem as just storing the CLOCK_REALTIME > start time at fork and additionally it is wreckaged vs. suspend resume. It's wrecked in general. It also jumps around for any REALTIME adjustment. > So a CLOCK_REALTIME time stamp at fork would at least be correct > vs. suspend resume. But still wrecked vs REALTIME jumps, as in, when DST flips the clock back an hour, your timestamp is in the future. Any which way around the whole thing is buggered. The only real fix is not using REALTIME anything. Which is why I'm loath to add that REALTIME timestamp at fork(), it just encourages more use.
On Thu, 7 Nov 2019, Peter Zijlstra wrote: > On Thu, Nov 07, 2019 at 01:40:47PM +0100, Thomas Gleixner wrote: > > On Thu, 7 Nov 2019, Peter Zijlstra wrote: > > > > + mono = ktime_get_ns(); > > > + real = ktime_get_real_ns(); > > > + /* > > > + * Compute btime by subtracting the elapsed time from the current > > > + * CLOCK_REALTIME. > > > + * > > > + * XXX totally buggered, because it changes results across > > > + * adjtime() calls and suspend/resume. > > > + */ > > > + delta = mono - tsk->start_time; // elapsed in ns > > > + btime = real - delta; // real ns - elapsed ns > > > + do_div(btime, NSEC_PER_SEC); // truncated to seconds > > > + stats->ac_btime = btime; > > > > That has pretty much the same problem as just storing the CLOCK_REALTIME > > start time at fork and additionally it is wreckaged vs. suspend resume. > > It's wrecked in general. It also jumps around for any REALTIME > adjustment. > > > So a CLOCK_REALTIME time stamp at fork would at least be correct > > vs. suspend resume. > > But still wrecked vs REALTIME jumps, as in, when DST flips the clock > back an hour, your timestamp is in the future. > > Any which way around the whole thing is buggered. The only real fix is > not using REALTIME anything. Which is why I'm loath to add that REALTIME > timestamp at fork(), it just encourages more use. Fair enough. You have a sane alternative though: CLOCK_TAI Thanks, tglx
----- Original Message ----- > It's wrecked in general. It also jumps around for any REALTIME > adjustment. > > > So a CLOCK_REALTIME time stamp at fork would at least be correct > > vs. suspend resume. > > But still wrecked vs REALTIME jumps, as in, when DST flips the clock > back an hour, your timestamp is in the future. > > Any which way around the whole thing is buggered. The only real fix is > not using REALTIME anything. Which is why I'm loath to add that REALTIME > timestamp at fork(), it just encourages more use. Thank you for feedback and listing all other problems. I'll adjust test expectations. Regards, Jan
diff --git a/kernel/acct.c b/kernel/acct.c index 81f9831a7859..991c898160cd 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -417,6 +417,7 @@ static void fill_ac(acct_t *ac) struct pacct_struct *pacct = ¤t->signal->pacct; u64 elapsed, run_time; struct tty_struct *tty; + struct timespec64 ts; /* * Fill the accounting struct with the needed info as recorded @@ -448,7 +449,8 @@ static void fill_ac(acct_t *ac) } #endif do_div(elapsed, AHZ); - ac->ac_btime = get_seconds() - elapsed; + ktime_get_real_ts64(&ts); + ac->ac_btime = ts.tv_sec - elapsed; #if ACCT_VERSION==2 ac->ac_ahz = AHZ; #endif diff --git a/kernel/tsacct.c b/kernel/tsacct.c index 7be3e7530841..4d10854255ab 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -24,6 +24,7 @@ void bacct_add_tsk(struct user_namespace *user_ns, const struct cred *tcred; u64 utime, stime, utimescaled, stimescaled; u64 delta; + struct timespec64 ts; BUILD_BUG_ON(TS_COMM_LEN < TASK_COMM_LEN); @@ -34,7 +35,8 @@ void bacct_add_tsk(struct user_namespace *user_ns, stats->ac_etime = delta; /* Convert to seconds for btime */ do_div(delta, USEC_PER_SEC); - stats->ac_btime = get_seconds() - delta; + ktime_get_real_ts64(&ts); + stats->ac_btime = ts.tv_sec - delta; if (thread_group_leader(tsk)) { stats->ac_exitcode = tsk->exit_code; if (tsk->flags & PF_FORKNOEXEC)
fill_ac() calculates process creation time from current time as: ac->ac_btime = get_seconds() - elapsed get_seconds() doesn't accumulate nanoseconds as regular time getters. This creates race for user-space (e.g. LTP acct02), which typically uses gettimeofday(), because process creation time sometimes appear to be dated 'in past': acct("myfile"); time_t start_time = time(NULL); if (fork()==0) { sleep(1); exit(0); } waitpid(NULL); acct(NULL); // acct.ac_btime == 1572616777 // start_time == 1572616778 Testing: 10 hours of LTP acct02 on s390x with CONFIG_HZ=100, test failed on unpatched kernel in 15 minutes Signed-off-by: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Fontana <rfontana@redhat.com> --- kernel/acct.c | 4 +++- kernel/tsacct.c | 4 +++- 2 files changed, 6 insertions(+), 2 deletions(-)