__remove_hrtimer() attempts to reprogram the clockevent device when
the timer being removed is the next to expire. However,
__remove_hrtimer() reprograms the clockevent *before* removing the
timer from the timerqueue and thus when hrtimer_force_reprogram()
finds the next timer to expire it finds the timer we're trying to
remove.
This is especially noticeable when the system switches to NOHz mode
and the system tick is removed. The timer tick is removed from the
system but the clockevent is programmed to wakeup in another HZ
anyway.
Silence the extra wakeup by removing the timer from the timerqueue
before calling hrtimer_force_reprogram() so that we actually program
the clockevent for the next timer to expire.
This was broken by 998adc3 "hrtimers: Convert hrtimers to use
timerlist infrastructure".
Signed-off-by: Jeff Ohlstein <johlstei@codeaurora.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1321660030-8520-1-git-send-email-johlstei@codeaurora.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.
Nothing to see here but a whole lot of instances of:
-#include <linux/module.h>
+#include <linux/export.h>
This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
commit 9ec2690758 ("timerfd: Manage cancelable timers in timerfd")
introduced a CONFIG_HIGHRES_TIMERS (should be CONFIG_HIGH_RES_TIMERS)
typo, which caused applications depending on CLOCK_REALTIME timers to
become sluggy due to the fact that the time base of the realtime
timers was not updated when the wall clock time was set.
This causes anything from 100% CPU use for some applications to odd
delays and hickups.
Reported-bisected-and-tested-by: Anca Emanuel <anca.emanuel@gmail.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Fatfingered-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ordering of the clock bases is historical due to the
CLOCK_REALTIME and CLOCK_MONOTONIC constants. Now the hrtimer bases
have their own enumeration due to the gap between CLOCK_MONOTONIC and
CLOCK_BOOTTIME. So we can be more clever as most timers end up on the
CLOCK_MONOTONIC base due to the virtue of POSIX declaring that
relative CLOCK_REALTIME timers are not affected by time changes. In
desktop environments this is slowly changing as applications switch to
absolute timers, but I've observed empty CLOCK_REALTIME bases often
enough. There is no performance penalty or overhead when
CLOCK_REALTIME timers are active, but in case they are not we don't
skip over a full cache line.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Instead of iterating over all possible timer bases avoid it by marking
the active bases in the cpu base.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Peter is concerned about the extra scan of CLOCK_REALTIME_COS in the
timer interrupt. Yes, I did not think about it, because the solution
was so elegant. I didn't like the extra list in timerfd when it was
proposed some time ago, but with a rcu based list the list walk it's
less horrible than the original global lock, which was held over the
list iteration.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Some applications must be aware of clock realtime being set
backward. A simple example is a clock applet which arms a timer for
the next minute display. If clock realtime is set backward then the
applet displays a stale time for the amount of time which the clock
was set backwards. Due to that applications poll the time because we
don't have an interface.
Extend the timerfd interface by adding a flag which puts the timer
onto a different internal realtime clock. All timers on this clock are
expired whenever the clock was set.
The timerfd core records the monotonic offset when the timer is
created. When the timer is armed, then the current offset is compared
to the previous recorded offset. When it has changed, then
timerfd_settime returns -ECANCELED. When a timer is read the offset is
compared and if it changed -ECANCELED returned to user space. Periodic
timers are not rearmed in the cancelation case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Chris Friesen <chris.friesen@genband.com>
Tested-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Alexander Shishkin <virtuoso@slind.org>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104271359580.3323%40ionos%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make clock_was_set() unconditional and rename hres_timers_resume to
hrtimers_resume. This is a preparatory patch for hrtimers which are
cancelled when clock realtime was set.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Sedat and Bruno reported RCU stalls which turned out to be caused by
the following;
sched_init() calls init_rt_bandwidth() which calls hrtimer_init()
_BEFORE_ hrtimers_init() is called. While not entirely correct this
worked because hrtimer_init() only accessed statically initialized
data (hrtimer_bases.clock_base[CLOCK_MONOTONIC])
Commit e06383db9 (hrtimers: extend hrtimer base code to handle more
then 2 clockids) added an indirection to the hrtimer_bases.clock_base
lookup to avoid gap handling in the hot path. The table which is used
for the translataion from CLOCK_ID to HRTIMER_BASE index is
initialized at runtime in hrtimers_init(). So the early call of the
scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME.
Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is
armed and the wall clock time is set (e.g. ntpdate in the early boot
process - which also gives the problem deterministic behaviour
i.e. magic recovery after N hours), then the timer ends up with an
expiry time far into the future. That breaks the RT throttler
mechanism as rt runtime is accumulated and never cleared, so the rt
throttler detects a false cpu hog condition and blocks all RT tasks
until the timer finally expires. That in turn stalls the RCU thread of
TINYRCU which leads to an huge amount of RCU callbacks piling up.
Make the translation table statically initialized, so we are back to
the status of <= 2.6.39.
Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Cc: John stultz <johnstul@us.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3E
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In complex subsystems like mac80211 structures can contain several
timers and work structs, so identifying a specific instance from the
call trace and object type output of debugobjects can be hard.
Allow the subsystems which support debugobjects to provide a hint
function. This function returns a pointer to a kernel address
(preferrably the objects callback function) which is printed along
with the debugobjects type.
Add hint methods for timer_list, work_struct and hrtimer.
[ tglx: Massaged changelog, made it compile ]
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <20110307085809.GA9334@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We calculate the current time of each clock base by adding an offset
to clock_monotonic. The offset for the clock bases is set in
retrigger_next_event() which is called when we switch a cpu to highres
mode or when the clock was set.
Add the missing update for clock boottime.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <johnstul@us.ibm.com>
CLOCK_MONOTONIC stops while the system is in suspend. This is because
to applications system suspend is invisible. However, there is a
growing set of applications that are wanting to be suspend-aware,
but do not want to deal with the complications of CLOCK_REALTIME
(which might jump around if settimeofday is called).
For these applications, I propose a new clockid: CLOCK_BOOTTIME.
CLOCK_BOOTTIME is idential to CLOCK_MONOTONIC, except it also
includes any time spent in suspend.
This patch add hrtimer base for CLOCK_BOOTTIME, using
get_monotonic_boottime/ktime_get_boottime, to allow
in kernel users to set timers against.
CC: Jamie Lokier <jamie@shareable.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alexander Shishkin <virtuoso@slind.org>
CC: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The hrtimer code is written mainly with CLOCK_REALTIME and CLOCK_MONOTONIC
in mind. These are clockids 0 and 1 resepctively. However, if we are
to introduce any new hrtimer bases, using new clockids, we have to skip
the cputimers (clockids 2,3) as well as other clockids that may not impelement
timers.
This patch adds a little bit of indirection between the clockid and
the base, so that we can extend the base by one when we add
a new clockid at number 7 or so.
CC: Jamie Lokier <jamie@shareable.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Alexander Shishkin <virtuoso@slind.org>
CC: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The hrtimer code accesses timekeeping variables under
xtime_lock. Provide a sensible accessor function and use it.
[ tglx: Removed the conditionals, unused variable, fixed codingstyle
and massaged changelog ]
Signed-off-by: Torben Hohn <torbenh@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: johnstul@us.ibm.com
Cc: yong.zhang0@gmail.com
Cc: hch@infradead.org
LKML-Reference: <20110127145905.23248.30458.stgit@localhost>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
__get_cpu_var() can be replaced with this_cpu_read and will then use a
single read instruction with implied address calculation to access the
correct per cpu instance.
However, the address of a per cpu variable passed to __this_cpu_read()
cannot be determined (since it's an implied address conversion through
segment prefixes). Therefore apply this only to uses of __get_cpu_var
where the address of the variable is not used.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hughd@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In converting the hrtimers to timerqueue, I missed
a spot in hrtimer_run_queues where we loop running
timers. We end up not pulling the new next value out
and instead just use the last next value, causing
boot time hangs in some cases.
The proper fix is to pull timerqueue_getnext each iteration
instead of using a local next value.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Converts the hrtimer code to use the new timerlist infrastructure
Signed-off-by: John Stultz <john.stultz@linaro.org>
LKML Reference: <1290136329-18291-3-git-send-email-john.stultz@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
CC: Alessandro Zummo <a.zummo@towertech.it>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Richard Cochran <richardcochran@gmail.com>
The race is described as follows:
CPU X CPU Y
remove_hrtimer
// state & QUEUED == 0
timer->state = CALLBACK
unlock timer base
timer->f(n) //very long
hrtimer_start
lock timer base
remove_hrtimer // no effect
hrtimer_enqueue
timer->state = CALLBACK |
QUEUED
unlock timer base
hrtimer_start
lock timer base
remove_hrtimer
mode = INACTIVE
// CALLBACK bit lost!
switch_hrtimer_base
CALLBACK bit not set:
timer->base
changes to a
different CPU.
lock this CPU's timer base
The bug was introduced with commit ca109491f (hrtimer: removing all ur
callback modes) in 2.6.29
[ tglx: Feed new state via local variable and add a comment. ]
Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: akpm@linux-foundation.org
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20101012142351.8485.21823.stgit@dungbeetle.mtv.corp.google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
No real bugs I believe, just some dead code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: andi@firstfloor.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provides an accessor function to replace hrtimer.c's
direct access of wall_to_monotonic.
This will allow wall_to_monotonic to be made static as
planned in Documentation/feature-removal-schedule.txt
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-9-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In the new push model, all idle CPUs indeed go into nohz mode. There is
still the concept of idle load balancer (performing the load balancing
on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
balancer when any of the nohz CPUs need idle load balancing.
The kickee CPU does the idle load balancing on behalf of all idle CPUs
instead of the normal idle balance.
This addresses the below two problems with the current nohz ilb logic:
* the idle load balancer continued to have periodic ticks during idle and
wokeup frequently, even though it did not have any rebalancing to do on
behalf of any of the idle CPUs.
* On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
periodic wakeup can result in a periodic additional interrupt on a CPU
doing the timer broadcast.
Also currently we are migrating the unpinned timers from an idle to the cpu
doing idle load balancing (when all the cpus in the system are idle,
there is no idle load balancing cpu and timers get added to the same idle cpu
where the request was made. So the existing optimization works only on semi idle
system).
And In semi idle system, we no longer have periodic ticks on the idle load
balancer CPU. Using that cpu will add more delays to the timers than intended
(as that cpu's timer base may not be uptodate wrt jiffies etc). This was
causing mysterious slowdowns during boot etc.
For now, in the semi idle case, use the nearest busy cpu for migrating timers
from an idle cpu. This is good for power-savings anyway.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hrtimer_get_softirq_time() has it's own xtime lock protection, so it's
safe to use plain __current_kernel_time() and avoid the double seqlock
loop.
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
LKML-Reference: <20100525214912.GA1934@r2bh72.net.upc.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current version of schedule_hrtimeout() always uses the
monotonic clock. Some system calls such as mq_timedsend()
and mq_timedreceive(), however, require the use of the wall
clock due to the definition of the system call.
This patch provides the infrastructure to use schedule_hrtimeout()
with a CLOCK_REALTIME timer.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Tested-by: Pradyumna Sampath <pradysam@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Veen <arjan@infradead.org>
LKML-Reference: <20100402204331.167439615@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
There is no reason to make timer_stats_hrtimer_set_start_info and
friends visible to the rest of the kernel. So move all of them to
hrtimer.c. Also make timer_stats_hrtimer_set_start_info a static
inline function so it gets inlined and we avoid another function call.
Based on a patch by Thomas Gleixner.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20091210095629.GC4144@osiris.boeblingen.de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
execution time of the hrtimer callbacks.
This is error-prone for virtual machines, where a guest vcpu can be
scheduled out during the execution of the callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor), which in can lead to large min_delta_ns rendering the
system unusable.
Replace the current heuristics with something more reliable. Allow the
interrupt code to try 3 times to catch up with the lost time. If that
fails use the total time spent in the interrupt handler to defer the
next timer interrupt so the system can catch up with other things
which got delayed. Limit that deferment to 100ms.
The retry events and the maximum time spent in the interrupt handler
are recorded and exposed via /proc/timer_list
Inspired by a patch from Marcelo.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
In the dynamic tick code, "max_delta_ns" (member of the
"clock_event_device" structure) represents the maximum sleep time
that can occur between timer events in nanoseconds.
The variable, "max_delta_ns", is defined as an unsigned long
which is a 32-bit integer for 32-bit machines and a 64-bit
integer for 64-bit machines (if -m64 option is used for gcc).
The value of max_delta_ns is set by calling the function
"clockevent_delta2ns()" which returns a maximum value of LONG_MAX.
For a 32-bit machine LONG_MAX is equal to 0x7fffffff and in
nanoseconds this equates to ~2.15 seconds. Hence, the maximum
sleep time for a 32-bit machine is ~2.15 seconds, where as for
a 64-bit machine it will be many years.
This patch changes the type of max_delta_ns to be "u64" instead of
"unsigned long" so that this variable is a 64-bit type for both 32-bit
and 64-bit machines. It also changes the maximum value returned by
clockevent_delta2ns() to KTIME_MAX. Hence this allows a 32-bit
machine to sleep for longer than ~2.15 seconds. Please note that this
patch also changes "min_delta_ns" to be "u64" too and although this is
unnecessary, it makes the patch simpler as it avoids to fixup all
callers of clockevent_delta2ns().
[ tglx: changed "unsigned long long" to u64 as we use this data type
through out the time code ]
Signed-off-by: Jon Hunter <jon-hunter@ti.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1250617512-23567-3-git-send-email-jon-hunter@ti.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On big systems, printing <number of CPUs> copies of
Switched to high resolution mode on CPU nnn
clutters up the kernel log for minimal gain. Just get rid of them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
LKML-Reference: <ada1vlw126s.fsf_-_@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On NOHZ systems the following timers,
- tick_nohz_restart_sched_tick (tick_sched_timer)
- hrtimer_start (tick_sched_timer)
are reprogramming the clock events device far more often than needed.
No specific test case was required to observe this effect. This
occurres because there was no check to see if the currently removed or
restarted hrtimer was:
1) the one which previously armed the clock events device.
2) going to be replaced by another timer which has the same expiry time.
Avoid the reprogramming in hrtimer_force_reprogram when the new expiry
value which is evaluated from the clock bases is equal to
cpu_base->expires_next. This results in faster application startup
time by ~4%.
[ tglx: simplified initial solution ]
Signed-off-by: Ashwin Chaugule <ashwinc@quicinc.com>
LKML-Reference: <4AA00165.90609@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add tracepoints which cover the life cycle of a hrtimer. The
tracepoints are integrated with the already existing debug_object
debug points as far as possible.
[ tglx: Fixed comments, made output conistent, easier to read and
parse. Fixed output for 32bit archs which do not use the
scalar representation of ktime_t. Hand current time to
trace_hrtimer_expiry_entry instead of calling get_time()
inside of the trace assignment. ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Anton Blanchard <anton@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Zhaolei <zhaolei@cn.fujitsu.com>
LKML-Reference: <4A7F8B2B.5020908@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This changes how the pktgen thread spins/waits between
packets if delay is configured. It uses a high res timer to
wait for time to arrive.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's unused, remove it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
The timer migration expiry check should prevent the migration of a
timer to another CPU when the timer expires before the next event is
scheduled on the other CPU. Migrating the timer might delay it because
we can not reprogram the clock event device on the other CPU. But the
code implementing that check has two flaws:
- for !HIGHRES the check compares the expiry value with the clock
events device expiry value which is wrong for CLOCK_REALTIME based
timers.
- the check is racy. It holds the hrtimer base lock of the target CPU,
but the clock event device expiry value can be modified
nevertheless, e.g. by an timer interrupt firing.
The !HIGHRES case is easy to fix as we can enqueue the timer on the
cpu which was selected by the load balancer. It runs the idle
balancing code once per jiffy anyway. So the maximum delay for the
timer is the same as when we keep the tick on the current cpu going.
In the HIGHRES case we can get the next expiry value from the hrtimer
cpu_base of the target CPU and serialize the update with the cpu_base
lock. This moves the lock section in hrtimer_interrupt() so we can set
next_event to KTIME_MAX while we are handling the expired timers and
set it to the next expiry value after we handled the timers under the
base lock. While the expired timers are processed timer migration is
blocked because the expiry time of the timer is always <= KTIME_MAX.
Also remove the now useless clockevents_get_next_event() function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timer migration code needs to check whether the expiry time of the
timer is before the programmed clock event expiry time when the timer
is enqueued on another CPU because we can not reprogram the timer
device on the other CPU. The current logic checks the expiry time even
if we enqueue on the current CPU when nohz_get_load_balancer() returns
current CPU. This might lead to an endless loop in the expiry check
code when the expiry time of the timer is before the current
programmed next event.
Check whether nohz_get_load_balancer() returns current CPU and skip
the expiry check if this is the case.
The bug was triggered from the networking code. The patch fixes the
regression http://bugzilla.kernel.org/show_bug.cgi?id=13738
(Soft-Lockup/Race in networking in 2.6.31-rc1+195)
Cc: Arun Bharadwaj <arun@linux.vnet.ibm.com
Tested-by: Joao Correia <joaomiguelcorreia@gmail.com>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The ktime_get() functions for GENERIC_TIME=n are still located in
hrtimer.c. Move them to time/timekeeping.c where they belong.
LKML-Reference: <new-submission>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The generic ktime_get function defined in kernel/hrtimer.c is suboptimial
for GENERIC_TIME=y:
0) | ktime_get() {
0) | ktime_get_ts() {
0) | getnstimeofday() {
0) | read_tod_clock() {
0) 0.601 us | }
0) 1.938 us | }
0) | set_normalized_timespec() {
0) 0.602 us | }
0) 4.375 us | }
0) 5.523 us | }
Overall there are two read_seqbegin/read_seqretry loops and a lot of
unnecessary struct timespec calculations. ktime_get returns a nano second
value which is the sum of xtime, wall_to_monotonic and the nano second
delta from the clock source.
ktime_get can be optimized for GENERIC_TIME=y. The new version only calls
clocksource_read:
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.610 us | }
0) 1.977 us | }
It uses a single read_seqbegin/readseqretry loop and just adds everthing
to a nano second value.
ktime_get_ts is optimized in a similar fashion.
[ tglx: added WARN_ON(timekeeping_suspended) as in getnstimeofday() ]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090707112728.3005244d@skybase>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
commit 3f68535ada (clocksource: sanity check sysfs clocksource
changes) prevents selection of non high resolution capable
clocksources when high resolution mode is active, but did not take
into account that the same rules apply for highres=off nohz=on.
Check the tick device mode instead of hrtimer_hres_active() to verify
whether the system needs to be protected from a switch to jiffies or
other non highres capable clock sources.
Reported-by: Luming Yu <luming.yu@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas, Andrew and Ingo pointed out that we don't have any safety checks
in the clocksource sysfs entries to make sure sysadmins don't try to
change the clocksource to a non high-res timer capable clocksource (such
as jiffies) when high-res timers (HRT) is enabled. Doing so will likely
hang a system.
Correct this by filtering non HRT clocksources from available_clocksources
and not accepting non HRT clocksources with HRT enabled.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We want to use hrtimers in UBIFS (for write-buffer write-back timer).
We need the 'hrtimer_set_expires_range_ns()', which is an in-line
function which uses 'ktime_add_safe()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:
This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.
While migrating hrtimers, care should be taken to check if migrating
a hrtimer would result in a latency or not. So we compare the expiry of the
hrtimer with the next timer interrupt on the target cpu and migrate the
hrtimer only if it expires *after* the next interrupt on the target cpu.
So, added a clockevents_get_next_event() helper function to return the
next_event on the target cpu's clock_event_device.
[ tglx: cleanups and simplifications ]
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:
This patch creates a new framework for identifying cpu-pinned timers
and hrtimers.
This framework is needed because pinned timers are expected to fire on
the same CPU on which they are queued. So it is essential to identify
these and not migrate them, in case there are any.
For regular timers, the currently existing add_timer_on() can be used
queue pinned timers and subsequently mod_timer_pinned() can be used
to modify the 'expires' field.
For hrtimers, new modes HRTIMER_ABS_PINNED and HRTIMER_REL_PINNED are
added to queue cpu-pinned hrtimer.
[ tglx: use .._PINNED mode argument instead of creating tons of new
functions ]
Signed-off-by: Arun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It appears I inadvertly introduced rq->lock recursion to the
hrtimer_start() path when I delegated running already expired
timers to softirq context.
This patch fixes it by introducing a __hrtimer_start_range_ns()
method that will not use raise_softirq_irqoff() but
__raise_softirq_irqoff() which avoids the wakeup.
It then also changes schedule() to check for pending softirqs and
do the wakeup then, I'm not quite sure I like this last bit, nor
am I convinced its really needed.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
LKML-Reference: <20090313112301.096138802@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent false positive WARN_ON() in clockevents_program_event()
clock_was_set() changes the base->offset of CLOCK_REALTIME and
enforces the reprogramming of the clockevent device to expire timers
which are based on CLOCK_REALTIME. If the clock change is large enough
then the subtraction of the timer expiry value and base->offset can
become negative which triggers the warning in
clockevents_program_event().
Check the subtraction result and set a negative value to 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: fix CPU hotplug hang on Power6 testbox
On architectures that support offlining all cpus (at least powerpc/pseries),
hot-unpluging the tick_do_timer_cpu can result in a system hang.
This comes from the fact that if the cpu going down happens to be the
cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
cpu is dead (via the CPU_DEAD notification), we're left without ticks,
jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
That's particularly the case for the cpu looping in __cpu_die() waiting
for the dying cpu to be dead.
This patch addresses this by having the tick_do_timer_cpu handover happen
earlier during the CPU_DYING notification. For this, a new clockevent
notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
in hrtimer_cpu_notify().
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid timer IRQ hanging slow systems
While using the function graph tracer on a virtualized system, the
hrtimer_interrupt can hang the system on an infinite loop.
This can be caused in several situations:
- the hardware is very slow and HZ is set too high
- something intrusive is slowing the system down (tracing under emulation)
... and the next clock events to program are always before the current time.
This patch implements a reasonable compromise: if such a situation is
detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
This is enough to let the system running without serious starvation.
It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
with function graph tracer launched. On both cases, the clock events were
increased until about 25 ms periodic ticks, which means 40 HZ.
So we change a hard to debug hang into a warning message and a system that
still manages to limp along.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prevent false positive WARN_ON() in clockevents_program_event()
clock_was_set() changes the base->offset of CLOCK_REALTIME and
enforces the reprogramming of the clockevent device to expire timers
which are based on CLOCK_REALTIME. If the clock change is large enough
then the subtraction of the timer expiry value and base->offset can
become negative which triggers the warning in
clockevents_program_event().
Check the subtraction result and set a negative value to 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>