The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved. The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.
For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex. The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:
OOM invoking task:
mem_cgroup_handle_oom+0x241/0x3b0
mem_cgroup_cache_charge+0xbe/0xe0
add_to_page_cache_locked+0x4c/0x140
add_to_page_cache_lru+0x22/0x50
grab_cache_page_write_begin+0x8b/0xe0
ext3_write_begin+0x88/0x270
generic_file_buffered_write+0x116/0x290
__generic_file_aio_write+0x27c/0x480
generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
do_sync_write+0xea/0x130
vfs_write+0xf3/0x1f0
sys_write+0x51/0x90
system_call_fastpath+0x18/0x1d
OOM kill victim:
do_truncate+0x58/0xa0 # takes i_mutex
do_last+0x250/0xa30
path_openat+0xd7/0x440
do_filp_open+0x49/0xa0
do_sys_open+0x106/0x240
sys_open+0x20/0x30
system_call_fastpath+0x18/0x1d
The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.
A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit. But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks. For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.
This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:
1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.
2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.
Debugged by Michal Hocko.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.
Enable the memcg OOM killer only for user faults, where it's really the
only option available.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_struct->pid/tgid should go away.
1. Change same_thread_group() to use task->signal for comparison.
2. Change has_group_leader_pid(task) to compare task_pid(task) with
signal->leader_pid.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit df54d6fa54.
The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.
In particular, the MATLAB Java runtime seems to be unhappy. See
https://bugzilla.kernel.org/show_bug.cgi?id=60774
So we'll want to apply the random offset only once, and Radu has a patch
for that. Revert this older commit in order to apply the other one.
Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the stack is set to unlimited, the bottomup direction is used for
mmap-ings but the mmap_base is not used and thus effectively renders
ASLR for mmapings along with PIE useless.
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling freeze_processes sets a global flag that will cause any
process that calls try_to_freeze to enter the refrigerator. It
skips sending a signal to the current task, but if the current
task ever hits try_to_freeze, all threads will be frozen and the
system will deadlock.
Set a new flag, PF_SUSPEND_TASK, on the task that calls
freeze_processes. The flag notifies the freezer that the thread
is involved in suspend and should not be frozen. Also add a
WARN_ON in thaw_processes if the caller does not have the
PF_SUSPEND_TASK flag set to catch if a different task calls
thaw_processes than the one that called freeze_processes, leaving
a task with PF_SUSPEND_TASK permanently set on it.
Threads that spawn off a task with PF_SUSPEND_TASK set (which
swsusp does) will also have PF_SUSPEND_TASK set, preventing them
from freezing while they are helping with suspend, but they need
to be dead by the time suspend is triggered, otherwise they may
run when userspace is expected to be frozen. Add a WARN_ON in
thaw_processes if more than one thread has the PF_SUSPEND_TASK
flag set.
Reported-and-tested-by: Michael Leun <lkml20130126@newton.leun.net>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The wake-affine scheduler feature is currently always trying to pull
the wakee close to the waker. In theory this should be beneficial if
the waker's CPU caches hot data for the wakee, and it's also beneficial
in the extreme ping-pong high context switch rate case.
Testing shows it can benefit hackbench up to 15%.
However, the feature is somewhat blind, from which some workloads
such as pgbench suffer. It's also time-consuming algorithmically.
Testing shows it can damage pgbench up to 50% - far more than the
benefit it brings in the best case.
So wake-affine should be smarter and it should realize when to
stop its thankless effort at trying to find a suitable CPU to wake on.
This patch introduces 'wakee_flips', which will be increased each
time the task flips (switches) its wakee target.
So a high 'wakee_flips' value means the task has more than one
wakee, and the bigger the number, the higher the wakeup frequency.
Now when making the decision on whether to pull or not, pay attention to
the wakee with a high 'wakee_flips', pulling such a task may benefit
the wakee. Also imply that the waker will face cruel competition later,
it could be very cruel or very fast depends on the story behind
'wakee_flips', waker therefore suffers.
Furthermore, if waker also has a high 'wakee_flips', that implies that
multiple tasks rely on it, then waker's higher latency will damage all
of them, so pulling wakee seems to be a bad deal.
Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
higher and higher, the cost of pulling seems to be worse and worse.
The patch therefore helps the wake-affine feature to stop its pulling
work when:
wakee->wakee_flips > factor &&
waker->wakee_flips > (factor * wakee->wakee_flips)
The 'factor' here is the number of CPUs in the current CPU's NUMA node,
so a bigger node will lead to more pulling since the trial becomes more
severe.
After applying the patch, pgbench shows up to 40% improvements and no regressions.
Tested with 12 cpu x86 server and tip 3.10.0-rc7.
The percentages in the final column highlight the areas with the biggest wins,
all other areas improved as well:
pgbench base smart
| db_size | clients | tps | | tps |
+---------+---------+-------+ +-------+
| 22 MB | 1 | 10598 | | 10796 |
| 22 MB | 2 | 21257 | | 21336 |
| 22 MB | 4 | 41386 | | 41622 |
| 22 MB | 8 | 51253 | | 57932 |
| 22 MB | 12 | 48570 | | 54000 |
| 22 MB | 16 | 46748 | | 55982 | +19.75%
| 22 MB | 24 | 44346 | | 55847 | +25.93%
| 22 MB | 32 | 43460 | | 54614 | +25.66%
| 7484 MB | 1 | 8951 | | 9193 |
| 7484 MB | 2 | 19233 | | 19240 |
| 7484 MB | 4 | 37239 | | 37302 |
| 7484 MB | 8 | 46087 | | 50018 |
| 7484 MB | 12 | 42054 | | 48763 |
| 7484 MB | 16 | 40765 | | 51633 | +26.66%
| 7484 MB | 24 | 37651 | | 52377 | +39.11%
| 7484 MB | 32 | 37056 | | 51108 | +37.92%
| 15 GB | 1 | 8845 | | 9104 |
| 15 GB | 2 | 19094 | | 19162 |
| 15 GB | 4 | 36979 | | 36983 |
| 15 GB | 8 | 46087 | | 49977 |
| 15 GB | 12 | 41901 | | 48591 |
| 15 GB | 16 | 40147 | | 50651 | +26.16%
| 15 GB | 24 | 37250 | | 52365 | +40.58%
| 15 GB | 32 | 36470 | | 50015 | +37.14%
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linux as a guest on KVM hypervisor, the only user of the pvclock
vsyscall interface, does not require notification on task migration
because:
1. cpu ID number maps 1:1 to per-CPU pvclock time info.
2. per-CPU pvclock time info is updated if the
underlying CPU changes.
3. that version is increased whenever underlying CPU
changes.
Which is sufficient to guarantee nanoseconds counter
is calculated properly.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
When building the htmldocs (in verbose mode), scripts/kernel-doc
reports the follwing type of warnings:
Warning(kernel/sched/core.c:936): No description found for return value of 'task_curr'
...
Fix those by:
- adding the missing descriptions
- using "Return" sections for the descriptions
Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1373654747-2389-1-git-send-email-yacine.belkadi.1@gmail.com
[ While at it, fix the cpupri_set() explanation. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit bf26c01849 ("Prepare to fix racy accesses on task
breakpoints").
The patch was fine but we can no longer race with SIGKILL after commit
9899d11f65 ("ptrace: ensure arch_ptrace/ptrace_request can never race
with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
->ptrace_bps[] can't go away.
Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers,
we can kill them and remove task->ptrace_bp_refcnt.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.
And rename it to set_special_pids(), another helper with this name has
gone away.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.
Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f), but also need to revert.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is intended for use in loops which read data protected by RCU and may
have a large number of iterations. Such an example is dumping the list of
connections known to IPVS: ip_vs_conn_array() and ip_vs_conn_seq_next().
The benefits are for CONFIG_PREEMPT_RCU=y where we save CPU cycles
by moving rcu_read_lock and rcu_read_unlock out of large loops
but still allowing the current task to be preempted after every
loop iteration for the CONFIG_PREEMPT_RCU=n case.
The call to cond_resched() is not needed when CONFIG_PREEMPT_RCU=y.
Thanks to Paul E. McKenney for explaining this and for the
final version that checks the context with CONFIG_DEBUG_ATOMIC_SLEEP=y
for all possible configurations.
The function can be empty in the CONFIG_PREEMPT_RCU case,
rcu_read_lock and rcu_read_unlock are not needed in this case
because the task can be preempted on indication from scheduler.
Thanks to Peter Zijlstra for catching this and for his help
in trying a solution that changes __might_sleep.
Initial cond_resched_rcu_lock() function suggested by Eric Dumazet.
Tested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The scheduler doesn't yet fully support environments
with a single task running without a periodic tick.
In order to ensure we still maintain the duties of scheduler_tick(),
keep at least 1 tick per second.
This makes sure that we keep the progression of various scheduler
accounting and background maintainance even with a very low granularity.
Examples include cpu load, sched average, CFS entity vruntime,
avenrun and events such as load balancing, amongst other details
handled in sched_class::task_tick().
This limitation will be removed in the future once we get
these individual items to work in full dynticks CPUs.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
threadgroup_lock() takes signal->cred_guard_mutex to ensure that
thread_group_leader() is stable. This doesn't look nice, the scope of
this lock in do_execve() is huge.
And as Dave pointed out this can lead to deadlock, we have the
following dependencies:
do_execve: cred_guard_mutex -> i_mutex
cgroup_mount: i_mutex -> cgroup_mutex
attach_task_by_pid: cgroup_mutex -> cred_guard_mutex
Change de_thread() to take threadgroup_change_begin() around the
switch-the-leader code and change threadgroup_lock() to avoid
->cred_guard_mutex.
Note that de_thread() can't sleep with ->group_rwsem held, this can
obviously deadlock with the exiting leader if the writer is active, so it
does threadgroup_change_end() before schedule().
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:
- do_coredump() clears TIF_SIGPENDING but of course this can't help
if, say, SIGCHLD comes after that.
In this case the coredump can fail unexpectedly. See for example
wait_for_dump_helper()->signal_pending() check but there are other
reasons.
- At the same time, dumping a huge core on the slow media can take a
lot of time/resources and there is no way to kill the coredumping
task reliably. In particular this is not oom_kill-friendly.
This patch tries to fix the 1st problem, and makes the preparation for the
next changes.
We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core. prepare_signal() checks this flag and
nacks any signal except SIGKILL.
Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion. See marc.info/?l=linux-kernel&m=120508897917439
Notes:
- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
The patch assumes that dump_write/etc paths should never
call it, but we can change it as well.
- There is another source of TIF_SIGPENDING, freezer. This
will be addressed separately.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most commands are entered one line at a time and processed as complete lines
in non-canonical mode. Commands that interactively require a password, enter
canonical mode to do this while shutting off echo. This pair of features
(icanon and !echo) can be used to avoid logging passwords by audit while still
logging the rest of the command.
Adding a member (log_passwd) to the struct audit_tty_status passed in by
pam_tty_audit allows control of canonical mode without echo per task.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
On my SMP platform which is made of 5 cores in 2 clusters, I
have the nr_busy_cpu field of sched_group_power struct that is
not null when the platform is fully idle - which makes the
scheduler unhappy.
The root cause is:
During the boot sequence, some CPUs reach the idle loop and set
their NOHZ_IDLE flag while waiting for others CPUs to boot. But
the nr_busy_cpus field is initialized later with the assumption
that all CPUs are in the busy state whereas some CPUs have
already set their NOHZ_IDLE flag.
More generally, the NOHZ_IDLE flag must be initialized when new
sched_domains are created in order to ensure that NOHZ_IDLE and
nr_busy_cpus are aligned.
This condition can be ensured by adding a synchronize_rcu()
between the destruction of old sched_domains and the creation of
new ones so the NOHZ_IDLE flag will not be updated with old
sched_domain once it has been initialized. But this solution
introduces a additionnal latency in the rebuild sequence that is
called during cpu hotplug.
As suggested by Frederic Weisbecker, another solution is to have
the same rcu lifecycle for both NOHZ_IDLE and sched_domain
struct. A new nohz_idle field is added to sched_domain so both
status and sched_domain will share the same RCU lifecycle and
will be always synchronized. In addition, there is no more need
to protect nohz_idle against concurrent access as it is only
modified by 2 exclusive functions called by local cpu.
This solution has been prefered to the creation of a new struct
with an extra pointer indirection for sched_domain.
The synchronization is done at the cost of :
- An additional indirection and a rcu_dereference for accessing nohz_idle.
- We use only the nohz_idle field of the top sched_domain.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linaro-kernel@lists.linaro.org
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
Cc: pjt@google.com
Cc: rostedt@goodmis.org
Cc: efault@gmx.de
Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
[ Fixed !NO_HZ build bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide a new helper to be called from the full dynticks engine
before stopping the tick in order to make sure we don't stop
it when there is more than one task running on the CPU.
This way we make sure that the tick stays alive to maintain
fairness.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
feature bit was really just an early hack to make with/without
mutex-spinning testable. So it is no longer necessary.
This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
move the mutex spinning code from kernel/sched/core.c back to
kernel/mutex.c which is where they should belong.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chandramouleeswaran Aswin <aswin@hp.com>
Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Norton Scott J <scott.norton@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently kernel generates IDs for posix timers in a global manner --
there's a kernel-wide IDR tree from which IDs are created. This makes
it impossible to recreate a timer with a desired ID (in particular
this is done by the CRIU checkpoint-restore project) -- since these
IDs are global it may happen, that at the time we recreate a timer, the
ID we want for it is already busy by some other timer.
In order to address this, replace the IDR tree with a global hash
table for timers and makes timer IDs unique per signal_struct (to
which timers are linked anyway). With this, two timers belonging to
different processes may have equal IDs and we can recreate either of
them with the ID we want.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513D9FF5.9010004@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.
Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.
The migration thread has another related issue.
CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement set/clear functions for the idle need_resched poll
implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.518839807@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move it to a common place. Preparatory patch for implementing
set/clear for the idle need_resched poll implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Link: http://lkml.kernel.org/r/20130321215233.446034505@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We are planning to convert the dynticks Kconfig options layout
into a choice menu. The user must be able to easily pick
any of the following implementations: constant periodic tick,
idle dynticks, full dynticks.
As this implies a mutual exclusion, the two dynticks implementions
need to converge on the selection of a common Kconfig option in order
to ease the sharing of a common infrastructure.
It would thus seem pretty natural to reuse CONFIG_NO_HZ to
that end. It already implements all the idle dynticks code
and the full dynticks depends on all that code for now.
So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.
On the other hand we want to stay backward compatible: if
CONFIG_NO_HZ is set in an older config file, we want to
enable CONFIG_NO_HZ_IDLE by default.
But we can't afford both at the same time or we run into
a circular dependency:
1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
CONFIG_NO_HZ
2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE
We might be able to support that from Kconfig/Kbuild but it
may not be wise to introduce such a confusing behaviour.
So to solve this, create a new CONFIG_NO_HZ_COMMON option
which gathers the common code between idle and full dynticks
(that common code for now is simply the idle dynticks code)
and select it from their referring Kconfig.
Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
to it for backward compatibility.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Does writethrough and writeback caching, handles unclean shutdown, and
has a bunch of other nifty features motivated by real world usage.
See the wiki at http://bcache.evilpiepirate.org for more.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Wake up a CPU when a timer list timer is enqueued there and
the target is part of the full dynticks range. Sending an IPI
to it makes it reconsidering the next timer to program on top
of recent updates.
This may later be improved by checking if the tick is really
stopped on the target. This would need some careful
synchronization though. So deal with such optimization later
and start simple.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.
What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.
This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.
This will allow simplifying workqueue worker CPU affinity management.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The full dynticks cputime accounting is able to account either
using the tick or the context tracking subsystem. This way
the housekeeping CPU can keep the low overhead tick based
solution.
This latter mode has a low jiffies resolution granularity and
need to be scaled against CFS precise runtime accounting to
improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
now we also need to expand it to full dynticks accounting dynamic
off-case as well.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Mats Liljegren <mats.liljegren@enea.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
As default_scale_{freq,smt}_power() and update_rt_power() are
used in kernel/sched/fair.c only, annotate them as static
functions.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A7AF.8010900@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are used internally only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A78E.7040609@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move struct sched_group_power and sched_group and related inline
functions to kernel/sched/sched.h, as they are used internally
only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A77F.2010705@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They are used internally only.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A771.4070104@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No one will call those functions if CONFIG_SCHED_DEBUG=n.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5135A748.3050206@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
defines introduced in 54b501992d ("coredump: warn about unsafe
suid_dumpable / core_pattern combo"). Remove the new ones, and use the
prior values instead.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Chen Gang <gang.chen@asianux.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)
- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since
any function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the sysctl-related bits from include/linux/sched.h into
a new file: include/linux/sched/sysctl.h. Then update source
files requiring access to those bits by including the new
header file.
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.
To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>