Jiri managed to trigger this warning:
[] ======================================================
[] [ INFO: possible circular locking dependency detected ]
[] 3.10.0+ #228 Tainted: G W
[] -------------------------------------------------------
[] p/6613 is trying to acquire lock:
[] (rcu_node_0){..-...}, at: [<ffffffff810ca797>] rcu_read_unlock_special+0xa7/0x250
[]
[] but task is already holding lock:
[] (&ctx->lock){-.-...}, at: [<ffffffff810f2879>] perf_lock_task_context+0xd9/0x2c0
[]
[] which lock already depends on the new lock.
[]
[] the existing dependency chain (in reverse order) is:
[]
[] -> #4 (&ctx->lock){-.-...}:
[] -> #3 (&rq->lock){-.-.-.}:
[] -> #2 (&p->pi_lock){-.-.-.}:
[] -> #1 (&rnp->nocb_gp_wq[1]){......}:
[] -> #0 (rcu_node_0){..-...}:
Paul was quick to explain that due to preemptible RCU we cannot call
rcu_read_unlock() while holding scheduler (or nested) locks when part
of the read side critical section was preemptible.
Therefore solve it by making the entire RCU read side non-preemptible.
Also pull out the retry from under the non-preempt to play nice with RT.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Helped-out-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently when the child context for inherited events is
created, it's based on the pmu object of the first event
of the parent context.
This is wrong for the following scenario:
- HW context having HW and SW event
- HW event got removed (closed)
- SW event stays in HW context as the only event
and its pmu is used to clone the child context
The issue starts when the cpu context object is touched
based on the pmu context object (__get_cpu_context). In
this case the HW context will work with SW cpu context
ending up with following WARN below.
Fixing this by using parent context pmu object to clone
from child context.
Addresses the following warning reported by Vince Weaver:
[ 2716.472065] ------------[ cut here ]------------
[ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
[ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
[ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
[ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
[ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
[ 2716.476035] Call Trace:
[ 2716.476035] [<ffffffff8102e215>] ? warn_slowpath_common+0x5b/0x70
[ 2716.476035] [<ffffffff810ab2bd>] ? task_ctx_sched_out+0x3c/0x5f
[ 2716.476035] [<ffffffff810af02a>] ? perf_event_exit_task+0xbf/0x194
[ 2716.476035] [<ffffffff81032a37>] ? do_exit+0x3e7/0x90c
[ 2716.476035] [<ffffffff810cd5ab>] ? __do_fault+0x359/0x394
[ 2716.476035] [<ffffffff81032fe6>] ? do_group_exit+0x66/0x98
[ 2716.476035] [<ffffffff8103dbcd>] ? get_signal_to_deliver+0x479/0x4ad
[ 2716.476035] [<ffffffff810ac05c>] ? __perf_event_task_sched_out+0x230/0x2d1
[ 2716.476035] [<ffffffff8100205d>] ? do_signal+0x3c/0x432
[ 2716.476035] [<ffffffff810abbf9>] ? ctx_sched_in+0x43/0x141
[ 2716.476035] [<ffffffff810ac2ca>] ? perf_event_context_sched_in+0x7a/0x90
[ 2716.476035] [<ffffffff810ac311>] ? __perf_event_task_sched_in+0x31/0x118
[ 2716.476035] [<ffffffff81050dd9>] ? mmdrop+0xd/0x1c
[ 2716.476035] [<ffffffff81051a39>] ? finish_task_switch+0x7d/0xa6
[ 2716.476035] [<ffffffff81002473>] ? do_notify_resume+0x20/0x5d
[ 2716.476035] [<ffffffff813654f5>] ? retint_signal+0x3d/0x78
[ 2716.476035] ---[ end trace 827178d8a5966c3d ]---
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes a serious bug in:
14c63f17b1 perf: Drop sample rate when sampling is too slow
There was an misunderstanding on the API of the do_div()
macro. It returns the remainder of the division and this
was not what the function expected leading to disabling the
interrupt latency watchdog.
This patch also remove a duplicate assignment in
perf_sample_event_took().
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: dave.hansen@linux.intel.com
Cc: ak@linux.intel.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/20130704223010.GA30625@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch keeps track of how long perf's NMI handler is taking,
and also calculates how many samples perf can take a second. If
the sample length times the expected max number of samples
exceeds a configurable threshold, it drops the sample rate.
This way, we don't have a runaway sampling process eating up the
CPU.
This patch can tend to drop the sample rate down to level where
perf doesn't work very well. *BUT* the alternative is that my
system hangs because it spends all of its time handling NMIs.
I'll take a busted performance tool over an entire system that's
busted and undebuggable any day.
BTW, my suspicion is that there's still an underlying bug here.
Using the HPET instead of the TSC is definitely a contributing
factor, but I suspect there are some other things going on.
But, I can't go dig down on a bug like that with my machine
hanging all the time.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org
Cc: acme@ghostprotocols.net
Cc: Dave Hansen <dave@sr71.net>
[ Prettified it a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch simply moves all per-cpu variables into the new
single per-cpu "struct bp_cpuinfo".
To me this looks more logical and clean, but this can also
simplify the further potential changes. In particular, I do not
think this memory should be per-cpu, it is never used "locally".
After this change it is trivial to turn it into, say,
bootmem[nr_cpu_ids].
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155020.GA6350@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. register_wide_hw_breakpoint() can use unregister_ if failure,
no need to duplicate the code.
2. "struct perf_event **pevent" adds the unnecesary lever of
indirection and complication, use per_cpu(*cpu_events, cpu).
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155018.GA6347@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the trivial helper which simply returns cpumask_of() or
cpu_possible_mask depending on bp->cpu.
Change fetch_bp_busy_slots() and toggle_bp_slot() to always do
for_each_cpu(cpumask_of_bp) to simplify the code and avoid the
code duplication.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155015.GA6340@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change toggle_bp_slot() to make "weight" negative if !enable.
This way we can always use "+ weight" without additional "if
(enable)" check and toggle_bp_task_slot() no longer needs this
arg.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155013.GA6337@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The enable/disable logic in toggle_bp_slot() is not symmetrical
and imho very confusing. "old_count" in toggle_bp_task_slot() is
actually new_count because this bp was already removed from the
list.
Change toggle_bp_slot() to always call list_add/list_del after
toggle_bp_task_slot(). This way old_idx is task_bp_pinned() and
this entry should be decremented, new_idx is +/-weight and we
need to increment this element. The code/logic looks obvious.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20130620155011.GA6330@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
slot") in arch_install_hw_breakpoint() but the problem is not
arch-specific.
The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
but this doesn't account the "all cpus" events with iter->cpu <
0.
This means that, say, register_user_hw_breakpoint(tsk) can
happily create the arbitrary number > HBP_NUM of breakpoints
which can not be activated. toggle_bp_task_slot() is equally
wrong by the same reason and nr_task_bp_pinned[] can have
negative entries.
Simple test:
# perl -e 'sleep 1 while 1' &
# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`
Before this patch this triggers the same problem/WARN_ON(),
after the patch it correctly fails with -ENOSPC.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This allows us to use pdev->name for registering a PMU device.
IMO the name is not supposed to be changed anyway.
Signed-off-by: Mischa Jonker <mjonker@synopsys.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370339148-5566-1-git-send-email-mjonker@synopsys.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 2b923c8 perf/x86: Check branch sampling priv level in generic code
was missing the check for the hypervisor (HV) priv level, so add it back.
With this patch, we get the following correct behavior:
# echo 2 >/proc/sys/kernel/perf_event_paranoid
$ perf record -j any,k noploop 1
Error:
You may not have permission to collect stats.
Consider tweaking /proc/sys/kernel/perf_event_paranoid:
-1 - Not paranoid at all
0 - Disallow raw tracepoint access for unpriv
1 - Disallow cpu events for unpriv
2 - Disallow kernel profiling for unpriv
$ perf record -j any,hv noploop 1
Error:
You may not have permission to collect stats.
Consider tweaking /proc/sys/kernel/perf_event_paranoid:
-1 - Not paranoid at all
0 - Disallow raw tracepoint access for unpriv
1 - Disallow cpu events for unpriv
2 - Disallow kernel profiling for unpriv
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130606090204.GA3725@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.
When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.
Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.
Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince reported a problem found by his perf specific trinity
fuzzer.
Al noticed 2 problems with perf's mmap():
- it has issues against fork() since we use vma->vm_mm for accounting.
- it has an rb refcount leak on double mmap().
We fix the issues against fork() by using VM_DONTCOPY; I don't
think there's code out there that uses this; we didn't hear
about weird accounting problems/crashes. If we do need this to
work, the previously proposed VM_PINNED could make this work.
Aside from the rb reference leak spotted by Al, Vince's example
prog was indeed doing a double mmap() through the use of
perf_event_set_output().
This exposes another problem, since we now have 2 events with
one buffer, the accounting gets screwy because we account per
event. Fix this by making the buffer responsible for its own
accounting.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch moves commit 7cc23cd to the generic code:
perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
The check is now implemented in generic code instead of x86 specific
code. That way we do not have to repeat the test in each arch
supporting branch sampling.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20130521105337.GA2879@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds /sys/device/xxx/perf_event_mux_interval_ms to ajust
the multiplexing interval per PMU. The unit is milliseconds. Value has
to be >= 1.
In the 4th version, we renamed the sysfs file to be more consistent
with the other /proc/sys/kernel entries for perf_events.
In the 5th version, we handle the reprogramming of the hrtimer using
hrtimer_forward_now(). That way, we sync up to new timer value quickly
(suggested by Jiri Olsa).
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current scheme of using the timer tick was fine for per-thread
events. However, it was causing bias issues in system-wide mode
(including for uncore PMUs). Event groups would not get their fair
share of runtime on the PMU. With tickless kernels, if a core is idle
there is no timer tick, and thus no event rotation (multiplexing).
However, there are events (especially uncore events) which do count
even though cores are asleep.
This patch changes the timer source for multiplexing. It introduces a
per-PMU per-cpu hrtimer. The advantage is that even when a core goes
idle, it will come back to service the hrtimer, thus multiplexing on
system-wide events works much better.
The per-PMU implementation (suggested by PeterZ) enables adjusting the
multiplexing interval per PMU. The preferred interval is stashed into
the struct pmu. If not set, it will be forced to the default interval
value.
In order to minimize the impact of the hrtimer, it is turned on and
off on demand. When the PMU on a CPU is overcommited, the hrtimer is
activated. It is stopped when the PMU is not overcommitted.
In order for this to work properly, we had to change the order of
initialization in start_kernel() such that hrtimer_init() is run
before perf_event_init().
The default interval in milliseconds is set to a timer tick just like
with the old code. We will provide a sysctl to tune this in another
patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1364991694-5876-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The hw breakpoint pmu 'add' function is missing the
period_left update needed for SW events.
The perf HW breakpoint events use the SW events framework
to process the overflow, so it needs to be properly initialized
in the PMU 'add' method.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1367421944-19082-5-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add perf_event_aux() function to send out all types of
auxiliary events - mmap, task, comm events. For each type
there's match and output functions defined and used as
callbacks during perf_event_aux processing.
This way we can centralize the pmu/context iterating and
event matching logic. Also since lot of the code was
duplicated, this patch reduces the .text size about 2kB
on my setup:
snipped output from 'objdump -x kernel/events/core.o'
before:
Idx Name Size
0 .text 0000d313
after:
Idx Name Size
0 .text 0000cad3
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The perf_event_task_ctx() function needs to be called with
preemption disabled, since it's checking for currently
scheduled cpu against event cpu.
We disable preemption for task related perf event context
if there's one defined, leaving up to the chance which cpu
it gets scheduled in.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1367857638-27631-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If we allocate perf ring buffer with the size of single (user)
page, we will get memory corruption when releasing itin
rb_free_work function (for CONFIG_PERF_USE_VMALLOC option).
For single page sized ring buffer the page_order is -1 (because
nr_pages is 0). This needs to be recognized in the rb_free_work
function to release proper amount of pages.
Adding data_page_nr function that returns number of allocated
data pages. Customizing the rest of the code to use it.
Reported-by: Jan Stancek <jstancek@redhat.com>
Original-patch-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20130319143509.GA1128@krava.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide a new helper that help full dynticks CPUs to prevent
from stopping their tick in case there are events in the local
rotation list.
This way we make sure that perf_event_task_tick() is serviced
on demand.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Kick the current CPU's tick by sending it a self IPI when
an event is queued on the rotation list and it is the first
element inserted. This makes sure that perf_event_task_tick()
works on full dynticks CPUs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Trinity discovered that we fail to check all 64 bits of
attr.config passed by user space, resulting to out-of-bounds
access of the perf_swevent_enabled array in
sw_perf_event_destroy().
Introduced in commit b0a873ebb ("perf: Register PMU
implementations").
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: davej@redhat.com
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike the kretprobes we can't trust userspace, thus must have
protection from user space attacks. User-space have "unlimited"
stack, and this patch limits the return probes nestedness as a
simple remedy for it.
Note that this implementation leaks return_instance on siglongjmp
until exit()/exec().
The intention is to have KISS and bare minimum solution for the
initial implementation in order to not complicate the uretprobes
code.
In the future we may come up with more sophisticated solution that
remove this depth limitation. It is not easy task and lays beyond
this patchset.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Uretprobe handlers are invoked when the trampoline is hit, on completion
the trampoline is replaced with the saved return address and the uretprobe
instance deleted.
TODO: handle_trampoline() assumes that ->return_instances is always valid.
We should teach it to handle longjmp() which can invalidate the pending
return_instance's. This is nontrivial, we will try to do this in a separate
series.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
When a uprobe with return probe consumer is hit, prepare_uretprobe()
function is invoked. It creates return_instance, hijacks return address
and replaces it with the trampoline.
* Return instances are kept as stack per uprobed task.
* Return instance is chained, when the original return address is
trampoline's page vaddr (e.g. recursive call of the probed function).
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Allocate trampoline page, as the very first one in uprobed
task xol area, and fill it with breakpoint opcode.
Also introduce get_trampoline_vaddr() helper, to wrap the
trampoline address extraction from area->vaddr. That removes
confusion and eases the debug experience in case ->vaddr
notion will be changed.
Signed-off-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
perf_event is one of a couple remaining cgroup controllers with broken
hierarchy support. Converting it to support hierarchy is almost
trivial. The only thing necessary is to consider a task belonging to
a descendant cgroup as a match. IOW, if the cgroup of the currently
executing task (@cpuctx->cgrp) equals or is a descendant of the
event's cgroup (@event->cgrp), then the event should be enabled.
Implement hierarchy support and remove .broken_hierarchy tag along
with the incorrect comment on what needs to be done for hierarchy
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
For NUL terminated string, always make sure that there's '\0' at the end.
In our case we need a return value, so still use strncpy() and
fix up the tail explicitly.
(strlcpy() returns the size, not the pointer)
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: a.p.zijlstra@chello.nl <a.p.zijlstra@chello.nl>
Cc: paulus@samba.org <paulus@samba.org>
Cc: acme@ghostprotocols.net <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/51623E0B.7070101@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change write_opcode() to use copy_highpage() + copy_to_page()
and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Extract the kmap_atomic/memcpy/kunmap_atomic code from
xol_get_insn_slot() into the new simple helper, copy_to_page().
It will have more users soon.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
__copy_insn(filp) can only be called after valid_vma() returns T,
vma->vm_file passed as "filp" can not be NULL.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change __copy_insn() to use copy_from_page() and simplify the code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
No functional changes. Rename copy_opcode() into copy_from_page() and
add the new "int len" argument to make it more more generic for the
new users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Anton Arapov <anton@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Some architectures like powerpc have multiple variants of the trap
instruction. Introduce an additional helper is_trap_insn() for run-time
handling of non-uprobe traps on such architectures.
While there, change is_swbp_at_addr() to is_trap_at_addr() for reading
clarity.
With this change, the uprobe registration path will supercede any trap
instruction inserted at the requested location, while taking care of
delivering the SIGTRAP for cases where the trap notification came in
for an address without a uprobe. See [1] for a more detailed explanation.
[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-March/104771.html
This change was suggested by Oleg Nesterov.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cleanup. Now that we have f_inode/file_inode() we can use it instead
of vm_file->f_mapping->host.
This should not make any difference for uprobes, but in theory this
change is more correct. We use this inode as a key, to compare it
with uprobe->inode set by uprobe_register(inode), and the caller uses
d_inode.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Type of mapping was lost and made it hard for a tool
to distinguish code vs. data mmaps. Perf has the ability
to distinguish the two.
Use a bit in the header->misc bitmask to keep track of
the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
the mapping is not executable (!VM_EXEC). If not set, then
the mapping is executable.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-16-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch adds PERF_SAMPLE_DATA_SRC.
PERF_SAMPLE_DATA_SRC collects the data source, i.e., where
did the data associated with the sampled instruction
come from. Information is stored in a perf_mem_data_src
structure. It contains opcode, mem level, tlb, snoop,
lock information, subject to availability in hardware.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
For some events it's useful to weight sample with a hardware
provided number. This expresses how expensive the action the
sample represent was. This allows the profiler to scale
the samples to be more informative to the programmer.
There is already the period which is used similarly, but it
means something different, so I chose to not overload it.
Instead a new sample type for WEIGHT is added.
Can be used for multiple things. Initially it is used for TSX
abort costs and profiling by memory latencies (so to make
expensive load appear higher up in the histograms). The concept
is quite generic and can be extended to many other kinds of
events or architectures, as long as the hardware provides
suitable auxillary values. In principle it could be also used
for software tracepoints.
This adds the generic glue. A new optional sample format for a
64-bit weight value.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: namhyung.kim@lge.com
Link: http://lkml.kernel.org/r/1359040242-8269-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This patch fixes a flaw in perf_output_space(). In case the size
of the space needed is bigger than the actual buffer size, there
may be situations where the function would return true (i.e.,
there is space) when it should not. head > offset due to
rounding of the masking logic.
The problem can be tested by activating BTS on Intel processors.
A BTS record can be as big as 16 pages. The following command
fails:
$ perf record -m 4 -c 1 -e branches:u my_test_program
You will get a buffer corruption with this. Perf report won't be
able to parse the perf.data.
The fix is to first check that the requested space is smaller
than the buffer size. If so, then the masking logic will work
fine. If not, then there is no chance the record can be saved
and it will be gracefully handled by upper code layers.
[ In v2, we also make the logic for the writable more explicit by
renaming it to rb->overwrite because it tells whether or not the
buffer can overwrite its tail (suggested by PeterZ). ]
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a per-cpu data structure but missed the __percpu annotation.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Link: http://lkml.kernel.org/r/1363600594-11453-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_event_task_event() iterates pmu list and generate events
for each eligible pmu context. But if task_event has task_ctx
like in EXIT it'll generate events even though the pmu doesn't
have an eligible one. Fix it by moving the code to proper
places.
Before this patch:
$ perf record -n true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.006 MB perf.data (~248 samples) ]
$ perf report -D | tail
Aggregated stats:
TOTAL events: 73
MMAP events: 67
COMM events: 2
EXIT events: 4
cycles stats:
TOTAL events: 73
MMAP events: 67
COMM events: 2
EXIT events: 4
After this patch:
$ perf report -D | tail
Aggregated stats:
TOTAL events: 70
MMAP events: 67
COMM events: 2
EXIT events: 1
cycles stats:
TOTAL events: 70
MMAP events: 67
COMM events: 2
EXIT events: 1
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363332433-7637-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When cpu/task clock events are initialized, their sampling
frequencies are converted to have a fixed value. However it
missed to update the hwc->last_period which was set to 1 for
initial sampling frequency calibration.
Because this hwc->last_period value is used as a period in
perf_swevent_ hrtime(), every recorded sample will have an
incorrected period of 1.
$ perf record -e task-clock noploop 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.158 MB perf.data (~6919 samples) ]
$ perf report -n --show-total-period --stdio
# Samples: 4K of event 'task-clock'
# Event count (approx.): 4000
#
# Overhead Samples Period Command Shared Object Symbol
# ........ ............ ............ ....... ............. ..................
#
99.95% 3998 3998 noploop noploop [.] main
0.03% 1 1 noploop libc-2.15.so [.] init_cacheinfo
0.03% 1 1 noploop ld-2.15.so [.] open_verify
Note that it doesn't affect the non-sampling event so that the
perf stat still gets correct value with or without this patch.
$ perf stat -e task-clock noploop 1
Performance counter stats for 'noploop 1':
1000.272525 task-clock # 1.000 CPUs utilized
1.000560605 seconds time elapsed
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1363574507-18808-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>