When updating PT_NOTE header size (ie. p_memsz), an overflow issue
happens with the following bogus note entry:
n_namesz = 0xFFFFFFFF
n_descsz = 0x0
n_type = 0x0
This kind of note entry should be dropped during updating p_memsz. But
because n_namesz is 32bit, after (n_namesz + 3) & (~3), it's overflow to
0x0, the note entry size looks sane and reserved.
When userspace (eg. crash utility) is trying to access such bogus note,
it could lead to an unexpected behavior (eg. crash utility segment fault
because it's reading bogus address).
The source of bogus note hasn't been identified yet. At least we could
drop the bogus note so user space wouldn't be surprised.
Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Randy Wright <rwright@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Rashika Kheria <rashika.kheria@gmail.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of custom approach let's use string_escape_str() to escape a given
string (task_name in this case).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The output of /proc/$pid/numa_maps is in terms of number of pages like
anon=22 or dirty=54. Here's some output:
7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50
7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50
7fff8d425000 default stack anon=50 dirty=50 N0=50
Looks like we have a stack and a couple of anonymous hugetlbfs
areas page which both use the same amount of memory. They don't.
The 'bigfile' uses 1GB pages and takes up ~50GB of space. The
anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack
uses normal 4k pages. You can go over to smaps to figure out what the
page size _really_ is with KernelPageSize or MMUPageSize. But, I think
this is a pretty nasty and counterintuitive interface as it stands.
This patch introduces 'kernelpagesize_kB' line element to
/proc/<pid>/numa_maps report file in order to help identifying the size of
pages that are backing memory areas mapped by a given task. This is
specially useful to help differentiating between HUGE and GIGANTIC page
backed VMAs.
This patch is based on Dave Hansen's proposal and reviewer's follow-ups
taken from the following dicussion threads:
* https://lkml.org/lkml/2011/9/21/454
* https://lkml.org/lkml/2014/12/20/66
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the PDE() helper to get proc_dir_entry instead of coding it directly.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs. The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.
[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak <petrcermak@chromium.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Primiano Tucci <primiano@chromium.org>
Cc: Petr Cermak <petrcermak@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently pagewalker splits all THP pages on any clear_refs request. It's
not necessary. We can handle this on PMD level.
One side effect is that soft dirty will potentially see more dirty memory,
since we will mark whole THP page dirty at once.
Sanity checked with CRIU test suite. More testing is required.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
undesirable behaviour at client end (who called walk_page_range). For
example for pagemap_read(), when no callbacks are called against VM_PFNMAP
vma, pagemap_read() may prepare pagemap data for next virtual address
range at wrong index. That could confuse and/or break userspace
applications.
This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
- for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
over vma(VM_PFNMAP),
- for clear_refs and queue_pages which have their own ->tests_walk,
just return 1 and skip vma(VM_PFNMAP). This is no problem because
these are not interested in hole regions,
- for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private. And show_numa_map() walks pages on vma basis, so using
walk_page_vma() is preferable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
grep-friendly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page table walker has the information of the current vma in mm_walk, so we
don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
any longer. Currently pagemap_pte_range() does vma loop itself, so this
patch reduces many lines of code.
NULL-vma check is omitted because we assume that we never run these
callbacks on any address outside vma. And even if it were broken, NULL
pointer dereference would be detected, so we can get enough information
for debugging.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_refs_write() has some prechecks to determine if we really walk over
a given vma. Now we have a test_walk() callback to filter vmas, so let's
utilize it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private. And show_smap() walks pages on vma basis, so using
walk_page_vma() is preferable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():
CPU A (pagemap) CPU B (migration)
lock_page()
try_to_unmap(page, TTU_MIGRATION...)
make_migration_entry()
set_pte_at()
<read *pte>
pte_to_pagemap_entry()
remove_migration_ptes()
unlock_page()
if(is_migration_entry())
migration_entry_to_page()
BUG_ON(!PageLocked(page))
Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.
Fixes: 052fb0d635 ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs}
which is unused now. Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are only 3 callers and quite a bit of that thing is executed
exactly in one of those. Just lift it there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
Currently, in a CMA enabled system, if somebody wants to know the total
CMA size declared, there is no way to tell, other than the dmesg or
/var/log/messages logs.
With this patch we are showing the CMA info as part of meminfo, so that it
can be determined at any point of time. This will be populated only when
CMA is enabled.
Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.
MemTotal: 471172 kB
MemFree: 111712 kB
MemAvailable: 271172 kB
.
.
.
CmaTotal: 16384 kB
CmaFree: 6144 kB
This patch also fix below checkpatch errors that were found during these changes.
ERROR: space required after that ',' (ctx:ExV)
199: FILE: fs/proc/meminfo.c:199:
+ ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
^
ERROR: space required after that ',' (ctx:ExV)
202: FILE: fs/proc/meminfo.c:202:
+ ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
^
ERROR: space required after that ',' (ctx:ExV)
206: FILE: fs/proc/meminfo.c:206:
+ ,K(totalcma_pages)
^
total: 3 errors, 0 warnings, 2 checks, 236 lines checked
Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.
CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);
/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.
The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.
For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.
Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.
Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.
Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.
Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
- Expose the knob to user space through a proc file /proc/<pid>/setgroups
A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.
A value of "allow" means the segtoups system call is enabled.
- Descendant user namespaces inherit the value of setgroups from
their parents.
- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.
- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.
This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.
A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().
This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot. The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.
As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
proc_flush_task_mnt() always tries to flush task/pid, but this is
pointless if we reap the leader. d_invalidate() is recursive, and
if nothing else the next d_hash_and_lookup(tgid) should fail anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check. Other callers already use it
directly without additional checks.
Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
tgid/ngid and drop rcu lock.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. The usage of fdt looks very ugly, it can't be NULL if ->files is
not NULL. We can use "unsigned int max_fds" instead.
2. This also allows to move seq_printf(max_fds) outside of task_lock()
and join it with the previous seq_printf(). See also the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Better to use existing macro that rewriting them.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries. This serie aims to accelerate this part.
The current implementation for the directories in /proc is using a single
linked list. This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).
This patch replaces this linked list by a red-black tree.
Here are some numbers:
dummy30000.batch contains 30 000 times 'link add type dummy'.
Before the patch:
$ time ip -b dummy30000.batch
real 2m31.950s
user 0m0.440s
sys 2m21.440s
$ time rmmod dummy
real 1m35.764s
user 0m0.000s
sys 1m24.088s
After the patch:
$ time ip -b dummy30000.batch
real 2m0.874s
user 0m0.448s
sys 1m49.720s
$ time rmmod dummy
real 1m13.988s
user 0m0.000s
sys 1m1.008s
The idea of improving this part was suggested by Thierry Herbelot.
[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Thierry Herbelot <thierry.herbelot@6wind.com>.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.
Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other
options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
seq_printf functions shouldn't really check the return value.
Checking seq_has_overflowed() occasionally is used instead.
Update vfs documentation.
Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com
Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ did a few clean ups ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf2414 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.
Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.
All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.
Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty. Here's a program to demonstrate the bug:
int main() {
const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
uint64_t pme[3];
int fd = open("/proc/self/pagemap", O_RDONLY);;
char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
munmap(m + getpagesize(), getpagesize());
pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
return 0;
}
(Note that all pages in new VMAs are softdirty until cleared).
Tested:
Used the program given above. I'm going to include this code in
a selftest in the future.
[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9e7814404b "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.
The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.
However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.
Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.
Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some ARCHs modules range is eauql to vmalloc range. E.g on i686
"#define MODULES_VADDR VMALLOC_START"
"#define MODULES_END VMALLOC_END"
This will cause 2 duplicate program segments in /proc/kcore, and no flag
to indicate they are different. This is confusing. And usually people
who need check the elf header or read the content of kcore will check
memory ranges. Two program segments which are the same are unnecessary.
So check if the modules range is equal to vmalloc range. If so, just skip
adding the modules range.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rename vm_is_stack() to task_of_stack() and change it to return
"struct task_struct *" rather than the global (and thus wrong in
general) pid_t.
- Add the new pid_of_stack() helper which calls task_of_stack() and
uses the right namespace to report the correct pid_t.
Unfortunately we need to define this helper twice, in task_mmu.c
and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
and move at least pid_of_stack/task_of_stack there to avoid the
code duplication.
- Change show_map_vma() and show_numa_map() to use the new helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.
Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.
Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>