As noted in 83d349f35e ("x86: don't send
an IPI to the empty set of CPU's"), some APIC's will be very unhappy
with an empty destination mask. That commit added a WARN_ON() for that
case, and avoided the resulting problem, but didn't fix the underlying
reason for why those empty mask cases happened.
This fixes that, by checking the result of 'cpumask_andnot()' of the
current CPU actually has any other CPU's left in the set of CPU's to be
sent a TLB flush, and not calling down to the IPI code if the mask is
empty.
The reason this started happening at all is that we started passing just
the CPU mask pointers around in commit 4595f9620 ("x86: change
flush_tlb_others to take a const struct cpumask"), and when we did that,
the cpumask was no longer thread-local.
Before that commit, flush_tlb_mm() used to create it's own copy of
'mm->cpu_vm_mask' and pass that copy down to the low-level flush
routines after having tested that it was not empty. But after changing
it to just pass down the CPU mask pointer, the lower level TLB flush
routines would now get a pointer to that 'mm->cpu_vm_mask', and that
could still change - and become empty - after the test due to other
CPU's having flushed their own TLB's.
See
http://bugzilla.kernel.org/show_bug.cgi?id=13933
for details.
Tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With VMALLOC_END included in the calculation of MAXMEM (as of
2.6.28) it is no longer correct to also bump __VMALLOC_RESERVE
in reserve_top_address(). Doing so results in needlessly small
lowmem.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A71DD2A020000780000D482@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The code was incorrectly reserving memtypes using the page
virtual address instead of the physical address. Furthermore,
the code was not ignoring highmem pages as it ought to.
( upstream does not pass in highmem pages yet - but upcoming
graphics code will do it and there's no reason to not handle
this properly in the CPA APIs.)
Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=13884
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: <stable@kernel.org>
Cc: dri-devel@lists.sourceforge.net
Cc: venkatesh.pallipadi@intel.com
LKML-Reference: <1249284345-7654-1-git-send-email-thellstrom@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changeset 3869c4aa18
that went in after 2.6.30-rc1 was a seemingly small change to _set_memory_wc()
to make it complaint with SDM requirements. But, introduced a nasty bug, which
can result in crash and/or strange corruptions when set_memory_wc is used.
One such crash reported here
http://lkml.org/lkml/2009/7/30/94
Actually, that changeset introduced two bugs.
* change_page_attr_set() takes &addr as first argument and can the addr value
might have changed on return, even for single page change_page_attr_set()
call. That will make the second change_page_attr_set() in this routine
operate on unrelated addr, that can eventually cause strange corruptions
and bad page state crash.
* The second change_page_attr_set() call, before setting _PAGE_CACHE_WC, should
clear the earlier _PAGE_CACHE_UC_MINUS, as otherwise cache attribute will not
be WC (will be UC instead).
The patch below fixes both these problems. Sending a single patch to fix both
the problems, as the change is to the same line of code. The change to have a
addr_copy is not very clean. But, it is simpler than making more changes
through various routines in pageattr.c.
A huge thanks to Jerome for reporting this problem and providing a simple test
case that helped us root cause the problem.
Reported-by: Jerome Glisse <glisse@freedesktop.org>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090730214319.GA1889@linux-os.sc.intel.com>
Acked-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This functionality is needed to kmap_atomic() highmem pages that may
potentially have or are about to set up other mappings with
non-standard caching attributes.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
Upcoming paches to support the new 64-bit "BookE" powerpc architecture
will need to have the virtual address corresponding to PTE page when
freeing it, due to the way the HW table walker works.
Basically, the TLB can be loaded with "large" pages that cover the whole
virtual space (well, sort-of, half of it actually) represented by a PTE
page, and which contain an "indirect" bit indicating that this TLB entry
RPN points to an array of PTEs from which the TLB can then create direct
entries. Thus, in order to invalidate those when PTE pages are deleted,
we need the virtual address to pass to tlbilx or tlbivax instructions.
The old trick of sticking it somewhere in the PTE page struct page sucks
too much, the address is almost readily available in all call sites and
almost everybody implemets these as macros, so we may as well add the
argument everywhere. I added it to the pmd and pud variants for consistency.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Need to clear both nodes and nodes_add state for start/end.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20090718065657.GA2898@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Since commit 5fd29d6c ("printk: clean up handling of log-levels
and newlines"), the kernel logs segfaults like:
<6>gnome-power-man[24509]: segfault at 20 ip 00007f9d4950465a sp 00007fffbb50fc70 error 4 in libgobject-2.0.so.0.2103.0[7f9d494f7000+45000]
with the extra "<6>" being KERN_INFO. This happens because the
printk in show_signal_msg() started with KERN_CONT and then
used "%s" to pass in the real level; and KERN_CONT is no longer
an empty string, and printk only pays attention to the level at
the very beginning of the format string.
Therefore, remove the KERN_CONT from this printk, since it is
now actively causing problems (and never really made any
sense).
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <874otjitkj.fsf@shaolin.home.digitalvampire.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alex found that specjbb2005 still can not run with hugepages on an
x86-64 machine. This only happens when numa is not compiled in.
The root cause: node_set_state will not set it back for us in that case,
so don't clear that when numa is not select in config
[ v2: use node_clear_state instead ]
Reported-and-Tested-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This sparse warning:
arch/x86/mm/init.c:83:16: warning: symbol 'check_efer' was not declared. Should it be static?
triggers because check_efer() is not decalared before using it.
asm/proto.h includes the declaration of check_efer(), so
including asm/proto.h to fix that - this also addresses the
sparse warning.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1246458263.6940.22.camel@hpdv5.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Nathan reported that
| commit 73d60b7f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date: Tue Jun 16 15:33:00 2009 -0700
|
| page-allocator: clear N_HIGH_MEMORY map before we set it again
|
| SRAT tables may contains nodes of very small size. The arch code may
| decide to not activate such a node. However, currently the early boot
| code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
| active although these nodes have no present pages.
|
| For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
an i386 kvm guest, meaning that cpuset.mems can not be used.
Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn
Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The init_gbpages() function is conditionally called from
init_memory_mapping() function. There are two call-sites where
this 'after_bootmem' condition can be true: setup_arch() and
mem_init() via pci_iommu_alloc().
Therefore, it's safe to move the call to init_gbpages() to
setup_arch() as it's always called before mem_init().
This removes an after_bootmem use - paving the way to remove
all uses of that state variable.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <Pine.LNX.4.64.0906221731210.19474@melkki.cs.Helsinki.FI>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lpage allocator aliases a PMD page for each cpu and returns whatever
is unused to the page allocator. When the pageattr of the recycled
pages are changed, this makes the two aliases point to the overlapping
regions with different attributes which isn't allowed and known to
cause subtle data corruption in certain cases.
This can be handled in simliar manner to the x86_64 highmap alias.
pageattr code should detect if the target pages have PMD alias and
split the PMD alias and synchronize the attributes.
pcpur allocator is updated to keep the allocated PMD pages map sorted
in ascending address order and provide pcpu_lpage_remapped() function
which binary searches the array to determine whether the given address
is aliased and if so to which address. pageattr is updated to use
pcpu_lpage_remapped() to detect the PMD alias and split it up as
necessary from cpa_process_alias().
Jan Beulich spotted the original problem and incorrect usage of vaddr
instead of laddr for lookup.
With this, lpage percpu allocator should work correctly. Re-enable
it.
[ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reorganize cpa_process_alias() so that new alias condition can be
added easily.
Jan Beulich spotted problem in the original cleanup thread which
incorrectly assumed the two existing conditions were mutially
exclusive.
[ Impact: code reorganization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's really not right to use 'access_ok()', since that is meant for the
normal "get_user()" and "copy_from/to_user()" accesses, which are done
through the TLB, rather than through the page tables.
Why? access_ok() does both too few, and too many checks. Too many,
because it is meant for regular kernel accesses that will not honor the
'user' bit in the page tables, and because it honors the USER_DS vs
KERNEL_DS distinction that we shouldn't care about in GUP. And too few,
because it doesn't do the 'canonical' check on the address on x86-64,
since the TLB will do that for us.
So instead of using a function that isn't meant for this, and does
something else and much more complicated, just do the real rules: we
don't want the range to overflow, and on x86-64, we want it to be a
canonical low address (on 32-bit, all addresses are canonical).
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Improve a few details in perfcounter call-chain recording that
makes use of fast-GUP:
- Use ACCESS_ONCE() to observe the pte value. ptes are fundamentally
racy and can be changed on another CPU, so we have to be careful
about how we access them. The PAE branch is already careful with
read-barriers - but the non-PAE and 64-bit side needs an
ACCESS_ONCE() to make sure the pte value is observed only once.
- make the checks a bit stricter so that we can feed it any kind of
cra^H^H^H user-space input ;-)
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prefetch instructions can generate spurious faults on certain
models of older CPUs. The faults themselves cannot be stopped
and they can occur pretty much anywhere - so the way we solve
them is that we detect certain patterns and ignore the fault.
There is one small path of code where we must not take faults
though: the #PF handler execution leading up to the reading
of the CR2 (the faulting address). If we take a fault there
then we destroy the CR2 value (with that of the prefetching
instruction's) and possibly mishandle user-space or
kernel-space pagefaults.
It turns out that in current upstream we do exactly that:
prefetchw(&mm->mmap_sem);
/* Get the faulting address: */
address = read_cr2();
This is not good.
So turn around the order: first read the cr2 then prefetch
the lock address. Reading cr2 is plenty fast (2 cycles) so
delaying the prefetch by this amount shouldnt be a big issue
performance-wise.
[ And this might explain a mystery fault.c warning that sometimes
occurs on one an old AMD/Semptron based test-system i have -
which does have such prefetch problems. ]
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
LKML-Reference: <20090616030522.GA22162@Krystal>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a gup_fast() variant which is usable from IRQ/NMI context.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Nick Piggin <npiggin@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We've had some troubles in the past with weird instructions. This
patch adds a self-test framework which can be used to verify that
a certain set of opcodes are decoded correctly. Of course, the
opcodes which are not tested can still give the wrong results.
In short, this is just a safeguard to catch unintentional changes
in the opcode decoder. It does not mean that errors can't still
occur!
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This adds support for tracking the initializedness of memory that
was allocated with the page allocator. Highmem requests are not
tracked.
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
[build fix for !CONFIG_KMEMCHECK]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
As these are allocated using the page allocator, we need to pass
__GFP_NOTRACK before we add page allocator support to kmemcheck.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
The hooks that we modify are:
- Page fault handler (to handle kmemcheck faults)
- Debug exception handler (to hide pages after single-stepping
the instruction that caused the page fault)
Also redefine memset() to use the optimized version if kmemcheck is
enabled.
(Thanks to Pekka Enberg for minimizing the impact on the page fault
handler.)
As kmemcheck doesn't handle MMX/SSE instructions (yet), we also disable
the optimized xor code, and rely instead on the generic C implementation
in order to avoid false-positive warnings.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
[whitespace fixlet]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Lets use kmemcheck_pte_lookup() in kmemcheck_fault() instead of
open-coding it there.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This patch moves the CONFIG_X86_64 ifdef out of kmemcheck_opcode_decode() by
introducing a version of the function that always returns false for
CONFIG_X86_32.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Multiple ifdef'd definitions of the same global variable is ugly and
error-prone. Fix that up.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
The "Bugs, beware!" printout during is cute but confuses users that something
bad happened so change the text to the more boring "Initialized" message.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This patch reorders code in error.c so that we can get rid of the forward
declarations.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
kmemcheck/shadow.c needs to include <linux/module.h> to prevent
the following warnings:
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : data definition has no type or storage class
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : parameter names (without types) in function declaration
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: vegardno@ifi.uio.no
Cc: penberg@cs.helsinki.fi
Cc: akpm <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.
Thanks to Andi Kleen for the set_memory_4k() solution.
Andrew Morton suggested documenting the shadow member of struct page.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
kernel_physical_mapping_init() could be called in memory hotplug path.
[ Impact: fix potential crash with memory hotplug ]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20090612045752.GA827@sli10-desk.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit c9690998ef (x86: memtest: remove
64-bit division) introduced following compile warning:
arch/x86/mm/memtest.c: In function 'memtest':
arch/x86/mm/memtest.c:56: warning: comparison of distinct pointer types lacks a cast
arch/x86/mm/memtest.c:58: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Using gcc 3.3.5 a "make allmodconfig" + "CONFIG_KVM=n"
triggers a build error:
arch/x86/mm/built-in.o(.init.text+0x43f7): In function `__change_page_attr':
arch/x86/mm/pageattr.c:114: undefined reference to `__udivdi3'
make: *** [.tmp_vmlinux1] Error 1
The culprit turned out to be a division in arch/x86/mm/memtest.c
For more info see this thread:
http://marc.info/?l=linux-kernel&m=124416232620683
The patch entirely removes the division that caused the build
error.
[ Impact: build fix with certain GCC versions ]
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: xiyou.wangcong@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <20090608170939.GB12431@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302
On x86 and x86-64, it is possible that page tables are shared beween
shared mappings backed by hugetlbfs. As part of this,
page_table_shareable() checks a pair of vma->vm_flags and they must match
if they are to be shared. All VMA flags are taken into account, including
VM_LOCKED.
The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared
VMAs with different flags. The impact is that the shared segment is
sometimes considered shareable and other times not, depending on what
process is checking.
What happens is that the segment page tables are being shared but the
count is inaccurate depending on the ordering of events. As the page
tables are freed with put_page(), bad pmd's are found when some of the
children exit. The hugepage counters also get corrupted and the Total and
Free count will no longer match even when all the hugepage-backed regions
are freed. This requires a reboot of the machine to "fix".
This patch addresses the problem by comparing all flags except VM_LOCKED
when deciding if pagetables should be shared or not for hugetlbfs-backed
mapping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <starlight@binnacle.cx>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup cpa_flush_array() to avoid back to back on_each_cpu() calls.
[ Impact: optimizes fix 0af48f42df ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
cpa_flush_array seems to prefer wbinvd() over clflush at 4M threshold.
clflush needs to be done on only one CPU as per instruction definition.
wbinvd() however, should be done on all CPUs.
[ Impact: fix missing flush which could cause data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
wbinvd is supported on all CPUs 486 or later. But,
pageattr.c is checking x86_model >= 4 before wbinvd(), which looks like
an oversight bug. It was first introduced at one place by changeset
d7c8f21a8c and got copied over to second
place in the same file later.
[ Impact: fix missing cache flush on early-model CPUs, potential data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Recently there were some changes to the meaning of node_possible_map,
and it is quite strange:
- the node without memory would be set in node_possible_map
- but some node with less NODE_MIN_SIZE will be kicked out of node_possible_map.
fix it by adding strict_setup_node_bootmem().
Also, remove unparse_node().
so result will be:
1. cpu_to_node() will return online node only (nearest one)
2. apicid_to_node() still returns the node that could be not online but is set
in node_possible_map.
3. node_possible_map will include nodes that mem on it are less NODE_MIN_SIZE
v2: after move_cpus_to_node change.
[ Impact: get node_possible_map right ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <4A0C49BE.6080800@kernel.org>
[ v3: various small cleanups and comment clarifications ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
after:
| commit b263295dbf
| Author: Christoph Lameter <clameter@sgi.com>
| Date: Wed Jan 30 13:30:47 2008 +0100
|
| x86: 64-bit, make sparsemem vmemmap the only memory model
we don't have MEMORY_HOTPLUG_RESERVE anymore.
Historically, x86-64 had an architecture-specific method for memory hotplug
whereby it scanned the SRAT for physical memory ranges that could be
potentially used for memory hot-add later. By reserving those ranges
without physical memory, the memmap would be allocated and left dormant
until needed. This depended on the DISCONTIG memory model which has been
removed so the code implementing HOTPLUG_RESERVE is now dead.
This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE.
(Changelog authored by Mel.)
v2: updated changelog, and remove hotadd= in doc
[ Impact: remove dead code ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Workflow-found-OK-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A0C4910.7090508@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With sparse memory, holes should not be marked present for memmap.
This patch makes sure sparsemem really works on SMP mode (!NUMA).
[ Impact: use less memory to map fragmented RAM, avoid boot-OOM/crash ]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
LKML-Reference: <1242117600.22431.0.camel@sli10-desk.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's no need to use call memory_present() manually on UMA because
initmem_init() sets up early_node_map by calling
e820_register_active_regions().
[ Impact: cleanup ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1241699742.17846.31.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit UMA and NUMA versions of paging_init() are almost identical.
Therefore, merge the copy in mm/numa_64.c to mm/init_64.c to remove
duplicate code.
[ Impact: cleanup ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1241699741.17846.30.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is expected that there might be slight differences between the e820
map and the SRAT table and the intention was that 1MB of slack be allowed.
The calculation comparing e820ram and pxmram assumes the units are bytes,
when they are in fact pages. This means 4GB of slack is being allowed,
not 1MB. This patch makes the correct comparison.
comment is from Mel.
[ Impact: don't accept buggy SRATs that could dump up to 4G of RAM ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A03E13E.6050107@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
node_cover_memory() sanity checks the SRAT table by ensuring that all
PXMs cover the memory reported in the e820.
However, when calculating the size of the holes in the e820, it uses
the early_node_map[] which contains information taken from both SRAT
and e820. If the SRAT is missing an entry, then it is not detected
that the SRAT table is incorrect and missing entries.
This patch uses the e820 map to calculate the holes instead of
early_node_map[].
comment is from Mel.
[ Impact: reject incorrect SRAT tables ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A03E10C.60906@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do this so we can check the range that is mapped before
init_memory_mapping().
To be able to print out meaningful info, we first have to fix
64-bit to have max_pfn_mapped assigned before that call. This
also unifies the code-path a bit.
[ Impact: print more debug info, cleanup ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BF0978.40605@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the introduction of the .brk section, special care must be taken
that no unused page table entries remain if _brk_end and _end are
separated by a 2M page boundary. cleanup_highmap() runs very early and
hence cannot take care of that, hence potential entries needing to be
removed past _brk_end must be cleared once the brk allocator has done
its job.
[ Impact: avoids undesirable TLB aliases ]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>