arch/x86/pci/Makefile_32 has a nasty detail. VISWS and NUMAQ build
override the generic pci-y rules. This needs a proper cleanup, but
that needs more thoughts. Undo
commit 895d30935e
x86: numaq fix
do not override the existing pci-y rule when adding visws or
numaq rules.
There is also a stupid init function ordering problem vs. acpi.o
Add comments to the Makefile to avoid tripping over this again.
Remove the srat stub code in discontig_32.c to allow a proper NUMAQ
build.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this function doesnt just 'find' the max_pfn - it also has
other side-effects such as registering sparse memory maps.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
we could call find_max_pfn() directly instead of setup_memory() to get
max_pfn needed for mtrr trimming.
otherwise setup_memory() is called two times... that is duplicated...
[ mingo@elte.hu: both Thomas and me simulated a double call to
setup_bootmem_allocator() and can confirm that it is a real bug
which can hang in certain configs. It's not been reported yet but
that is probably due to the relatively scarce nature of
MTRR-trimming systems. ]
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patchset adds a flags variable to reserve_bootmem() and uses the
BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
between crashkernel area and already used memory.
This patch:
Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
If that flag is set, the function returns with -EBUSY if the memory already
has been reserved in the past. This is to avoid conflicts.
Because that code runs before SMP initialisation, there's no race condition
inside reserve_bootmem_core().
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix powerpc build]
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The DISCONTIG memory model on x86 32 bit uses a remap allocator early
in boot. The objective is that portions of every node are mapped in to
the kernel virtual area (KVA) in place of ZONE_NORMAL so that node-local
allocations can be made for pgdat and mem_map structures.
With SPARSEMEM, the amount that is set aside is insufficient for all the
mem_maps to be allocated. During the boot process, it falls back to using
the bootmem allocator. This breaks assumptions that SPARSEMEM makes about
the layout of the mem_map in memory and results in a VM_BUG_ON triggering
due to pfn_to_page() returning garbage values.
This patch only enables the remap allocator for use with DISCONTIG.
Without SRAT support, a compile-error occurs because ACPI table parsing
functions are only available in x86-64. This patch also adds no-op stubs
and prints a warning message. What likely needs to be done is sharing
the table parsing functions between 32 and 64 bit if they are
compatible.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On 32-bit NUMA, the memmap representing struct pages on each node is
allocated from node-local memory if possible. As only node-0 has memory from
ZONE_NORMAL, the memmap must be mapped into low memory. This is done by
reserving space in the Kernel Virtual Area (KVA) for the memmap belonging
to other nodes by taking pages from the end of ZONE_NORMAL and remapping
the other nodes memmap into those virtual addresses. The node boundaries
are then adjusted so that the region of pages is not used and it is marked
as reserved in the bootmem allocator.
This reserved portion of the KVA is PMD aligned althought
strictly speaking that requirement could be lifted (see thread at
http://lkml.org/lkml/2007/8/24/220). The problem is that when aligned, there
may be a portion of ZONE_NORMAL at the end that is not used for memmap and
does not have an initialised memmap nor is it marked reserved in the bootmem
allocator. Later in the boot process, these pages are freed and a storm of
Bad page state messages result.
This patch marks these pages reserved that are wasted due to alignment
in the bootmem allocator so they are not accidently freed. It is worth
noting that memory from node-0 is wasted where it could have been put into
ZONE_HIGHMEM on NUMA machines. Worse, the KVA is always reserved from the
location of real memory even when there is plenty of spare virtual address
space.
This patch also makes sure that reserve_bootmem() is not called with a
0-length size in numa_kva_reserve(). When this happens, it usually means
that a kernel built for Summit is being booted on a normal machine. The
resulting BUG_ON() is misleading so it is caught here.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The boot protocol has until now required that the initrd be located in
lowmem, which makes the lowmem/highmem boundary visible to the boot
loader. This was exported to the bootloader via a compile-time
field. Unfortunately, the vmalloc= command-line option breaks this
part of the protocol; instead of adding yet another hack that affects
the bootloader, have the kernel relocate the initrd down below the
lowmem boundary inside the kernel itself.
Note that this does not rely on HIGHMEM being enabled in the kernel.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
node0_bdata and paddr_to_nid() can become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch makes some needlessly global variables static.
[ tglx: arch/x86 adaptation ]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It's not necessarily a very sane configuration, but people running "make
randconfig" noticed it wouldn't compile. This fixes some obvious
problems in discontig.c to allow a clean compile.
Acked-by: andrew hendry <andrew.hendry@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extern declarations belong in headers. Times, they are a'changin.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
===================================================================
Fix compile error when config memory hotplug with numa on i386.
The cause of compile error was missing of arch_add_memory(),
remove_memory(), and memory_add_physaddr_to_nid().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@cs.washington.edu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Arch-independent zone-sizing is using indices instead of symbolic names to
offset within an array related to zones (max_zone_pfns). The unintended
impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the
the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
The following patch properly initialises the max_zone_pfns[] array and uses
symbolic names instead of indices in each architecture using
arch-independent zone-sizing. Two users have successfully booted their
powerpcs with it (one an ibook G4). It has also been boot tested on x86,
x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2.
Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
first fix. Additional credit to Johannes Berg and Andreas Schwab for
reporting the problem and testing on powerpc.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This changes a couple of if() BUG(); constructs to
BUG_ON(); so it can be safely optimized away.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM. Maybe we can find
a way to optimize the lookup in the future.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix array initialization in lots of arches
The number of zones may now be reduced from 4 to 2 for many arches. Fix the
array initialization for the zones array for all architectures so that it is
not initializing a fixed number of elements.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Address a long standing issue of booting with an initrd on an i386 numa
system. Currently (and always) the numa kva area is mapped into low memory
by finding the end of low memory and moving that mark down (thus creating
space for the kva). The issue with this is that Grub loads initrds into
this similar space so when the kernel check the initrd it finds it outside
max_low_pfn and disables it (it thinks the initrd is not mapped into usable
memory) thus initrd enabled kernels can't boot i386 numa :(
My solution to the problem just converts the numa kva area to use the
bootmem allocator to save it's area (instead of moving the end of low
memory). Using bootmem allows the kva area to be mapped into more diverse
addresses (not just the end of low memory) and enables the kva area to be
mapped below the initrd if present.
I have tested this patch on numaq(no initrd) and summit(initrd) i386 numa
based systems.
[akpm@osdl.org: cleanups]
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move initialization of all memory end variables to as early as
possible, so that dependent code doesn't need to check whether these
variables have already been set.
Change the range check in kunmap_atomic to actually make use of this
so that the no-mapping-estabished path (under CONFIG_DEBUG_HIGHMEM)
gets used only when the address is inside the lowmem area (and BUG()
otherwise).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Just about every architecture defines some macros to do operations on pfns.
They're all virtually identical. This patch consolidates all of them.
One minor glitch is that at least i386 uses them in a very skeletal header
file. To keep away from #include dependency hell, I stuck the new
definitions in a new, isolated header.
Of all of the implementations, sh64 is the only one that varied by a bit.
It used some masks to ensure that any sign-extension got ripped away before
the arithmetic is done. This has been posted to that sh64 maintainers and
the development list.
Compiles on x86, x86_64, ia64 and ppc64.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Because pgdat_list was linked to pgdat_list in *reverse* order, (By default)
some of arch has to sort it by themselves.
for_each_pgdat has gone..for_each_online_pgdat() uses node_online_map, which
doesn't need to be sorted.
This patch removes codes for sorting pgdat.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adds the necessary for non-NUMA hot-add of highmem to an existing zone on
i386.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mark variables which are usually accessed for reads with __readmostly.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move the fix to align node_end_pfns to a proper location. The earlier fix
made the node_remap_start_vaddr to get misaligned causing remap_numa_kva to
barf again :-/
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While reserving KVA for lmem_maps of node, we have to make sure that
node_remap_start_pfn[] is aligned to a proper pmd boundary.
(node_remap_start_pfn[] gets its value from node_end_pfn[])
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the x86 implementation of the crashkernel option. It reserves a
window of memory very early in the bootup process, so we never use it for
anything but the kernel to switch to when the running kernel panics.
In addition to reserving this memory a resource structure is registered so
looking at /proc/iomem it is clear what happened to that memory.
ISSUES:
Is it possible to implement this in a architecture generic way?
What should be done with architectures that always use an iommu and
thus don't report their RAM memory resources in /proc/iomem?
Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* EXPORT_SYMBOL's moved to other files
* #include <linux/config.h>, <linux/module.h> where needed
* #include's in i386_ksyms.c cleaned up
* After copy-paste, redundant due to Makefiles rules preprocessor directives
removed:
#ifdef CONFIG_FOO
EXPORT_SYMBOL(foo);
#endif
obj-$(CONFIG_FOO) += foo.o
* Tiny reformat to fit in 80 columns
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide the architecture specific implementation for SPARSEMEM for i386 SMP
and NUMA systems.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
discontig.c has some assumptions that mem_map[]s inside of a node are
contiguous. Teach it to make sure that each region that it's bringing online
is actually made up of valid ranges of ram.
Written-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a simple allocator for the NUMA remap space. This space is very
scarce, used for structures which are best allocated node local.
This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.
Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
to allow us to allocate the pgdat's using this mechanism; which we do not yet
do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The following four patches provide the last needed changes before the
introduction of sparsemem. For a more complete description of what this
will do, please see this patch:
http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch
or previous posts on the subject:
http://marc.theaimsgroup.com/?t=110868540700001&r=1&w=2http://marc.theaimsgroup.com/?l=linux-mm&m=109897373315016&w=2
Three of these are i386-only, but one of them reorganizes the macros
used to manage the space in page->flags, and will affect all platforms.
There are analogous patches to the i386 ones for ppc64, ia64, and
x86_64, but those will be submitted by the normal arch maintainers.
The combination of the four patches has been test-booted on a variety of
i386 hardware, and compiled for ppc64, i386, and x86-64 with about 17
different .configs. It's also been runtime-tested on ia64 configs (with
more patches on top).
This patch:
We _know_ which node pages in general belong to, at least at a very gross
level in node_{start,end}_pfn[]. Use those to target the allocations of
pages.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!