Since the trampoline code is now used for ACPI resume from suspend to RAM,
the trampoline page tables have to be fixed up during boot not only on SMP
systems, but also on UP systems that use the trampoline.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10923
Reported-by: Dionisus Torimens <djtm@gmx.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some Dell laptops enter resume with apparent garbage in the segment
descriptor registers (almost certainly the result of a botched
transition from protected to real mode.) The only way to clean that
up is to enter protected mode ourselves and clean out the descriptor
registers.
This fixes resume on Dell XPS M1210 and Dell D620.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10927
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The build of the Alpha Linux kernel currently fails[1] with inconsistent
kallsyms data. As I never saw that before, I thought about hardware
problems. But in fact it is a bug in the Linux kernel.
The end of the rodata section is marked with the "__end_rodata" symbol.
This symbol have different aligning constraints than the inittext parts
and therefor the start marked "_sinittext". Because of that the
__end_rodata symbol shifts between < _sinittext and == _sinittext. The
later variant is seen as a code symbol and recorded in the kallsyms data.
On fix would be to move the exception table a little bit and get some
space between that two areas.
[1]: http://buildd.debian.org/fetch.cgi?pkg=linux-2.6&arch=alpha&ver=2.6.25-5&stamp=1213919009&file=log&as=raw
Cc: maximilian attems <max@stro.at>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide __ucmpdi2() for MN10300 so that allmodconfig can be built.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Export kernel_thread() and empty_zero_page so that allmodconfig can be
built for MN10300.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When converting the page number in a pte/pmd/pud/pgd between
machine and pseudo-physical addresses, the converted result was
being truncated at 32-bits. This caused failures on machines
with more than 4G of physical memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: "Christopher S. Aker" <caker@theshore.net>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The txx9_tmr_init() will not clear a timer counter register in a certain
case. The counter register is cleared on 1->0 transition of TCE bit if
CRE=1. So just clearing the TCE bit is not enough.
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
The introduction of a real dma cache invalidate makes it important
to have a correct cache line size, otherwise the kernel will gives
out two memory segment, which might share one cache line. The R4400
Indy/Indigo2 CPU modules are using a second level cache line size
of 128 bytes, so MIPS_L1_CACHE_SHIFT needs to be bumped up to 7 for
IP22.
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
It's possible that the crime interrupt handler is called without
pending interrupts (probably a hardware issue). To avoid irritating
"unexpected irq 71" messages, we now just ignore the spurious crime
interrupts.
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fedora reports that mem_init()'s zap_low_mappings(), extended to SMP in
61165d7a03 x86: fix app crashes after SMP
resume causes 32-bit Intel Mac machines to reboot very early when
booting with EFI.
The EFI code appears to manage low mappings for itself when needed; but
like many before it, confuses PSE with PAE. So it has only been mapping
half the space it needed when PSE but not PAE. This remained unnoticed
until we moved the SMP zap_low_mappings() before
efi_enter_virtual_mode(). Presumably could have been noticed years ago
if anyone ran a UP kernel on such machines?
Reported-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Peter Jones <pjones@redhat.com>
Cc: Glauber Costa <gcosta@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Peter Jones <pjones@redhat.com>
Suspend/resume ("echo mem > /sys/power/state") does not work with
vanilla kernels -- the system does not suspend correctly and just
hangs. This patch fixes this so suspend/resume works:
1) of_iomap does not map the whole 0xC000 of the MPC5200 immr so
saving registers does not work.
2) PCI registers need to be saved and restored.
Signed-off-by: Tim Yamin <plasm@roo.me.uk>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
The legacy serial driver does not work with an 8250 type UART that is
described in the device tree with the reg-offset and reg-shift
properties. This change makes legacy_serial ignore these devices.
Signed-off-by: John Linn <john.linn@xilinx.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
This change to the makefile corrects the build of a simpleImage with initrd.
Signed-off-by: John Linn <john.linn@xilinx>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
commit 4323838215
x86: change size of node ids from u8 to s16
set the range for NODES_SHIFT to 1..15.
The possible range is 1..9
Fixes Bugzilla #10726
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The symbol account_system_vtime is used by the kvm module but
not exported. This breaks building with CONFIG_VIRT_CPU_ACCOUNTING
and CONFIG_KVM=m.
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Acked-by: Hidetosho Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
On a system where there are no hot pluggable cpus "additional_cpus"
is still set to -1 at the point where we call per_cpu_scan_finalize().
If we didn't find an SRAT table and so pick the default "32" for the
number of cpus, when we get to:
high_cpu = min(high_cpu + reserve_cpus, NR_CPUS);
we will end up initializing for just 31 cpus ... and so we will
die horribly when bringing up cpu#32.
Problem introduced by: 2c6e6db41f
"Minimize per_cpu reservations."
Acked-by: Robin Holt <holt@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch annotates the platform_secondary_init function in
arch/arm/mach-realview/platsmp.c with trace_hardirqs_off to avoid a
warning when LOCKDEP and TRACE_IRQFLAGS are enabled.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
When I update kernel 2.6.25 from 2.6.24, gdb does not work.
On 2.6.25, ptrace(PTRACE_GETFPXREGS, ...) returns ENODEV.
But 2.6.24 kernel's ptrace() returns EIO.
It is issue of compatibility.
I attached test program as pt.c and patch for fix it.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <errno.h>
#include <sys/ptrace.h>
#include <sys/types.h>
struct user_fxsr_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
};
int main(void)
{
pid_t pid;
pid = fork();
switch(pid){
case -1:/* error */
break;
case 0:/* child */
child();
break;
default:
parent(pid);
break;
}
return 0;
}
int child(void)
{
ptrace(PTRACE_TRACEME);
kill(getpid(), SIGSTOP);
sleep(10);
return 0;
}
int parent(pid_t pid)
{
int ret;
struct user_fxsr_struct fpxregs;
ret = ptrace(PTRACE_GETFPXREGS, pid, 0, &fpxregs);
if(ret < 0){
printf("%d: %s.\n", errno, strerror(errno));
}
kill(pid, SIGCONT);
wait(pid);
return 0;
}
/* in the kerel, at kernel/i387.c get_fpxregs() */
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Vegard Nossum reported crashes during cpu hotplug tests:
http://marc.info/?l=linux-kernel&m=121413950227884&w=4
In function _cpu_up, the panic happens when calling
__raw_notifier_call_chain at the second time. Kernel doesn't panic when
calling it at the first time. If just say because of nr_cpu_ids, that's
not right.
By checking the source code, I found that function do_boot_cpu is the culprit.
Consider below call chain:
_cpu_up=>__cpu_up=>smp_ops.cpu_up=>native_cpu_up=>do_boot_cpu.
So do_boot_cpu is called in the end. In do_boot_cpu, if
boot_error==true, cpu_clear(cpu, cpu_possible_map) is executed. So later
on, when _cpu_up calls __raw_notifier_call_chain at the second time to
report CPU_UP_CANCELED, because this cpu is already cleared from
cpu_possible_map, get_cpu_sysdev returns NULL.
Many resources are related to cpu_possible_map, so it's better not to
change it.
Below patch against 2.6.26-rc7 fixes it by removing the bit clearing in
cpu_possible_map.
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
WARNING: vmlinux.o(.text+0x721a): Section mismatch in reference from the function ___fill_code_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_code_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_code_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x7238): Section mismatch in reference from the function ___fill_code_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_code_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_code_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x7250): Section mismatch in reference from the function ___fill_code_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_code_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_code_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x7264): Section mismatch in reference from the function ___fill_code_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_code_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_code_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x72a2): Section mismatch in reference from the function ___fill_data_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_data_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_data_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x72bc): Section mismatch in reference from the function ___fill_data_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_data_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_data_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x72d4): Section mismatch in reference from the function ___fill_data_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_data_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_data_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
WARNING: vmlinux.o(.text+0x72e8): Section mismatch in reference from the function ___fill_data_cplbtab() to the function .init.text:_fill_cplbtab()
The function ___fill_data_cplbtab() references
the function __init _fill_cplbtab().
This is often because ___fill_data_cplbtab lacks a __init
annotation or the annotation of _fill_cplbtab is wrong.
--
Signed-off-by: Bryan Wu <cooloney@kernel.org>
Initialize the lock of bad_irq_desc properly.
The content of irq_desc array is replaced by bad_irq_desc in blackfin
arch irqchip init code. So, do it properly as common irq init code.
Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Bryan Wu <cooloney@kernel.org>
This patch updates the kvm host code to use the pvclock structs
and functions, thereby making it compatible with Xen.
The patch also fixes an initialization bug: on SMP systems the
per-cpu has two different locations early at boot and after CPU
bringup. kvmclock must take that in account when registering the
physical address within the host.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the kvm host code to use the pvclock structs.
It also makes the paravirt clock compatible with Xen.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the xen guest to use the pvclock structs
and helper functions.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds structs for the paravirt clocksource ABI
used by both xen and kvm (pvclock-abi.h).
It also adds some helper functions to read system time and
wall clock time from a paravirtual clocksource (pvclock.[ch]).
They are based on the xen code. They are enabled using
CONFIG_PARAVIRT_CLOCK.
Subsequent patches of this series will put the code in use.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
As noted by Akinobu Mita alloc_bootmem and related functions never return
NULL and always return a zeroed region of memory. Thus a NULL test or
memset after calls to these functions is unnecessary.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The fix applied in e0c6d97c65
"security hole in sn2_ptc_proc_write" didn't take into account
the case where count==0 (which results in a buffer underrun
when adding the trailing '\0'). Thanks to Andi Kleen for
pointing this out.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Call check_sal_cache_flush() after platform_setup() as
check_sal_cache_flush() now relies on being able to call platform
vector code.
Problem was introduced by: 3463a93def
"Update check_sal_cache_flush to use platform_send_ipi()"
Signed-off-by: Jes Sorensen <jes@sgi.com>
Tested-by: Alex Chiang: <achiang@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Non-PAE operation has been deprecated in Xen for a while, and is
rarely tested or used. xen-unstable has now officially dropped
non-PAE support. Since Xen/pvops' non-PAE support has also been
broken for a while, we may as well completely drop it altogether.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded. If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.
Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM has a heuristic to unshadow guest pagetables when userspace accesses
them, on the assumption that most guests do not allow userspace to access
pagetables directly. Unfortunately, in addition to unshadowing the pagetables,
it also oopses.
This never triggers on ordinary guests since sane OSes will clear the
pagetables before assigning them to userspace, which will trigger the flood
heuristic, unshadowing the pagetables before the first userspace access. One
particular guest, though (Xenner) will run the kernel in userspace, triggering
the oops. Since the heuristic is incorrect in this case, we can simply
remove it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed
guests properly. It will instantiate two 2MB sptes pointing to the same
physical 2MB page when a guest large pte update is trapped.
Instead of duplicating code to handle this, disallow directory level
updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes
emulating one guest 4MB pte can be correctly created by the page fault
handling path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
rmap_next() does not work correctly after rmap_remove(), as it expects
the rmap chains not to change during iteration. Fix (for now) by restarting
iteration from the beginning.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.
It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.
For now introduce a new vcpu requests bit to cancel guest entry.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A guest vcpu instance can be scheduled to a different physical CPU
between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable().
If that happens, the timer will only be migrated to the current pCPU on
the next exit, meaning that guest LAPIC timer event can be delayed until
a host interrupt is triggered.
Fix it by cancelling guest entry if any vcpu request is pending. This
has the side effect of nicely consolidating vcpu->requests checks.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Noticed by Martin Michlmayr, this missing export prevents IEEE1394
from building with:
ERROR: "dma_sync_sg_for_device" [drivers/ieee1394/ieee1394.ko] undefined!
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Which was removed in the hope that generic legacy IDE quirk in
drivers/pci/probe.c is sufficient for Cypress IDE.
It isn't, as this controller has non-standard BAR layout:
secondary channel registers are in the BAR0-1 of the second
PCI function - not in the BAR2-3 of the same function, as the
generic quirk routine assumes.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vast majority of these build failures are gcc-4.3 warnings
about static functions and objects being referenced from
non-static (read: "extern inline") functions, in conjunction
with our -Werror.
We cannot just convert "extern inline" to "static inline",
as people keep suggesting all the time, because "extern inline"
logic is crucial for generic kernel build.
So
- just make sure that all callees of critical "extern inline"
functions are also "extern inline";
- use "static inline", wherever it's possible.
traps.c: work around gcc-4.3 being too smart about array
bounds-checking.
TODO: add "gnu_inline" attribute to all our "extern inline"
functions to ensure desired behaviour with future compilers.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With built-in scsi disk driver, the final link fails with a following
error:
`.exit.text' referenced in section `.rodata' of drivers/built-in.o:
defined in discarded section `.exit.text' of drivers/built-in.o
This happens with -Os (CONFIG_CC_OPTIMIZE_FOR_SIZE=y) with all gcc-4
versions, and also with -O2 and gcc-4.3.
The problem is in sd.c:sd_major() being inlined into __exit function
exit_sd(), and the compiler generating a jump table in .rodata section
for the 'switch' statement in sd_major(). So we have references to
discarded section.
Fixed with a big hammer in the form of -fno-jump-tables.
Note that jump tables vs. discarded sections is a generic problem,
other architectures are just lucky not to suffer from it. But with
a slightly more complex switch/case statement it can be reproduced
on x86 as well. So maybe at some point we should consider
-fno-jump-tables as a generic compile option...
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Security hole in sn2_ptc_proc_write
It is possible to overrun a buffer with a write to this /proc file.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
557ed1fa26 ("remove ZERO_PAGE") removed
the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
generally now populate the VM with real empty pages needlessly.
We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
since fault handling no longer uses ZERO_PAGE for new anonymous pages,
we now need to handle that special case in follow_page() instead.
In particular, the removal of ZERO_PAGE effectively removed the core
file writing optimization where we would skip writing pages that had not
been populated at all, and increased memory pressure a lot by allocating
all those useless newly zeroed pages.
This reinstates the optimization by making the unmapped PTE case the
same as for a non-existent page table, which already did this correctly.
While at it, this also fixes the XIP case for follow_page(), where the
caller could not differentiate between the case of a page that simply
could not be used (because it had no "struct page" associated with it)
and a page that just wasn't mapped.
We do that by simply returning an error pointer for pages that could not
be turned into a "struct page *". The error is arbitrarily picked to be
EFAULT, since that was what get_user_pages() already used for the
equivalent IO-mapped page case.
[ Also removed an impossible test for pte_offset_map_lock() failing:
that's not how that function works ]
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because NX is now enforced properly, we must put the hypercall page
into the .text segment so that it is executable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Stable: this isn't a bugfix in itself, but it's a pre-requiste
for "xen: don't drop NX bit" ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
General Software writes their own VSA2 module for their version
of the Geode BIOS, which returns a different ID then the standard
VSA2. This was causing the framebuffer driver to break for most
GSW boards.
Signed-off-by: Jordan Crouse <jordan.crouse@amd.com>
Cc: tglx@linutronix.de
Cc: linux-geode@lists.infradead.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
i386 and prints a error message on failure.
The patch is still for 2.6.26 since it is only bug fixing. The unification
of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Booting 2.6.26-rc6 on my 486 DX/4 fails with a "BUG: Int 6"
(invalid opcode) and a kernel halt immediately after the
kernel has been uncompressed. The BUG shows EIP pointing
to an rdtsc instruction in native_read_tsc(), invoked from
native_sched_clock().
(This error occurs so early that not even the serial console
can capture it.)
A bisection showed that this bug first occurs in 2.6.26-rc3-git7,
via commit 9ccc906c97e34fd91dc6aaf5b69b52d824386910:
>x86: distangle user disabled TSC from unstable
>
>tsc_enabled is set to 0 from the command line switch "notsc" and from
>the mark_tsc_unstable code. Seperate those functionalities and replace
>tsc_enable with tsc_disable. This makes also the native_sched_clock()
>decision when to use TSC understandable.
>
>Preparatory patch to solve the sched_clock() issue on 32 bit.
>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The core reason for this bug is that native_sched_clock() gets
called before tsc_init().
Before the commit above, tsc_32.c used a "tsc_enabled" variable
which defaulted to 0 == disabled, and which only got enabled late
in tsc_init(). Thus early calls to native_sched_clock() would skip
the TSC and use jiffies instead.
After the commit above, tsc_32.c uses a "tsc_disabled" variable
which defaults to 0, meaning that the TSC is Ok to use. Early calls
to native_sched_clock() now erroneously try to use the TSC on
!cpu_has_tsc processors, leading to invalid opcode exceptions.
My proposed fix is to initialise tsc_disabled to a "soft disabled"
state distinct from the hard disabled state set up by the "notsc"
kernel option. This fixes the native_sched_clock() problem. It also
allows tsc_init() to be simplified: instead of setting tsc_disabled = 1
on every error return, we just set tsc_disabled = 0 once when all
checks have succeeded.
I've verified that this lets my 486 boot again. I've also verified
that a Core2 machine still uses the TSC as clocksource after the patch.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Patrick McHardy reported a crash:
> > I get this oops once a day, its apparently triggered by something
> > run by cron, but the process is a different one each time.
> >
> > Kernel is -git from yesterday shortly before the -rc6 release
> > (last commit is the usb-2.6 merge, the x86 patches are missing),
> > .config is attached.
> >
> > I'll retry with current -git, but the patches that have gone in
> > since I last updated don't look related.
> >
> > [62060.043009] BUG: unable to handle kernel NULL pointer dereference at
> > 000001ff
> > [62060.043009] IP: [<c0102a9b>] __switch_to+0x2f/0x118
> > [62060.043009] *pde = 00000000
> > [62060.043009] Oops: 0002 [#1] PREEMPT
Vegard Nossum analyzed it:
> This decodes to
>
> 0: 0f ae 00 fxsave (%eax)
>
> so it's related to the floating-point context. This is the exact
> location of the crash:
>
> $ addr2line -e arch/x86/kernel/process_32.o -i ab0
> include/asm/i387.h:232
> include/asm/i387.h:262
> arch/x86/kernel/process_32.c:595
>
> ...so it looks like prev_task->thread.xstate->fxsave has become NULL.
> Or maybe it never had any other value.
Somehow (as described below) TS_USEDFPU is set but the fpu is not
allocated or freed.
Another possible FPU pre-emption issue with the sleazy FPU optimization
which was benign before but not so anymore, with the dynamic FPU allocation
patch.
New task is getting exec'd and it is prempted at the below point.
flush_thread() {
...
/*
* Forget coprocessor state..
*/
clear_fpu(tsk);
<----- Preemption point
clear_used_math();
...
}
Now when it context switches in again, as the used_math() is still set
and fpu_counter can be > 5, we will do a math_state_restore() which sets
the task's TS_USEDFPU. After it continues from the above preemption point
it does clear_used_math() and much later free_thread_xstate().
Now, at the next context switch, it is quite possible that xstate is
null, used_math() is not set and TS_USEDFPU is still set. This will
trigger unlazy_fpu() causing kernel oops.
Fix this by clearing tsk's fpu_counter before clearing task's fpu.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we demote a slice from 64k to 4k, and we are about to insert an
HPTE for a 4k subpage and we notice that there is an existing 64k
HPTE, we first invalidate that HPTE before inserting the new 4k
subpage HPTE. Since the bits that encode which hash bucket the old
HPTE was in overlap with the bits that encode which of the 16 subpages
have HPTEs, we need to clear out the subpage HPTE-present bits before
starting to insert HPTEs for the 4k subpages. If we don't do that, we
can erroneously think that a subpage already has an HPTE when it
doesn't.
That in itself wouldn't be such a problem except that when we go to
update the HPTE that we think is present on machines with a
hypervisor, the hypervisor can tell us that the HPTE we think is there
is actually there even though it isn't, which can lead to a process
getting stuck in a loop, continually faulting. The reason for the
confusion is that the AVPN (abbreviated virtual page number) we are
looking for in the HPTE for a 4k subpage can actually match the AVPN
in a stale HPTE for another 64k page. For example, the HPTE for
the 4k subpage at 0x84000f000 will be in the same hash bucket and have
the same AVPN as the HPTE for the 64k page at 0x8400f0000.
This fixes the code to clear out the subpage HPTE-present bits.
Signed-off-by: Paul Mackerras <paulus@samba.org>