While porting some changes of the 2.6.21-rc7 pptp/proto_gre conntrack
and nat modules to a 2.4.32 kernel I noticed that the gre_key function
returns a wrong pointer to the GRE key of a version 0 packet thus
corrupting the packet payload.
The intended behaviour for GREv0 packets is to act like
nf_conntrack_proto_generic/nf_nat_proto_unknown so I have ripped the
offending functions (not used anymore) and modified the
nf_nat_proto_gre modules to not touch version 0 (non PPTP) packets.
Signed-off-by: Jorge Boncompte <jorge@dti2.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add __dev_getfirstbyhwtype for callers that don't want a reference but
some data from the device and thus need to take the rtnl anyway.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix skbuff.h kernel-doc:
linux-2.6.21-git4//include/linux/skbuff.h:316): No description found for parameter 'transport_header'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the match_*() functions take a const pointer to the options table
and make strings pointers in the options table const too.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is illegal not to return from a pio or mmio request without completing
it, as mmio or pio is an atomic operation. Therefore, we can simplify
the userspace interface by avoiding the completion indication.
Signed-off-by: Avi Kivity <avi@qumranet.com>
With this, we can specify that accesses to one physical memory range will
be remapped to another. This is useful for the vga window at 0xa0000 which
is used as a movable window into the (much larger) framebuffer.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions. This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.
Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there. The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This allows us to store offsets in the kernel/user kvm_run area, and be
sure that userspace has them mapped. As offsets can be outside the
kvm_run struct, userspace has no way of knowing how much to mmap.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allow a special signal mask to be used while executing in guest mode. This
allows signals to be used to interrupt a vcpu without requiring signal
delivery to a userspace handler, which is quite expensive. Userspace still
receives -EINTR and can get the signal via sigwait().
Signed-off-by: Avi Kivity <avi@qumranet.com>
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more). So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM used to handle cpuid by letting userspace decide what values to
return to the guest. We now handle cpuid completely in the kernel. We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).
The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions). This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.
So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of passing a 'struct kvm_run' back and forth between the kernel and
userspace, allocate a page and allow the user to mmap() it. This reduces
needless copying and makes the interface expandable by providing lots of
free space.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Unless we finally completely remove it, people will always add new users.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch removes the PCI_MULTITHREAD_PROBE option that had already
been marked as broken.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch introduces an optional function, arch_teardown_msi_irqs(),
which gives an arch the opportunity to do per-device teardown for
MSI/X. If that's not required, the default version simply calls
arch_teardown_msi_irq() for each msi irq required.
arch_teardown_msi_irqs() is simply passed a pdev, attached to the pdev
is a list of msi_descs, it is up to the arch to free the irq associated
with each of these as appropriate.
For archs that _don't_ implement arch_teardown_msi_irqs(), all msi_descs
with irq == 0 are considered unallocated, and the arch teardown routine
is not called on them.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch introduces an optional function, arch_setup_msi_irqs(),
(note the plural) which gives an arch the opportunity to do per-device
setup for MSI/X and then allocate all the requested MSI/Xs at once.
If that's not required by the arch, the default version simply calls
arch_setup_msi_irq() for each MSI irq required.
arch_setup_msi_irqs() is passed a pdev, attached to the pdev is a list
of msi_descs with irq == 0, it is up to the arch to connect these up to
an irq (via set_irq_msi()) or return an error. For convenience the number
of vectors and the type are passed also.
All msi_descs with irq != 0 are considered allocated, and the arch
teardown routine will be called on them when necessary.
The existing semantics of pci_enable_msix() are that if the requested
number of irqs can not be allocated, the maximum number that _could_ be
allocated is returned. To support that, we define that in case of an
error from arch_setup_msi_irqs(), the number of msi_descs with irq != 0
are considered allocated, and are counted toward the "max that could be
allocated".
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that we keep a list of msi descriptors, we don't need first_msi_irq
in the pci dev.
If we somehow have zero MSIs configured list_entry() will give us weird
oopes or nice memory corruption bugs. So be paranoid. Add BUG_ONs and also
a check in pci_msi_check_device() to make sure nvec > 0.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The msi descriptors are linked together with what looks a lot like
a linked list, but isn't a struct list_head list. Make it one.
The only complication is that previously we walked a list of irqs, and
got the descriptor for each with get_irq_msi(). Now we have a list of
descriptors and need to get the irq out of it, so it needs to be in the
actual struct msi_desc. We use 0 to indicate no irq is setup.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are currently several places in the kernel where we kmalloc()
a struct pci_dev and start initialising it. It'd be preferable to
have an allocator so we can ensure the pci_dev is correctly initialised
in one place.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Add an arch_check_device(), which gives archs a chance to check the input
to pci_enable_msi/x. The arch might be interested in the value of nvec so
pass it in. Propagate the error value returned from the arch routine out
to the caller.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Balance declarations of pci_request_regions() and pci_release_regions() with
empty inline definitions for the CONFIG_PCI=n case -- otherwise my patch to
drivers/net/3c59x.c in the -mm tree doesn't compile. :-)
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Adds a new API which can be used to issue various types
of PCI-E reset, including PCI-E warm reset and PCI-E hot reset.
This is needed for an ipr PCI-E adapter which does not properly
implement BIST. Running BIST on this adapter results in PCI-E
errors. The only reliable reset mechanism that exists on this
hardware is PCI Fundamental reset (warm reset). Since driving
this type of reset is architecture unique, this provides the
necessary hooks for architectures to add this support.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Acked-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We need to work on cleaning up the relationship between kobjects, ksets and
ktypes. The removal of 'struct subsystem' is the first step of this,
especially as it is not really needed at all.
Thanks to Kay for fixing the bugs in this patch.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Almost all definitions used by file2alias was already
present in mod_devicetable.h.
Added the last definition and killed the input.h usage.
The errornous include was pointed out
by: Jan Engelhardt <jengelh@linux01.gwdg.de>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
Cc: Deepak Saxena <dsaxena@plexity.net>
Three cleanups:
1: ELF notes are never mapped, so there's no need to have any access
flags in their phdr.
2: When generating them from asm, tell the assembler to use a SHT_NOTE
section type. There doesn't seem to be a way to do this from C.
3: Use ANSI rather than traditional cpp behaviour to stringify the
macro argument.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space. These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.
Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.
This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings. Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.
[ Zach - vmi.c will need some attention after this patch. It wasn't
immediately obvious to me what needs to be done. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Some versions of libc can't deal with a VDSO which doesn't have its
ELF headers matching its mapped address. COMPAT_VDSO maps the VDSO at
a specific system-wide fixed address. Previously this was all done at
build time, on the grounds that the fixed VDSO address is always at
the top of the address space. However, a hypervisor may reserve some
of that address space, pushing the fixmap address down.
This patch does the adjustment dynamically at runtime, depending on
the runtime location of the VDSO fixmap.
[ Patch has been through several hands: Jan Beulich wrote the orignal
version; Zach reworked it, and Jeremy converted it to relocate phdrs
as well as sections. ]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Zachary Amsden <zach@vmware.com>
Cc: "Jan Beulich" <JBeulich@novell.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Rather than using a single constant PERCPU_ENOUGH_ROOM, compute it as
the sum of kernel_percpu + PERCPU_MODULE_RESERVE. This is now common
to all architectures; if an architecture wants to set
PERCPU_ENOUGH_ROOM to something special, then it may do so (ia64 is
the only one which does).
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
On x86-64, kernel memory freed after init can be entirely unmapped instead
of just getting 'poisoned' by overwriting with a debug pattern.
On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
can also be write-protected.
Compared to the first version, this one prevents re-creating deleted
mappings in the kernel image range on x86-64, if those got removed
previously. This, together with the original changes, prevents temporarily
having inconsistent mappings when cacheability attributes are being
changed on such pages (e.g. from AGP code). While on i386 such duplicate
mappings don't exist, the same change is done there, too, both for
consistency and because checking pte_present() before using various other
pte_XXX functions is a requirement anyway. At once, i386 code gets
adjusted to use pte_huge() instead of open coding this.
AK: split out cpa() changes
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
The specific case I am encountering is kdump under Xen with a 64 bit
hypervisor and 32 bit kernel/userspace. The dump created is 64 bit due to
the hypervisor but the dump kernel is 32 bit for maximum compatibility.
It's possibly less likely to be useful in a purely native scenario but I
see no reason to disallow it.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Ian Campbell <ian.campbell@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Horms <horms@verge.net.au>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA
Forcing the kernel to use node interleaved vmalloc instead of bootmem for
the system hashtable memory (alloc_large_system_hash) reduces the memory
imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:
Before the following patch, on bootup of a 8 node box:
Node 0 MemTotal: 3407488 kB
Node 0 MemFree: 3206296 kB
Node 0 MemUsed: 201192 kB
Node 0 Active: 7012 kB
Node 0 Inactive: 512 kB
Node 0 Dirty: 0 kB
Node 0 Writeback: 0 kB
Node 0 FilePages: 1912 kB
Node 0 Mapped: 420 kB
Node 0 AnonPages: 5612 kB
Node 0 PageTables: 468 kB
Node 0 NFS_Unstable: 0 kB
Node 0 Bounce: 0 kB
Node 0 Slab: 5408 kB
Node 0 SReclaimable: 644 kB
Node 0 SUnreclaim: 4764 kB
After the patch (or using hashdist=1 on the kernel command line):
Node 0 MemTotal: 3407488 kB
Node 0 MemFree: 3247608 kB
Node 0 MemUsed: 159880 kB
Node 0 Active: 3012 kB
Node 0 Inactive: 616 kB
Node 0 Dirty: 0 kB
Node 0 Writeback: 0 kB
Node 0 FilePages: 2424 kB
Node 0 Mapped: 380 kB
Node 0 AnonPages: 1200 kB
Node 0 PageTables: 396 kB
Node 0 NFS_Unstable: 0 kB
Node 0 Bounce: 0 kB
Node 0 Slab: 6304 kB
Node 0 SReclaimable: 1596 kB
Node 0 SUnreclaim: 4708 kB
I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA
since x86_64 has no dearth of vmalloc space? Or maybe enable hash
distribution for all 64bit NUMA arches? The following patch does it only
for x86_64.
I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of
memory and uses up tlb entries. This was on a 4 way, 2 socket
Tyan AMD box (non vsmp), with 8G total memory (4G pernode).
The results with and without hash distribution are:
1. Vanilla - runtime of 1188.000s
2. With hashdist=1 runtime of 1154.000s
Oprofile output for the duration of run is:
1. Vanilla:
PU: AMD64 processors, speed 2411.16 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples % app name symbol name
163054 6.5513 libansys1.so MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
162061 6.5114 libansys3.so blockSaxpy6L_fd
162042 6.5107 libansys3.so blockInnerProduct6L_fd
156286 6.2794 libansys3.so maxb33_
87879 3.5309 libansys1.so elmatrixmultpcg_
84857 3.4095 libansys4.so saxpy_pcg
58637 2.3560 libansys4.so .st4560
46612 1.8728 libansys4.so .st4282
43043 1.7294 vmlinux-t copy_user_generic_string
41326 1.6604 libansys3.so blockSaxpyBackSolve6L_fd
41288 1.6589 libansys3.so blockInnerProductBackSolve6L_fd
2. With hashdist=1
CPU: AMD64 processors, speed 2411.13 MHz (estimated)
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
mask of 0x00 (No unit mask) count 500
samples % app name symbol name
162993 6.9814 libansys1.so MultiFront::decompose(int, int,
Elemset *, int *, int, int, int)
160799 6.8874 libansys3.so blockInnerProduct6L_fd
160459 6.8729 libansys3.so blockSaxpy6L_fd
156018 6.6826 libansys3.so maxb33_
84700 3.6279 libansys4.so saxpy_pcg
83434 3.5737 libansys1.so elmatrixmultpcg_
58074 2.4875 libansys4.so .st4560
46000 1.9703 libansys4.so .st4282
41166 1.7632 libansys3.so blockSaxpyBackSolve6L_fd
41033 1.7575 libansys3.so blockInnerProductBackSolve6L_fd
35762 1.5318 libansys1.so inner_product_sub
35591 1.5245 libansys1.so inner_product_sub2
28259 1.2104 libansys4.so addVectors
Signed-off-by: Pravin B. Shelar <pravin.shelar@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix for the following patch. Provide dummy cpufreq functions when
CPUFREQ is not compiled in.
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
The reboot_fixups stuff seems to be a bit of a mess, specifically the
header is in linux/ when its a purely i386-specific piece of code. I'm
not sure why it has its config option; its only currently needed for
"geode-gx1/cs5530a", so perhaps whatever config option controls that
hardware should enable this?
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Change sysenter_setup to __cpuinit.
Change __INIT & __INITDATA to be cpu hotplug aware.
Resolve MODPOST warnings similar to:
WARNING: vmlinux - Section mismatch: reference to .init.text:sysenter_setup from
.text between 'identify_cpu' (at offset 0xc040a380) and 'detect_ht'
and
WARNING: vmlinux - Section mismatch: reference to .init.data:vsyscall_int80_end
from .text between 'sysenter_setup' (at offset 0xc041a269) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_int80_start from .text between 'sysenter_setup' (at offset
0xc041a26e) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_sysenter_end from .text between 'sysenter_setup' (at offset
0xc041a275) and 'enable_sep_cpu'
WARNING: vmlinux - Section mismatch: reference to
.init.data:vsyscall_sysenter_start from .text between 'sysenter_setup' (at
offset 0xc041a27a) and 'enable_sep_cpu'
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch adds ablkcipher_request_set_tfm for those users that need
to manage the memory for ablkcipher requests directly.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the mid-level interface for asynchronous block ciphers.
It also includes a generic queueing mechanism that can be used by other
asynchronous crypto operations in future.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch passes the type/mask along when constructing instances of
templates. This is in preparation for templates that may support
multiple types of instances depending on what is requested. For example,
the planned software async crypto driver will use this construct.
For the moment this allows us to check whether the instance constructed
is of the correct type and avoid returning success if the type does not
match.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds the frontend interface for asynchronous block ciphers.
In addition to the usual block cipher parameters, there is a callback
function pointer and a data pointer. The callback will be invoked only
if the encrypt/decrypt handlers return -EINPROGRESS. In other words,
if the return value of zero the completion handler (or the equivalent
code) needs to be invoked by the caller.
The request structure is allocated and freed by the caller. Its size
is determined by calling crypto_ablkcipher_reqsize(). The helpers
ablkcipher_request_alloc/ablkcipher_request_free can be used to manage
the memory for a request.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This is a very simple bitbanging I2C bus driver utilizing the new
arch-neutral GPIO API. Useful for chips that don't have a built-in
I2C controller, additional I2C busses, or testing purposes.
To use, include something similar to the following in the
board-specific setup code:
#include <linux/i2c-gpio.h>
static struct i2c_gpio_platform_data i2c_gpio_data = {
.sda_pin = GPIO_PIN_FOO,
.scl_pin = GPIO_PIN_BAR,
};
static struct platform_device i2c_gpio_device = {
.name = "i2c-gpio",
.id = 0,
.dev = {
.platform_data = &i2c_gpio_data,
},
};
Register this platform_device, set up the I2C pins as GPIO if
required and you're ready to go. This will use default values for
udelay and timeout, and will work with GPIO hardware that does not
support open drain mode, but allows sensing of the SDA and SCL lines
even when they are being driven.
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>