Currently we rely on the PIPE_BUF_FLAG_LRU flag being set correctly
to know whether we need to fiddle with page LRU state after stealing it,
however for some origins we just don't know if the page is on the LRU
list or not.
So remove PIPE_BUF_FLAG_LRU and do this check/add manually in pipe_to_file()
instead.
Signed-off-by: Jens Axboe <axboe@suse.de>
When iptables userspace adds an ipt_standard_target, it calculates the size
of the entire entry as:
sizeof(struct ipt_entry) + XT_ALIGN(sizeof(struct ipt_standard_target))
ipt_standard_target looks like this:
struct xt_standard_target
{
struct xt_entry_target target;
int verdict;
};
xt_entry_target contains a pointer, so when compiled for 64 bit the
structure gets an extra 4 byte of padding at the end. On 32 bit
architectures where iptables aligns to 8 byte it will also have 4
byte padding at the end because it is only 36 bytes large.
The compat_ipt_standard_fn in the kernel adjusts the offsets by
sizeof(struct ipt_standard_target) - sizeof(struct compat_ipt_standard_target),
which will always result in 4, even if the structure from userspace
was already padded to a multiple of 8. On x86 this works out by
accident because userspace only aligns to 4, on all other
architectures this is broken and causes incorrect adjustments to
the size and following offsets.
Thanks to Linus for lots of debugging help and testing.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add an nid member to the spu structure, and store the numa id of the spu there
on creation.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change of_node_to_nid() to traverse the device tree, looking for a numa id.
Cell uses this to assign ids to SPUs, which are children of the CPU node.
Existing users of of_node_to_nid() are altered to use of_node_to_nid_single(),
which doesn't do the traversal.
Export an attach_sysdev_to_node() function, allowing system devices (eg.
SPUs) to link themselves into the numa topology in sysfs.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If SPLICE_F_GIFT is set, the user is basically giving this pages away to
the kernel. That means we can steal them for eg page cache uses instead
of copying it.
The data must be properly page aligned and also a multiple of the page size
in length.
Signed-off-by: Jens Axboe <axboe@suse.de>
The pipe ->map() method uses kmap() to virtually map the pages, which
is both slow and has known scalability issues on SMP. This patch enables
atomic copying of pipe pages, by pre-faulting data and using kmap_atomic()
instead.
lmbench bw_pipe and lat_pipe measurements agree this is a Good Thing. Here
are results from that on a UP machine with highmem (1.5GiB of RAM), running
first a UP kernel, SMP kernel, and SMP kernel patched.
Vanilla-UP:
Pipe bandwidth: 1622.28 MB/sec
Pipe bandwidth: 1610.59 MB/sec
Pipe bandwidth: 1608.30 MB/sec
Pipe latency: 7.3275 microseconds
Pipe latency: 7.2995 microseconds
Pipe latency: 7.3097 microseconds
Vanilla-SMP:
Pipe bandwidth: 1382.19 MB/sec
Pipe bandwidth: 1317.27 MB/sec
Pipe bandwidth: 1355.61 MB/sec
Pipe latency: 9.6402 microseconds
Pipe latency: 9.6696 microseconds
Pipe latency: 9.6153 microseconds
Patched-SMP:
Pipe bandwidth: 1578.70 MB/sec
Pipe bandwidth: 1579.95 MB/sec
Pipe bandwidth: 1578.63 MB/sec
Pipe latency: 9.1654 microseconds
Pipe latency: 9.2266 microseconds
Pipe latency: 9.1527 microseconds
Signed-off-by: Jens Axboe <axboe@suse.de>
The ->map() function is really expensive on highmem machines right now,
since it has to use the slower kmap() instead of kmap_atomic(). Splice
rarely needs to access the virtual address of a page, so it's a waste
of time doing it.
Introduce ->pin() to take over the responsibility of making sure the
page data is valid. ->map() is then reduced to just kmap(). That way we
can also share a most of the pipe buffer ops between pipe.c and splice.c
Signed-off-by: Jens Axboe <axboe@suse.de>
Found by Oleg Nesterov <oleg@tv-sign.ru>, fixed by me.
- Only allow full pages to go to the page cache.
- Check page != buf->page instead of using PIPE_BUF_FLAG_STOLEN.
- Remember to clear 'stolen' if add_to_page_cache() fails.
And as a cleanup on that:
- Make the bottom fall-through logic a little less convoluted. Also make
the steal path hold an extra reference to the page, so we don't have
to differentiate between stolen and non-stolen at the end.
Signed-off-by: Jens Axboe <axboe@suse.de>
1) The audit_ipc_perms() function has been split into two different
functions:
- audit_ipc_obj()
- audit_ipc_set_perm()
There's a key shift here... The audit_ipc_obj() collects the uid, gid,
mode, and SElinux context label of the current ipc object. This
audit_ipc_obj() hook is now found in several places. Most notably, it
is hooked in ipcperms(), which is called in various places around the
ipc code permforming a MAC check. Additionally there are several places
where *checkid() is used to validate that an operation is being
performed on a valid object while not necessarily having a nearby
ipcperms() call. In these locations, audit_ipc_obj() is called to
ensure that the information is captured by the audit system.
The audit_set_new_perm() function is called any time the permissions on
the ipc object changes. In this case, the NEW permissions are recorded
(and note that an audit_ipc_obj() call exists just a few lines before
each instance).
2) Support for an AUDIT_IPC_SET_PERM audit message type. This allows
for separate auxiliary audit records for normal operations on an IPC
object and permissions changes. Note that the same struct
audit_aux_data_ipcctl is used and populated, however there are separate
audit_log_format statements based on the type of the message. Finally,
the AUDIT_IPC block of code in audit_free_aux() was extended to handle
aux messages of this new type. No more mem leaks I hope ;-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hi,
The patch below builds upon the patch sent earlier and adds subject label to
all audit events generated via the netlink interface. It also cleans up a few
other minor things.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.
[updated:
> Stephen Smalley also wanted to change a variable from isec to tsec in the
> user sid patch. ]
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hi,
The patch below converts IPC auditing to collect sid's and convert to context
string only if it needs to output an audit record. This patch depends on the
inode audit change patch already being applied.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previously, we were gathering the context instead of the sid. Now in this patch,
we gather just the sid and convert to context only if an audit event is being
output.
This patch brings the performance hit from 146% down to 23%
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The following patch provides selinux interfaces that will allow the audit
system to perform filtering based on the process context (user, role, type,
sensitivity, and clearance). These interfaces will allow the selinux
module to perform efficient matches based on lower level selinux constructs,
rather than relying on context retrievals and string comparisons within
the audit module. It also allows for dominance checks on the mls portion
of the contexts that are impossible with only string comparisons.
Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use hlist_unhashed() rather than accessing inside data structure.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FXSAVE information leak patch introduced a bug in FP exception
handling: it clears FP exceptions only when there are already
none outstanding. Mikael Pettersson reported that causes problems
with the Erlang runtime and has tested this fix.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While writing to an event device allows to set repeat rate for an
individual input device there is no way to retrieve current settings
so we need to ressurect EVIOCGREP. Also ressurect EVIOCSREP so we
have a symmetrical interface.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Add a cputable entry for the POWER6 processor.
The SIHV and SIPR bits in the mmcra have moved in POWER6, so disable
support for that until oprofile is fixed.
Also tell firmware that we know about POWER6.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add a read_mostly section and define __read_mostly to prevent cache line
pollution due to writes for mostly read variables. In addition fix the
incorrect alignment of the cache_line_aligned data section. s390 has a
cacheline size of 256 bytes.
Signed-off-by: Christian Borntraeger <cborntra@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for atomic futex operations.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning.
Some drivers like to use request_irq() to find an unused interrupt slot.
- Use it in i82365.c
- Kill unused SA_PROBE.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This consists of offsets fix in ..._devices.c, and update of
ppc_sys_fixup_mem_resource() function to prevent subsequent fixups
Signed-off-by: Vitaly Bordug <vbordug@ru.mvista.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Wire up *at syscalls.
This patch has been tested on ppc64 (using glibc's testsuite, both 32bit
and 64bit), and compile-tested for ppc32 (I have currently no ppc32 system
available, but I expect no problems).
Signed-off-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch adds workaround for PPC 440GX erratum 440_43. According to
this erratum spurious MachineChecks (caused by L1 cache parity) can
happen during DataTLB miss processing. We disable L1 cache parity
checking for 440GX rev.C and rev.F
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some people report that we die on some Macs when we are expecting to
catch machine checks after poking at some random I/O address. I'd seen
it happen on my dual G4 with serial ports until we fixed those to use
OF, but now other users are reporting it with i8042.
This expands the use of check_legacy_ioport() to avoid that situation
even on 32-bit kernels.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At present, ARCH=powerpc kernels can waste considerable space in
pagetables when making large hugepage mappings. Hugepage PTEs go in
PMD pages, but each PMD page maps 256M and so contains only 16
hugepage PTEs (128 bytes of data), but takes up a 1024 byte
allocation. With CONFIG_PPC_64K_PAGES enabled (64k base page size),
the situation is worse. Now hugepage PTEs are at the PTE page level
(also mapping 256M), so we store 16 hugepage PTEs in a 64k allocation.
The PowerPC MMU already means that any 256M region is either all
hugepage, or all normal pages. Thus, with some care, we can use a
different allocation for the hugepage PTE tables and only allocate the
128 bytes necessary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
In SLES10 (2.6.16) crash dumping (in my experience, LKCD) is unable to
capture the second page of the 2-page task/stack allocation.
This is particularly troublesome for dump analysis, as the stack traceback
cannot be done.
(A similar convention is probably needed throughout the kernel to make
kernel multi-page allocations detectable for dumping)
Multi-page kernel allocations are represented by the single page structure
associated with the first page of the allocation. The page structures
associated with the other pages are unintialized.
If the dumper is selecting only kernel pages it has no way to identify
any but the first page of the allocation.
The fix is to make the task/stack allocation a compound page.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Fix a bug that causes discovery of the nearest node/cpu to
a TIO (IO node) to fail.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This patch contains the following possible cleanups:
- #if 0 the following unused global function:
- subsys_remove_file()
- remove the following unused EXPORT_SYMBOL's:
- kset_find_obj
- subsystem_init
- remove the following unused EXPORT_SYMBOL_GPL:
- kobject_add_dir
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix the following warning which happens when OCFS2_FS is enabled but
DEBUG_FS isn't:
fs/ocfs2/dlmglue.c: In function `ocfs2_dlm_init_debug':
fs/ocfs2/dlmglue.c:2036: warning: passing arg 5 of `debugfs_create_file' discards qualifiers from pointer target type
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Joel Becker <Joel.Becker@oracle.com>
Acked-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This fixes a build error for various odd combinations of CONFIG_HOTPLUG
and CONFIG_NET.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Nigel Cunningham <ncunningham@cyclades.com>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fixed Oops at rmmod with CONFIG_SND_VERBOSE_PROCFS=n.
Add ifdef to struct fields for optimization and better compile
checks.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug,
so use the same fix for both. Turns out pmd_clear had it as well, but pgds
are not affected.
The problem is rather intricate. Page table entries in PAE mode are 64-bits
wide, but the only atomic 8-byte write operation available in 32-bit mode is
cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can
happen that the processor may prefetch entries into the TLB in the middle of an
operation which clears a page table entry. So one must always clear the P-bit
in the low word of the page table entry first when clearing it.
Since the sequence *ptep = __pte(0) leaves the order of the write dependent on
the compiler, it must be coded explicitly as a clear of the low word followed
by a clear of the high word. Further, there must be a write memory barrier
here to enforce proper ordering by the compiler (and, in the future, by the
processor as well).
On > 4GB memory machines, the implementation of pte_clear for PAE was clearly
deficient, as it could leave virtual mappings of physical memory above 4GB
aliased to memory below 4GB in the TLB. The implementation of
ptep_get_and_clear_full has a similar bug, although not nearly as likely to
occur, since the mappings being cleared are in the process of being destroyed,
and should never be dereferenced again.
But, as luck would have it, it is possible to trigger bugs even without ever
dereferencing these bogus TLB mappings, even if the clear is followed fairly
soon after with a TLB flush or invalidation. The problem is that memory above
4GB may now be aliased into the first 4GB of memory, and in fact, may hit a
region of memory with non-memory semantics. These regions include AGP and PCI
space. As such, these memory regions are not cached by the processor. This
introduces the bug.
The processor can speculate memory operations, including memory writes, as long
as they are committed with the proper ordering. Speculating a memory write to
a linear address that has a bogus TLB mapping is possible. Normally, the
speculation is harmless. But for cached memory, it does leave the falsely
speculated cacheline unmodified, but in a dirty state. This cache line will be
eventually written back. If this cacheline happens to intersect a region of
memory that is not protected by the cache coherency protocol, it can corrupt
data in I/O memory, which is generally a very bad thing to do, and can cause
total system failure or just plain undefined behavior.
These bugs are extremely unlikely, but the severity is of such magnitude, and
the fix so simple that I think fixing them immediately is justified. Also,
they are nearly impossible to debug.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With recent rewrite for generic bitops, fls() for 32bit kernel with
MIPS64_CPU is broken. Also, ffs(), fls() should be defined the same
way as the libc and compiler built-in routines (returns int instead of
unsigned long).
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
find_get_pages_contig() will break out if we hit a hole in the page cache.
From Andrew Morton, small modifications and documentation by me.
Signed-off-by: Jens Axboe <axboe@suse.de>
This is a workaround for the case edge-triggered irq's. Several users
seem to have broken configurations sharing edge-triggered irq's. To avoid
losing IRQ's, reshedule if more work arrives.
The changes to netdevice.h are to extract the part that puts device
back in list into separate inline.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
sys_splice() moves data to/from pipes with a file input/output. sys_vmsplice()
moves data to a pipe, with the input being a user address range instead.
This uses an approach suggested by Linus, where we can hold partial ranges
inside the pages[] map. Hopefully this will be useful for network
receive support as well.
Signed-off-by: Jens Axboe <axboe@suse.de>
Providing more accurate coordinates for thumb press requires additional
steps in the filtering logic:
- Ignore samples found invalid by the debouncing logic, or the ones that
have out of bound pressure value.
- Add a parameter to repeat debouncing, so that more then two consecutive
good readings are required for a valid sample.
Signed-off-by: Imre Deak <imre.deak@nokia.com>
Acked-by: Juha Yrjola <juha.yrjola@nokia.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Some (?) non-x86 architectures require 8byte alignment for u_int64_t
even when compiled for 32bit, using u_int32_t in compat_xt_counters
breaks on these architectures, use u_int64_t for everything but x86.
Reported by Andreas Schwab <schwab@suse.de>.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
__NR_sys_sync_file_range part was lost somewhere...
[glibc is already checking __NR_sync_file_range]
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some bugs in the current implementation of the SIOCSIWAP wext,
for example that when you do it twice and it fails, it may still try
another access point for some reason. This patch fixes this by introducing
a new flag that tells the association code that the bssid that is in use
was fixed by the user and shouldn't be deviated from.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
machine_is() was always returning 0 when used in a module, because
we weren't exporting the machine definitions. This was why sound
wasn't working on powermacs when CONFIG_SND_POWERMAC=m. Original
fix from Ben Herrenschmidt, further fixed by me.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since it is way more work to change most drivers to comply with parisc, take
the easy way out and make ioremap _NO_CACHE by default. This is in line with
what powerpc does.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>