Prarit's excellent bug report:
> In recent Fedora releases (F17 & F18) some users have reported seeing
> messages similar to
>
> [ 15.478160] kvm: Could not allocate 304 bytes percpu data
> [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
> reserved chunk failed
>
> during system boot. In some cases, users have also reported seeing this
> message along with a failed load of other modules.
>
> What is happening is systemd is loading an instance of the kvm module for
> each cpu found (see commit e9bda3b). When the module load occurs the kernel
> currently allocates the modules percpu data area prior to checking to see
> if the module is already loaded or is in the process of being loaded. If
> the module is already loaded, or finishes load, the module loading code
> releases the current instance's module's percpu data.
Now we have a new state MODULE_STATE_UNFORMED, we can insert the
module into the list (and thus guarantee its uniqueness) before we
allocate the per-cpu region.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Prarit Bhargava <prarit@redhat.com>
lib/rbtree.c declared __rb_erase_color() as __always_inline void, and
then exported it with EXPORT_SYMBOL.
This was because __rb_erase_color() must be exported for augmented
rbtree users, but it must also be inlined into rb_erase() so that the
dummy callback can get optimized out of that call site.
(Actually with a modern compiler, none of the dummy callback functions
should even be generated as separate text functions).
The above usage is legal C, but it was unusual enough for some compilers
to warn about it. This change makes things more explicit, with a static
__always_inline ____rb_erase_color function for use in rb_erase(), and a
separate non-inline __rb_erase_color function for use in
rb_erase_augmented call sites.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases, free_irq_cpu_rmap() is called while holding a lock (eg
rtnl). This can lead to deadlocks, because it invokes
flush_scheduled_work() which ends up waiting for whole system workqueue
to flush, but some pending works might try to acquire the lock we are
already holding.
This commit uses reference-counting to replace
irq_run_affinity_notifiers(). It also removes
irq_run_affinity_notifiers() altogether.
[akpm@linux-foundation.org: eliminate free_cpu_rmap, rename cpu_rmap_reclaim() to cpu_rmap_release(), propagate kref_put() retval from cpu_rmap_put()]
Signed-off-by: David Decotigny <decot@googlers.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, rcutorture traces every read-side access. This can be
problematic because even a two-minute rcutorture run on a two-CPU system
can generate 28,853,363 reads. Normally, only a failing read is of
interest, so this commit traces adjusts rcutorture's tracing to only
trace failing reads. The resulting event tracing records the time
and the ->completed value captured at the beginning of the RCU read-side
critical section, allowing correlation with other event-tracing messages.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Add fix to build problem located by Randy Dunlap based on
diagnosis by Steven Rostedt. ]
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the last of the __dev* markings from the kernel from
a variety of different, tiny, places.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The atomic64 library uses a handful of static spin locks to implement
atomic 64-bit operations on architectures without support for atomic
64-bit instructions.
Unfortunately, the spinlocks are initialized in a pure initcall and that
is too late for the vfs namespace code which wants to use atomic64
operations before the initcall is run.
This became a problem as of commit 8823c079ba71: "vfs: Add setns support
for the mount namespace".
This leads to BUG messages such as:
BUG: spinlock bad magic on CPU#0, swapper/0/0
lock: atomic64_lock+0x240/0x400, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
do_raw_spin_lock+0x158/0x198
_raw_spin_lock_irqsave+0x4c/0x58
atomic64_add_return+0x30/0x5c
alloc_mnt_ns.clone.14+0x44/0xac
create_mnt_ns+0xc/0x54
mnt_init+0x120/0x1d4
vfs_caches_init+0xe0/0x10c
start_kernel+0x29c/0x300
coming out early on during boot when spinlock debugging is enabled.
Fix this by initializing the spinlocks statically at compile time.
Reported-and-tested-by: Vaibhav Bedia <vaibhav.bedia@ti.com>
Tested-by: Tony Lindgren <tony@atomide.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is absolutely no reason to crash the kernel when we have a
perfectly good return value already available to use for conveying
failure status.
Let's return an error code instead of crashing the kernel: that sounds
like a much better plan.
[akpm@linux-foundation.org: s/E2BIG/EINVAL/]
Signed-off-by: Nick Bowler <nbowler@elliptictech.com>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add functions to get the requested number of pseudo-random bytes.
The difference from get_random_bytes() is that it generates pseudo-random
numbers by prandom_u32(). It doesn't consume the entropy pool, and the
sequence is reproducible if the same rnd_state is used. So it is suitable
for generating random bytes for testing.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: David Laight <david.laight@aculab.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This renames all random32 functions to have 'prandom_' prefix as follows:
void prandom_seed(u32 seed); /* rename from srandom32() */
u32 prandom_u32(void); /* rename from random32() */
void prandom_seed_state(struct rnd_state *state, u64 seed);
/* rename from prandom32_seed() */
u32 prandom_u32_state(struct rnd_state *state);
/* rename from prandom32() */
The purpose of this renaming is to prevent some kernel developers from
assuming that prandom32() and random32() might imply that only
prandom32() was the one using a pseudo-random number generator by
prandom32's "p", and the result may be a very embarassing security
exposure. This concern was expressed by Theodore Ts'o.
And furthermore, I'm going to introduce new functions for getting the
requested number of pseudo-random bytes. If I continue to use both
prandom32 and random32 prefixes for these functions, the confusion
is getting worse.
As a result of this renaming, "prandom_" is the common prefix for
pseudo-random number library.
Currently, srandom32() and random32() are preserved because it is
difficult to rename too many users at once.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: David Laight <david.laight@aculab.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the documentation for simple_strto* to reflect that it has been
obsoleted and advise the usage of kstrto*.
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Joe Perches <joe@perches.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Bruce Fields pointed out, kstrto* is currently lacking kerneldoc
comments. This patch adds kerneldoc comments to common variants of
kstrto*: kstrto(u)l, kstrto(u)ll and kstrto(u)int.
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Joe Perches <joe@perches.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix this warning:
lib/rbtree_test.c: In function `check':
lib/rbtree_test.c:121: warning: `blacks' may be used uninitialized in this function
Signed-off-by: Cong Ding <dinggnu@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add lockdep annotations. Not only this can help to find the potential
problems, we do not want the false warnings if, say, the task takes two
different percpu_rw_semaphore's for reading. IOW, at least ->rw_sem
should not use a single class.
This patch exposes this internal lock to lockdep so that it represents the
whole percpu_rw_semaphore. This way we do not need to add another "fake"
->lockdep_map and lock_class_key. More importantly, this also makes the
output from lockdep much more understandable if it finds the problem.
In short, with this patch from lockdep pov percpu_down_read() and
percpu_up_read() acquire/release ->rw_sem for reading, this matches the
actual semantics. This abuses __up_read() but I hope this is fine and in
fact I'd like to have down_read_no_lockdep() as well,
percpu_down_read_recursive_readers() will need it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu_rw_semaphore->writer_mutex was only added to simplify the initial
rewrite, the only thing it protects is clear_fast_ctr() which otherwise
could be called by multiple writers. ->rw_sem is enough to serialize the
writers.
Kill this mutex and add "atomic_t write_ctr" instead. The writers
increment/decrement this counter, the readers check it is zero instead of
mutex_is_locked().
Move atomic_add(clear_fast_ctr(), slow_read_ctr) under down_write() to
avoid the race with other writers. This is a bit sub-optimal, only the
first writer needs this and we do not need to exclude the readers at this
stage. But this is simple, we do not want another internal lock until we
add more features.
And this speeds up the write-contended case. Before this patch the racing
writers sleep in synchronize_sched_expedited() sequentially, with this
patch multiple synchronize_sched_expedited's can "overlap" with each
other. Note: we can do more optimizations, this is only the first step.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely. Even if the "write" section was not actually started
or if it was already finished.
With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.
percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.
Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is another step towards better standard conformance. Rather than
adding a local buffer to store the specified portion of the string (with
the need to enforce an arbitrary maximum supported width to limit the
buffer size), do a maximum width conversion and then drop as much of it as
is necessary to meet the caller's request.
Also fail on negative field widths.
Uses the deprecated simple_strto*() functions because kstrtoXX() fail on
non-zero terminated strings.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the custom implementation of the functionality similar to kbasename().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation/printk-formats.txt says to use %zd for a ssize_t argument
and some drivers do. Unfortunately this prints a positive number for
negative values eg:
tpm_tis 70030000.tpm_tis: tpm_transmit: tpm_send: error 4294967234
Add a case to va_args a ssize_t type if the interpretation should be
signed.
Tested on PPC32.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants in the ASN.1
general decoder instead of the equivalent numbers.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This module used to inject errors in the pSeries specific dynamic
reconfiguration notifiers. Those are gone however, replaced by
generic notifiers for changes to the device-tree. So let's update
the module to deal with these instead and rename it along the way.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Akinobu Mita <akinobu.mita@gmail.com>
sse and avx2 stuff only exist on x86 arch, and we don't need to build
altivec on x86. And we can do that at lib/raid6/Makefile.
Proposed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Add AVX2 optimized gen_syndrom functions, which is simply based on
sse2.c written by hpa.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Optimize RAID6 recovery functions to take advantage of
the 256-bit YMM integer instructions introduced in AVX2.
The patch was tested and benchmarked before submission.
However hardware is not yet released so benchmark numbers
cannot be reported.
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It is strange that alloc_bootmem() returns a virtual address and
free_bootmem() requires a physical address. Anyway, free_bootmem()'s
first parameter should be physical address.
There are some call sites for free_bootmem() with virtual address. So fix
them.
[akpm@linux-foundation.org: improve free_bootmem() and free_bootmem_pate() documentation]
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I've legally changed my name with New York State, the US Social Security
Administration, et al. This patch propagates the name change and change
in initials and login to comments in the kernel source as well.
Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
It is $(obj)/oid_registry.o that is dependent on $(obj)/oid_registry_data.c.
The object file cannot be built until $(obj)/oid_registry_data.c has been
generated.
A periodic and hard to reproduce parallel build failure is due to
this incorrect lib/Makefile dependency. The compile error is completely
disingenuous.
GEN lib/oid_registry_data.c
Compiling 49 OIDs
CC lib/oid_registry.o
gcc: error: lib/oid_registry.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
make[3]: *** [lib/oid_registry.o] Error 4
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fix an error in asn1_find_indefinite_length() whereby small definite length
elements of size 0x7f are incorrecly classified as non-small. Without this
fix, an error will be given as the length of the length will be perceived as
being very much greater than the maximum supported size.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
CONFIG_HOTPLUG is being removed so kobject_uevent needs to always be
part of the library.
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since 4.4 GCC on MIPS no longer recognizes the "h" constraint,
leading to this build failure:
CC lib/mpi/generic_mpih-mul1.o
lib/mpi/generic_mpih-mul1.c: In function 'mpihelp_mul_1':
lib/mpi/generic_mpih-mul1.c:50:3: error: impossible constraint in 'asm'
This patch updates MPI with the latest umul_ppm implementations for MIPS.
Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
Cc: Linux-MIPS <linux-mips@linux-mips.org>
Cc: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
Cc: James Morris <jmorris@namei.org>
Patchwork: https://patchwork.linux-mips.org/patch/4612/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
dma-debug depends on get_dma_ops() interface. Several architectures
do not define dma_ops and get_dma_ops(). When dma debug interfaces are
used on an architecture (e.g: c6x) that doesn't define get_dmap_ops(),
compilation fails. Changing dma-debug to call dma_mapping_error() instead
of defining its own that calls get_dma_ops(), such that the internal use of
dma_mapping_error() doesn't interfere with the debug_dma_mapping_error()
interface's mapping error checks. Moving dma_mapping_error() checks in
check_unmap() under the dma debug entry not found is sufficient to fix the
problem.
Reference: https://lkml.org/lkml/2012/10/26/367
Signed-off-by: Shuah Khan <shuah.khan@hp.com>
Reported-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
The RCU CPU stall warning timeout has defaulted to 60 seconds for
some years, with almost no false positives. This commit therefore
reduces the default to 21 seconds, slightly shorter than the new
soft-lockup timeout.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently swiotlb is the only consumer for swiotlb_bounce. Since that is the
case it doesn't make much sense to be exporting it so make it a static
function only.
In addition we can save a few more lines of code by making it so that it
accepts the DMA address as a physical address instead of a virtual one. This
is the last piece in essentially pushing all of the DMA address values to use
physical addresses in swiotlb.
In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_bounce I am renaming phys to orig_addr, and dma_addr to
tlb_addr. This way is should be clear that orig_addr is contained within
io_orig_addr and tlb_addr is an address within the io_tlb_addr buffer.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change makes it so that the sync functionality also uses physical
addresses. This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.
In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_sync_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr. This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change makes it so that the unmap functionality also uses physical
addresses. This helps to further reduce the use of virt_to_phys and
phys_to_virt functions.
In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_unmap_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr. This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change makes it so that swiotlb_tbl_map_single will return a physical
address instead of a virtual address when called. The advantage to this once
again is that we are avoiding a number of virt_to_phys and phys_to_virt
translations by working with everything as a physical address.
One change I had to make in order to support using physical addresses is that
I could no longer trust 0 to be a invalid physical address on all platforms.
So instead I made it so that ~0 is returned on error. This should never be a
valid return value as it implies that only one byte would be available for
use.
In order to clarify things since we now have 2 physical addresses in use
inside of swiotlb_tbl_map_single I am renaming phys to orig_addr, and
dma_addr to tlb_addr. This way is should be clear that orig_addr is
contained within io_orig_addr and tlb_addr is an address within the
io_tlb_addr buffer.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change makes it so that we can avoid virt_to_phys overhead when using the
io_tlb_overflow_buffer. My original plan was to completely remove the value
and replace it with a constant but I had seen that there were recent patches
that stated this couldn't be done until all device drivers that depended on
that functionality be updated.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change replaces all references to the virtual address for io_tlb_start
with references to the physical address io_tlb_end. The main advantage of
replacing the virtual address with a physical address is that we can avoid
having to do multiple translations from the virtual address to the physical
one needed for testing an existing DMA address.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change replaces all references to the virtual address for io_tlb_end
with references to the physical address io_tlb_end. The main advantage of
replacing the virtual address with a physical address is that we can avoid
having to do multiple translations from the virtual address to the physical
one needed for testing an existing DMA address.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The genalloc code uses the bitmap API from include/linux/bitmap.h and
lib/bitmap.c, which is based on long values. Both bitmap_set from
lib/bitmap.c and bitmap_set_ll, which is the lockless version from
genalloc.c, use BITMAP_LAST_WORD_MASK to set the first bits in a long in
the bitmap.
That one uses (1 << bits) - 1, 0b111, if you are setting the first three
bits. This means that the API counts from the least significant bits
(LSB from now on) to the MSB. The LSB in the first long is bit 0, then.
The same works for the lookup functions.
The genalloc code uses longs for the bitmap, as it should. In
include/linux/genalloc.h, struct gen_pool_chunk has unsigned long
bits[0] as its last member. When allocating the struct, genalloc should
reserve enough space for the bitmap. This should be a proper number of
longs that can fit the amount of bits in the bitmap.
However, genalloc allocates an integer number of bytes that fit the
amount of bits, but may not be an integer amount of longs. 9 bytes, for
example, could be allocated for 70 bits.
This is a problem in itself if the Least Significat Bit in a long is in
the byte with the largest address, which happens in Big Endian machines.
This means genalloc is not allocating the byte in which it will try to
set or check for a bit.
This may end up in memory corruption, where genalloc will try to set the
bits it has not allocated. In fact, genalloc may not set these bits
because it may find them already set, because they were not zeroed since
they were not allocated. And that's what causes a BUG when
gen_pool_destroy is called and check for any set bits.
What really happens is that genalloc uses kmalloc_node with __GFP_ZERO
on gen_pool_add_virt. With SLAB and SLUB, this means the whole slab
will be cleared, not only the requested bytes. Since struct
gen_pool_chunk has a size that is a multiple of 8, and slab sizes are
multiples of 8, we get lucky and allocate and clear the right amount of
bytes.
Hower, this is not the case with SLOB or with older code that did memset
after allocating instead of using __GFP_ZERO.
So, a simple module as this (running 3.6.0), will cause a crash when
rmmod'ed.
[root@phantom-lp2 foo]# cat foo.c
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/genalloc.h>
MODULE_LICENSE("GPL");
MODULE_VERSION("0.1");
static struct gen_pool *foo_pool;
static __init int foo_init(void)
{
int ret;
foo_pool = gen_pool_create(10, -1);
if (!foo_pool)
return -ENOMEM;
ret = gen_pool_add(foo_pool, 0xa0000000, 32 << 10, -1);
if (ret) {
gen_pool_destroy(foo_pool);
return ret;
}
return 0;
}
static __exit void foo_exit(void)
{
gen_pool_destroy(foo_pool);
}
module_init(foo_init);
module_exit(foo_exit);
[root@phantom-lp2 foo]# zcat /proc/config.gz | grep SLOB
CONFIG_SLOB=y
[root@phantom-lp2 foo]# insmod ./foo.ko
[root@phantom-lp2 foo]# rmmod foo
------------[ cut here ]------------
kernel BUG at lib/genalloc.c:243!
cpu 0x4: Vector: 700 (Program Check) at [c0000000bb0e7960]
pc: c0000000003cb50c: .gen_pool_destroy+0xac/0x110
lr: c0000000003cb4fc: .gen_pool_destroy+0x9c/0x110
sp: c0000000bb0e7be0
msr: 8000000000029032
current = 0xc0000000bb0e0000
paca = 0xc000000006d30e00 softe: 0 irq_happened: 0x01
pid = 13044, comm = rmmod
kernel BUG at lib/genalloc.c:243!
[c0000000bb0e7ca0] d000000004b00020 .foo_exit+0x20/0x38 [foo]
[c0000000bb0e7d20] c0000000000dff98 .SyS_delete_module+0x1a8/0x290
[c0000000bb0e7e30] c0000000000097d4 syscall_exit+0x0/0x94
--- Exception: c00 (System Call) at 000000800753d1a0
SP (fffd0b0e640) is in userspace
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Benjamin Gaignard <benjamin.gaignard@stericsson.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add dma-debug interface debug_dma_mapping_error() to debug
drivers that fail to check dma mapping errors on addresses
returned by dma_map_single() and dma_map_page() interfaces.
This interface clears a flag set by debug_dma_map_page() to
indicate that dma_mapping_error() has been called by the
driver. When driver does unmap, debug_dma_unmap() checks the
flag and if this flag is still set, prints warning message
that includes call trace that leads up to the unmap. This
interface can be called from dma_mapping_error() routines to
enable dma mapping error check debugging.
Tested: Intel iommu and swiotlb (iommu=soft) on x86-64 with
CONFIG_DMA_API_DEBUG enabled and disabled.
Signed-off-by: Shuah Khan <shuah.khan@hp.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
If there is only one match, the unique matched entry should be returned.
Without the fix, the upcoming dma debug interfaces ("dma-debug: new
interfaces to debug dma mapping errors") can't work reliably because
only device and dma_addr are passed to dma_mapping_error().
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Tested-by: Shuah Khan <shuah.khan@hp.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously kvasprintf() allocation was being done through kmalloc(),
thus producing an inaccurate trace report.
This is a common problem: in order to get accurate callsite tracing, a
lib/utils function shouldn't allocate kmalloc but instead use
kmalloc_track_caller.
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
asn1_find_indefinite_length() returns an error indicator of -1, which the
caller asn1_ber_decoder() places in a size_t (which is usually unsigned) and
then checks to see whether it is less than 0 (which it can't be). This can
lead to the following warning:
lib/asn1_decoder.c:320 asn1_ber_decoder()
warn: unsigned 'len' is never less than zero.
Instead, asn1_find_indefinite_length() update the caller's idea of the data
cursor and length separately from returning the error code.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Add a CONFIG_DEBUG_VM_RB build option for the previously existing
DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using
recursive algorithms, we can expose it a bit more.
Also extend this code to validate_mm() after stack expansion, and to check
that the vma's start and last pgoffs have not changed since the nodes were
inserted on the anon vma interval tree (as it is important that the nodes
be reindexed after each such update).
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the generic interval tree code that was introduced in "mm: replace
vma prio_tree with an interval tree".
Changes:
- fixed 'endpoing' typo noticed by Andrew Morton
- replaced include/linux/interval_tree_tmpl.h, which was used as a
template (including it automatically defined the interval tree
functions) with include/linux/interval_tree_generic.h, which only
defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
defines the interval tree functions when invoked. Now that is a very
long macro which is unfortunate, but it does make the usage sites
(lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
- make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
instead of duplicating that code in the interval tree template.
- replaced vma_interval_tree_add(), which was actually handling the
nonlinear and interval tree cases, with vma_interval_tree_insert_after()
which handles only the interval tree case and has an API that is more
consistent with the other interval tree handling functions.
The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide rb_insert_augmented() and rb_erase_augmented() through a new
rbtree_augmented.h include file. rb_erase_augmented() is defined there as
an __always_inline function, in order to allow inlining of augmented
rbtree callbacks into it. Since this generates a relatively large
function, each augmented rbtree user should make sure to have a single
call site.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After both prio_tree users have been converted to use red-black trees,
there is no need to keep around the prio tree library anymore.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.
Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>