Now we have a set of nested attributes:
IFLA_VFINFO_LIST (NESTED)
IFLA_VF_INFO (NESTED)
IFLA_VF_MAC
IFLA_VF_VLAN
IFLA_VF_TX_RATE
This allows a single set to operate on multiple attributes if desired.
Among other things, it means a dump can be replayed to set state.
The current interface has yet to be released, so this seems like
something to consider for 2.6.34.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.
2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.
3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Originally, commit d899bf7b ("procfs: provide stack information for
threads") attempted to introduce a new feature for showing where the
threadstack was located and how many pages are being utilized by the
stack.
Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
applied to fix the NO_MMU case.
Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
64-bit") was applied to fix a bug in ia32 executables being loaded.
Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel
threads") was applied to fix a bug which had kernel threads printing a
userland stack address.
Commit 1306d603f ('proc: partially revert "procfs: provide stack
information for threads"') was then applied to revert the stack pages
being used to solve a significant performance regression.
This patch nearly undoes the effect of all these patches.
The reason for reverting these is it provides an unusable value in
field 28. For x86_64, a fork will result in the task->stack_start
value being updated to the current user top of stack and not the stack
start address. This unpredictability of the stack_start value makes
it worthless. That includes the intended use of showing how much stack
space a thread has.
Other architectures will get different values. As an example, ia64
gets 0. The do_fork() and copy_process() functions appear to treat the
stack_start and stack_size parameters as architecture specific.
I only partially reverted c44972f1 ("procfs: disable per-task stack usage
on NOMMU") . If I had completely reverted it, I would have had to change
mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
configured. Since I could not test the builds without significant effort,
I decided to not change mm/Makefile.
I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
information for threads on 64-bit") . I left the KSTK_ESP() change in
place as that seemed worthwhile.
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some RCU-lockdep splat repairs need to know whether they are running
in a single-threaded process. Unfortunately, the thread_group_empty()
primitive is defined in sched.h, and can induce #include hell. This
commit therefore introduces a rcu_my_thread_group_empty() wrapper that
is defined in rcupdate.c, thus avoiding the need to include sched.h
everywhere.
Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Expand task_subsys_state()'s rcu_dereference_check() to include the full
locking rule as documented in Documentation/cgroups/cgroups.txt by adding
a check for task->alloc_lock being held.
This fixes an RCU false positive when resuming from suspend. The warning
comes from freezer cgroup in cgroup_freezing_or_frozen().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Using a single list for all userspace devices leads to a dead lock
on multiplexed buses in some circumstances (mux chip instantiated
from userspace). This is solved by using a separate list for each
bus segment.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Michael Lawnick <ml.lawnick@gmx.de>
For more clearance what the functions actually do,
usb_buffer_alloc() is renamed to usb_alloc_coherent()
usb_buffer_free() is renamed to usb_free_coherent()
They should only be used in code which really needs DMA coherency.
[added compatibility macros so we can convert things easier - gregkh]
Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Pedro Ribeiro <pedrib@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
noop_backing_dev_info is used only as a flag to mark filesystems that
don't have any backing store, like tmpfs, procfs, spufs, etc.
Signed-off-by: Joern Engel <joern@logfs.org>
Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
to the noop_backing_dev_info is not legal and will not result in
them being flushed, but we already catch this condition in
__mark_inode_dirty() when checking for a registered bdi.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If a futex key happens to be located within a huge page mapped
MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a
page->mapping that will never exist.
See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details
about the problem.
This patch makes page->mapping a poisoned value that includes
PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to
continue but because of PAGE_MAPPING_ANON, the poisoned value is not
dereferenced or used by futex. No other part of the VM should be
dereferencing the page->mapping of a hugetlbfs page as its page cache is
not on the LRU.
This patch fixes the problem with the test case described in the bugzilla.
[akpm@linux-foundation.org: mel cant spel]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Darren Hart <darren@dvhart.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cleans up a few of the complaints of __generic_block_fiemap. I've
fixed all the typing stuff, used inline functions instead of macros,
gotten rid of a couple of variables, and made sure the size and block
requests are all block aligned. It also fixes a problem where sometimes
FIEMAP_EXTENT_LAST wasn't being set properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 2c61be0a94 (NFS: Ensure that the WRITE
and COMMIT RPC calls are always uninterruptible) exposed a race on file
close. In order to ensure correct close-to-open behaviour, we want to wait
for all outstanding background commit operations to complete.
This patch adds an inode flag that indicates if a commit operation is under
way, and provides a mechanism to allow ->write_inode() to wait for its
completion if this is a data integrity flush.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pretty trivial helper, just sets up the bdi and registers it. An atomic
sequence count is used to ensure that the registered sysfs names are
unique.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch increases the current hardcoded limit of NR_IOBUS_DEVS
from 6 to 200. We are hitting this limit when creating a guest with more
than 1 virtio-net device using vhost-net backend. Each virtio-net
device requires 2 such devices to service notifications from rx/tx queues.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Int is not long enough to store the size of a dirty bitmap.
This patch fixes this problem with the introduction of a wrapper
function to calculate the sizes of dirty bitmaps.
Note: in mark_page_dirty(), we have to consider the fact that
__set_bit() takes the offset as int, not long.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Have the stub variant of regulator_get() return NULL, so that drivers
can (but still don't have to) handle this case specifically.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Jerome Oufella <jerome.oufella@savoirfairelinux.com>
Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Liam Girdwood <lrg@slimlogic.co.uk>
The lockdep facility temporarily disables lockdep checking by
incrementing the current->lockdep_recursion variable. Such
disabling happens in NMIs and in other situations where lockdep
might expect to recurse on itself.
This patch therefore checks current->lockdep_recursion, disabling RCU
lockdep splats when this variable is non-zero. In addition, this patch
removes the "likely()", as suggested by Lai Jiangshan.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: David Miller <davem@davemloft.net>
Tested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference: <20100415195039.GA22623@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This was supposed to be generic "authors or copyright holders";
I mistakenly picked up text from a wrong file.
Reported-by: Daniel K. <dk@uw.no>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Among else, this allows projects like libdc1394 to carry copies of the
ABI related header files without them or distributors having to worry
about effects on the project's overall license terms. Switch to MIT
license as suggested by Kristian. Also update the year in the
copyright statement according to source history.
Cc: Jay Fenlason <fenlason@redhat.com>
Acked-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Kristian Høgsberg <krh@bitplanet.net>
This patch adds variants of rcu_dereference() that handle
situations where the RCU-protected data structure cannot change,
perhaps due to our holding the update-side lock, or where the
RCU-protected pointer is only to be fetched, not dereferenced.
These are needed due to some performance concerns with using
rcu_dereference() where it is not required, aside from the need
for lockdep/sparse checking.
The new rcu_access_pointer() primitive is for the case where the
pointer is be fetch and not dereferenced. This primitive may be
used without protection, RCU or otherwise, due to the fact that
it uses ACCESS_ONCE().
The new rcu_dereference_protected() primitive is for the case
where updates are prevented, for example, due to holding the
update-side lock. This primitive does neither ACCESS_ONCE() nor
smp_read_barrier_depends(), so can only be used when updates are
somehow prevented.
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: eric.dumazet@gmail.com
LKML-Reference: <1270852752-25278-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Arnaud Giersch reports that NFSv4 locking is broken when we hold a
delegation since commit 8e469ebd6d (NFSv4:
Don't allow posix locking against servers that don't support it).
According to Arnaud, the lock succeeds the first time he opens the file
(since we cannot do a delegated open) but then fails after we start using
delegated opens.
The following patch fixes it by ensuring that locking behaviour is
governed by a per-filesystem capability flag that is initially set, but
gets cleared if the server ever returns an OPEN without the
NFS4_OPEN_RESULT_LOCKTYPE_POSIX flag being set.
Reported-by: Arnaud Giersch <arnaud.giersch@iut-bm.univ-fcomte.fr>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Add the missing documentation for iso packets.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
or radix_tree_tag_clear(). The problem is that the double tag_get() in
radix_tree_tag_get():
if (!tag_get(node, tag, offset))
saw_unset_tag = 1;
if (height == 1) {
int ret = tag_get(node, tag, offset);
may see the value change due to the action of set/clear. RCU is no protection
against this as no pointers are being changed, no nodes are being replaced
according to a COW protocol - set/clear alter the node directly.
The documentation in linux/radix-tree.h, however, says that
radix_tree_tag_get() is an exception to the rule that "any function modifying
the tree or tags (...) must exclude other modifications, and exclude any
functions reading the tree".
The problem is that the next statement in radix_tree_tag_get() checks that the
tag doesn't vary over time:
BUG_ON(ret && saw_unset_tag);
This has been seen happening in FS-Cache:
https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html
To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
comments that the value of the tag may change whilst the RCU read lock is held,
and thus that the return value of radix_tree_tag_get() may not be relied upon
unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
running concurrently with it.
Reported-by: Romain DEGEZ <romain.degez@smartjog.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Linus, introduce a kern_ptr_validate() helper that does some
sanity checks to make sure a pointer is a valid kernel pointer. This is a
preparational step for fixing SLUB kmem_ptr_validate().
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most drives from Seagate, Hitachi, and possibly other brands,
do not allow LBA28 access to sector number 0x0fffffff (2^28 - 1).
So instead use LBA48 for such accesses.
This bug could bite a lot of systems, especially when the user has
taken care to align partitions to 4KB boundaries. On misaligned systems,
it is less likely to be encountered, since a 4KB read would end at
0x10000000 rather than at 0x0fffffff.
Signed-off-by: Mark Lord <mlord@pobox.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Move MULTIPORT feature and related config changes
out of exported headers, and disable the feature
at runtime.
At this point, it seems less risky to keep code around
until we can enable it than rip it out completely.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Presently, memcg's FILE_MAPPED accounting has following race with
move_account (happens at rmdir()).
increment page->mapcount (rmap.c)
mem_cgroup_update_file_mapped() move_account()
lock_page_cgroup()
check page_mapped() if
page_mapped(page)>1 {
FILE_MAPPED -1 from old memcg
FILE_MAPPED +1 to old memcg
}
.....
overwrite pc->mem_cgroup
unlock_page_cgroup()
lock_page_cgroup()
FILE_MAPPED + 1 to pc->mem_cgroup
unlock_page_cgroup()
Then,
old memcg (-1 file mapped)
new memcg (+2 file mapped)
This happens because move_account see page_mapped() which is not guarded
by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup
and move account information based on it. Now, all checks are synchronous
with lock_page_cgroup().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@in.ibm.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrea Righi <arighi@develer.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we look into pagemap using page-types with option -p, the value of
pfn for hugepages looks wrong (see below.) This is because pte was
evaluated only once for one vma although it should be updated for each
hugepage. This patch fixes it.
$ page-types -p 3277 -Nl -b huge
voffset offset len flags
7f21e8a00 11e400 1 ___U___________H_G________________
7f21e8a01 11e401 1ff ________________TG________________
^^^
7f21e8c00 11e400 1 ___U___________H_G________________
7f21e8c01 11e401 1ff ________________TG________________
^^^
One hugepage contains 1 head page and 511 tail pages in x86_64 and each
two lines represent each hugepage. Voffset and offset mean virtual
address and physical address in the page unit, respectively. The
different hugepages should not have the same offset value.
With this patch applied:
$ page-types -p 3386 -Nl -b huge
voffset offset len flags
7fec7a600 112c00 1 ___UD__________H_G________________
7fec7a601 112c01 1ff ________________TG________________
^^^
7fec7a800 113200 1 ___UD__________H_G________________
7fec7a801 113201 1ff ________________TG________________
^^^
OK
More info:
- This patch modifies walk_page_range()'s hugepage walker. But the
change only affects pagemap_read(), which is the only caller of hugepage
callback.
- Without this patch, hugetlb_entry() callback is called per vma, that
doesn't match the natural expectation from its name.
- With this patch, hugetlb_entry() is called per hugepte entry and the
callback can become much simpler.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When __ratelimit() returns 1 this means that we can go ahead.
Signed-off-by: Yong Zhang <yong.zhang@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Requested by hch, for consistency now it is exported.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 148f948ba8 (vfs: Introduce new
helpers for syncing after writing to O_SYNC file or IS_SYNC inode) broke
the raw driver.
We now call through generic_file_aio_write -> generic_write_sync ->
vfs_fsync_range. vfs_fsync_range has:
if (!fop || !fop->fsync) {
ret = -EINVAL;
goto out;
}
But drivers/char/raw.c doesn't set an fsync method.
We have two options: fix it or remove the raw driver completely. I'm
happy to do either, the fact this has been broken for so long suggests it
is rarely used.
The patch below adds an fsync method to the raw driver. My knowledge of
the block layer is pretty sketchy so this could do with a once over.
If we instead decide to remove the raw driver, this patch might still be
useful as a backport to 2.6.33 and 2.6.32.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DECLARE_KFIFO creates a union with a struct kfifo and a buffer array with
size [size + sizeof(struct kfifo)].
INIT_KFIFO then sets the buffer pointer in struct kfifo to point to the
beginning of the buffer array which means that the first call to kfifo_in
will overwrite members of the struct kfifo.
Signed-off-by: David Härdeman <david@hardeman.nu>
Acked-by: Stefani Seibold <stefani@seibold.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Migration has been completed so remove this now. There's one straggler in
linux-next's drivers/mtd/sm_ftl.c. A patch has been sent.
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some BIOSes don't configure HPA during boot but do so while resuming.
This causes harddrives to shrink during resume making libata detach
and reattach them. This can be worked around by unlocking HPA if old
size equals native size.
Add ATA_DFLAG_UNLOCK_HPA so that HPA unlocking can be controlled
per-device and update ata_dev_revalidate() such that it sets
ATA_DFLAG_UNLOCK_HPA and fails with -EIO when the above condition is
detected.
This patch fixes the following bug.
https://bugzilla.kernel.org/show_bug.cgi?id=15396
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleksandr Yermolenko <yaa.bta@gmail.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
In an embedded system the matrix_keypad driver might be used to
interface with an external control panel and not an actual keyboard.
On the control panel some of the keys could be used to turn on/off
various functions. If key autorepeat is enabled this causes the
function to quickly toggle between the on and off states and makes
operation difficult.
Add an option in the platform-specific data to disable the key
autorepeat.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Module refcounting is implemented with a per-cpu counter for speed.
However there is a race when tallying the counter where a reference may
be taken by one CPU and released by another. Reference count summation
may then see the decrement without having seen the previous increment,
leading to lower than expected count. A module which never has its
actual reference drop below 1 may return a reference count of 0 due to
this race.
Module removal generally runs under stop_machine, which prevents this
race causing bugs due to removal of in-use modules. However there are
other real bugs in module.c code and driver code (module_refcount is
exported) where the callers do not run under stop_machine.
Fix this by maintaining running per-cpu counters for the number of
module refcount increments and the number of refcount decrements. The
increments are tallied after the decrements, so any decrement seen will
always have its corresponding increment counted. The final refcount is
the difference of the total increments and decrements, preventing a
low-refcount from being returned.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows arch code could decide the way to reserve the ibft.
And we should reserve ibft as early as possible, instead of BOOTMEM
stage, in case the table is in RAM range and is not reserved by BIOS
(this will often be the case.)
Move to just after find_smp_config().
Also when CONFIG_NO_BOOTMEM=y, We will not have reserve_bootmem() anymore.
-v2: fix typo about ibft pointed by Konrad Rzeszutek Wilk <konrad@darnok.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4BB510FB.80601@kernel.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Peter Jones <pjones@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
CC: Jan Beulich <jbeulich@novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
I noticed that my KVM virtual machines were experiencing IDE
issues resulting in processes stuck on waiting for buffers to
complete.
The root cause is of course race conditions in the ancient qemu
backend that I'm using. However, the fact that the guest isn't
recovering is a bug.
I've tracked it down to the change made last year to dequeue
requests at the start rather than at the end in the IDE layer.
commit 8f6205cd57
Author: Tejun Heo <tj@kernel.org>
Date: Fri May 8 11:53:59 2009 +0900
ide: dequeue in-flight request
The problem is that the function ide_dma_timeout_retry does not
requeue the current request, causing one request to be lost for
each DMA timeout.
This patch fixes this by requeueing the request.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Scheduler's task migration events don't work because they always
pass NULL regs perf_sw_event(). The event hence gets filtered
in perf_swevent_add().
Scheduler's context switches events use task_pt_regs() to get
the context when the event occured which is a wrong thing to
do as this won't give us the place in the kernel where we went
to sleep but the place where we left userspace. The result is
even more wrong if we switch from a kernel thread.
Use the hot regs snapshot for both events as they belong to the
non-interrupt/exception based events family. Unlike page faults
or so that provide the regs matching the exact origin of the event,
we need to save the current context.
This makes the task migration event working and fix the context
switch callchains and origin ip.
Example: perf record -a -e cs
Before:
10.91% ksoftirqd/0 0 [k] 0000000000000000
|
--- (nil)
perf_callchain
perf_prepare_sample
__perf_event_overflow
perf_swevent_overflow
perf_swevent_add
perf_swevent_ctx_event
do_perf_sw_event
__perf_sw_event
perf_event_task_sched_out
schedule
run_ksoftirqd
kthread
kernel_thread_helper
After:
23.77% hald-addon-stor [kernel.kallsyms] [k] schedule
|
--- schedule
|
|--60.00%-- schedule_timeout
| wait_for_common
| wait_for_completion
| blk_execute_rq
| scsi_execute
| scsi_execute_req
| sr_test_unit_ready
| |
| |--66.67%-- sr_media_change
| | media_changed
| | cdrom_media_changed
| | sr_block_media_changed
| | check_disk_change
| | cdrom_open
v2: Always build perf_arch_fetch_caller_regs() now that software
events need that too. They don't need it from modules, unlike trace
events, so we keep the EXPORT_SYMBOL in trace_event_perf.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Fix build for CONFIG_MODULES not enabled by providing a stub
for is_module_percpu_address().
kernel/lockdep.c:605: error: implicit declaration of function 'is_module_percpu_address'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
percpu.h has always been including slab.h to get k[mz]alloc/free() for
UP inline implementation. percpu.h being used by very low level
headers including module.h and sched.h, this meant that a lot files
unintentionally got slab.h inclusion.
Lee Schermerhorn was trying to make topology.h use percpu.h and got
bitten by this implicit inclusion. The right thing to do is break
this ultimately unnecessary dependency. The previous patch added
explicit inclusion of either gfp.h or slab.h to the source files using
them. This patch updates percpu.h such that slab.h is no longer
included from percpu.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
In commit 9df93939b7 ("ext3: Use bitops to read/modify
EXT3_I(inode)->i_state") ext3 changed its internal 'i_state' variable to
use bitops for its state handling. However, unline the same ext4
change, it didn't actually change the name of the field when it changed
the semantics of it.
As a result, an old use of 'i_state' remained in fs/ext3/ialloc.c that
initialized the field to EXT3_STATE_NEW. And that does not work
_at_all_ when we're now working with individually named bits rather than
values that get masked. So the code tried to mark the state to be new,
but in actual fact set the field to EXT3_STATE_JDATA. Which makes no
sense at all, and screws up all the code that checks whether the inode
was newly allocated.
In particular, it made the xattr code unhappy, and caused various random
behavior, like apparently
https://bugzilla.redhat.com/show_bug.cgi?id=577911
So fix the initialization, and rename the field to match ext4 so that we
don't have this happen again.
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Daniel J Walsh <dwalsh@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pl061.h is using u8 type. including <linux/types.h> in pl061.h to avoid
warning.
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Acked-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
linux/amba/bus.h have dependencies on linux/device.h and linux/resource.h, but
it doesn't include them. We get compilation errors in our files which include
bus.h but doesn't include device.h and resource.h. This patch includes device.h
and resource.h in linux/amba/bus.h file.
Signed-off-by: Viresh Kumar <viresh.kumar@st.com>
Acked-by: Linux Walleij <linux.ml.walleij@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>