Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.
However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs. We have had a case where a machine with
46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
dealing with pagecache pages. However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).
This patch implements slab reclaim during zone reclaim. Zone reclaim
occurs if there is a danger of an off node allocation. At that point we
1. Shrink the per node page cache if the number of pagecache
pages is more than min_unmapped_ratio percent of pages in a zone.
2. Shrink the slab cache if the number of the nodes reclaimable slab pages
(patch depends on earlier one that implements that counter)
are more than min_slab_ratio (a new /proc/sys/vm tunable).
The shrinking of the slab cache is a bit problematic since it is not node
specific. So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit. I hope we will have zone based
slab reclaim at some point which will make that easier.
The default for the min_slab_ratio is 5%
Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.
[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the atomic counter for slab_reclaim_pages and replace the counter
and NR_SLAB with two ZVC counter that account for unreclaimable and
reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE.
Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The
intend seems to be to check for slab pages that could be freed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*_pages is a better description of the role of the variable.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In many places we will need to use the same combination of flags. Specify
a single GFP_THISNODE definition for ease of use in gfp.h.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
flag is essential if a kernel component requires memory to be located on a
certain node. It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Let's try to keep mm/ comments more useful and up to date. This is a start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
lock_page needs the caller to have a reference on the page->mapping inode
due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
its comments.
Solve it by introducing a new lock_page_nosync which does not do a sync_page.
akpm: unpleasant solution to an unpleasant problem. If it goes wrong it could
cause great slowdowns while the lock_page() caller waits for kblockd to
perform the unplug. And if a filesystem has special sync_page() requirements
(none presently do), permanent hangs are possible.
otoh, set_page_dirty_lock() is usually (always?) called against userspace
pages. They are always up-to-date, so there shouldn't be any pending read I/O
against these pages.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits alloc_percpu() up into two phases. Likewise for
free_percpu(). This allows clients to limit initial allocations to online
cpu's, and to populate or depopulate per-cpu data at run time as needed:
struct my_struct *obj;
/* initial allocation for online cpu's */
obj = percpu_alloc(sizeof(struct my_struct), GFP_KERNEL);
...
/* populate per-cpu data for cpu coming online */
ptr = percpu_populate(obj, sizeof(struct my_struct), GFP_KERNEL, cpu);
...
/* access per-cpu object */
ptr = percpu_ptr(obj, smp_processor_id());
...
/* depopulate per-cpu data for cpu going offline */
percpu_depopulate(obj, cpu);
...
/* final removal */
percpu_free(obj);
Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a notifer chain to the out of memory killer. If one of the registered
callbacks could release some memory, do not kill the process but return and
retry the allocation that forced the oom killer to run.
The purpose of the notifier is to add a safety net in the presence of
memory ballooners. If the resource manager inflated the balloon to a size
where memory allocations can not be satisfied anymore, it is better to
deflate the balloon a bit instead of killing processes.
The implementation for the s390 ballooner is included.
[akpm@osdl.org: cleanups]
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I wonder why we need this bitmask indexing into zone->node_zonelists[]?
We always start with the highest zone and then include all lower zones
if we build zonelists.
Are there really cases where we need allocation from ZONE_DMA or
ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
of highest_zone() makes that already impossible.
If we go linear on the index then gfp_zone() == highest_zone() and a lot
of definitions fall by the wayside.
We can now revert back to the use of gfp_zone() in mempolicy.c ;-)
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
After we have done this we can now do some typing cleanup.
The memory policy layer keeps a policy_zone that specifies
the zone that gets memory policies applied. This variable
can now be of type enum zone_type.
The check_highest_zone function and the build_zonelists funnctionm must
then also take a enum zone_type parameter.
Plus there are a number of loops over zones that also should use
zone_type.
We run into some troubles at some points with functions that need a
zone_type variable to become -1. Fix that up.
[pj@sgi.com: fix set_mempolicy() crash]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is a check in zonelist_policy that compares pieces of the bitmap
obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
zonelist_policy().
The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
The policy_zone is a zone number with the possible values of ZONE_DMA,
ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
of values.
For some reason seemed to work before the zone reduction patchset (It
definitely works on SGI boxes since we just have one zone and the check
cannot fail).
With the zone reduction patchset this check definitely fails on systems
with two zones if the system actually has memory in both zones.
This is because ZONE_NORMAL is selected using no __GFP flag at
all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1.
policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
populated.
For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
allocations!
Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
to DMA allocations!
What we realy want in that place is to establish the highest allowable
zone for a given gfp_mask. If the highest zone is higher or equal to the
policy_zone then memory policies need to be applied. We have such
a highest_zone() function in page_alloc.c.
So move the highest_zone() function from mm/page_alloc.c into
include/linux/gfp.h. On the way we simplify the function and use the new
zone_type that was also introduced with the zone reduction patchset plus we
also specify the right type for the gfp flags parameter.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
eventcounters: Do not display counters for zones that are not available on an
arch
Do not define or display counters for the DMA32 and the HIGHMEM zone if such
zones were not configured.
[akpm@osdl.org: s390 fix]
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make ZONE_HIGHMEM optional
- ifdef out code and definitions related to CONFIG_HIGHMEM
- __GFP_HIGHMEM falls back to normal allocations if there is no
ZONE_HIGHMEM
- GFP_ZONEMASK becomes 0x01 if there is no DMA32 and no HIGHMEM
zone.
[jdike@addtoit.com: build fix]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make ZONE_DMA32 optional
- Add #ifdefs around ZONE_DMA32 specific code and definitions.
- Add CONFIG_ZONE_DMA32 config option and use that for x86_64
that alone needs this zone.
- Remove the use of CONFIG_DMA_IS_DMA32 and CONFIG_DMA_IS_NORMAL
for ia64 and fix up the way per node ZVCs are calculated.
- Fall back to prior GFP_ZONEMASK of 0x03 if there is no
DMA32 zone.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use enum for zones and reformat zones dependent information
Add comments explaning the use of zones and add a zones_t type for zone
numbers.
Line up information that will be #ifdefd by the following patches.
[akpm@osdl.org: comment cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move totalhigh_pages and nr_free_highpages() into highmem.c/.h
Move the totalhigh_pages definition into highmem.c/.h. Move the
nr_free_highpages function into highmem.c
[yoichi_yuasa@tripeaks.co.jp: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It fixes various coding style issues, specially when spaces are useless. For
example '*' go next to the function name.
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
__init in headers is pretty useless because the compiler doesn't check it, and
they get out of sync relatively frequently. So if you see an __init in a
header file, it's quite unreliable and you need to check the definition
anyway.
Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes the following needlessly global functions static:
- slab.c: kmem_find_general_cachep()
- swap.c: __page_cache_release()
- vmalloc.c: __vmalloc_node()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Tracking of dirty pages in shared writeable mmap()s.
The idea is simple: write protect clean shared writeable pages, catch the
write-fault, make writeable and set dirty. On page write-back clean all the
PTE dirty bits and write protect them once again.
The implementation is a tad harder, mainly because the default
backing_dev_info capabilities were too loosely maintained. Hence it is not
enough to test the backing_dev_info for cap_account_dirty.
The current heuristic is as follows, a VMA is eligible when:
- its shared writeable
(vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
- it is not a 'special' mapping
(vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
- the backing_dev_info is cap_account_dirty
mapping_cap_account_dirty(vma->vm_file->f_mapping)
- f_op->mmap() didn't change the default page protection
Page from remap_pfn_range() are explicitly excluded because their COW
semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
because they don't have a backing store anyway.
mprotect() is taught about the new behaviour as well. However it overrides
the last condition.
Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
It can be called on any page, but is currently only implemented for mapped
pages, if the page is found the be of a VMA that accounts dirty pages it will
also wrprotect the PTE.
Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
under ->private_lock. This seems to be safe, since ->private_lock is used to
serialize access to the buffers, not the page itself. This is needed because
clear_page_dirty() will call into page_mkclean() and would thereby violate
locking order.
[dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a VM_BUG_ON, which is turned on with CONFIG_DEBUG_VM. Use this
in the lightweight, inline refcounting functions; PageLRU and PageActive
checks in vmscan, because they're pretty well confined to vmscan. And in
page allocate/free fastpaths which can be the hottest parts of the kernel
for kbuilds.
Unlike BUG_ON, VM_BUG_ON must not be used to execute statements with
side-effects, and should not be used outside core mm code.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Give non-highmem architectures access to the kmap API for the purposes of
overriding (this is what the attached patch does).
The proposal is that we should now require all architectures with coherence
issues to manage data coherence via the kmap/kunmap API. Thus driver
writers never have to write code like
kmap(page)
modify data in page
flush_kernel_dcache_page(page)
kunmap(page)
instead, kmap/kunmap will manage the coherence and driver (and filesystem)
writers don't need to worry about how to flush between kmap and kunmap.
For most architectures, the page only needs to be flushed if it was
actually written to *and* there are user mappings of it, so the best
implementation looks to be: clear the page dirty pte bit in the kernel page
tables on kmap and on kunmap, check page->mappings for user maps, and then
the dirty bit, and only flush if it both has user mappings and is dirty.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
get_cpu_var()/per_cpu()/__get_cpu_var() arguments must be simple
identifiers. Otherwise the arch dependent implementations might break.
This patch enforces the correct usage of the macros by producing a syntax
error if the variable is not a simple identifier.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current time_before/time_after macros will fail typechecks
when passed u64 values (as returned by get_jiffies_64()). On 64bit
systems, this will just result in a warning about mismatching types
without explicit casts, but since unsigned long and u64
(unsigned long long) are of same size, it will still work.
On 32bit systems, a long is 32bits, so the value from get_jiffies_64()
will be truncated by the cast and thus lose all the precision gained by
64bit jiffies.
Signed-off-by: Dmitriy Zavin <dmitriyz@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch adds the per thread cookie field to the task struct and the PDA.
Also it makes sure that the PDA value gets the new cookie value at context
switch, and that a new task gets a new cookie at task creation time.
Signed-off-by: Arjan van Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <ak@suse.de>
The EDD code would scan the command line as a fixed array, without
taking account of either whitespace, null-termination, the old
command-line protocol, late overrides early, or the fact that the
command line may not be reachable from INITSEG.
This should fix those problems, and enable us to use a longer command
line.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Apparently IA64 needs it, but i386/x86-64 don't anymore
since gcc 2.95 support was dropped. Nobody else on linux-arch
requested keeping it generically
Cc: tony.luck@intel.com
Cc: kaos@sgi.com
Signed-off-by: Andi Kleen <ak@suse.de>
Right now the kernel on x86-64 has a 100% lazy fpu behavior: after *every*
context switch a trap is taken for the first FPU use to restore the FPU
context lazily. This is of course great for applications that have very
sporadic or no FPU use (since then you avoid doing the expensive
save/restore all the time). However for very frequent FPU users... you
take an extra trap every context switch.
The patch below adds a simple heuristic to this code: After 5 consecutive
context switches of FPU use, the lazy behavior is disabled and the context
gets restored every context switch. If the app indeed uses the FPU, the
trap is avoided. (the chance of the 6th time slice using FPU after the
previous 5 having done so are quite high obviously).
After 256 switches, this is reset and lazy behavior is returned (until
there are 5 consecutive ones again). The reason for this is to give apps
that do longer bursts of FPU use still the lazy behavior back after some
time.
[akpm@osdl.org: place new task_struct field next to jit_keyring to save space]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
This patch moves the entry.S:error_entry to .kprobes.text section,
since code marked unsafe for kprobes jumps directly to entry.S::error_entry,
that must be marked unsafe as well.
This patch also moves all the ".previous.text" asm directives to ".previous"
for kprobes section.
AK: Following a similar i386 patch from Chuck Ebbert
AK: Also merged Jeremy's fix in.
+From: Jeremy Fitzhardinge <jeremy@goop.org>
KPROBE_ENTRY does a .section .kprobes.text, and expects its users to
do a .previous at the end of the function.
Unfortunately, if any code within the function switches sections, for
example .fixup, then the .previous ends up putting all subsequent code
into .fixup. Worse, any subsequent .fixup code gets intermingled with
the code its supposed to be fixing (which is also in .fixup). It's
surprising this didn't cause more havok.
The fix is to use .pushsection/.popsection, so this stuff nests
properly. A further cleanup would be to get rid of all
.section/.previous pairs, since they're inherently fragile.
+From: Chuck Ebbert <76306.1226@compuserve.com>
Because code marked unsafe for kprobes jumps directly to
entry.S::error_code, that must be marked unsafe as well.
The easiest way to do that is to move the page fault entry
point to just before error_code and let it inherit the same
section.
Also moved all the ".previous" asm directives for kprobes
sections to column 1 and removed ".text" from them.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
- Remove unused all_contexts parameter
No caller used it
- Move skip argument into the structure (needed for
followon patches)
Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
For NUMA optimization and some other algorithms it is useful to have a fast
to get the current CPU and node numbers in user space.
x86-64 added a fast way to do this in a vsyscall. This adds a generic
syscall for other architectures to make it a generic portable facility.
I expect some of them will also implement it as a faster vsyscall.
The cache is an optimization for the x86-64 vsyscall optimization. Since
what the syscall returns is an approximation anyways and user space
often wants very fast results it can be cached for some time. The norma
methods to get this information in user space are relatively slow
The vsyscall is in a better position to manage the cache because it has direct
access to a fast time stamp (jiffies). For the generic syscall optimization
it doesn't help much, but enforce a valid argument to keep programs
portable
I only added an i386 syscall entry for now. Other architectures can follow
as needed.
AK: Also added some cleanups from Andrew Morton
Signed-off-by: Andi Kleen <ak@suse.de>
This patch adds a vgetcpu vsyscall, which depending on the CPU RDTSCP
capability uses either the RDTSCP or CPUID to obtain a CPU and node
numbers and pass them to the program.
AK: Lots of changes over Vojtech's original code:
Better prototype for vgetcpu()
It's better to pass the cpu / node numbers as separate arguments
to avoid mistakes when going from SMP to NUMA.
Also add a fast time stamp based cache using a user supplied
argument to speed things more up.
Use fast method from Chuck Ebbert to retrieve node/cpu from
GDT limit instead of CPUID
Made sure RDTSCP init is always executed after node is known.
Drop printk
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
To quote Alan Cox:
The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.
A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.
This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
watchdog.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
There is a potential deadlock in the driver core. It boils down to
the fact that bus_remove_device() calls klist_remove() instead of
klist_del(), thereby waiting until the reference count of the
klist_node in the bus's klist of devices drops to 0. The refcount
can't reach 0 so long as a modprobe process is trying to bind a new
driver to the device being removed, by calling __driver_attach(). The
problem is that __driver_attach() tries to acquire the device's
parent's semaphore, but the caller of bus_remove_device() is quite
likely to own that semaphore already.
It isn't sufficient just to replace klist_remove() with klist_del().
Doing so runs the risk that the device would remain on the bus's klist
of devices for some time, and so could be bound to another driver even
after it was unregistered. What's needed is a new way to distinguish
whether or not a device is registered, based on a criterion other than
whether its klist_node is linked into the bus's klist of devices. That
way driver binding can fail when the device is unregistered, even if
it is still linked into the klist.
This patch (as782) implements the solution, by adding a new bitflag to
indiate when a struct device is registered, by testing the flag before
allowing a driver to bind a device, and by changing the definition of
the device_is_registered() inline.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This adds the infrastructure for drivers to do a threaded probe, and
waits at init time for all currently outstanding probes to complete.
A new kernel thread will be created when the probe() function for the
driver is called, if the multithread_probe bit is set in the driver
saying it can support this kind of operation.
I have tested this with USB and PCI, and it works, and shaves off a lot
of time in the boot process, but there are issues with finding root boot
disks, and some USB drivers assume that this can never happen, so it is
currently not enabled for any bus type. Individual drivers can enable
this right now if they wish, and bus authors can selectivly turn it on
as well, once they determine that their subsystem will work properly
with it.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Don't be crufty. Mark it __must_check too.
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Those 1500 warnings can be a bit of a pain. Add a config option to shut them
up.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We're getting a lot of crashes in the sysfs/kobject/device/bus/class code and
they're very hard to diagnose.
I'm suspecting that in some cases this is because drivers aren't checking
return values and aren't handling errors correctly. So the code blithely
blunders on and crashes later in very obscure ways.
There's just no reason to ignore errors which can and do occur. So the patch
sprinkles __must_check all over these APIs.
Causes 1,513 new warnings. Heh.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make sysfs_remove_bin_file() void. If it detects an error,
printk the file name and call dump_stack().
sysfs_hash_and_remove() now returns an error code indicating
its success or failure so that sysfs_remove_bin_file() can
know success/failure.
Convert the only driver that checked the return value of
sysfs_remove_bin_file().
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Makes it easier for devices to create and remove binary attribute files
so they don't have to call directly into sysfs. This is needed to help
with the conversion from struct class_device to struct device.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When moving class_device usage over to device, we need to handle
class_interfaces properly with devices. This patch adds that support.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This change creates a devices/virtual/CLASS_NAME tree for struct devices
that belong to a class, yet do not have a "real" struct device for a
parent. It automatically creates the directories on the fly as needed.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>