Move memory_present() in arch/mips/kernel/setup.c. When using sparsemem
extreme, this function does an allocate for bootmem. This would always
fail since init_bootmem hasn't been called yet.
Move memory_present after free_bootmem. This only marks actual memory
ranges as present instead of the entire address space.
Signed-off-by: Chad Reese <creese@caviumnetworks.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Fix the non-linear memory mapping done via remap_file_pages() -- it
didn't work on any MIPS CPU because the page offset clashing with
_PAGE_FILE and some other page protection bits which should have been left
zeros for this kind of pages.
Signed-off-by: Konstantin Baydarov <kbaidarov@ru.mvista.com>
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
With 64-bit physical address enabled, 'swapon' was causing kernel oops on
Alchemy CPUs (MIPS32) because of the swap entry type field corrupting the
_PAGE_FILE bit in 'pte_low' field. So, switch to storing the swap entry in
'pte_high' field using all its bits except _PAGE_GLOBAL and _PAGE_VALID which
gives 25 bits for the swap entry offset.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
A while ago prom_prepare_cpus was replaced by plat_prepare_cpus but
the declaration has stayed unchanged.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Increase alignment of BogoMIPS loop to 8 bytes. Having the delay loop
overlap cache line boundaries may cause instable delays.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
If we move a mapping from one virtual address to another,
and this changes the virtual color of the mapping to those
pages, we can see corrupt data due to D-cache aliasing.
Check for and deal with this by overriding the move_pte()
macro. Set things up so that other platforms can cleanly
override the move_pte() macro too.
Signed-off-by: David S. Miller <davem@davemloft.net>
A small bugfix for up to now unused instruction definitions, and a
somewhat larger update to cover MIPS32R2 instructions.
Signed-off-by: Thiemo Seufer <ths@networkno.de>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Rename the 64-bit sc_hi and sc_lo arrays to use the same names
as the 32-bit struct sigcontext (sc_mdhi, sc_hi1, et cetera).
Signed-off-by: Daniel Jacobowitz <dan@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
o Implement futex_atomic_op_inuser() operation
o Don't use the R10000-ll/sc bug workaround version for every processor.
branch likely is deprecated and some historic ll/sc processors don't
implement it. In any case it's slow.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Nothing exciting; Linux just didn't know it yet so this is most adding
a value to a case statement.
Signed-off-by: Chris Dearman <chris@mips.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
In case of CONFIG_64BIT_PHYS_ADDR, set_pte() and pte_clear() functions
only set _PAGE_GLOBAL bit in the pte_low field of the buddy PTEs,
forgetting to propagate ito to pte_high. Thus, the both pages might not
really be made global for the CPU (since it AND's the G-bit of the
odd / even PTEs together to decide whether they're global or not). Thus,
if only a single page is allocated via vmalloc() or ioremap(), it's not
really global for CPU (and it must be, since this is kernel mapping),
and thus its ASID is compared against the current process' one -- so,
we'll get into trouble sooner or later... Also, pte_none() will fail
on global pages because _PAGE_GLOBAL bit is set in both pte_low and
pte_high, and pte_val() will return u64 value consisting of those fields
concateneted.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
With recent rewrite for generic bitops, fls() for 32bit kernel with
MIPS64_CPU is broken. Also, ffs(), fls() should be defined the same
way as the libc and compiler built-in routines (returns int instead of
unsigned long).
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Gcc might emit an absolute address for the the "m" constraint which
gas unfortunately does not permit.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for
each arch. Its definition is sometimes configurable. Indeed, ia64 defines 5
NODES_SHIFT values in the current git tree. But it looks a bit messy.
SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has
been changeable by config. Suitable node's number may be changed in the
future even if it is other architecture. So, I wrote configurable node's
number.
This patch set defines just default value for each arch which needs multi
nodes except ia64. But, it is easy to change to configurable if necessary.
On ia64 the number of nodes can be already configured in generic ia64 and SN2
config. But, NODES_SHIFT is defined for DIG64 and HP'S machine too. So, I
changed it so that all platforms can be configured via CONFIG_NODES_SHIFT. It
would be simpler.
See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix: initialize the robust list(s) to NULL in copy_process.
- doc update
- cleanup: rename _inuser to _inatomic
- __user cleanups and other small cleanups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patchset provides a new (written from scratch) implementation of robust
futexes, called "lightweight robust futexes". We believe this new
implementation is faster and simpler than the vma-based robust futex solutions
presented before, and we'd like this patchset to be adopted in the upstream
kernel. This is version 1 of the patchset.
Background
----------
What are robust futexes? To answer that, we first need to understand what
futexes are: normal futexes are special types of locks that in the
noncontended case can be acquired/released from userspace without having to
enter the kernel.
A futex is in essence a user-space address, e.g. a 32-bit lock variable
field. If userspace notices contention (the lock is already owned and someone
else wants to grab it too) then the lock is marked with a value that says
"there's a waiter pending", and the sys_futex(FUTEX_WAIT) syscall is used to
wait for the other guy to release it. The kernel creates a 'futex queue'
internally, so that it can later on match up the waiter with the waker -
without them having to know about each other. When the owner thread releases
the futex, it notices (via the variable value) that there were waiter(s)
pending, and does the sys_futex(FUTEX_WAKE) syscall to wake them up. Once all
waiters have taken and released the lock, the futex is again back to
'uncontended' state, and there's no in-kernel state associated with it. The
kernel completely forgets that there ever was a futex at that address. This
method makes futexes very lightweight and scalable.
"Robustness" is about dealing with crashes while holding a lock: if a process
exits prematurely while holding a pthread_mutex_t lock that is also shared
with some other process (e.g. yum segfaults while holding a pthread_mutex_t,
or yum is kill -9-ed), then waiters for that lock need to be notified that the
last owner of the lock exited in some irregular way.
To solve such types of problems, "robust mutex" userspace APIs were created:
pthread_mutex_lock() returns an error value if the owner exits prematurely -
and the new owner can decide whether the data protected by the lock can be
recovered safely.
There is a big conceptual problem with futex based mutexes though: it is the
kernel that destroys the owner task (e.g. due to a SEGFAULT), but the kernel
cannot help with the cleanup: if there is no 'futex queue' (and in most cases
there is none, futexes being fast lightweight locks) then the kernel has no
information to clean up after the held lock! Userspace has no chance to clean
up after the lock either - userspace is the one that crashes, so it has no
opportunity to clean up. Catch-22.
In practice, when e.g. yum is kill -9-ed (or segfaults), a system reboot is
needed to release that futex based lock. This is one of the leading
bugreports against yum.
To solve this problem, 'Robust Futex' patches were created and presented on
lkml: the one written by Todd Kneisel and David Singleton is the most advanced
at the moment. These patches all tried to extend the futex abstraction by
registering futex-based locks in the kernel - and thus give the kernel a
chance to clean up.
E.g. in David Singleton's robust-futex-6.patch, there are 3 new syscall
variants to sys_futex(): FUTEX_REGISTER, FUTEX_DEREGISTER and FUTEX_RECOVER.
The kernel attaches such robust futexes to vmas (via
vma->vm_file->f_mapping->robust_head), and at do_exit() time, all vmas are
searched to see whether they have a robust_head set.
Lots of work went into the vma-based robust-futex patch, and recently it has
improved significantly, but unfortunately it still has two fundamental
problems left:
- they have quite complex locking and race scenarios. The vma-based
patches had been pending for years, but they are still not completely
reliable.
- they have to scan _every_ vma at sys_exit() time, per thread!
The second disadvantage is a real killer: pthread_exit() takes around 1
microsecond on Linux, but with thousands (or tens of thousands) of vmas every
pthread_exit() takes a millisecond or more, also totally destroying the CPU's
L1 and L2 caches!
This is very much noticeable even for normal process sys_exit_group() calls:
the kernel has to do the vma scanning unconditionally! (this is because the
kernel has no knowledge about how many robust futexes there are to be cleaned
up, because a robust futex might have been registered in another task, and the
futex variable might have been simply mmap()-ed into this process's address
space).
This huge overhead forced the creation of CONFIG_FUTEX_ROBUST, but worse than
that: the overhead makes robust futexes impractical for any type of generic
Linux distribution.
So it became clear to us, something had to be done. Last week, when Thomas
Gleixner tried to fix up the vma-based robust futex patch in the -rt tree, he
found a handful of new races and we were talking about it and were analyzing
the situation. At that point a fundamentally different solution occured to
me. This patchset (written in the past couple of days) implements that new
solution. Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing. Parental advice
recommended ;-)
New approach to robust futexes
------------------------------
At the heart of this new approach there is a per-thread private list of robust
locks that userspace is holding (maintained by glibc) - which userspace list
is registered with the kernel via a new syscall [this registration happens at
most once per thread lifetime]. At do_exit() time, the kernel checks this
user-space list: are there any robust futex locks to be cleaned up?
In the common case, at do_exit() time, there is no list registered, so the
cost of robust futexes is just a simple current->robust_list != NULL
comparison. If the thread has registered a list, then normally the list is
empty. If the thread/process crashed or terminated in some incorrect way then
the list might be non-empty: in this case the kernel carefully walks the list
[not trusting it], and marks all locks that are owned by this thread with the
FUTEX_OWNER_DEAD bit, and wakes up one waiter (if any).
The list is guaranteed to be private and per-thread, so it's lockless. There
is one race possible though: since adding to and removing from the list is
done after the futex is acquired by glibc, there is a few instructions window
for the thread (or process) to die there, leaving the futex hung. To protect
against this possibility, userspace (glibc) also maintains a simple per-thread
'list_op_pending' field, to allow the kernel to clean up if the thread dies
after acquiring the lock, but just before it could have added itself to the
list. Glibc sets this list_op_pending field before it tries to acquire the
futex, and clears it after the list-add (or list-remove) has finished.
That's all that is needed - all the rest of robust-futex cleanup is done in
userspace [just like with the previous patches].
Ulrich Drepper has implemented the necessary glibc support for this new
mechanism, which fully enables robust mutexes. (Ulrich plans to commit these
changes to glibc-HEAD later today.)
Key differences of this userspace-list based approach, compared to the vma
based method:
- it's much, much faster: at thread exit time, there's no need to loop
over every vma (!), which the VM-based method has to do. Only a very
simple 'is the list empty' op is done.
- no VM changes are needed - 'struct address_space' is left alone.
- no registration of individual locks is needed: robust mutexes dont need
any extra per-lock syscalls. Robust mutexes thus become a very lightweight
primitive - so they dont force the application designer to do a hard choice
between performance and robustness - robust mutexes are just as fast.
- no per-lock kernel allocation happens.
- no resource limits are needed.
- no kernel-space recovery call (FUTEX_RECOVER) is needed.
- the implementation and the locking is "obvious", and there are no
interactions with the VM.
Performance
-----------
I have benchmarked the time needed for the kernel to process a list of 1
million (!) held locks, using the new method [on a 2GHz CPU]:
- with FUTEX_WAIT set [contended mutex]: 130 msecs
- without FUTEX_WAIT set [uncontended mutex]: 30 msecs
I have also measured an approach where glibc does the lock notification [which
it currently does for !pshared robust mutexes], and that took 256 msecs -
clearly slower, due to the 1 million FUTEX_WAKE syscalls userspace had to do.
(1 million held locks are unheard of - we expect at most a handful of locks to
be held at a time. Nevertheless it's nice to know that this approach scales
nicely.)
Implementation details
----------------------
The patch adds two new syscalls: one to register the userspace list, and one
to query the registered list pointer:
asmlinkage long
sys_set_robust_list(struct robust_list_head __user *head,
size_t len);
asmlinkage long
sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
size_t __user *len_ptr);
List registration is very fast: the pointer is simply stored in
current->robust_list. [Note that in the future, if robust futexes become
widespread, we could extend sys_clone() to register a robust-list head for new
threads, without the need of another syscall.]
So there is virtually zero overhead for tasks not using robust futexes, and
even for robust futex users, there is only one extra syscall per thread
lifetime, and the cleanup operation, if it happens, is fast and
straightforward. The kernel doesnt have any internal distinction between
robust and normal futexes.
If a futex is found to be held at exit time, the kernel sets the highest bit
of the futex word:
#define FUTEX_OWNER_DIED 0x40000000
and wakes up the next futex waiter (if any). User-space does the rest of
the cleanup.
Otherwise, robust futexes are acquired by glibc by putting the TID into the
futex field atomically. Waiters set the FUTEX_WAITERS bit:
#define FUTEX_WAITERS 0x80000000
and the remaining bits are for the TID.
Testing, architecture support
-----------------------------
I've tested the new syscalls on x86 and x86_64, and have made sure the parsing
of the userspace list is robust [ ;-) ] even if the list is deliberately
corrupted.
i386 and x86_64 syscalls are wired up at the moment, and Ulrich has tested the
new glibc code (on x86_64 and i386), and it works for his robust-mutex
testcases.
All other architectures should build just fine too - but they wont have the
new syscalls yet.
Architectures need to implement the new futex_atomic_cmpxchg_inuser() inline
function before writing up the syscalls (that function returns -ENOSYS right
now).
This patch:
Add placeholder futex_atomic_cmpxchg_inuser() implementations to every
architecture that supports futexes. It returns -ENOSYS.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As announced in feature-removal-schedule.txt.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Bitmap functions for the minix filesystem and the ext2 filesystem except
ext2_set_bit_atomic() and ext2_clear_bit_atomic() do not require the atomic
guarantees.
But these are defined by using atomic bit operations on several architectures.
(cris, frv, h8300, ia64, m32r, m68k, m68knommu, mips, s390, sh, sh64, sparc,
sparc64, v850, and xtensa)
This patch switches to non atomic bit operation.
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add blkcnt_t as the type of inode.i_blocks. This enables you to make the size
of blkcnt_t either 4 bytes or 8 bytes on 32 bits architecture with CONFIG_LSF.
- CONFIG_LSF
Add new configuration parameter.
- blkcnt_t
On h8300, i386, mips, powerpc, s390 and sh that define sector_t,
blkcnt_t is defined as u64 if CONFIG_LSF is enabled; otherwise it is
defined as unsigned long.
On other architectures, it is defined as unsigned long.
- inode.i_blocks
Change the type from sector_t to blkcnt_t.
Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense. The same thing was discussed and conceptually
agreed quite some time ago:
http://lkml.org/lkml/2003/7/12/116
Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it. As far
as the existing POLLHUP handling, the patch leaves it as is. The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files. The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.
There is "a stupid program" to test POLLRDHUP delivery here:
http://www.xmailserver.org/pollrdhup-test.c
It tests poll(2), but since the delivery is same epoll(2) will work equally.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Provide abstraction for generating type and size information of assembly
routines and data, while permitting architectures to override these
defaults.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: "Russell King" <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "Andi Kleen" <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
unused isa_...() helpers removed.
Adrian Bunk:
The asm-sh part was rediffed due to unrelated changes.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>