There is a bug in writepage and delete_inode which allows jdata files to
invalidate pages from the address space without being in a transaction at
the time. This causes problems in case the pages are in the journal. This
patch fixes that case and prevents the resulting oops.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Move the contents of some headers which contained very
little into more sensible places, and remove the original
header files. This should make it easier to find things.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch implements the FIEMAP ioctl for GFS2. We can use the generic
code (aside from a lock order issue, solved as per Ted Tso's suggestion)
for which I've introduced a new variant of the generic function. We also
have one exception to deal with, namely stuffed files, so we do that
"by hand", setting all the required flags.
This has been tested with a modified (I could only find an old version) of
Eric's test program, and appears to work correctly.
This patch does not currently support FIEMAP of xattrs, but the plan is to add
that feature at some future point.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.
The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.
Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).
This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).
[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Andreas Dilger, introduce a bgl_lock_ptr() helper in
<linux/blockgroup_lock.h> and add separate sb_bgl_lock() helpers to
filesystem specific header files to break the hidden dependency to
struct ext[234]_sb_info.
Also, while at it, convert the macros to static inlines to try make up
for all the times I broke Andrew Morton's tree.
Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... just make it a binfmt handler like #! one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
They are actually alpha vs. i386/arm/m68k i.e. ecoff vs. aout.
In the only place where we actually tried to handle arm and i386/m68k in
different ways (START_DATA() in coredump handling), the arm variant
works for all of them (i386 and m68k have u.start_code set to 0).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/devpts/inode.c:324: warning: 'compare_init_pts_sb' defined but not used
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just nail the oddments now while this code is being touched
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support containers, allow multiple instances of devpts filesystem, such
that indices of ptys allocated in one instance are independent of ptys
allocated in other instances of devpts.
But to preserve backward compatibility, enable this support for multiple
instances only if:
- CONFIG_DEVPTS_MULTIPLE_INSTANCES is set to Y, and
- '-o newinstance' mount option is specified while mounting devpts
To use multi-instance mount, a container startup script could:
$ ns_exec -cm /bin/bash
$ umount /dev/pts
$ mount -t devpts -o newinstance lxcpts /dev/pts
$ mount -o bind /dev/pts/ptmx /dev/ptmx
$ /usr/sbin/sshd -p 1234
where 'ns_exec -cm /bin/bash' is calls clone() with CLONE_NEWNS flag and execs
/bin/bash in the child process. A pty created by the sshd is not visible in
the original mount of /dev/pts.
USER-SPACE-IMPACT:
- See Documentation/fs/devpts.txt (included in next patch) for user-
space impact in multi-instance and mixed-mode operation.
TODO:
- Update mount(8), pts(4) man pages. Highlight impact of not
redirecting /dev/ptmx to /dev/pts/ptmx after a multi-instance mount.
Changelog[v6]:
- [Dave Hansen] Use new get_init_pts_sb() interface
- [Serge Hallyn] Don't bother displaying 'newinstance' in show_options
- [Serge Hallyn] Use macros (PARSE_REMOUNT/PARSE_MOUNT) instead of 0/1.
- [Serge Hallyn] Check error return from get_sb_single() (now
get_init_pts_sb())
- devpts_pty_kill(): don't dput error dentries
Changelog[v5]:
- Move get_sb_ref() definition to earlier patch
- Move usage info to Documentation/filesystems/devpts.txt (next patch)
- Make ptmx node even in init_pts_ns, now that default mode is 0000
(defined in earlier patch, enabled here).
- Cache ptmx dentry and use to update mode during remount
(defined in earlier patch, enabled here).
- Bugfix: explicitly ignore newinstance on remount (if newinstance was
specified on remount of initial mount, it would be ignored but
/proc/mounts would imply that the option was set)
Changelog[v4]:
- Update patch description to address H. Peter Anvin's comments
- Consolidate multi-instance mode code under new config token,
CONFIG_DEVPTS_MULTIPLE_INSTANCE.
- Move usage-details from patch description to
Documentation/fs/devpts.txt
Changelog[v3]:
- Rename new mount option to 'newinstance'
- Create ptmx nodes only in 'newinstance' mounts
- Bugfix: parse_mount_options() modifies @data but since we need to
parse the @data twice (once in devpts_get_sb() and once during
do_remount_sb()), parse a local copy of @data in devpts_get_sb().
(restructured code in devpts_get_sb() to fix this)
Changelog[v2]:
- Support both single-mount and multiple-mount semantics and
provide '-onewmnt' option to select the semantics.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
See comments in the function header for details. The new interface will
be used in a follow-on patch.
Changelog [v2]:
[Dave Hansen] Replace get_sb_ref() in fs/super.c with get_init_pts_sb()
and make the new interface private to devpts
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx,
allocates the next pty index and the associated device shows up in the
devpts fs as /dev/pts/n.
Wih multiple instancs of devpts filesystem, during an open of /dev/ptmx
we would be unable to determine which instance of the devpts is being
accessed.
So we move the 'ptmx' node into /dev/pts and use the inode of the 'ptmx'
node to identify the superblock and hence the devpts instance. This patch
adds ability for the kernel to internally create the [ptmx, c, 5:2] device
when mounting devpts filesystem. Since the ptmx node in devpts is new and
may surprise some userspace scripts, the default permissions for the new
node is 0000. These permissions can be changed either using chmod or by
remounting with the new '-o ptmxmode=0666' mount option.
Changelog[v5]:
- [Serge Hallyn bugfix]: Letting new_inode() assign inode number to
ptmx can collide with hand-assigning inode numbers to ptys. So,
hand-assign specific inode number to ptmx node also.
- [Serge Hallyn]: Maybe safer to grab root dentry mutex while creating
ptmx node
- [Bugfix with Serge Hallyn] Replace lookup_one_len() in mknod_ptmx()
wih d_alloc_name() (lookup during ->get_sb() locks up system). To
simplify patchset, fold the ptmx_dentry patch into this.
Changelog[v4]:
- Change default permissions of pts/ptmx node to 0000.
- Move code for ptmxmode under #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES.
Changelog[v3]:
- Rename ptmx_mode to ptmxmode (for consistency with 'newinstance')
Changelog[v2]:
- [H. Peter Anvin] Remove mknod() system call support and create the
ptmx node internally.
Changelog[v1]:
- Earlier version of this patch enabled creating /dev/pts/tty as
well. As pointed out by Al Viro and H. Peter Anvin, that is not
really necessary.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move code to parse mount options into a separate function so it can
(later) be shared between mount and remount operations.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With support for multiple mounts of devpts, the 'config' structure really
represents per-mount options rather than config parameters. Rename 'config'
structure to 'pts_mount_opts' and store it in the super-block.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount
variable rather than a global variable. Move 'allocated_ptys' into the
super_block's s_fs_info.
Changelog[v2]:
Define and use DEVPTS_SB() wrapper.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the 'devpts_root' global variable and find the root dentry using
the super_block. The super-block can be found from the device inode, using
the new wrapper, pts_sb_from_inode().
Changelog: This patch is based on an earlier patchset from Serge Hallyn
and Matt Helsley.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... and the same for reiserfs. The difference here is that we need
insert_inode_locked4() to match iget5_locked().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* make ext2_new_inode() put the inode into icache in locked state
* do not unlock until the inode is fully set up; otherwise nfsd
might pick it in half-baked state.
* make sure that ext2_new_inode() does *not* lead to two inodes with the
same inumber hashed at the same time; otherwise a bogus fhandle coming
from nfsd might race with inode creation:
nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
ext2_new_inode(): allocate inode with that inumber
ext2_new_inode(): insert it into icache, set it up and dirty
ext2_write_inode(): get the relevant part of inode table in cache,
set the entry for our inode (and start writing to disk)
nfsd: get CPU again, look into inode table, see nice and sane on-disk
inode, set the in-core inode from it
oops - we have two in-core inodes with the same inumber live in icache,
both used for IO. Welcome to fs corruption...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helpers - insert_inode_locked() and insert_inode_locked4().
Hash new inode, making sure that there's no such inode in icache
already. If there is and it does not end up unhashed (as would
happen if we have nfsd trying to resolve a bogus fhandle), fail.
Otherwise insert our inode into hash and succeed.
In either case have i_state set to new+locked; cleanup ends up
being simpler with such calling conventions.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Creating a generic filesystem notification interface, fsnotify, which will be
used by inotify, dnotify, and eventually fanotify is really starting to
clutter the fs directory. This patch simply moves inotify and dnotify into
fs/notify/inotify and fs/notify/dnotify respectively to make both current fs/
and future notification tidier.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- iget5_locked in bdget really needs blockdev_superblock, instead of
bd_mnt, so bd_mnt could be just a local variable;
- blockdev_superblock really needs __read_mostly, while local var bd_mnt
not;
- make use of sb_is_blkdev_sb in bd_forget, instead of direct reference
to blockdev_superblock.
Signed-off-by: Denis ChengRq <crquan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the hopelessly misguided ->dir_notify(). The only instance (cifs)
has been broken by design from the very beginning; the objects it creates
are never destroyed, keep references to struct file they can outlive, nothing
that could possibly evict them exists on close(2) path *and* no locking
whatsoever is done to prevent races with close(), should the previous, er,
deficiencies someday be dealt with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of creating the "filp" kmem_cache in vfs_caches_init(),
we can do it a litle be later in files_init(), so that filp_cachep
is static to fs/file_table.c
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[AV: rediffed on top of unification of init_fs]
Initialization of init_fs still uses the deprecated RW_LOCK_UNLOCKED macro.
This patch updates it to use the __RW_LOCK_UNLOCKED(lock) macro.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With all the nameidata removal there's no point anymore for this helper.
Of the three callers left two will go away with the next lookup series
anyway.
Also add proper kerneldoc to inode_permission as this is the main
permission check routine now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need for the nameidata in may_open - a struct path is enough.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
walk_init_root is a tiny helper that is marked __always_inline, has just
one caller and an unused argument. Just merge it into the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We now pass on all MAY_ flags to the filesystems permission routines,
so remove the comment stating the contrary.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Explain that you really need to use the return value of d_path rather than
the buffer you passed into it.
Also fix the comment for seq_path(), the function arguments changed
recently but the comment hadn't been updated in sync.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
no function named d_put(), it should be dput().
Impact: fix document and comment, no functionality changed
Signed-off-by: Zhao Lei <zhaolei@cn.fuijtsu.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Sergey S. Kostyliov <rathamahata@php4.ru>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: adilger@sun.com
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On-disk data corruption could cause a page link to have its i_size set
to PAGE_SIZE (or a multiple thereof) and its contents all non-NUL.
NUL-terminate the link name to ensure this doesn't cause further
problems for the kernel.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The result from readlink is being used to index into the link name
buffer without checking whether it is a valid length. If readlink
returns an error this will fault or cause memory corruption.
Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The extra semicolon serves no purpose.
Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
struct dentry is one of the most critical structures in the kernel. So it's
sad to see it going neglected.
With CONFIG_PROFILING turned on (which is probably the common case at least
for distros and kernel developers), sizeof(struct dcache) == 208 here
(64-bit). This gives 19 objects per slab.
I packed d_mounted into a hole, and took another 4 bytes off the inline
name length to take the padding out from the end of the structure. This
shinks it to 200 bytes. I could have gone the other way and increased the
length to 40, but I'm aiming for a magic number, read on...
I then got rid of the d_cookie pointer. This shrinks it to 192 bytes. Rant:
why was this ever a good idea? The cookie system should increase its hash
size or use a tree or something if lookups are a problem. Also the "fast
dcookie lookups" in oprofile should be moved into the dcookie code -- how
can oprofile possibly care about the dcookie_mutex? It gets dropped after
get_dcookie() returns so it can't be providing any sort of protection.
At 192 bytes, 21 objects fit into a 4K page, saving about 3MB on my system
with ~140 000 entries allocated. 192 is also a multiple of 64, so we get
nice cacheline alignment on 64 and 32 byte line systems -- any given dentry
will now require 3 cachelines to touch all fields wheras previously it
would require 4.
I know the inline name size was chosen quite carefully, however with the
reduction in cacheline footprint, it should actually be just about as fast
to do a name lookup for a 36 character name as it was before the patch (and
faster for other sizes). The memory footprint savings for names which are
<= 32 or > 36 bytes long should more than make up for the memory cost for
33-36 byte names.
Performance is a feature...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reorder struct inotify_device to remove 8 bytes of padding on 64bit
builds, reducing size to 128 bytes . Therefore allocating from a smaller
slab & using one fewer cachelines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
----
Hi,
patch against 2.6.28-rc7.
built & tested on AMDX2 desktop.
I've not been able to send this to the listed inotify maintainers, I
just get mail failures. So I guessed filesystem was the best home for
it, hope that's ok.
regards
Richard
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>