check_unsafe_exec() also notes whether the fs_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
use of get_fs_struct(), which also raises that sharing count.
This might occasionally cause a setuid program not to change euid,
in the same way as happened with files->count (check_unsafe_exec
also looks at sighand->count, but /proc doesn't raise that one).
We'd prefer exec not to unshare fs_struct: so fix this in procfs,
replacing get_fs_struct() by get_fs_path(), which does path_get
while still holding task_lock, instead of raising fs->count.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
___
fs/proc/base.c | 50 +++++++++++++++--------------------------------
1 file changed, 16 insertions(+), 34 deletions(-)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.
Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.
There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.
[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]
Reported-by: Joe Malicki <jmalicki@metacarta.com>
Narrowed-down-by: Michael Itz <mitz@metacarta.com>
Tested-by: Joe Malicki <jmalicki@metacarta.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2.6.26's commit fd8328be87
"sanitize handling of shared descriptor tables in failing execve()"
moved the unshare_files() from flush_old_exec() and several binfmts
to the head of do_execve(); but forgot to make the same change to
compat_do_execve(), leaving a CLONE_FILES files_struct shared across
exec from a 32-bit process on a 64-bit kernel.
It's arguable whether the files_struct really ought to be unshared
across exec; but 2.6.1 made that so to stop the loading binary's fd
leaking into other threads, and a 32-bit process on a 64-bit kernel
ought to behave in the same way as 32 on 32 and 64 on 64.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To be on the safe side, it should be less fragile to exclude I_NEW inodes
from inode list scans by default (unless there is an important reason to
have them).
Normally they will get excluded (eg. by zero refcount or writecount etc),
however it is a bit fragile for list walkers to know exactly what parts of
the inode state is set up and valid to test when in I_NEW. So along these
lines, move I_NEW checks upward as well (sometimes taking I_FREEING etc
checks with them too -- this shouldn't be a problem should it?)
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new_pts_mount() (including the get_sb_nodev()), shares a lot of code
with init_pts_mount(). The only difference between them is the 'test-super'
function passed into sget().
Move all common code into devpts_get_sb() and remove the new_pts_mount() and
init_pts_mount() functions,
Changelog[v3]:
[Serge Hallyn]: Remove unnecessary printk()s
Changelog[v2]:
(Christoph Hellwig): Merge code in 'do_pts_mount()' into devpts_get_sb()
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With mknod_ptmx() moved to devpts_get_sb(), init_pts_mount() becomes
a wrapper around get_init_pts_sb(). Remove get_init_pts_sb() and
fold code into init_pts_mount().
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We create 'ptmx' node in both single-instance and multiple-instance
mounts. So devpts_get_sb() can call mknod_ptmx() once rather than
have both modes calling mknod_ptmx() separately.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since all the mount option parsing is done in devpts, we could do it
just once and pass it around in devpts functions and eventually store
it in the super block.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On remount, devpts fs only needs to parse the mount options. Users cannot
directly create/dirty files in /dev/pts so the MS_RDONLY flag and
shrinking the dcache does not really apply to devpts.
So effectively on remount, devpts only parses the mount options and updates
these options in its super block. As such, we could replace do_remount_sb()
call with a direct parse_mount_options().
Doing so enables subsequent patches to avoid parsing the mount options twice
and simplify the code.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
simple_set_mnt() is defined as returning 'int' but always returns 0.
Callers assume simple_set_mnt() never fails and don't properly cleanup if
it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev()
should:
up_write(sb->s_unmount);
deactivate_super(sb);
if simple_set_mnt() fails.
Since simple_set_mnt() never fails, would be cleaner if it did not
return anything.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move some block device related code out from buffer.c and put it in
block_dev.c. I'm trying to move non-buffer_head code out of buffer.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We should drop the ->s_umount mutex if an error occurs after the
sget()/grab_super() call. This was introduced when adding support
for multiple instances of devpts and noticed during a code review/reorg.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last user of do_pipe is in arch/alpha/, after replacing it with
do_pipe_flags, the do_pipe can be totally dropped.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Acked-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Copy symlink data into the union member it is accessed through. Although
this shouldn't make a difference to behaviour it makes the code easier
to follow and grep through. It may also prevent problems if the
struct/union definitions change in the future.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ufs2 fast symlinks can be twice as long as ufs ones, however the code
was using the ufs size in various places. Fix that so ufs2 symlinks over
60 characters aren't truncated.
Note that we copy the entire area instead of using the maxsymlinklen field
from the superblock. This way we will be more robust against corruption (of
the superblock).
While we are at it, use memcpy instead of open-coding it with for loops.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The maximum fast symlink size is set in the superblock of certain types
of UFS filesystem. Before using it we need to check that it isn't longer
than the available space we have in the inode.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a switch for the various i_mode fmt cases, and remove the comment
about writeability of devices nodes - that part is handled in
inode_permission and comment on (briefly) there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that comments describe what's going on and not how, and always
use __d_instantiate instead of two separate branches, one with
d_instantiate and one with __d_instantiate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Due to a different size of ino_t ustat needs a compat handler, but
currently only x86 and mips provide one. Add a generic compat_sys_ustat
and switch all architectures over to it. Instead of doing various
user copy hacks compat_sys_ustat just reimplements sys_ustat as
it's trivial. This was suggested by Arnd Bergmann.
Found by Eric Sandeen when running xfstests/017 on ppc64, which causes
stack smashing warnings on RHEL/Fedora due to the too large amount of
data writen by the syscall.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In two error cases affs_remove_link doesn't call affs_unlock_dir to
release the i_hash_lock semaphore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't have to start a transaction in writepage() when all the blocks
are a properly allocated. Even in ordered mode either the data has been
written via write() and they are thus already added to transaction's list
or the data was written via mmap and then it's random in which transaction
they get written anyway.
This should help VM to pageout dirty memory without blocking on transaction
commits.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the default behaviour of the kernel to use relatime for all
filesystems. This can be overridden with the "strictatime" mount
option.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dasd device driver will now support ECKD devices with more then
65520 cylinders.
In the traditional ECKD adressing scheme each track is addressed
by a 16-bit cylinder and 16-bit head number. The new addressing
scheme makes use of the fact that the actual number of heads is
never larger then 15, so 12 bits of the head number can be redefined
to be part of the cylinder address.
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Opencode a cheasy approach with kevent. The idea here is that we'll
add some generic delayed work infrastructure, which probably wont be
based on pdflush (or maybe it will, in which case we can just add it
back).
This is in preparation for getting rid of pdflush completely.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>