Add a new mount option which enables a new "lazytime" mode. This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode. The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.
This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.
For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table. The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives. Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).
Google-Bug-Id: 18297052
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The unload_nls() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Get rid of le24 stuff, along with the bitfields use - all that stuff
can be done with standard stuff, in sparse-verifiable manner. Moreover,
that way (shift-and-mask) often generates better code - gcc optimizer
sucks on bitfields...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
----
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
CC: jfs-discussion@lists.sourceforge.net
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...
[AV: fixed the braino spotted by Cyrill]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch replaces obsolete simple_str functions by kstr
use kstrtouint for
-uid_t ( __kernel_uid32_t )
-gid_t ( __kernel_gid32_t )
-jfs_sb_info->umask
-jfs_sb_info->minblks_trim
(all unsigned int)
newLVSize is s64 -> use kstrtol
Current parse_options behaviour stays the same ie it doesn't return kstr
rc but just 0 if function failed (parse_options callsites
return -EINVAL when there's anything wrong).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Static values are automatically initialized to NULL
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Check for NULL before using the acl in the access type switch
statement. This seems to be consistent with what is done in the JFFS
and ext4 filesystems and with the behaviour of JFS in the 3.13 kernel.
The bug seemed to be introduced in commit 2cc6a5a0.
The bug results in a kernel Oops, NULL dereference could not be handled
when accessing a JFS filesystem. The rdiff-backup process seemed to
trigger the bug. See also reported bug #75341:
https://bugzilla.kernel.org/show_bug.cgi?id=75341
Signed-off-by: William Burrow <wbkernel@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
According to commit 5f16f3225b
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page. As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently. At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.
Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate. Reclaim will check
for this flag before installing shadow pages.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a regression in 3.14-rc1 where xfstests generic/307 fails.
jfs sets the ctime on the inode when writing an xattr. Previously,
jfs went ahead and stored an acl that can be completely represented
in the traditional permission bits, so the ctime was always set in
the xattr code. The new code doesn't bother storing the acl in that
case, thus the ctime isn't getting set.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
I missed a couple errors in reviewing the patches converting jfs
to use the generic posix ACL function. Setting ACL's currently
fails with -EOPNOTSUPP.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Copy the scheme I introduced to btrfs many years ago to only use the
xattr handler for ACLs, but pass plain attrs straight through.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a potential overflow if the specified EA value size is
greater than USHRT_MAX because the size of value is limited by
the on-disk format (i.e, __le16), this issue could be reflected
via the tests below:
# touch /jfs/testfile
# setfattr -n user.comment -v `perl -e 'print "A"x65536'` /jfs/testfile
setfattr: /jfs/testfile: Invalid argument
Syslog:
... jfs_xsetattr: xattr_size = 21, new_size = 65557
This patch add pre-checkups of EA value size against USHRT_MAX to
avoid this problem, and return -E2BIG which is consistent with the
VFS setxattr interface. Moreover, fix the debug code to print the
correct function name.
With this fix:
setfattr: /jfs/testfile: Argument list too long
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
If insert_inode_locked() fails, we shouldn't be calling
unlock_new_inode().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Tested-by: Michael L. Semon <mlsemon35@gmail.com>
Cc: stable@vger.kernel.org
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression"). Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..),
but jfs allows a value of 2 for a non-special entry. This incompatibility
can result in the nfs client reporting a readdir loop.
This patch doesn't change the value stored internally, but adds one to
the value exposed to the iterate method.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use a more current logging style.
Add __printf format and argument verification.
Remove embedded function names from formats.
Add %pf, __builtin_return_address(0) to jfs_error.
Add newlines to formats for kernel style consistency.
(One format already had an erroneous newline)
Coalesce formats and align arguments.
Object size reduced ~1KiB.
$ size fs/jfs/built-in.o*
text data bss dec hex filename
201891 35488 63936 301315 49903 fs/jfs/built-in.o.new
202821 35488 64192 302501 49da5 fs/jfs/built-in.o.old
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
CHECK fs/jfs/xattr.c
fs/jfs/xattr.c:1092:5: warning: symbol 'jfs_initxattrs' was not declared. Should it be static?
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
The mentioned functions do not pay attention to the error codes returned
by the functions updateSuper(), lmLogInit() and lmLogShutdown(). It brings
to system crash later when writing to log.
The patch adds corresponding code to check and return the error codes
and to print correct error messages in case of errors.
Found by Linux File System Verification project (linuxtesting.org).
Signed-off-by: Vahram Martirosyan <vahram.martirosyan@linuxtesting.org>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.
Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).
This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.
We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.
Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
This patch fixes races uncovered by xfstests testcase 068.
One race is the result of jfs_sync() trying to write a sync point to the
journal after it has been frozen (or possibly in the process). Since
freezing sync's the journal, there is no need to write a sync point so
we simply want to return.
The second involves jfs_write_inode() being called on a deleted inode.
It calls jfs_flush_journal which is held up by the jfs_commit thread
doing the final iput on the same deleted inode, which itself is
waiting for the I_SYNC flag to be cleared. jfs_write_inode need not
do anything when i_nlink is zero, which is the easy fix.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.
Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Shifting a 32-bit int by 32 bits is undefined behavior in C, and
results in different behavior on different architectures (e.g., x86
and PowerPC). diAlloc() in fs/jfs/jfs_imap.c computes a mask using
0xffffffffu<<(32-bitno), which can left-shift by 32 bits. To avoid
unexpected behavior, explicitly check for bitno==0 and use a 0 mask.
Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Currently when 'range->start' is beyond the end of file system
nothing is done and that fact is ignored, where in fact we should return
EINVAL. The same problem is when 'range.len' is smaller than file system
block.
Fix this by adding check for such conditions and return EINVAL
appropriately.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Tino Reichardt <milky-kernel@mcmilk.de>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In a hasty fix to replace a 64-bit division with do_div, I
unintentionally assigned the divisor to a 32-bit variable.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Tino Reichardt <milky-kernel@mcmilk.de>