The nfsv4 state manager could potentially deadlock inside
__nfs_inode_return_delegation() if the server reboots, so that the calls to
nfs_msync_inode() end up waiting on state recovery to complete.
Also ensure that if a server reboot or network partition causes us to have
to stop returning delegations, that NFS4CLNT_DELEGRETURN is set so that
the state manager can resume any outstanding delegation returns after it
has dealt with the state recovery situation.
Finally, ensure that the state manager doesn't wait for the DELEGRETURN
call to complete. It doesn't need to, and that too can cause a deadlock.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Subject: [PATCH] nfs: fix acl decoding
Commit 28f566942c "NFS: use dynamically
computed compound_hdr.replen for xdr_inline_pages offset" accidentally
changed the amount of space to allow for the acl reply, resulting in an
IO error on attempts to get an acl.
Reported-by: Paul Rudin <paul@rudin.co.uk>
Cc: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- no one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
congestion wait
So remove the congestion checks as suggested by Chris.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Evgeniy Polyakov <zbr@ioremap.net>
Cc: Alex Elder <aelder@sgi.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It will lower the flush priority for NFS, and maybe more in future.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is dead code because no bdi flush thread will be started for
!bdi_cap_writeback_dirty bdi.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch fixes some ref counting issues. Firstly by moving
the point at which we drop the ref count after a dlm lock
operation has completed we ensure that we never call
gfs2_glock_hold() on a lock with a zero ref count.
Secondly, by using atomic_dec_and_lock() in gfs2_glock_put()
we ensure that at no time will a glock with zero ref count
appear on the lru_list. That means that we can remove the
check for this in our shrinker (which was racy).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
No one is calling wb_writeback and write_cache_pages with
wbc.nonblocking=1 any more. And lumpy pageout will want to do
nonblocking writeback without the congestion wait.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When a gfs2 filesystem is grown, it needs to rebuild the rindex list to be able
to use the new space. gfs2 does this when the rindex is marked not uptodate,
which happens when the rindex glock is dropped. However, on a single node
setup, there is never any reason to drop the rindex glock, so gfs2 never
invalidates the the rindex. This patch makes gfs2 automatically drop the
rindex glock after filesystem grows, so it can refresh the rindex list.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There are two spare field in the header common to all GFS2
metadata. One is just the right size to fit a journal id
in it, and this patch updates the journal code so that each
time a metadata block is modified, we tag it with the journal
id of the node which is performing the modification.
The reason for this is that it should make it much easier to
debug issues which arise if we can tell which node was the
last to modify a particular metadata block.
Since the field is updated before the block is written into
the journal, each journal should only contain metadata which
is tagged with its own journal id. The one exception to this
is the journal header block, which might have a different node's
id in it, if that journal was recovered by another node in the
cluster.
Thus each journal will contain a record of which nodes recovered
it, via the journal header.
The other field in the metadata header could potentially be
used to hold information about what kind of operation was
performed, but for the time being we just zero it on each
transaction so that if we use it for that in future, we'll
know that the information (where it exists) is reliable.
I did consider using the other field to hold the journal
sequence number, however since in GFS2's journaling we write
the modified data into the journal and not the original
data, this gives no information as to what action caused the
modification, so I think we can probably come up with a better
use for those 64 bits in the future.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This function only had one caller left, and that caller only
called it for leaf blocks, hence one branch of the "if" was
never taken. In addition the call to get_left had already
verified the metadata type, so the function can be reduced
to a single line of code in its caller.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Currently gfs2 issues barrier unconditionally. There are various reasons
to disable them, be that just for testing or for stupid devices flushing
large battert backed caches. Add a nobarrier option that matches xfs and
btrfs for this. Also add a symmetric barrier option to turn it back on
at remount time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
It's not necessary to do any 64bit division for the statfs sync code, so
remove it.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 now has three new mount options, statfs_quantum, quota_quantum and
statfs_percent. statfs_quantum and quota_quantum simply allow you to
set the tunables of the same name. Setting setting statfs_quantum to 0
will also turn on the statfs_slow tunable. statfs_percent accepts an
integer between 0 and 100. Numbers between 1 and 100 will cause GFS2 to
do any early sync when the local number of blocks free changes by at
least statfs_percent from the totoal number of blocks free. Setting
statfs_percent to 0 disables this.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support to GFS2 to send quota warnings via netlink.
Also it removes a stray \r which was left over from when the
code used to print warnings on the console.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Sending a message to userspace in a generic format to warn
of events (e.g. quota exceeded) in the quota subsystem is
a generically useful feature. This patch makes some minor
changes to the send_message function from dquot.c renaming
it quota_send_message, moving it to quota.c and exporting it
for use by filesystems which do not use the dquot code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support for viewing the current GFS2 quota settings
via the XFS quota API. The setting of quotas will be addressed
in a later patch. Fields which are not supported here are left
set to zero.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Both of these functions contained confusing and in one case
duplicate code. This patch adds a new check in do_glock()
so that we report -ENOENT if we are asked to sync a quota
entry which doesn't exist. Due to the previous patch this is
now reported correctly to userspace.
Also there are a few new comments, and I hope that the code
is easier to understand now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The "create" argument to qdsb_get() was only ever set to true,
so this patch removes that argument.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There is no point in testing for GLF_DEMOTE here, we might as
well always release the glock at that point.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The plan is to add further operations to the gfs2_quotactl_ops
in future patches. The sync operation is easy, so we start with
that one.
We plan to use the XFS quota control functions because they more
closely match the GFS2 ones.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
These two functions are altered so that gfs2_quota_sync may
in future be called directly from the VFS. The GFS2 superblock
changes to a VFS super block and there is an addition of an int
argument which is currently ignored.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2 needs to call this from under a glock, so we need GFP_NOFS
and I suspect that other filesystems might require this too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The other patches in this series have been building towards
being able to support cached ACLs like other filesystems. The
only real difference with GFS2 is that we have to invalidate
the cache when we drop a glock, but that is dealt with in earlier
patches.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
To prepare for support for caching of ACLs, this cleans up the GFS2
ACL support by pushing the xattr code back into xattr.c and changing
the acl_get function into one which only returns ACLs so that we
can drop the caching function into it shortly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This code has been shamelessly stolen from XFS at the suggestion
of Christoph Hellwig. I've not added support for cached ACLs so
far... watch for that in a later patch, although this is designed
in such a way that they should be easy to add.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
We have a long term plan to use the "-o meta" flag to GFS2 mounts to
access the alternate root which is used to store metadata for a GFS2
filesystem. This will allow us to eventually remove support for the
gfs2meta filesystem type (which is in any case just a "front end" to
the gfs2 filesystem type with the meta/master root).
Currently the "-o meta" option is only taken into account on the
initial mount of the filesystem. Subsequent mounts of the same
filesystem (i.e. on the same device) result in basically the same
as bind mounting the root of the original mount.
This patch changes that by using what is more or less a copy
of get_sb_bdev() and extending it so that it will take into
account the alternate root in all cases. The main difference
is that we have to parse the mount options a bit earlier. We can
then use them to select the appropriate root towards the end of
the function.
In addition this also fixes a bug where it was possible (but certainly
not desirable) to set different ro/rw options for the meta root
when mounted via the gfs2meta fs compared with the original mount.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
We need to be careful of the ordering between clearing the
GLF_LOCK bit and scheduling the workqueue.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522
Now cputime is accounted in the following way:
- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).
- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().
- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.
And accounted values are used by:
- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.
- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.
The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:
group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).
To fix this, we could do:
group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
But task_times() contains hard divisions, so applying it for
every thread should be avoided.
This patch fixes the above problem in the following way:
- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.
- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().
- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.
- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
This patch have a positive side effect:
- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.
v2:
- remove if()s, put new variables into signal_struct.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Spencer Candland <spencer@bluehost.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When IMA is active, using dentry_open without updating the
IMA counters will result in free/open imbalance errors when
fput is eventually called.
Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While building 2.6.32-rc8-git2 for Fedora I noticed the following thinko
in commit 201a15428b ("FS-Cache: Handle
pages pending storage that get evicted under OOM conditions"):
fs/9p/cache.c: In function '__v9fs_fscache_release_page':
fs/9p/cache.c:346: error: 'vnode' undeclared (first use in this function)
fs/9p/cache.c:346: error: (Each undeclared identifier is reported only once
fs/9p/cache.c:346: error: for each function it appears in.)
make[2]: *** [fs/9p/cache.o] Error 1
Fix the 9P filesystem to correctly construct the argument to
fscache_maybe_release_page().
Signed-off-by: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com> [from identical patch]
Signed-off-by: Stefan Lippers-Hollmann <s.l-h@gmx.de> [from identical patch]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace all GFP_KERNEL and ls_allocation with GFP_NOFS.
ls_allocation would be GFP_KERNEL for userland lockspaces
and GFP_NOFS for file system lockspaces.
It was discovered that any lockspaces on the system can
affect all others by triggering memory reclaim in the
file system which could in turn call back into the dlm
to acquire locks, deadlocking dlm threads that were
shared by all lockspaces, like dlm_recv.
Signed-off-by: David Teigland <teigland@redhat.com>
In 2.6.23 kernel, commit a32ea1e1f9
("Fix read/truncate race") fixed a race in the generic code, and as a
side effect, now do_generic_file_read() can ask us to readpage() past
the i_size. This seems to be correctly handled by the block routines
(e.g. block_read_full_page() fills the page with zeroes in case if
somebody is trying to read past the last inode's block).
JFFS2 doesn't handle this; it assumes that it won't be asked to read
pages which don't exist -- and thus that there will be at least _one_
valid 'frag' on the page it's being asked to read. It will fill any
holes with the following memset:
memset(buf, 0, min(end, frag->ofs + frag->size) - offset);
When the 'closest smaller match' returned by jffs2_lookup_node_frag() is
actually on a previous page and ends before 'offset', that results in:
memset(buf, 0, <huge unsigned negative>);
Hopefully, in most cases the corruption is fatal, and quickly causing
random oopses, like this:
root@10.0.0.4:~/ltp-fs-20090531# ./testcases/kernel/fs/ftest/ftest01
Unable to handle kernel paging request for data at address 0x00000008
Faulting instruction address: 0xc01cd980
Oops: Kernel access of bad area, sig: 11 [#1]
[...]
NIP [c01cd980] rb_insert_color+0x38/0x184
LR [c0043978] enqueue_hrtimer+0x88/0xc4
Call Trace:
[c6c63b60] [c004f9a8] tick_sched_timer+0xa0/0xe4 (unreliable)
[c6c63b80] [c0043978] enqueue_hrtimer+0x88/0xc4
[c6c63b90] [c0043a48] __run_hrtimer+0x94/0xbc
[c6c63bb0] [c0044628] hrtimer_interrupt+0x140/0x2b8
[c6c63c10] [c000f8e8] timer_interrupt+0x13c/0x254
[c6c63c30] [c001352c] ret_from_except+0x0/0x14
--- Exception: 901 at memset+0x38/0x5c
LR = jffs2_read_inode_range+0x144/0x17c
[c6c63cf0] [00000000] (null) (unreliable)
This patch fixes the issue, plus fixes all LTP tests on NAND/UBI with
JFFS2 filesystem that were failing since 2.6.23 (seems like the bug
above also broke the truncation).
Reported-By: Anton Vorontsov <avorontsov@ru.mvista.com>
Tested-By: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This separates wait function for submitted logs from the write
function nilfs_segctor_write(). A new list of segment buffers
"sc_write_logs" is added to hold logs under writing, and double
buffering is partially applied to hide io latency.
At this point, the double buffering is disabled for blocksize <
pagesize because page dirty flag is turned off during write and dirty
buffers are not properly collected for pages crossing over segments.
To receive full benefit of the double buffering, further refinement is
needed to move the io wait outside the lock section of log writer.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds a few iterator functions for segment buffers to make it easy
to handle multiple series of logs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Hides nilfs_write_info struct and nilfs_segbuf_prepare_write function
in segbuf.c to simplify the interface of nilfs_segbuf_write function.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This moves io status variables in nilfs_write_info struct to
nilfs_segment_buffer struct.
This is a preparation to hide nilfs_write_info in segment buffer code.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Previously, log writer had possibility to set an io error flag on
segments even in case of memory allocation failure.
This fixes the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This applies list_splice_tail (or list_splice_tail_init) operation
instead of list_splice (or list_splice_init, respectively) to append a
new list to tail of an existing list.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The comment in fuse_open about O_DIRECT:
"VFS checks this, but only _after_ ->open()"
also holds for fuse_create, however, the same kind of check was missing there.
As an impact of this bug, open(newfile, O_RDWR|O_CREAT|O_DIRECT) fails, but a
stub newfile will remain if the fuse server handled the implied FUSE_CREATE
request appropriately.
Other impact: in the above situation ima_file_free() will complain to open/free
imbalance if CONFIG_IMA is set.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Harshavardhana <harsha@gluster.com>
Cc: stable@kernel.org
Replace mark_inode_dirty() as nilfs_mark_inode_dirty()
to reduce deep function calls.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>