Dave Sterba had put in patches to look for mixed data/metadata groups
with metadata bigger than 4KB. But these ended up in the wrong place
and it wasn't testing the feature flag correctly.
This updates the tests to make sure our sizes are matching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we use autodefrag, we forget to update the index which indicates
the last page we've dirty. And we'll set dirty flags on a same set of
pages again and again.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag
$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3072 10 eof
/mnt/btrfs/foobar: 1 extent found
Now we have a big real extent [0, 40960), but autodefrag will still defrag it.
$ sync
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3082 10 eof
/mnt/btrfs/foobar: 1 extent found
So if we already find a big real extent, we're ok about that, just skip it.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If our file's layout is as follows:
| hole | data1 | hole | data2 |
we do not need to defrag this file, because this file has holes and
cannot be merged into one extent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
$ mkfs.btrfs disk
$ mount disk /mnt -o autodefrag
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
$ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
seek=$i conv=notrunc 2> /dev/null; done && sync
then we'll get to defrag "foobar" again and again.
So does option "-o autodefrag,compress".
Reasons:
When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
them, it will dirty pages and submit them, this will comes to another DATA COW
where the processing inode will be inserted to the defrag tree again.
This patch sets a rule for COW code, i.e. insert an inode when we're really
going to make some defragments.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 600a45e1d5
(Btrfs: fix deadlock on page lock when doing auto-defragment)
fixes the deadlock on page, but it also introduces another bug.
A page may have been truncated after unlock & lock.
So we need to find it again to get the right one.
And since we've held i_mutex lock, inode size remains unchanged and
we can drop isize overflow checks.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bug is from running xfstests 209 with autodefrag.
The race is as follows:
t1 t2(autodefrag)
direct IO
invalidate pagecache
dio(old data) add_inode_defrag
invalidate pagecache
endio
direct IO
invalidate pagecache
run_defrag
readpage(old data)
set page dirty (old data)
dio(new data, rewrite)
invalidate pagecache (*)
endio
t2(autodefrag) will get old data into pagecache via readpage and set
pagecache dirty. Meanwhile, invalidate pagecache(*) will fail due to
dirty flags in pages. So the old data may be flushed into disk by
flush thread, which will lead to data loss.
And so does the case of user defragment progs.
The patch fixes this race by holding i_mutex when we readpage and set page dirty.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This deadlock comes from xfstests 251.
We'll hold the chunk_mutex throughout the whole of a chunk allocation.
But if we find that we've used up system chunk space, we need to allocate a
new system chunk, but this will lead to a recursion of chunk allocation and end
up with a deadlock on chunk_mutex.
So instead we need to allocate the system chunk first if we find we're in ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
o For space info, the type of space info is useful for debug.
o For transaction handle, its transid is useful.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Otherwise, we get a warning or error similar to this when building with
CONFIG_NFSD_V4 disabled:
ERROR: "nfsd4_cld_block" [fs/nfsd/nfsd.ko] undefined!
Fix this by wrapping the calls to rpc_pipefs_notifier_register and
..._unregister in another function and providing no-op replacements
when CONFIG_NFSD_V4 is disabled.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
With the big metadata blocks, we can have crc items
that are much bigger than a page. There are a few
places that we try to kmalloc memory to hold the
items during a split.
Items bigger than 4KB don't really have a huge benefit
in efficiency, but they do trigger larger order allocations.
This commits changes the csums to make sure they stay under
4KB. This is not a format change, just a #define to limit
huge items.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs puts the filesystem metadata into its own address space, and
somehow the block device address space isn't getting onto disk properly
before a mount. The end result is that a loop of mkfs and mounting the
filesystem will sometimes find stale or incorrect data.
This commit should fix it by sprinkling fdatawrites and invalidate_bdev
calls around. This is a short term measure to make sure it is fixed.
The block devices really should be flushed and cleaned up higher in the
stack.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With support for bigger metadata blocks, we must avoid mounting a
filesystem with different block size for mixed block groups, this causes
corruption (found by xfstests/083).
Signed-off-by: David Sterba <dsterba@suse.cz>
If CONFIG_NET_NS, CONFIG_UTS_NS and CONFIG_IPC_NS are disabled,
ns_entries[] becomes empty and things like
ns_entries[ARRAY_SIZE(ns_entries) - 1] will explode.
Reported-by: Richard Weinberger <richard@nod.at>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In several code paths, such as when unmounting a file system (but not
only) we send an IPI to ask each cpu to invalidate its local LRU BHs.
For multi-cores systems that have many cpus that may not have any LRU BH
because they are idle or because they have not performed any file system
accesses since last invalidation (e.g. CPU crunching on high perfomance
computing nodes that write results to shared memory or only using
filesystems that do not use the bh layer.) This can lead to loss of
performance each time someone switches the KVM (the virtual keyboard and
screen type, not the hypervisor) if it has a USB storage stuck in.
This patch attempts to only send an IPI to cpus that have LRU BH.
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pmd_trans_unstable() should be called before pmd_offset_map() in the
locations where the mmap_sem is held for reading.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This isn't actually correct, but it works with the Linux client, and
agrees with the behavior we used to have before commit 80fc015bdf.
Later patches will implement the spec-mandated behavior (which is to use
the security parameters explicitly given by the client in create_session
or backchannel_ctl).
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.
asm/system.h has been disintegrated by this point on all arches into the
following common segments:
(1) asm/barrier.h
Moved memory barrier definitions here.
(2) asm/cmpxchg.h
Moved xchg() and cmpxchg() here. #included in asm/atomic.h.
(3) asm/bug.h
Moved die() and similar here.
(4) asm/exec.h
Moved arch_align_stack() here.
(5) asm/elf.h
Moved AT_VECTOR_SIZE_ARCH here.
(6) asm/switch_to.h
Moved switch_to() here.
Signed-off-by: David Howells <dhowells@redhat.com>
In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
case; in two subcases the inode i_lock is properly released but this
does not occur in the -ELOOP subcase.
This seems to have been introduced by commit 1836750115 ("fix loop
checks in d_materialise_unique()").
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: stable@vger.kernel.org # v3.0+
[ Added a comment, and moved the unlock to where we generate the -ELOOP,
which seems to be more natural.
You probably can't actually trigger this without a buggy network file
server - d_materialize_unique() is for finding aliases on non-local
filesystems, and the d_ancestor() case is for a hardlinked directory
loop.
But we should be robust in the case of such buggy servers anyway. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we find that "cup" is NULL in this case, then we obviously don't
want to dereference it. What we really want to print in this case
is the xid that we copied off earlier.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
"id" is type is a uid_t (32 bits) but on 64 bit systems strict_strtoul()
modifies 64 bits of data. We should use kstrtouint() instead.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Firstly, task->tk_status will always return negative error values,
so the current tests for 'NFS4ERR_DELEG_REVOKED' etc. are all being
ignored.
Secondly, clean up the code so that we only need to test
task->tk_status once!
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
We can currently loop forever in nfs4_lookup_root() and in
nfs41_proc_secinfo_no_name(), if the first iteration returns a
NFS4ERR_DELAY or something else that causes exception.retry to get
set.
Reported-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
When an IO error happens during inode deletion run from
xlog_recover_process_iunlinks() filesystem gets shutdown. Thus any subsequent
attempt to read buffers fails. Code in xlog_recover_process_iunlinks() does not
count with the fact that read of a buffer which was read a while ago can
really fail which results in the oops on
agi = XFS_BUF_TO_AGI(agibp);
Fix the problem by cleaning up the buffer handling in
xlog_recover_process_iunlinks() as suggested by Dave Chinner. We release buffer
lock but keep buffer reference to AG buffer. That is enough for buffer to stay
pinned in memory and we don't have to call xfs_read_agi() all the time.
CC: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_ioc_fstrim() doesn't treat the incoming offset and length
correctly. It treats them as a filesystem block address, rather than
a disk address. This is wrong because the range passed in is a
linear representation, while the filesystem block address notation
is a sparse representation. Hence we cannot convert the range direct
to filesystem block units and then use that for calculating the
range to trim.
While this sounds dangerous, the problem is limited to calculating
what AGs need to be trimmed. The code that calcuates the actual
ranges to trim gets the right result (i.e. only ever discards free
space), even though it uses the wrong ranges to limit what is
trimmed. Hence this is not a bug that endangers user data.
Fix this by treating the range as a disk address range and use the
appropriate functions to convert the range into the desired formats
for calculations.
Further, fix the first free extent lookup (the longest) to actually
find the largest free extent. Currently this lookup uses a <=
lookup, which results in finding the extent to the left of the
largest because we can never get an exact match on the largest
extent. This is due to the fact that while we know it's size, we
don't know it's location and so the exact match fails and we move
one record to the left to get the next largest extent. Instead, use
a >= search so that the lookup returns the largest extent regardless
of the fact we don't get an exact match on it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
Scrub used to be coded for nodesize == leafsize == sectorsize == PAGE_SIZE.
This is now changed to support sizes for nodesize and leafsize which are
N * PAGE_SIZE.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just a minor cleanup commit in preparation for the big block changes.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Readahead already has a define for the max number of mirrors. Scrub
needs such a define now, the rest of the code will need something
like this soon. Therefore the define was added to ctree.h and removed
from the readahead code.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The 'forcemand' form of 'forcemandatorylock' mount option was missed
when the code moved to use the standard token parser. Return it back.
Also fix a comment style in the parser.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
If relocate of block group 0 fails with ENOSPC we end up infinitely
looping because key.offset -= 1 statement in that case brings us back to
where we started.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
init_ipath() allocates btrfs_data_container which is never freed. Free
it in free_ipath() and nuke the comment for init_data_container() - we
can safely free it with kfree().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Generally we don't allow dup for data, but mixed chunks are special and
people seem to think this has its use cases.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Do not run sanity checks on all target profiles unless they all will be
used. This came up because alloc_profile_is_valid() is now more strict
than it used to be.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently if we don't have enough space allocated we go ahead and loop
though devices in the hopes of finding enough space for a chunk of the
*same* type as the one we are trying to relocate. The problem with that
is that if we are trying to restripe the chunk its target type can be
more relaxed than the current one (eg require less devices or less
space). So, when restriping, run checks against the target profile
instead of the current one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add __get_block_group_index() helper to be able to derive block group
index from an arbitary set of flags. Implement get_block_group_index()
in terms of it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Header file is not a good place to define functions. This also moves a
call to alloc_profile_is_valid() down the stack and removes a redundant
check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it
into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
"0" is a valid value for an on-disk chunk profile, but it is not a valid
extended profile. (We have a separate bit for single chunks in extended
case)
Also rename it to alloc_profile_is_valid() for clarity.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add functions to abstract the conversion between chunk and extended
allocation profile formats and switch everybody to use them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This has been causing a lot of confusion for quite a while now and a lot
of users were surprised by this (some of them were even stuck in a
ENOSPC situation which they couldn't easily get out of). The addition
of restriper gives users a clear choice between raid0 and drive concat
setup so there's absolutely no excuse for us to keep doing this.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In commit 4692cf58 we introduced new backref walking code for btrfs. This
assumes we're searching live roots, which requires a transaction context.
While scrubbing, however, we must not join a transaction because this could
deadlock with the commit path. Additionally, what scrub really wants to do
is resolving a logical address in the commit root it's currently checking.
This patch adds support for logical to path resolving on commit roots and
makes scrub use that.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The two helper functions commit_cowonly_roots() and
create_pending_snapshot() failed to check the return value from
btrfs_cow_block(), which could at least in theory fail with -ENOSPC from
btrfs_alloc_free_block(). This commit adds the missing checks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_init_lockdep only makes our lockdep class names look prettier, thus
it did never hurt we forgot to actually call it. This turns our lockdep
identifier strings from lockdep auto-set #[id] into really pretty
"btrfs-fs-01" or "btrfs-csum-03".
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis. So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>