After we transfer set the EXT4_ERROR_FS bit in the file system
superblock, it's not enough to call jbd2_journal_clear_err() to clear
the error indication from journal superblock --- we need to call
jbd2_journal_update_sb_errno() as well. Otherwise, when the root file
system is mounted read-only, the journal is replayed, and the error
indicator is transferred to the superblock --- but the s_errno field
in the jbd2 superblock is left set (since although we cleared it in
memory, we never flushed it out to disk).
This can end up confusing e2fsck. We should make e2fsck more robust
in this case, but the kernel shouldn't be leaving things in this
confused state, either.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from various jbd and jbd2.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use kmem_cache_zalloc wrapper instead of flag __GFP_ZERO.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Calculate and verify checksums of each data block being stored in the journal.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Calculate and verify the checksum of commit blocks. In checksum v2,
deprecate most of the checksum v1 commit block checksum fields, since
each block has its own checksum.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Calculate and verify a checksum of each descriptor block.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Calculate and verify a checksum covering the journal superblock.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Obtain a reference to the crc32c driver if needed for the v2 checksum.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add in the necessary code so that journal clients can enable the new
journal checksumming features.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Define flags and allocate space in on-disk journal structures to support
checksumming of journal metadata.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
flush request is issued in transaction commit code path, so looks using
GFP_KERNEL to allocate memory for flush request bio falls into the classic
deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
using GFP.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:
perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
Signed-off-by: David Howells <dhowells@redhat.com>
Normally, we have to issue a cache flush before we can update journal tail in
journal superblock, effectively wiping out old transactions from the journal.
So use the fact that during transaction commit we issue cache flush anyway and
opportunistically push journal tail as far as we can. Since update of journal
superblock is still costly (we have to use WRITE_FUA), we update log tail only
if we can free significant amount of space.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
All accesses to checkpointing entries in journal_head are protected
by j_list_lock. Thus __jbd2_journal_remove_checkpoint() doesn't really
need bh_state lock.
Also the only part of journal head that the rest of checkpointing code
needs to check is jh->b_transaction which is safe to read under
j_list_lock.
So we can safely remove bh_state lock from all of checkpointing code which
makes it considerably prettier.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The check b_jlist == BJ_None in __journal_try_to_free_buffer() is
always true (__jbd2_journal_temp_unlink_buffer() also checks this in
an assertion) so just remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
BH_JWrite bit should be set when buffer is written to the journal. So
checkpointing shouldn't set this bit when writing out buffer. This didn't
cause any observable bug since BH_JWrite bit is used only for debugging
purposes but it's good to have this consistent.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we reach jbd2_cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash. Thus we must unconditionally issue a cache flush before we
update journal superblock in these cases.
A similar problem can also occur if journal superblock is written only in
disk's caches, other transaction starts reusing space of the transaction
cleaned from the log and power failure happens. Subsequent journal replay would
still try to replay the old transaction but some of it's blocks may be already
overwritten by the new transaction. For this reason we must use WRITE_FUA when
updating log tail and we must first write new log tail to disk and update
in-memory information only after that.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With the latest and greatest changes to the freezer, I started seeing
panics that were caused by jbd2 running post-process freezing and
hitting the canary BUG_ON for non-TuxOnIce I/O submission. I've traced
this back to a lack of set_freezable calls in both jbd and jbd2. Since
they're clearly meant to be frozen (there are tests for freezing()), I
submit the following patch to add the missing calls.
Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There are some log tail updates that are not protected by j_checkpoint_mutex.
Some of these are harmless because they happen during startup or shutdown but
updates in jbd2_journal_commit_transaction() and jbd2_journal_flush() can
really race with other log tail updates (e.g. someone doing
jbd2_journal_flush() with someone running jbd2_cleanup_journal_tail()). So
protect all log tail updates with j_checkpoint_mutex.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are three case of updating journal superblock. In the first case, we want
to mark journal as empty (setting s_sequence to 0), in the second case we want
to update log tail, in the third case we want to update s_errno. Split these
cases into separate functions. It makes the code slightly more straightforward
and later patches will make the distinction even more important.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use the KMEM_CACHE helper macro instead of kmem_cache_create().
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch renames functions initializing the slab caches for the
journal head and handle structures to so they are consistent with the
names of the corresponding functions which destroys those slab caches.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is normally only a handful of these active at any one time, but
putting them in a separate slab cache makes debugging memory
corruption problems easier. Manish Katiyar also wanted this make it
easier to test memory failure scenarios in the jbd2 layer.
Cc: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
journal_unmap_buffer()'s zap_buffer: code clears a lot of buffer head
state ala discard_buffer(), but does not touch _Delay or _Unwritten as
discard_buffer() does.
This can be problematic in some areas of the ext4 code which assume
that if they have found a buffer marked unwritten or delay, then it's
a live one. Perhaps those spots should check whether it is mapped
as well, but if jbd2 is going to tear down a buffer, let's really
tear it down completely.
Without this I get some fsx failures on sub-page-block filesystems
up until v3.2, at which point 4e96b2dbbf
and 189e868fa8 make the failures go
away, because buried within that large change is some more flag
clearing. I still think it's worth doing in jbd2, since
->invalidatepage leads here directly, and it's the right place
to clear away these flags.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
This patch adds trace_jbd2_drop_transaction and
trace_jbd2_update_superblock_end because there are similar tracepoints
in jbd and they are needed in jbd2 as well.
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Toshiyuki Okajima found out that when running
for ((i=0; i < 100000; i++)); do
if ((i%2 == 0)); then
chattr +j /mnt/file
else
chattr -j /mnt/file
fi
echo "0" >> /mnt/file
done
process sometimes hangs indefinitely in jbd2_journal_lock_updates().
Toshiyuki identified that the following race happens:
jbd2_journal_lock_updates() |jbd2_journal_stop()
---------------------------------------+---------------------------------------
write_lock(&journal->j_state_lock) | .
++journal->j_barrier_count | .
spin_lock(&tran->t_handle_lock) | .
atomic_read(&tran->t_updates) //not 0 |
| atomic_dec_and_test(&tran->t_updates)
| // t_updates = 0
| wake_up(&journal->j_wait_updates)
prepare_to_wait() | // no process is woken up.
spin_unlock(&tran->t_handle_lock) |
write_unlock(&journal->j_state_lock) |
schedule() // never return |
We fix the problem by first calling prepare_to_wait() and only after that
checking t_updates in jbd2_journal_lock_updates().
Reported-and-analyzed-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently, we clear revoked flag only when a block is reused. However,
this can tigger a false journal error. Consider a situation when a block
is used as a meta block and is deleted(revoked) in ordered mode, then the
block is allocated as a data block to a file. At this moment, user changes
the file's journal mode from ordered to journaled and truncates the file.
The block will be considered re-revoked by journal because it has revoked
flag still pending from the last transaction and an assertion triggers.
We fix the problem by keeping the revoked status more uptodate - we clear
revoked flag when switching revoke tables to reflect there is no revoked
buffers in current transaction any more.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is no reason to export two functions for entering the
refrigerator. Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.
* Rename refrigerator() to __refrigerator() and make it return bool
indicating whether it scheduled out for freezing.
* Update try_to_freeze() to return bool and relay the return value of
__refrigerator() if freezing().
* Convert all refrigerator() users to try_to_freeze().
* Update documentation accordingly.
* While at it, add might_sleep() to try_to_freeze().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
Some jbd2 code prints out kernel messages with "JBD2: " prefix, at the
same time other jbd2 code prints with "JBD: " prefix. Unify the prefix
to "JBD2: ".
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I hit a J_ASSERT(blocknr != 0) failure in cleanup_journal_tail() when
mounting a fsfuzzed ext3 image. It turns out that the corrupted ext3
image has s_first = 0 in journal superblock, and the 0 is passed to
journal->j_head in journal_reset(), then to blocknr in
cleanup_journal_tail(), in the end the J_ASSERT failed.
So validate s_first after reading journal superblock from disk in
journal_get_superblock() to ensure s_first is valid.
The following script could reproduce it:
fstype=ext3
blocksize=1024
img=$fstype.img
offset=0
found=0
magic="c0 3b 39 98"
dd if=/dev/zero of=$img bs=1M count=8
mkfs -t $fstype -b $blocksize -F $img
filesize=`stat -c %s $img`
while [ $offset -lt $filesize ]
do
if od -j $offset -N 4 -t x1 $img | grep -i "$magic";then
echo "Found journal: $offset"
found=1
break
fi
offset=`echo "$offset+$blocksize" | bc`
done
if [ $found -ne 1 ];then
echo "Magic \"$magic\" not found"
exit 1
fi
dd if=/dev/zero of=$img seek=$(($offset+23)) conv=notrunc bs=1 count=1
mkdir -p ./mnt
mount -o loop $img ./mnt
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fix build error when CONFIG_BUG is not enabled:
fs/jbd2/transaction.c:1175:3: error: implicit declaration of function '__WARN'
by changing __WARN() to WARN_ON(), as suggested by
Arnaud Lacombe <lacombar@gmail.com>.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Arnaud Lacombe <lacombar@gmail.com>
This silences some Sparse warnings:
fs/jbd2/transaction.c:135:69: warning: incorrect type in argument 2 (different base types)
fs/jbd2/transaction.c:135:69: expected restricted gfp_t [usertype] flags
fs/jbd2/transaction.c:135:69: got int [signed] gfp_mask
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add debugging information in case jbd2_journal_dirty_metadata() is
called with a buffer_head which didn't have
jbd2_journal_get_write_access() called on it, or if the journal_head
has the wrong transaction in it. In addition, return an error code.
This won't change anything for ocfs2, which will BUG_ON() the non-zero
exit code.
For ext4, the caller of this function is ext4_handle_dirty_metadata(),
and on seeing a non-zero return code, will call __ext4_journal_stop(),
which will print the function and line number of the (buggy) calling
function and abort the journal. This will allow us to recover instead
of bug halting, which is better from a robustness and reliability
point of view.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Using function calls in TP_printk causes perf heartburn, so print the
MAJOR/MINOR device numbers instead.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In journal checkpoint, we write the buffer and wait for its finish.
But in cfq, the async queue has a very low priority, and in our test,
if there are too many sync queues and every queue is filled up with
requests, the write request will be delayed for quite a long time and
all the tasks which are waiting for journal space will end with errors like:
INFO: task attr_set:3816 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
attr_set D ffff880028393480 0 3816 1 0x00000000
ffff8802073fbae8 0000000000000086 ffff8802140847c8 ffff8800283934e8
ffff8802073fb9d8 ffffffff8103e456 ffff8802140847b8 ffff8801ed728080
ffff8801db4bc080 ffff8801ed728450 ffff880028393480 0000000000000002
Call Trace:
[<ffffffff8103e456>] ? __dequeue_entity+0x33/0x38
[<ffffffff8103caad>] ? need_resched+0x23/0x2d
[<ffffffff814006a6>] ? thread_return+0xa2/0xbc
[<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
[<ffffffffa01f6224>] ? jbd2_journal_dirty_metadata+0x116/0x126 [jbd2]
[<ffffffff81400d31>] __mutex_lock_common+0x14e/0x1a9
[<ffffffffa021dbfb>] ? brelse+0x13/0x15 [ext4]
[<ffffffff81400ddb>] __mutex_lock_slowpath+0x19/0x1b
[<ffffffff81400b2d>] mutex_lock+0x1b/0x32
[<ffffffffa01f927b>] __jbd2_journal_insert_checkpoint+0xe3/0x20c [jbd2]
[<ffffffffa01f547b>] start_this_handle+0x438/0x527 [jbd2]
[<ffffffff8106f491>] ? autoremove_wake_function+0x0/0x3e
[<ffffffffa01f560b>] jbd2_journal_start+0xa1/0xcc [jbd2]
[<ffffffffa02353be>] ext4_journal_start_sb+0x57/0x81 [ext4]
[<ffffffffa024a314>] ext4_xattr_set+0x6c/0xe3 [ext4]
[<ffffffffa024aaff>] ext4_xattr_user_set+0x42/0x4b [ext4]
[<ffffffff81145adb>] generic_setxattr+0x6b/0x76
[<ffffffff81146ac0>] __vfs_setxattr_noperm+0x47/0xc0
[<ffffffff81146bb8>] vfs_setxattr+0x7f/0x9a
[<ffffffff81146c88>] setxattr+0xb5/0xe8
[<ffffffff81137467>] ? do_filp_open+0x571/0xa6e
[<ffffffff81146d26>] sys_fsetxattr+0x6b/0x91
[<ffffffff81002d32>] system_call_fastpath+0x16/0x1b
So this patch tries to use WRITE_SYNC in __flush_batch so that the request will
be moved into sync queue and handled by cfq timely. We also use the new plug,
sot that all the WRITE_SYNC requests can be given as a whole when we unplug it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Reported-by: Robin Dong <sanbai@taobao.com>
jbd2_journal_remove_journal_head() can oops when trying to access
journal_head returned by bh2jh(). This is caused for example by the
following race:
TASK1 TASK2
jbd2_journal_commit_transaction()
...
processing t_forget list
__jbd2_journal_refile_buffer(jh);
if (!jh->b_transaction) {
jbd_unlock_bh_state(bh);
jbd2_journal_try_to_free_buffers()
jbd2_journal_grab_journal_head(bh)
jbd_lock_bh_state(bh)
__journal_try_to_free_buffer()
jbd2_journal_put_journal_head(jh)
jbd2_journal_remove_journal_head(bh);
jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
buffer is not part of any transaction and thus frees journal_head
before TASK1 gets to doing so. Note that even buffer_head can be
released by try_to_free_buffers() after
jbd2_journal_put_journal_head() which adds even larger opportunity for
oops (but I didn't see this happen in reality).
Fix the problem by making transactions hold their own journal_head
reference (in b_jcount). That way we don't have to remove journal_head
explicitely via jbd2_journal_remove_journal_head() and instead just
remove journal_head when b_jcount drops to zero. The result of this is
that [__]jbd2_journal_refile_buffer(),
[__]jbd2_journal_unfile_buffer(), and
__jdb2_journal_remove_checkpoint() can free journal_head which needs
modification of a few callers. Also we have to be careful because once
journal_head is removed, buffer_head might be freed as well. So we
have to get our own buffer_head reference where it matters.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
credits isn't a parameter for jbd2_journal_get_write_access and
jbd2_journal_get_undo_access. So remove the corresponding comments.
Acked-by: Jan Kara <jack@suse.cz>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
jbd2__journal_start() returns an ERR_PTR() value rather than NULL on
failure.
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Provide a function which returns whether a transaction with given tid
will send a flush to the filesystem device. The function will be used
by ext4 to detect whether fsync needs to send a separate flush or not.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In data=ordered mode, it's theoretically possible (however rare) that
an inode is filed to transaction's t_inode_list and a flusher thread
writes all the data and inode is reclaimed before the transaction
starts to commit. In such a case, we could erroneously omit sending a
flush to file system device when it is different from the journal
device (because data can still be in disk cache only).
Fix the problem by setting a flag in a transaction when some inode is added
to it and then send disk flush in the commit code when the flag is set.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
t_max_wait is added in commit 8e85fb3f to indicate how long we
were waiting for new transaction to start. In commit 6d0bf005,
it is moved to another function named update_t_max_wait to
avoid a build warning. But the wrong thing is that the original
'ts' is initialized in the start of function start_this_handle
and we can calculate t_max_wait in the right way. while with
this change, ts is initialized within the function and t_max_wait
can never be calculated right.
This patch moves the initialization of ts to the original beginning
of start_this_handle and pass it to function update_t_max_wait so
that it can be calculated right and the build warning is avoided also.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
summarise_journal_usage seems to be obsolete for a long time,
so remove it.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>