This value is visible through sysfs and is used by mdadm
when it manages a reshape (backing up data that is about to be
rearranged). So it is important that it is always correct.
Current it does not get updated properly when a reshape
starts which can cause problems when assembling an array
that is in the middle of being reshaped.
This is suitable for 2.6.31.y stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If a 'sync_max' has been set (via sysfs), it is wrong to clear it
until a resync (or reshape or recovery ...) actually reached that
point.
So if a resync is interrupted (e.g. by device failure),
leave 'resync_max' unchanged.
This is particularly important for 'reshape' operations that do not
change the size of the array. For such operations mdadm needs to
monitor the reshape taking rolling backups of the section being
reshaped. If resync_max gets cleared, the reshape can get ahead of
mdadm and then the backups that mdadm creates are useless.
This is suitable for 2.6.31.y stable kernels.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Allow the snapshot chunk size to be smaller than the page size
The code is now capable of handling this due to some previous
fixes and enhancements.
As the page size varies between computers, prior to this patch,
the chunk size of a snapshot dictated which machines could read it:
Snapshots created on one machine might not be readable on another.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use unsigned integer chunk size.
Maximum chunk size is 512kB, there won't ever be need to use 4GB chunk size,
so the number can be 32-bit. This fixes compiler failure on 32-bit systems
with large block devices.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch locks the snapshot when returning status. It fixes a race
when it could return an invalid number of free chunks if someone
was simultaneously modifying it.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Properly close the device if failing because of an invalid chunk size.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If we are creating snapshot with memory-stored exception store, fail if
the user didn't specify chunk size. Zero chunk size would probably crash
a lot of places in the rest of snapshot code.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Multiple instances of dec_pending() can run concurrently so a lock is
needed when it saves the first error code.
I have never experienced actual problem without locking and just found
this during code inspection while implementing the barrier support
patch for request-based dm.
This patch adds the locking.
I've done compile, boot and basic I/O testings.
Cc: stable@kernel.org
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add missing del_gendisk() to error path when creation of workqueue fails.
Otherwice there is a resource leak and following warning is shown:
WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xc5/0x160()
sysfs: cannot create duplicate filename '/devices/virtual/block/dm-0'
Cc: stable@kernel.org
Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
mips:
drivers/md/dm-log-userspace-base.c: In function `userspace_ctr':
drivers/md/dm-log-userspace-base.c:159: warning: cast from pointer to integer of different size
Cc: stable@kernel.org
Cc: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
While initializing the snapshot module, if we fail to register
the snapshot target then we must back-out the exception store
module initialization.
Cc: stable@kernel.org
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Avoid a race causing corruption when snapshots of the same origin have
different chunk sizes by sorting the internal list of snapshots by chunk
size, largest first.
https://bugzilla.redhat.com/show_bug.cgi?id=182659
For example, let's have two snapshots with different chunk sizes. The
first snapshot (1) has small chunk size and the second snapshot (2) has
large chunk size. Let's have chunks A, B, C in these snapshots:
snapshot1: ====A==== ====B====
snapshot2: ==========C==========
(Chunk size is a power of 2. Chunks are aligned.)
A write to the origin at a position within A and C comes along. It
triggers reallocation of A, then reallocation of C and links them
together using A as the 'primary' exception.
Then another write to the origin comes along at a position within B and
C. It creates pending exception for B. C already has a reallocation in
progress and it already has a primary exception (A), so nothing is done
to it: B and C are not linked.
If the reallocation of B finishes before the reallocation of C, because
there is no link with the pending exception for C it does not know to
wait for it and, the second write is dispatched to the origin and causes
data corruption in the chunk C in snapshot2.
To avoid this situation, we maintain snapshots sorted in descending
order of chunk size. This leads to a guaranteed ordering on the links
between the pending exceptions and avoids the problem explained above -
both A and B now get linked to C.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
md/raid6 passes a list of 'struct page *' to the async_tx routines,
which then either DMA map them for offload, or take the page_address
for CPU based calculations.
For RAID6 we sometime leave 'blanks' in the list of pages.
For CPU based calcs, we want to treat theses as a page of zeros.
For offloaded calculations, we simply don't pass a page to the
hardware.
Currently the 'blanks' are encoded as a pointer to
raid6_empty_zero_page. This is a 4096 byte memory region, not a
'struct page'. This is mostly handled correctly but is rather ugly.
So change the code to pass and expect a NULL pointer for the blanks.
When taking page_address of a page, we need to check for a NULL and
in that case use raid6_empty_zero_page.
Signed-off-by: NeilBrown <neilb@suse.de>
When a raid5 (or raid6) array is being reshaped to have fewer devices,
conf->raid_disks is the latter and hence smaller number of devices.
However sometimes we want to use a number which is the total number of
currently required devices - the larger of the 'old' and 'new' sizes.
Before we implemented reducing the number of devices, this was always
'new' i.e. ->raid_disks.
Now we need max(raid_disks, previous_raid_disks) in those places.
This particularly affects assembling an array that was shutdown while
in the middle of a reshape to fewer devices.
md.c needs a similar fix when interpreting the md metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%). However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe(). As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).
By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated. This version still exhibits an
average 11% throughput loss for:
mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
echo 1024 > /sys/block/md0/md/stripe_cache_size
dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
...but the results are at least stable and can be used as a base for
further multicore experimentation.
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
drivers/md/unroll.pl replaced by awk script to drop build-time
dependency on perl
Signed-off-by: Vladimir Dronnikov <dronnikov@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
and replace with vfs_fsync which is much neater (but wasn't exported,
or even in existence at the time the code was written).
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Both raid1 and raid10 create a mempool during startup.
If the 'alloc' function for this mempool fails, unplug_slaves
is called.
If that happens when the pool is being initialised, unplug_slaves
will try to use the 'conf' structure that isn't filled in yet, and
badness will happen.
So ensure that unplug_slaves doesn't get called unless we know
that the conf structure if fully initialised.
Signed-off-by: NeilBrown <neilb@suse.de>
Deallocating a raid5_conf_t structure requires taking 'device_lock'.
Ensure it is initialized before it is used, i.e. initialize the lock
before attempting any further initializations that might fail.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
During 'check' of a raid1 or raid10 it is possible for the management
thread to spend a lot of time running 'memcmp' on blocks from
different devices, so make sure the thread has a chance to schedule.
raid5d already has a cond_resched (in process_stripe).
Reported-By: Lee Howard <faxguy@howardsilvan.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This reverts commit df10cfbc4d.
This patch was based on a misunderstanding and risks introducing a busy-wait loop.
So revert it.
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit a9327cac44 added seperate read
and write statistics of in_flight requests. And exported the number
of read and write requests in progress seperately through sysfs.
But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange
output from "iostat -kx 2". Global values for service time and
utilization were garbage. For interval values, utilization was always
100%, and service time is higher than normal.
So this was reverted by commit 0f78ab9899
The problem was in part_round_stats_single(), I missed the following:
if (now == part->stamp)
return;
- if (part->in_flight) {
+ if (part_in_flight(part)) {
__part_stat_add(cpu, part, time_in_queue,
part_in_flight(part) * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
With this chunk included, the reported regression gets fixed.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
--
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently Jens has changed bio_rw_flagged() logic by following
commit 1f98a13f62. Now it returns
bool instead of int. This broke raid1/raid10 RW bits manipulation logic.
One of visible result is BUG_ON triggering due to empty barrier
here scsi_lib.c:1108 scsi_setup_fs_cmnd()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Recent commit bbba809e96
replaced mempool_create_kzalloc_pool with mempool_create_kmalloc_pool
plus a memset.
This memset is not needed (and we didn't need kzalloc in the first
place).
Ever field of the allocated structure (struct multipath_bh) is
initialised immediately except retry_list, and memset does not
initial a list_head anyway.
To remove the memset.
Signed-off-by: NeilBrown <neilb@suse.de>
The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.
So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.
This is simpler and more correct.
Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Rename some variable and remove some duplicate definitions
to avoid there warnings. None of them are actual errors.
Signed-off-by: NeilBrown <neilb@suse.de>
The kzalloc mempool does not re-zero items that have been used and then
returned to the pool. Manually zero the allocated multipath_bh instead.
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows subsytems to provide devtmpfs with non-default permissions
for the device node. Instead of the default mode of 0600, null, zero,
random, urandom, full, tty, ptmx now have a mode of 0666, which allows
non-privileged processes to access standard device nodes in case no
other userspace process applies the expected permissions.
This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Neil says:
"It is correct as it stands, but the fact that every branch in
the 'if' part ends with a 'return' isn't immediately obvious,
so it is clearer if we are explicit about the if / then / else
structure."
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As pointed out by Neil it should be possible to build a driver with all
BUG_ON statements deleted. It's bad form to have a BUG_ON with a side
effect.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper
around it. DM needs this to avoid poking at the queue_limits directly.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.
This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Commit b8313b6da7 ("dm log: remove incorrect
field from userspace table output") added a call to strstr() with a
single-character "needle" string parameter.
Unfortunately some versions of gcc replace such calls to strstr() by calls
to strchr() behind our back. This causes linking errors if strchr() is
defined as an inline function in <asm/string.h> (e.g. on m68k):
| WARNING: "strchr" [drivers/md/dm-log-userspace.ko] undefined!
Avoid this by explicitly calling strchr() instead.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization. The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency. So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fix some problems seen in the chunk size processing when activating a
pre-existing snapshot.
For a new snapshot, the chunk size can either be supplied by the creator
or a default value can be used. For an existing snapshot, the
chunk size in the snapshot header on disk should always be used.
If someone attempts to load an existing snapshot and has the 'default
chunk size' option set, the kernel uses its default value even when it
is incorrect for the snapshot being loaded. This patch ensures the
correct on-disk value is always used.
Secondly, when the code does use the chunk size stored on the disk it is
prudent to revalidate it, so the code can exit cleanly if it got
corrupted as happened in
https://bugzilla.redhat.com/show_bug.cgi?id=461506 .
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Break the function set_chunk_size to two functions in preparation for
the fix in the following patch.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If a persistent snapshot fills up, a race can corrupt the on-disk header
which causes a crash on any future attempt to activate the snapshot
(typically while booting). This patch fixes the race.
When the snapshot overflows, __invalidate_snapshot is called, which calls
snapshot store method drop_snapshot. It goes to persistent_drop_snapshot that
calls write_header. write_header constructs the new header in the "area"
location.
Concurrently, an existing kcopyd job may finish, call copy_callback
and commit_exception method, that goes to persistent_commit_exception.
persistent_commit_exception doesn't do locking, relying on the fact that
callbacks are single-threaded, but it can race with snapshot invalidation and
overwrite the header that is just being written while the snapshot is being
invalidated.
The result of this race is a corrupted header being written that can
lead to a crash on further reactivation (if chunk_size is zero in the
corrupted header).
The fix is to use separate memory areas for each.
See the bug: https://bugzilla.redhat.com/show_bug.cgi?id=461506
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Refactor chunk_io to prepare for the fix in the following patch.
Pass an area pointer to chunk_io and simplify zero_disk_area to use
chunk_io. No functional change.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Device-mapper userspace logs (like the clustered log) are
identified by a universally unique identifier (UUID). This
identifier is used to associate requests from the kernel to
a specific log in userspace. The UUID must be unique everywhere,
since multiple machines may use this identifier when communicating
about a particular log, as is the case for cluster logs.
Sometimes, device-mapper/LVM may re-use a UUID. This is the
case during pvmoves, when moving from one segment of an LV
to another, or when resizing a mirror, etc. In these cases,
a new log is created with the same UUID and loaded in the
"inactive" slot. When a device-mapper "resume" is issued,
the "live" table is deactivated and the new "inactive" table
becomes "live". (The "inactive" table can also be removed
via a device-mapper 'clear' command.)
The above two issues were colliding. More than one log was being
created with the same UUID, and there was no way to distinguish
between them. So, sometimes the wrong log would be swapped
out during the exchange.
The solution is to create a locally unique identifier,
'luid', to go along with the UUID. This new identifier is used
to determine exactly which log is being referenced by the kernel
when the log exchange is made. The identifier is not
universally safe, but it does not need to be, since
create/destroy/suspend/resume operations are bound to a specific
machine; and these are the operations that make up the exchange.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>