If the thread calling dm_kcopyd_copy is delayed due to scheduling inside
split_job/segment_complete and the subjobs complete before the loop in
split_job completes, the kcopyd callback could be invoked from the
thread that called dm_kcopyd_copy instead of the kcopyd workqueue.
dm_kcopyd_copy -> split_job -> segment_complete -> job->fn()
Snapshots depend on the fact that callbacks are called from the singlethreaded
kcopyd workqueue and expect that there is no racing between individual
callbacks. The racing between callbacks can lead to corruption of exception
store and it can also mean that exception store callbacks are called twice
for the same exception - a likely reason for crashes reported inside
pending_complete() / remove_exception().
This patch fixes two problems:
1. job->fn being called from the thread that submitted the job (see above).
- Fix: hand over the completion callback to the kcopyd thread.
2. job->fn(read_err, write_err, job->context); in segment_complete
reports the error of the last subjob, not the union of all errors.
- Fix: pass job->write_err to the callback to report all error bits
(it is done already in run_complete_job)
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Use a variable in segment_complete() to point to the dm_kcopyd_client
struct and only release job->pages in run_complete_job() if any are
defined. These changes are needed by the next patch.
Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Change #include "dm.h" to #include <linux/device-mapper.h> in all targets.
Targets should not need direct access to internal DM structures.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Write throughput to LVM snapshot origin volume is an order
of magnitude slower than those to LV without snapshots or
snapshot target volumes, especially in the case of sequential
writes with O_SYNC on.
The following patch originally written by Kevin Jamieson and
Jan Blunck and slightly modified for the current RCs by myself
tries to improve the performance by modifying the behaviour
of kcopyd, so that it pushes back an I/O job to the head of
the job queue instead of the tail as process_jobs() currently
does when it has to wait for free pages. This way, write
requests aren't shuffled to cause extra seeks.
I tested the patch against 2.6.27-rc5 and got the following results.
The test is a dd command writing to snapshot origin followed by fsync
to the file just created/updated. A couple of filesystem benchmarks
gave me similar results in case of sequential writes, while random
writes didn't suffer much.
dd if=/dev/zero of=<somewhere on snapshot origin> bs=4096 count=...
[conv=notrunc when updating]
1) linux 2.6.27-rc5 without the patch, write to snapshot origin,
average throughput (MB/s)
10M 100M 1000M
create,dd 511.46 610.72 11.81
create,dd+fsync 7.10 6.77 8.13
update,dd 431.63 917.41 12.75
update,dd+fsync 7.79 7.43 8.12
compared with write throughput to LV without any snapshots,
all dd+fsync and 1000 MiB writes perform very poorly.
10M 100M 1000M
create,dd 555.03 608.98 123.29
create,dd+fsync 114.27 72.78 76.65
update,dd 152.34 1267.27 124.04
update,dd+fsync 130.56 77.81 77.84
2) linux 2.6.27-rc5 with the patch, write to snapshot origin,
average throughput (MB/s)
10M 100M 1000M
create,dd 537.06 589.44 46.21
create,dd+fsync 31.63 29.19 29.23
update,dd 487.59 897.65 37.76
update,dd+fsync 34.12 30.07 26.85
Although still not on par with plain LV performance -
cannot be avoided because it's copy on write anyway -
this simple patch successfully improves throughtput
of dd+fsync while not affecting the rest.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Kazuo Ito <ito.kazuo@oss.ntt.co.jp>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.
It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).
The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.
This was happenning:
When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.
fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.
wait_on_buffer() starts waiting until the IO is finished.
kmirrord is scheduled, kmirrord takes bios and submits them to the devices.
The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)
So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.
This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.
Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.
This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.
It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Remove client counting code that is no longer needed.
Initialization and destruction is made globally from dm_init and dm_exit and is
not based on client counts. Initialization allocates only one empty slab cache,
so there is no negative impact from performing the initialization always,
regardless of whether some client uses kcopyd or not.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change the global mempool in kcopyd into a per-device mempool to avoid
deadlock possibilities.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the kcopyd interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up the dm-io interface to prepare for publishing it in include/linux.
Signed-off-by: Heinz Mauelshagen <hjm@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
write_err is an unsigned long used with set_bit() so should not be passed
around as unsigned int.
http://bugzilla.kernel.org/show_bug.cgi?id=10271
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kcopyd uses a semaphore as mutex. Use the mutex API instead of the (binary)
semaphore,
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use new KMEM_CACHE() macro and make the newly-exposed structure names more
meaningful. Also remove some superfluous casts and inlines (let a modern
compiler be the judge).
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kcopyd should accumulate errors - otherwise I/O failures may be ignored
unintentionally.
And invert 'success' (used in a future patch), using a more intuitive
!(read_err || write_err).
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Before removing a snapshot, wait for the completion of any kcopyd jobs using
it.
Do this by maintaining a count (nr_jobs) of how many outstanding jobs each
kcopyd_client has.
The snapshot destructor first unregisters the snapshot so that no new kcopyd
jobs (created by writes to the origin) will reference that particular
snapshot. kcopyd_client_destroy() is now run next to wait for the completion
of any outstanding jobs before the snapshot exception structures (that those
jobs reference) are freed.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Modify well over a dozen mempool users to call mempool_create_slab_pool()
rather than calling mempool_create() with extra arguments, saving about 30
lines of code and increasing readability.
Signed-off-by: Matthew Dobson <colpatch@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jens Axboe <axboe@suse.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
EDAC requires a way to scrub memory if an ECC error is found and the chipset
does not do the work automatically. That means rewriting memory locations
atomically with respect to all CPUs _and_ bus masters. That means we can't
use atomic_add(foo, 0) as it gets optimised for non-SMP
This adds a function to include/asm-foo/atomic.h for the platforms currently
supported which implements a scrub of a mapped block.
It also adjusts a few other files include order where atomic.h is included
before types.h as this now causes an error as atomic_scrub uses u32.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch #if 0's the not yet implemented global function kcopyd_cancel().
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!