For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.
Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.
The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Some programs require HDIO_GETGEO work, which requires we implement
getgeo.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Done to ensure nvme_thread is not running when there
are no devices to poll.
Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Increase the default timeout to 30 seconds to match SCSI.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[use byte instead of ushort]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
zram is ram based block device and can be used by backend of filesystem.
When filesystem deletes a file, it normally doesn't do anything on data
block of that file. It just marks on metadata of that file. This
behavior has no problem on disk based block device, but has problems on
ram based block device, since we can't free memory used for data block.
To overcome this disadvantage, there is REQ_DISCARD functionality. If
block device support REQ_DISCARD and filesystem is mounted with discard
option, filesystem sends REQ_DISCARD to block device whenever some data
blocks are discarded. All we have to do is to handle this request.
This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
REQ_DISCARD request. With it, we can free memory used by zram if it isn't
used.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysfs.txt documentation lists the following requirements:
- The buffer will always be PAGE_SIZE bytes in length. On i386, this
is 4096.
- show() methods should return the number of bytes printed into the
buffer. This is the return value of scnprintf().
- show() should always use scnprintf().
Use scnprintf() in show() functions.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we initialized zcomp with single, we couldn't change
max_comp_streams without zram reset but current interface doesn't show
any error to user and even it changes max_comp_streams's value without
any effect so it would make user very confusing.
This patch prevents max_comp_streams's change when zcomp was initialized
as single zcomp and emit the error to user(ex, echo).
[akpm@linux-foundation.org: don't return with the lock held, per Sergey]
[fengguang.wu@intel.com: fix coccinelle warnings]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of returning just NULL, return ERR_PTR from zcomp_create() if
compressing backend creation has failed. ERR_PTR(-EINVAL) for unsupported
compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
compression stream) error.
Perform IS_ERR() check of returned from zcomp_create() value in
disksize_store() and set return code to PTR_ERR().
Change suggested by Jerome Marchand.
[akpm@linux-foundation.org: clean up error recovery flow]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
Minchan Kim noted [2] that it's better to move compression backend
allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
with zram_meta_alloc(), in order to prevent the same lockdep spew.
[1] https://lkml.org/lkml/2014/2/27/337
[2] https://lkml.org/lkml/2014/3/3/32
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce LZ4 compression backend and make it available for selection.
LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
option. The default compression backend is LZO.
TEST
(x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
ext4 file system, 3 compression streams)
iozone -t 3 -R -r 16K -s 60M -I +Z
Test LZO LZ4
----------------------------------------------
Initial write 1642744.62 1317005.09
Rewrite 2498980.88 1800645.16
Read 3957026.38 5877043.75
Re-read 3950997.38 5861847.00
Reverse Read 2937114.56 5047384.00
Stride read 2948163.19 4929587.38
Random read 3292692.69 4880793.62
Mixed workload 1545602.62 3502940.38
Random write 2448039.75 1758786.25
Pwrite 1670051.03 1338329.69
Pread 2530682.00 5097177.62
Fwrite 3232085.62 3275942.56
Fread 6306880.25 6645271.12
So on my system LZ4 is slower in write-only tests, while it performs
better in read-only and mixed (reads + writes) tests.
Official LZ4 benchmarks available here http://code.google.com/p/lz4/
(linux kernel uses revision r90).
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch allows to change max_comp_streams on initialised zcomp.
Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
and zcomp_strm_single_set_max_streams() callbacks to change streams limit
for zcomp_strm_multi and zcomp_strm_single, accordingly. set_max_streams
for single steam zcomp does nothing.
If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
attempts to immediately free extra streams (as much as it can, depending
on idle streams availability).
Note, this patch does not allow to change stream 'policy' from single to
multi stream (or vice versa) on already initialised compression backend.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Existing zram (zcomp) implementation has only one compression stream
(buffer and algorithm private part), so in order to prevent data
corruption only one write (compress operation) can use this compression
stream, forcing all concurrent write operations to wait for stream lock
to be released. This patch changes zcomp to keep a compression streams
list of user-defined size (via sysfs device attr). Each write operation
still exclusively holds compression stream, the difference is that we
can have N write operations (depending on size of streams list)
executing in parallel. See TEST section later in commit message for
performance data.
Introduce struct zcomp_strm_multi and a set of functions to manage
zcomp_strm stream access. zcomp_strm_multi has a list of idle
zcomp_strm structs, spinlock to protect idle list and wait queue, making
it possible to perform parallel compressions.
The following set of functions added:
- zcomp_strm_multi_find()/zcomp_strm_multi_release()
find and release a compression stream, implement required locking
- zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
create and destroy zcomp_strm_multi
zcomp ->strm_find() and ->strm_release() callbacks are set during
initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
correspondingly.
Each time zcomp issues a zcomp_strm_multi_find() call, the following set
of operations performed:
- spin lock strm_lock
- if idle list is not empty, remove zcomp_strm from idle list, spin
unlock and return zcomp stream pointer to caller
- if idle list is empty, current adds itself to wait queue. it will be
awaken by zcomp_strm_multi_release() caller.
zcomp_strm_multi_release():
- spin lock strm_lock
- add zcomp stream to idle list
- spin unlock, wake up sleeper
Minchan Kim reported that spinlock-based locking scheme has demonstrated
a severe perfomance regression for single compression stream case,
comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)
base spinlock mutex
==Initial write ==Initial write ==Initial write
records: 5 records: 5 records: 5
avg: 1642424.35 avg: 699610.40 avg: 1655583.71
std: 39890.95(2.43%) std: 232014.19(33.16%) std: 52293.96
max: 1690170.94 max: 1163473.45 max: 1697164.75
min: 1568669.52 min: 573429.88 min: 1553410.23
==Rewrite ==Rewrite ==Rewrite
records: 5 records: 5 records: 5
avg: 1611775.39 avg: 501406.64 avg: 1684419.11
std: 17144.58(1.06%) std: 15354.41(3.06%) std: 18367.42
max: 1641800.95 max: 531356.78 max: 1706445.84
min: 1593515.27 min: 488817.78 min: 1655335.73
When only one compression stream available, mutex with spin on owner
tends to perform much better than frequent wait_event()/wake_up(). This
is why single stream implemented as a special case with mutex locking.
Introduce and document zram device attribute max_comp_streams. This
attr shows and stores current zcomp's max number of zcomp streams
(max_strm). Extend zcomp's zcomp_create() with `max_strm' parameter.
`max_strm' limits the number of zcomp_strm structs in compression
backend's idle list (max_comp_streams).
max_comp_streams used during initialisation as follows:
-- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
using single compression stream zcomp_strm_single (mutex-based locking).
-- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
using multi compression stream zcomp_strm_multi (spinlock-based locking).
default max_comp_streams value is 1, meaning that zram with single stream
will be initialised.
Later patch will introduce configuration knob to change max_comp_streams
on already initialised and used zcomp.
TEST
iozone -t 3 -R -r 16K -s 60M -I +Z
test base 1 strm (mutex) 3 strm (spinlock)
-----------------------------------------------------------------------
Initial write 589286.78 583518.39 718011.05
Rewrite 604837.97 596776.38 1515125.72
Random write 584120.11 595714.58 1388850.25
Pwrite 535731.17 541117.38 739295.27
Fwrite 1418083.88 1478612.72 1484927.06
Usage example:
set max_comp_streams to 4
echo 4 > /sys/block/zram0/max_comp_streams
show current max_comp_streams (default value is 1).
cat /sys/block/zram0/max_comp_streams
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is preparation patch to add multi stream support to zcomp.
Introduce struct zcomp_strm_single and a set of functions to manage
zcomp_strm stream access. zcomp_strm_single implements single compession
stream, same way as current zcomp implementation. This moves zcomp_strm
stream control and locking from zcomp, so compressing backend zcomp is not
aware of required locking.
Single and multi streams require different locking schemes. Minchan Kim
reported that spinlock-based locking scheme (which is used in multi stream
implementation) has demonstrated a severe perfomance regression for single
compression stream case, comparing to mutex-based. see
https://lkml.org/lkml/2014/2/18/16
The following set of functions added:
- zcomp_strm_single_find()/zcomp_strm_single_release()
find and release a compression stream, implement required locking
- zcomp_strm_single_create()/zcomp_strm_single_destroy()
create and destroy zcomp_strm_single
New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
set to zcomp_strm_single_find() and zcomp_strm_single_release() during
initialisation. Instead of direct locking and zcomp_strm access from
zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
and ->strm_release() correspondingly.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ZRAM performs direct LZO compression algorithm calls, making it the one
and only option. While LZO is generally performs well, LZ4 algorithm
tends to have a faster decompression (see http://code.google.com/p/lz4/
for full report)
Name Ratio C.speed D.speed
MB/s MB/s
LZ4 (r101) 2.084 422 1820
LZO 2.06 2.106 414 600
Thus, users who have mostly read (decompress) usage scenarious or mixed
workflow (writes with relatively high read ops number) will benefit from
using LZ4 compression backend.
Introduce compressing backend abstraction zcomp in order to support
multiple compression algorithms with the following set of operations:
.create
.destroy
.compress
.decompress
Schematically zram write() usually contains the following steps:
0) preparation (decompression of partioal IO, etc.)
1) lock buffer_lock mutex (protects meta compress buffers)
2) compress (using meta compress buffers)
3) alloc and map zs_pool object
4) copy compressed data (from meta compress buffers) to object allocated by 3)
5) free previous pool page, assign a new one
6) unlock buffer_lock mutex
As we can see, compressing buffers must remain untouched from 1) to 4),
because, otherwise, concurrent write() can overwrite data. At the same
time, zram_meta must be aware of a) specific compression algorithm memory
requirements and b) necessary locking to protect compression buffers. To
remove requirement a) new struct zcomp_strm introduced, which contains a
compress/decompress `buffer' and compression algorithm `private' part.
While struct zcomp implements zcomp_strm stream handling and locking and
removes requirement b) from zram meta. zcomp ->create() and ->destroy(),
respectively, allocate and deallocate algorithm specific zcomp_strm
`private' part.
Every zcomp has zcomp stream and mutex to protect its compression stream.
Stream usage semantics remains the same -- only one write can hold stream
lock and use its buffers. zcomp_strm_find() turns caller into exclusive
user of a stream (holding stream mutex until zram release stream), and
zcomp_strm_release() makes zcomp stream available (unlock the stream
mutex). Hence no concurrent write (compression) operations possible at
the moment.
iozone -t 3 -R -r 16K -s 60M -I +Z
test base patched
--------------------------------------------------
Initial write 597992.91 591660.58
Rewrite 609674.34 616054.97
Read 2404771.75 2452909.12
Re-read 2459216.81 2470074.44
Reverse Read 1652769.66 1589128.66
Stride read 2202441.81 2202173.31
Random read 2236311.47 2276565.31
Mixed workload 1423760.41 1709760.06
Random write 579584.08 615933.86
Pwrite 597550.02 594933.70
Pread 1703672.53 1718126.72
Fwrite 1330497.06 1461054.00
Fread 3922851.00 3957242.62
Usage examples:
comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
which initialises compressing backend if requested algorithm is supported.
Compress:
zstrm = zcomp_strm_find(comp)
zcomp_compress(comp, zstrm, src, &dst_len)
[..] /* copy compressed data */
zcomp_strm_release(comp, zstrm)
Decompress:
zcomp_decompress(comp, src, src_len, dst);
Free compessing backend and its zcomp stream:
zcomp_destroy(comp)
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
allocate new `zram_meta' in disksize_store() only for uninitialised zram
device, saving a number of allocations and deallocations in case if
disksize_store() was called on currently used device. at the same time
zram_meta stack variable is not necessary, because we can set ->meta
directly. there is also no need in setting QUEUE_FLAG_NONROT queue on
every disksize_store(), set it once during device creation.
[minchan@kernel.org: handle zram->meta alloc fail case]
[minchan@kernel.org: prevent lockdep spew of init_lock]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram accounted but did not report numbers of failed read and write
queries. make these stats available as failed_reads and failed_writes
attrs.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparation patch for stats code duplication removal.
1) use atomic64_t for `pages_zero' and `pages_stored' zram stats.
2) `compr_size' and `pages_zero' struct zram_stats members did not
follow the existing device attr naming scheme: zram_stats.ATTR has
ATTR_show() function. rename them:
-- compr_size -> compr_data_size
-- pages_zero -> zero_pages
Minchan Kim's note:
If we really have trouble with atomic stat operation, we could
change it with percpu_counter so that it could solve atomic overhead and
unnecessary memory space by introducing unsigned long instead of 64bit
atomic_t.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove `good' and `bad' compressed sub-requests stats. RW request may
cause a number of RW sub-requests. zram used to account `good' compressed
sub-queries (with compressed size less than 50% of original size), `bad'
compressed sub-queries (with compressed size greater that 75% of original
size), leaving sub-requests with compression size between 50% and 75% of
original size not accounted and not reported. zram already accounts each
sub-request's compression size so we can calculate real device compression
ratio.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not pass rw argument down the __zram_make_request() -> zram_bvec_rw()
chain, decode it in zram_bvec_rw() instead. Besides, this is the place
where we distinguish READ and WRITE bio data directions, so account zram
RW stats here, instead of __zram_make_request(). This also allows to
account a real number of zram READ/WRITE operations, not just requests
(single RW request may cause a number of zram RW ops with separate
locking, compression/decompression, etc).
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce init_done() helper function which allows us to drop `init_done'
struct zram member. init_done() uses the fact that ->init_done == 1
equals to ->meta != NULL.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the libceph/osd side.
"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create(). The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Our longest osd request now contains 3 ops: copyup+hint+write.
Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2. Fix it, and switch to rbd_assert while at it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():
Assertion failure in rbd_obj_request_destroy() at line 1867:
rbd_assert(obj_request->img_request == NULL);
rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del(). Use it.
Fixes: http://tracker.ceph.com/issues/7327
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Commit 03507db631 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus. Remove it.
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Olivier Bonvalet reported having repeated crashes due to a failed
assertion he was hitting in rbd_img_obj_callback():
Assertion failure in rbd_img_obj_callback() at line 2165:
rbd_assert(which >= img_request->next_completion);
With a lot of help from Olivier with reproducing the problem
we were able to determine the object and image requests had
already been completed (and often freed) at the point the
assertion failed.
There was a great deal of discussion on the ceph-devel mailing list
about this. The problem only arose when there were two (or more)
object requests in an image request, and the problem was always
seen when the second request was being completed.
The problem is due to a race in the window between setting the
"done" flag on an object request and checking the image request's
next completion value. When the first object request completes, it
checks to see if its successor request is marked "done", and if
so, that request is also completed. In the process, the image
request's next_completion value is updated to reflect that both
the first and second requests are completed. By the time the
second request is able to check the next_completion value, it
has been set to a value *greater* than its own "which" value,
which caused an assertion to fail.
Fix this problem by skipping over any completion processing
unless the completing object request is the next one expected.
Test only for inequality (not >=), and eliminate the bad
assertion.
Tested-by: Olivier Bonvalet <ob@daevel.fr>
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Checkpatch has started warning against using DEFINE_PCI_DEVICE_TABLE,
so replace it. Also update the copyright date and bump the module
version number to 0.9.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
dev->max_hw_sectors may be zero to indicate the device has no limit on
the number of sectors. nvme_trans_do_nvme_io() should use the software
limit, since this is guaranteed to be non-zero.
Reported-by: Mundu <mundu2510@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.
The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.
There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
If an NVMe device becomes ready but fails to create IO queues, the driver
creates a character device handle so the device can be managed. The
device reference count needs to be initialized before creating the
character device.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Add CONFIG_PM_SLEEP to suspend/resume functions to fix the following
build warning when CONFIG_PM_SLEEP is not selected. This is because
sleep PM callbacks defined by SIMPLE_DEV_PM_OPS are only used when
the CONFIG_PM_SLEEP is enabled.
drivers/block/nvme-core.c:2541:12: warning: 'nvme_suspend' defined but not used [-Wunused-function]
drivers/block/nvme-core.c:2550:12: warning: 'nvme_resume' defined but not used [-Wunused-function]
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Venkatash spake thus:
virtio-blk set the default queue depth to 64 requests, which was
insufficient for high-IOPS devices. Instead set the blk-queue depth to
the device's virtqueue depth divided by two (each I/O requires at least
two VQ entries).
But behold, Ted added a module parameter:
Also allow the queue depth to be something which can be set at module
load time or via a kernel boot-time parameter, for
testing/benchmarking purposes.
And I rewrote it substantially, mainly to take
VIRTIO_RING_F_INDIRECT_DESC into account.
As QEMU sets the vq size for PCI to 128, Venkatash's patch wouldn't
have made a change. This version does (since QEMU also offers
VIRTIO_RING_F_INDIRECT_DESC.
Inspired-by: "Theodore Ts'o" <tytso@mit.edu>
Based-on-the-true-story-of: Venkatesh Srinivas <venkateshs@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Cc: virtualization@lists.linux-foundation.org
Cc: Frank Swiderski <fes@google.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If drivers do dynamic allocation in the hardware command init
path, then we need to be able to handle and return failures.
And if they do allocations or mappings in the init command path,
then we need a cleanup function to free up that space at exit
time. So add blk_mq_free_commands() as the cleanup function.
This is required for the mtip32xx driver conversion to blk-mq.
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes 2 issues in the fast completion path:
1) Possible double completions / double dma_unmap_sg() calls due to lack
of atomicity in the check and subsequent dereference of the upper layer
callback function. Fixed with cmpxchg before unmap and callback.
2) Regression in unaligned IO constraining workaround for p420m devices.
Fixed by checking if IO is unaligned and using proper semaphore if so.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
If the buffers are unmapped after completing a request, then stale data
might be in the request.
Signed-off-by: Felipe Franciosi <felipe@paradoxo.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
We need to set the queue bounce limit during the device initialization to
prevent excessive bouncing on 32 bit architectures.
Signed-off-by: Felipe Franciosi <felipe@paradoxo.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers
using these two interfaces need to be updated to use the
new pci_enable_msi_range() or pci_enable_msi_exact()
and pci_enable_msix_range() or pci_enable_msix_exact()
interfaces.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-nvme@lists.infradead.org
Cc: linux-pci@vger.kernel.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently the driver falls back to INTx mode when MSI-X
initialization failed. This is a suboptimal behaviour
for chips that also support MSI. This update changes that
behaviour and falls back to MSI mode in case MSI-X mode
initialization failed.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: iss_storagedev@hp.com
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
interruptible_sleep_on is racy and going away. This replaces the one
caller in the swim3 driver with the equivalent race-free
wait_event_interruptible call. Since we're here already, this
also fixes the case where we get interrupted from atomic context,
which used to just spin in the loop.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>