kernel_samsung_sm7125

jenna

Author	SHA1	Message	Date
Tejun Heo	7918ffb5b8	cfq-iosched: implement cfq_group->nr_active and ->children_weight To prepare for blkcg hierarchy support, add cfqg->nr_active and ->children_weight. cfqg->nr_active counts the number of active cfqgs at the cfqg's level and ->children_weight is sum of weights of those cfqgs. The level covers itself (cfqg->leaf_weight) and immediate children. The two values are updated when a cfqg enters and leaves the group service tree. Unless the hierarchy is very deep, the added overhead should be negligible. Currently, the parent is determined using cfqg_flat_parent() which makes the root cfqg the parent of all other cfqgs. This is to make the transition to hierarchy-aware scheduling gradual. Scheduling logic will be converted to use cfqg->children_weight without actually changing the behavior. When everything is ready, blkcg_weight_parent() will be replaced with proper parent function. This patch doesn't introduce any behavior chagne. v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Tejun Heo	e71357e118	cfq-iosched: add leaf_weight cfq blkcg is about to grow proper hierarchy handling, where a child blkg's weight would nest inside the parent's. This makes tasks in a blkg to compete against both tasks in the sibling blkgs and the tasks of child blkgs. We're gonna use the existing weight as the group weight which decides the blkg's weight against its siblings. This patch introduces a new weight - leaf_weight - which decides the weight of a blkg against the child blkgs. It's named leaf_weight because another way to look at it is that each internal blkg nodes have a hidden child leaf node which contains all its tasks and leaf_weight is the weight of the leaf node and handled the same as the weight of the child blkgs. This patch only adds leaf_weight fields and exposes it to userland. The new weight isn't actually used anywhere yet. Note that cfq-iosched currently offcially supports only single level hierarchy and root blkgs compete with the first level blkgs - ie. root weight is basically being used as leaf_weight. For root blkgs, the two weights are kept in sync for backward compatibility. v2: cfqd->root_group->leaf_weight initialization was missing from cfq_init_queue() causing divide by zero when !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>	12 years ago
Tejun Heo	3c54786590	blkcg: make blkcg_gq's hierarchical Currently a child blkg (blkcg_gq) can be created even if its parent doesn't exist. ie. Given a blkg, it's not guaranteed that its ancestors will exist. This makes it difficult to implement proper hierarchy support for blkcg policies. Always create blkgs recursively and make a child blkg hold a reference to its parent. blkg->parent is added so that finding the parent is easy. blkcg_parent() is also added in the process. This change can be visible to userland. e.g. while issuing IO in a nested cgroup didn't affect the ancestors at all, now it will initialize all ancestor blkgs and zero stats for the request_queue will always appear on them. While this is userland visible, this shouldn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Tejun Heo	93e6d5d8f5	blkcg: cosmetic updates to blkg_create() * Rename out_* labels to err_. Do ERR_PTR() conversion once in the error return path. This patch is cosmetic and to prepare for the hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Tejun Heo	86cde6b623	blkcg: reorganize blkg_lookup_create() and friends Reorganize such that * __blkg_lookup() takes bool param @update_hint to determine whether to update hint. * __blkg_lookup_create() no longer performs lookup before trying to create. Renamed to blkg_create(). * blkg_lookup_create() now performs lookup and then invokes blkg_create() if lookup fails. * root_blkg creation in blkcg_activate_policy() updated accordingly. Note that blkcg_activate_policy() no longer updates lookup hint if root_blkg already exists. Except for the last lookup hint bit which is immaterial, this is pure reorganization and doesn't introduce any visible behavior change. This is to prepare for proper hierarchy support. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Tejun Heo	356d2e5810	blkcg: fix minor bug in blkg_alloc() blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice. The latter test should have been on whether pol->pd_init_fn() exists. This doesn't cause actual problems because both blkcg policies implement pol->pd_init_fn(). Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Vivek Goyal	b226e5c411	cfq-iosched: Print sync-noidle information in blktrace messages Currently we attach a character "S" or "A" to the cfqq<pid>, to represent whether queues is sync or async. Add one more character "N" to represent whether it is sync-noidle queue or sync queue. So now three different type of queues will look as follows. cfq1234S --> sync queus cfq1234SN --> sync noidle queue cfq1234A --> Async queue Previously S/A classification was being printed only if group scheduling was enabled. This patch also makes sure that this classification is displayed even if group idling is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Vivek Goyal	1f23f12151	cfq-iosched: Get rid of unnecessary local variable Use of local varibale "n" seems to be unnecessary. Remove it. This brings it inline with function __cfq_group_st_add(), which is also doing the similar operation of adding a group to a rb tree. No functionality change here. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Vivek Goyal	6d816ec7c8	cfq-iosched: Rename few functions related to selecting workload choose_service_tree() selects/sets both wl_class and wl_type. Rename it to choose_wl_class_and_type() to make it very clear. cfq_choose_wl() only selects and sets wl_type. It is easy to confuse it with choose_st(). So rename it to cfq_choose_wl_type() to make it clear what does it do. Just renaming. No functionality change. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Vivek Goyal	34b98d03bd	cfq-iosched: Rename "service_tree" to "st" at some places At quite a few places we use the keyword "service_tree". At some places, especially local variables, I have abbreviated it to "st". Also at couple of places moved binary operator "+" from beginning of line to end of previous line, as per Tejun's feedback. v2: Reverted most of the service tree name change based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Vivek Goyal	4d2ceea4cb	cfq-iosched: More renaming to better represent wl_class and wl_type Some more renaming. Again making the code uniform w.r.t use of wl_class/class to represent IO class (RT, BE, IDLE) and using wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC). At places this patch shortens the string "workload" to "wl". Renamed "saved_workload" to "saved_wl_type". Renamed "saved_serving_class" to "saved_wl_class". For uniformity with "saved_wl_*" variables, renamed "serving_class" to "serving_wl_class" and renamed "serving_type" to "serving_wl_type". Again, just trying to improve upon code uniformity and improve readability. No functional change. v2: - Restored the usage of keyword "service" based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Vivek Goyal	3bf10fea3b	cfq-iosched: Properly name all references to IO class Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we are calling workloads belonging to these classes as "prio". This gets very confusing as one starts to associate it with ioprio. So this patch just does bunch of renaming so that reading code becomes easier. All reference to RT, BE and IDLE workload are done using keyword "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made using keyword "type". This makes me feel much better while I am reading the code. There is no functionality change due to this patch. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	12 years ago
Derek Basehore	12c2bdb232	block: prevent race/cleanup Remove a race condition which causes a warning in disk_clear_events. This is a race between disk_clear_events() and disk_flush_events(). ev->clearing will be altered by disk_flush_events() even though we are blocking event checking through disk_flush_events(). If this happens after ev->clearing was cleared for disk_clear_events(), this can cause the WARN_ON_ONCE() in that function to be triggered. This change also has disk_clear_events() not go through a workqueue. Since we have to wait for the work to complete, we should just call the function directly. Also, since this work cannot be put on a freezable workqueue, it will have to contend with increased demand, so calling the function directly avoids this. [akpm@linux-foundation.org: fix spello in comment] Signed-off-by: Derek Basehore <dbasehore@chromium.org> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Derek Basehore	aea24a8bbc	block: remove deadlock in disk_clear_events In disk_clear_events, do not put work on system_nrt_freezable_wq. Instead, put it on system_nrt_wq. There is a race between probing a usb and suspending the device. Since probing a usb calls disk_clear_events, which puts work on a frozen workqueue, probing cannot finish after the workqueue is frozen. However, suspending cannot finish until the usb probe is finished, so we get a deadlock, causing the system to reboot. The way to reproduce this bug is to wake up from suspend with a usb storage device plugged in, or plugging in a usb storage device right before suspend. The window of time is on the order of time it takes to probe the usb device. As long as the workqueues are frozen before the call to add_disk within sd_probe_async finishes, there will be a deadlock (which calls blkdev_get, sd_open, check_disk_change, then disk_clear_events). This is not difficult to reproduce after figuring out the timings. [akpm@linux-foundation.org: fix up comment] Signed-off-by: Derek Basehore <dbasehore@chromium.org> Reviewed-by: Mandeep Singh Baines <msb@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Oleg Nesterov	22b361d1df	percpu_rw_semaphore: introduce CONFIG_PERCPU_RWSEM Currently only block_dev and uprobes use percpu_rw_semaphore, add the config option selected by BLOCK \|\| UPROBES. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Anton Arapov <anton@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	12 years ago
NeilBrown	cbae8d45d6	block: export block_unplug tracepoint This allows stacked devices (like md/raid5) to provide blktrace tracing, including unplug events. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Shaohua Li	0cfbcafcae	block: add plug for blkdev_issue_discard Last post of this patch appears lost, so I resend this. Now discard merge works, add plug for blkdev_issue_discard. This will help discard request merge especially for raid0 case. In raid0, a big discard request is split to small requests, and if correct plug is added, such small requests can be merged in low layer. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Shaohua Li	8dd2cb7e88	block: discard granularity might not be power of 2 In MD raid case, discard granularity might not be power of 2, for example, a 4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for such cases. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
xiaobing tu	75274551c8	deadline: Allow 0ms deadline latency, increase the read speed Change a timer compare from after to after-equals, thus allowing 0 timeout and making deadline schedule FIFO. Signed-off-by: xiaobing tu <xiaobing.tu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Diego Calleja	5f6f38dbb0	partitions: enable EFI/GPT support by default The Kconfig currently enables MSDOS partitions by default because they are assumed to be essential, but it's necessary to enable "advanced partition selection" in order to get GPT support. IMO GPT partitions are becoming common enought to deserve the same treatment MSDOS partitions get. (Side note: I got bit by a disk that had MSDOS and GPT partition tables, but for some reason the MSDOS table was different from the GPT one. I was stupid enought to disable "advanced partition selection" in my .config, which disabled GPT partitioning and made my btrfs pool unbootable because it couldn't find the partitions) Signed-off-by: Diego Calleja <diegocg@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	80729beb33	bsg: Remove unused function bsg_goose_queue() The function bsg_goose_queue() does not have any in-tree callers, so let's remove it. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	24faf6f604	block: Make blk_cleanup_queue() wait until request_fn finished Some request_fn implementations, e.g. scsi_request_fn(), unlock the queue lock internally. This may result in multiple threads executing request_fn for the same queue simultaneously. Keep track of the number of active request_fn calls and make sure that blk_cleanup_queue() waits until all active request_fn invocations have finished. A block driver may start cleaning up resources needed by its request_fn as soon as blk_cleanup_queue() finished, so blk_cleanup_queue() must wait for all outstanding request_fn invocations to finish. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Chanho Min <chanho.min@lge.com> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	704605711e	block: Avoid scheduling delayed work on a dead queue Running a queue must continue after it has been marked dying until it has been marked dead. So the function blk_run_queue_async() must not schedule delayed work after blk_cleanup_queue() has marked a queue dead. Hence add a test for that queue state in blk_run_queue_async() and make sure that queue_unplugged() invokes that function with the queue lock held. This avoids that the queue state can change after it has been tested and before mod_delayed_work() is invoked. Drop the queue dying test in queue_unplugged() since it is now superfluous: __blk_run_queue() already tests whether or not the queue is dead. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	c246e80d86	block: Avoid that request_fn is invoked on a dead queue A block driver may start cleaning up resources needed by its request_fn as soon as blk_cleanup_queue() finished, so request_fn must not be invoked after draining finished. This is important when blk_run_queue() is invoked without any requests in progress. As an example, if blk_drain_queue() and scsi_run_queue() run in parallel, blk_drain_queue() may have finished all requests after scsi_run_queue() has taken a SCSI device off the starved list but before that last function has had a chance to run the queue. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Chanho Min <chanho.min@lge.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	807592a4fa	block: Let blk_drain_queue() caller obtain the queue lock Let the caller of blk_drain_queue() obtain the queue lock to improve readability of the patch called "Avoid that request_fn is invoked on a dead queue". Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Chanho Min <chanho.min@lge.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Bart Van Assche	3f3299d5c0	block: Rename queue dead flag QUEUE_FLAG_DEAD is used to indicate that queuing new requests must stop. After this flag has been set queue draining starts. However, during the queue draining phase it is still safe to invoke the queue's request_fn, so QUEUE_FLAG_DYING is a better name for this flag. This patch has been generated by running the following command over the kernel source tree: git grep -lEw 'blk_queue_dead\|QUEUE_FLAG_DEAD' \| xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \ -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \ sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \ include/linux/blkdev.h; \ sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \ -e 's/Dead queue/A dying queue/' block/blk-core.c Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: James Bottomley <JBottomley@Parallels.com> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Chanho Min <chanho.min@lge.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Roland Dreier	893d290f1d	block: Don't access request after it might be freed After we've done __elv_add_request() and __blk_run_queue() in blk_execute_rq_nowait(), the request might finish and be freed immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME isn't safe afterwards, because if it isn't, rq might be gone. Instead, check beforehand and stash the result in a temporary. This fixes crashes in blk_execute_rq_nowait() I get occasionally when running with lots of memory debugging options enabled -- I think this race is usually harmless because the window for rq to be reallocated is so small. Signed-off-by: Roland Dreier <roland@purestorage.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Stephen Warren	d33b98fc82	block: partition: msdos: provide UUIDs for partitions The MSDOS/MBR partition table includes a 32-bit unique ID, often referred to as the NT disk signature. When combined with a partition number within the table, this can form a unique ID similar in concept to EFI/GPT's partition UUID. Constructing and recording this value in struct partition_meta_info allows MSDOS partitions to be referred to on the kernel command-line using the following syntax: root=PARTUUID=0002dd75-01 Signed-off-by: Stephen Warren <swarren@nvidia.com> Cc: Tejun Heo <tj@kernel.org> Cc: Will Drewry <wad@chromium.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Stephen Warren	1ad7e89940	block: store partition_meta_info.uuid as a string This will allow other types of UUID to be stored here, aside from true UUIDs. This also simplifies code that uses this field, since it's usually constructed from a, used as a, or compared to other, strings. Note: A simplistic approach here would be to set uuid_str[36]=0 whenever a /PARTNROFF option was found to be present. However, this modifies the input string, and causes subsequent calls to devt_from_partuuid() not to see the /PARTNROFF option, which causes different results. In order to avoid misleading future maintainers, this parameter is marked const. Signed-off-by: Stephen Warren <swarren@nvidia.com> Cc: Tejun Heo <tj@kernel.org> Cc: Will Drewry <wad@chromium.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Tejun Heo	92fb97487a	cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	12 years ago
Ezequiel Garcia	c304a51bf4	block: use NUMA_NO_NODE instead of -1 Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com> Modified by me to cover blk_init_queue() as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Shaohua Li	bee0393cc1	block: recursive merge requests In a workload, thread 1 accesses a, a+2, ..., thread 2 accesses a+1, a+3,.... When the requests are flushed to queue, a and a+1 are merged to (a, a+1), a+2 and a+3 too to (a+2, a+3), but (a, a+1) and (a+2, a+3) aren't merged. If we do recursive merge for such interleave access, some workloads throughput get improvement. A recent worload I'm checking on is swap, below change boostes the throughput around 5% ~ 10%. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Shaohua Li	3d106fba2e	block CFQ: avoid moving request to different queue request is queued in cfqq->fifo list. Looks it's possible we are moving a request from one cfqq to another in request merge case. In such case, adjusting the fifo list order doesn't make sense and is impossible if we don't iterate the whole fifo list. My test does hit one case the two cfqq are different, but didn't cause kernel crash, maybe it's because fifo list isn't used frequently. Anyway, from the code logic, this is buggy. I thought we can re-enable the recusive merge logic after this is fixed. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Tejun Heo	bcf6de1b91	cgroup: make ->pre_destroy() return void All ->pre_destory() implementations return 0 now, which is the only allowed return value. Make it return void. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com>	12 years ago
Jianpeng Ma	975927b942	block: Add blk_rq_pos(rq) to sort rq when plushing My workload is a raid5 which had 16 disks. And used our filesystem to write using direct-io mode. I used the blktrace to find those message: 8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5] 8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5] 8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5] 8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5] 8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5] 8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5] 8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5] 8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5] 8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5] 8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5] 8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5] 8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5] 8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5] 8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5] 8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5] 8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5] 8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5] 8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5] 8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5] 8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5] 8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5] 8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5] 8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5] 8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5] 8,16 0 0 2.453853661 0 m N cfq2579 insert_request 8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453854439 0 m N cfq2579 insert_request 8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2 8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request 8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1 8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5] 8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1 8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert 8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request 8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2 8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5] 8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0] 8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0 8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0] 8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0 8,16 0 0 2.454795160 0 m N cfq schedule dispatch From above messages,we can find rq[W 7493144 + 104] and rq[W 7493120 + 24] do not merge. Because the bio order is: 8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5] 8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5] 8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5] 8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5] The bio(7493144) first and bio(7493120) later.So the subsequent bios will be divided into two parts. When flushing plug-list,because elv_attempt_insert_merge only support backmerge,not supporting frontmerge. So rq[7493120 + 24] can't merge with rq[7493144 + 104]. From my test,i found those situation can count 25% in our system. Using this patch, there is no this situation. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> CC:Shaohua Li <shli@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Kees Cook	8e42e0a23d	block: remove CONFIG_EXPERIMENTAL This config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it. CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Jun'ichi Nomura	65c77fd9e8	blkcg: stop iteration early if root_rl is the only request list __blk_queue_next_rl() finds next request list based on blkg_list while skipping root_blkg in the list. OTOH, root_rl is special as it may exist even without root_blkg. Though the later part of the function handles such a case correctly, exiting early is good for readability of the code. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Jun'ichi Nomura	65635cbc37	blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg blk_put_rl() does not call blkg_put() for q->root_rl because we don't take request list reference on q->root_blkg. However, if root_blkg is once attached then detached (freed), blk_put_rl() is confused by the bogus pointer in q->root_blkg. For example, with !CONFIG_BLK_DEV_THROTTLING && CONFIG_CFQ_GROUP_IOSCHED, switching IO scheduler from cfq to deadline will cause system stall after the following warning with 3.6: > WARNING: at /work/build/linux/block/blk-cgroup.h:250 > blk_put_rl+0x4d/0x95() > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1 > Call Trace: > <IRQ> [<ffffffff810453bd>] warn_slowpath_common+0x85/0x9d > [<ffffffff810453ef>] warn_slowpath_null+0x1a/0x1c > [<ffffffff811d5f8d>] blk_put_rl+0x4d/0x95 > [<ffffffff811d614a>] __blk_put_request+0xc3/0xcb > [<ffffffff811d71a3>] blk_finish_request+0x232/0x23f > [<ffffffff811d76c3>] ? blk_end_bidi_request+0x34/0x5d > [<ffffffff811d76d1>] blk_end_bidi_request+0x42/0x5d > [<ffffffff811d7728>] blk_end_request+0x10/0x12 > [<ffffffff812cdf16>] scsi_io_completion+0x207/0x4d5 > [<ffffffff812c6fcf>] scsi_finish_command+0xfa/0x103 > [<ffffffff812ce2f8>] scsi_softirq_done+0xff/0x108 > [<ffffffff811dcea5>] blk_done_softirq+0x8d/0xa1 > [<ffffffff810915d5>] ? > generic_smp_call_function_single_interrupt+0x9f/0xd7 > [<ffffffff8104cf5b>] __do_softirq+0x102/0x213 > [<ffffffff8108a5ec>] ? lock_release_holdtime+0xb6/0xbb > [<ffffffff8104d2b4>] ? raise_softirq_irqoff+0x9/0x3d > [<ffffffff81424dfc>] call_softirq+0x1c/0x30 > [<ffffffff81011beb>] do_softirq+0x4b/0xa3 > [<ffffffff8104cdb0>] irq_exit+0x53/0xd5 > [<ffffffff8102d865>] smp_call_function_single_interrupt+0x34/0x36 > [<ffffffff8142486f>] call_function_single_interrupt+0x6f/0x80 > <EOI> [<ffffffff8101800b>] ? mwait_idle+0x94/0xcd > [<ffffffff81018002>] ? mwait_idle+0x8b/0xcd > [<ffffffff81017811>] cpu_idle+0xbb/0x114 > [<ffffffff81401fbd>] rest_init+0xc1/0xc8 > [<ffffffff81401efc>] ? csum_partial_copy_generic+0x16c/0x16c > [<ffffffff81cdbd3d>] start_kernel+0x3d4/0x3e1 > [<ffffffff81cdb79e>] ? kernel_init+0x1f7/0x1f7 > [<ffffffff81cdb2dd>] x86_64_start_reservations+0xb8/0xbd > [<ffffffff81cdb3e3>] x86_64_start_kernel+0x101/0x110 This patch clears q->root_blkg and q->root_rl.blkg when root blkg is destroyed. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	12 years ago
Stefan Weinhuber	46e8894786	s390/partitions: make partition detection independent from DASD ioctls In some usage scenarios it is desireable to work with disk images or virtualized DASD devices. One problem that prevents such applications is the partition detection in ibm.c. Currently it works only for devices that support the BIODASDINFO2 ioctl, in other words, it only works for devices that belong to the DASD device driver. The information gained from the BIODASDINFO2 ioctl is only for a small set of legacy cases abolutely necessary. All current VOL1, LNX1 and CMS1 type of disk labels can be interpreted correctly without this information, as long as the generic HDIO_GETGEO ioctl works and provides a correct disk geometry. This patch makes the ibm.c partition detection as independent as possible from the BIODASDINFO2 ioctl. Only the following two cases are still restricted to real DASDs: - An FBA DASD, or LDL formatted ECKD DASD without any disk label. - An old style LNX1 label (without large volume support) on a disk with inconsistent device geometry. Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	13 years ago
Tejun Heo	60ea8226cb	block: fix request_queue->flags initialization A queue newly allocated with blk_alloc_queue_node() has only QUEUE_FLAG_BYPASS set. For request-based drivers, blk_init_allocated_queue() is called and q->queue_flags is overwritten with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the initial bypass is still in effect. In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags instead of overwriting. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Tejun Heo	749fefe677	block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() `b82d4b197c` ("blkcg: make request_queue bypassing on allocation") made request_queues bypassed on allocation to avoid switching on and off bypass mode on a queue being initialized. Some drivers allocate and then destroy a lot of queues without fully initializing them and incurring bypass latency overhead on each of them could add upto significant overhead. Unfortunately, blk_init_allocated_queue() is never used by queues of bio-based drivers, which means that all bio-based driver queues are in bypass mode even after initialization and registration complete successfully. Due to the limited way request_queues are used by bio drivers, this problem is hidden pretty well but it shows up when blk-throttle is used in combination with a bio-based driver. Trying to configure (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs indefinitely in blkg_conf_prep() waiting for bypass mode to end. This patch moves the initial blk_queue_bypass_end() call from blk_init_allocated_queue() to blk_register_queue() which is called for any userland-visible queues regardless of its type. I believe this is correct because I don't think there is any block driver which needs or wants working elevator and blk-cgroup on a queue which isn't visible to userland. If there are such users, we need a different solution. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au> Cc: stable@vger.kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Martin K. Petersen	66ba32dc16	block: ioctl to zero block ranges Introduce a BLKZEROOUT ioctl which can be used to clear block ranges by way of blkdev_issue_zeroout(). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Martin K. Petersen	579e8f3c7b	block: Make blkdev_issue_zeroout use WRITE SAME If the device supports WRITE SAME, use that to optimize zeroing of blocks. If the device does not support WRITE SAME or if the operation fails, fall back to writing zeroes the old-fashioned way. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Martin K. Petersen	4363ac7c13	block: Implement support for WRITE SAME The WRITE SAME command supported on some SCSI devices allows the same block to be efficiently replicated throughout a block range. Only a single logical block is transferred from the host and the storage device writes the same data to all blocks described by the I/O. This patch implements support for WRITE SAME in the block layer. The blkdev_issue_write_same() function can be used by filesystems and block drivers to replicate a buffer across a block range. This can be used to efficiently initialize software RAID devices, etc. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Martin K. Petersen	f31dc1cd49	block: Consolidate command flag and queue limit checks for merges - blk_check_merge_flags() verifies that cmd_flags / bi_rw are compatible. This function is called for both req-req and req-bio merging. - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used to query the maximum sector count for a given request or queue. The calls will return the right value from the queue limits given the type of command (RW, discard, write same, etc.) Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Martin K. Petersen	e2a60da74f	block: Clean up special command handling logic Remove special-casing of non-rw fs style requests (discard). The nomerge flags are consolidated in blk_types.h, and rq_mergeable() and bio_mergeable() have been modified to use them. bio_is_rw() is used in place of bio_has_data() a few places. This is done to to distinguish true reads and writes from other fs type requests that carry a payload (e.g. write same). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Alan Cox	2bd6efad25	blk: add an upper sanity check on partition adding 65536 should be ludicrous anyway but without it we overflow the memory computation doing the allocation and badness occurs. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Tejun Heo	8c7f6edbda	cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	13 years ago
Peter Senna Tschudin	d41570b746	block/blk-tag.c: Remove useless kfree Remove useless kfree() and clean up code related to the removal. The semantic patch that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r exists@ position p1,p2; expression x; @@ if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; } @unchanged exists@ position r.p1,r.p2; expression e <= r.x,x,e1; iterator I; statement S; @@ if (x@p1 == NULL) { ... when != I(x,...) S when != e = e1 when != e += e1 when != e -= e1 when != ++e when != --e when != e++ when != e-- when != &e kfree@p2(x); ... return ...; } @ok depends on unchanged exists@ position any r.p1; position r.p2; expression x; @@ ... when != true x@p1 == NULL kfree@p2(x); @depends on !ok && unchanged@ position r.p2; expression x; @@ *kfree@p2(x); // </smpl> Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago
Jaehoon Chung	e32463b2f7	block: remove the duplicated setting for congestion_threshold Before call the blk_queue_congestion_threshold(), the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest(). Because this code is the duplicated, it has removed. Signed-off-by: Jaehoon Chung <jh80.chung@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	13 years ago

1 2 3 4 5 ...

1902 Commits (dcf0105039660e951dfea348d317043d17988dfc)