This patch allows the various users of attribute_groups to selectively
allow the appearance of group attributes. The primary consumer of
this will be the transport classes in which we currently have
elaborate attribute selection algorithms to do this same thing.
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
This patch is the beginning of moving the attribute_containers to use
attribute groups exclusively. The attr element is now deprecated and
will eventually be removed (along with all the hand rolled code for
doing exactly what attribute groups do) when all the consumers are
converted to attribute groups.
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
The purpose of this is to allow stacked alignment settings, with the
ultimate queue alignment being set to the largest alignment requirement
in the stack.
The reason for this is so that the SCSI mid-layer can relax the default
alignment requirements (which are basically causing a lot of superfluous
copying to go on in the SG_IO interface) while allowing transports,
devices or HBAs to add stricter limits if they need them.
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
I realize that sg chaining is a ploy to make the rest of the kernel
devs feel the pain of the SCSI subsystem. But this was a little
unsubtle.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Create a bit to signal that a napi_disable() is in progress.
This sets up infrastructure such that net_rx_action() can generically
break out of the ->poll() loop on a NAPI context that has a pending
napi_disable() yet is being bombed with packets (and thus would
otherwise poll endlessly and not allow the napi_disable() to finish).
Now, what napi_disable() does is first set the NAPI_STATE_DISABLE bit
(to indicate that a disable is pending), then it polls for the
NAPI_STATE_SCHED bit, and once the NAPI_STATE_SCHED bit is acquired
the NAPI_STATE_DISABLE bit is cleared. Here, the test_and_set_bit()
provides the necessary memory barrier between the various bitops.
napi_schedule_prep() now tests for a pending disable as it's first
action and won't try to obtain the NAPI_STATE_SCHED bit if a disable
is pending.
As a result, we can remove the netif_running() check in
netif_rx_schedule_prep() because the NAPI disable pending state serves
this purpose. And, it does so in a NAPI centric manner which is what
we really want.
Signed-off-by: David S. Miller <davem@davemloft.net>
It is pointless, because everything that can make a device go away
will do a napi_disable() first.
The main impetus behind this is that now we can legally do a NAPI
completion in generic code like net_rx_action() which a following
changeset needs to do. net_rx_action() can only perform actions
in NAPI centric ways, because there may be a one to many mapping
between NAPI contexts and network devices (SKY2 is one example).
We also want to get rid of this because it's an extra atomic in the
NAPI paths, and also because it is one of the last instances where the
NAPI interfaces care about net devices.
The one remaining netdev detail the NAPI stuff cares about is the
netif_running() check which will be killed off in a subsequent
changeset.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleaning out all the incorrect 'no change made' checks for termios
settings showed up a problem with the PL2303. The hardware here seems to
lose sync and bits if you tell it to make no changes. This shows up with
a real world application.
To fix this the driver check for meaningful hardware changes is restored
but doing the tests correctly and as a tty layer function so it doesn't
get duplicated wrongly everywhere if other drivers turn out to need it.
Signed-off-by: Alan Cox <alan@redhat.com>
Tested-by: Mirko Parthey <mirko.parthey@informatik.tu-chemnitz.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 664cceb009 changed the parameters of
the function make_key_ref(). The macros that are used in case CONFIG_KEY
is not defined did not change.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make randconfig bootup testing found that the cpufreq code
crashes on bootup, if the powernow-k8 driver is enabled and
if maxcpus=1 passed on the boot line to a !CONFIG_HOTPLUG_CPU
kernel.
First lockdep found out that there's an inconsistent unlock
sequence:
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
swapper/1 is trying to release lock (&per_cpu(cpu_policy_rwsem, cpu)) at:
[<ffffffff806ffd8e>] unlock_policy_rwsem_write+0x3c/0x42
but there are no more locks to release!
Call Trace:
[<ffffffff806ffd8e>] unlock_policy_rwsem_write+0x3c/0x42
[<ffffffff80251c29>] print_unlock_inbalance_bug+0x104/0x12c
[<ffffffff80252f3a>] mark_held_locks+0x56/0x94
[<ffffffff806ffd8e>] unlock_policy_rwsem_write+0x3c/0x42
[<ffffffff807008b6>] cpufreq_add_dev+0x2a8/0x5c4
...
then shortly afterwards the cpufreq code crashed on an assert:
------------[ cut here ]------------
kernel BUG at drivers/cpufreq/cpufreq.c:1068!
invalid opcode: 0000 [1] SMP
[...]
Call Trace:
[<ffffffff805145d6>] sysdev_driver_unregister+0x5b/0x91
[<ffffffff806ff520>] cpufreq_register_driver+0x15d/0x1a2
[<ffffffff80cc0596>] powernowk8_init+0x86/0x94
[...]
---[ end trace 1e9219be2b4431de ]---
the bug was caused by maxcpus=1 bootup, which brought up the
secondary core as !cpu_online() but !cpu_is_offline() either,
which on on !CONFIG_HOTPLUG_CPU is always 0 (include/linux/cpu.h):
/* CPUs don't go offline once they're online w/o CONFIG_HOTPLUG_CPU */
static inline int cpu_is_offline(int cpu) { return 0; }
but the cpufreq code uses cpu_online() and cpu_is_offline() in
a mixed way - the low-level drivers use cpu_online(), while
the cpufreq core uses cpu_is_offline(). This opened up the
possibility to add the non-initialized sysdev device of the
secondary core:
cpufreq-core: trying to register driver powernow-k8
cpufreq-core: adding CPU 0
powernow-k8: BIOS error - no PSB or ACPI _PSS objects
cpufreq-core: initialization failed
cpufreq-core: adding CPU 1
cpufreq-core: initialization failed
which then blew up. The fix is to make cpu_is_offline() always
the negation of cpu_online(). With that fix applied the kernel
boots up fine without crashing:
Calling initcall 0xffffffff80cc0510: powernowk8_init+0x0/0x94()
powernow-k8: Found 1 AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ processors (1 cpu cores) (version 2.20.00)
powernow-k8: BIOS error - no PSB or ACPI _PSS objects
initcall 0xffffffff80cc0510: powernowk8_init+0x0/0x94() returned -19.
initcall 0xffffffff80cc0510 ran for 19 msecs: powernowk8_init+0x0/0x94()
Calling initcall 0xffffffff80cc328f: init_lapic_nmi_sysfs+0x0/0x39()
We could fix this by making CPU enumeration aware of max_cpus, but that
would be more fragile IMO, and the cpu_online(cpu) != cpu_is_offline(cpu)
possibility was quite confusing and a continuous source of bugs too.
Most distributions have kernels with CPU hotplug enabled, so this bug
remained hidden for a long time.
Bug forensics:
The broken cpu_is_offline() API variant was introduced via:
commit a59d2e4e6977e7b94e003c96a41f07e96cddc340
Author: Rusty Russell <rusty@rustcorp.com.au>
Date: Mon Mar 8 06:06:03 2004 -0800
[PATCH] minor cleanups for hotplug CPUs
( this predates linux-2.6.git, this commit is available from Thomas's
historic git tree. )
Then 1.5 years later the cpufreq code made use of it:
commit c32b6b8e52
Author: Ashok Raj <ashok.raj@intel.com>
Date: Sun Oct 30 14:59:54 2005 -0800
[PATCH] create and destroy cpufreq sysfs entries based on cpu notifiers
+ if (cpu_is_offline(cpu))
+ return 0;
which is a correct use of the subtly broken new API. v2.6.15 then
shipped with this bug included.
then it took two more years for random-kernel qa to hit it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Contents of /proc/*/maps is sensitive and may become sensitive after
open() (e.g. if target originally shares our ->mm and later does exec
on suid-root binary).
Check at read() (actually, ->start() of iterator) time that mm_struct
we'd grabbed and locked is
- still the ->mm of target
- equal to reader's ->mm or the target is ptracable by reader.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.
This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code. Maybe SLOB will want this some
day too.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move veth.h from net/ to linux/ since it is a user api, and add it to
user header processing Kbuild.
[ Use header-y as suggested by Sam Ravnborg. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 build needs tc_nat.h header from kernel make install_headers.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
quicklists must keep even off node pages on the quicklists until the TLB
flush has been completed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure dm honours max_hw_sectors of underlying devices
We still have no firm testing evidence in support of this patch but
believe it may help to resolve some bug reports. - agk
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add unlocked version for use by irq_chip.set_type handlers which may
wish to change handler to level or edge handler when IRQ type is
changed.
The normal set_irq_handler() call cannot be used because it tries to
take irq_desc.lock which is already held when the irq_chip.set_type
hook is called.
Signed-off-by: Kevin Hilman <khilman@mvista.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
elv_register() always returns 0, and there isn't anything it does where
it should return an error (the only error condition is so grave that
it's handled with a BUG_ON).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This reverts commit 54f9f80d65 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")
Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).
(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These types define the size of data read from /dev/apm_bios. They should
not be hidden behind #ifdef __KERNEL__.
This is killing my xserver compile, apm_event_t is used in the xserver
source.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make[3]: *** No rule to make target `/usr/src/devel/include/linux/ticable.h', needed by `/usr/src/devel/usr/include/linux/ticable.h'. Stop.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With ATAPI transfer chunk size properly programmed, libata PIO HSM
should be able to handle full spurious data chunks. Also, it's a good
idea to suppress trailing data warning for misc ATAPI commands as
there can be many of them per command - for example, if the chunk size
is 16 and the drive tries to transfer 510 bytes, there can be 31
trailing data messages.
This patch makes the following updates to libata ATAPI PIO HSM
implementation.
* Make it drain full spurious chunks.
* Suppress trailing data warning message for misc commands.
* Put limit on how many bytes can be drained.
* If odd, round up consumed bytes and the number of bytes to be
drained. This gets the number of bytes to drain right for drivers
which do 16bit PIO.
This patch is partial backport of improve-ATAPI-data-xfer patchset
pending for #upstream.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
On certain implementations, _GTF evaluation depends on preceding _STM
and both can be pretty picky about the configuration. Using _GTM
result cached during controller initialization satisfies the most
neurotic _STM implementation. However, libata evaluates _GTF after
reset during device configuration and the hardware state can be
different from what _GTF expects and can cause evaluation failure.
This patch adds dev->gtf_cache and updates ata_dev_get_GTF() such that
it uses the cached value if available. Cache is cleared with a call
to ata_acpi_clear_gtf().
Because for SATA ACPI nodes _GTF must be evaluated after _SDD which
can't be done till IDENTIFY is complete, _GTF caching from
ata_acpi_on_resume() is used only for IDE ACPI nodes.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
_GTM fetches currently configured transfer mode while _STM configures
controller according to _GTM parameter and prepares transfer mode
configuration TFs for _GTF. In many cases _GTM and _STM
implementations are quite brittle and can't cope with configuration
changed by libata.
libata does not depend on ATA ACPI to configure devices. The only
reason libata performs _GTM and _STM are to make _GTF evaluation
succeed and libata also doesn't care about how _GTF TFs configure
transfer mode. It overrides that configuration anyway, so from
libata's POV, it doesn't matter what value is feeded to _STM as long
as evaluation succeeds for _STM and following _GTF.
This patch adds dev->__acpi_init_gtm and store initial _GTM values on
host initialization before modified by reset and mode configuration.
If the field is valid, ata_acpi_init_gtm() returns pointer to the
saved _GTM structure; otherwise, NULL.
This saved value is used for _STM during resume and peek at
BIOS/firmware programmed initial timing for later use. The accessor
is there to make building w/o ACPI easy as dev->__acpi_init doesn't
exist if ACPI is not enabled.
On driver detach, the initial BIOS configuration is restored by
executing _STM with the initial _GTM values such that the next driver
can also use the initial BIOS configured values.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Add constants for DEVICE CONFIGURATION OVERLAY and SET_MAX to
include/linux/ata.h.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Make prink helpers format @lv together rather than prepending to the
format string as constant.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
* No internal function uses const ata_port. Drop const from @ap.
* Make ata_acpi_stm() copy @stm before using it and change @stm to
const.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Fix kernel-doc warning in usb.h:
Warning(linux-2.6.24-rc3-git7//include/linux/usb.h:166): No description found for parameter 'sysfs_files_created'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When a device cannot handle the smallest previously limited transfer
size (64 blocks) without stalling, limit the device to the amount of
packets that fit in a platform native page.
The lowest possible limit is PAGE_CACHE_SIZE, so if the device is ever
used on a platform that has larger than 8K pages, you lose unless you
can convince the device firmware folks to fix the issue.
Cc: Mathew Dharm <mdharm-scsi@one-eyed-alien.net>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Signed-off-by: Doug Maxey <dwm@austin.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
tipar: remove obsolete module
The tipar character driver was used to implement bit-banging access
to Texas Instruments parallel link cable. A user-land method now
exists thru PPDEV & PARPORT.
Signed-off-by: Romain Liévin <roms@lpg.ticalc.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As reported by Damien Thebault, the double POSTROUTING hook invocation
fix caused outgoing packets routed between two bridges to appear without
a link-layer header. The reason for this is that we're skipping the
br_nf_post_routing hook for routed packets now and don't save the
original link layer header, but nevertheless tries to restore it on
output, causing corruption.
The root cause for this is that skb->nf_bridge has no clearly defined
lifetime and is used to indicate all kind of things, but that is
quite complicated to fix. For now simply don't touch these packets
and handle them like packets from any other device.
Tested-by: Damien Thebault <damien.thebault@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* ide_xfer_verbose() fixups:
- beautify returned mode names
- fix PIO5 reporting
- make it return 'const char *'
* Change printk() level from KERN_DEBUG to KERN_INFO in ide_find_dma_mode().
* Add ide_id_dma_bug() helper based on ide_dma_verbose() to check for invalid
DMA info in identify block.
* Use ide_id_dma_bug() in ide_tune_dma() and ide_driveid_update().
As a result DMA won't be tuned or will be disabled after tuning if device
reports inconsistent info about enabled DMA mode (ide_dma_verbose() does the
same checks while the IDE device is probed by ide-{cd,disk} device driver).
* Remove no longer needed ide_dma_verbose().
This patch should fix the following problem with out-of-sync IDE messages
reported by Nick Warne:
hdd: ATAPI 48X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache<7>hdd:
skipping word 93 validity check
, UDMA(66)
and later debugged by Mark Lord to be caused by:
ide_dma_verbose()
printk( ... "2048kB Cache");
eighty_ninty_three()
printk(KERN_DEBUG "%s: skipping word 93 validity check\n");
ide_dma_verbose()
printk(", UDMA(66)"
Please note that as a result ide-{cd,disk} device drivers won't report the
DMA speed used but this is intended since now DMA mode being used is always
reported by IDE core code.
v2:
* fixes suggested by Randy:
- use KERN_CONT for printk()-s in ide-{cd,disk}.c
- don't remove argument name from ide_xfer_verbose() declaration
v3:
* Remove incorrect check for (id->field_valid & 1) from ide_id_dma_bug()
(spotted by Sergei).
* "XFER SLOW" -> "PIO SLOW" in ide_xfer_verbose() (suggested by Sergei).
* Fix ide_find_dma_mode() to report the correct mode ('mode' after being
limited by 'req_mode').
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Nick Warne <nick@ukfsn.org>
Cc: Mark Lord <lkml@rtr.ca>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
This field and corresponding defines are simply never used anywhere
in the code. But its mere presence is enough to confuse some host
driver authors who attempt to rely on it. Let's eliminate the
possibility for confusion and remove it entirely.
Signed-off-by: Nicolas Pitre <nico@cam.org>
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
The JMicron JMB38x chip doesn't support transfers that aren't 32-bit
aligned (both size and start address). It also doesn't like switching
between PIO and DMA mode, so it needs to be reset after each request.
Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
Add new hash for balance-xor and 802.3ad modes. Originally
submitted by "Glenn Griffin" <ggriffin.kernel@gmail.com>; modified by
Jay Vosburgh to move setting of hash policy out of line, tweak the
documentation update and add version update to 3.2.2.
Glenn's original comment follows:
Included is a patch for a new xmit_hash_policy for the bonding driver
that selects slaves based on MAC and IP information. This is a middle
ground between what currently exists in the layer2 only policy and the
layer3+4 policy. This policy strives to be fully 802.3ad compliant by
transmitting every packet of any particular flow over the same link.
As documented the layer3+4 policy is not fully compliant for extreme
cases such as ip fragmentation, so this policy is a nice compromise
for environments that require full compliance but desire more than the
layer2 only policy.
Signed-off-by: "Glenn Griffin" <ggriffin.kernel@gmail.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Convert part of the led trigger core from rw spinlocks to rw
semaphores. We're calling functions which can sleep from invalid
contexts otherwise. Fixes bug #9264.
Signed-off-by: Richard Purdie <rpurdie@rpsys.net>
Before we start committing a transaction, we call
__journal_clean_checkpoint_list() to cleanup transaction's written-back
buffers.
If this call happens to remove all of them (and there were already some
buffers), __journal_remove_checkpoint() will decide to free the transaction
because it isn't (yet) a committing transaction and soon we fail some
assertion - the transaction really isn't ready to be freed :).
We change the check in __journal_remove_checkpoint() to free only a
transaction in T_FINISHED state. The locking there is subtle though (as
everywhere in JBD ;(). We use j_list_lock to protect the check and a
subsequent call to __journal_drop_transaction() and do the same in the end
of journal_commit_transaction() which is the only place where a transaction
can get to T_FINISHED state.
Probably I'm too paranoid here and such locking is not really necessary -
checkpoint lists are processed only from log_do_checkpoint() where a
transaction must be already committed to be processed or from
__journal_clean_checkpoint_list() where kjournald itself calls it and thus
transaction cannot change state either. Better be safe if something
changes in future...
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Holmes found a bug in the -rt tree with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too. The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.
Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.
Here's the relevant code from kernel/futex.c: (not in order in the file)
[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
struct timespec __user *utime, u32 __user *uaddr2,
u32 val3)
{
struct timespec ts;
ktime_t t, *tp = NULL;
u32 val2 = 0;
int cmd = op & FUTEX_CMD_MASK;
if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
t = timespec_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add(ktime_get(), t);
tp = &t;
}
[...]
return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}
[...]
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
int ret;
int cmd = op & FUTEX_CMD_MASK;
struct rw_semaphore *fshared = NULL;
if (!(op & FUTEX_PRIVATE_FLAG))
fshared = ¤t->mm->mmap_sem;
switch (cmd) {
case FUTEX_WAIT:
ret = futex_wait(uaddr, fshared, val, timeout);
[...]
static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
u32 val, ktime_t *abs_time)
{
[...]
struct restart_block *restart;
restart = ¤t_thread_info()->restart_block;
restart->fn = futex_wait_restart;
restart->arg0 = (unsigned long)uaddr;
restart->arg1 = (unsigned long)val;
restart->arg2 = (unsigned long)abs_time;
restart->arg3 = 0;
if (fshared)
restart->arg3 |= ARG3_SHARED;
return -ERESTART_RESTARTBLOCK;
[...]
static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->arg0;
u32 val = (u32)restart->arg1;
ktime_t *abs_time = (ktime_t *)restart->arg2;
struct rw_semaphore *fshared = NULL;
restart->fn = do_no_restart_syscall;
if (restart->arg3 & ARG3_SHARED)
fshared = ¤t->mm->mmap_sem;
return (long)futex_wait(uaddr, fshared, val, abs_time);
}
So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK. The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.
This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.
I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.
The solution I did to solve this (with input from Linus Torvalds)
was to add unions to the restart_block to allow system calls to
use the restart with specific parameters. This way the futex code now
saves the time in a 64bit value in the restart block instead of storing
it on the stack.
Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
Not sure what that is there for. If this turns out to be a problem, I've
tested this with using "unsigned int" for u32 and "unsigned long long" for
u64 and it worked just the same. I'm using u32 and u64 just to be
consistent with what the futex code uses.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a field to the lro_mgr struct so that drivers can specify how much
padding is required to align layer 3 headers when a packet is copied
into a freshly allocated skb by inet_lro.c:lro_gen_skb(). Without
padding, skbs generated by LRO will cause alignment warnings on
architectures which require strict alignment (seen on sparc64).
Myri10GE is updated to use this field.
Signed-off-by: Andrew Gallatin <gallatin@myri.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
non-null hint address less than mmap_min_addr the mapping will fail the
security checks. Since this is just a hint address this patch will round
such a hint address above mmap_min_addr.
gcj was found to try to be very frugal with vm usage and give hint addresses
in the 8k-32k range. Without this patch all such programs failed and with
the patch they happily get a higher address.
This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
without it and there would be no security check possible no matter what. So
we should not bother compiling in this rounding if it is just a waste of
time.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Lately I've got this nice badness on mdio bus removal:
Device 'e0103120:06' does not have a release() function, it is broken and must be fixed.
------------[ cut here ]------------
Badness at drivers/base/core.c:107
NIP: c015c1a8 LR: c015c1a8 CTR: c0157488
REGS: c34bdcf0 TRAP: 0700 Not tainted (2.6.23-rc5-g9ebadfbb-dirty)
MSR: 00029032 <EE,ME,IR,DR> CR: 24088422 XER: 00000000
...
[c34bdda0] [c015c1a8] device_release+0x78/0x80 (unreliable)
[c34bddb0] [c01354cc] kobject_cleanup+0x80/0xbc
[c34bddd0] [c01365f0] kref_put+0x54/0x6c
[c34bdde0] [c013543c] kobject_put+0x24/0x34
[c34bddf0] [c015c384] put_device+0x1c/0x2c
[c34bde00] [c0180e84] mdiobus_unregister+0x2c/0x58
...
Though actually there is nothing broken, it just device
subsystem core expects another "pattern" of resource managment.
This patch implement phy device's release function, thus
we're getting rid of this badness.
Also small hidden bug fixed, hope none other introduced. ;-)
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Andy Fleming <afleming@freescale.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Commit cfb5285660 removed a useful feature for
us, which provided a cpu accounting resource controller. This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.
The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df6406), with
these differences:
- Removed load average information. I felt it needs more thought (esp
to deal with SMP and virtualized platforms) and can be added for
2.6.25 after more discussions.
- Convert group cpu usage to be nanosecond accurate (as rest of the cfs
stats are) and invoke cpuacct_charge() from the respective scheduler
classes
- Make accounting scalable on SMP systems by splitting the usage
counter to be per-cpu
- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
code is not big enough to warrant a new file and also this rightly
needs to live inside the scheduler. Also things like accessing
rq->lock while reading cpu usage becomes easier if the code lived in
kernel/sched.c)
The patch also modifies the cpu controller not to provide the same accounting
information.
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
some simple tests like cpuspin (spin on the cpu), ran several tasks in
the same group and timed them. Compared their time stamps with
cpuacct.usage.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow phylib specification of cases where hardware needs to configure
PHYs for Internal Delay only on either RX or TX (not both).
Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
Tested-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Well I clearly goofed when I added the initial network namespace support
for /proc/net. Currently things work but there are odd details visible to
user space, even when we have a single network namespace.
Since we do not cache proc_dir_entry dentries at the moment we can just
modify ->lookup to return a different directory inode depending on the
network namespace of the process looking at /proc/net, replacing the
current technique of using a magic and fragile follow_link method.
To accomplish that this patch:
- introduces a shadow_proc method to allow different dentries to
be returned from proc_lookup.
- Removes the old /proc/net follow_link magic
- Fixes a weakness in our not caching of proc generic dentries.
As shadow_proc uses a task struct to decided which dentry to return we can
go back later and fix the proc generic caching without modifying any code
that uses the shadow_proc method.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
RTC code is using mutex to assure exclusive access to /dev/rtc. This is
however wrong usage, as it leaves the mutex locked when returning into
userspace, which is unacceptable.
Convert rtc->char_lock into bit operation.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some open flags (O_APPEND, O_DIRECT) can be changed with fcntl(F_SETFL, ...)
after open, but fuse currently only sends the flags to userspace in open.
To make it possible to correcly handle changing flags, send the
current value to userspace in each read and write.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds Graphics Output Protocol support to the kernel. UEFI2.0 spec
deprecates Universal Graphics Adapter (UGA) protocol and only Graphics Output
Protocol (GOP) is produced. Therefore, the boot loader needs to query the
UEFI firmware with appropriate Output Protocol and pass the video information
to the kernel. As a result of GOP protocol, an EFI framebuffer driver is
needed for displaying console messages. The patch adds a EFI framebuffer
driver. The EFI frame buffer driver in this patch is based on the Intel Mac
framebuffer driver.
The ELILO bootloader takes care of passing the video information as
appropriate for EFI firmware.
The framebuffer driver has been tested in i386 kernel and x86_64 kernel on EFI
platform.
Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit a686cd898bd999fd026a51e90fb0a3410d258ddb:
"Val's cross-port of the ext3 reservations code into ext2."
include/linux/ext2_fs.h got a new function whose return value is only
defined if __KERNEL__ is defined. Putting #ifdef __KERNEL__ around the
function seems to help, patch below.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>