iw_cxgb3 has a potential problem where a CQ's comp_handler can get
called simultaneously from different places in iw_cxgb3 driver. This
does not comply with Documentation/infiniband/core_locking.txt, which
states that at a given point of time, there should be only one
callback per CQ should be active.
Such problem was reported by Parav Pandit <Parav.Pandit@Emulex.Com>
for iw_cxgb4 driver. Based on discussion between Parav Pandit and
Steve Wise, this patch fixes the above problem by serializing the
calls to a CQ's comp_handler using a spin_lock.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As part of MPAv2 Enhanced RDMA Negotiation, pass max supported ird/ord
values upwards for the time being in iw_cxgb3 and amso1100.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This oops was reported recently:
d:mon> e
cpu 0xd: Vector: 300 (Data Access) at [c0000000fd4c7120]
pc: d00000000076f194: .t3_l2t_get+0x44/0x524 [cxgb3]
lr: d000000000b02108: .init_act_open+0x150/0x3d4 [cxgb3i]
sp: c0000000fd4c73a0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc0000000fd640d40
paca = 0xc00000000054ff80
pid = 5085, comm = iscsid
d:mon> t
[c0000000fd4c7450] d000000000b02108 .init_act_open+0x150/0x3d4 [cxgb3i]
[c0000000fd4c7500] d000000000e45378 .cxgbi_ep_connect+0x784/0x8e8 [libcxgbi]
[c0000000fd4c7650] d000000000db33f0 .iscsi_if_rx+0x71c/0xb18
[scsi_transport_iscsi2]
[c0000000fd4c7740] c000000000370c9c .netlink_data_ready+0x40/0xa4
[c0000000fd4c77c0] c00000000036f010 .netlink_sendskb+0x4c/0x9c
[c0000000fd4c7850] c000000000370c18 .netlink_sendmsg+0x358/0x39c
[c0000000fd4c7950] c00000000033be24 .sock_sendmsg+0x114/0x1b8
[c0000000fd4c7b50] c00000000033d208 .sys_sendmsg+0x218/0x2ac
[c0000000fd4c7d70] c00000000033f55c .sys_socketcall+0x228/0x27c
[c0000000fd4c7e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080da560cfc
The root cause was an EEH error, which sent us down the offload_close path in
the cxgb3 driver, which in turn sets cdev->l2opt to NULL, without regard for
upper layer driver (like the cxgbi drivers) which might have execution contexts
in the middle of its use. The result is the oops above, when t3_l2t_get attempts
to dereference L2DATA(cdev)->nentries in arp_hash right after the EEH error handler sets it to NULL.
The fix is to prevent the setting of the NULL pointer until after there are no
further users of it. The t3cdev->l2opt pointer is now converted to be an rcu
pointer and the L2DATA macro is now called under the protection of the
rcu_read_lock(). When the EEH error path:
t3_adapter_error->offload_close->cxgb3_offload_deactivate
Is exectured, setting of that l2opt pointer to NULL, is now gated on an rcu
quiescence point, preventing, allowing L2DATA callers to safely check for a NULL
pointer without concern that the underlying data will be freeded before the
pointer is dereferenced.
This has been tested by the reporter and shown to fix the reproted oops
[nhorman: fix up unitinialised variable reported by Dan Carpenter]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Karen Xie <kxie@chelsio.com>
Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Moves the drivers for the Chelsio chipsets into
drivers/net/ethernet/chelsio/ and the necessary Kconfig and Makefile
changes.
CC: Divy Le Ray <divy@chelsio.com>
CC: Dimitris Michailidis <dm@chelsio.com>
CC: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
These methods don't make sense for iWARP devices, so rather than
forcing them to implement stubs, just return -ENOSYS in the core if
the hardware driver doesn't set .modify_device and/or .modify_port.
Signed-off-by: Roland Dreier <roland@purestorage.com>
tx_ack() wasn't checking the endpoint state and consequently would
attempt to post the p2p 0B read on an endpoint/QP that is closing or
aborting. This causes a NULL pointer dereference crash.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes unused code found by running 'make namespacecheck';
compile tested only.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The flushing of work requests for user QPs is implemented entirely in
the user mode library. The only kernel interaction is to mark the
user QP object indicating it is in error when the QP exits RTS. When
the user QP operations are called by the application (eg: post_send,
post_recv), the QP in error bit is checked and if set, the library
flushes the QP. If, however, the application is not doing IO, but
rather just polling the CQ, it will never get flushed work requests.
This breaks some classes of applications.
This patch adds logic to mark user CQs in error when a QP that is bound
to the CQ is marked in error. The library poll code can then notice
the CQ is in error and flush all the in error QPs bound to that CQ.
Design:
- add 1 extra CQE entry to the CQ memory that will be used to indicate
in error status.
- return the desired CQ memory size that should be mapped by the library
- bump the ABI since the create_cq uverbs response changes.
- detect older libraries and reduce the mmap size accordingly.
(The ABI bump doesn't break old libraries, since they didn't check
the ABI field anyway)
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The patch below updates broken web addresses in the kernel
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Dimitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Ben Pfaff <blp@cs.stanford.edu>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Reviewed-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The HW by default has RX coalescing on. For iWARP connections, this
causes a 100ms delay in connection establishement due to the ingress
MPA Start message being stalled in HW. So explicitly turn RX
coalescing off when setting up iWARP connections.
This was causing very bad performance for NP64 gather operations using
Open MPI, due to the way it sets up connections on larger jobs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The max depth supported by T3 is 64K entries. This fixes a bug
introduced in commit 9918b28d ("RDMA/cxgb3: Increase the max CQ
depth") that causes stalls and possibly crashes in large MPI clusters.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Q_FREECNT() returns the number of spaces free. This should never be a
negative amount. Also the num_wrs is an unsigned int so it can never
be less than zero.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
remove useless union keyword in rtable, rt6_info and dn_route.
Since there is only one member in a union, the union keyword isn't useful.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs. This allows
low-level device drivers to create files in
/sys/class/infiniband/<hca>/ports/<N>/
without having to poke through the internals of the RDMA sysfs handling.
There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The low level cxgb3 driver can return NET_XMIT_CN and friends.
The iw_cxgb3 driver should _not_ treat these as errors.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The DMA API is preferred; no functional change.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
During a hot-plug LLD removal event or an EEH error event, iw_cxgb3
must ensure that any/all threads that might be in a cxgb3 exported
function must return from the function before iw_cxgb3 returns from
its event processing. Do this by calling synchronize_net().
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If cxgb3 calls the iw_cxgb3 t3cclient remove function due to a device
removal event, then the iwch device must be marked with CXIO_ERROR_FATAL
since the device below us is going away. Otherwise, we can get stuck in
a deadlock as RDMA ULPs try and deallocate objects (like MRs, QPs, etc).
So always mark the device with CXIO_ERROR_FATAL when removing.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only kernel mode CQs need the SW queue memory allocated. The SW queue
for user mode CQs is allocated in userspace by libcxgb3.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
T3 hardware doorbell FIFO overflows can cause application stalls due
to lost doorbell ring events. This has been seen when running large
NP IMB alltoall MPI jobs. The T3 hardware supports an xon/xoff-type
flow control mechanism to help avoid overflowing the HW doorbell FIFO.
This patch uses these interrupts to disable RDMA QP doorbell rings
when we near an overflow condition, and then turn them back on (and
ring all the active QP doorbells) when when the doorbell FIFO empties
out. In addition if an doorbell ring is dropped by the hardware, the
code will now recover.
Design:
cxgb3:
- enable these DB interrupts
- in the interrupt handler, schedule work tasks to call the ULPs event
handlers with the new events.
- ring all the qset txqs when an overflow is detected.
iw_cxgb3:
- disable db ringing on all active qps when we get the DB_FULL event
- enable db ringing on all active qps and ring all active dbs when we get
the DB_EMPTY event
- On DB_DROP event:
- disable db rings in the event handler
- delay-schedule a work task which rings and enables the dbs on
all active qps.
- in post_send and post_recv logic, don't ring the db if it's disabled.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Failure to rearm a CQ means the cxgb3 device is wedged, but we shouldn't
kill the whole system with a BUG_ON() if this happens.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use the %pM kernel extension to display the MAC address.
The only difference in the output is that the MAC address is
shown in the usual colon-separated hex notation.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the "ignoring return value of '...', declared with attribute
warn_unused_result" compiler warning in several users of the new kfifo
API.
It removes the __must_check attribute from kfifo_in() and
kfifo_in_locked() which must not necessary performed.
Fix the allocation bug in the nozomi driver file, by moving out the
kfifo_alloc from the interrupt handler into the probe function.
Fix the kfifo_out() and kfifo_out_locked() users to handle a unexpected
end of fifo.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rename kfifo_put... into kfifo_in... to prevent miss use of old non in
kernel-tree drivers
ditto for kfifo_get... -> kfifo_out...
Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc
annotations more readable.
Add mini "howto porting to the new API" in kfifo.h
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the pointer to the spinlock out of struct kfifo. Most users in
tree do not actually use a spinlock, so the few exceptions now have to
call kfifo_{get,put}_locked, which takes an extra argument to a
spinlock.
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a new generic kernel FIFO implementation.
The current kernel fifo API is not very widely used, because it has to
many constrains. Only 17 files in the current 2.6.31-rc5 used it.
FIFO's are like list's a very basic thing and a kfifo API which handles
the most use case would save a lot of development time and memory
resources.
I think this are the reasons why kfifo is not in use:
- The API is to simple, important functions are missing
- A fifo can be only allocated dynamically
- There is a requirement of a spinlock whether you need it or not
- There is no support for data records inside a fifo
So I decided to extend the kfifo in a more generic way without blowing up
the API to much. The new API has the following benefits:
- Generic usage: For kernel internal use and/or device driver.
- Provide an API for the most use case.
- Slim API: The whole API provides 25 functions.
- Linux style habit.
- DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros
- Direct copy_to_user from the fifo and copy_from_user into the fifo.
- The kfifo itself is an in place member of the using data structure, this save an
indirection access and does not waste the kernel allocator.
- Lockless access: if only one reader and one writer is active on the fifo,
which is the common use case, no additional locking is necessary.
- Remove spinlock - give the user the freedom of choice what kind of locking to use if
one is required.
- Ability to handle records. Three type of records are supported:
- Variable length records between 0-255 bytes, with a record size
field of 1 bytes.
- Variable length records between 0-65535 bytes, with a record size
field of 2 bytes.
- Fixed size records, which no record size field.
- Preserve memory resource.
- Performance!
- Easy to use!
This patch:
Since most users want to have the kfifo as part of another object,
reorganize the code to allow including struct kfifo in another data
structure. This requires changing the kfifo_alloc and kfifo_init
prototypes so that we pass an existing kfifo pointer into them. This
patch changes the implementation and all existing users.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always set bad_wr when an immediate error is detected. Return ENOMEM
for queue full instead of EINVAL to match other drivers.
Signed-off-by: Frank Zago <fzago@systemfabricworks.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch was generated by
git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/'
and the cumsumed was found by checking the diff for aquire.
Signed-off-by: Uwe Kleine-Knig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
in_dev_get() can return NULL. If it does, iwch_query_port() will crash.
Handle the NULL case by mapping it to port state INIT.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
FW mismatches can cause a crash in the iw_cxgb3 event handler.
- NULL the t3cdev->ulp pointer on failures in cxio_rdev_open()
- Silently ignore events when the ulp ptr is NULL in iwch_err_handler()
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace open-coded reimplementations with printk_once().
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A close/abort while waiting for a wr_ack during connection migration
can cause a hung process in iwch_accept_cr/iwch_reject_cr.
The fix is to set rpl_error/rpl_done and wake up the waiters when we
get a close/abort while in MPA_REQ_RCVD state.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Keep ref on connection request endpoints until either accepted or
rejected so it doesn't get freed early.
- Endpoint flags now need to be set via atomic bitops because they can
be set on both the iw_cxgb3 workqueue thread and user disconnect
threads.
- Don't move out of CLOSING too early due to multiple calls to
iwch_ep_disconnect.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>