NFSv4 requires us to ensure that we break the TCP connection before we're
allowed to retransmit a request. However in the case where we're
retransmitting several requests that have been sent on the same
connection, we need to ensure that we don't interfere with the attempt to
reconnect and/or break the connection again once it has been established.
We therefore introduce a 'connection' cookie that is bumped every time a
connection is broken. This allows requests to track if they need to force a
disconnection.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The condition for exiting from the loop in xs_tcp_send_request() should be
that we find we're not making progress (i.e. number of bytes sent is 0).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to try to ensure that we always use the same credentials whenever
we re-establish the clientid on the server. If not, the server won't
recognise that we're the same client, and so may not allow us to recover
state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
With the recent change to generic creds, we can no longer use
cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and
AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name
instead...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We want to ensure that req->rq_private_buf.len is updated before
req->rq_received, so that call_decode() doesn't use an old value for
req->rq_rcv_buf.len.
In 'call_decode()' itself, instead of using task->tk_status (which is set
using req->rq_received) must use the actual value of
req->rq_private_buf.len when deciding whether or not the received RPC reply
is too short.
Finally ensure that we set req->rq_rcv_buf.len to zero when retrying a
request. A typo meant that we were resetting req->rq_private_buf.len in
call_decode(), and then clobbering that value with the old rq_rcv_buf.len
again in xprt_transmit().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
..and always destroy using a 'soft' RPC call. Destroying GSS credentials
isn't mandatory; the server can always cope with a few credentials not
getting destroyed in a timely fashion.
This actually fixes a hang situation. Basically, some servers will decide
that the client is crazy if it tries to destroy an RPC context for which
they have sent an RPCSEC_GSS_CREDPROBLEM, and so will refuse to talk to it
for a while.
The regression therefor probably was introduced by commit
0df7fb74fb.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The rest of the networking layer uses SOCK_ASYNC_NOSPACE to signal whether
or not we have someone waiting for buffer memory. Convert the SUNRPC layer
to use the same idiom.
Remove the unlikely()s in xs_udp_write_space and xs_tcp_write_space. In
fact, the most common case will be that there is nobody waiting for buffer
space.
SOCK_NOSPACE is there to tell the TCP layer whether or not the cwnd was
limited by the application window. Ensure that we follow the same idiom as
the rest of the networking layer here too.
Finally, ensure that we clear SOCK_ASYNC_NOSPACE once we wake up, so that
write_space() doesn't keep waking things up on xprt->pending.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
call_verify() can, under certain circumstances, free the RPC slot. In that
case, our cached pointer 'req = task->tk_rqstp' is invalid. Bug was
introduced in commit 220bcc2afd (SUNRPC:
Don't call xprt_release in call refresh).
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Use new node_to_cpumask_ptr. This creates a pointer to the
cpumask for a given node. This definition is in mm patch:
asm-generic-add-node_to_cpumask_ptr-macro.patch
* Use new set_cpus_allowed_ptr function.
Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
[x86/latest]: x86: add cpus_scnprintf function
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 510deb0d was supposed to move the xprt_create_transport() call in
rpc_create(), but neglected to remove the old call site. This resulted in
a transport leak after every rpc_create() call.
This leak is present in 2.6.24 and 2.6.25.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a problem in _copy_to_pages(), whereby it may call flush_dcache_page()
with an invalid pointer due to the fact that 'pgto' gets incremented
beyond the end of the page array. Fix is to exit the loop without this
unnecessary increment of pgto.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The RDMACTXT_F_LAST_CTXT bit was getting set incorrectly
when the last chunk in the read-list spanned multiple pages. This
resulted in a kernel panic when the wrong context was used to
build the RPC iovec page list.
RDMA_READ is used to fetch RPC data from the client for
NFS_WRITE requests. A scatter-gather is used to map the
advertised client side buffer to the server-side iovec and
associated page list.
WR contexts are used to convey which scatter-gather entries are
handled by each WR. When the write data is large, a single RPC may
require multiple RDMA_READ requests so the contexts for a single RPC
are chained together in a linked list. The last context in this list
is marked with a bit RDMACTXT_F_LAST_CTXT so that when this WR completes,
the CQ handler code can enqueue the RPC for processing.
The code in rdma_read_xdr was setting this bit on the last two
contexts on this list when the last read-list chunk spanned multiple
pages. This caused the svc_rdma_recvfrom logic to incorrectly build
the RPC and caused the kernel to crash because the second-to-last
context doesn't contain the iovec page list.
Modified the condition that sets this bit so that it correctly detects
the last context for the RPC.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Tested-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The iWARP protocol limits RDMA read requests to a single scatter
entry. NFS/RDMA has code in rdma_read_max_sge() that is supposed to
limit the sge_count for RDMA read requests to 1, but the code to do
that is inside an #ifdef RDMA_TRANSPORT_IWARP block. In the mainline
kernel at least, RDMA_TRANSPORT_IWARP is an enum and not a
preprocessor #define, so the #ifdef'ed code is never compiled.
In my test of a kernel build with -j8 on an NFS/RDMA mount, this
problem eventually leads to trouble starting with:
svcrdma: Error posting send = -22
svcrdma : RDMA_READ error = -22
and things go downhill from there.
The trivial fix is to delete the #ifdef guard. The check seems to be
a remnant of when the NFS/RDMA code was not merged and needed to
compile against multiple kernel versions, although I don't think it
ever worked as intended. In any case now that the code is upstream
there's no need to test whether the RDMA_TRANSPORT_IWARP constant is
defined or not.
Without this patch, my kernel build on an NFS/RDMA mount using NetEffect
adapters quickly and 100% reproducibly failed with an error like:
ld: final link failed: Software caused connection abort
With the patch applied I was able to complete a kernel build on the
same setup.
(Tom Tucker says this is "actually an _ancient_ remnant when it had to
compile against iWARP vs. non-iWARP enabled OFA trees.")
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The NFSv4 protocol allows clients to negotiate security protocols on the
fly in the case where an administrator on the server changes the export
settings and/or in the case where we may have a filesystem migration event.
Instead of having the NFS client code cache credentials that are tied to a
particular AUTH method it is therefore preferable to have a generic credential
that can be converted into whatever AUTH is in use by the RPC client when
the read/write/sillyrename/... is put on the wire.
We do this by means of the new "generic" credential, which basically just
caches the minimal information that is needed to look up an RPCSEC_GSS,
AUTH_SYS, or AUTH_NULL credential.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need the ability to treat 'generic' creds specially, since they want to
bind instances of the auth cred instead of binding themselves.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add an rpc credential that is not tied to any particular auth mechanism,
but that can be cached by NFS, and later used to look up a cred for
whichever auth mechanism that turns out to be valid when the RPC call is
being made.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The current RPCAUTH_LOOKUP_ROOTCREDS flag only works for AUTH_SYS
authentication, and then only as a special case in the code. This patch
removes the auth_sys special casing, and replaces it with generic code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The assertion that checks for sge context overflow is
incorrectly hard-coded to 32. This causes a kernel bug
check when using big-data mounts. Changed the BUG_ON to
use the computed value RPCSVC_MAXPAGES.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RDMA connection shutdown on an SMP machine can cause a kernel crash due
to the transport close path racing with the I/O tasklet.
Additional transport references were added as follows:
- A reference when on the DTO Q to avoid having the transport
deleted while queued for I/O.
- A reference while there is a QP able to generate events.
- A reference until the DISCONNECTED event is received on the CM ID
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevent an RPC oops when freeing a dynamically allocated RDMA
buffer, used in certain special-case large metadata operations.
Signed-off-by: Tom Talpey <tmt@netapp.com>
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we've tightened up the locking rules for RPC queue wakeups, we can
remove the RCU-safe kfree calls...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is designed to replace the timeout timer in the individual rpc_tasks.
By putting the timer function in the wait queue, we will eventually be able
to reduce the total number of timers in use by the RPC subsystem.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Use proc_create() to make sure that ->proc_fops be setup before gluing
PDE to main tree.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An audit of the current RPC timeout functions shows that they don't really
ever need to run in the softirq context. As long as the softirq is
able to signal that the wakeup is due to a timeout (which it can do by
setting task->tk_status to -ETIMEDOUT) then the callback functions can just
run as standard task->tk_callback functions (in the rpciod/process
context).
The only possible border-line case would be xprt_timer() for the case of
UDP, when the callback is used to reduce the size of the transport
congestion window. In testing, however, the effect of moving that update
to a callback would appear to be minor.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In all cases where we currently use rpc_wake_up_task(), we almost always
know on which waitqueue the rpc_task is actually sleeping. This will allows
us to simplify the queue locking in a future patch.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All RPC timeout callback functions are expected to wake the task up. We can
enforce this by moving the wakeup back into rpc_run_timer.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
A lot of the work done by the rpc_release() callback is inappropriate for
rpciod as it will often involve things like starting a new rpc call in
order to clean up state after an interrupted NFSv4 open() call, or
calls to mntput(), etc.
This patch allows the caller of rpc_run_task() to specify that the
rpc_release callback should run on a different workqueue than the default
rpciod_workqueue.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Sorry for the noise, but here's the v3 of this compilation fix :)
There are some places, which declare the char buf[...] on the stack
to push it later into dprintk(). Since the dprintk sometimes (if the
CONFIG_SYSCTL=n) becomes an empty do { } while (0) stub, these buffers
cause gcc to produce appropriate warnings.
Wrap these buffers with RPC_IFDEBUG macro, as Trond proposed, to
compile them out when not needed.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use updated file list for docbook files and
fix kernel-doc warnings in sunrpc:
Warning(linux-2.6.24-git12//net/sunrpc/rpc_pipe.c:689): No description found for parameter 'rpc_client'
Warning(linux-2.6.24-git12//net/sunrpc/rpc_pipe.c:765): No description found for parameter 'flags'
Warning(linux-2.6.24-git12//net/sunrpc/clnt.c:584): No description found for parameter 'tk_ops'
Warning(linux-2.6.24-git12//net/sunrpc/clnt.c:618): No description found for parameter 'bufsize'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
net/sunrpc/xprtrdma/svc_rdma_sendto.c:160: warning: format '%llx'
expects type 'long long unsigned int', but argument 3 has type 'u64'
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: When looping over RPC version and procedure numbers, use
unsigned index variables.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Move the initialzation in __svc_create_thread that happens prior to
thread creation to a new function. Export the function to allow
services to have better control over the svc_rqst structs.
Also rearrange the rqstp initialization to prevent NULL pointer
dereferences in svc_exit_thread in case allocations fail.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Add the svcrdma module to the xprtrdma makefile.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>