This file documents the specifics of the RDS sockets API, as well as covering some of the details of its internal implementation. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>tirimbino
parent
55b7ed0b58
commit
0c5f9b8830
@ -0,0 +1,356 @@ |
||||
|
||||
Overview |
||||
======== |
||||
|
||||
This readme tries to provide some background on the hows and whys of RDS, |
||||
and will hopefully help you find your way around the code. |
||||
|
||||
In addition, please see this email about RDS origins: |
||||
http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html |
||||
|
||||
RDS Architecture |
||||
================ |
||||
|
||||
RDS provides reliable, ordered datagram delivery by using a single |
||||
reliable connection between any two nodes in the cluster. This allows |
||||
applications to use a single socket to talk to any other process in the |
||||
cluster - so in a cluster with N processes you need N sockets, in contrast |
||||
to N*N if you use a connection-oriented socket transport like TCP. |
||||
|
||||
RDS is not Infiniband-specific; it was designed to support different |
||||
transports. The current implementation used to support RDS over TCP as well |
||||
as IB. Work is in progress to support RDS over iWARP, and using DCE to |
||||
guarantee no dropped packets on Ethernet, it may be possible to use RDS over |
||||
UDP in the future. |
||||
|
||||
The high-level semantics of RDS from the application's point of view are |
||||
|
||||
* Addressing |
||||
RDS uses IPv4 addresses and 16bit port numbers to identify |
||||
the end point of a connection. All socket operations that involve |
||||
passing addresses between kernel and user space generally |
||||
use a struct sockaddr_in. |
||||
|
||||
The fact that IPv4 addresses are used does not mean the underlying |
||||
transport has to be IP-based. In fact, RDS over IB uses a |
||||
reliable IB connection; the IP address is used exclusively to |
||||
locate the remote node's GID (by ARPing for the given IP). |
||||
|
||||
The port space is entirely independent of UDP, TCP or any other |
||||
protocol. |
||||
|
||||
* Socket interface |
||||
RDS sockets work *mostly* as you would expect from a BSD |
||||
socket. The next section will cover the details. At any rate, |
||||
all I/O is performed through the standard BSD socket API. |
||||
Some additions like zerocopy support are implemented through |
||||
control messages, while other extensions use the getsockopt/ |
||||
setsockopt calls. |
||||
|
||||
Sockets must be bound before you can send or receive data. |
||||
This is needed because binding also selects a transport and |
||||
attaches it to the socket. Once bound, the transport assignment |
||||
does not change. RDS will tolerate IPs moving around (eg in |
||||
a active-active HA scenario), but only as long as the address |
||||
doesn't move to a different transport. |
||||
|
||||
* sysctls |
||||
RDS supports a number of sysctls in /proc/sys/net/rds |
||||
|
||||
|
||||
Socket Interface |
||||
================ |
||||
|
||||
AF_RDS, PF_RDS, SOL_RDS |
||||
These constants haven't been assigned yet, because RDS isn't in |
||||
mainline yet. Currently, the kernel module assigns some constant |
||||
and publishes it to user space through two sysctl files |
||||
/proc/sys/net/rds/pf_rds |
||||
/proc/sys/net/rds/sol_rds |
||||
|
||||
fd = socket(PF_RDS, SOCK_SEQPACKET, 0); |
||||
This creates a new, unbound RDS socket. |
||||
|
||||
setsockopt(SOL_SOCKET): send and receive buffer size |
||||
RDS honors the send and receive buffer size socket options. |
||||
You are not allowed to queue more than SO_SNDSIZE bytes to |
||||
a socket. A message is queued when sendmsg is called, and |
||||
it leaves the queue when the remote system acknowledges |
||||
its arrival. |
||||
|
||||
The SO_RCVSIZE option controls the maximum receive queue length. |
||||
This is a soft limit rather than a hard limit - RDS will |
||||
continue to accept and queue incoming messages, even if that |
||||
takes the queue length over the limit. However, it will also |
||||
mark the port as "congested" and send a congestion update to |
||||
the source node. The source node is supposed to throttle any |
||||
processes sending to this congested port. |
||||
|
||||
bind(fd, &sockaddr_in, ...) |
||||
This binds the socket to a local IP address and port, and a |
||||
transport. |
||||
|
||||
sendmsg(fd, ...) |
||||
Sends a message to the indicated recipient. The kernel will |
||||
transparently establish the underlying reliable connection |
||||
if it isn't up yet. |
||||
|
||||
An attempt to send a message that exceeds SO_SNDSIZE will |
||||
return with -EMSGSIZE |
||||
|
||||
An attempt to send a message that would take the total number |
||||
of queued bytes over the SO_SNDSIZE threshold will return |
||||
EAGAIN. |
||||
|
||||
An attempt to send a message to a destination that is marked |
||||
as "congested" will return ENOBUFS. |
||||
|
||||
recvmsg(fd, ...) |
||||
Receives a message that was queued to this socket. The sockets |
||||
recv queue accounting is adjusted, and if the queue length |
||||
drops below SO_SNDSIZE, the port is marked uncongested, and |
||||
a congestion update is sent to all peers. |
||||
|
||||
Applications can ask the RDS kernel module to receive |
||||
notifications via control messages (for instance, there is a |
||||
notification when a congestion update arrived, or when a RDMA |
||||
operation completes). These notifications are received through |
||||
the msg.msg_control buffer of struct msghdr. The format of the |
||||
messages is described in manpages. |
||||
|
||||
poll(fd) |
||||
RDS supports the poll interface to allow the application |
||||
to implement async I/O. |
||||
|
||||
POLLIN handling is pretty straightforward. When there's an |
||||
incoming message queued to the socket, or a pending notification, |
||||
we signal POLLIN. |
||||
|
||||
POLLOUT is a little harder. Since you can essentially send |
||||
to any destination, RDS will always signal POLLOUT as long as |
||||
there's room on the send queue (ie the number of bytes queued |
||||
is less than the sendbuf size). |
||||
|
||||
However, the kernel will refuse to accept messages to |
||||
a destination marked congested - in this case you will loop |
||||
forever if you rely on poll to tell you what to do. |
||||
This isn't a trivial problem, but applications can deal with |
||||
this - by using congestion notifications, and by checking for |
||||
ENOBUFS errors returned by sendmsg. |
||||
|
||||
setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) |
||||
This allows the application to discard all messages queued to a |
||||
specific destination on this particular socket. |
||||
|
||||
This allows the application to cancel outstanding messages if |
||||
it detects a timeout. For instance, if it tried to send a message, |
||||
and the remote host is unreachable, RDS will keep trying forever. |
||||
The application may decide it's not worth it, and cancel the |
||||
operation. In this case, it would use RDS_CANCEL_SENT_TO to |
||||
nuke any pending messages. |
||||
|
||||
|
||||
RDMA for RDS |
||||
============ |
||||
|
||||
see rds-rdma(7) manpage (available in rds-tools) |
||||
|
||||
|
||||
Congestion Notifications |
||||
======================== |
||||
|
||||
see rds(7) manpage |
||||
|
||||
|
||||
RDS Protocol |
||||
============ |
||||
|
||||
Message header |
||||
|
||||
The message header is a 'struct rds_header' (see rds.h): |
||||
Fields: |
||||
h_sequence: |
||||
per-packet sequence number |
||||
h_ack: |
||||
piggybacked acknowledgment of last packet received |
||||
h_len: |
||||
length of data, not including header |
||||
h_sport: |
||||
source port |
||||
h_dport: |
||||
destination port |
||||
h_flags: |
||||
CONG_BITMAP - this is a congestion update bitmap |
||||
ACK_REQUIRED - receiver must ack this packet |
||||
RETRANSMITTED - packet has previously been sent |
||||
h_credit: |
||||
indicate to other end of connection that |
||||
it has more credits available (i.e. there is |
||||
more send room) |
||||
h_padding[4]: |
||||
unused, for future use |
||||
h_csum: |
||||
header checksum |
||||
h_exthdr: |
||||
optional data can be passed here. This is currently used for |
||||
passing RDMA-related information. |
||||
|
||||
ACK and retransmit handling |
||||
|
||||
One might think that with reliable IB connections you wouldn't need |
||||
to ack messages that have been received. The problem is that IB |
||||
hardware generates an ack message before it has DMAed the message |
||||
into memory. This creates a potential message loss if the HCA is |
||||
disabled for any reason between when it sends the ack and before |
||||
the message is DMAed and processed. This is only a potential issue |
||||
if another HCA is available for fail-over. |
||||
|
||||
Sending an ack immediately would allow the sender to free the sent |
||||
message from their send queue quickly, but could cause excessive |
||||
traffic to be used for acks. RDS piggybacks acks on sent data |
||||
packets. Ack-only packets are reduced by only allowing one to be |
||||
in flight at a time, and by the sender only asking for acks when |
||||
its send buffers start to fill up. All retransmissions are also |
||||
acked. |
||||
|
||||
Flow Control |
||||
|
||||
RDS's IB transport uses a credit-based mechanism to verify that |
||||
there is space in the peer's receive buffers for more data. This |
||||
eliminates the need for hardware retries on the connection. |
||||
|
||||
Congestion |
||||
|
||||
Messages waiting in the receive queue on the receiving socket |
||||
are accounted against the sockets SO_RCVBUF option value. Only |
||||
the payload bytes in the message are accounted for. If the |
||||
number of bytes queued equals or exceeds rcvbuf then the socket |
||||
is congested. All sends attempted to this socket's address |
||||
should return block or return -EWOULDBLOCK. |
||||
|
||||
Applications are expected to be reasonably tuned such that this |
||||
situation very rarely occurs. An application encountering this |
||||
"back-pressure" is considered a bug. |
||||
|
||||
This is implemented by having each node maintain bitmaps which |
||||
indicate which ports on bound addresses are congested. As the |
||||
bitmap changes it is sent through all the connections which |
||||
terminate in the local address of the bitmap which changed. |
||||
|
||||
The bitmaps are allocated as connections are brought up. This |
||||
avoids allocation in the interrupt handling path which queues |
||||
sages on sockets. The dense bitmaps let transports send the |
||||
entire bitmap on any bitmap change reasonably efficiently. This |
||||
is much easier to implement than some finer-grained |
||||
communication of per-port congestion. The sender does a very |
||||
inexpensive bit test to test if the port it's about to send to |
||||
is congested or not. |
||||
|
||||
|
||||
RDS Transport Layer |
||||
================== |
||||
|
||||
As mentioned above, RDS is not IB-specific. Its code is divided |
||||
into a general RDS layer and a transport layer. |
||||
|
||||
The general layer handles the socket API, congestion handling, |
||||
loopback, stats, usermem pinning, and the connection state machine. |
||||
|
||||
The transport layer handles the details of the transport. The IB |
||||
transport, for example, handles all the queue pairs, work requests, |
||||
CM event handlers, and other Infiniband details. |
||||
|
||||
|
||||
RDS Kernel Structures |
||||
===================== |
||||
|
||||
struct rds_message |
||||
aka possibly "rds_outgoing", the generic RDS layer copies data to |
||||
be sent and sets header fields as needed, based on the socket API. |
||||
This is then queued for the individual connection and sent by the |
||||
connection's transport. |
||||
struct rds_incoming |
||||
a generic struct referring to incoming data that can be handed from |
||||
the transport to the general code and queued by the general code |
||||
while the socket is awoken. It is then passed back to the transport |
||||
code to handle the actual copy-to-user. |
||||
struct rds_socket |
||||
per-socket information |
||||
struct rds_connection |
||||
per-connection information |
||||
struct rds_transport |
||||
pointers to transport-specific functions |
||||
struct rds_statistics |
||||
non-transport-specific statistics |
||||
struct rds_cong_map |
||||
wraps the raw congestion bitmap, contains rbnode, waitq, etc. |
||||
|
||||
Connection management |
||||
===================== |
||||
|
||||
Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and |
||||
ERROR states. |
||||
|
||||
The first time an attempt is made by an RDS socket to send data to |
||||
a node, a connection is allocated and connected. That connection is |
||||
then maintained forever -- if there are transport errors, the |
||||
connection will be dropped and re-established. |
||||
|
||||
Dropping a connection while packets are queued will cause queued or |
||||
partially-sent datagrams to be retransmitted when the connection is |
||||
re-established. |
||||
|
||||
|
||||
The send path |
||||
============= |
||||
|
||||
rds_sendmsg() |
||||
struct rds_message built from incoming data |
||||
CMSGs parsed (e.g. RDMA ops) |
||||
transport connection alloced and connected if not already |
||||
rds_message placed on send queue |
||||
send worker awoken |
||||
rds_send_worker() |
||||
calls rds_send_xmit() until queue is empty |
||||
rds_send_xmit() |
||||
transmits congestion map if one is pending |
||||
may set ACK_REQUIRED |
||||
calls transport to send either non-RDMA or RDMA message |
||||
(RDMA ops never retransmitted) |
||||
rds_ib_xmit() |
||||
allocs work requests from send ring |
||||
adds any new send credits available to peer (h_credits) |
||||
maps the rds_message's sg list |
||||
piggybacks ack |
||||
populates work requests |
||||
post send to connection's queue pair |
||||
|
||||
The recv path |
||||
============= |
||||
|
||||
rds_ib_recv_cq_comp_handler() |
||||
looks at write completions |
||||
unmaps recv buffer from device |
||||
no errors, call rds_ib_process_recv() |
||||
refill recv ring |
||||
rds_ib_process_recv() |
||||
validate header checksum |
||||
copy header to rds_ib_incoming struct if start of a new datagram |
||||
add to ibinc's fraglist |
||||
if competed datagram: |
||||
update cong map if datagram was cong update |
||||
call rds_recv_incoming() otherwise |
||||
note if ack is required |
||||
rds_recv_incoming() |
||||
drop duplicate packets |
||||
respond to pings |
||||
find the sock associated with this datagram |
||||
add to sock queue |
||||
wake up sock |
||||
do some congestion calculations |
||||
rds_recvmsg |
||||
copy data into user iovec |
||||
handle CMSGs |
||||
return to application |
||||
|
||||
|
Loading…
Reference in new issue