The following patch aimed to resolve an issue where secondary, tertiary,
etc. addresses added to bond interfaces could overwrite the
bond->master_ip and vlan_ip values.
commit 917fbdb32f
Author: Henrik Saavedra Persson <henrik.e.persson@ericsson.com>
Date: Wed Nov 23 23:37:15 2011 +0000
bonding: only use primary address for ARP
That patch was good because it prevented bonds using ARP monitoring from
sending frames with an invalid source IP address. Unfortunately, it
didn't always work as expected.
When using an ioctl (like ifconfig does) to set the IP address and
netmask, 2 separate ioctls are actually called to set the IP and netmask
if the mask chosen doesn't match the standard mask for that class of
address. The first ioctl did not have a mask that matched the one in
the primary address and would still cause the device address to be
overwritten. The second ioctl that was called to set the mask would
then detect as secondary and ignored, but the damage was already done.
This was not an issue when using an application that used netlink
sockets as the setting of IP and netmask came down at once. The
inconsistent behavior between those two interfaces was something that
needed to be resolved.
While I was thinking about how I wanted to resolve this, Ralf Zeidler
came with a patch that resolved this on a RHEL kernel by keeping a full
shadow of the entries in dev->ifa_list for the bonding device and vlan
devices in the bonding driver. I didn't like the duplication of the
list as I want to see the 'bonding' struct and code shrink rather than
grow, but liked the general idea.
As the Subject indicates this patch drops the master_ip and vlan_ip
elements from the 'bonding' and 'vlan_entry' structs, respectively.
This can be done because a device's address-list is now traversed to
determine the optimal source IP address for ARP requests and for checks
to see if the bonding device has a particular IP address. This code
could have all be contained inside the bonding driver, but it made more
sense to me to EXPORT and call inet_confirm_addr since it did exactly
what was needed.
I tested this and a backported patch and everything works as expected.
Ralf also helped with verification of the backported patch.
Thanks to Ralf for all his help on this.
v2: Whitespace and organizational changes based on suggestions from Jay
Vosburgh and Dave Miller.
v3: Fixup incorrect usage of rcu_read_unlock based on Dave Miller's
suggestion.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
CC: Ralf Zeidler <ralf.zeidler@nsn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It used to be an int, and it got changed to a bool parameter at least
7 years ago. It happens that NF_ACCEPT and NF_DROP are 0 and 1, so
this works, but it's unclear, and the check that it's in range is not
required.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to permanently attach the timeout policy to the conntrack,
otherwise we may apply the custom timeout policy inconsistently.
Without this patch, the following example:
nfct timeout add test inet icmp timeout 100
iptables -I PREROUTING -t raw -p icmp -s 1.1.1.1 -j CT --timeout test
Will only apply the custom timeout policy to outgoing packets from
1.1.1.1, but not to reply packets from 2.2.2.2 going to 1.1.1.1.
To fix this issue, this patch modifies the current logic to attach the
timeout policy when the first packet is seen (which is when the
conntrack entry is created). Then, we keep using the attached timeout
policy until the conntrack entry is destroyed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
`iptables -p all' uses 0 to match all protocols, while the conntrack
subsystem uses 255. We still need `-p all' to attach the custom
timeout policies for the generic protocol tracker.
Moreover, we may use `iptables -p sctp' while the SCTP tracker is
not loaded. In that case, we have to default on the generic protocol
tracker.
Another possibility is `iptables -p ip' that should be supported
as well. This patch makes sure we validate all possible scenarios.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces nf_conntrack_l4proto_find_get() and
nf_conntrack_l4proto_put() to fix module dependencies between
timeout objects and l4-protocol conntrack modules.
Thus, we make sure that the module cannot be removed if it is
used by any of the cttimeout objects.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We call the wrong replay notify function when we use ESN replay
handling. This leads to the fact that we don't send notifications
if we use ESN. Fix this by calling the registered callbacks instead
of xfrm_replay_notify().
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xfrm_state argument is unused in this function, so remove it.
Also the name xfrm_state_check_space does not really match what this
function does. It actually checks if we have enough head and tailroom
on the skb. So we rename the function to xfrm_skb_check_space.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should be using the gfp flags the caller specified here, instead of
GFP_KERNEL. I think this might be a bugfix, depending on the value of
"sock->sk->sk_allocation" when we call rds_conn_create_outgoing() in
rds_sendmsg(). Otherwise, it's just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function takes a GFP flags as a parameter, but they are never used.
We don't take a lock in this function so there is no reason to prefer
GFP_ATOMIC over the caller's GFP flags.
There is only one caller, cipso_v4_map_cat_rng_ntoh(), and it passes
GFP_ATOMIC as the GFP flags so this doesn't change how the code works.
It's just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In write_partial_msg_pages(), every case now does an identical call
to kmap(page). Instead, just call it once inside the CRC-computing
block where it's needed. Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.
We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use. Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).
Since it's only used in the bio case rename it "bio_offset". Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function. This makes the bio case match the
others more closely.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module. Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again. It's almost certain to
be in low memory anyway.
So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
used, by using this helper in write_partial_msg_pages().
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data. This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().
Since we are already working with a page in this case we can use
the sendpage interface instead. Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
CRC's are computed for all messages between ceph entities. The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option. The default is
for CRC computation for the data portion to be enabled.
Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.
Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.
This fixes http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
Nothing too big here.
- define the size of the buffer used for consuming ignored
incoming data using a symbolic constant
- simplify the condition determining whether to unmap the page
in write_partial_msg_pages(): do it for crc but not if the
page is the zero page
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call. Same functionality, just blocked out
a little differently.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Move blocks of code out of loops in read_partial_message_section()
and read_partial_message(). They were only was getting called at
the end of the last iteration of the loop anyway.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Calculate CRC in a separate step from rearranging the byte order
of the result, to improve clarity and readability.
Use offsetof() to determine the number of bytes to include in the
CRC calculation.
In read_partial_message(), switch which value gets byte-swapped,
since the just-computed CRC is already likely to be in a register.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.
Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages(): the value of "do_crc"
assigned appears to be the opposite of what it should be. No
attempt to fix this is made here; this change preserves the
erroneous behavior. The problem I found is documented here:
http://tracker.newdream.net/issues/2064
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Many ceph-related Boolean options offer the ability to both enable
and disable a feature. For all those that don't offer this, add
a new option so that they do.
Note that ceph_show_options()--which reports mount options currently
in effect--only reports the option if it is different from the
default value.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This gathers a number of very minor changes:
- use %hu when formatting the a socket address's address family
- null out the ceph_msgr_wq pointer after the queue has been
destroyed
- drop a needless cast in ceph_write_space()
- add a WARN() call in ceph_state_change() in the event an
unrecognized socket state is encountered
- rearrange the logic in ceph_con_get() and ceph_con_put() so
that:
- the reference counts are only atomically read once
- the values displayed via dout() calls are known to
be meaningful at the time they are formatted
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller. Instead, have it return an error code,
which tidies things up a bit.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Define a helper function to perform various cleanup operations. Use
it both in the exit routine and in the init routine in the event of
an error.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The messenger workqueue has no need to be public. So give it static
scope.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array. Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.
Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.
This was split out from the next patch to highlight this change in
logic.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful. With this interface is not evident at call sites that
the pointer is always initialized. Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism. Instead, properly cast the
type to the intended target type.
Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This eliminates type casts in some places where they are not
required.
Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption. The index is reset to 0 if it ever reaches the maximum
index value.
Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits. Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).
The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
[1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.
Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.
Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements. Don't record a
connection's socket pointer unless the connect operation is
successful.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition. Instead,
use the kernel ZERO_PAGE() for that purpose.
Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The existing overflow check (n > ULONG_MAX / b) didn't work, because
n = ULONG_MAX / b would both bypass the check and still overflow the
allocation size a + n * b.
The correct check should be (n > (ULONG_MAX - a) / b).
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
The Ceph messenger would sometimes queue multiple work items to write
data to a socket when the socket buffer was full.
Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
clearing it only when sufficient space is available in the socket buffer.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@dreamhost.com>
This fixes the following linking error:
xt_LOG.c:(.text+0x789b1): undefined reference to `ip6t_ext_hdr'
ifdefs have to use CONFIG_IP6_NF_IPTABLES instead of CONFIG_IPV6.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When L2TP is configured as a module, requests for L2TP sockets do not result
in the l2tp_ppp module being loaded. Fix this by adding the appropriate
MODULE_ALIAS to be recognized by pppox's request_module() call.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi->skb is allocated in napi_get_frags() using
netdev_alloc_skb_ip_align(), with a reserve of NET_SKB_PAD +
NET_IP_ALIGN bytes.
However, when such skb is recycled in napi_reuse_skb(), it ends with a
reserve of NET_IP_ALIGN which is suboptimal.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xprtrdma FRMR mapping logic assumes that a segment is <= PAGE_SIZE.
This is not true for NFS4.
Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The client side RDMA transport will bug check if it receives a duplicate
reply, instead we should simply drop the duplicate reply.
Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Stephen Rothwell reports:
net/sunrpc/rpcb_clnt.c: In function 'rpcb_enc_mapping':
net/sunrpc/rpcb_clnt.c:820:19: warning: unused variable 'task' [-Wunused-variable]
net/sunrpc/rpcb_clnt.c: In function 'rpcb_dec_getport':
net/sunrpc/rpcb_clnt.c:837:19: warning: unused variable 'task' [-Wunused-variable]
net/sunrpc/rpcb_clnt.c: In function 'rpcb_dec_set':
net/sunrpc/rpcb_clnt.c:860:19: warning: unused variable 'task' [-Wunused-variable]
net/sunrpc/rpcb_clnt.c: In function 'rpcb_enc_getaddr':
net/sunrpc/rpcb_clnt.c:892:19: warning: unused variable 'task' [-Wunused-variable]
net/sunrpc/rpcb_clnt.c: In function 'rpcb_dec_getaddr':
net/sunrpc/rpcb_clnt.c:914:19: warning: unused variable 'task' [-Wunused-variable]
fs/lockd/svclock.c:49:20: warning: 'nlmdbg_cookie2a' declared 'static' but never defined [-Wunused-function]
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
While testing L2TP functionality, I came across a bug in getsockname(). The
IP address returned within the pppol2tp_addr's addr memember was not being
set to the IP address in use. This bug is caused by using inet_sk() on the
wrong socket (the L2TP socket rather than the underlying UDP socket), and was
likely introduced during the addition of L2TPv3 support.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
no socket layer outputs a message for this error and neither should rds.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to turn on/off the dprintk() debugging interfaces for
those distributions that don't ship the 'rpcdebug' utility.
It also allows us to add Kbuild dependencies. Specifically, we already
know that dprintk() in general relies on CONFIG_SYSCTL. Now it turns out
that the NFS dprintks depend on CONFIG_CRC32 after we added support
for the filehandle hash.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>