上篇文章说,遇到了个奇怪的问题,花时间研究下
1. vm(eth0) ping vr(eth0) 正常
2. vm(eth1) ping vr(eth1) 正常
3. vm(eth0) ping vr(eth1) 100%丢包,但是可以在vm(eth0)处抓包看到vr(eth1)的确已经回包,并且iptables也记录了manle的PREROUTING记录,似乎是在Route时丢了包,还不知道为啥
拓扑图如下
vrouter 有两个接口,eth0(177.0.0.1/8)与eth1(178.0.0.1/8)
vm(trex-cxg)有两个接口 eth0(ssh) 与 eth4 (178.0.0.10)
问题
vm(eth4) ping vr(eth1) 可以ping通
vm(eth4) ping vr(eth0) 无法ping通,但可以抓到vr(eth0)的reply包。
为什么会无法ping通?
如何确认能收到vr(eth0)的reply包
使用tcpdump 指定网卡抓到了包
有没有可能没在tcpdump前就丢包了?
可能tcpdump之前丢包,但不是这里的讨论范围
tcpdump看到包到达网卡后,还可能在哪里丢包?
tcpdump之后会进入协议栈,协议栈注册了iptables的hook点,我们可以在iptables加上hook来观察数据包经过了哪些地方
iptables有3条路径,并且作用于4张表上(raw, mangle, nat, filter)
发送至进程 (prerouting — input)
转发 (prerouting — forawrd—postrouting)
从进程发包出去 (output—postrouting)
我们对应于发送至进程的转发路径,但为了便于观测,我们把每一个可能经过的hook点,都使用iptables记录
iptables --table raw -A PREROUTING -i eth4 -j LOG --log-prefix "iptables raw PREROUTING SEEN"
iptables --table mangle -A PREROUTING -i eth4 -j LOG --log-prefix "iptables mangle PREROUTING SEEN"
iptables --table nat -A PREROUTING -i eth4 -j LOG --log-prefix "iptables nat PREROUTING SEEN"
iptables --table filter -A INPUT -i eth4 -j LOG --log-prefix "iptables filter INPUT SEEN"
iptables --table filter -A OUTPUT -i eth4 -j LOG --log-prefix "iptables filter OUTPUT SEEN"
dmesg -e -w 实时读取日志
[ +0.999734] IN= OUT=eth4 SRC=178.0.0.10 DST=177.0.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=54941 DF PROTO=ICMP TYPE=8 CODE=0 ID=7974 SEQ=2
[ +0.000204] iptables raw PREROUTING SEENIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=54941 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=2
[ +0.000008] iptables mangle PREROUTING SEIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=54941 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=2
[ +0.999784] IN= OUT=eth4 SRC=178.0.0.10 DST=177.0.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=55830 DF PROTO=ICMP TYPE=8 CODE=0 ID=7974 SEQ=3
[ +0.000192] iptables raw PREROUTING SEENIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=55830 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=3
[ +0.000013] iptables mangle PREROUTING SEIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=55830 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=3
[ +0.999772] IN= OUT=eth4 SRC=178.0.0.10 DST=177.0.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56433 DF PROTO=ICMP TYPE=8 CODE=0 ID=7974 SEQ=4
[ +0.000191] iptables raw PREROUTING SEENIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56433 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=4
[ +0.000040] iptables mangle PREROUTING SEIN=eth4 OUT= MAC=fe:fc:fe:b9:97:c7:fe:fd:fe:43:61:cc:08:00 SRC=177.0.0.1 DST=178.0.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56433 DF PROTO=ICMP TYPE=0 CODE=0 ID=7974 SEQ=4
[ +0.999790] IN= OUT=eth4 SRC=178.0.0.10 DST=177.0.0.1 LEN=84 TOS=0x00 PREC=0x00 TTL=64 ID=56903 DF PROTO=ICMP TYPE=8 CODE=0 ID=7974 SEQ=5
确实进入到了mangle的PRERUTING中,接下来会进入nat,但nat没规则,所以不可能丢包,接下来就是route decision了
看样子只有可能是这里丢包了,我们看一下现在的路由
[root@177-0-0-10 ~]# ip r
default via 10.131.192.1 dev eth0 metric 103
default via 178.0.0.1 dev eth4 metric 107
10.131.192.0/20 dev eth0 proto kernel scope link src 10.131.196.192
169.254.0.0/16 dev eth0 scope link metric 1002
169.254.0.0/16 dev eth3 scope link metric 1005
169.254.0.0/16 dev eth4 scope link metric 1006
178.0.0.0/8 dev eth4 proto kernel scope link src 178.0.0.10
按理解,进入eth4的包,应该会从
default via 178.0.0.1 dev eth4 metric 107
进入或出去
那么哪里有问题呢?
理解有问题呗…
这里我们知道在PREROUTING至ROUTE这部分逻辑丢包了,
我们看一下,这中间有什么函数在进行处理
|-ip_rcv
|-ip_rcv_finish
|-ip_rcv_finish_core
|-ip_route_input_noref
|-ip_route_input_rcu
|-ip_route_input_slow
|-fib_validate_source
|-__fib_validate_source
看起来像是在__fib_validate_source这个函数进行了丢包
/* Given (packet source, input interface) and optional (dst, oif, tos):
* - (main) check, that source is valid i.e. not broadcast or our local
* address.
* - figure out what "logical" interface this packet arrived
* and calculate "specific destination" address.
* - check, that packet arrived from expected physical interface.
* called with rcu_read_lock()
*/
static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
u8 tos, int oif, struct net_device *dev,
int rpf, struct in_device *idev, u32 *itag)
{
struct net *net = dev_net(dev);
struct flow_keys flkeys;
int ret, no_addr;
struct fib_result res;
struct flowi4 fl4;
bool dev_match;
fl4.flowi4_oif = 0;
fl4.flowi4_iif = l3mdev_master_ifindex_rcu(dev);
if (!fl4.flowi4_iif)
fl4.flowi4_iif = oif ? : LOOPBACK_IFINDEX;
fl4.daddr = src;
fl4.saddr = dst;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
fl4.flowi4_tun_key.tun_id = 0;
fl4.flowi4_flags = 0;
fl4.flowi4_uid = sock_net_uid(net, NULL);
fl4.flowi4_multipath_hash = 0;
no_addr = idev->ifa_list == NULL;
fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
if (!fib4_rules_early_flow_dissect(net, skb, &fl4, &flkeys)) {
fl4.flowi4_proto = 0;
fl4.fl4_sport = 0;
fl4.fl4_dport = 0;
}
if (fib_lookup(net, &fl4, &res, 0))
goto last_resort;
if (res.type != RTN_UNICAST &&
(res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
goto e_inval;
fib_combine_itag(itag, &res);
dev_match = fib_info_nh_uses_dev(res.fi, dev);
/* This is not common, loopback packets retain skb_dst so normally they
* would not even hit this slow path.
*/
dev_match = dev_match || (res.type == RTN_LOCAL &&
dev == net->loopback_dev);
if (dev_match) {
ret = FIB_RES_NHC(res)->nhc_scope >= RT_SCOPE_HOST;
return ret;
}
if (no_addr)
goto last_resort;
if (rpf == 1)
goto e_rpf;
fl4.flowi4_oif = dev->ifindex;
ret = 0;
if (fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
if (res.type == RTN_UNICAST)
ret = FIB_RES_NHC(res)->nhc_scope >= RT_SCOPE_HOST;
}
return ret;
last_resort:
if (rpf)
goto e_rpf;
*itag = 0;
return 0;
e_inval:
return -EINVAL;
e_rpf:
return -EXDEV;
}
这段代码感觉一眼过去,对于我这种不熟悉内核代码的同学,也看不出怎么判断输入的device interface,和输出的device interface不相同的。
问下ChatGPT
Q: my kernel version(5,10) of source code ...list the code..., tell me how it compares iif different oif.
A:
. After reviewing the code you provided, it appears that the version of the __fib_validate_source function in the Linux kernel 5.10 is indeed different from the version I discussed earlier.
In the provided code, the function doesn't directly compare the input interface (iif) and output interface (oif). Instead, it checks whether the route's next hop (found using fib_lookup) uses the same network device as the incoming packet. Here's a summary of the relevant parts of the code:
The function calls fib_lookup to find a route for the source IP address in the routing table.
It checks if the route type is either RTN_UNICAST or RTN_LOCAL and if the device specified in the route's next hop information (res.fi) matches the incoming device (dev). The fib_info_nh_uses_dev function is called to determine if the route uses the same device as the incoming packet.
If the route's device matches the incoming device (or if the route is a local route and the incoming device is the loopback device), the function returns a value indicating that the RPF check has passed.
The code snippet you provided doesn't explicitly compare iif and oif but achieves the same goal of verifying that the route associated with the source IP address uses the same device as the incoming packet. This ensures that the packet is not considered spoofed and can be processed further.
In summary, while the Linux kernel 5.10 version of the __fib_validate_source function doesn't explicitly compare iif and oif, it still serves the same purpose of validating the source IP address by checking if the route's next hop uses the same network device as the incoming packet.
ChatGPT告诉我们fib_info_nh_uses_dev
这里完成了input interface,和 output interface的比较
bool fib_info_nh_uses_dev(struct fib_info *fi, const struct net_device *dev)
{
bool dev_match = false;
#ifdef CONFIG_IP_ROUTE_MULTIPATH
if (unlikely(fi->nh)) {
dev_match = nexthop_uses_dev(fi->nh, dev);
} else {
int ret;
for (ret = 0; ret < fib_info_num_path(fi); ret++) {
const struct fib_nh_common *nhc = fib_info_nhc(fi, ret);
if (nhc_l3mdev_matches_dev(nhc, dev)) {
dev_match = true;
break;
}
}
}
#else
if (fib_info_nhc(fi, 0)->nhc_dev == dev)
dev_match = true;
#endif
return dev_match;
}
遍历含path的dev,判断是否和输入的dev相同。
这里肯定判空,然后走进last_resort或者e_rpf的逻辑,看起来rpf是个比较关键的变量,决定是否丢包。
last_resort:
if (rpf)
goto e_rpf;
*itag = 0;
return 0;
e_inval:
return -EINVAL;
e_rpf:
return -EXDEV;
rpf即为rp_fiilter(ReversePathFilter),在RFC1822中有详细介绍,是一个内核参数
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.
The max value from conf/{all,interface}/rp_filter is used
when doing source validation on the {interface}.
Default value is 0. Note that some distributions enable it
in startup scripts.
查看rpf 即为 rp_filter,sysctl -a | grep rp_filter
能查到对应的值。
当其为0时,不校验source_ip是否可达,
当前为1时,强校验source_ip 是否input_interface = output_interface
当前为2时,校验source_ip是否可达
Centos6, 7, 8 默认为1,有点坑吧….
为什么单网卡没问题,而多网卡有问题
单网卡出入网口都只能是一张网卡,所以无论是否校验都没问题
多网卡,入网卡和默认路由不一致,你就会遇到rp_filter的问题。记得sysctl -w rp_filter=0,给关一下
如何快速定位该问题
此处参考[Redhat Solution](https://access.redhat.com/solutions/53031 ”rp_filter问题的检查及解决办法”)
- 查看计数器是否增加
$ nstat -rsz | grep IPReversePathFilter TcpExtIPReversePathFilter 52537 0.0 $ netstat -s | grep IPReversePathFilter IPReversePathFilter: 52537
- 检查源ip的出网卡是哪一个
ip route get <remote IP>
-
检查rp_filter设置
sysctl -a | grep rp_filter