现象:us地区 zabbix-proxy 在晚上高峰期时(21:00~24:00)容易发生 zabbix-server 无法连接 zabbix proxy。
问题原因:经检查,US 到 TW 之间通讯走国际链路会经过 PCCW 运营商,且保障到运营商和调整路由后效果不明显。

mtr 截图
mtr

因为 CDS 在 TW 有服务器,最后和 厂商 CDS 沟通后,采用 CDS 云互联解决问题,网络链路不会经过 PCCW 运营商。即 US CDS 机器与 TW CDS 机器建立 “大内网”,两者通过内网通讯,然后 TW CDS 转发 US zabbix proxy 流量到 TW zabbix-server。

解决过程:
最开始想到的是 通过 iptables SNAT 和 DNAT 实现路由转发。但可惜的是,失败了。
然后尝试使用 nginx 来转发 zabbix proxy 流量到 zabbix server,经过验证,由于 zabbix proxy 数据包过大,七层 nginx 反向代理无法完整转发流量,经多次尝试调整参数后,仍无法解决。失败了。
使用四层 haproxy 来代理 zabbix proxy 流量到 zabbix server。成功!。

使用 SNAT 和 DNAT ,代理服务器一直在发送大量 ARP 包,暂未解决问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
20:53:19.828856 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:20.330829 ARP, Request who-has 10.123.0.205 tell 10.123.0.10, length 46
20:53:20.376638 ARP, Request who-has 10.123.0.204 tell 10.123.0.10, length 46
20:53:20.830674 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:21.230574 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:21.832715 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:22.232912 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:22.801790 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46
20:53:22.834676 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:23.234728 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:23.804725 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46
20:53:24.236666 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:24.806665 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46

这是之前配置的 SNAT 和 DNAT 规则,可能是姿势不对,未能解决问题。

1
2
iptables -t nat -D POSTROUTING -s 10.123.1.68/32 -o ens192 -j SNAT --to-source 164.52.1.106
iptables -t nat -D PREROUTING -s 124.108.157.186 -j DNAT --to-destination 164.52.1.106

在 zabbix proxy 上配置了路由到代理服务器,但zabbix-proxy 仍无法与 zabbix-server 建立通讯。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
111991:20190820:212111.845 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111990:20190820:212111.846 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]. Will retry every 120 second(s)
111992:20190820:212111.942 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]. Will retry every 15 second(s)
111991:20190820:212211.878 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212227.046 Still unable to connect...
111991:20190820:212311.926 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212327.110 Still unable to connect...
111990:20190820:212341.925 Still unable to connect...
111991:20190820:212411.974 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212427.173 Still unable to connect...
111991:20190820:212512.005 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212527.238 Still unable to connect...
111990:20190820:212556.965 Still unable to connect...
111991:20190820:212611.046 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212627.302 Still unable to connect...
111991:20190820:212711.077 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212727.365 Still unable to connect...

使用 nginx 代理时,可以发现 zabbix proxy 发送过来的 包很大,nginx 无法完整接收此数据包,才导致最后的失败。

1
10.123.1.68 - - [20/Aug/2019:20:25:04 +0800] "ZBXD\x01\x95\xB4\x11\x00\x00\x00\x00\x00{\x22request\x22:\x22history data\x22,\x22host\x22:\x22US_FF_zabbix-proxy\x22,\x22data\x22:[{\x22host\x22:\x22us-ff-mysql08s2\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-mysql53s1\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-mysql32m\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22mysql_bak03\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-backend02\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond0.1711]\x22,\x22clock\x22:1566295977,\x22ns\x22:131031460,\x22value\x22:\x2225911485220\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond0]\x22,\x22clock\x22:1566295977,\x22ns\x22:131165849,\x22value\x22:\x2230945671117\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond1.1712]\x22,\x22clock\x22:1566295977,\x22ns\x22:131254822,\x22value\x22:\x2219002118429\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if." 400 173 "-" "-" "-"

使用 haproxy 代理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Aug 20 21:42:55 localhost haproxy[34707]: 10.123.1.68:40592 [20/Aug/2019:21:42:53.425] tcenterd tcenterd/tcenterd 1/74/1829 111 -- 0/0/0/0/0 0/0
Aug 20 21:42:57 localhost haproxy[34707]: 10.123.1.68:40596 [20/Aug/2019:21:42:55.538] tcenterd tcenterd/tcenterd 1/78/1748 111 -- 0/0/0/0/0 0/0
Aug 20 21:42:59 localhost haproxy[34707]: 10.123.1.68:40598 [20/Aug/2019:21:42:57.569] tcenterd tcenterd/tcenterd 1/74/1838 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:01 localhost haproxy[34707]: 10.123.1.68:40600 [20/Aug/2019:21:42:59.692] tcenterd tcenterd/tcenterd 1/74/1742 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:03 localhost haproxy[34707]: 10.123.1.68:40602 [20/Aug/2019:21:43:01.720] tcenterd tcenterd/tcenterd 1/74/1735 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:05 localhost haproxy[34707]: 10.123.1.68:40606 [20/Aug/2019:21:43:03.738] tcenterd tcenterd/tcenterd 1/74/1774 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:07 localhost haproxy[34707]: 10.123.1.68:40608 [20/Aug/2019:21:43:05.800] tcenterd tcenterd/tcenterd 1/73/1720 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:09 localhost haproxy[34707]: 10.123.1.68:40610 [20/Aug/2019:21:43:07.807] tcenterd tcenterd/tcenterd 1/75/1809 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:11 localhost haproxy[34707]: 10.123.1.68:40612 [20/Aug/2019:21:43:09.899] tcenterd tcenterd/tcenterd 1/77/1741 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:13 localhost haproxy[34707]: 10.123.1.68:40614 [20/Aug/2019:21:43:11.924] tcenterd tcenterd/tcenterd 1/82/1783 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:15 localhost haproxy[34707]: 10.123.1.68:40616 [20/Aug/2019:21:43:13.990] tcenterd tcenterd/tcenterd 1/75/1793 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:17 localhost haproxy[34707]: 10.123.1.68:40618 [20/Aug/2019:21:43:16.066] tcenterd tcenterd/tcenterd 1/77/1737 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:19 localhost haproxy[34707]: 10.123.1.68:40620 [20/Aug/2019:21:43:18.086] tcenterd tcenterd/tcenterd 1/74/1788 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:21 localhost haproxy[34707]: 10.123.1.68:40622 [20/Aug/2019:21:43:20.158] tcenterd tcenterd/tcenterd 1/74/1761 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:23 localhost haproxy[34707]: 10.123.1.68:40624 [20/Aug/2019:21:43:22.199] tcenterd tcenterd/tcenterd 1/77/1771 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:26 localhost haproxy[34707]: 10.123.1.68:40626 [20/Aug/2019:21:43:24.256] tcenterd tcenterd/tcenterd 1/74/1845 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:28 localhost haproxy[34707]: 10.123.1.68:40628 [20/Aug/2019:21:43:26.383] tcenterd tcenterd/tcenterd 1/77/1791 111 -- 0/0/0/0/0 0/0

使用 haproxy 代理后,看到 zabbix proxy 已经可以正常发送数据到 zabbix-server。

1
2
3
4
5
6
7
8
9
10
11
114075:20190820:212810.030 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:212915.579 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213021.450 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213126.427 received configuration data from server at "10.123.0.2", datalen 9492921
114077:20190820:213133.145 cannot send history data to server at "10.123.0.2": ZBX_TCP_WRITE() failed: [32] Broken pipe
114075:20190820:213234.122 received configuration data from server at "10.123.0.2", datalen 9492921
114077:20190820:213245.932 cannot send history data to server at "10.123.0.2": ZBX_TCP_WRITE() failed: [32] Broken pipe
114075:20190820:213348.664 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213453.962 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213603.570 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213713.600 received configuration data from server at "10.123.0.2", datalen 9492921

haproxy 配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
$ cat /etc/haproxy/haproxy.cfg
global
log 127.0.0.1 local2
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 400000
user haproxy
group haproxy
daemon
nbproc 30
# turn on stats unix socket
stats socket /var/lib/haproxy/stats
defaults
mode http
log global
option dontlognull
option http-server-close
option redispatch
retries 3
timeout http-request 100s
timeout queue 10m
timeout connect 100s
timeout client 10m
timeout server 10m
timeout http-keep-alive 100s
timeout check 100s
maxconn 300000
frontend us_to_tw_proxy
bind *:10051
mode tcp
default_backend zabbix-server
maxconn 5000000
backend zabbix-server
mode tcp
server tcenterd 124.108.157.186:10051 maxconn 100
listen stats :1936
stats enable
stats uri /
stats realm Haproxy\ Statistics
stats auth admin:test1

zabbix proxy 需要配置静态路由改变 zabbix proxy 到 zabbix server 的网络链路,即通过 代理服务器转发流量。

1
/usr/sbin/route add -host 124.108.157.186 gw 10.123.0.3

然后zabbix proxy的 zabbix server 配置改为 代理服务器 IP 地址。

1
2
$ grep -iv '^$' zabbix_proxy_tw.conf | grep -iv '^#'
Server=10.123.0.3


本文出自”Jack Wang Blog”:http://www.yfshare.vip/2021/03/17/代理转zabbix proxy/