k8s解决dns查询慢的问题

2019-12-30 · 原创

现象就是,每次docker容器要查dns,一定要等五六秒。


原来的业务网有5个应用节点,为了迁移到k8s,新开了3个节点,组建k8s。

新开3个节点名称为master, node1, node2。

原来节点的dns配置,/etc/resolv.conf不是azure给的,是自己改过的,内容如下

nameserver 8.8.8.8

新节点node1和node2的dns配置是azure给的,内容如下

# Generated by NetworkManager
search klhjewf2u3orhfu2oihfn3i4fj3.hx.internal.cloudapp.net
nameserver 134.33.122.76

后来业务流量切换到k8s上之后,就把原来5台机中的3台给闲置,另外加入2台到k8s集群,为node3, node4。

在node1,node2上开启的容器,/etc/resolv.conf为

nameserver 10.96.0.10
search example-test.svc.cluster.local svc.cluster.local cluster.local klhjewf2u3orhfu2oihfn3i4fj3.hx.internal.cloudapp.net
options ndots:5

在node3,node4上开启的容器,/etc/resolv.conf为

nameserver 10.96.0.10
search example-test.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

options ndots:5的意思是所有查询中,如果.的个数少于5个,则会根据search中配置的列表依次在对应域中先进行搜索,如果没有返回,则最后再直接查询域名本身。

search Search list for host-name lookup.

The search list is normally determined from the local domain name; by default, it contains only the local domain name. This may be changed by listing the desired domain search path following the search keyword with spaces or tabs separating the names. Resolver queries having fewer than ndots dots (default is 1) in them will be attempted using each component of the search path in turn until a match is found. For environments with multiple subdomains please read options ndots:n below to avoid man-in-the-middle attacks and unnecessary traffic for the root-dns-servers. Note that this process may be slow and will generate a lot of network traffic if the servers for the listed domains are not local, and that queries will time out if no server is available for one of the domains.
The search list is currently limited to six domains with a total of 256 characters.

options option …

where option is one of the following:

ndots:n

sets a threshold for the number of dots which must appear in a name given to res_query(3) (see resolver(3)) before an initial absolute query will be made. The default for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it. The value for this option is silently capped to 15.

测试一下

bash-5.0# host -v mysql.domain.com
Trying "mysql.domain.com.example-test.svc.cluster.local"
Trying "mysql.domain.com.svc.cluster.local"
Trying "mysql.domain.com.cluster.local"
Trying "mysql.domain.com.klhjewf2u3orhfu2oihfn3i4fj3.hx.internal.cloudapp.net"
Host mysql.domain.com.klhjewf2u3orhfu2oihfn3i4fj3.hx.internal.cloudapp.net not found: 2(SERVFAIL)
bash-5.0# host -v mysql.domain.com
Trying "mysql.domain.com.example-test.svc.cluster.local"
Trying "mysql.domain.com.svc.cluster.local"
Trying "mysql.domain.com.cluster.local"
Trying "mysql.domain.com.klhjewf2u3orhfu2oihfn3i4fj3.hx.internal.cloudapp.net"
Trying "mysql.domain.com"
... ...

后来发现,是卡在查询cloudapp.net这个域名的时候,如果找不到记录,dns服务器的返回等待时间是五六秒。

如果一切顺利,这整个流程跑下来是没有问题的,要有三个因素都符合,才有问题。

  1. coredns的转发默认走的主机的/etc/resolv.conf,主机的/etc/resolv.conf指定的dns服务器,不能解析需要解析的域名
  2. 容器不缓存dns查询结果,如果记录没找到,它就不缓存“未找到”这个结果
  3. 容器需要查询不能解析的域名

所以就是,容器的/etc/resolv.conf需要查询一个coredns查询不到的域名,这个查询时间是五六秒,而且因为查不到记录,所以不缓存结果,所以每次都要去查。

所以就是,8.8.8.8这个服务器不能解析azure给的cloudapp.net域名,然而每次容器都要去解析,每次解析都要等五六秒。


一开始coredns是跑在node1,node2上的,能解析cloudapp.net域名域名,后来跑到了node3和node4上,于是不能解析。

node3和node4是不用解析那个解析不了的域名的,新节点的容器会需要解析。


解决办法就是

让容器不要查那个查不到的域名

或者

那个查不到的域名,查也可以,但立刻返回。

粤ICP备19065259号