🈺 🤶🏿 👩🏽‍🤝‍👨🏾 监控Kubernetes主机之间的ping是我们的食谱 🎒 👞 👩🏻‍⚖️

在诊断Kubernetes集群中的问题时，我们经常会注意到有时集群节点之一会细雨*，当然，这是罕见且奇怪的。因此，我们需要一种可以在每个节点之间进行ping操作并以Prometheus度量形式呈现其工作结果的工具。我们只需要在Grafana中绘制图形并快速定位失败的节点即可（并且，如有必要，从中删除所有节点，然后进行相应的工作**）...

*通过“ NotReady ”，我知道该节点可以进入NotReady状态并突然恢复工作。 或者，例如，pod中的部分流量可能无法到达相邻节点上的pod。

**为什么会出现这种情况？ 常见原因之一可能是数据中心交换机上的网络问题。 例如，曾经在Hetzner上配置过vswitch，但在一个美妙的时刻，其中一个节点不再可以在此vswitch端口上访问：因此，事实证明该节点在本地网络上完全不可访问。

另外，我们想直接在Kubernetes中启动这样的服务 ，以便整个部署通过使用Helm-chart进行。（预想的问题-如果使用相同的Ansible，我们将不得不为各种环境编写角色：AWS，GCE，裸机...）在Internet上搜索了用于该任务的现成工具时，我们找不到任何合适的工具。因此，他们做了自己的。

脚本和配置

因此，我们解决方案的主要组件是一个脚本，该脚本监视.status.addresses字段中任何节点的更改，并且如果某个字段的某个节点发生了更改（即已添加了新节点），则使用以下命令将Helm值发送到图表ConfigMap形式的节点列表：

 --- apiVersion: v1 kind: ConfigMap metadata: name: node-ping-config namespace: kube-prometheus data: nodes.json: > {{ .Values.nodePing.nodes | toJson }}

Python脚本本身：

 #!/usr/bin/env python3 import subprocess import prometheus_client import re import statistics import os import json import glob import better_exchook import datetime better_exchook.install() FPING_CMDLINE = "/usr/sbin/fping -p 1000 -A -C 30 -B 1 -q -r 1".split(" ") FPING_REGEX = re.compile(r"^(\S*)\s*: (.*)$", re.MULTILINE) CONFIG_PATH = "/config/nodes.json" registry = prometheus_client.CollectorRegistry() prometheus_exceptions_counter = \ prometheus_client.Counter('kube_node_ping_exceptions', 'Total number of exceptions', [], registry=registry) prom_metrics = {"sent": prometheus_client.Counter('kube_node_ping_packets_sent_total', 'ICMP packets sent', ['destination_node', 'destination_node_ip_address'], registry=registry), "received": prometheus_client.Counter( 'kube_node_ping_packets_received_total', 'ICMP packets received', ['destination_node', 'destination_node_ip_address'], registry=registry), "rtt": prometheus_client.Counter( 'kube_node_ping_rtt_milliseconds_total', 'round-trip time', ['destination_node', 'destination_node_ip_address'], registry=registry), "min": prometheus_client.Gauge('kube_node_ping_rtt_min', 'minimum round-trip time', ['destination_node', 'destination_node_ip_address'], registry=registry), "max": prometheus_client.Gauge('kube_node_ping_rtt_max', 'maximum round-trip time', ['destination_node', 'destination_node_ip_address'], registry=registry), "mdev": prometheus_client.Gauge('kube_node_ping_rtt_mdev', 'mean deviation of round-trip times', ['destination_node', 'destination_node_ip_address'], registry=registry)} def validate_envs(): envs = {"MY_NODE_NAME": os.getenv("MY_NODE_NAME"), "PROMETHEUS_TEXTFILE_DIR": os.getenv("PROMETHEUS_TEXTFILE_DIR"), "PROMETHEUS_TEXTFILE_PREFIX": os.getenv("PROMETHEUS_TEXTFILE_PREFIX")} for k, v in envs.items(): if not v: raise ValueError("{} environment variable is empty".format(k)) return envs @prometheus_exceptions_counter.count_exceptions() def compute_results(results): computed = {} matches = FPING_REGEX.finditer(results) for match in matches: ip = match.group(1) ping_results = match.group(2) if "duplicate" in ping_results: continue splitted = ping_results.split(" ") if len(splitted) != 30: raise ValueError("ping returned wrong number of results: \"{}\"".format(splitted)) positive_results = [float(x) for x in splitted if x != "-"] if len(positive_results) > 0: computed[ip] = {"sent": 30, "received": len(positive_results), "rtt": sum(positive_results), "max": max(positive_results), "min": min(positive_results), "mdev": statistics.pstdev(positive_results)} else: computed[ip] = {"sent": 30, "received": len(positive_results), "rtt": 0, "max": 0, "min": 0, "mdev": 0} if not len(computed): raise ValueError("regex match\"{}\" found nothing in fping output \"{}\"".format(FPING_REGEX, results)) return computed @prometheus_exceptions_counter.count_exceptions() def call_fping(ips): cmdline = FPING_CMDLINE + ips process = subprocess.run(cmdline, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, universal_newlines=True) if process.returncode == 3: raise ValueError("invalid arguments: {}".format(cmdline)) if process.returncode == 4: raise OSError("fping reported syscall error: {}".format(process.stderr)) return process.stdout envs = validate_envs() files = glob.glob(envs["PROMETHEUS_TEXTFILE_DIR"] + "*") for f in files: os.remove(f) labeled_prom_metrics = [] while True: with open("/config/nodes.json", "r") as f: config = json.loads(f.read()) if labeled_prom_metrics: for node in config: if (node["name"], node["ipAddress"]) not in [(metric["node_name"], metric["ip"]) for metric in labeled_prom_metrics]: for k, v in prom_metrics.items(): v.remove(node["name"], node["ipAddress"]) labeled_prom_metrics = [] for node in config: metrics = {"node_name": node["name"], "ip": node["ipAddress"], "prom_metrics": {}} for k, v in prom_metrics.items(): metrics["prom_metrics"][k] = v.labels(node["name"], node["ipAddress"]) labeled_prom_metrics.append(metrics) out = call_fping([prom_metric["ip"] for prom_metric in labeled_prom_metrics]) computed = compute_results(out) for dimension in labeled_prom_metrics: result = computed[dimension["ip"]] dimension["prom_metrics"]["sent"].inc(computed[dimension["ip"]]["sent"]) dimension["prom_metrics"]["received"].inc(computed[dimension["ip"]]["received"]) dimension["prom_metrics"]["rtt"].inc(computed[dimension["ip"]]["rtt"]) dimension["prom_metrics"]["min"].set(computed[dimension["ip"]]["min"]) dimension["prom_metrics"]["max"].set(computed[dimension["ip"]]["max"]) dimension["prom_metrics"]["mdev"].set(computed[dimension["ip"]]["mdev"]) prometheus_client.write_to_textfile( envs["PROMETHEUS_TEXTFILE_DIR"] + envs["PROMETHEUS_TEXTFILE_PREFIX"] + envs["MY_NODE_NAME"] + ".prom", registry)

它在每个节点上运行 ，并每秒两次将ICMP数据包发送到所有其他Kubernetes集群实例，然后将结果写入文本文件。

该脚本包含在Docker映像中 ：

 FROM python:3.6-alpine3.8 COPY rootfs / WORKDIR /app RUN pip3 install --upgrade pip && pip3 install -r requirements.txt && apk add --no-cache fping ENTRYPOINT ["python3", "/app/node-ping.py"]

另外，创建了一个ServiceAccount并为其指定了一个角色，该角色仅允许接收节点列表（以便知道其地址）：

 --- apiVersion: v1 kind: ServiceAccount metadata: name: node-ping namespace: kube-prometheus --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: kube-prometheus:node-ping rules: - apiGroups: [""] resources: ["nodes"] verbs: ["list"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: kube-prometheus:kube-node-ping subjects: - kind: ServiceAccount name: node-ping namespace: kube-prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-prometheus:node-ping

最后，您需要DaemonSet ，它可以在集群的所有实例上运行：

 --- apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-ping namespace: kube-prometheus labels: tier: monitoring app: node-ping version: v1 spec: updateStrategy: type: RollingUpdate template: metadata: labels: name: node-ping spec: terminationGracePeriodSeconds: 0 tolerations: - operator: "Exists" serviceAccountName: node-ping priorityClassName: cluster-low containers: - resources: requests: cpu: 0.10 image: private-registry.flant.com/node-ping/node-ping-exporter:v1 imagePullPolicy: Always name: node-ping env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: PROMETHEUS_TEXTFILE_DIR value: /node-exporter-textfile/ - name: PROMETHEUS_TEXTFILE_PREFIX value: node-ping_ volumeMounts: - name: textfile mountPath: /node-exporter-textfile - name: config mountPath: /config volumes: - name: textfile hostPath: path: /var/run/node-exporter-textfile - name: config configMap: name: node-ping-config imagePullSecrets: - name: antiopa-registry

文字摘要：

Python脚本结果-即放在主机上/var/run/node-exporter-textfile textfile目录中的文本文件进入DaemonSet node-exporter。运行它的参数表示--collector.textfile.directory /host/textfile ，其中/host/textfile是/var/run/node-exporter-textfile上的hostPath。 （您可以在node-exporter中了解有关文本文件收集器的信息。）
结果，node-exporter读取了这些文件，Prometheus从node-exporter收集了所有数据。