Prometheus监控TLS Kubernetes集群
当我们完成Kubernetes集群环境部署后,就需要提取Kubernets集群中POD的日志提取和监控。当集群内的N台服务器在Kubernets的管理下自动创建和销毁POD,但在这种情况下,我们就不方便及时获取所有POD和服务器的运行状态及资源消耗状态,给我们感觉是,驾驶着一辆没有仪表盘的跑车在高速公路上飙车,给人一种心慌的感觉。
在以前的工作中,用过Nagios,Cacti,zabbix等监控工具。但在Kubernets集群中,这些工具并不适用。因此,我们需要引入新的监控工具Prometheus。
Prometheus简介
Prometheus是SoundCloud开源的一款监控软件。它的实现参考了Google内部的监控实现, 与同样源自Google的Kubernetes项目十分搭配。Prometheus集成了数据采集,存储,异常告警多项功能,是一款一体化的完整方案。它针对大规模的集群环境设计了拉取式的数据采集方式、多维度数据存储格式以及服务发现等创新功能。
与传统监控工具相比,Prometheus 可以通过服务发现掌握集群内部已经暴露的监控点,然后主动拉取所有监控数据。通过这样的架构设计,我们仅需要向Kubernetes集群中部署一份Prometheus实例,它就可以通过向apiserver查询集群状态,然后向所有已经支持Prometheus metrics的kubelet获取所有Pod的运行数据。如果我们想采集底层服务器运行状态,通过DaemonSet在所有服务器上运行配套的node-exporter之后,Prometheus就可以自动采集到新的这部分数据。
这种动态发现的架构,非常适合服务器和程序都不固定的Kubernetes集群环境,同时也大大降低了运维的负担。
Prometheus官网:https://prometheus.io/
Prometheus官方下载地址:https://prometheus.io/download/
Prometheus官方文档地址:https://prometheus.io/docs/introduction/overview/
环境说明
环境:
Prometheus v2.2.0
node-exporter v0.15.2
Kubernetes v1.8.2
Centos 7.4
| 角色 | IP | 备注 |
|---|---|---|
| k8s master | 192.168.1.195 | k8s master |
| k8s node | 192.168.1.198 | k8s node、Prometheus、node-exporter |
| k8s node | 192.168.1.199 | k8s node、Prometheus、node-exporter |
部署node-exporter
node-exporter可以用于监控底层的服务器指标
下面是官方解释:1
2Prometheus exporter for hardware and OS metrics exposed by *NIX kernels, written in Go with pluggable metric collectors.
The WMI exporter is recommended for Windows users.
node-exporter Github:https://github.com/prometheus/node_exporter
为了能够收集每个节点的信息,这里使用DaemonSet的形式部署PODS
node-exporter.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
namespace: kube-ops
labels:
k8s-app: node-exporter
spec:
template:
metadata:
labels:
k8s-app: node-exporter
spec:
containers:
- image: prom/node-exporter:latest
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
protocol: TCP
name: http
volumeMounts:
- name: time
mountPath: /etc/localtime
readOnly: true
volumes:
- name: time
hostPath:
path: /etc/localtime
---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: node-exporter
name: node-exporter
namespace: kube-ops
spec:
ports:
- name: http
port: 9100
targetPort: 9100
protocol: TCP
selector:
k8s-app: node-exporter1
2[root@localhost prometheus]# kubectl create namespace kube-ops
[root@localhost prometheus]# kubectl apply -f node-exporter.yaml1
2
3
4
5[root@localhost prometheus]# kubectl get pod -o wide -n kube-ops
NAME READY STATUS RESTARTS AGE IP NODE
node-exporter-8d66t 1/1 Running 0 1h 172.30.41.5 192.168.1.199
node-exporter-xn5ss 1/1 Running 0 1h 172.30.57.6 192.168.1.198
[root@localhost prometheus]# 1
2
3
4[root@localhost prometheus]# kubectl get svc -o wide -n kube-ops
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
node-exporter ClusterIP 172.16.152.14 <none> 9100/TCP 1h k8s-app=node-exporter
[root@localhost prometheus]#
部署Service Account
Kubernetes在1.8.0之后启用了RBAC特性,因此我们需要先通过RBAC授权,然后Prometheus通过RBAC连接Kubernetes集群,否则被拒绝后,将无法连接到K8s的API-SERVER
参考:https://kubernetes.io/docs/admin/authorization/rbac/
prometheus-service-account.yml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-ops
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
namespace: kube-ops
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: kube-ops
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-ops1
[root@localhost prometheus]# kubectl apply -f prometheus-service-account.yml
1
2
3
4
5[root@localhost prometheus]# kubectl get ServiceAccount -n kube-ops
NAME SECRETS AGE
default 1 1d
prometheus 1 55m
[root@localhost prometheus]#
部署Prometheus alertmanager配置文件
使用ConfigMap的形式来设置Prometheus的配置文件
参考:https://prometheus.io/docs/prometheus/latest/configuration/configuration/
prometheus-alertmanager-config.yml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: kube-ops
data:
config.yml: |-
global:
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: 'user1@example.com'
smtp_auth_username: 'user1@example.com'
smtp_auth_password: 'password'
smtp_require_tls: false
resolve_timeout: 5m
templates:
- '/etc/alertmanager/*.tmpl'
route:
receiver: email
group_wait: 30s
group_interval: 5m
repeat_interval: 10d
group_by: [alertname]
routes:
- receiver: email
group_wait: 10s
match:
team: node
receivers:
- name: email
email_configs:
- send_resolved: true
to: 'user2@example.com,user3@example.com'
repeat_interval:指定告警发送间隔时间to: 'user2@example.com,user3@example.com'指定多个收件人,每个收件人邮箱之间用逗号隔开
1 | [root@localhst prometheus]# kubectl apply -f prometheus-alertmanager-config.yml |
1 | [root@localhst prometheus]# kubectl get ConfigMap -n kube-ops |
- 需要修改
global.smtp*和receivers.name.email_configs相关的邮件信息
部署Prometheus的配置文件
prometheus-config.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: kube-ops
data:
prometheus.yml: |
global:
scrape_interval: 30s
scrape_timeout: 30s
alerting:
alertmanagers:
- static_configs:
- targets: ["192.168.1.198:9093"]
rule_files:
- "rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-nodes'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: 192.168.1.195:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: 'kubernetes-cadvisor'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: 192.168.1.195:6443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: 'kubernetes-node-exporter'
scheme: http
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
rules.yml: |
groups:
- name: rule
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Filesystem usage detected"
description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
- alert: NodeCPUUsage
expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High CPU usage detected"
description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
job_name: 'kubernetes-node-exporter'中替换31672端口为9100,该端口是node-exporter暴露的NodePort端口,这里需要根据实际情况填写
在前面node-exporter.yaml中指定了targetPort: 9100,所以这里的端口需要修改为9100kubernetes.default.svc:443为k8s api地址,如果安装k8s时不是使用的默认DNS,则需要手动修改- 新增了Prometheus alertmanagers,需要修改
alerting.alertmanagers.static_configs.targets的IP地址,这时Prometheus和alertmanagers是两个docker容器,IP为运行alertmanagers的宿主机的IP地址 - 新增了Prometheus alertmanagers告警规则,添加
rule_files(指定报警规则),并增加三条规则(rules.yml) - 这里新增的三条报警规则分别是:节点的文件系统,节点内存,CPU的使用量。如果大于了80%的话就触发label为
team=node的receiver(alertmanager 配置文件中配置),可以看到上面的配置就会匹配email这个receiver
1 | [root@localhost prometheus]# kubectl apply -f prometheus-config.yaml |
1 | [root@localhst prometheus]# kubectl get ConfigMap -n kube-ops |
部署prometheus
使用Deployment的形式来设置Prometheus
参考:https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
创建Node Label:1
2
3
4
5
6
7[root@localhost ~]# kubectl label node 192.168.1.198 "appNodes=pro-00-monitor"
node "192.168.1.198" labeled
[root@localhost ~]# kubectl get node -a -l "appNodes=pro-00-monitor"
NAME STATUS ROLES AGE VERSION
192.168.1.198 Ready <none> 24m v1.9.2
[root@localhost ~]#
[root@localhost ~]# mkdir -p /data/monitor #创建数据挂载目录
prometheus-deploy.yaml:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: prometheus
name: prometheus
namespace: kube-ops
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: prometheus
spec:
nodeSelector:
appNodes: pro-00-monitor
securityContext:
runAsUser: 0
serviceAccountName: prometheus
containers:
- image: prom/prometheus:v2.2.0
name: prometheus
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=15d"
ports:
- containerPort: 9090
hostPort: 9090
protocol: TCP
name: http
volumeMounts:
- mountPath: "/prometheus"
name: data
subPath: prometheus/data
- mountPath: "/etc/prometheus"
name: config-volume
- mountPath: "/etc/localtime"
name: time
readOnly: true
resources:
requests:
cpu: 1
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
- image: prom/alertmanager:v0.14.0
name: alertmanager
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
ports:
- containerPort: 9093
hostPort: 9093
protocol: TCP
name: http
volumeMounts:
- name: alertmanager-config-volume
mountPath: /etc/alertmanager
resources:
requests:
memory: 500Mi
limits:
memory: 1024Mi
volumes:
- name: data
hostPath:
path: "/data/monitor"
- name: time
hostPath:
path: "/etc/localtime"
- configMap:
name: prometheus-config
name: config-volume
- name: alertmanager-config-volume
configMap:
name: alertmanager1
[root@localhost prometheus]# kubectl apply -f prometheus-deploy.yaml
1
2
3[root@localhost prometheus]# kubectl get pod -o wide -n kube-ops
prometheus-fc7685cc7-rwlc7 1/1 Running 0 34s 172.30.57.7 192.168.1.198
[root@localhost prometheus]#1
2
3
4[root@localhst ~]# netstat -tunlp |egrep '9090|9093'
tcp6 0 0 :::9090 :::* LISTEN 4023/docker-proxy
tcp6 0 0 :::9093 :::* LISTEN 3983/docker-proxy
[root@localhst ~]#
访问prometheus
prometheus启动成功后,我们就可以打开prometheus dashboard查看了,访问 http://ip:9090/graph ,点status–>Targets
可以看到prometheus已经成功访问到k8s api-server并获取到监控指标
访问Alertmanager
alertmanager启动后,我们可以打开alertmanager Dashboard查看,访问 http://ip:9093
当然在prometheus dashboard也可以查看,在status –> Runtime & Build Information 最底部
告警规则
在Prometheus定义的rules生效后,可以在Status –>Rules 这里看到
- 点击的
expr会直接跳转到Prometheus graph页面查询,在制定报警规则的时候,可以先在Prometheus中测试表达式
在Prometheus的Alerts这里可以看到触发告警规则的状态
目前有三台主机成功触发规则
一个报警信息在生命周期内有下面3中状态:
inactive: 表示当前报警信息既不是firing状态也不是pending状态pending: 表示在设置的阈值时间范围内被激活。这时Prometheus处于等待状态,大概等待3分钟左右,整合所有的告警条目,等待集中发送给Alertmanagerfiring: 表示超过设置的阈值时间被激活。这时Prometheus处于发送告警到Alertmanager阶段,也是最终状态
触发规则后,也可以在Alertmanager Dashboard上看到
最后来一张,我们成功收到Alertmanager的告警邮件截图
查询监控数据
Prometheus提供了API的方式进行数据查询,同样可以使用query语言进行复杂的查询任务
点Graph
查询每个POD的CPU使用情况,输入:sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!="", pod_name!=""}[1m] ) )
更多查询条件参考:
https://prometheus.io/docs/prometheus/latest/querying/basics/
https://prometheus.io/docs/prometheus/latest/querying/api/
https://prometheus.io/docs/prometheus/latest/querying/examples/
Q&A
Question:
如果遇到下面的报错:
Answer:
在prometheus-deploy.yaml中spec.spec.下增加如下配置:1
2securityContext:
runAsUser: 0
详细配置参考上面prometheus-deploy.yaml配置文件
参考:https://github.com/prometheus/prometheus/issues/2939
Question:1
2
3level=error ts=2018-04-23T13:08:34.417214948Z caller=notify.go:303 component=dispatcher msg="Error on notify" err="dial t
cp 14.18.245.164:25: getsockopt: connection timed out"level=error ts=2018-04-23T13:08:34.417316796Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed"
num_alerts=3 err="dial tcp 14.18.245.164:25: getsockopt: connection timed out"
Answer:
遇到上面的报错,首先先检查下docker容器是否联网,再测试下与smtp服务器是否正常通讯1
telnet smtp.qq.com 25
在测试过程中发现,腾讯邮箱(个人QQ邮件+企业邮箱),只支持SMTP SSL 465端口,且不支持TLS,smtp_require_tls这个参数官方默认是true,这里需要设置为 smtp_require_tls: false,使用不同邮箱的SMTP 需要具体测试。
Prometheus监控Kubernetes HTTP集群
环境:
Prometheus v1.0.1
node-exporter v0.15.2
Kubernetes v1.8.2
服务器环境如上
1 | #获取k8s kube-apiserver地址和端口 |
1 | [root@localhost prometheus]# kubectl apply -f node-exporter.yaml |
1 | [root@localhost prometheus]# kubectl get pod -n kube-ops -o wide |
1 | [root@localhost prometheus]# kubectl get svc -n kube-ops -o wide |
1 | [root@localhost prometheus]# kubectl get ConfigMap -n kube-ops -o wide |
其他的同上,Prometheus监控Kubernetes HTTP集群配置文件见附件
注:经测试,目前只发现Prometheus v1.0.1支持,测试其他版本都有报错
参考:
https://blog.qikqiak.com/post/kubernetes-monitor-prometheus-grafana/
https://blog.qikqiak.com/post/update-prometheus-2-in-kubernetes/
https://github.com/cnych/k8s-repo/tree/master/prometheus
https://blog.qikqiak.com/post/alertmanager-of-prometheus-in-practice/
https://blog.csdn.net/qq_21398167/article/details/76008594?locationnum=10&fps=1
https://segmentfault.com/a/1190000008695463
https://prometheus.io/docs/alerting/overview/
附件:
Prometheus监控TLS K8S配置文件.zip
Prometheus监控K8S HTTP配置文件.zip
本作品采用知识共享署名 2.5 中国大陆许可协议进行许可,欢迎转载,但转载请注明来自Jack Wang Blog,并保持转载后文章内容的完整。本人保留所有版权相关权利。
本文出自”Jack Wang Blog”:http://www.yfshare.vip/2018/03/14/Prometheus%E7%9B%91%E6%8E%A7TLS-Kubernetes%E9%9B%86%E7%BE%A4/