前文我们聊到了k8s的apiservice资源结合自定义apiserver扩展原生apiserver功能的相关话题,回顾请参考:https://www.cnblogs.com/qiuhom-1874/p/14279850.html;今天我们来聊一聊监控k8s集群相关话题;
前文我们使用自定义apiserver metrics server扩展了原生apiserver的功能,让其原生apiserver能够通过kubectl top node/pod 命令来获取对应节点或名称空间下pod的cpu和内存指标数据;这些指标数据在一定程度上能够让我们清楚的知道对应pod或节点资源使用情况,本质上这也是一种监控方式;但是metrics server 采集的数据只有内存和cpu指标数据,在一定程度上不能满足我们了解节点或pod的其他数据;这样一来我们就需要有一款专业的监控系统来帮助我们监控k8s集群节点或pod;Prometheus是一款高性能的监控程序,其内部主要有3个组件,Retrieval组件主要负责数据收集工作,它可以结合外部其他程序收集数据;TSDB组件主要是用来存储指标数据,该组件是一个时间序列存储系统;HttpServer组件主要用来对外提供restful api接口,为客户端提供查询接口;默认监听在9090端口;
prometheus监控系统整体top
提示:上图是Prometheus监控系统的top图;Pushgateway组件类似Prometheus retrieval代理,它主要负责收集主动推送指标数据的pod的指标数据,在Prometheus 监控系统中也有主动监控和被动监控的概念,主动监控是指被监控端主动推送数据到server,被动监控是指被监控端被动等待server来拉去数据,默认情况Prometheus是工作为被动监控模式,即server主动到被监控端采集数据;节点级别metrics 数据可以使用node-exporter来收集,当然node-exporter也可以收集pod容器里的指标数据;alertmanager主要用来为Prometheus监控系统提供告警功能;Prometheus web ui主要作用是为其提供一个web查询页面;
Prometheus 监控系统组件
kube-state-metrics:该组件主要用来为监控k8s集群中的指标数据提供计数能力;比如k8s节点有几个,pod的数量等等;
node-exporter:该组件主要作用是用来收集对应节点上的指标数据;
alertmanager:该组件主要用来为Prometheus监控系统提供告警功能;
prometheus-server:该组件主要用来存储指标数据,处理指标数据,以及为用户提供一个restful api查询接口;
控制pod能够被Prometheus抓取数据的注解信息
prometheus.io/scrape:该注解信息主要用来描述对应pod是否允许抓取指标数据,true表示允许,false表示不允许;
prometheus.io/path:用于描述抓取指标数据使用的url路径,一般为/metrics
prometheus.io/port:用于描述对应抓取指标数据使用的端口信息;
部署Prometheus监控系统
1、部署kube-state-metrics
创建kube-state-metrics rbac授权相关清单
[root@master01 kube-state-metrics]# cat kube-state-metrics-rbac.yaml apiVersion: v1 kind: ServiceAccount Metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole Metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list","watch"] - apiGroups: ["extensions","apps"] resources: - daemonsets - deployments - replicasets verbs: ["list","watch"] - apiGroups: ["apps"] resources: - statefulsets verbs: ["list","watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["list","watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["list","watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role Metadata: name: kube-state-metrics-resizer namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - pods verbs: ["get"] - apiGroups: ["extensions","apps"] resources: - deployments resourceNames: ["kube-state-metrics"] verbs: ["get","update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding Metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding Metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system [root@master01 kube-state-metrics]#
提示:上述清单主要创建了一个sa用户,和两个角色,并将sa用户绑定之对应的角色上;让其对应sa用户拥有对应角色的相关权限;
创建kube-state-metrics service配置清单
[root@master01 kube-state-metrics]# cat kube-state-metrics-service.yaml apiVersion: v1 kind: Service Metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "kube-state-metrics" annotations: prometheus.io/scrape: 'true' spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics [root@master01 kube-state-metrics]#
创建kube-state-metrics 部署清单
[root@master01 kube-state-metrics]# cat kube-state-metrics-deployment.yaml apiVersion: apps/v1 kind: Deployment Metadata: name: kube-state-metrics namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.0.0-beta spec: selector: matchLabels: k8s-app: kube-state-metrics version: v2.0.0-beta replicas: 1 template: Metadata: labels: k8s-app: kube-state-metrics version: v2.0.0-beta spec: priorityClassName: system-cluster-critical serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: quay.io/coreos/kube-state-metrics:v2.0.0-beta ports: - name: http-metrics containerPort: 8080 - name: telemetry containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer image: k8s.gcr.io/addon-resizer:1.8.7 resources: limits: cpu: 100m memory: 30Mi requests: cpu: 100m memory: 30Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: Metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: Metadata.namespace volumeMounts: - name: config-volume mountPath: /etc/config command: - /pod_nanny - --config-dir=/etc/config - --container=kube-state-metrics - --cpu=100m - --extra-cpu=1m - --memory=100Mi - --extra-memory=2Mi - --threshold=5 - --deployment=kube-state-metrics volumes: - name: config-volume configMap: name: kube-state-metrics-config --- # Config map for resource configuration. apiVersion: v1 kind: ConfigMap Metadata: name: kube-state-metrics-config namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile data: NannyConfiguration: |- apiVersion: nannyconfig/v1alpha1 kind: NannyConfiguration [root@master01 kube-state-metrics]#
应用上述三个清单,部署kube-state-metrics组件
[root@master01 kube-state-metrics]# ls kube-state-metrics-deployment.yaml kube-state-metrics-rbac.yaml kube-state-metrics-service.yaml [root@master01 kube-state-metrics]# kubectl apply -f . deployment.apps/kube-state-metrics created configmap/kube-state-metrics-config created serviceaccount/kube-state-metrics created clusterrole.rbac.authorization.k8s.io/kube-state-metrics created role.rbac.authorization.k8s.io/kube-state-metrics-resizer created clusterrolebinding.rbac.authorization.k8s.io/kube-state-metrics created rolebinding.rbac.authorization.k8s.io/kube-state-metrics created service/kube-state-metrics created [root@master01 kube-state-metrics]#
验证:查看对应的pod和service是否都成功创建?
提示:可以看到对应pod和svc都已经正常创建;
验证:访问对应service的8080端口,url为/metrics,看看是否能够访问到数据?
提示:可以看到访问对应service的8080端口,url为/metrics能够访问到对应数据,说明kube-state-metrics组件安装部署完成;
2、部署node-exporter
创建node-export service配置清单
[root@master01 node_exporter]# cat node-exporter-service.yaml apiVersion: v1 kind: Service Metadata: name: node-exporter namespace: kube-system annotations: prometheus.io/scrape: "true" labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "NodeExporter" spec: clusterIP: None ports: - name: metrics port: 9100 protocol: TCP targetPort: 9100 selector: k8s-app: node-exporter [root@master01 node_exporter]#
创建node-export 部署清单
[root@master01 node_exporter]# cat node-exporter-ds.yml apiVersion: apps/v1 kind: DaemonSet Metadata: name: node-exporter namespace: kube-system labels: k8s-app: node-exporter kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v1.0.1 spec: selector: matchLabels: k8s-app: node-exporter version: v1.0.1 updateStrategy: type: OnDelete template: Metadata: labels: k8s-app: node-exporter version: v1.0.1 spec: priorityClassName: system-node-critical containers: - name: prometheus-node-exporter image: "prom/node-exporter:v1.0.1" imagePullPolicy: "IfNotPresent" args: - --path.procfs=/host/proc - --path.sysfs=/host/sys ports: - name: metrics containerPort: 9100 hostPort: 9100 volumeMounts: - name: proc mountPath: /host/proc readOnly: true - name: sys mountPath: /host/sys readOnly: true resources: limits: memory: 50Mi requests: cpu: 100m memory: 50Mi hostNetwork: true hostPID: true volumes: - name: proc hostPath: path: /proc - name: sys hostPath: path: /sys tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule [root@master01 node_exporter]#
提示:上述清单主要用daemonSet控制器来运行node-exporter pod,并在对应pod上做了共享宿主机网络名称空间和pid,以及对主节点污点的容忍度;这样node-exporter就可以在k8s的所有节点上运行一个pod,通过对应pod来采集对应节点上的指标数据;
应用上述两个配置清单部署 node-exporter
[root@master01 node_exporter]# ls node-exporter-ds.yml node-exporter-service.yaml [root@master01 node_exporter]# kubectl apply -f . daemonset.apps/node-exporter created service/node-exporter created [root@master01 node_exporter]#
验证:查看对应pod和svc是否正常创建?
[root@master01 node_exporter]# kubectl get pods -l "k8s-app=node-exporter" -n kube-system NAME READY STATUS RESTARTS AGE node-exporter-6zgkz 1/1 Running 0 107s node-exporter-9mvxr 1/1 Running 0 107s node-exporter-jbll7 1/1 Running 0 107s node-exporter-s7vvt 1/1 Running 0 107s node-exporter-xmrjh 1/1 Running 0 107s [root@master01 node_exporter]# kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 39d kube-state-metrics ClusterIP 10.110.110.216 <none> 8080/TCP,8081/TCP 20m metrics-server ClusterIP 10.98.59.116 <none> 443/TCP 46h node-exporter ClusterIP None <none> 9100/TCP 116s [root@master01 node_exporter]#
验证:访问任意节点上的9100端口,url为/metrics,看看是否能够访问到指标数据?
提示:可以看到对应端口下/metrics url能够访问到对应的数据,说明node-exporter组件部署成功;
3、部署alertmanager
创建alertmanager pvc配置清单
[root@master01 alertmanager]# cat alertmanager-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim Metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: # storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: "2Gi" [root@master01 alertmanager]#
创建pv
[root@master01 ~]# cat pv-demo.yaml apiVersion: v1 kind: PersistentVolume Metadata: name: nfs-pv-v1 spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: ["ReadWriteOnce","ReadWriteMany","ReadOnlyMany"] persistentVolumeReclaimPolicy: Retain mountOptions: - hard - nfsvers=4.1 nfs: path: /data/v1 server: 192.168.0.99 --- apiVersion: v1 kind: PersistentVolume Metadata: name: nfs-pv-v2 spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: ["ReadWriteOnce","ReadOnlyMany"] persistentVolumeReclaimPolicy: Retain mountOptions: - hard - nfsvers=4.1 nfs: path: /data/v2 server: 192.168.0.99 --- apiVersion: v1 kind: PersistentVolume Metadata: name: nfs-pv-v3 spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: ["ReadWriteOnce","ReadOnlyMany"] persistentVolumeReclaimPolicy: Retain mountOptions: - hard - nfsvers=4.1 nfs: path: /data/v3 server: 192.168.0.99 [root@master01 ~]#
应用清单创建pv
[root@master01 ~]# kubectl apply -f pv-demo.yaml persistentvolume/nfs-pv-v1 created persistentvolume/nfs-pv-v2 created persistentvolume/nfs-pv-v3 created [root@master01 ~]# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE nfs-pv-v1 5Gi RWO,ROX,RWX Retain Available 4s nfs-pv-v2 5Gi RWO,RWX Retain Available 4s nfs-pv-v3 5Gi RWO,RWX Retain Available 4s [root@master01 ~]#
创建alertmanager service配置清单
[root@master01 alertmanager]# cat alertmanager-service.yaml apiVersion: v1 kind: Service Metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Alertmanager" spec: ports: - name: http port: 80 protocol: TCP targetPort: 9093 nodePort: 30093 selector: k8s-app: alertmanager type: "NodePort" [root@master01 alertmanager]#
创建alertmanager cm配置清单
[root@master01 alertmanager]# cat alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap Metadata: name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: null receivers: - name: default-receiver route: group_interval: 5m group_wait: 10s receiver: default-receiver repeat_interval: 3h [root@master01 alertmanager]#
创建alertmanager 部署清单
[root@master01 alertmanager]# cat alertmanager-deployment.yaml apiVersion: apps/v1 kind: Deployment Metadata: name: alertmanager namespace: kube-system labels: k8s-app: alertmanager kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v0.14.0 spec: replicas: 1 selector: matchLabels: k8s-app: alertmanager version: v0.14.0 template: Metadata: labels: k8s-app: alertmanager version: v0.14.0 spec: priorityClassName: system-cluster-critical containers: - name: prometheus-alertmanager image: "prom/alertmanager:v0.14.0" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 readinessProbe: httpGet: path: /#/status port: 9093 initialDelaySeconds: 30 timeoutSeconds: 30 volumeMounts: - name: config-volume mountPath: /etc/config - name: storage-volume mountPath: "/data" subPath: "" resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi # - name: prometheus-alertmanager-configmap-reload # image: "jimmidyson/configmap-reload:v0.1" # imagePullPolicy: "IfNotPresent" # args: # - --volume-dir=/etc/config # - --webhook-url=http://localhost:9093/-/reload # volumeMounts: # - name: config-volume # mountPath: /etc/config # readOnly: true # resources: # limits: # cpu: 10m # memory: 10Mi # requests: # cpu: 10m # memory: 10Mi volumes: - name: config-volume configMap: name: alertmanager-config - name: storage-volume persistentVolumeClaim: claimName: alertmanager [root@master01 alertmanager]#
应用上述4个清单,部署alertmanager
[root@master01 alertmanager]# ls alertmanager-configmap.yaml alertmanager-deployment.yaml alertmanager-pvc.yaml alertmanager-service.yaml [root@master01 alertmanager]# kubectl apply -f . configmap/alertmanager-config created deployment.apps/alertmanager created persistentvolumeclaim/alertmanager created service/alertmanager created [root@master01 alertmanager]#
验证:查看对应pod和svc是否正常创建?
[root@master01 alertmanager]# kubectl get pods -l "k8s-app=alertmanager" -n kube-system NAME READY STATUS RESTARTS AGE alertmanager-6546bf7676-lt9jq 1/1 Running 0 85s [root@master01 alertmanager]# kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager NodePort 10.99.246.148 <none> 80:30093/TCP 92s kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,8081/TCP 31m metrics-server ClusterIP 10.98.59.116 <none> 443/TCP 47h node-exporter ClusterIP None <none> 9100/TCP 13m [root@master01 alertmanager]#
验证:访问任意节点的30093端口,看看是否能够访问到alertmanager?
提示:访问对应的端口能够访问到上述界面,说明alertmanager 部署成功;
4、部署prometheus-server
创建Prometheus rabc相关授权配置清单
[root@master01 prometheus-server]# cat prometheus-rbac.yaml apiVersion: v1 kind: ServiceAccount Metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole Metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding Metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system [root@master01 prometheus-server]#
创建Prometheus service配置清单
[root@master01 prometheus-server]# cat prometheus-service.yaml kind: Service apiVersion: v1 Metadata: name: prometheus namespace: kube-system labels: kubernetes.io/name: "Prometheus" kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: ports: - name: http port: 9090 protocol: TCP targetPort: 9090 nodePort: 30090 selector: k8s-app: prometheus type: NodePort [root@master01 prometheus-server]#
创建Prometheus cm配置清单
[root@master01 prometheus-server]# cat prometheus-configmap.yaml # Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/ apiVersion: v1 kind: ConfigMap Metadata: name: prometheus-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: prometheus.yml: | scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __Meta_kubernetes_namespace - __Meta_kubernetes_service_name - __Meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-kubelet kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __Meta_kubernetes_node_label_(.+) scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __Meta_kubernetes_node_label_(.+) - target_label: __metrics_path__ replacement: /metrics/cadvisor scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-service-endpoints kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: true source_labels: - __Meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __Meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __Meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __Meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __Meta_kubernetes_service_label_(.+) - action: replace source_labels: - __Meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __Meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-services kubernetes_sd_configs: - role: service metrics_path: /probe params: module: - http_2xx relabel_configs: - action: keep regex: true source_labels: - __Meta_kubernetes_service_annotation_prometheus_io_probe - source_labels: - __address__ target_label: __param_target - replacement: blackBox target_label: __address__ - source_labels: - __param_target target_label: instance - action: labelmap regex: __Meta_kubernetes_service_label_(.+) - source_labels: - __Meta_kubernetes_namespace target_label: kubernetes_namespace - source_labels: - __Meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __Meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __Meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __Meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __Meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __Meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __Meta_kubernetes_pod_name target_label: kubernetes_pod_name alerting: alertmanagers: - kubernetes_sd_configs: - role: pod tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__Meta_kubernetes_namespace] regex: kube-system action: keep - source_labels: [__Meta_kubernetes_pod_label_k8s_app] regex: alertmanager action: keep - source_labels: [__Meta_kubernetes_pod_container_port_number] regex: action: drop [root@master01 prometheus-server]#
创建Prometheus 部署清单
[root@master01 prometheus-server]# cat prometheus-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet Metadata: name: prometheus namespace: kube-system labels: k8s-app: prometheus kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.24.0 spec: serviceName: "prometheus" replicas: 1 podManagementPolicy: "Parallel" updateStrategy: type: "RollingUpdate" selector: matchLabels: k8s-app: prometheus template: Metadata: labels: k8s-app: prometheus spec: priorityClassName: system-cluster-critical serviceAccountName: prometheus initContainers: - name: "init-chown-data" image: "busyBox:latest" imagePullPolicy: "IfNotPresent" command: ["chown","-R","65534:65534","/data"] volumeMounts: - name: prometheus-data mountPath: /data subPath: "" containers: # - name: prometheus-server-configmap-reload # image: "jimmidyson/configmap-reload:v0.1" # imagePullPolicy: "IfNotPresent" # args: # - --volume-dir=/etc/config # - --webhook-url=http://localhost:9090/-/reload # volumeMounts: # - name: config-volume # mountPath: /etc/config # readOnly: true # resources: # limits: # cpu: 10m # memory: 10Mi # requests: # cpu: 10m # memory: 10Mi - name: prometheus-server image: "prom/prometheus:v2.24.0" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle ports: - containerPort: 9090 readinessProbe: httpGet: path: /-/ready port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /-/healthy port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 # based on 10 running nodes with 30 pods each resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 200m memory: 1000Mi volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data subPath: "" terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config volumeClaimTemplates: - Metadata: name: prometheus-data spec: # storageClassName: standard accessModes: - ReadWriteOnce resources: requests: storage: "5Gi" [root@master01 prometheus-server]#
提示:应用上述清单前,请确保对应pv容量是否够用;
应用上述4个清单部署Prometheus server
[root@master01 prometheus-server]# ls prometheus-configmap.yaml prometheus-rbac.yaml prometheus-service.yaml prometheus-statefulset.yaml [root@master01 prometheus-server]# kubectl apply -f . configmap/prometheus-config created serviceaccount/prometheus created clusterrole.rbac.authorization.k8s.io/prometheus created clusterrolebinding.rbac.authorization.k8s.io/prometheus created service/prometheus created statefulset.apps/prometheus created [root@master01 prometheus-server]#
验证:查看对应pod和svc是否成功创建?
[root@master01 prometheus-server]# kubectl get pods -l "k8s-app=prometheus" -n kube-system NAME READY STATUS RESTARTS AGE prometheus-0 1/1 Running 0 2m20s [root@master01 prometheus-server]# kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager NodePort 10.99.246.148 <none> 80:30093/TCP 10m kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,8081/TCP 40m metrics-server ClusterIP 10.98.59.116 <none> 443/TCP 47h node-exporter ClusterIP None <none> 9100/TCP 22m prometheus NodePort 10.111.155.1 <none> 9090:30090/TCP 2m27s [root@master01 prometheus-server]#
验证:访问任意节点的30090端口,看看对应Prometheus 是否能够被访问?
提示:能够访问到上述页面,表示Prometheus server部署没有问题;
通过上述界面查看监控指标数据
提示:选择对应要查看的指标数据项,点击execute,对应图像就会呈现出来;到此Prometheus监控系统就部署完成了,接下来部署grafana,并配置grafana使用Prometheus数据源展示监控数据;
部署grafana
创建grafana 部署清单
[root@master01 grafana]# cat grafana.yaml apiVersion: apps/v1 kind: Deployment Metadata: name: monitoring-grafana namespace: kube-system spec: replicas: 1 selector: matchLabels: task: monitoring k8s-app: grafana template: Metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4 ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /etc/ssl/certs name: ca-certificates readOnly: true - mountPath: /var name: grafana-storage env: # - name: INFLUXDB_HOST # value: monitoring-influxdb - name: GF_SERVER_HTTP_PORT value: "3000" - name: GF_AUTH_BASIC_ENABLED value: "false" - name: GF_AUTH_ANONYMOUS_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ORG_ROLE value: Admin - name: GF_SERVER_ROOT_URL value: / volumes: - name: ca-certificates hostPath: path: /etc/ssl/certs - name: grafana-storage emptyDir: {} --- apiVersion: v1 kind: Service Metadata: labels: kubernetes.io/cluster-service: 'true' kubernetes.io/name: monitoring-grafana name: monitoring-grafana namespace: kube-system spec: ports: - port: 80 targetPort: 3000 selector: k8s-app: grafana type: "NodePort" [root@master01 grafana]#
应用资源清单 部署grafana
[root@master01 grafana]# ls grafana.yaml [root@master01 grafana]# kubectl apply -f . deployment.apps/monitoring-grafana created service/monitoring-grafana created [root@master01 grafana]#
验证:查看对应pod和svc是否都创建?
[root@master01 grafana]# kubectl get pods -l "k8s-app=grafana" -n kube-system NAME READY STATUS RESTARTS AGE monitoring-grafana-6c74ccc5dd-grjzf 1/1 Running 0 87s [root@master01 grafana]# kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager NodePort 10.99.246.148 <none> 80:30093/TCP 82m kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,8081/TCP 112m metrics-server ClusterIP 10.98.59.116 <none> 443/TCP 2d monitoring-grafana NodePort 10.100.230.71 <none> 80:30196/TCP 92s node-exporter ClusterIP None <none> 9100/TCP 94m prometheus NodePort 10.111.155.1 <none> 9090:30090/TCP 74m [root@master01 grafana]#
验证:访问grafana service 暴露的端口,看看对应pod是否能够被访问?
配置grafana
1、配置grafana的数据源为Prometheus
2、新建监控面板
提示:进入grafana.com网站上,下载监控面板模板;
@H_237_301@
提示:选择下载的模板文件,然后再选择对应的数据源,点击import即可;上面没有数据的原因是对应指标名称和Prometheus中指标名称不同导致的;我们可以根据自己环境Prometheus中指标数据名称来修改模板文件;