k8s descheduler

原理

kube-scheduler通过各种算法计算出最佳节点去运行Pod,当出现新的 Pod 进行调度时,调度程序会根据其当时对 Kubernetes 集群的资源做出调度决定。但Kubernetes集群发生变化,比如一个节点为了维护,执行了驱逐操作,这个节点的 Pod 会被驱逐到其他节点,但是当我们维护完成Pod 不会自动回到该节点上来,由此 Kubernetes 集群出现了不均衡的状态,可以使用descheduler进行在均衡。

install descheduler

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135

apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
params:
removeDuplicates:
excludeOwnerKinds:
- "DaemonSet"
- "StatefulSet"
namespaces:
include:
- uat
- uat2
- uat3
- aux
- preprod
- production
- staging
- qa
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"RemovePodsHavingTooManyRestarts":
enabled: true
params:
podsHavingTooManyRestarts:
podRestartThreshold: 100
includingInitContainers: true
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 70
"memory": 70
"pods": 40
"PodLifeTime":
enabled: false
params:
podLifeTime:
maxPodLifeTimeSeconds: 86400
podStatusPhases:
- "Pending"
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: descheduler-cronjob
namespace: kube-system
spec:
schedule: "*/2 * * * *"
concurrencyPolicy: "Forbid"
successfulJobsHistoryLimit: 1
jobTemplate:
spec:
template:
metadata:
name: descheduler-pod
spec:
priorityClassName: system-cluster-critical
containers:
- name: descheduler
image: k8s.gcr.io/descheduler/descheduler:v0.19.0
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--v"
- "3"
restartPolicy: "Never"
serviceAccountName: descheduler-sa
volumes:
- name: policy-volume
configMap:
name: descheduler-policy-configmap
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: descheduler-cluster-role
rules:
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: ["scheduling.k8s.io"]
resources: ["priorityclasses"]
verbs: ["get", "watch", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: descheduler-sa
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: descheduler-cluster-role-binding
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: descheduler-cluster-role
subjects:
- name: descheduler-sa
kind: ServiceAccount
namespace: kube-system

结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
➜  descheduler git:(master) k get cronjob  -A
NAMESPACE NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
kube-system descheduler-cronjob */2 * * * * True 0 3d2h 8d


➜ descheduler git:(master) k get pod -A | grep desc
kube-system descheduler-cronjob-1606710120-np5xc 0/1 Completed 0 3d2h

➜ descheduler git:(master) k logs -f descheduler-cronjob-1606710120-np5xc -n kube-system
I1130 04:22:07.533753 1 node.go:45] node lister returned empty list, now fetch directly
I1130 04:22:07.541476 1 toomanyrestarts.go:73] Processing node: cn-hongkong.10.0.3.20
I1130 04:22:07.565399 1 toomanyrestarts.go:73] Processing node: cn-hongkong.10.0.3.21
I1130 04:22:07.577738 1 duplicates.go:73] Processing node: "cn-hongkong.10.0.3.20"
I1130 04:22:07.593279 1 duplicates.go:73] Processing node: "cn-hongkong.10.0.3.21"
I1130 04:22:07.632923 1 lownodeutilization.go:203] Node "cn-hongkong.10.0.3.20" is appropriately utilized with usage: api.ResourceThresholds{"cpu":32.5, "memory":29.3980074559352, "pods":17.1875}
I1130 04:22:07.632997 1 lownodeutilization.go:203] Node "cn-hongkong.10.0.3.21" is appropriately utilized with usage: api.ResourceThresholds{"cpu":30, "memory":23.303233323623655, "pods":15.625}
I1130 04:22:07.633016 1 lownodeutilization.go:101] Criteria for a node under utilization: CPU: 20, Mem: 20, Pods: 20
I1130 04:22:07.633028 1 lownodeutilization.go:105] No node is underutilized, nothing to do here, you might tune your thresholds further
I1130 04:22:07.633042 1 pod_antiaffinity.go:72] Processing node: "cn-hongkong.10.0.3.20"
I1130 04:22:07.647199 1 pod_antiaffinity.go:72] Processing node: "cn-hongkong.10.0.3.21"