🤞🏿 🧔 👉🏼 Kubernetes：为什么设置系统资源管理如此重要？ 🏊 🕯️ 🤦🏽

通常，始终需要为任何应用程序提供专用的资源池，以确保其正确而稳定的运行。但是，如果多个应用程序同时使用相同的容量怎么办？如何为他们每个人提供最少的必要资源？如何限制资源消耗？如何正确分配节点之间的负载？在应用程序负载增加的情况下，如何确保水平缩放的机制？

您需要从系统中存在的资源的基本类型开始-当然是处理器时间和RAM。在k8s清单中，这些类型的资源以以下单位度量：

CPU-核心
RAM-以字节为单位

此外，对于每种资源，都有机会设置两种类型的需求- 请求和限制。请求-描述运行容器（以及整个炉床）的节点的空闲资源的最低要求，而限制对容器可用的资源设置了严格的限制。

重要的是要了解，在清单中不必显式定义两种类型，其行为如下：

如果仅显式设置资源的限制，则对此资源的请求将自动采用等于限制的值（可以通过调用describe实体进行验证）。即实际上，容器的运行将受到其运行所需资源量的限制。
如果仅为资源明确设置请求，则在此资源之上不会设置任何限制-即容器仅受节点本身资源的限制。

使用以下实体，不仅可以在特定容器级别配置资源管理，还可以在名称空间级别配置资源管理：

LimitRange-以ns为单位描述容器/炉膛级别的限制策略，并且需要用它来描述对容器/炉膛的默认限制，以及防止创建明显的胖容器/炉膛（反之亦然），限制其数量并确定限制之间的可能差异和要求
ResourceQuotas-通常以ns为单位描述所有容器的限制策略，通常用于限制环境之间的资源（当在节点级别未严格限制环境时很有用）

以下是设置资源限制的清单的示例：

在特定的容器级别：
```
containers: - name: app-nginx image: nginx resources: requests: memory: 1Gi limits: cpu: 200m 
```
即在这种情况下，要使用nginx启动容器，您至少需要在节点上存在免费的1G OP和0.2 CPU，而最大的容器可以吃掉0.2 CPU和节点上所有可用的OP。
在ns整数级：
```
 apiVersion: v1 kind: ResourceQuota metadata: name: nxs-test spec: hard: requests.cpu: 300m requests.memory: 1Gi limits.cpu: 700m limits.memory: 2Gi 
```
即对于CPU，默认ns中所有请求容器的总和不能超过300m，对于OP，其总数不能超过1G，对于CPU和OP，所有限制的总和分别为700m和2G。
ns中容器的默认限制：
```
 apiVersion: v1 kind: LimitRange metadata: name: nxs-limit-per-container spec: limits: - type: Container defaultRequest: cpu: 100m memory: 1Gi default: cpu: 1 memory: 2Gi min: cpu: 50m memory: 500Mi max: cpu: 2 memory: 4Gi 
```
即在所有容器的默认命名空间中，默认情况下，CPU的请求将设置为100m，OP的请求将设置为1G，限制为1 CPU和2G。同时，对CPU（50m <x <2）和RAM（500M <x <4G）的请求/限制中的可能值也建立了限制。

ns炉膛水平的局限性：

 apiVersion: v1 kind: LimitRange metadata: name: nxs-limit-pod spec: limits: - type: Pod max: cpu: 4 memory: 1Gi

即对于默认ns中的每个炉床，将设置4个vCPU和1G的限制。

现在，我想告诉您，安装这些限制可以给我们带来什么好处。

节点之间的负载平衡机制

如您所知，k8s组件（例如Scheduler ）是根据某种算法工作的，它负责在每个节点上分配炉膛。该算法在选择要运行的最佳节点的过程中经历两个阶段：

筛选
排名

即根据所描述的策略，首先根据一组谓词（包括节点是否有足够的资源来运行炉床-PodFitsResources）选择可以在其上启动炉床的节点，然后根据优先级 （包括，节点拥有的可用资源越多-为其分配的点数越多-LeastResourceAllocation / LeastRequestedPriority / BalancedResourceAllocation）并在具有最多点数的节点上运行（如果多个节点一次满足此条件，则选择一个随机条件）。

同时，您需要了解调度程序在评估节点的可用资源时，将重点放在存储在etcd中的数据上，即取决于在此节点上运行的每个pod的请求/限制资源的数量，而不是实际消耗的资源。可以在kubectl describe node $NODE命令的输出中获得此信息，例如：

 # kubectl describe nodes nxs-k8s-s1 .. Non-terminated Pods: (9 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- ingress-nginx nginx-ingress-controller-754b85bf44-qkt2t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system kube-flannel-26bl4 150m (0%) 300m (1%) 64M (0%) 500M (1%) 233d kube-system kube-proxy-exporter-cb629 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system kube-proxy-x9fsc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system nginx-proxy-k8s-worker-s1 25m (0%) 300m (1%) 32M (0%) 512M (1%) 233d nxs-monitoring alertmanager-main-1 100m (0%) 100m (0%) 425Mi (1%) 25Mi (0%) 233d nxs-logging filebeat-lmsmp 100m (0%) 0 (0%) 100Mi (0%) 200Mi (0%) 233d nxs-monitoring node-exporter-v4gdq 112m (0%) 122m (0%) 200Mi (0%) 220Mi (0%) 233d Allocated resources: (Total limits may be over 100 percent, ie, overcommitted.) Resource Requests Limits -------- -------- ------ cpu 487m (3%) 822m (5%) memory 15856217600 (2%) 749976320 (3%) ephemeral-storage 0 (0%) 0 (0%)

在这里，我们看到在特定节点上运行的所有Pod，以及每个Pod要求的资源。这是启动cronjob-cron-events-1573793820-xt6q9 pod时调度程序日志的外观（在启动命令--v = 10的参数中设置第十级日志记录时，此信息显示在调度程序日志中）：

海鸥

 I1115 07:57:21.637791 1 scheduling_queue.go:908] About to try and schedule pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 I1115 07:57:21.637804 1 scheduler.go:453] Attempting to schedule pod: nxs-stage/cronjob-cron-events-1573793820-xt6q9 I1115 07:57:21.638285 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s5 is allowed, Node is running only 16 out of 110 Pods. I1115 07:57:21.638300 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s6 is allowed, Node is running only 20 out of 110 Pods. I1115 07:57:21.638322 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s3 is allowed, Node is running only 20 out of 110 Pods. I1115 07:57:21.638322 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s4 is allowed, Node is running only 17 out of 110 Pods. I1115 07:57:21.638334 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, Node is running only 16 out of 110 Pods. I1115 07:57:21.638365 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s12 is allowed, Node is running only 9 out of 110 Pods. I1115 07:57:21.638334 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s11 is allowed, Node is running only 11 out of 110 Pods. I1115 07:57:21.638385 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s1 is allowed, Node is running only 19 out of 110 Pods. I1115 07:57:21.638402 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s2 is allowed, Node is running only 21 out of 110 Pods. I1115 07:57:21.638383 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, Node is running only 16 out of 110 Pods. I1115 07:57:21.638335 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, Node is running only 18 out of 110 Pods. I1115 07:57:21.638408 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s13 is allowed, Node is running only 8 out of 110 Pods. I1115 07:57:21.638478 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s10 is allowed, existing pods anti-affinity terms satisfied. I1115 07:57:21.638505 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s8 is allowed, existing pods anti-affinity terms satisfied. I1115 07:57:21.638577 1 predicates.go:1369] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s9 is allowed, existing pods anti-affinity terms satisfied. I1115 07:57:21.638583 1 predicates.go:829] Schedule Pod nxs-stage/cronjob-cron-events-1573793820-xt6q9 on Node nxs-k8s-s7 is allowed, Node is running only 25 out of 110 Pods. I1115 07:57:21.638932 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 9 I1115 07:57:21.638946 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 2343 millicores 9640186880 memory bytes, score 8 I1115 07:57:21.638961 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: BalancedResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 9 I1115 07:57:21.638971 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: BalancedResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7 I1115 07:57:21.638975 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: LeastResourceAllocation, capacity 39900 millicores 66620170240 memory bytes, total request 4107 millicores 11307422720 memory bytes, score 8 I1115 07:57:21.638990 1 resource_allocation.go:78] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: LeastResourceAllocation, capacity 39900 millicores 66620178432 memory bytes, total request 5847 millicores 24333637120 memory bytes, score 7 I1115 07:57:21.639022 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: TaintTolerationPriority, Score: (10) I1115 07:57:21.639030 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: TaintTolerationPriority, Score: (10) I1115 07:57:21.639034 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: TaintTolerationPriority, Score: (10) I1115 07:57:21.639041 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: NodeAffinityPriority, Score: (0) I1115 07:57:21.639053 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: NodeAffinityPriority, Score: (0) I1115 07:57:21.639059 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: NodeAffinityPriority, Score: (0) I1115 07:57:21.639061 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: InterPodAffinityPriority, Score: (0) I1115 07:57:21.639063 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639073 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: InterPodAffinityPriority, Score: (0) I1115 07:57:21.639077 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639085 1 interpod_affinity.go:237] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: InterPodAffinityPriority, Score: (0) I1115 07:57:21.639088 1 selector_spreading.go:146] cronjob-cron-events-1573793820-xt6q9 -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639103 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s10: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639109 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s8: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639114 1 generic_scheduler.go:726] cronjob-cron-events-1573793820-xt6q9_nxs-stage -> nxs-k8s-s9: SelectorSpreadPriority, Score: (10) I1115 07:57:21.639127 1 generic_scheduler.go:781] Host nxs-k8s-s10 => Score 100037 I1115 07:57:21.639150 1 generic_scheduler.go:781] Host nxs-k8s-s8 => Score 100034 I1115 07:57:21.639154 1 generic_scheduler.go:781] Host nxs-k8s-s9 => Score 100037 I1115 07:57:21.639267 1 scheduler_binder.go:269] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10" I1115 07:57:21.639286 1 scheduler_binder.go:279] AssumePodVolumes for pod "nxs-stage/cronjob-cron-events-1573793820-xt6q9", node "nxs-k8s-s10": all PVCs bound and nothing to do I1115 07:57:21.639333 1 factory.go:733] Attempting to bind cronjob-cron-events-1573793820-xt6q9 to nxs-k8s-s10

在这里，我们看到调度程序最初执行过滤并形成3个可以在其上运行的节点的列表（nxs-k8s-s8，nxs-k8s-s9，nxs-k8s-s10）。然后，根据这些参数中的每个节点的几个参数（包括BalancedResourceAllocation，LeastResourceAllocation）计算点，以确定最合适的节点。最后，将其规划在具有最多点的节点下（此处，两个节点一次具有相同数量的点100037，因此选择一个随机点-nxs-k8s-s10）。

结论：如果pod在未设置任何限制的节点上工作，则对于k8（从资源消耗的角度来看），这等效于该节点上完全没有此类pod。因此，如果您有条件地拥有一个进程繁琐的Pod（例如wowza），并且对此没有任何限制，则可能会出现以下情况：实际上，给定的Pod已经耗尽了该节点的所有资源，但对于k8s，该节点被视为已卸载并且进行排名时，它将获得相同数量的分数（即，对可用资源进行评估的分数），以及没有工作间距的节点，这最终会导致节点之间的负载分配不均。

炉膛驱逐

如您所知，为每个Pod分配了3种QoS类之一：

保证 -在为炉膛中的每个容器设置内存和cpu的请求和限制时分配，并且这些值必须匹配
可爆 -炉膛中至少有一个容器具有请求和限制，而请求<limit
尽力而为 -当炉膛中没有容器受到资源限制时

同时，当节点上的资源（磁盘，内存）不足时，kubelet开始根据某种算法对Pod进行排名和逐出，该算法考虑了Pod的优先级及其QoS类。例如，如果我们谈论的是RAM，则根据QoS等级，将根据以下原则授予分数：

保证的 ：-998
尽力而为 ：1000
突发：最小值（最大值（ 2，1000- （1000 * memoryRequestBytes）/ machineMemoryCapacityBytes），999）

即具有相同优先级的kubelet将首先从节点中排出具有尽力而为QoS等级的Pod。

结论：如果要减少在节点上没有足够资源的情况下从节点上逐出必要Pod的可能性，那么除了优先级外，还必须注意为其设置请求/限制。

应用炉膛水平自动缩放机构（HPA）

当任务是根据资源（系统-CPU / RAM或用户-rps）的使用自动增加和减少pod的数量时，诸如HPA （Horizontal Pod Autoscaler）之类的实体k8s可以帮助解决问题。其算法如下：

确定观察资源的当前读数（currentMetricValue）
确定资源的期望值（desiredMetricValue），使用请求为系统资源设置这些期望值
确定当前副本数（currentReplicas）
下面的公式计算所需的副本数（desiredReplicas）
wantedReplicas = [currentReplicas *（currentMetricValue /期望MetricValue）]

但是，当系数（currentMetricValue / desireMetricValue）接近1时，将不会发生缩放（我们可以自行设置允许误差，默认为0.1）。

考虑使用app-test应用程序（称为“部署”）使用hpa，在此有必要根据CPU消耗更改副本数：

申请清单

 kind: Deployment apiVersion: apps/v1beta2 metadata: name: app-test spec: selector: matchLabels: app: app-test replicas: 2 template: metadata: labels: app: app-test spec: containers: - name: nginx image: registry.nixys.ru/generic-images/nginx imagePullPolicy: Always resources: requests: cpu: 60m ports: - name: http containerPort: 80 - name: nginx-exporter image: nginx/nginx-prometheus-exporter resources: requests: cpu: 30m ports: - name: nginx-exporter containerPort: 9113 args: - -nginx.scrape-uri - http://127.0.0.1:80/nginx-status

即我们看到，在应用程序下，它最初是在两个实例中启动的，每个实例包含两个容器nginx和nginx-exporter，每个容器都给出了对CPU的请求。

HPA清单

 apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: app-test-hpa spec: maxReplicas: 10 minReplicas: 2 scaleTargetRef: apiVersion: extensions/v1beta1 kind: Deployment name: app-test metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 30

即我们创建了一个hpa来监控Deployment app-test并根据cpu指标调整应用程序的炉床数量（我们希望炉床消耗其请求的CPU的30％），而副本数在2-10之间。

现在，如果将负载施加到一个炉膛上，我们将考虑hpa操作机制：

  # kubectl top pod NAME CPU(cores) MEMORY(bytes) app-test-78559f8f44-pgs58 101m 243Mi app-test-78559f8f44-cj4jz 4m 240Mi

总计，我们有以下内容：

期望值（desiredMetricValue）-根据hpa设置，我们有30％
当前值（currentMetricValue）-为了进行计算，控制器管理器以％为单位计算资源消耗的平均值，即有条件地执行以下操作：
1. 从指标服务器获取炉床指标的绝对值，即 101m和4m
2. 计算平均绝对值，即（101m + 4m）/ 2 = 53m
3. 获取所需资源消耗的绝对值（为此，所有容器的请求求和）60m + 30m = 90m
4. 计算相对于请求炉床的CPU消耗的平均百分比，即 53m / 90m * 100％= 59％

现在，我们具有确定是否需要更改副本数的所有必要条件，为此，我们计算系数：

ratio = 59% / 30% = 1.96

即复制副本的数量应增加约2倍，并组成[2 * 1.96] = 4。

结论：如您所见，为了使此机制起作用，先决条件是包括对观察到的炉膛中所有容器的请求的可用性。

节点的水平自动缩放机制（Cluster Autoscaler）

为了消除负载突增期间对系统的负面影响，仅调整hpa的存在是不够的。例如，根据hpa控制器管理器中的设置，需要将副本数量增加2倍，但是，节点上没有可用的资源来运行如此数量的Pod（即节点无法为Pod请求提供请求的资源）以及这些Pod进入待处理状态。

在这种情况下，如果提供者具有适当的IaaS / PaaS（例如GKE / GCE，AKS，EKS等），则诸如Node Autoscaler之类的工具可以为我们提供帮助。当集群中的资源不足并且无法调度Pod时（通过Pending状态），它允许您设置集群中节点的最大和最小数目，并自动调整当前节点数（通过访问云提供商API来订购/删除节点）。

结论：为了能够自动扩展节点，必须在炉床容器中指定请求，以便k8s可以正确评估节点的负载，并相应地报告集群中没有资源来启动下一个炉床。

结论

应当注意，为容器设置资源限制不是成功启动应用程序的先决条件，但是出于以下原因，这样做还是更好的：

为了在k8s节点之间的负载平衡方面实现更精确的调度程序操作
为了减少发生炉膛搬迁事件的可能性
用于水平自动缩放应用炉床（HPA）
用于云提供商的节点的水平自动缩放（群集自动缩放）

Kubernetes：为什么设置系统资源管理如此重要？

节点之间的负载平衡机制

炉膛驱逐

应用炉膛水平自动缩放机构（HPA）

节点的水平自动缩放机制（Cluster Autoscaler）

结论

另请阅读我们博客上的其他文章：

More articles: