前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >K8S:分享一次“乌龙问题”(人为导致的无法正常删除命名空间)

K8S:分享一次“乌龙问题”(人为导致的无法正常删除命名空间)

作者头像
不背锅运维
发布2023-05-07 22:07:04
8820
发布2023-05-07 22:07:04
举报
文章被收录于专栏:监控监控

问题背景

背景是这样的,我有一套测试用的K8S集群,发现无法正常删除命名空间了,一直处于Terminating状态,强制删除也不行。于是,再次手动创建了一个名为“test-b”的命名空间,同样也是不能正常删除。于是,展开了排查。不过,查到最后,发现是个毫无技术含量的“乌龙问题”。结果不重要,重要的是我想把这个过程分享一下。

排查过程

  1. 正常删除命名空间时,一直处于阻塞状态,只能Ctrl+C掉
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?delete?ns?test-b
namespace?"test-b"?deleted
  1. 查看状态一直处于Terminating状态
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?get?ns
test-b????????????Terminating???18h
  1. 尝试强制删除,也是一直处于阻塞状态,也只能Ctrl+C掉
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?delete?namespace?test-b?--grace-period=0?--force
  1. 查看详细信息发现
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?describe?ns?test-b
Name:?????????test-b
Labels:???????kubernetes.io/metadata.name=test-b
Annotations:??<none>
Status:???????Terminating
Conditions:
??Type?????????????????????????????????????????Status??LastTransitionTime???????????????Reason??????????????????Message
??----?????????????????????????????????????????------??------------------???????????????------??????????????????-------
??NamespaceDeletionDiscoveryFailure????????????True????Fri,?05?May?2023?14:06:52?+0800??DiscoveryFailed?????????Discovery?failed?for?some?groups,?1?failing:?unable?to?retrieve?the?complete?list?of?server?APIs:?metrics.k8s.io/v1beta1:?the?server?is?currently?unable?to?handle?the?request
??NamespaceDeletionGroupVersionParsingFailure??False???Fri,?05?May?2023?14:06:52?+0800??ParsedGroupVersions?????All?legacy?kube?types?successfully?parsed
??NamespaceDeletionContentFailure??????????????False???Fri,?05?May?2023?14:06:52?+0800??ContentDeleted??????????All?content?successfully?deleted,?may?be?waiting?on?finalization
??NamespaceContentRemaining????????????????????False???Fri,?05?May?2023?14:06:52?+0800??ContentRemoved??????????All?content?successfully?removed
??NamespaceFinalizersRemaining?????????????????False???Fri,?05?May?2023?14:06:52?+0800??ContentHasNoFinalizers??All?content-preserving?finalizers?finished

No?resource?quota.

No?LimitRange?resource.

问题出在这里:DiscoveryFailed:Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

难道API Server出问题了?于是继续排查。

  1. 检查API Server是否正常运行
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?get?componentstatus
Warning:?v1?ComponentStatus?is?deprecated?in?v1.19+
NAME?????????????????STATUS????MESSAGE?????????????????????????ERROR
controller-manager???Healthy???ok??????????????????????????????
etcd-0???????????????Healthy???{"health":"true","reason":""}???
scheduler????????????Healthy???ok??????????????????????????????

从输出来看,controller-manager、scheduler和etcd集群都处于正常状态(Healthy)。

  1. 继续检查API Server的日志看看是否有错误或异常
代码语言:txt
复制
#?获取API?Server?Pod的名称:
[root@k8s-b-master?~]#?kubectl?get?pods?-n?kube-system?-l?component=kube-apiserver?-o?name
pod/kube-apiserver-k8s-b-master

#?查看API?Server的日志:
[root@k8s-b-master?~]#?kubectl?get?pods?-n?kube-system?-l?component=kube-apiserver?-o?name
pod/kube-apiserver-k8s-b-master
[root@k8s-b-master?~]#?kubectl?logs?-n?kube-system?kube-apiserver-k8s-b-master
...
...
W0506?01:00:12.965627???????1?handler_proxy.go:105]?no?RequestInfo?found?in?the?context
E0506?01:00:12.965711???????1?controller.go:116]?loading?OpenAPI?spec?for?"v1beta1.metrics.k8s.io"?failed?with:?failed?to?retrieve?openAPI?spec,?http?error:?ResponseCode:?503,?Body:?service?unavailable
,?Header:?map[Content-Type:[text/plain;?charset=utf-8]?X-Content-Type-Options:[nosniff]]
I0506?01:00:12.965722???????1?controller.go:129]?OpenAPI?AggregationController:?action?for?item?v1beta1.metrics.k8s.io:?Rate?Limited?Requeue.
W0506?01:00:12.968678???????1?handler_proxy.go:105]?no?RequestInfo?found?in?the?context
E0506?01:00:12.968709???????1?controller.go:113]?loading?OpenAPI?spec?for?"v1beta1.metrics.k8s.io"?failed?with:?Error,?could?not?get?list?of?group?versions?for?APIService
I0506?01:00:12.968719???????1?controller.go:126]?OpenAPI?AggregationController:?action?for?item?v1beta1.metrics.k8s.io:?Rate?Limited?Requeue.
I0506?01:00:43.794546???????1?controller.go:616]?quota?admission?added?evaluator?for:?endpointslices.discovery.k8s.io
I0506?01:00:44.023629???????1?controller.go:616]?quota?admission?added?evaluator?for:?endpoints
W0506?01:01:12.965985???????1?handler_proxy.go:105]?no?RequestInfo?found?in?the?context
E0506?01:01:12.966062???????1?controller.go:116]?loading?OpenAPI?spec?for?"v1beta1.metrics.k8s.io"?failed?with:?failed?to?retrieve?openAPI?spec,?http?error:?ResponseCode:?503,?Body:?service?unavailable
,?Header:?map[Content-Type:[text/plain;?charset=utf-8]?X-Content-Type-Options:[nosniff]]
I0506?01:01:12.966069???????1?controller.go:129]?OpenAPI?AggregationController:?action?for?item?v1beta1.metrics.k8s.io:?Rate?Limited?Requeue.
W0506?01:01:12.969496???????1?handler_proxy.go:105]?no?RequestInfo?found?in?the?context
E0506?01:01:12.969527???????1?controller.go:113]?loading?OpenAPI?spec?for?"v1beta1.metrics.k8s.io"?failed?with:?Error,?could?not?get?list?of?group?versions?for?APIService
...
...
#?后面都是这样的Log
...
...

从输出日志来看,问题似乎与metrics.k8s.io/v1beta1有关,这个API被用于收集Kubernetes集群的度量数据。可能是因为度量服务器(metrics-server)出现故障,无法满足API Server的请求,导致API Server无法处理请求。

  1. k8s默认是没有metrics-server组件的呀,还是看看:
代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?get?pods?-n?kube-system?-l?k8s-app=metrics-server
No?resources?found?in?kube-system?namespace.

kube-system命名空间中没有找到标签为k8s-app=metrics-server的Pod,这很正常呀,K8S本来就是默认没有安装Metrics Server 组件的,为什么现在又依赖了?

查到这里,我突然想起了前段时间部署过kube-prometheus,当时kube-state-metrics拉取镜像失败没有正常运行,因为是临时测试环境,后来就没管了,时间一长居然把这事给忘了。

于是看了一下,发现还真是:

代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?get?pod?-n?monitoring
NAME???????????????????????????????????READY???STATUS?????????????RESTARTS???????AGE
alertmanager-main-0????????????????????2/2?????Running????????????14?(30m?ago)???5d19h
alertmanager-main-1????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
alertmanager-main-2????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
blackbox-exporter-69f4d86566-wn6q7?????3/3?????Running????????????21?(30m?ago)???5d19h
grafana-56c4977497-2rjmt???????????????1/1?????Running????????????7?(29m?ago)????5d19h
kube-state-metrics-56f8746666-lsps6????2/3?????ImagePullBackOff???14?(29m?ago)???5d19h?#?拉取镜像失败导致没有正常运行
node-exporter-d8c5k????????????????????2/2?????Running????????????14?(30m?ago)???5d19h
node-exporter-gvfx2????????????????????2/2?????Running????????????14?(30m?ago)???5d19h
node-exporter-gxccx????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
node-exporter-h292z????????????????????2/2?????Running????????????14?(30m?ago)???5d19h
node-exporter-mztj6????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
node-exporter-rvfz6????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
node-exporter-twg9q????????????????????2/2?????Running????????????13?(29m?ago)???5d19h
prometheus-adapter-77f56b865b-76nzk????0/1?????ImagePullBackOff???0??????????????5d19h
prometheus-adapter-77f56b865b-wbcwl????0/1?????ImagePullBackOff???0??????????????5d19h
prometheus-k8s-0???????????????????????2/2?????Running????????????14?(30m?ago)???5d19h
prometheus-k8s-1???????????????????????2/2?????Running????????????14?(29m?ago)???5d19h
prometheus-operator-788dd7cb76-85zwj???2/2?????Running????????????14?(29m?ago)???5d19h

反正也不用这套环境了,干它:

代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?delete?-f?kube-prometheus-main/manifests/
[root@k8s-b-master?~]#?kubectl?delete?-f?kube-prometheus-main/manifests/setup/

再次查看命名空间,test-b这个命名空间也随之能正常删除掉了,问题解决:

代码语言:txt
复制
[root@k8s-b-master?~]#?kubectl?get?ns
NAME??????????????STATUS???AGE
default???????????Active???5d22h
kube-node-lease???Active???5d22h
kube-public???????Active???5d22h
kube-system???????Active???5d22h
[root@k8s-b-master?~]#?

最后的觉悟

结合官方文档相关资料和自己平常的经验反思了一下这个事情,kube-state-metrics 组件是负责监控 K8S 集群的状态,并且它会定期获取集群内各个资源的指标数据,这些指标数据会被 Metrics Server 组件使用。当 kube-state-metrics 组件无法正常工作时,Metrics Server 组件就无法获取到指标数据,从而导致 Metrics Server 组件无法正常运行。在 K8S 集群中,很多组件都会使用 Metrics Server 组件提供的指标数据,例如 HPA、kubelet 等。如果 Metrics Server 组件无法正常运行,可能会导致其他组件出现问题,包括删除命名空间时提示错误。也就是说 Metrics Server 组件无法正常运行,导致了API Server组件在处理其它一些请求时可能会失败,从而发生了无法正常删除命名空间的情况。

本文系转载,前往查看

如有侵权,请联系?cloudcommunity@tencent.com 删除。

本文系转载前往查看

如有侵权,请联系?cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 问题背景
  • 排查过程
  • 最后的觉悟
相关产品与服务
容器服务
腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
http://www.vxiaotou.com