前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >分片恢复达到最大重试次数

分片恢复达到最大重试次数

原创
作者头像
用户7442844
修改2023-07-19 18:00:35
4550
修改2023-07-19 18:00:35
举报
文章被收录于专栏:ES自助排障ES自助排障

异常现象

通过执行 GET /_cluster/allocation/explain 查看当前索引分配详情

获取分片锁失败(failed to obtain in-memory shard lock)

代码语言:javascript
复制
		"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-27T06:48:04.340Z], failed_attempts[5], failed_nodes[[iOKq3oMXReCl1EcdcM3OEQ]], delayed=false, details[failed shard on node [iOKq3oMXReCl1EcdcM3OEQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[myIndex][5]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
		}]

熔断(Data too large)

代码语言:javascript
复制
		"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-12-27T06:04:04.013Z], failed_attempts[5], delayed=false, details[failed shard on node [aHN6ZO4dSDOJPJCfdnGyAQ]: failed recovery, failure RecoveryFailedException[[.triggered_watches][0]: Recovery failed from {1618925137004890932}{ZsOg8qa1Qn6_NpWV3m-FVA}{q4nJNlZYQgujXDLmD6ol0g}{x.x.x.x}{x.x.x.x:9300}{ml.machine_memory=3929833472, rack=cvm_8_800003, xpack.installed=true, set=800003, ip=x.x.x.x, temperature=hot, ml.max_open_jobs=20, ml.enabled=true, region=8} into {1591263882001002432}{aHN6ZO4dSDOJPJCfdnGyAQ}{UoLH-_YfQcWuzWO8eVOe9w}{x.x.x.x}{x.x.x.x:28905}{ml.machine_memory=3929833472, rack=cvm_8_800003, xpack.installed=true, set=800003, ip=x.x.x.x, temperature=hot, ml.max_open_jobs=20, ml.enabled=true, region=8}]; nested: RemoteTransportException[[1618925137004890932][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1435813322/1.3gb], which is larger than the limit of [1433862144/1.3gb], real usage: [1435810872/1.3gb], new bytes reserved: [2450/2.3kb]]; ], allocation_status[no_attempt]]]"
		}]

磁盘打满(No space left on device)

代码语言:javascript
复制
"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-01T11:16:21.428Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[im_session_log][1]: Recovery failed from {1637914038007031232}{ITpemyeASn-4AKdB_hmmGA}{RLe4brDyR5CszH6WhUla2w}{x.x.x.x}{x.x.x.x:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=x.x.x.x} into {1637914038007031032}{xANHj5XtQreIJPuqbSSydg}{2nSl8dV6Tam0T61u2-CMsw}{x.x.x.x}{x.x.x.x:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=x.x.x.x}]; nested: RemoteTransportException[[1637914038007031232][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [134] files with total size of [10gb]]; nested: RemoteTransportException[[1637914038007031032][x.x.x.x:9300][internal:index/shard/recovery/file_chunk]]; nested: IOException[No space left on device]; ], allocation_status[no_attempt]]]"
		}]

如果 decider 中返回 "max_retry" 时,可以通过上面3种常见关键字过滤 explanation。

获取分片锁失败熔断 通常是由于节点刚加入集群或集群当前负载比较高,导致分配失败,此时可以手动触发分片重试分配,或等集群低负载时手动触发分片重试分配。

磁盘打满 需要先清理历史数据或扩容磁盘容量,保证磁盘利用率低于磁盘低水位后,可以手动触发分片重试分配。

解决方案

手动触发分片重试分配

代码语言:javascript
复制
POST _cluster/reroute?retry_failed=true

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 异常现象
    • 获取分片锁失败(failed to obtain in-memory shard lock)
      • 熔断(Data too large)
        • 磁盘打满(No space left on device)
        • 解决方案
        相关产品与服务
        Elasticsearch Service
        腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
        http://www.vxiaotou.com