Prometheus 监控虚拟机状态

原创

软件书桌

修改于 2024-04-30 15:04:42

1010

修改于 2024-04-30 15:04:42

通过监控虚拟机状态，虚拟机宕机之后，发送告警邮件，这样一个小案例，将 Prometheus 的入门使用给记录下来。

部署 Prometheus

# 安装 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.25.0/prometheus-2.25.0.linux-amd64.tar.gz

tar xf  prometheus-2.25.0.linux-amd64.tar.gz -C /usr/local

cd /usr/local
mv prometheus-2.25.0.linux-amd64/ prometheus

# 启动 Prometheus
cd /usr/lib/systemd/system

vim prometheus.service
[Unit]
  Description=https://prometheus.io
  
  [Service]
  Restart=on-failure
  ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --web.listen-address=:9090

  [Install]                      
  WantedBy=multi-user.target


systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
systemctl status prometheus

# 访问 Prometheus Web UI
http://178.104.163.109:9090
http://178.104.163.109:9090/metrics

部署 Grafana

# 安装 Grafana
wget https://mirrors.tuna.tsinghua.edu.cn/grafana/yum/rpm/grafana-9.3.2-1.x86_64.rpm

yum install initscripts fontconfig
yum install -y grafana-7.4.3-1.x86_64.rpm


# 启动 Grafana
systemctl start grafana-server.service
systemctl status grafana-server.service 
systemctl enable grafana-server.service


# 访问 Grafana Web UI
http://178.104.163.109:3000/login
admin / admin

部署 node-exporter

[root@desktop-a853 ~]# cat /usr/local/prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090','178.104.163.105:9100']

导入 node-exporter Grafana Dashboard 。

# 设置告警规则匹配目录
vi prometheus.yml

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:       # 这里指定将告警发送到那里，发送到alertmanager
      - 192.168.1.20:9093     # alertmanager 地址
  
  
# 添加告警规则
vi ./rules/node_rule.yml

groups:
- name: node-up
  rules:
  - alert: node-up
    expr: up{job="node"} == 0  
    for: 10s                   
    labels:                    
      severity: 1              
      team: node
    annotations:               
      summary: "已停止运行超过 15s"
      description: hello world
      
# 重启 Prometheus
systemctl restart prometheus

部署 Alertmanager

# 安装 alertmanager
tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz

# 拷贝并赋权
install -m 0755 alertmanager-0.21.0.linux-amd64/{alertmanager,amtool} /usr/bin

配置告警邮件

# 添加 alertmanager.yml 配置文件
cat >> alertmanager.yml <<EOF
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25' # 邮箱smtp服务器代理
  smtp_from: 'demo*@163.com' # 发送邮箱名称
  smtp_auth_username: 'demo*@163.com' # 邮箱名称
  smtp_auth_password: 'QNHPB***XBRMWCB' # 邮箱密码或授权码
  smtp_require_tls: false
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'mail'
receivers:
- name: 'mail'
  email_configs:
  - to: '*@*.com'
EOF

# 移动文件并设置权限
install -m 0644 -D alertmanager.yml /etc/alertmanager/alertmanager.yml

# 设置 systemctld
cat > alertmanager.service <<EOF
[Unit]
Description=Alertmanager handles alerts sent by client applications such as the Prometheus server.
Documentation=https://prometheus.io/docs/alerting/alertmanager/
After=network.target
 
[Service]
User=root
ExecStart=/usr/bin/alertmanager \\
  --config.file=/etc/alertmanager/alertmanager.yml \\
  --storage.path=/var/lib/alertmanager \\
  --cluster.advertise-address=0.0.0.0:9093
ExecReload=/bin/kill -HUP
Restart=on-failure
 
[Install]
WantedBy=multi-user.target
EOF


# 移动文件，并设置权限
install -m 0644 alertmanager.service /etc/systemd/system
 
# 启动服务
systemctl daemon-reload
systemctl start alertmanager
systemctl status alertmanager
systemctl enable alertmanager

# 访问 AlertManager
http://178.104.163.109:9093

将监控的虚拟机关机或者将虚拟机中的 node-exporter 关闭就可以触发邮件告警通知了。

有了这样一个基础环境，以后学习 Prometheus 相关的功能，就可以在这个环境中继续尝试了。

无论新学什么技术，先将一个 MVP 环境构建出来，似乎都是必不可少的。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

prometheus

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

prometheus

登录后参与评论

0 条评论

热度

Prometheus 监控虚拟机状态

Prometheus 监控虚拟机状态

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐