搭建企业级监控告警系统
📋 文章目录
一、监控体系概述与设计原则
1. 监控的重要性与价值
在现代化的IT系统中,监控已经从一个"可选项"变成了"必选项"。一个完善的监控体系能够:
快速发现问题
实时发现系统异常,避免小问题演变成大故障。
性能优化依据
通过历史数据分析系统瓶颈,指导容量规划和性能优化。
故障回溯分析
记录系统历史状态,便于故障原因分析和责任追溯。
SLA保障
量化服务可用性和性能,为SLA(服务等级协议)提供数据支持。
根据Google SRE(站点可靠性工程)的经验,有效的监控可以将MTTR(平均修复时间)降低70%以上,并将故障预测准确率提升到85%以上。
2. 监控设计原则(Google四大黄金指标)
Google SRE团队提出的四大黄金指标是监控设计的核心指导原则:
服务处理请求所需时间
系统承载的请求量或并发量
请求失败或错误的比例
系统资源的使用程度
3. 监控层级划分
全面的监控体系应该覆盖以下四个层级:
1. 业务指标:关键业务流程成功率、用户活跃度、订单量等
2. 应用指标:应用响应时间、错误率、吞吐量、JVM/GC状态等
3. 系统指标:CPU、内存、磁盘、网络使用率、进程状态等
4. 网络指标:网络延迟、丢包率、带宽使用率、连接数等
二、监控系统架构设计
1. 可扩展性:支持监控上千个节点
2. 高可用性:监控系统自身不能成为单点故障
3. 实时性:指标采集和告警延迟在秒级
4. 易用性:配置简单,可视化友好,便于问题定位
1. 整体架构设计
企业级监控系统通常采用以下架构:
# 企业级监控系统架构
"""
数据采集层 (Data Collection)
├── Node Exporter: 服务器基础指标
├── MySQL Exporter: 数据库监控
├── Nginx Exporter: Web服务器监控
├── JMX Exporter: Java应用监控
├── Blackbox Exporter: 网络探测
└── 自定义Exporter: 业务指标采集
数据存储与处理层 (Storage & Processing)
├── Prometheus Server: 指标采集、存储、查询
├── Prometheus Alertmanager: 告警管理
├── Thanos/Cortex: 长期存储和集群方案(可选)
└── 时序数据库: VictoriaMetrics/InfluxDB(替代方案)
数据可视化层 (Visualization)
├── Grafana: 仪表板展示
├── 自定义Dashboard: 业务大屏
└── 报表系统: 定期报告生成
告警通知层 (Alerting & Notification)
├── 邮件通知: SMTP集成
├── 即时通讯: 钉钉/企业微信/Slack
├── 短信通知: 云服务商API
└── 电话告警: 紧急情况自动呼叫
辅助组件 (Auxiliary Components)
├── Service Discovery: 自动发现监控目标
├── 配置管理: Ansible/Terraform自动化部署
├── 权限控制: LDAP/OAuth2集成
└── 日志集成: Loki/ELK Stack关联分析
"""
2. 技术选型对比
主流监控解决方案对比:
| 特性 | Prometheus | Zabbix | Nagios | DataDog |
|---|---|---|---|---|
| 开源/商业 | 开源 | 开源 | 开源 | 商业 |
| 数据模型 | 多维时序数据 | 键值对 | 状态检查 | 多维时序数据 |
| 查询语言 | PromQL | 有限 | 无 | 专有查询 |
| 服务发现 | 原生支持 | 有限支持 | 不支持 | 自动发现 |
| 可视化 | 需Grafana | 内置 | 需插件 | 内置 |
| 社区生态 | 强大 | 强大 | 强大 | 商业支持 |
| 成本 | 免费 | 免费 | 免费 | 昂贵 |
3. 监控指标设计规范
# 监控指标命名规范
# 格式: {__name__}{label1="value1",label2="value2",...}
"""
命名规范:
1. 使用下划线分隔单词:http_requests_total
2. 基本命名模式:_
- _total: 计数器累加值
- _count: 直方图/摘要的计数
- _sum: 直方图/摘要的总和
- _bucket: 直方图的分桶
- _info: 提供元信息
3. 单位标准化:
- 时间: 秒(seconds)
- 内存: 字节(bytes)
- 磁盘: 字节(bytes)
- 网络: 比特/秒(bits/sec)
标签设计原则:
1. 标识性标签(必备):
- instance: 实例标识(IP:Port)
- job: 任务/服务名称
- env: 环境(prod/staging/dev)
2. 维度性标签(可选):
- region: 地域(华北/华东)
- az: 可用区
- team: 负责团队
- version: 应用版本
3. 避免的标签设计:
- 不要使用高基数标签(如用户ID)
- 避免标签值动态变化
- 标签数量不宜过多(一般5-10个)
示例指标:
# 系统指标
node_cpu_seconds_total{mode="idle", instance="192.168.1.100:9100", job="node"}
node_memory_MemFree_bytes{instance="192.168.1.100:9100", job="node"}
# 应用指标
http_requests_total{method="POST", endpoint="/api/users", status="200", job="user-service"}
http_request_duration_seconds_bucket{method="GET", endpoint="/api/products", le="0.1"}
# 业务指标
orders_total{type="new", payment_method="alipay", env="production"}
user_sessions_active{region="north", platform="mobile"}
"""
三、Prometheus部署与配置
1. 多维数据模型:指标名称 + 键值对标签
2. 强大的查询语言:PromQL,灵活的数据查询和聚合
3. 不依赖分布式存储:单节点自包含
4. HTTP拉取模式:主动从目标拉取指标
5. 多种服务发现:支持Kubernetes、Consul等
1. Prometheus安装部署
#!/bin/bash
# install_prometheus.sh
# Prometheus一键安装脚本
PROMETHEUS_VERSION="2.45.0"
PROMETHEUS_USER="prometheus"
INSTALL_DIR="/opt/prometheus"
DATA_DIR="/var/lib/prometheus"
CONFIG_DIR="/etc/prometheus"
# 创建用户和目录
useradd --no-create-home --shell /bin/false $PROMETHEUS_USER
mkdir -p $INSTALL_DIR $DATA_DIR $CONFIG_DIR
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR
# 下载并安装Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xzf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
cd prometheus-$PROMETHEUS_VERSION.linux-amd64
# 复制二进制文件
cp prometheus promtool $INSTALL_DIR/
chown $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR/{prometheus,promtool}
chmod +x $INSTALL_DIR/{prometheus,promtool}
# 复制配置文件
cp prometheus.yml $CONFIG_DIR/
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR
# 创建systemd服务文件
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/prometheus \
--config.file=$CONFIG_DIR/prometheus.yml \
--storage.tsdb.path=$DATA_DIR \
--storage.tsdb.retention.time=30d \
--web.console.templates=$INSTALL_DIR/consoles \
--web.console.libraries=$INSTALL_DIR/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.external-url=http://prometheus.example.com \
--web.enable-lifecycle \
--web.enable-admin-api
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
# 创建配置目录结构
mkdir -p $CONFIG_DIR/{rules,rules.d,files_sd,targets}
cat > $CONFIG_DIR/prometheus.yml << 'EOF'
# 全局配置
global:
scrape_interval: 15s # 默认抓取间隔
evaluation_interval: 15s # 规则评估间隔
external_labels: # 外部标签
region: 'north'
env: 'production'
# 告警规则文件
rule_files:
- "rules/*.yml"
- "rules.d/*.yml"
# 抓取配置
scrape_configs:
# 监控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
service: 'monitoring'
# 监控所有Node Exporter
- job_name: 'node'
scrape_interval: 30s
file_sd_configs:
- files:
- 'targets/node_*.yml'
refresh_interval: 5m
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [__meta_dns_name]
target_label: hostname
# 监控所有MySQL
- job_name: 'mysql'
scrape_interval: 30s
static_configs:
- targets: ['mysql-1:9104', 'mysql-2:9104']
labels:
database: 'mysql'
# 监控所有Nginx
- job_name: 'nginx'
scrape_interval: 30s
static_configs:
- targets: ['nginx-1:9113', 'nginx-2:9113']
labels:
service: 'web'
# 通过Consul服务发现
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_tags]
regex: ',(production|staging|dev),'
target_label: env
replacement: '$1'
# 远程读写配置(可选)
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
max_samples_per_send: 1000
capacity: 5000
max_shards: 200
remote_read:
- url: "http://thanos-query:10902/api/v1/read"
read_recent: true
EOF
# 创建示例target文件
cat > $CONFIG_DIR/targets/node_servers.yml << 'EOF'
- targets:
- '192.168.1.100:9100'
- '192.168.1.101:9100'
- '192.168.1.102:9100'
labels:
datacenter: 'dc1'
rack: 'rack-a'
EOF
# 创建告警规则文件
cat > $CONFIG_DIR/rules/node_alerts.yml << 'EOF'
groups:
- name: node_alerts
interval: 30s
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "高CPU使用率 (实例 {{ $labels.instance }})"
description: "CPU使用率超过80%已经5分钟。当前值: {{ $value }}%"
runbook: "https://runbook.example.com/high-cpu"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "高内存使用率 (实例 {{ $labels.instance }})"
description: "内存使用率超过85%已经5分钟。当前值: {{ $value }}%"
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
for: 2m
labels:
severity: critical
team: infrastructure
annotations:
summary: "磁盘空间严重不足 (实例 {{ $labels.instance }})"
description: "根分区使用率超过90%已经2分钟。当前值: {{ $value }}%"
runbook: "https://runbook.example.com/disk-space"
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "节点宕机 (实例 {{ $labels.instance }})"
description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
EOF
# 设置权限
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR
chmod -R 644 $CONFIG_DIR/*.yml
chmod 755 $CONFIG_DIR $CONFIG_DIR/{rules,rules.d,files_sd,targets}
# 启动服务
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
# 检查状态
sleep 3
systemctl status prometheus --no-pager
echo "Prometheus安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9090"
echo "数据目录: $DATA_DIR"
echo "配置目录: $CONFIG_DIR"
2. Prometheus配置详解
# Prometheus高级配置示例
# 1. 远程存储配置(VictoriaMetrics)
remote_write:
- url: "http://victoria-metrics:8428/api/v1/write"
write_relabel_configs:
- action: keep
regex: "node.*|prometheus.*"
source_labels: [__name__]
queue_config:
max_shards: 10
min_shards: 2
max_samples_per_send: 500
capacity: 10000
batch_send_deadline: "5s"
min_backoff: "100ms"
max_backoff: "5s"
# 2. 服务发现配置(Kubernetes)
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 只抓取有注解 prometheus.io/scrape: "true" 的pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 从注解中获取抓取路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 从注解中获取抓取端口
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 添加标准标签
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# 设置实例名为 pod名:端口
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# 设置命名空间标签
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
# 设置节点标签
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
# 3. 静态配置示例
scrape_configs:
- job_name: 'static-targets'
static_configs:
- targets:
- 'app-1.example.com:8080'
- 'app-2.example.com:8080'
- 'app-3.example.com:8080'
labels:
environment: 'production'
region: 'us-east-1'
application: 'user-service'
# 4. 文件服务发现
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
- '/etc/prometheus/targets/*.yml'
refresh_interval: 5m
relabel_configs:
- source_labels: [__address__]
regex: '(.*):(\d+)'
replacement: '${1}'
target_label: host
- source_labels: [__address__]
regex: '(.*):(\d+)'
replacement: '${2}'
target_label: port
# 5. 标签重写规则
scrape_configs:
- job_name: 'example'
static_configs:
- targets: ['example.com:80']
metric_relabel_configs:
# 删除不需要的指标
- action: drop
regex: 'go_.*'
source_labels: [__name__]
# 重命名指标
- source_labels: [__name__]
regex: 'http_requests_(\w+)'
replacement: 'http_${1}'
target_label: __name__
# 添加标签
- source_labels: [instance]
regex: '([^:]+):\d+'
replacement: '${1}'
target_label: hostname
# 替换标签值
- source_labels: [status_code]
regex: '5..'
replacement: 'server_error'
target_label: status_group
# 6. 告警规则分组
groups:
- name: infrastructure_alerts
interval: 30s
rules:
# 系统级别告警
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
domain: infrastructure
annotations:
summary: "实例 {{ $labels.instance }} 宕机"
description: "{{ $labels.instance }} 已经5分钟无法访问"
runbook: "/runbooks/instance-down.md"
# 资源级别告警
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 10m
labels:
severity: warning
domain: infrastructure
annotations:
summary: "内存使用率过高 {{ $labels.instance }}"
description: "内存使用率超过90%已经10分钟"
runbook: "/runbooks/high-memory.md"
- name: application_alerts
interval: 15s
rules:
# 应用级别告警
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
for: 2m
labels:
severity: warning
domain: application
annotations:
summary: "高请求延迟 {{ $labels.service }}"
description: "95%的请求延迟超过0.5秒"
runbook: "/runbooks/high-latency.md"
# 业务级别告警
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
domain: business
annotations:
summary: "高错误率 {{ $labels.service }}"
description: "错误率超过5%已经5分钟"
runbook: "/runbooks/high-error-rate.md"
四、各种Exporter部署
Prometheus拥有丰富的Exporter生态系统,可以监控几乎所有常见的服务和系统。官方和社区维护了数百个Exporter,覆盖了基础设施、中间件、数据库、应用程序等各个层面。
1. Node Exporter(服务器监控)
#!/bin/bash
# install_node_exporter.sh
# Node Exporter一键安装脚本
NODE_EXPORTER_VERSION="1.6.0"
NODE_EXPORTER_USER="node_exporter"
INSTALL_DIR="/opt/node_exporter"
# 创建用户
useradd --no-create-home --shell /bin/false $NODE_EXPORTER_USER
# 下载并安装
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
cd node_exporter-$NODE_EXPORTER_VERSION.linux-amd64
# 复制二进制文件
cp node_exporter $INSTALL_DIR/
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $INSTALL_DIR/node_exporter
chmod +x $INSTALL_DIR/node_exporter
# 创建systemd服务
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network.target
[Service]
User=$NODE_EXPORTER_USER
Group=$NODE_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/node_exporter \
--web.listen-address=":9100" \
--collector.systemd \
--collector.systemd.unit-whitelist="(docker|ssh|nginx|mysql).service" \
--collector.processes \
--collector.tcpstat \
--collector.netdev \
--collector.netstat \
--collector.diskstats \
--collector.filesystem \
--collector.meminfo \
--collector.loadavg \
--collector.stat \
--collector.vmstat \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--log.level="info"
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
# 创建文本文件收集器目录
mkdir -p /var/lib/node_exporter/textfile_collector
chown -R $NODE_EXPORTER_USER:$NODE_EXPORTER_USER /var/lib/node_exporter
# 创建自定义指标收集脚本
cat > /usr/local/bin/custom_node_metrics.sh << 'EOF'
#!/bin/bash
# 自定义节点指标收集脚本
OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom_metrics.prom"
# 1. 系统更新时间
echo '# HELP node_system_uptime_seconds System uptime in seconds' > $OUTPUT_FILE
echo '# TYPE node_system_uptime_seconds gauge' >> $OUTPUT_FILE
echo "node_system_uptime_seconds $(awk '{print $1}' /proc/uptime)" >> $OUTPUT_FILE
# 2. 登录用户数
LOGIN_USERS=$(who | wc -l)
echo '# HELP node_login_users Number of logged in users' >> $OUTPUT_FILE
echo '# TYPE node_login_users gauge' >> $OUTPUT_FILE
echo "node_login_users $LOGIN_USERS" >> $OUTPUT_FILE
# 3. 僵尸进程数
ZOMBIE_PROCESSES=$(ps aux | awk '{print $8}' | grep -c Z)
echo '# HELP node_zombie_processes Number of zombie processes' >> $OUTPUT_FILE
echo '# TYPE node_zombie_processes gauge' >> $OUTPUT_FILE
echo "node_zombie_processes $ZOMBIE_PROCESSES" >> $OUTPUT_FILE
# 4. 文件句柄使用率
FILE_HANDLES=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
FILE_HANDLES_MAX=$(cat /proc/sys/fs/file-max)
FILE_HANDLES_PERCENT=$(echo "scale=2; $FILE_HANDLES * 100 / $FILE_HANDLES_MAX" | bc)
echo '# HELP node_file_handles_used File handles used' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_used gauge' >> $OUTPUT_FILE
echo "node_file_handles_used $FILE_HANDLES" >> $OUTPUT_FILE
echo '# HELP node_file_handles_max Maximum file handles' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_max gauge' >> $OUTPUT_FILE
echo "node_file_handles_max $FILE_HANDLES_MAX" >> $OUTPUT_FILE
echo '# HELP node_file_handles_percent File handles usage percent' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_percent gauge' >> $OUTPUT_FILE
echo "node_file_handles_percent $FILE_HANDLES_PERCENT" >> $OUTPUT_FILE
# 5. 系统负载(15分钟)
LOAD_15=$(awk '{print $3}' /proc/loadavg)
echo '# HELP node_load15 System load average for 15 minutes' >> $OUTPUT_FILE
echo '# TYPE node_load15 gauge' >> $OUTPUT_FILE
echo "node_load15 $LOAD_15" >> $OUTPUT_FILE
# 6. 磁盘inode使用率
DISK_INODES=$(df -i / | awk 'NR==2 {print $5}' | sed 's/%//')
echo '# HELP node_disk_inode_usage_percent Disk inode usage percent for root' >> $OUTPUT_FILE
echo '# TYPE node_disk_inode_usage_percent gauge' >> $OUTPUT_FILE
echo "node_disk_inode_usage_percent $DISK_INODES" >> $OUTPUT_FILE
# 7. 网络连接统计
TCP_ESTABLISHED=$(ss -s | awk '/^TCP:/ {print $4}')
echo '# HELP node_network_tcp_established Established TCP connections' >> $OUTPUT_FILE
echo '# TYPE node_network_tcp_established gauge' >> $OUTPUT_FILE
echo "node_network_tcp_established $TCP_ESTABLISHED" >> $OUTPUT_FILE
# 8. 系统时间同步状态
NTP_SYNC=0
if chronyc tracking 2>/dev/null | grep -q "Leap status.*Normal"; then
NTP_SYNC=1
elif ntpq -p 2>/dev/null | grep -q "^\*"; then
NTP_SYNC=1
fi
echo '# HELP node_ntp_synchronized NTP synchronization status (1=synchronized, 0=not synchronized)' >> $OUTPUT_FILE
echo '# TYPE node_ntp_synchronized gauge' >> $OUTPUT_FILE
echo "node_ntp_synchronized $NTP_SYNC" >> $OUTPUT_FILE
# 设置权限
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $OUTPUT_FILE
chmod 644 $OUTPUT_FILE
EOF
chmod +x /usr/local/bin/custom_node_metrics.sh
# 创建定时任务
echo "*/30 * * * * root /usr/local/bin/custom_node_metrics.sh" > /etc/cron.d/node_exporter_custom_metrics
# 启动服务
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
# 检查状态
sleep 2
systemctl status node_exporter --no-pager
echo "Node Exporter安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9100"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9100/metrics"
2. MySQL Exporter配置
#!/bin/bash
# install_mysql_exporter.sh
# MySQL Exporter安装配置
MYSQL_EXPORTER_VERSION="0.15.0"
MYSQL_EXPORTER_USER="mysql_exporter"
INSTALL_DIR="/opt/mysql_exporter"
# 创建用户
useradd --no-create-home --shell /bin/false $MYSQL_EXPORTER_USER
# 下载并安装
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v$MYSQL_EXPORTER_VERSION/mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
cd mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64
# 复制二进制文件
cp mysqld_exporter $INSTALL_DIR/
chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER $INSTALL_DIR/mysqld_exporter
chmod +x $INSTALL_DIR/mysqld_exporter
# 在MySQL中创建监控用户
mysql -u root -p << 'EOF'
-- 创建监控用户
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
-- 验证权限
SHOW GRANTS FOR 'exporter'@'localhost';
EOF
# 创建配置文件
cat > /etc/mysql_exporter.cnf << EOF
[client]
user=exporter
password=ExporterPassword123!
host=localhost
port=3306
EOF
chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER /etc/mysql_exporter.cnf
chmod 600 /etc/mysql_exporter.cnf
# 创建systemd服务
cat > /etc/systemd/system/mysql_exporter.service << EOF
[Unit]
Description=MySQL Exporter
Documentation=https://github.com/prometheus/mysqld_exporter
After=network.target mysql.service
[Service]
User=$MYSQL_EXPORTER_USER
Group=$MYSQL_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/mysqld_exporter \
--web.listen-address=":9104" \
--config.my-cnf=/etc/mysql_exporter.cnf \
--collect.global_status \
--collect.global_variables \
--collect.info_schema.innodb_metrics \
--collect.info_schema.processlist \
--collect.info_schema.tables \
--collect.info_schema.tablestats \
--collect.info_schema.userstats \
--collect.perf_schema.eventswaits \
--collect.perf_schema.file_events \
--collect.perf_schema.indexiowaits \
--collect.perf_schema.tableiowaits \
--collect.slave_status \
--collect.auto_increment.columns \
--collect.binlog_size \
--collect.info_schema.query_response_time \
--collect.engine_innodb_status \
--log.level="info"
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
systemctl daemon-reload
systemctl enable mysql_exporter
systemctl start mysql_exporter
# 检查状态
sleep 2
systemctl status mysql_exporter --no-pager
echo "MySQL Exporter安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9104"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9104/metrics"
# 关键MySQL监控指标说明
cat << 'EOF'
=== 关键MySQL监控指标 ===
1. 连接相关:
mysql_global_status_threads_connected # 当前连接数
mysql_global_status_max_used_connections # 历史最大连接数
mysql_global_variables_max_connections # 最大连接数限制
2. 查询性能:
mysql_global_status_questions # 总查询数
mysql_global_status_slow_queries # 慢查询数
rate(mysql_global_status_questions[1m]) # QPS
3. InnoDB状态:
mysql_global_status_innodb_buffer_pool_pages_total # 缓冲池总页数
mysql_global_status_innodb_buffer_pool_pages_free # 缓冲池空闲页数
mysql_global_status_innodb_row_lock_time_avg # 平均行锁等待时间
4. 复制状态:
mysql_slave_status_slave_io_running # IO线程状态
mysql_slave_status_slave_sql_running # SQL线程状态
mysql_slave_status_seconds_behind_master # 复制延迟秒数
5. 表状态:
mysql_info_schema_table_size_bytes # 表大小
mysql_info_schema_table_rows # 表行数
=== 常用告警规则 ===
# 连接数过高
- alert: MySQL连接数过高
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
# 慢查询过多
- alert: MySQL慢查询过多
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 2m
# 复制延迟
- alert: MySQL复制延迟
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
# InnoDB缓冲池命中率低
- alert: InnoDB缓冲池命中率低
expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
for: 10m
EOF
3. 其他常用Exporter
Blackbox Exporter
网络探测,监控HTTP、HTTPS、DNS、TCP、ICMP等服务可用性。
PostgreSQL Exporter
PostgreSQL数据库监控,收集连接数、查询性能、锁等信息。
Cloud Exporter
AWS、Azure、GCP云服务监控,收集云资源使用情况和成本。
JMX Exporter
Java应用监控,通过JMX收集JVM性能指标和应用业务指标。
HAProxy Exporter
负载均衡器监控,收集连接数、请求率、后端服务器状态等。
cAdvisor
容器监控,收集Docker容器资源使用情况和性能指标。
五、Grafana可视化与仪表板
1. 丰富的可视化选项:图表、表格、仪表、热图、地理地图等
2. 强大的数据源支持:Prometheus、MySQL、PostgreSQL、Elasticsearch等
3. 灵活的告警功能:可视化告警配置,多种通知渠道
4. 团队协作:文件夹权限、共享仪表板、版本管理
1. Grafana安装与配置
#!/bin/bash
# install_grafana.sh
# Grafana一键安装脚本
GRAFANA_VERSION="10.0.3"
GRAFANA_USER="grafana"
INSTALL_DIR="/opt/grafana"
DATA_DIR="/var/lib/grafana"
CONFIG_DIR="/etc/grafana"
LOG_DIR="/var/log/grafana"
# 下载Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
tar xzf grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
mv grafana-$GRAFANA_VERSION $INSTALL_DIR
# 创建用户和目录
useradd --no-create-home --shell /bin/false $GRAFANA_USER
mkdir -p $DATA_DIR $CONFIG_DIR $LOG_DIR
chown -R $GRAFANA_USER:$GRAFANA_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR $LOG_DIR
# 创建systemd服务
cat > /etc/systemd/system/grafana.service << EOF
[Unit]
Description=Grafana
Documentation=https://grafana.com/docs/
After=network.target
[Service]
User=$GRAFANA_USER
Group=$GRAFANA_USER
Type=simple
Restart=always
RestartSec=5
WorkingDirectory=$INSTALL_DIR
EnvironmentFile=-$CONFIG_DIR/grafana.conf
ExecStart=$INSTALL_DIR/bin/grafana-server \\
--config=$CONFIG_DIR/grafana.ini \\
--homepath=$INSTALL_DIR \\
--packaging=docker \\
cfg:default.paths.logs=$LOG_DIR \\
cfg:default.paths.data=$DATA_DIR \\
cfg:default.paths.plugins=$INSTALL_DIR/plugins \\
cfg:default.paths.provisioning=$CONFIG_DIR/provisioning
ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20
[Install]
WantedBy=multi-user.target
EOF
# 创建配置文件
cat > $CONFIG_DIR/grafana.ini << 'EOF'
[server]
# 监听地址和端口
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false
# 日志配置
[log]
mode = console file
level = info
format = console
# 数据库配置(默认使用SQLite)
[database]
type = sqlite3
path = grafana.db
max_idle_conn = 2
max_open_conn = 0
conn_max_lifetime = 14400
# 安全配置
[security]
admin_user = admin
admin_password = admin
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = false
data_source_proxy_whitelist =
# 用户配置
[auth]
disable_login_form = false
disable_signout_menu = false
# 匿名访问配置
[auth.anonymous]
enabled = false
org_name = Main Org.
org_role = Viewer
# 基础认证配置
[auth.basic]
enabled = true
# 邮件配置(告警通知)
[smtp]
enabled = true
host = smtp.example.com:465
user = alert@example.com
password = YourPassword
from_address = alert@example.com
from_name = Grafana Alert
# 用户配置
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer
# 会话配置
[session]
provider = file
provider_config = sessions
cookie_secure = false
session_life_time = 86400
# 分析配置
[analytics]
reporting_enabled = true
check_for_updates = true
# 路径配置
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /opt/grafana/plugins
provisioning = /etc/grafana/provisioning
# 快照配置
[snapshots]
external_enabled = true
external_snapshot_url = https://snapshots.example.com
external_snapshot_name = Grafana Snapshots
# 指标配置(Grafana自身监控)
[metrics]
enabled = true
interval_seconds = 10
EOF
# 创建数据源配置
mkdir -p $CONFIG_DIR/provisioning/datasources
cat > $CONFIG_DIR/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
jsonData:
timeInterval: 15s
queryTimeout: 60s
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
prometheusVersion: 2.45.0
cacheLevel: 'High'
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo
secureJsonData:
tlsAuth: false
tlsAuthWithCACert: false
- name: Alertmanager
type: alertmanager
access: proxy
url: http://localhost:9093
editable: true
jsonData:
implementation: prometheus
handleGrafanaManagedAlerts: true
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
editable: true
jsonData:
maxLines: 1000
- name: Tempo
type: tempo
access: proxy
url: http://localhost:3200
editable: true
jsonData:
nodeGraph:
enabled: true
tracesToLogs:
datasourceUid: 'loki'
spanStartTimeShift: '-1h'
spanEndTimeShift: '1h'
tags: ['job', 'instance', 'pod', 'namespace']
filterByTraceID: true
filterBySpanID: true
EOF
# 创建仪表板配置
mkdir -p $CONFIG_DIR/provisioning/dashboards
cat > $CONFIG_DIR/provisioning/dashboards/dashboards.yml << 'EOF'
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboards
EOF
# 创建示例仪表板目录
mkdir -p /etc/grafana/dashboards
# 启动服务
systemctl daemon-reload
systemctl enable grafana
systemctl start grafana
# 检查状态
sleep 5
systemctl status grafana --no-pager
echo "Grafana安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):3000"
echo "默认用户名: admin"
echo "默认密码: admin"
echo ""
echo "请登录后立即修改管理员密码!"
2. Grafana仪表板设计
{
"dashboard": {
"title": "Node Exporter Full",
"tags": ["templated", "node-exporter"],
"style": "dark",
"timezone": "browser",
"panels": [
{
"datasource": "Prometheus",
"description": "总体CPU使用率",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 2,
"options": {
"orientation": "auto",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"pluginVersion": "9.3.2",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"interval": "",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"title": "CPU Usage",
"type": "gauge"
},
{
"datasource": "Prometheus",
"description": "内存使用情况",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "normal"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "bytes"
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Used"
},
"properties": [
{
"id": "color",
"value": {
"fixedColor": "red",
"mode": "fixed"
}
}
]
},
{
"matcher": {
"id": "byName",
"options": "Cached"
},
"properties": [
{
"id": "color",
"value": {
"fixedColor": "yellow",
"mode": "fixed"
}
}
]
}
]
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 3,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"tooltip": {
"mode": "single",
"sort": "none"
}
},
"targets": [
{
"expr": "node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes",
"interval": "",
"legendFormat": "Used",
"refId": "A"
},
{
"expr": "node_memory_Buffers_bytes",
"hide": false,
"interval": "",
"legendFormat": "Buffers",
"refId": "B"
},
{
"expr": "node_memory_Cached_bytes",
"hide": false,
"interval": "",
"legendFormat": "Cached",
"refId": "C"
}
],
"title": "Memory Usage",
"type": "timeseries"
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"templating": {
"list": [
{
"current": {
"selected": false,
"text": "All",
"value": "$__all"
},
"datasource": "Prometheus",
"definition": "label_values(node_cpu_seconds_total, instance)",
"hide": 0,
"includeAll": true,
"multi": true,
"name": "instance",
"options": [],
"query": {
"query": "label_values(node_cpu_seconds_total, instance)",
"refId": "StandardVariableQuery"
},
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"type": "query"
},
{
"current": {
"selected": false,
"text": "1m",
"value": "1m"
},
"hide": 0,
"includeAll": false,
"label": "Interval",
"multi": false,
"name": "interval",
"options": [
{
"selected": true,
"text": "1m",
"value": "1m"
},
{
"selected": false,
"text": "5m",
"value": "5m"
},
{
"selected": false,
"text": "10m",
"value": "10m"
},
{
"selected": false,
"text": "30m",
"value": "30m"
},
{
"selected": false,
"text": "1h",
"value": "1h"
},
{
"selected": false,
"text": "6h",
"value": "6h"
}
],
"query": "1m,5m,10m,30m,1h,6h",
"queryValue": "",
"refresh": 2,
"skipUrlSync": false,
"type": "interval"
}
]
},
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"target": {
"limit": 100,
"matchAny": false,
"tags": [],
"type": "dashboard"
},
"type": "dashboard"
}
]
},
"refresh": "10s",
"schemaVersion": 37,
"version": 1,
"uid": "node-exporter-full"
},
"folderUid": "general",
"message": "Updated dashboard",
"overwrite": true
}
六、告警规则与通知配置
1. 分级分类:根据严重程度划分告警级别(紧急、重要、警告)
2. 静默降噪:避免告警风暴,合理设置静默规则
3. 明确可操作:告警信息应包含具体问题和解决建议
4. 多渠道通知:重要告警应通过多种渠道通知
5. 闭环管理:告警应关联事件、处理、复盘全过程
1. Alertmanager配置
# Alertmanager配置 (alertmanager.yml)
global:
# SMTP配置
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'YourPassword'
smtp_require_tls: true
# Slack配置
slack_api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
# 微信企业号配置
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_secret: 'your-wechat-secret'
wechat_api_corp_id: 'your-corp-id'
# 路由配置 - 定义告警如何路由到接收器
route:
# 默认路由
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default-receiver'
# 子路由
routes:
# 按严重程度路由
- match:
severity: critical
receiver: 'critical-receiver'
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
continue: true
# 按团队路由
- match_re:
team: ^(infra|platform).*
receiver: 'infra-team'
continue: false
- match_re:
team: ^(dev|app).*
receiver: 'dev-team'
continue: false
# 按服务路由
- match:
service: mysql
receiver: 'dba-team'
continue: false
- match:
service: nginx
receiver: 'web-team'
continue: false
# 工作时间路由
- receiver: 'work-hours-receiver'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
continue: true
matchers:
- name: time_range
value: '09:00-18:00'
- name: weekday
value: '1-5' # 周一到周五
# 告警抑制规则 - 避免重复告警
inhibit_rules:
# 当有节点宕机告警时,抑制该节点上的所有其他告警
- source_match:
alertname: NodeDown
severity: critical
target_match:
severity: critical
equal: ['instance', 'cluster']
# 当有集群级别故障时,抑制所有节点级别告警
- source_match:
alertname: ClusterDown
target_match_re:
alertname: 'NodeDown|HighCpuUsage|HighMemoryUsage'
equal: ['cluster']
# 网络分区告警抑制
- source_match:
alertname: NetworkPartition
target_match:
severity: warning
equal: ['zone']
# 静默配置 - 临时关闭特定告警
# 可以通过Web UI或API配置
# 接收器配置 - 定义告警发送方式
receivers:
# 默认接收器
- name: 'default-receiver'
email_configs:
- to: 'alerts@example.com'
send_resolved: true
headers:
Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
html: |
{{ .GroupLabels.alertname }}
状态: {{ .Status | toUpper }}
开始时间: {{ (index .Alerts 0).StartsAt }}
结束时间: {{ (index .Alerts 0).EndsAt }}
摘要: {{ .CommonAnnotations.summary }}
描述: {{ .CommonAnnotations.description }}
告警详情
标签
值
{{ range .GroupLabels.SortedPairs }}
{{ .Name }}
{{ .Value }}
{{ end }}
webhook_configs:
- url: 'http://alert-webhook.example.com/alerts'
send_resolved: true
# 紧急告警接收器
- name: 'critical-receiver'
email_configs:
- to: 'oncall@example.com, manager@example.com'
send_resolved: true
# Slack通知
slack_configs:
- channel: '#alerts-critical'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: |-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Labels:*
{{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }}
{{ end }}
{{ end }}
send_resolved: true
color: 'danger' # 红色
# 微信通知
wechat_configs:
- agent_id: '1000002'
to_user: '@all'
to_party: '2'
message: '{{ template "wechat.default.message" . }}'
send_resolved: true
# 电话告警(通过第三方服务)
webhook_configs:
- url: 'http://phone-alert-service.example.com/call'
send_resolved: false
# 基础设施团队
- name: 'infra-team'
email_configs:
- to: 'infra-team@example.com'
slack_configs:
- channel: '#infra-alerts'
# 开发团队
- name: 'dev-team'
email_configs:
- to: 'dev-team@example.com'
slack_configs:
- channel: '#dev-alerts'
# DBA团队
- name: 'dba-team'
email_configs:
- to: 'dba-team@example.com'
webhook_configs:
- url: 'http://dba-alert.example.com/webhook'
# 工作时间接收器
- name: 'work-hours-receiver'
email_configs:
- to: 'work-hours-team@example.com'
slack_configs:
- channel: '#work-hours-alerts'
# 模板配置
templates:
- '/etc/alertmanager/templates/*.tmpl'
# 钉钉机器人配置模板
- name: 'dingtalk-receiver'
dingtalk_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx'
message: |
{{ range .Alerts }}
## [{{ .Status | toUpper }}] {{ .Labels.alertname }}
**开始时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
**实例**: {{ .Labels.instance }}
**摘要**: {{ .Annotations.summary }}
**描述**: {{ .Annotations.description }}
{{ if .Annotations.runbook }}
**运行手册**: [查看]({{ .Annotations.runbook }})
{{ end }}
---
{{ end }}
send_resolved: true
# 企业微信配置模板
- name: 'wechat-work-receiver'
wechat_configs:
- api_secret: 'your-secret'
corp_id: 'your-corp-id'
agent_id: '1000002'
message: '{{ template "wechat.default.message" . }}'
send_resolved: true
# 短信配置(通过阿里云、腾讯云等)
- name: 'sms-receiver'
webhook_configs:
- url: 'http://sms-gateway.example.com/send'
send_resolved: false
2. 告警规则最佳实践
| 告警级别 | 响应时间 | 通知渠道 | 示例场景 |
|---|---|---|---|
| 紧急 | 5分钟内 | 电话+短信+钉钉+邮件 | 核心服务不可用、数据库宕机、重大安全事件 |
| 重要 | 30分钟内 | 钉钉+邮件+Slack | 性能严重下降、磁盘空间不足、CPU使用率过高 |
| 警告 | 2小时内 | 邮件+企业微信 | 磁盘使用率预警、内存使用率预警、服务重启 |
# 企业级告警规则示例
groups:
# ============ 基础设施告警 ============
- name: infrastructure_alerts
interval: 30s
rules:
# 节点宕机
- alert: NodeDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "节点宕机 {{ $labels.instance }}"
description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
runbook: "https://runbook.example.com/node-down"
dashboard: "https://grafana.example.com/d/node-overview"
# CPU使用率过高
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "高CPU使用率 {{ $labels.instance }}"
description: "CPU使用率超过80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/high-cpu"
dashboard: "https://grafana.example.com/d/node-cpu"
# 内存使用率过高
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 10m
labels:
severity: warning
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "高内存使用率 {{ $labels.instance }}"
description: "内存使用率超过85%已经10分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/high-memory"
dashboard: "https://grafana.example.com/d/node-memory"
# 磁盘空间严重不足
- alert: DiskSpaceCritical
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
for: 2m
labels:
severity: critical
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "磁盘空间严重不足 {{ $labels.instance }}"
description: "根分区使用率超过90%已经2分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/disk-space"
dashboard: "https://grafana.example.com/d/node-disk"
# 磁盘空间预警
- alert: DiskSpaceWarning
expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
for: 10m
labels:
severity: warning
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "磁盘空间预警 {{ $labels.instance }}"
description: "根分区使用率超过80%已经10分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/disk-space"
dashboard: "https://grafana.example.com/d/node-disk"
# 系统负载过高
- alert: HighSystemLoad
expr: node_load1 > count by(instance) (node_cpu_seconds_total{mode="system"}) * 1.5
for: 5m
labels:
severity: warning
team: infrastructure
domain: infrastructure
service: node
annotations:
summary: "高系统负载 {{ $labels.instance }}"
description: "1分钟系统负载超过CPU核心数1.5倍已经5分钟。当前值: {{ $value }}"
runbook: "https://runbook.example.com/high-load"
dashboard: "https://grafana.example.com/d/node-load"
# ============ 应用服务告警 ============
- name: application_alerts
interval: 15s
rules:
# 服务宕机
- alert: ServiceDown
expr: up{job=~".*"} == 0
for: 1m
labels:
severity: critical
team: development
domain: application
annotations:
summary: "服务宕机 {{ $labels.job }}"
description: "服务 {{ $labels.job }} (实例 {{ $labels.instance }}) 已经宕机超过1分钟"
runbook: "https://runbook.example.com/service-down"
dashboard: "https://grafana.example.com/d/service-overview"
# 高请求延迟
- alert: HighRequestLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, instance)) > 1
for: 2m
labels:
severity: warning
team: development
domain: application
annotations:
summary: "高请求延迟 {{ $labels.job }}"
description: "95%的请求延迟超过1秒已经2分钟。当前值: {{ $value }}秒"
runbook: "https://runbook.example.com/high-latency"
dashboard: "https://grafana.example.com/d/service-latency"
# 高错误率
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
for: 5m
labels:
severity: critical
team: development
domain: application
annotations:
summary: "高错误率 {{ $labels.job }}"
description: "HTTP 5xx错误率超过5%已经5分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/high-error-rate"
dashboard: "https://grafana.example.com/d/service-errors"
# 低请求量(可能服务有问题但没报错)
- alert: LowRequestRate
expr: rate(http_requests_total[10m]) < 10
for: 5m
labels:
severity: warning
team: development
domain: application
annotations:
summary: "低请求量 {{ $labels.job }}"
description: "请求率低于10次/秒已经5分钟。当前值: {{ $value }}次/秒"
runbook: "https://runbook.example.com/low-request-rate"
dashboard: "https://grafana.example.com/d/service-traffic"
# ============ 数据库告警 ============
- name: database_alerts
interval: 30s
rules:
# MySQL连接数过高
- alert: MySQLHighConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
for: 5m
labels:
severity: warning
team: dba
domain: database
service: mysql
annotations:
summary: "MySQL连接数过高 {{ $labels.instance }}"
description: "MySQL连接数超过最大连接数的80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/mysql-connections"
dashboard: "https://grafana.example.com/d/mysql-overview"
# MySQL复制延迟
- alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: critical
team: dba
domain: database
service: mysql
annotations:
summary: "MySQL复制延迟 {{ $labels.instance }}"
description: "MySQL从库复制延迟超过30秒已经5分钟。当前值: {{ $value }}秒"
runbook: "https://runbook.example.com/mysql-replication"
dashboard: "https://grafana.example.com/d/mysql-replication"
# InnoDB缓冲池命中率低
- alert: MySQLInnoDBBufferPoolHitRateLow
expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
for: 10m
labels:
severity: warning
team: dba
domain: database
service: mysql
annotations:
summary: "InnoDB缓冲池命中率低 {{ $labels.instance }}"
description: "InnoDB缓冲池命中率低于90%已经10分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/mysql-innodb"
dashboard: "https://grafana.example.com/d/mysql-innodb"
# ============ 业务指标告警 ============
- name: business_alerts
interval: 1m
rules:
# 订单量异常下降
- alert: OrderRateAbnormalDrop
expr: rate(orders_total[10m]) < rate(orders_total[40m:10m]) * 0.5
for: 5m
labels:
severity: critical
team: business
domain: business
annotations:
summary: "订单量异常下降"
description: "最近10分钟订单量比40分钟前下降超过50%。当前速率: {{ $value }} 订单/分钟"
runbook: "https://runbook.example.com/order-drop"
dashboard: "https://grafana.example.com/d/business-orders"
# 支付失败率过高
- alert: HighPaymentFailureRate
expr: rate(payment_attempts_total{status="failed"}[10m]) / rate(payment_attempts_total[10m]) * 100 > 10
for: 5m
labels:
severity: critical
team: business
domain: business
annotations:
summary: "支付失败率过高"
description: "支付失败率超过10%已经5分钟。当前值: {{ $value | humanizePercentage }}"
runbook: "https://runbook.example.com/payment-failure"
dashboard: "https://grafana.example.com/d/business-payments"
# 用户活跃度下降
- alert: UserActivityDrop
expr: active_users_total < (active_users_total offset 1d) * 0.7
for: 1h
labels:
severity: warning
team: business
domain: business
annotations:
summary: "用户活跃度下降"
description: "当前活跃用户数比昨天同时段下降超过30%。当前值: {{ $value }}"
runbook: "https://runbook.example.com/user-activity"
dashboard: "https://grafana.example.com/d/business-users"
# ============ 黑盒监控告警 ============
- name: blackbox_alerts
interval: 30s
rules:
# HTTP探测失败
- alert: HTTPProbeFailed
expr: probe_success{job="blackbox-http"} == 0
for: 1m
labels:
severity: critical
team: infrastructure
domain: availability
annotations:
summary: "HTTP服务不可用 {{ $labels.instance }}"
description: "HTTP服务 {{ $labels.instance }} 探测失败已经1分钟"
runbook: "https://runbook.example.com/http-probe-failed"
dashboard: "https://grafana.example.com/d/blackbox-http"
# SSL证书即将过期
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 86400 * 30 # 30天内过期
for: 0m
labels:
severity: warning
team: infrastructure
domain: security
annotations:
summary: "SSL证书即将过期 {{ $labels.instance }}"
description: "SSL证书将在{{ $value | humanizeDuration }}后过期"
runbook: "https://runbook.example.com/ssl-cert-expiring"
dashboard: "https://grafana.example.com/d/blackbox-ssl"
七、高可用与生产优化
1. 高可用性:监控系统自身不能成为单点故障
2. 可扩展性:支持监控上千节点,PB级数据存储
3. 性能优化:查询响应时间在秒级,资源消耗可控
4. 安全性:访问控制、数据加密、审计日志
5. 可维护性:自动化部署、配置管理、故障自愈
1. Prometheus高可用方案
# Prometheus高可用配置示例
# 1. 多副本Prometheus配置
# prometheus-a.yml 和 prometheus-b.yml 配置相同,但external_labels不同
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
replica: 'A' # 副本A使用'A',副本B使用'B'
# 2. 服务发现配置(确保两个副本抓取相同的targets)
scrape_configs:
- job_name: 'node'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: ['node-exporter']
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_tags]
regex: ',(production|staging|dev),'
target_label: env
# 3. Thanos Sidecar配置(与每个Prometheus实例一起运行)
# thanos-sidecar.yml
prometheus:
external_url: "http://prometheus-a.example.com:9090"
# 或 "http://prometheus-b.example.com:9090"
thanos:
sidecar:
grpc_address: "0.0.0.0:10901"
http_address: "0.0.0.0:10902"
prometheus_url: "http://localhost:9090"
tsdb_path: "/var/lib/prometheus"
objstore:
type: S3
config:
bucket: "thanos-metrics"
endpoint: "s3.example.com"
access_key: "YOUR_ACCESS_KEY"
secret_key: "YOUR_SECRET_KEY"
insecure: false
signature_version2: false
put_user_metadata: {}
http_config:
idle_conn_timeout: 90s
response_header_timeout: 2m
trace:
enable: true
compactor:
data_dir: "/var/lib/thanos/compactor"
retention: 30d
query:
http_address: "0.0.0.0:10903"
grpc_address: "0.0.0.0:10904"
store:
- "prometheus-a.example.com:10901"
- "prometheus-b.example.com:10901"
- "thanos-store.example.com:10901"
# 4. Thanos Query前端配置
# thanos-query.yml
http:
address: "0.0.0.0:10905"
grace_period: 2m
grpc:
address: "0.0.0.0:10906"
query:
replica_labels:
- "replica"
- "prometheus_replica"
auto_downsampling: true
partial_response: true
default_evaluation_interval: 1m
stores:
- "prometheus-a.example.com:10901"
- "prometheus-b.example.com:10901"
- "thanos-store.example.com:10901"
# 5. 负载均衡配置(Nginx)
# nginx.conf
upstream prometheus {
zone prometheus 64k;
server prometheus-a.example.com:9090 max_fails=3 fail_timeout=30s;
server prometheus-b.example.com:9090 max_fails=3 fail_timeout=30s;
keepalive 16;
}
upstream thanos_query {
zone thanos_query 64k;
server thanos-query.example.com:10905 max_fails=3 fail_timeout=30s;
keepalive 16;
}
server {
listen 80;
server_name prometheus.example.com;
location / {
proxy_pass http://prometheus;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 健康检查
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
proxy_connect_timeout 2s;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
}
}
server {
listen 80;
server_name thanos.example.com;
location / {
proxy_pass http://thanos_query;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
# 6. 监控目标分片配置(当监控目标太多时)
# 根据标签将目标分片到不同的Prometheus实例
scrape_configs:
- job_name: 'node-shard-0'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: ['node-exporter']
relabel_configs:
# 根据实例名称哈希分片
- source_labels: [__meta_consul_service_id]
action: hashmod
modulus: 2
target_label: __tmp_hash
- source_labels: [__tmp_hash]
action: keep
regex: ^0$ # 这个实例只抓取hash为0的目标
- job_name: 'node-shard-1'
consul_sd_configs:
- server: 'consul.example.com:8500'
services: ['node-exporter']
relabel_configs:
- source_labels: [__meta_consul_service_id]
action: hashmod
modulus: 2
target_label: __tmp_hash
- source_labels: [__tmp_hash]
action: keep
regex: ^1$ # 这个实例只抓取hash为1的目标
# 7. 远程写配置(将数据写入VictoriaMetrics集群)
remote_write:
- url: "http://vminsert:8480/insert/0/prometheus/api/v1/write"
queue_config:
max_samples_per_send: 10000
capacity: 100000
max_shards: 30
write_relabel_configs:
# 只保留重要的指标
- action: keep
regex: "up|node_.*|process_.*|prometheus_.*"
source_labels: [__name__]
# 8. 资源限制配置
# 通过cgroups或systemd限制资源使用
[Service]
MemoryLimit=8G
CPUQuota=200%
IOWeight=100
TasksMax=10000
# 9. 数据保留策略
# prometheus.yml
storage:
tsdb:
retention: 15d # 本地保留15天
out_of_order_time_window: 1h
# 远程存储保留更长时间
remote_write:
- url: "http://long-term-storage:9090/api/v1/write"
remote_timeout: 30s
write_relabel_configs:
- action: keep
regex: ".*"
source_labels: [__name__]
# 10. 备份策略
# backup_prometheus.sh
#!/bin/bash
BACKUP_DIR="/backup/prometheus"
DATE=$(date +%Y%m%d_%H%M%S)
# 停止Prometheus抓取
curl -X POST http://localhost:9090/-/quit
# 等待写入完成
sleep 30
# 备份数据目录
tar czf $BACKUP_DIR/prometheus_data_$DATE.tar.gz /var/lib/prometheus/
# 备份配置文件
tar czf $BACKUP_DIR/prometheus_config_$DATE.tar.gz /etc/prometheus/
# 重启Prometheus
systemctl restart prometheus
echo "备份完成: $BACKUP_DIR/prometheus_data_$DATE.tar.gz"
2. 性能优化建议
使用SSD存储TSDB数据,启用压缩,定期清理过期数据。
使用Recording Rules预计算常用查询,优化PromQL,避免高基数查询。
配置合理的抓取间隔,使用连接池,启用HTTP/2,减少网络往返。
调整块大小,监控内存使用,启用内存映射文件,避免OOM。
八、总结与最佳实践
1. 全面覆盖:基础设施、应用、业务全面监控
2. 及时告警:问题发现时间从小时级降低到分钟级
3. 快速定位:MTTR(平均修复时间)显著降低
4. 数据驱动:监控数据用于容量规划和性能优化
5. 团队赋能:开发、运维、业务团队都能使用监控数据
1. 实施路线图
| 阶段 | 时间 | 主要任务 | 关键成果 |
|---|---|---|---|
| 第一阶段 | 1-2周 | 基础设施监控、基础告警 | 服务器监控、基础告警规则 |
| 第二阶段 | 2-4周 | 应用监控、业务监控 | 应用性能监控、关键业务指标 |
| 第三阶段 | 4-8周 | 高可用、自动化、优化 | 监控系统高可用、自动化部署 |
| 第四阶段 | 持续优化 | 智能分析、预测告警 | 异常检测、容量预测、AIOps |
2. 监控系统检查清单
# 企业级监控系统检查清单
"""
1. 数据采集检查
□ 所有服务器部署Node Exporter
□ 关键服务有专用Exporter(MySQL、Nginx等)
□ 应用层业务指标采集
□ 网络探测(黑盒监控)配置
□ 日志指标采集(通过Loki或ELK)
2. 存储与处理检查
□ Prometheus配置高可用
□ 数据保留策略合理(本地+远程)
□ 告警规则分类清晰
□ 记录规则优化查询性能
□ 远程写配置正确
3. 可视化检查
□ Grafana仪表板覆盖所有监控维度
□ 仪表板有清晰的分类和组织
□ 关键指标有实时可视化
□ 历史数据可回溯分析
□ 权限控制配置正确
4. 告警通知检查
□ 告警分级合理(紧急/重要/警告)
□ 通知渠道覆盖全面(邮件/钉钉/短信)
□ 告警抑制规则配置合理
□ 静默规则管理规范
□ 告警处理流程明确
5. 高可用检查
□ Prometheus多副本部署
□ Alertmanager集群部署
□ Grafana配置持久化存储
□ 负载均衡配置正确
□ 备份恢复方案验证
6. 性能检查
□ 查询响应时间在秒级
□ 抓取间隔合理(不造成目标压力)
□ 内存使用可控(无OOM风险)
□ 磁盘空间充足(有预警机制)
□ 网络带宽足够
7. 安全检查
□ 监控系统访问控制
□ 数据传输加密(HTTPS)
□ 认证授权配置
□ 审计日志开启
□ 敏感信息保护
8. 运维检查
□ 配置版本管理
□ 自动化部署脚本
□ 监控系统自身监控
□ 容量规划预测
□ 定期演练恢复
9. 文档检查
□ 架构设计文档完整
□ 部署运维手册
□ 告警处理手册(Runbook)
□ 故障排查指南
□ 培训材料
10. 合规检查
□ 数据保留符合法规要求
□ 审计日志满足合规
□ 访问控制符合安全政策
□ 告警通知符合SLA
□ 故障处理流程规范
"""
# 监控系统成熟度评估
MATURITY_LEVELS = {
"Level 1 - 基础监控": "服务器基础指标监控,手动告警",
"Level 2 - 标准监控": "应用和业务监控,自动化告警",
"Level 3 - 高级监控": "全链路监控,预测性告警",
"Level 4 - 智能监控": "AIOps,自动化修复,业务影响分析"
}
3. 常见问题与解决方案
解决方案:
1. 调整块大小和保留时间
2. 使用Recording Rules预计算
3. 限制标签基数(避免高基数标签)
4. 启用数据分片和远程存储
解决方案:
1. 合理配置告警抑制规则
2. 设置合理的告警间隔和等待时间
3. 使用告警分组(按服务、按实例)
4. 实现告警升级和降级机制
解决方案:
1. 统一时钟同步(NTP)
2. 配置合理的抓取超时时间
3. 监控Exporter健康状况
4. 实现数据一致性检查
4. 未来发展趋势
AIOps智能运维
机器学习异常检测、根因分析、自动化修复。
可观测性
Metrics、Logs、Traces三位一体,全链路追踪。
云原生监控
Kubernetes原生监控,Service Mesh监控,Serverless监控。
业务可观测性
业务指标监控,用户体验监控,业务影响分析。
1. 从小处着手:从核心业务开始,逐步扩大监控范围
2. 持续迭代:监控系统需要持续优化和演进
3. 文化先行:建立数据驱动的运维文化
4. 工具为辅:工具是手段,解决问题才是目的
5. 全员参与:监控不仅是运维的事,需要开发、测试、业务共同参与
📊 监控是运维的眼睛,告警是运维的耳朵!
一个完善的企业级监控告警系统是保障业务稳定运行的基石。通过本文的介绍,您应该能够构建一套从数据采集、存储处理、可视化展示到告警通知的完整监控体系。记住,监控的最终目标不是收集数据,而是通过数据发现问题、解决问题、预防问题。
如有问题或建议,欢迎在评论区留言交流!