返回文章列表

搭建企业级监控告警系统

本文详细介绍了如何使用Prometheus和Grafana从零开始搭建一套完整的企业级监控告警系统。内容包括:监控体系设计、Prometheus安装与配置、Exporter部署、Grafana仪表板设计、告警规则配置、通知渠道集成(邮件、钉钉、企业微信等),以及生产环境优化建议。通过本教程,您将能够构建一套具备高可用性和扩展性的监控体系,满足企业IT系统的可观测性需求。

📋 文章目录

一、监控体系概述与设计原则

1. 监控的重要性与价值

在现代化的IT系统中,监控已经从一个"可选项"变成了"必选项"。一个完善的监控体系能够:

快速发现问题

实时发现系统异常,避免小问题演变成大故障。

性能优化依据

通过历史数据分析系统瓶颈,指导容量规划和性能优化。

故障回溯分析

记录系统历史状态,便于故障原因分析和责任追溯。

SLA保障

量化服务可用性和性能,为SLA(服务等级协议)提供数据支持。

📊 监控数据价值:

根据Google SRE(站点可靠性工程)的经验,有效的监控可以将MTTR(平均修复时间)降低70%以上,并将故障预测准确率提升到85%以上。

2. 监控设计原则(Google四大黄金指标)

Google SRE团队提出的四大黄金指标是监控设计的核心指导原则:

延迟
Latency

服务处理请求所需时间

流量
Traffic

系统承载的请求量或并发量

错误率
Errors

请求失败或错误的比例

饱和度
Saturation

系统资源的使用程度

3. 监控层级划分

全面的监控体系应该覆盖以下四个层级:

业务监控
应用监控
系统监控
网络监控
🔍 监控覆盖度检查清单:

1. 业务指标:关键业务流程成功率、用户活跃度、订单量等
2. 应用指标:应用响应时间、错误率、吞吐量、JVM/GC状态等
3. 系统指标:CPU、内存、磁盘、网络使用率、进程状态等
4. 网络指标:网络延迟、丢包率、带宽使用率、连接数等

二、监控系统架构设计

🎯 架构设计目标:

1. 可扩展性:支持监控上千个节点
2. 高可用性:监控系统自身不能成为单点故障
3. 实时性:指标采集和告警延迟在秒级
4. 易用性:配置简单,可视化友好,便于问题定位

1. 整体架构设计

企业级监控系统通常采用以下架构:

# 企业级监控系统架构

"""
数据采集层 (Data Collection)
├── Node Exporter: 服务器基础指标
├── MySQL Exporter: 数据库监控
├── Nginx Exporter: Web服务器监控
├── JMX Exporter: Java应用监控
├── Blackbox Exporter: 网络探测
└── 自定义Exporter: 业务指标采集

数据存储与处理层 (Storage & Processing)
├── Prometheus Server: 指标采集、存储、查询
├── Prometheus Alertmanager: 告警管理
├── Thanos/Cortex: 长期存储和集群方案(可选)
└── 时序数据库: VictoriaMetrics/InfluxDB(替代方案)

数据可视化层 (Visualization)
├── Grafana: 仪表板展示
├── 自定义Dashboard: 业务大屏
└── 报表系统: 定期报告生成

告警通知层 (Alerting & Notification)
├── 邮件通知: SMTP集成
├── 即时通讯: 钉钉/企业微信/Slack
├── 短信通知: 云服务商API
└── 电话告警: 紧急情况自动呼叫

辅助组件 (Auxiliary Components)
├── Service Discovery: 自动发现监控目标
├── 配置管理: Ansible/Terraform自动化部署
├── 权限控制: LDAP/OAuth2集成
└── 日志集成: Loki/ELK Stack关联分析
"""

2. 技术选型对比

主流监控解决方案对比:

特性 Prometheus Zabbix Nagios DataDog
开源/商业 开源 开源 开源 商业
数据模型 多维时序数据 键值对 状态检查 多维时序数据
查询语言 PromQL 有限 专有查询
服务发现 原生支持 有限支持 不支持 自动发现
可视化 需Grafana 内置 需插件 内置
社区生态 强大 强大 强大 商业支持
成本 免费 免费 免费 昂贵

3. 监控指标设计规范

# 监控指标命名规范
# 格式: {__name__}{label1="value1",label2="value2",...}

"""
命名规范:
1. 使用下划线分隔单词:http_requests_total
2. 基本命名模式:_
   - _total: 计数器累加值
   - _count: 直方图/摘要的计数
   - _sum: 直方图/摘要的总和
   - _bucket: 直方图的分桶
   - _info: 提供元信息
3. 单位标准化:
   - 时间: 秒(seconds)
   - 内存: 字节(bytes)
   - 磁盘: 字节(bytes)
   - 网络: 比特/秒(bits/sec)

标签设计原则:
1. 标识性标签(必备):
   - instance: 实例标识(IP:Port)
   - job: 任务/服务名称
   - env: 环境(prod/staging/dev)

2. 维度性标签(可选):
   - region: 地域(华北/华东)
   - az: 可用区
   - team: 负责团队
   - version: 应用版本

3. 避免的标签设计:
   - 不要使用高基数标签(如用户ID)
   - 避免标签值动态变化
   - 标签数量不宜过多(一般5-10个)

示例指标:
# 系统指标
node_cpu_seconds_total{mode="idle", instance="192.168.1.100:9100", job="node"}
node_memory_MemFree_bytes{instance="192.168.1.100:9100", job="node"}

# 应用指标
http_requests_total{method="POST", endpoint="/api/users", status="200", job="user-service"}
http_request_duration_seconds_bucket{method="GET", endpoint="/api/products", le="0.1"}

# 业务指标
orders_total{type="new", payment_method="alipay", env="production"}
user_sessions_active{region="north", platform="mobile"}
"""

三、Prometheus部署与配置

🚀 Prometheus特点:

1. 多维数据模型:指标名称 + 键值对标签
2. 强大的查询语言:PromQL,灵活的数据查询和聚合
3. 不依赖分布式存储:单节点自包含
4. HTTP拉取模式:主动从目标拉取指标
5. 多种服务发现:支持Kubernetes、Consul等

1. Prometheus安装部署

#!/bin/bash
# install_prometheus.sh
# Prometheus一键安装脚本

PROMETHEUS_VERSION="2.45.0"
PROMETHEUS_USER="prometheus"
INSTALL_DIR="/opt/prometheus"
DATA_DIR="/var/lib/prometheus"
CONFIG_DIR="/etc/prometheus"

# 创建用户和目录
useradd --no-create-home --shell /bin/false $PROMETHEUS_USER
mkdir -p $INSTALL_DIR $DATA_DIR $CONFIG_DIR
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR

# 下载并安装Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xzf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
cd prometheus-$PROMETHEUS_VERSION.linux-amd64

# 复制二进制文件
cp prometheus promtool $INSTALL_DIR/
chown $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR/{prometheus,promtool}
chmod +x $INSTALL_DIR/{prometheus,promtool}

# 复制配置文件
cp prometheus.yml $CONFIG_DIR/
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR

# 创建systemd服务文件
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target

[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/prometheus \
    --config.file=$CONFIG_DIR/prometheus.yml \
    --storage.tsdb.path=$DATA_DIR \
    --storage.tsdb.retention.time=30d \
    --web.console.templates=$INSTALL_DIR/consoles \
    --web.console.libraries=$INSTALL_DIR/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url=http://prometheus.example.com \
    --web.enable-lifecycle \
    --web.enable-admin-api

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建配置目录结构
mkdir -p $CONFIG_DIR/{rules,rules.d,files_sd,targets}
cat > $CONFIG_DIR/prometheus.yml << 'EOF'
# 全局配置
global:
  scrape_interval: 15s      # 默认抓取间隔
  evaluation_interval: 15s  # 规则评估间隔
  external_labels:          # 外部标签
    region: 'north'
    env: 'production'

# 告警规则文件
rule_files:
  - "rules/*.yml"
  - "rules.d/*.yml"

# 抓取配置
scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    labels:
      service: 'monitoring'

  # 监控所有Node Exporter
  - job_name: 'node'
    scrape_interval: 30s
    file_sd_configs:
      - files:
        - 'targets/node_*.yml'
        refresh_interval: 5m
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__meta_dns_name]
        target_label: hostname

  # 监控所有MySQL
  - job_name: 'mysql'
    scrape_interval: 30s
    static_configs:
      - targets: ['mysql-1:9104', 'mysql-2:9104']
    labels:
      database: 'mysql'

  # 监控所有Nginx
  - job_name: 'nginx'
    scrape_interval: 30s
    static_configs:
      - targets: ['nginx-1:9113', 'nginx-2:9113']
    labels:
      service: 'web'

  # 通过Consul服务发现
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: ',(production|staging|dev),'
        target_label: env
        replacement: '$1'

# 远程读写配置(可选)
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      max_samples_per_send: 1000
      capacity: 5000
      max_shards: 200

remote_read:
  - url: "http://thanos-query:10902/api/v1/read"
    read_recent: true
EOF

# 创建示例target文件
cat > $CONFIG_DIR/targets/node_servers.yml << 'EOF'
- targets:
  - '192.168.1.100:9100'
  - '192.168.1.101:9100'
  - '192.168.1.102:9100'
  labels:
    datacenter: 'dc1'
    rack: 'rack-a'
EOF

# 创建告警规则文件
cat > $CONFIG_DIR/rules/node_alerts.yml << 'EOF'
groups:
  - name: node_alerts
    interval: 30s
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "高CPU使用率 (实例 {{ $labels.instance }})"
          description: "CPU使用率超过80%已经5分钟。当前值: {{ $value }}%"
          runbook: "https://runbook.example.com/high-cpu"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "高内存使用率 (实例 {{ $labels.instance }})"
          description: "内存使用率超过85%已经5分钟。当前值: {{ $value }}%"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
        for: 2m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "磁盘空间严重不足 (实例 {{ $labels.instance }})"
          description: "根分区使用率超过90%已经2分钟。当前值: {{ $value }}%"
          runbook: "https://runbook.example.com/disk-space"

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "节点宕机 (实例 {{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
EOF

# 设置权限
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR
chmod -R 644 $CONFIG_DIR/*.yml
chmod 755 $CONFIG_DIR $CONFIG_DIR/{rules,rules.d,files_sd,targets}

# 启动服务
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus

# 检查状态
sleep 3
systemctl status prometheus --no-pager

echo "Prometheus安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9090"
echo "数据目录: $DATA_DIR"
echo "配置目录: $CONFIG_DIR"

2. Prometheus配置详解

# Prometheus高级配置示例

# 1. 远程存储配置(VictoriaMetrics)
remote_write:
  - url: "http://victoria-metrics:8428/api/v1/write"
    write_relabel_configs:
      - action: keep
        regex: "node.*|prometheus.*"
        source_labels: [__name__]
    queue_config:
      max_shards: 10
      min_shards: 2
      max_samples_per_send: 500
      capacity: 10000
      batch_send_deadline: "5s"
      min_backoff: "100ms"
      max_backoff: "5s"

# 2. 服务发现配置(Kubernetes)
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 只抓取有注解 prometheus.io/scrape: "true" 的pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # 从注解中获取抓取路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # 从注解中获取抓取端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # 添加标准标签
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      
      # 设置实例名为 pod名:端口
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      
      # 设置命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      
      # 设置节点标签
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: node

# 3. 静态配置示例
scrape_configs:
  - job_name: 'static-targets'
    static_configs:
      - targets:
        - 'app-1.example.com:8080'
        - 'app-2.example.com:8080'
        - 'app-3.example.com:8080'
        labels:
          environment: 'production'
          region: 'us-east-1'
          application: 'user-service'

# 4. 文件服务发现
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/*.json'
        - '/etc/prometheus/targets/*.yml'
        refresh_interval: 5m
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):(\d+)'
        replacement: '${1}'
        target_label: host
      - source_labels: [__address__]
        regex: '(.*):(\d+)'
        replacement: '${2}'
        target_label: port

# 5. 标签重写规则
scrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['example.com:80']
    metric_relabel_configs:
      # 删除不需要的指标
      - action: drop
        regex: 'go_.*'
        source_labels: [__name__]
      
      # 重命名指标
      - source_labels: [__name__]
        regex: 'http_requests_(\w+)'
        replacement: 'http_${1}'
        target_label: __name__
      
      # 添加标签
      - source_labels: [instance]
        regex: '([^:]+):\d+'
        replacement: '${1}'
        target_label: hostname
      
      # 替换标签值
      - source_labels: [status_code]
        regex: '5..'
        replacement: 'server_error'
        target_label: status_group

# 6. 告警规则分组
groups:
  - name: infrastructure_alerts
    interval: 30s
    rules:
      # 系统级别告警
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          domain: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} 宕机"
          description: "{{ $labels.instance }} 已经5分钟无法访问"
          runbook: "/runbooks/instance-down.md"
      
      # 资源级别告警
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 10m
        labels:
          severity: warning
          domain: infrastructure
        annotations:
          summary: "内存使用率过高 {{ $labels.instance }}"
          description: "内存使用率超过90%已经10分钟"
          runbook: "/runbooks/high-memory.md"

  - name: application_alerts
    interval: 15s
    rules:
      # 应用级别告警
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
        for: 2m
        labels:
          severity: warning
          domain: application
        annotations:
          summary: "高请求延迟 {{ $labels.service }}"
          description: "95%的请求延迟超过0.5秒"
          runbook: "/runbooks/high-latency.md"
      
      # 业务级别告警
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
          domain: business
        annotations:
          summary: "高错误率 {{ $labels.service }}"
          description: "错误率超过5%已经5分钟"
          runbook: "/runbooks/high-error-rate.md"

四、各种Exporter部署

📊 Exporter生态系统:

Prometheus拥有丰富的Exporter生态系统,可以监控几乎所有常见的服务和系统。官方和社区维护了数百个Exporter,覆盖了基础设施、中间件、数据库、应用程序等各个层面。

1. Node Exporter(服务器监控)

#!/bin/bash
# install_node_exporter.sh
# Node Exporter一键安装脚本

NODE_EXPORTER_VERSION="1.6.0"
NODE_EXPORTER_USER="node_exporter"
INSTALL_DIR="/opt/node_exporter"

# 创建用户
useradd --no-create-home --shell /bin/false $NODE_EXPORTER_USER

# 下载并安装
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
cd node_exporter-$NODE_EXPORTER_VERSION.linux-amd64

# 复制二进制文件
cp node_exporter $INSTALL_DIR/
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $INSTALL_DIR/node_exporter
chmod +x $INSTALL_DIR/node_exporter

# 创建systemd服务
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service]
User=$NODE_EXPORTER_USER
Group=$NODE_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/node_exporter \
  --web.listen-address=":9100" \
  --collector.systemd \
  --collector.systemd.unit-whitelist="(docker|ssh|nginx|mysql).service" \
  --collector.processes \
  --collector.tcpstat \
  --collector.netdev \
  --collector.netstat \
  --collector.diskstats \
  --collector.filesystem \
  --collector.meminfo \
  --collector.loadavg \
  --collector.stat \
  --collector.vmstat \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
  --log.level="info"

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建文本文件收集器目录
mkdir -p /var/lib/node_exporter/textfile_collector
chown -R $NODE_EXPORTER_USER:$NODE_EXPORTER_USER /var/lib/node_exporter

# 创建自定义指标收集脚本
cat > /usr/local/bin/custom_node_metrics.sh << 'EOF'
#!/bin/bash
# 自定义节点指标收集脚本

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom_metrics.prom"

# 1. 系统更新时间
echo '# HELP node_system_uptime_seconds System uptime in seconds' > $OUTPUT_FILE
echo '# TYPE node_system_uptime_seconds gauge' >> $OUTPUT_FILE
echo "node_system_uptime_seconds $(awk '{print $1}' /proc/uptime)" >> $OUTPUT_FILE

# 2. 登录用户数
LOGIN_USERS=$(who | wc -l)
echo '# HELP node_login_users Number of logged in users' >> $OUTPUT_FILE
echo '# TYPE node_login_users gauge' >> $OUTPUT_FILE
echo "node_login_users $LOGIN_USERS" >> $OUTPUT_FILE

# 3. 僵尸进程数
ZOMBIE_PROCESSES=$(ps aux | awk '{print $8}' | grep -c Z)
echo '# HELP node_zombie_processes Number of zombie processes' >> $OUTPUT_FILE
echo '# TYPE node_zombie_processes gauge' >> $OUTPUT_FILE
echo "node_zombie_processes $ZOMBIE_PROCESSES" >> $OUTPUT_FILE

# 4. 文件句柄使用率
FILE_HANDLES=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
FILE_HANDLES_MAX=$(cat /proc/sys/fs/file-max)
FILE_HANDLES_PERCENT=$(echo "scale=2; $FILE_HANDLES * 100 / $FILE_HANDLES_MAX" | bc)
echo '# HELP node_file_handles_used File handles used' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_used gauge' >> $OUTPUT_FILE
echo "node_file_handles_used $FILE_HANDLES" >> $OUTPUT_FILE

echo '# HELP node_file_handles_max Maximum file handles' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_max gauge' >> $OUTPUT_FILE
echo "node_file_handles_max $FILE_HANDLES_MAX" >> $OUTPUT_FILE

echo '# HELP node_file_handles_percent File handles usage percent' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_percent gauge' >> $OUTPUT_FILE
echo "node_file_handles_percent $FILE_HANDLES_PERCENT" >> $OUTPUT_FILE

# 5. 系统负载(15分钟)
LOAD_15=$(awk '{print $3}' /proc/loadavg)
echo '# HELP node_load15 System load average for 15 minutes' >> $OUTPUT_FILE
echo '# TYPE node_load15 gauge' >> $OUTPUT_FILE
echo "node_load15 $LOAD_15" >> $OUTPUT_FILE

# 6. 磁盘inode使用率
DISK_INODES=$(df -i / | awk 'NR==2 {print $5}' | sed 's/%//')
echo '# HELP node_disk_inode_usage_percent Disk inode usage percent for root' >> $OUTPUT_FILE
echo '# TYPE node_disk_inode_usage_percent gauge' >> $OUTPUT_FILE
echo "node_disk_inode_usage_percent $DISK_INODES" >> $OUTPUT_FILE

# 7. 网络连接统计
TCP_ESTABLISHED=$(ss -s | awk '/^TCP:/ {print $4}')
echo '# HELP node_network_tcp_established Established TCP connections' >> $OUTPUT_FILE
echo '# TYPE node_network_tcp_established gauge' >> $OUTPUT_FILE
echo "node_network_tcp_established $TCP_ESTABLISHED" >> $OUTPUT_FILE

# 8. 系统时间同步状态
NTP_SYNC=0
if chronyc tracking 2>/dev/null | grep -q "Leap status.*Normal"; then
    NTP_SYNC=1
elif ntpq -p 2>/dev/null | grep -q "^\*"; then
    NTP_SYNC=1
fi
echo '# HELP node_ntp_synchronized NTP synchronization status (1=synchronized, 0=not synchronized)' >> $OUTPUT_FILE
echo '# TYPE node_ntp_synchronized gauge' >> $OUTPUT_FILE
echo "node_ntp_synchronized $NTP_SYNC" >> $OUTPUT_FILE

# 设置权限
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $OUTPUT_FILE
chmod 644 $OUTPUT_FILE
EOF

chmod +x /usr/local/bin/custom_node_metrics.sh

# 创建定时任务
echo "*/30 * * * * root /usr/local/bin/custom_node_metrics.sh" > /etc/cron.d/node_exporter_custom_metrics

# 启动服务
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# 检查状态
sleep 2
systemctl status node_exporter --no-pager

echo "Node Exporter安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9100"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9100/metrics"

2. MySQL Exporter配置

#!/bin/bash
# install_mysql_exporter.sh
# MySQL Exporter安装配置

MYSQL_EXPORTER_VERSION="0.15.0"
MYSQL_EXPORTER_USER="mysql_exporter"
INSTALL_DIR="/opt/mysql_exporter"

# 创建用户
useradd --no-create-home --shell /bin/false $MYSQL_EXPORTER_USER

# 下载并安装
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v$MYSQL_EXPORTER_VERSION/mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
cd mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64

# 复制二进制文件
cp mysqld_exporter $INSTALL_DIR/
chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER $INSTALL_DIR/mysqld_exporter
chmod +x $INSTALL_DIR/mysqld_exporter

# 在MySQL中创建监控用户
mysql -u root -p << 'EOF'
-- 创建监控用户
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';

-- 验证权限
SHOW GRANTS FOR 'exporter'@'localhost';
EOF

# 创建配置文件
cat > /etc/mysql_exporter.cnf << EOF
[client]
user=exporter
password=ExporterPassword123!
host=localhost
port=3306
EOF

chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER /etc/mysql_exporter.cnf
chmod 600 /etc/mysql_exporter.cnf

# 创建systemd服务
cat > /etc/systemd/system/mysql_exporter.service << EOF
[Unit]
Description=MySQL Exporter
Documentation=https://github.com/prometheus/mysqld_exporter
After=network.target mysql.service

[Service]
User=$MYSQL_EXPORTER_USER
Group=$MYSQL_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/mysqld_exporter \
  --web.listen-address=":9104" \
  --config.my-cnf=/etc/mysql_exporter.cnf \
  --collect.global_status \
  --collect.global_variables \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.info_schema.tables \
  --collect.info_schema.tablestats \
  --collect.info_schema.userstats \
  --collect.perf_schema.eventswaits \
  --collect.perf_schema.file_events \
  --collect.perf_schema.indexiowaits \
  --collect.perf_schema.tableiowaits \
  --collect.slave_status \
  --collect.auto_increment.columns \
  --collect.binlog_size \
  --collect.info_schema.query_response_time \
  --collect.engine_innodb_status \
  --log.level="info"

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 启动服务
systemctl daemon-reload
systemctl enable mysql_exporter
systemctl start mysql_exporter

# 检查状态
sleep 2
systemctl status mysql_exporter --no-pager

echo "MySQL Exporter安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9104"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9104/metrics"

# 关键MySQL监控指标说明
cat << 'EOF'

=== 关键MySQL监控指标 ===

1. 连接相关:
   mysql_global_status_threads_connected      # 当前连接数
   mysql_global_status_max_used_connections   # 历史最大连接数
   mysql_global_variables_max_connections     # 最大连接数限制

2. 查询性能:
   mysql_global_status_questions              # 总查询数
   mysql_global_status_slow_queries           # 慢查询数
   rate(mysql_global_status_questions[1m])    # QPS

3. InnoDB状态:
   mysql_global_status_innodb_buffer_pool_pages_total     # 缓冲池总页数
   mysql_global_status_innodb_buffer_pool_pages_free      # 缓冲池空闲页数
   mysql_global_status_innodb_row_lock_time_avg           # 平均行锁等待时间

4. 复制状态:
   mysql_slave_status_slave_io_running        # IO线程状态
   mysql_slave_status_slave_sql_running       # SQL线程状态
   mysql_slave_status_seconds_behind_master   # 复制延迟秒数

5. 表状态:
   mysql_info_schema_table_size_bytes         # 表大小
   mysql_info_schema_table_rows               # 表行数

=== 常用告警规则 ===

# 连接数过高
- alert: MySQL连接数过高
  expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
  for: 5m

# 慢查询过多
- alert: MySQL慢查询过多
  expr: rate(mysql_global_status_slow_queries[5m]) > 10
  for: 2m

# 复制延迟
- alert: MySQL复制延迟
  expr: mysql_slave_status_seconds_behind_master > 30
  for: 5m

# InnoDB缓冲池命中率低
- alert: InnoDB缓冲池命中率低
  expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
  for: 10m
EOF

3. 其他常用Exporter

Blackbox Exporter

网络探测,监控HTTP、HTTPS、DNS、TCP、ICMP等服务可用性。

PostgreSQL Exporter

PostgreSQL数据库监控,收集连接数、查询性能、锁等信息。

Cloud Exporter

AWS、Azure、GCP云服务监控,收集云资源使用情况和成本。

JMX Exporter

Java应用监控,通过JMX收集JVM性能指标和应用业务指标。

HAProxy Exporter

负载均衡器监控,收集连接数、请求率、后端服务器状态等。

cAdvisor

容器监控,收集Docker容器资源使用情况和性能指标。

五、Grafana可视化与仪表板

📈 Grafana优势:

1. 丰富的可视化选项:图表、表格、仪表、热图、地理地图等
2. 强大的数据源支持:Prometheus、MySQL、PostgreSQL、Elasticsearch等
3. 灵活的告警功能:可视化告警配置,多种通知渠道
4. 团队协作:文件夹权限、共享仪表板、版本管理

1. Grafana安装与配置

#!/bin/bash
# install_grafana.sh
# Grafana一键安装脚本

GRAFANA_VERSION="10.0.3"
GRAFANA_USER="grafana"
INSTALL_DIR="/opt/grafana"
DATA_DIR="/var/lib/grafana"
CONFIG_DIR="/etc/grafana"
LOG_DIR="/var/log/grafana"

# 下载Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
tar xzf grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
mv grafana-$GRAFANA_VERSION $INSTALL_DIR

# 创建用户和目录
useradd --no-create-home --shell /bin/false $GRAFANA_USER
mkdir -p $DATA_DIR $CONFIG_DIR $LOG_DIR
chown -R $GRAFANA_USER:$GRAFANA_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR $LOG_DIR

# 创建systemd服务
cat > /etc/systemd/system/grafana.service << EOF
[Unit]
Description=Grafana
Documentation=https://grafana.com/docs/
After=network.target

[Service]
User=$GRAFANA_USER
Group=$GRAFANA_USER
Type=simple
Restart=always
RestartSec=5
WorkingDirectory=$INSTALL_DIR
EnvironmentFile=-$CONFIG_DIR/grafana.conf
ExecStart=$INSTALL_DIR/bin/grafana-server \\
  --config=$CONFIG_DIR/grafana.ini \\
  --homepath=$INSTALL_DIR \\
  --packaging=docker \\
  cfg:default.paths.logs=$LOG_DIR \\
  cfg:default.paths.data=$DATA_DIR \\
  cfg:default.paths.plugins=$INSTALL_DIR/plugins \\
  cfg:default.paths.provisioning=$CONFIG_DIR/provisioning

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建配置文件
cat > $CONFIG_DIR/grafana.ini << 'EOF'
[server]
# 监听地址和端口
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false

# 日志配置
[log]
mode = console file
level = info
format = console

# 数据库配置(默认使用SQLite)
[database]
type = sqlite3
path = grafana.db
max_idle_conn = 2
max_open_conn = 0
conn_max_lifetime = 14400

# 安全配置
[security]
admin_user = admin
admin_password = admin
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = false
data_source_proxy_whitelist = 

# 用户配置
[auth]
disable_login_form = false
disable_signout_menu = false

# 匿名访问配置
[auth.anonymous]
enabled = false
org_name = Main Org.
org_role = Viewer

# 基础认证配置
[auth.basic]
enabled = true

# 邮件配置(告警通知)
[smtp]
enabled = true
host = smtp.example.com:465
user = alert@example.com
password = YourPassword
from_address = alert@example.com
from_name = Grafana Alert

# 用户配置
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer

# 会话配置
[session]
provider = file
provider_config = sessions
cookie_secure = false
session_life_time = 86400

# 分析配置
[analytics]
reporting_enabled = true
check_for_updates = true

# 路径配置
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /opt/grafana/plugins
provisioning = /etc/grafana/provisioning

# 快照配置
[snapshots]
external_enabled = true
external_snapshot_url = https://snapshots.example.com
external_snapshot_name = Grafana Snapshots

# 指标配置(Grafana自身监控)
[metrics]
enabled = true
interval_seconds = 10
EOF

# 创建数据源配置
mkdir -p $CONFIG_DIR/provisioning/datasources
cat > $CONFIG_DIR/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: 15s
      queryTimeout: 60s
      httpMethod: POST
      manageAlerts: true
      prometheusType: Prometheus
      prometheusVersion: 2.45.0
      cacheLevel: 'High'
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo
    secureJsonData:
      tlsAuth: false
      tlsAuthWithCACert: false

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://localhost:9093
    editable: true
    jsonData:
      implementation: prometheus
      handleGrafanaManagedAlerts: true

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: true
    jsonData:
      maxLines: 1000

  - name: Tempo
    type: tempo
    access: proxy
    url: http://localhost:3200
    editable: true
    jsonData:
      nodeGraph:
        enabled: true
      tracesToLogs:
        datasourceUid: 'loki'
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags: ['job', 'instance', 'pod', 'namespace']
        filterByTraceID: true
        filterBySpanID: true
EOF

# 创建仪表板配置
mkdir -p $CONFIG_DIR/provisioning/dashboards
cat > $CONFIG_DIR/provisioning/dashboards/dashboards.yml << 'EOF'
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards
EOF

# 创建示例仪表板目录
mkdir -p /etc/grafana/dashboards

# 启动服务
systemctl daemon-reload
systemctl enable grafana
systemctl start grafana

# 检查状态
sleep 5
systemctl status grafana --no-pager

echo "Grafana安装完成!"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):3000"
echo "默认用户名: admin"
echo "默认密码: admin"
echo ""
echo "请登录后立即修改管理员密码!"

2. Grafana仪表板设计

{
  "dashboard": {
    "title": "Node Exporter Full",
    "tags": ["templated", "node-exporter"],
    "style": "dark",
    "timezone": "browser",
    "panels": [
      {
        "datasource": "Prometheus",
        "description": "总体CPU使用率",
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "mappings": [],
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "red",
                  "value": 80
                }
              ]
            },
            "unit": "percent"
          },
          "overrides": []
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "id": 2,
        "options": {
          "orientation": "auto",
          "reduceOptions": {
            "calcs": [
              "lastNotNull"
            ],
            "fields": "",
            "values": false
          },
          "showThresholdLabels": false,
          "showThresholdMarkers": true
        },
        "pluginVersion": "9.3.2",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "interval": "",
            "legendFormat": "{{instance}}",
            "refId": "A"
          }
        ],
        "title": "CPU Usage",
        "type": "gauge"
      },
      {
        "datasource": "Prometheus",
        "description": "内存使用情况",
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "palette-classic"
            },
            "custom": {
              "axisLabel": "",
              "axisPlacement": "auto",
              "barAlignment": 0,
              "drawStyle": "line",
              "fillOpacity": 10,
              "gradientMode": "none",
              "hideFrom": {
                "legend": false,
                "tooltip": false,
                "viz": false
              },
              "lineInterpolation": "linear",
              "lineWidth": 1,
              "pointSize": 5,
              "scaleDistribution": {
                "type": "linear"
              },
              "showPoints": "never",
              "spanNulls": false,
              "stacking": {
                "group": "A",
                "mode": "normal"
              },
              "thresholdsStyle": {
                "mode": "off"
              }
            },
            "mappings": [],
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "red",
                  "value": 80
                }
              ]
            },
            "unit": "bytes"
          },
          "overrides": [
            {
              "matcher": {
                "id": "byName",
                "options": "Used"
              },
              "properties": [
                {
                  "id": "color",
                  "value": {
                    "fixedColor": "red",
                    "mode": "fixed"
                  }
                }
              ]
            },
            {
              "matcher": {
                "id": "byName",
                "options": "Cached"
              },
              "properties": [
                {
                  "id": "color",
                  "value": {
                    "fixedColor": "yellow",
                    "mode": "fixed"
                  }
                }
              ]
            }
          ]
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "id": 3,
        "options": {
          "legend": {
            "calcs": [],
            "displayMode": "list",
            "placement": "bottom",
            "showLegend": true
          },
          "tooltip": {
            "mode": "single",
            "sort": "none"
          }
        },
        "targets": [
          {
            "expr": "node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes",
            "interval": "",
            "legendFormat": "Used",
            "refId": "A"
          },
          {
            "expr": "node_memory_Buffers_bytes",
            "hide": false,
            "interval": "",
            "legendFormat": "Buffers",
            "refId": "B"
          },
          {
            "expr": "node_memory_Cached_bytes",
            "hide": false,
            "interval": "",
            "legendFormat": "Cached",
            "refId": "C"
          }
        ],
        "title": "Memory Usage",
        "type": "timeseries"
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": [
        "5s",
        "10s",
        "30s",
        "1m",
        "5m",
        "15m",
        "30m",
        "1h",
        "2h",
        "1d"
      ],
      "time_options": [
        "5m",
        "15m",
        "1h",
        "6h",
        "12h",
        "24h",
        "2d",
        "7d",
        "30d"
      ]
    },
    "templating": {
      "list": [
        {
          "current": {
            "selected": false,
            "text": "All",
            "value": "$__all"
          },
          "datasource": "Prometheus",
          "definition": "label_values(node_cpu_seconds_total, instance)",
          "hide": 0,
          "includeAll": true,
          "multi": true,
          "name": "instance",
          "options": [],
          "query": {
            "query": "label_values(node_cpu_seconds_total, instance)",
            "refId": "StandardVariableQuery"
          },
          "refresh": 1,
          "regex": "",
          "skipUrlSync": false,
          "sort": 0,
          "type": "query"
        },
        {
          "current": {
            "selected": false,
            "text": "1m",
            "value": "1m"
          },
          "hide": 0,
          "includeAll": false,
          "label": "Interval",
          "multi": false,
          "name": "interval",
          "options": [
            {
              "selected": true,
              "text": "1m",
              "value": "1m"
            },
            {
              "selected": false,
              "text": "5m",
              "value": "5m"
            },
            {
              "selected": false,
              "text": "10m",
              "value": "10m"
            },
            {
              "selected": false,
              "text": "30m",
              "value": "30m"
            },
            {
              "selected": false,
              "text": "1h",
              "value": "1h"
            },
            {
              "selected": false,
              "text": "6h",
              "value": "6h"
            }
          ],
          "query": "1m,5m,10m,30m,1h,6h",
          "queryValue": "",
          "refresh": 2,
          "skipUrlSync": false,
          "type": "interval"
        }
      ]
    },
    "annotations": {
      "list": [
        {
          "builtIn": 1,
          "datasource": {
            "type": "grafana",
            "uid": "-- Grafana --"
          },
          "enable": true,
          "hide": true,
          "iconColor": "rgba(0, 211, 255, 1)",
          "name": "Annotations & Alerts",
          "target": {
            "limit": 100,
            "matchAny": false,
            "tags": [],
            "type": "dashboard"
          },
          "type": "dashboard"
        }
      ]
    },
    "refresh": "10s",
    "schemaVersion": 37,
    "version": 1,
    "uid": "node-exporter-full"
  },
  "folderUid": "general",
  "message": "Updated dashboard",
  "overwrite": true
}

六、告警规则与通知配置

🚨 告警设计原则:

1. 分级分类:根据严重程度划分告警级别(紧急、重要、警告)
2. 静默降噪:避免告警风暴,合理设置静默规则
3. 明确可操作:告警信息应包含具体问题和解决建议
4. 多渠道通知:重要告警应通过多种渠道通知
5. 闭环管理:告警应关联事件、处理、复盘全过程

1. Alertmanager配置

# Alertmanager配置 (alertmanager.yml)

global:
  # SMTP配置
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'YourPassword'
  smtp_require_tls: true
  
  # Slack配置
  slack_api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
  
  # 微信企业号配置
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'your-wechat-secret'
  wechat_api_corp_id: 'your-corp-id'

# 路由配置 - 定义告警如何路由到接收器
route:
  # 默认路由
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  
  # 子路由
  routes:
    # 按严重程度路由
    - match:
        severity: critical
      receiver: 'critical-receiver'
      group_wait: 5s
      group_interval: 5s
      repeat_interval: 5m
      continue: true
    
    # 按团队路由
    - match_re:
        team: ^(infra|platform).*
      receiver: 'infra-team'
      continue: false
    
    - match_re:
        team: ^(dev|app).*
      receiver: 'dev-team'
      continue: false
    
    # 按服务路由
    - match:
        service: mysql
      receiver: 'dba-team'
      continue: false
    
    - match:
        service: nginx
      receiver: 'web-team'
      continue: false
    
    # 工作时间路由
    - receiver: 'work-hours-receiver'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      continue: true
      matchers:
        - name: time_range
          value: '09:00-18:00'
        - name: weekday
          value: '1-5'  # 周一到周五

# 告警抑制规则 - 避免重复告警
inhibit_rules:
  # 当有节点宕机告警时,抑制该节点上的所有其他告警
  - source_match:
      alertname: NodeDown
      severity: critical
    target_match:
      severity: critical
    equal: ['instance', 'cluster']
  
  # 当有集群级别故障时,抑制所有节点级别告警
  - source_match:
      alertname: ClusterDown
    target_match_re:
      alertname: 'NodeDown|HighCpuUsage|HighMemoryUsage'
    equal: ['cluster']
  
  # 网络分区告警抑制
  - source_match:
      alertname: NetworkPartition
    target_match:
      severity: warning
    equal: ['zone']

# 静默配置 - 临时关闭特定告警
# 可以通过Web UI或API配置

# 接收器配置 - 定义告警发送方式
receivers:
  # 默认接收器
  - name: 'default-receiver'
    email_configs:
      - to: 'alerts@example.com'
        send_resolved: true
        headers:
          Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        html: |
          

{{ .GroupLabels.alertname }}

状态: {{ .Status | toUpper }}

开始时间: {{ (index .Alerts 0).StartsAt }}

结束时间: {{ (index .Alerts 0).EndsAt }}

摘要: {{ .CommonAnnotations.summary }}

描述: {{ .CommonAnnotations.description }}


告警详情

{{ range .GroupLabels.SortedPairs }} {{ end }}
标签
{{ .Name }} {{ .Value }}

运行手册: {{ .CommonAnnotations.runbook }}

webhook_configs: - url: 'http://alert-webhook.example.com/alerts' send_resolved: true # 紧急告警接收器 - name: 'critical-receiver' email_configs: - to: 'oncall@example.com, manager@example.com' send_resolved: true # Slack通知 slack_configs: - channel: '#alerts-critical' title: '[CRITICAL] {{ .GroupLabels.alertname }}' text: |- {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Labels:* {{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }} {{ end }} {{ end }} send_resolved: true color: 'danger' # 红色 # 微信通知 wechat_configs: - agent_id: '1000002' to_user: '@all' to_party: '2' message: '{{ template "wechat.default.message" . }}' send_resolved: true # 电话告警(通过第三方服务) webhook_configs: - url: 'http://phone-alert-service.example.com/call' send_resolved: false # 基础设施团队 - name: 'infra-team' email_configs: - to: 'infra-team@example.com' slack_configs: - channel: '#infra-alerts' # 开发团队 - name: 'dev-team' email_configs: - to: 'dev-team@example.com' slack_configs: - channel: '#dev-alerts' # DBA团队 - name: 'dba-team' email_configs: - to: 'dba-team@example.com' webhook_configs: - url: 'http://dba-alert.example.com/webhook' # 工作时间接收器 - name: 'work-hours-receiver' email_configs: - to: 'work-hours-team@example.com' slack_configs: - channel: '#work-hours-alerts' # 模板配置 templates: - '/etc/alertmanager/templates/*.tmpl' # 钉钉机器人配置模板 - name: 'dingtalk-receiver' dingtalk_configs: - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx' message: | {{ range .Alerts }} ## [{{ .Status | toUpper }}] {{ .Labels.alertname }} **开始时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }} **实例**: {{ .Labels.instance }} **摘要**: {{ .Annotations.summary }} **描述**: {{ .Annotations.description }} {{ if .Annotations.runbook }} **运行手册**: [查看]({{ .Annotations.runbook }}) {{ end }} --- {{ end }} send_resolved: true # 企业微信配置模板 - name: 'wechat-work-receiver' wechat_configs: - api_secret: 'your-secret' corp_id: 'your-corp-id' agent_id: '1000002' message: '{{ template "wechat.default.message" . }}' send_resolved: true # 短信配置(通过阿里云、腾讯云等) - name: 'sms-receiver' webhook_configs: - url: 'http://sms-gateway.example.com/send' send_resolved: false

2. 告警规则最佳实践

紧急 (Critical) 重要 (Warning) 警告 (Info)
告警级别 响应时间 通知渠道 示例场景
紧急 5分钟内 电话+短信+钉钉+邮件 核心服务不可用、数据库宕机、重大安全事件
重要 30分钟内 钉钉+邮件+Slack 性能严重下降、磁盘空间不足、CPU使用率过高
警告 2小时内 邮件+企业微信 磁盘使用率预警、内存使用率预警、服务重启
# 企业级告警规则示例

groups:
  # ============ 基础设施告警 ============
  - name: infrastructure_alerts
    interval: 30s
    rules:
      # 节点宕机
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "节点宕机 {{ $labels.instance }}"
          description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
          runbook: "https://runbook.example.com/node-down"
          dashboard: "https://grafana.example.com/d/node-overview"
      
      # CPU使用率过高
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高CPU使用率 {{ $labels.instance }}"
          description: "CPU使用率超过80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-cpu"
          dashboard: "https://grafana.example.com/d/node-cpu"
      
      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高内存使用率 {{ $labels.instance }}"
          description: "内存使用率超过85%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-memory"
          dashboard: "https://grafana.example.com/d/node-memory"
      
      # 磁盘空间严重不足
      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
        for: 2m
        labels:
          severity: critical
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "磁盘空间严重不足 {{ $labels.instance }}"
          description: "根分区使用率超过90%已经2分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/disk-space"
          dashboard: "https://grafana.example.com/d/node-disk"
      
      # 磁盘空间预警
      - alert: DiskSpaceWarning
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
        for: 10m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "磁盘空间预警 {{ $labels.instance }}"
          description: "根分区使用率超过80%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/disk-space"
          dashboard: "https://grafana.example.com/d/node-disk"
      
      # 系统负载过高
      - alert: HighSystemLoad
        expr: node_load1 > count by(instance) (node_cpu_seconds_total{mode="system"}) * 1.5
        for: 5m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高系统负载 {{ $labels.instance }}"
          description: "1分钟系统负载超过CPU核心数1.5倍已经5分钟。当前值: {{ $value }}"
          runbook: "https://runbook.example.com/high-load"
          dashboard: "https://grafana.example.com/d/node-load"

  # ============ 应用服务告警 ============
  - name: application_alerts
    interval: 15s
    rules:
      # 服务宕机
      - alert: ServiceDown
        expr: up{job=~".*"} == 0
        for: 1m
        labels:
          severity: critical
          team: development
          domain: application
        annotations:
          summary: "服务宕机 {{ $labels.job }}"
          description: "服务 {{ $labels.job }} (实例 {{ $labels.instance }}) 已经宕机超过1分钟"
          runbook: "https://runbook.example.com/service-down"
          dashboard: "https://grafana.example.com/d/service-overview"
      
      # 高请求延迟
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, instance)) > 1
        for: 2m
        labels:
          severity: warning
          team: development
          domain: application
        annotations:
          summary: "高请求延迟 {{ $labels.job }}"
          description: "95%的请求延迟超过1秒已经2分钟。当前值: {{ $value }}秒"
          runbook: "https://runbook.example.com/high-latency"
          dashboard: "https://grafana.example.com/d/service-latency"
      
      # 高错误率
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: development
          domain: application
        annotations:
          summary: "高错误率 {{ $labels.job }}"
          description: "HTTP 5xx错误率超过5%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-error-rate"
          dashboard: "https://grafana.example.com/d/service-errors"
      
      # 低请求量(可能服务有问题但没报错)
      - alert: LowRequestRate
        expr: rate(http_requests_total[10m]) < 10
        for: 5m
        labels:
          severity: warning
          team: development
          domain: application
        annotations:
          summary: "低请求量 {{ $labels.job }}"
          description: "请求率低于10次/秒已经5分钟。当前值: {{ $value }}次/秒"
          runbook: "https://runbook.example.com/low-request-rate"
          dashboard: "https://grafana.example.com/d/service-traffic"

  # ============ 数据库告警 ============
  - name: database_alerts
    interval: 30s
    rules:
      # MySQL连接数过高
      - alert: MySQLHighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "MySQL连接数过高 {{ $labels.instance }}"
          description: "MySQL连接数超过最大连接数的80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/mysql-connections"
          dashboard: "https://grafana.example.com/d/mysql-overview"
      
      # MySQL复制延迟
      - alert: MySQLReplicationLag
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 5m
        labels:
          severity: critical
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "MySQL复制延迟 {{ $labels.instance }}"
          description: "MySQL从库复制延迟超过30秒已经5分钟。当前值: {{ $value }}秒"
          runbook: "https://runbook.example.com/mysql-replication"
          dashboard: "https://grafana.example.com/d/mysql-replication"
      
      # InnoDB缓冲池命中率低
      - alert: MySQLInnoDBBufferPoolHitRateLow
        expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
        for: 10m
        labels:
          severity: warning
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "InnoDB缓冲池命中率低 {{ $labels.instance }}"
          description: "InnoDB缓冲池命中率低于90%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/mysql-innodb"
          dashboard: "https://grafana.example.com/d/mysql-innodb"

  # ============ 业务指标告警 ============
  - name: business_alerts
    interval: 1m
    rules:
      # 订单量异常下降
      - alert: OrderRateAbnormalDrop
        expr: rate(orders_total[10m]) < rate(orders_total[40m:10m]) * 0.5
        for: 5m
        labels:
          severity: critical
          team: business
          domain: business
        annotations:
          summary: "订单量异常下降"
          description: "最近10分钟订单量比40分钟前下降超过50%。当前速率: {{ $value }} 订单/分钟"
          runbook: "https://runbook.example.com/order-drop"
          dashboard: "https://grafana.example.com/d/business-orders"
      
      # 支付失败率过高
      - alert: HighPaymentFailureRate
        expr: rate(payment_attempts_total{status="failed"}[10m]) / rate(payment_attempts_total[10m]) * 100 > 10
        for: 5m
        labels:
          severity: critical
          team: business
          domain: business
        annotations:
          summary: "支付失败率过高"
          description: "支付失败率超过10%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/payment-failure"
          dashboard: "https://grafana.example.com/d/business-payments"
      
      # 用户活跃度下降
      - alert: UserActivityDrop
        expr: active_users_total < (active_users_total offset 1d) * 0.7
        for: 1h
        labels:
          severity: warning
          team: business
          domain: business
        annotations:
          summary: "用户活跃度下降"
          description: "当前活跃用户数比昨天同时段下降超过30%。当前值: {{ $value }}"
          runbook: "https://runbook.example.com/user-activity"
          dashboard: "https://grafana.example.com/d/business-users"

  # ============ 黑盒监控告警 ============
  - name: blackbox_alerts
    interval: 30s
    rules:
      # HTTP探测失败
      - alert: HTTPProbeFailed
        expr: probe_success{job="blackbox-http"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
          domain: availability
        annotations:
          summary: "HTTP服务不可用 {{ $labels.instance }}"
          description: "HTTP服务 {{ $labels.instance }} 探测失败已经1分钟"
          runbook: "https://runbook.example.com/http-probe-failed"
          dashboard: "https://grafana.example.com/d/blackbox-http"
      
      # SSL证书即将过期
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 86400 * 30  # 30天内过期
        for: 0m
        labels:
          severity: warning
          team: infrastructure
          domain: security
        annotations:
          summary: "SSL证书即将过期 {{ $labels.instance }}"
          description: "SSL证书将在{{ $value | humanizeDuration }}后过期"
          runbook: "https://runbook.example.com/ssl-cert-expiring"
          dashboard: "https://grafana.example.com/d/blackbox-ssl"

七、高可用与生产优化

🏗️ 生产环境要求:

1. 高可用性:监控系统自身不能成为单点故障
2. 可扩展性:支持监控上千节点,PB级数据存储
3. 性能优化:查询响应时间在秒级,资源消耗可控
4. 安全性:访问控制、数据加密、审计日志
5. 可维护性:自动化部署、配置管理、故障自愈

1. Prometheus高可用方案

负载均衡器
Prometheus A Prometheus B
Thanos Query
对象存储 长期存储
# Prometheus高可用配置示例

# 1. 多副本Prometheus配置
# prometheus-a.yml 和 prometheus-b.yml 配置相同,但external_labels不同

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    replica: 'A'  # 副本A使用'A',副本B使用'B'

# 2. 服务发现配置(确保两个副本抓取相同的targets)
scrape_configs:
  - job_name: 'node'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: ',(production|staging|dev),'
        target_label: env

# 3. Thanos Sidecar配置(与每个Prometheus实例一起运行)
# thanos-sidecar.yml
prometheus:
  external_url: "http://prometheus-a.example.com:9090"
  # 或 "http://prometheus-b.example.com:9090"

thanos:
  sidecar:
    grpc_address: "0.0.0.0:10901"
    http_address: "0.0.0.0:10902"
    prometheus_url: "http://localhost:9090"
    tsdb_path: "/var/lib/prometheus"
  
  objstore:
    type: S3
    config:
      bucket: "thanos-metrics"
      endpoint: "s3.example.com"
      access_key: "YOUR_ACCESS_KEY"
      secret_key: "YOUR_SECRET_KEY"
      insecure: false
      signature_version2: false
      put_user_metadata: {}
      http_config:
        idle_conn_timeout: 90s
        response_header_timeout: 2m
      trace:
        enable: true
  
  compactor:
    data_dir: "/var/lib/thanos/compactor"
    retention: 30d
  
  query:
    http_address: "0.0.0.0:10903"
    grpc_address: "0.0.0.0:10904"
    store:
      - "prometheus-a.example.com:10901"
      - "prometheus-b.example.com:10901"
      - "thanos-store.example.com:10901"

# 4. Thanos Query前端配置
# thanos-query.yml
http:
  address: "0.0.0.0:10905"
  grace_period: 2m

grpc:
  address: "0.0.0.0:10906"

query:
  replica_labels:
    - "replica"
    - "prometheus_replica"
  
  auto_downsampling: true
  partial_response: true
  default_evaluation_interval: 1m

stores:
  - "prometheus-a.example.com:10901"
  - "prometheus-b.example.com:10901"
  - "thanos-store.example.com:10901"

# 5. 负载均衡配置(Nginx)
# nginx.conf
upstream prometheus {
    zone prometheus 64k;
    server prometheus-a.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-b.example.com:9090 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

upstream thanos_query {
    zone thanos_query 64k;
    server thanos-query.example.com:10905 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

server {
    listen 80;
    server_name prometheus.example.com;
    
    location / {
        proxy_pass http://prometheus;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # 健康检查
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_connect_timeout 2s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;
    }
}

server {
    listen 80;
    server_name thanos.example.com;
    
    location / {
        proxy_pass http://thanos_query;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 6. 监控目标分片配置(当监控目标太多时)
# 根据标签将目标分片到不同的Prometheus实例

scrape_configs:
  - job_name: 'node-shard-0'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      # 根据实例名称哈希分片
      - source_labels: [__meta_consul_service_id]
        action: hashmod
        modulus: 2
        target_label: __tmp_hash
      - source_labels: [__tmp_hash]
        action: keep
        regex: ^0$  # 这个实例只抓取hash为0的目标
  
  - job_name: 'node-shard-1'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service_id]
        action: hashmod
        modulus: 2
        target_label: __tmp_hash
      - source_labels: [__tmp_hash]
        action: keep
        regex: ^1$  # 这个实例只抓取hash为1的目标

# 7. 远程写配置(将数据写入VictoriaMetrics集群)
remote_write:
  - url: "http://vminsert:8480/insert/0/prometheus/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      capacity: 100000
      max_shards: 30
    write_relabel_configs:
      # 只保留重要的指标
      - action: keep
        regex: "up|node_.*|process_.*|prometheus_.*"
        source_labels: [__name__]

# 8. 资源限制配置
# 通过cgroups或systemd限制资源使用

[Service]
MemoryLimit=8G
CPUQuota=200%
IOWeight=100
TasksMax=10000

# 9. 数据保留策略
# prometheus.yml
storage:
  tsdb:
    retention: 15d  # 本地保留15天
    out_of_order_time_window: 1h

# 远程存储保留更长时间
remote_write:
  - url: "http://long-term-storage:9090/api/v1/write"
    remote_timeout: 30s
    write_relabel_configs:
      - action: keep
        regex: ".*"
        source_labels: [__name__]

# 10. 备份策略
# backup_prometheus.sh
#!/bin/bash
BACKUP_DIR="/backup/prometheus"
DATE=$(date +%Y%m%d_%H%M%S)

# 停止Prometheus抓取
curl -X POST http://localhost:9090/-/quit

# 等待写入完成
sleep 30

# 备份数据目录
tar czf $BACKUP_DIR/prometheus_data_$DATE.tar.gz /var/lib/prometheus/

# 备份配置文件
tar czf $BACKUP_DIR/prometheus_config_$DATE.tar.gz /etc/prometheus/

# 重启Prometheus
systemctl restart prometheus

echo "备份完成: $BACKUP_DIR/prometheus_data_$DATE.tar.gz"

2. 性能优化建议

存储优化

使用SSD存储TSDB数据,启用压缩,定期清理过期数据。

查询优化

使用Recording Rules预计算常用查询,优化PromQL,避免高基数查询。

网络优化

配置合理的抓取间隔,使用连接池,启用HTTP/2,减少网络往返。

内存优化

调整块大小,监控内存使用,启用内存映射文件,避免OOM。

八、总结与最佳实践

✅ 监控体系建设成功标志:

1. 全面覆盖:基础设施、应用、业务全面监控
2. 及时告警:问题发现时间从小时级降低到分钟级
3. 快速定位:MTTR(平均修复时间)显著降低
4. 数据驱动:监控数据用于容量规划和性能优化
5. 团队赋能:开发、运维、业务团队都能使用监控数据

1. 实施路线图

阶段 时间 主要任务 关键成果
第一阶段 1-2周 基础设施监控、基础告警 服务器监控、基础告警规则
第二阶段 2-4周 应用监控、业务监控 应用性能监控、关键业务指标
第三阶段 4-8周 高可用、自动化、优化 监控系统高可用、自动化部署
第四阶段 持续优化 智能分析、预测告警 异常检测、容量预测、AIOps

2. 监控系统检查清单

# 企业级监控系统检查清单

"""
1. 数据采集检查
    □ 所有服务器部署Node Exporter
    □ 关键服务有专用Exporter(MySQL、Nginx等)
    □ 应用层业务指标采集
    □ 网络探测(黑盒监控)配置
    □ 日志指标采集(通过Loki或ELK)

2. 存储与处理检查
    □ Prometheus配置高可用
    □ 数据保留策略合理(本地+远程)
    □ 告警规则分类清晰
    □ 记录规则优化查询性能
    □ 远程写配置正确

3. 可视化检查
    □ Grafana仪表板覆盖所有监控维度
    □ 仪表板有清晰的分类和组织
    □ 关键指标有实时可视化
    □ 历史数据可回溯分析
    □ 权限控制配置正确

4. 告警通知检查
    □ 告警分级合理(紧急/重要/警告)
    □ 通知渠道覆盖全面(邮件/钉钉/短信)
    □ 告警抑制规则配置合理
    □ 静默规则管理规范
    □ 告警处理流程明确

5. 高可用检查
    □ Prometheus多副本部署
    □ Alertmanager集群部署
    □ Grafana配置持久化存储
    □ 负载均衡配置正确
    □ 备份恢复方案验证

6. 性能检查
    □ 查询响应时间在秒级
    □ 抓取间隔合理(不造成目标压力)
    □ 内存使用可控(无OOM风险)
    □ 磁盘空间充足(有预警机制)
    □ 网络带宽足够

7. 安全检查
    □ 监控系统访问控制
    □ 数据传输加密(HTTPS)
    □ 认证授权配置
    □ 审计日志开启
    □ 敏感信息保护

8. 运维检查
    □ 配置版本管理
    □ 自动化部署脚本
    □ 监控系统自身监控
    □ 容量规划预测
    □ 定期演练恢复

9. 文档检查
    □ 架构设计文档完整
    □ 部署运维手册
    □ 告警处理手册(Runbook)
    □ 故障排查指南
    □ 培训材料

10. 合规检查
    □ 数据保留符合法规要求
    □ 审计日志满足合规
    □ 访问控制符合安全政策
    □ 告警通知符合SLA
    □ 故障处理流程规范
"""

# 监控系统成熟度评估
MATURITY_LEVELS = {
    "Level 1 - 基础监控": "服务器基础指标监控,手动告警",
    "Level 2 - 标准监控": "应用和业务监控,自动化告警",
    "Level 3 - 高级监控": "全链路监控,预测性告警",
    "Level 4 - 智能监控": "AIOps,自动化修复,业务影响分析"
}

3. 常见问题与解决方案

❓ 问题:Prometheus内存占用过高

解决方案
1. 调整块大小和保留时间
2. 使用Recording Rules预计算
3. 限制标签基数(避免高基数标签)
4. 启用数据分片和远程存储

❓ 问题:告警风暴

解决方案
1. 合理配置告警抑制规则
2. 设置合理的告警间隔和等待时间
3. 使用告警分组(按服务、按实例)
4. 实现告警升级和降级机制

❓ 问题:监控数据不一致

解决方案
1. 统一时钟同步(NTP)
2. 配置合理的抓取超时时间
3. 监控Exporter健康状况
4. 实现数据一致性检查

4. 未来发展趋势

AIOps智能运维

机器学习异常检测、根因分析、自动化修复。

可观测性

Metrics、Logs、Traces三位一体,全链路追踪。

云原生监控

Kubernetes原生监控,Service Mesh监控,Serverless监控。

业务可观测性

业务指标监控,用户体验监控,业务影响分析。

🎯 最后建议:

1. 从小处着手:从核心业务开始,逐步扩大监控范围
2. 持续迭代:监控系统需要持续优化和演进
3. 文化先行:建立数据驱动的运维文化
4. 工具为辅:工具是手段,解决问题才是目的
5. 全员参与:监控不仅是运维的事,需要开发、测试、业务共同参与

📊 监控是运维的眼睛,告警是运维的耳朵!

一个完善的企业级监控告警系统是保障业务稳定运行的基石。通过本文的介绍,您应该能够构建一套从数据采集、存储处理、可视化展示到告警通知的完整监控体系。记住,监控的最终目标不是收集数据,而是通过数据发现问题、解决问题、预防问题。

如有问题或建议,欢迎在评论区留言交流!

标签: 监控系统 Prometheus Grafana 运维实战
最后更新:2025-12-08