搭建企业级监控告警系统

一、监控体系概述与设计原则

1. 监控的重要性与价值

在现代化的IT系统中，监控已经从一个"可选项"变成了"必选项"。一个完善的监控体系能够：

快速发现问题

实时发现系统异常，避免小问题演变成大故障。

性能优化依据

通过历史数据分析系统瓶颈，指导容量规划和性能优化。

故障回溯分析

记录系统历史状态，便于故障原因分析和责任追溯。

SLA保障

量化服务可用性和性能，为SLA（服务等级协议）提供数据支持。

📊 监控数据价值：

根据Google SRE（站点可靠性工程）的经验，有效的监控可以将MTTR（平均修复时间）降低70%以上，并将故障预测准确率提升到85%以上。

2. 监控设计原则（Google四大黄金指标）

Google SRE团队提出的四大黄金指标是监控设计的核心指导原则：

延迟

Latency

服务处理请求所需时间

流量

Traffic

系统承载的请求量或并发量

错误率

Errors

请求失败或错误的比例

饱和度

Saturation

系统资源的使用程度

3. 监控层级划分

全面的监控体系应该覆盖以下四个层级：

业务监控

↓

应用监控

↓

系统监控

↓

网络监控

🔍 监控覆盖度检查清单：

1. 业务指标：关键业务流程成功率、用户活跃度、订单量等
2. 应用指标：应用响应时间、错误率、吞吐量、JVM/GC状态等
3. 系统指标：CPU、内存、磁盘、网络使用率、进程状态等
4. 网络指标：网络延迟、丢包率、带宽使用率、连接数等

二、监控系统架构设计

🎯 架构设计目标：

1. 可扩展性：支持监控上千个节点
2. 高可用性：监控系统自身不能成为单点故障
3. 实时性：指标采集和告警延迟在秒级
4. 易用性：配置简单，可视化友好，便于问题定位

1. 整体架构设计

企业级监控系统通常采用以下架构：

# 企业级监控系统架构

"""
数据采集层 (Data Collection)
├── Node Exporter: 服务器基础指标
├── MySQL Exporter: 数据库监控
├── Nginx Exporter: Web服务器监控
├── JMX Exporter: Java应用监控
├── Blackbox Exporter: 网络探测
└── 自定义Exporter: 业务指标采集

数据存储与处理层 (Storage & Processing)
├── Prometheus Server: 指标采集、存储、查询
├── Prometheus Alertmanager: 告警管理
├── Thanos/Cortex: 长期存储和集群方案（可选）
└── 时序数据库: VictoriaMetrics/InfluxDB（替代方案）

数据可视化层 (Visualization)
├── Grafana: 仪表板展示
├── 自定义Dashboard: 业务大屏
└── 报表系统: 定期报告生成

告警通知层 (Alerting & Notification)
├── 邮件通知: SMTP集成
├── 即时通讯: 钉钉/企业微信/Slack
├── 短信通知: 云服务商API
└── 电话告警: 紧急情况自动呼叫

辅助组件 (Auxiliary Components)
├── Service Discovery: 自动发现监控目标
├── 配置管理: Ansible/Terraform自动化部署
├── 权限控制: LDAP/OAuth2集成
└── 日志集成: Loki/ELK Stack关联分析
"""

2. 技术选型对比

主流监控解决方案对比：

特性	Prometheus	Zabbix	Nagios	DataDog
开源/商业	开源	开源	开源	商业
数据模型	多维时序数据	键值对	状态检查	多维时序数据
查询语言	PromQL	有限	无	专有查询
服务发现	原生支持	有限支持	不支持	自动发现
可视化	需Grafana	内置	需插件	内置
社区生态	强大	强大	强大	商业支持
成本	免费	免费	免费	昂贵

3. 监控指标设计规范

# 监控指标命名规范
# 格式: {__name__}{label1="value1",label2="value2",...}

"""
命名规范：
1. 使用下划线分隔单词：http_requests_total
2. 基本命名模式：_
   - _total: 计数器累加值
   - _count: 直方图/摘要的计数
   - _sum: 直方图/摘要的总和
   - _bucket: 直方图的分桶
   - _info: 提供元信息
3. 单位标准化：
   - 时间: 秒（seconds）
   - 内存: 字节（bytes）
   - 磁盘: 字节（bytes）
   - 网络: 比特/秒（bits/sec）

标签设计原则：
1. 标识性标签（必备）：
   - instance: 实例标识（IP:Port）
   - job: 任务/服务名称
   - env: 环境（prod/staging/dev）

2. 维度性标签（可选）：
   - region: 地域（华北/华东）
   - az: 可用区
   - team: 负责团队
   - version: 应用版本

3. 避免的标签设计：
   - 不要使用高基数标签（如用户ID）
   - 避免标签值动态变化
   - 标签数量不宜过多（一般5-10个）

示例指标：
# 系统指标
node_cpu_seconds_total{mode="idle", instance="192.168.1.100:9100", job="node"}
node_memory_MemFree_bytes{instance="192.168.1.100:9100", job="node"}

# 应用指标
http_requests_total{method="POST", endpoint="/api/users", status="200", job="user-service"}
http_request_duration_seconds_bucket{method="GET", endpoint="/api/products", le="0.1"}

# 业务指标
orders_total{type="new", payment_method="alipay", env="production"}
user_sessions_active{region="north", platform="mobile"}
"""

三、Prometheus部署与配置

🚀 Prometheus特点：

1. 多维数据模型：指标名称 + 键值对标签
2. 强大的查询语言：PromQL，灵活的数据查询和聚合
3. 不依赖分布式存储：单节点自包含
4. HTTP拉取模式：主动从目标拉取指标
5. 多种服务发现：支持Kubernetes、Consul等

1. Prometheus安装部署

#!/bin/bash
# install_prometheus.sh
# Prometheus一键安装脚本

PROMETHEUS_VERSION="2.45.0"
PROMETHEUS_USER="prometheus"
INSTALL_DIR="/opt/prometheus"
DATA_DIR="/var/lib/prometheus"
CONFIG_DIR="/etc/prometheus"

# 创建用户和目录
useradd --no-create-home --shell /bin/false $PROMETHEUS_USER
mkdir -p $INSTALL_DIR $DATA_DIR $CONFIG_DIR
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR

# 下载并安装Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xzf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
cd prometheus-$PROMETHEUS_VERSION.linux-amd64

# 复制二进制文件
cp prometheus promtool $INSTALL_DIR/
chown $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR/{prometheus,promtool}
chmod +x $INSTALL_DIR/{prometheus,promtool}

# 复制配置文件
cp prometheus.yml $CONFIG_DIR/
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR

# 创建systemd服务文件
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target

[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/prometheus \
    --config.file=$CONFIG_DIR/prometheus.yml \
    --storage.tsdb.path=$DATA_DIR \
    --storage.tsdb.retention.time=30d \
    --web.console.templates=$INSTALL_DIR/consoles \
    --web.console.libraries=$INSTALL_DIR/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url=http://prometheus.example.com \
    --web.enable-lifecycle \
    --web.enable-admin-api

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建配置目录结构
mkdir -p $CONFIG_DIR/{rules,rules.d,files_sd,targets}
cat > $CONFIG_DIR/prometheus.yml << 'EOF'
# 全局配置
global:
  scrape_interval: 15s      # 默认抓取间隔
  evaluation_interval: 15s  # 规则评估间隔
  external_labels:          # 外部标签
    region: 'north'
    env: 'production'

# 告警规则文件
rule_files:
  - "rules/*.yml"
  - "rules.d/*.yml"

# 抓取配置
scrape_configs:
  # 监控Prometheus自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    labels:
      service: 'monitoring'

  # 监控所有Node Exporter
  - job_name: 'node'
    scrape_interval: 30s
    file_sd_configs:
      - files:
        - 'targets/node_*.yml'
        refresh_interval: 5m
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
      - source_labels: [__meta_dns_name]
        target_label: hostname

  # 监控所有MySQL
  - job_name: 'mysql'
    scrape_interval: 30s
    static_configs:
      - targets: ['mysql-1:9104', 'mysql-2:9104']
    labels:
      database: 'mysql'

  # 监控所有Nginx
  - job_name: 'nginx'
    scrape_interval: 30s
    static_configs:
      - targets: ['nginx-1:9113', 'nginx-2:9113']
    labels:
      service: 'web'

  # 通过Consul服务发现
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: ',(production|staging|dev),'
        target_label: env
        replacement: '$1'

# 远程读写配置（可选）
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      max_samples_per_send: 1000
      capacity: 5000
      max_shards: 200

remote_read:
  - url: "http://thanos-query:10902/api/v1/read"
    read_recent: true
EOF

# 创建示例target文件
cat > $CONFIG_DIR/targets/node_servers.yml << 'EOF'
- targets:
  - '192.168.1.100:9100'
  - '192.168.1.101:9100'
  - '192.168.1.102:9100'
  labels:
    datacenter: 'dc1'
    rack: 'rack-a'
EOF

# 创建告警规则文件
cat > $CONFIG_DIR/rules/node_alerts.yml << 'EOF'
groups:
  - name: node_alerts
    interval: 30s
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "高CPU使用率 (实例 {{ $labels.instance }})"
          description: "CPU使用率超过80%已经5分钟。当前值: {{ $value }}%"
          runbook: "https://runbook.example.com/high-cpu"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "高内存使用率 (实例 {{ $labels.instance }})"
          description: "内存使用率超过85%已经5分钟。当前值: {{ $value }}%"

      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
        for: 2m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "磁盘空间严重不足 (实例 {{ $labels.instance }})"
          description: "根分区使用率超过90%已经2分钟。当前值: {{ $value }}%"
          runbook: "https://runbook.example.com/disk-space"

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "节点宕机 (实例 {{ $labels.instance }})"
          description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
EOF

# 设置权限
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR
chmod -R 644 $CONFIG_DIR/*.yml
chmod 755 $CONFIG_DIR $CONFIG_DIR/{rules,rules.d,files_sd,targets}

# 启动服务
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus

# 检查状态
sleep 3
systemctl status prometheus --no-pager

echo "Prometheus安装完成！"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9090"
echo "数据目录: $DATA_DIR"
echo "配置目录: $CONFIG_DIR"

2. Prometheus配置详解

# Prometheus高级配置示例

# 1. 远程存储配置（VictoriaMetrics）
remote_write:
  - url: "http://victoria-metrics:8428/api/v1/write"
    write_relabel_configs:
      - action: keep
        regex: "node.*|prometheus.*"
        source_labels: [__name__]
    queue_config:
      max_shards: 10
      min_shards: 2
      max_samples_per_send: 500
      capacity: 10000
      batch_send_deadline: "5s"
      min_backoff: "100ms"
      max_backoff: "5s"

# 2. 服务发现配置（Kubernetes）
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 只抓取有注解 prometheus.io/scrape: "true" 的pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # 从注解中获取抓取路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # 从注解中获取抓取端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # 添加标准标签
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      
      # 设置实例名为 pod名:端口
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      
      # 设置命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      
      # 设置节点标签
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: node

# 3. 静态配置示例
scrape_configs:
  - job_name: 'static-targets'
    static_configs:
      - targets:
        - 'app-1.example.com:8080'
        - 'app-2.example.com:8080'
        - 'app-3.example.com:8080'
        labels:
          environment: 'production'
          region: 'us-east-1'
          application: 'user-service'

# 4. 文件服务发现
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/*.json'
        - '/etc/prometheus/targets/*.yml'
        refresh_interval: 5m
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):(\d+)'
        replacement: '${1}'
        target_label: host
      - source_labels: [__address__]
        regex: '(.*):(\d+)'
        replacement: '${2}'
        target_label: port

# 5. 标签重写规则
scrape_configs:
  - job_name: 'example'
    static_configs:
      - targets: ['example.com:80']
    metric_relabel_configs:
      # 删除不需要的指标
      - action: drop
        regex: 'go_.*'
        source_labels: [__name__]
      
      # 重命名指标
      - source_labels: [__name__]
        regex: 'http_requests_(\w+)'
        replacement: 'http_${1}'
        target_label: __name__
      
      # 添加标签
      - source_labels: [instance]
        regex: '([^:]+):\d+'
        replacement: '${1}'
        target_label: hostname
      
      # 替换标签值
      - source_labels: [status_code]
        regex: '5..'
        replacement: 'server_error'
        target_label: status_group

# 6. 告警规则分组
groups:
  - name: infrastructure_alerts
    interval: 30s
    rules:
      # 系统级别告警
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
          domain: infrastructure
        annotations:
          summary: "实例 {{ $labels.instance }} 宕机"
          description: "{{ $labels.instance }} 已经5分钟无法访问"
          runbook: "/runbooks/instance-down.md"
      
      # 资源级别告警
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
        for: 10m
        labels:
          severity: warning
          domain: infrastructure
        annotations:
          summary: "内存使用率过高 {{ $labels.instance }}"
          description: "内存使用率超过90%已经10分钟"
          runbook: "/runbooks/high-memory.md"

  - name: application_alerts
    interval: 15s
    rules:
      # 应用级别告警
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
        for: 2m
        labels:
          severity: warning
          domain: application
        annotations:
          summary: "高请求延迟 {{ $labels.service }}"
          description: "95%的请求延迟超过0.5秒"
          runbook: "/runbooks/high-latency.md"
      
      # 业务级别告警
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
          domain: business
        annotations:
          summary: "高错误率 {{ $labels.service }}"
          description: "错误率超过5%已经5分钟"
          runbook: "/runbooks/high-error-rate.md"

四、各种Exporter部署

📊 Exporter生态系统：

Prometheus拥有丰富的Exporter生态系统，可以监控几乎所有常见的服务和系统。官方和社区维护了数百个Exporter，覆盖了基础设施、中间件、数据库、应用程序等各个层面。

1. Node Exporter（服务器监控）

#!/bin/bash
# install_node_exporter.sh
# Node Exporter一键安装脚本

NODE_EXPORTER_VERSION="1.6.0"
NODE_EXPORTER_USER="node_exporter"
INSTALL_DIR="/opt/node_exporter"

# 创建用户
useradd --no-create-home --shell /bin/false $NODE_EXPORTER_USER

# 下载并安装
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
cd node_exporter-$NODE_EXPORTER_VERSION.linux-amd64

# 复制二进制文件
cp node_exporter $INSTALL_DIR/
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $INSTALL_DIR/node_exporter
chmod +x $INSTALL_DIR/node_exporter

# 创建systemd服务
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service]
User=$NODE_EXPORTER_USER
Group=$NODE_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/node_exporter \
  --web.listen-address=":9100" \
  --collector.systemd \
  --collector.systemd.unit-whitelist="(docker|ssh|nginx|mysql).service" \
  --collector.processes \
  --collector.tcpstat \
  --collector.netdev \
  --collector.netstat \
  --collector.diskstats \
  --collector.filesystem \
  --collector.meminfo \
  --collector.loadavg \
  --collector.stat \
  --collector.vmstat \
  --collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
  --log.level="info"

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建文本文件收集器目录
mkdir -p /var/lib/node_exporter/textfile_collector
chown -R $NODE_EXPORTER_USER:$NODE_EXPORTER_USER /var/lib/node_exporter

# 创建自定义指标收集脚本
cat > /usr/local/bin/custom_node_metrics.sh << 'EOF'
#!/bin/bash
# 自定义节点指标收集脚本

OUTPUT_FILE="/var/lib/node_exporter/textfile_collector/custom_metrics.prom"

# 1. 系统更新时间
echo '# HELP node_system_uptime_seconds System uptime in seconds' > $OUTPUT_FILE
echo '# TYPE node_system_uptime_seconds gauge' >> $OUTPUT_FILE
echo "node_system_uptime_seconds $(awk '{print $1}' /proc/uptime)" >> $OUTPUT_FILE

# 2. 登录用户数
LOGIN_USERS=$(who | wc -l)
echo '# HELP node_login_users Number of logged in users' >> $OUTPUT_FILE
echo '# TYPE node_login_users gauge' >> $OUTPUT_FILE
echo "node_login_users $LOGIN_USERS" >> $OUTPUT_FILE

# 3. 僵尸进程数
ZOMBIE_PROCESSES=$(ps aux | awk '{print $8}' | grep -c Z)
echo '# HELP node_zombie_processes Number of zombie processes' >> $OUTPUT_FILE
echo '# TYPE node_zombie_processes gauge' >> $OUTPUT_FILE
echo "node_zombie_processes $ZOMBIE_PROCESSES" >> $OUTPUT_FILE

# 4. 文件句柄使用率
FILE_HANDLES=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
FILE_HANDLES_MAX=$(cat /proc/sys/fs/file-max)
FILE_HANDLES_PERCENT=$(echo "scale=2; $FILE_HANDLES * 100 / $FILE_HANDLES_MAX" | bc)
echo '# HELP node_file_handles_used File handles used' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_used gauge' >> $OUTPUT_FILE
echo "node_file_handles_used $FILE_HANDLES" >> $OUTPUT_FILE

echo '# HELP node_file_handles_max Maximum file handles' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_max gauge' >> $OUTPUT_FILE
echo "node_file_handles_max $FILE_HANDLES_MAX" >> $OUTPUT_FILE

echo '# HELP node_file_handles_percent File handles usage percent' >> $OUTPUT_FILE
echo '# TYPE node_file_handles_percent gauge' >> $OUTPUT_FILE
echo "node_file_handles_percent $FILE_HANDLES_PERCENT" >> $OUTPUT_FILE

# 5. 系统负载（15分钟）
LOAD_15=$(awk '{print $3}' /proc/loadavg)
echo '# HELP node_load15 System load average for 15 minutes' >> $OUTPUT_FILE
echo '# TYPE node_load15 gauge' >> $OUTPUT_FILE
echo "node_load15 $LOAD_15" >> $OUTPUT_FILE

# 6. 磁盘inode使用率
DISK_INODES=$(df -i / | awk 'NR==2 {print $5}' | sed 's/%//')
echo '# HELP node_disk_inode_usage_percent Disk inode usage percent for root' >> $OUTPUT_FILE
echo '# TYPE node_disk_inode_usage_percent gauge' >> $OUTPUT_FILE
echo "node_disk_inode_usage_percent $DISK_INODES" >> $OUTPUT_FILE

# 7. 网络连接统计
TCP_ESTABLISHED=$(ss -s | awk '/^TCP:/ {print $4}')
echo '# HELP node_network_tcp_established Established TCP connections' >> $OUTPUT_FILE
echo '# TYPE node_network_tcp_established gauge' >> $OUTPUT_FILE
echo "node_network_tcp_established $TCP_ESTABLISHED" >> $OUTPUT_FILE

# 8. 系统时间同步状态
NTP_SYNC=0
if chronyc tracking 2>/dev/null | grep -q "Leap status.*Normal"; then
    NTP_SYNC=1
elif ntpq -p 2>/dev/null | grep -q "^\*"; then
    NTP_SYNC=1
fi
echo '# HELP node_ntp_synchronized NTP synchronization status (1=synchronized, 0=not synchronized)' >> $OUTPUT_FILE
echo '# TYPE node_ntp_synchronized gauge' >> $OUTPUT_FILE
echo "node_ntp_synchronized $NTP_SYNC" >> $OUTPUT_FILE

# 设置权限
chown $NODE_EXPORTER_USER:$NODE_EXPORTER_USER $OUTPUT_FILE
chmod 644 $OUTPUT_FILE
EOF

chmod +x /usr/local/bin/custom_node_metrics.sh

# 创建定时任务
echo "*/30 * * * * root /usr/local/bin/custom_node_metrics.sh" > /etc/cron.d/node_exporter_custom_metrics

# 启动服务
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

# 检查状态
sleep 2
systemctl status node_exporter --no-pager

echo "Node Exporter安装完成！"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9100"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9100/metrics"

2. MySQL Exporter配置

#!/bin/bash
# install_mysql_exporter.sh
# MySQL Exporter安装配置

MYSQL_EXPORTER_VERSION="0.15.0"
MYSQL_EXPORTER_USER="mysql_exporter"
INSTALL_DIR="/opt/mysql_exporter"

# 创建用户
useradd --no-create-home --shell /bin/false $MYSQL_EXPORTER_USER

# 下载并安装
cd /tmp
wget https://github.com/prometheus/mysqld_exporter/releases/download/v$MYSQL_EXPORTER_VERSION/mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64.tar.gz
cd mysqld_exporter-$MYSQL_EXPORTER_VERSION.linux-amd64

# 复制二进制文件
cp mysqld_exporter $INSTALL_DIR/
chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER $INSTALL_DIR/mysqld_exporter
chmod +x $INSTALL_DIR/mysqld_exporter

# 在MySQL中创建监控用户
mysql -u root -p << 'EOF'
-- 创建监控用户
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'ExporterPassword123!' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';

-- 验证权限
SHOW GRANTS FOR 'exporter'@'localhost';
EOF

# 创建配置文件
cat > /etc/mysql_exporter.cnf << EOF
[client]
user=exporter
password=ExporterPassword123!
host=localhost
port=3306
EOF

chown $MYSQL_EXPORTER_USER:$MYSQL_EXPORTER_USER /etc/mysql_exporter.cnf
chmod 600 /etc/mysql_exporter.cnf

# 创建systemd服务
cat > /etc/systemd/system/mysql_exporter.service << EOF
[Unit]
Description=MySQL Exporter
Documentation=https://github.com/prometheus/mysqld_exporter
After=network.target mysql.service

[Service]
User=$MYSQL_EXPORTER_USER
Group=$MYSQL_EXPORTER_USER
Type=simple
Restart=always
RestartSec=5
ExecStart=$INSTALL_DIR/mysqld_exporter \
  --web.listen-address=":9104" \
  --config.my-cnf=/etc/mysql_exporter.cnf \
  --collect.global_status \
  --collect.global_variables \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.info_schema.tables \
  --collect.info_schema.tablestats \
  --collect.info_schema.userstats \
  --collect.perf_schema.eventswaits \
  --collect.perf_schema.file_events \
  --collect.perf_schema.indexiowaits \
  --collect.perf_schema.tableiowaits \
  --collect.slave_status \
  --collect.auto_increment.columns \
  --collect.binlog_size \
  --collect.info_schema.query_response_time \
  --collect.engine_innodb_status \
  --log.level="info"

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 启动服务
systemctl daemon-reload
systemctl enable mysql_exporter
systemctl start mysql_exporter

# 检查状态
sleep 2
systemctl status mysql_exporter --no-pager

echo "MySQL Exporter安装完成！"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):9104"
echo "Metrics地址: http://$(hostname -I | awk '{print $1}'):9104/metrics"

# 关键MySQL监控指标说明
cat << 'EOF'

=== 关键MySQL监控指标 ===

1. 连接相关:
   mysql_global_status_threads_connected      # 当前连接数
   mysql_global_status_max_used_connections   # 历史最大连接数
   mysql_global_variables_max_connections     # 最大连接数限制

2. 查询性能:
   mysql_global_status_questions              # 总查询数
   mysql_global_status_slow_queries           # 慢查询数
   rate(mysql_global_status_questions[1m])    # QPS

3. InnoDB状态:
   mysql_global_status_innodb_buffer_pool_pages_total     # 缓冲池总页数
   mysql_global_status_innodb_buffer_pool_pages_free      # 缓冲池空闲页数
   mysql_global_status_innodb_row_lock_time_avg           # 平均行锁等待时间

4. 复制状态:
   mysql_slave_status_slave_io_running        # IO线程状态
   mysql_slave_status_slave_sql_running       # SQL线程状态
   mysql_slave_status_seconds_behind_master   # 复制延迟秒数

5. 表状态:
   mysql_info_schema_table_size_bytes         # 表大小
   mysql_info_schema_table_rows               # 表行数

=== 常用告警规则 ===

# 连接数过高
- alert: MySQL连接数过高
  expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
  for: 5m

# 慢查询过多
- alert: MySQL慢查询过多
  expr: rate(mysql_global_status_slow_queries[5m]) > 10
  for: 2m

# 复制延迟
- alert: MySQL复制延迟
  expr: mysql_slave_status_seconds_behind_master > 30
  for: 5m

# InnoDB缓冲池命中率低
- alert: InnoDB缓冲池命中率低
  expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
  for: 10m
EOF

3. 其他常用Exporter

Blackbox Exporter

网络探测，监控HTTP、HTTPS、DNS、TCP、ICMP等服务可用性。

PostgreSQL Exporter

PostgreSQL数据库监控，收集连接数、查询性能、锁等信息。

Cloud Exporter

AWS、Azure、GCP云服务监控，收集云资源使用情况和成本。

JMX Exporter

Java应用监控，通过JMX收集JVM性能指标和应用业务指标。

HAProxy Exporter

负载均衡器监控，收集连接数、请求率、后端服务器状态等。

cAdvisor

容器监控，收集Docker容器资源使用情况和性能指标。

五、Grafana可视化与仪表板

📈 Grafana优势：

1. 丰富的可视化选项：图表、表格、仪表、热图、地理地图等
2. 强大的数据源支持：Prometheus、MySQL、PostgreSQL、Elasticsearch等
3. 灵活的告警功能：可视化告警配置，多种通知渠道
4. 团队协作：文件夹权限、共享仪表板、版本管理

1. Grafana安装与配置

#!/bin/bash
# install_grafana.sh
# Grafana一键安装脚本

GRAFANA_VERSION="10.0.3"
GRAFANA_USER="grafana"
INSTALL_DIR="/opt/grafana"
DATA_DIR="/var/lib/grafana"
CONFIG_DIR="/etc/grafana"
LOG_DIR="/var/log/grafana"

# 下载Grafana
cd /tmp
wget https://dl.grafana.com/oss/release/grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
tar xzf grafana-$GRAFANA_VERSION.linux-amd64.tar.gz
mv grafana-$GRAFANA_VERSION $INSTALL_DIR

# 创建用户和目录
useradd --no-create-home --shell /bin/false $GRAFANA_USER
mkdir -p $DATA_DIR $CONFIG_DIR $LOG_DIR
chown -R $GRAFANA_USER:$GRAFANA_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR $LOG_DIR

# 创建systemd服务
cat > /etc/systemd/system/grafana.service << EOF
[Unit]
Description=Grafana
Documentation=https://grafana.com/docs/
After=network.target

[Service]
User=$GRAFANA_USER
Group=$GRAFANA_USER
Type=simple
Restart=always
RestartSec=5
WorkingDirectory=$INSTALL_DIR
EnvironmentFile=-$CONFIG_DIR/grafana.conf
ExecStart=$INSTALL_DIR/bin/grafana-server \\
  --config=$CONFIG_DIR/grafana.ini \\
  --homepath=$INSTALL_DIR \\
  --packaging=docker \\
  cfg:default.paths.logs=$LOG_DIR \\
  cfg:default.paths.data=$DATA_DIR \\
  cfg:default.paths.plugins=$INSTALL_DIR/plugins \\
  cfg:default.paths.provisioning=$CONFIG_DIR/provisioning

ExecReload=/bin/kill -HUP \$MAINPID
LimitNOFILE=65536
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target
EOF

# 创建配置文件
cat > $CONFIG_DIR/grafana.ini << 'EOF'
[server]
# 监听地址和端口
http_addr = 0.0.0.0
http_port = 3000
domain = grafana.example.com
root_url = %(protocol)s://%(domain)s:%(http_port)s/
serve_from_sub_path = false

# 日志配置
[log]
mode = console file
level = info
format = console

# 数据库配置（默认使用SQLite）
[database]
type = sqlite3
path = grafana.db
max_idle_conn = 2
max_open_conn = 0
conn_max_lifetime = 14400

# 安全配置
[security]
admin_user = admin
admin_password = admin
secret_key = SW2YcwTIb9zpOOhoPsMm
disable_gravatar = false
data_source_proxy_whitelist = 

# 用户配置
[auth]
disable_login_form = false
disable_signout_menu = false

# 匿名访问配置
[auth.anonymous]
enabled = false
org_name = Main Org.
org_role = Viewer

# 基础认证配置
[auth.basic]
enabled = true

# 邮件配置（告警通知）
[smtp]
enabled = true
host = smtp.example.com:465
user = alert@example.com
password = YourPassword
from_address = alert@example.com
from_name = Grafana Alert

# 用户配置
[users]
allow_sign_up = false
auto_assign_org = true
auto_assign_org_role = Viewer

# 会话配置
[session]
provider = file
provider_config = sessions
cookie_secure = false
session_life_time = 86400

# 分析配置
[analytics]
reporting_enabled = true
check_for_updates = true

# 路径配置
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /opt/grafana/plugins
provisioning = /etc/grafana/provisioning

# 快照配置
[snapshots]
external_enabled = true
external_snapshot_url = https://snapshots.example.com
external_snapshot_name = Grafana Snapshots

# 指标配置（Grafana自身监控）
[metrics]
enabled = true
interval_seconds = 10
EOF

# 创建数据源配置
mkdir -p $CONFIG_DIR/provisioning/datasources
cat > $CONFIG_DIR/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://localhost:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: 15s
      queryTimeout: 60s
      httpMethod: POST
      manageAlerts: true
      prometheusType: Prometheus
      prometheusVersion: 2.45.0
      cacheLevel: 'High'
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo
    secureJsonData:
      tlsAuth: false
      tlsAuthWithCACert: false

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://localhost:9093
    editable: true
    jsonData:
      implementation: prometheus
      handleGrafanaManagedAlerts: true

  - name: Loki
    type: loki
    access: proxy
    url: http://localhost:3100
    editable: true
    jsonData:
      maxLines: 1000

  - name: Tempo
    type: tempo
    access: proxy
    url: http://localhost:3200
    editable: true
    jsonData:
      nodeGraph:
        enabled: true
      tracesToLogs:
        datasourceUid: 'loki'
        spanStartTimeShift: '-1h'
        spanEndTimeShift: '1h'
        tags: ['job', 'instance', 'pod', 'namespace']
        filterByTraceID: true
        filterBySpanID: true
EOF

# 创建仪表板配置
mkdir -p $CONFIG_DIR/provisioning/dashboards
cat > $CONFIG_DIR/provisioning/dashboards/dashboards.yml << 'EOF'
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /etc/grafana/dashboards
EOF

# 创建示例仪表板目录
mkdir -p /etc/grafana/dashboards

# 启动服务
systemctl daemon-reload
systemctl enable grafana
systemctl start grafana

# 检查状态
sleep 5
systemctl status grafana --no-pager

echo "Grafana安装完成！"
echo "访问地址: http://$(hostname -I | awk '{print $1}'):3000"
echo "默认用户名: admin"
echo "默认密码: admin"
echo ""
echo "请登录后立即修改管理员密码！"

2. Grafana仪表板设计

{
  "dashboard": {
    "title": "Node Exporter Full",
    "tags": ["templated", "node-exporter"],
    "style": "dark",
    "timezone": "browser",
    "panels": [
      {
        "datasource": "Prometheus",
        "description": "总体CPU使用率",
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "mappings": [],
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "red",
                  "value": 80
                }
              ]
            },
            "unit": "percent"
          },
          "overrides": []
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "id": 2,
        "options": {
          "orientation": "auto",
          "reduceOptions": {
            "calcs": [
              "lastNotNull"
            ],
            "fields": "",
            "values": false
          },
          "showThresholdLabels": false,
          "showThresholdMarkers": true
        },
        "pluginVersion": "9.3.2",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "interval": "",
            "legendFormat": "{{instance}}",
            "refId": "A"
          }
        ],
        "title": "CPU Usage",
        "type": "gauge"
      },
      {
        "datasource": "Prometheus",
        "description": "内存使用情况",
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "palette-classic"
            },
            "custom": {
              "axisLabel": "",
              "axisPlacement": "auto",
              "barAlignment": 0,
              "drawStyle": "line",
              "fillOpacity": 10,
              "gradientMode": "none",
              "hideFrom": {
                "legend": false,
                "tooltip": false,
                "viz": false
              },
              "lineInterpolation": "linear",
              "lineWidth": 1,
              "pointSize": 5,
              "scaleDistribution": {
                "type": "linear"
              },
              "showPoints": "never",
              "spanNulls": false,
              "stacking": {
                "group": "A",
                "mode": "normal"
              },
              "thresholdsStyle": {
                "mode": "off"
              }
            },
            "mappings": [],
            "thresholds": {
              "mode": "absolute",
              "steps": [
                {
                  "color": "green",
                  "value": null
                },
                {
                  "color": "red",
                  "value": 80
                }
              ]
            },
            "unit": "bytes"
          },
          "overrides": [
            {
              "matcher": {
                "id": "byName",
                "options": "Used"
              },
              "properties": [
                {
                  "id": "color",
                  "value": {
                    "fixedColor": "red",
                    "mode": "fixed"
                  }
                }
              ]
            },
            {
              "matcher": {
                "id": "byName",
                "options": "Cached"
              },
              "properties": [
                {
                  "id": "color",
                  "value": {
                    "fixedColor": "yellow",
                    "mode": "fixed"
                  }
                }
              ]
            }
          ]
        },
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "id": 3,
        "options": {
          "legend": {
            "calcs": [],
            "displayMode": "list",
            "placement": "bottom",
            "showLegend": true
          },
          "tooltip": {
            "mode": "single",
            "sort": "none"
          }
        },
        "targets": [
          {
            "expr": "node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes",
            "interval": "",
            "legendFormat": "Used",
            "refId": "A"
          },
          {
            "expr": "node_memory_Buffers_bytes",
            "hide": false,
            "interval": "",
            "legendFormat": "Buffers",
            "refId": "B"
          },
          {
            "expr": "node_memory_Cached_bytes",
            "hide": false,
            "interval": "",
            "legendFormat": "Cached",
            "refId": "C"
          }
        ],
        "title": "Memory Usage",
        "type": "timeseries"
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": [
        "5s",
        "10s",
        "30s",
        "1m",
        "5m",
        "15m",
        "30m",
        "1h",
        "2h",
        "1d"
      ],
      "time_options": [
        "5m",
        "15m",
        "1h",
        "6h",
        "12h",
        "24h",
        "2d",
        "7d",
        "30d"
      ]
    },
    "templating": {
      "list": [
        {
          "current": {
            "selected": false,
            "text": "All",
            "value": "$__all"
          },
          "datasource": "Prometheus",
          "definition": "label_values(node_cpu_seconds_total, instance)",
          "hide": 0,
          "includeAll": true,
          "multi": true,
          "name": "instance",
          "options": [],
          "query": {
            "query": "label_values(node_cpu_seconds_total, instance)",
            "refId": "StandardVariableQuery"
          },
          "refresh": 1,
          "regex": "",
          "skipUrlSync": false,
          "sort": 0,
          "type": "query"
        },
        {
          "current": {
            "selected": false,
            "text": "1m",
            "value": "1m"
          },
          "hide": 0,
          "includeAll": false,
          "label": "Interval",
          "multi": false,
          "name": "interval",
          "options": [
            {
              "selected": true,
              "text": "1m",
              "value": "1m"
            },
            {
              "selected": false,
              "text": "5m",
              "value": "5m"
            },
            {
              "selected": false,
              "text": "10m",
              "value": "10m"
            },
            {
              "selected": false,
              "text": "30m",
              "value": "30m"
            },
            {
              "selected": false,
              "text": "1h",
              "value": "1h"
            },
            {
              "selected": false,
              "text": "6h",
              "value": "6h"
            }
          ],
          "query": "1m,5m,10m,30m,1h,6h",
          "queryValue": "",
          "refresh": 2,
          "skipUrlSync": false,
          "type": "interval"
        }
      ]
    },
    "annotations": {
      "list": [
        {
          "builtIn": 1,
          "datasource": {
            "type": "grafana",
            "uid": "-- Grafana --"
          },
          "enable": true,
          "hide": true,
          "iconColor": "rgba(0, 211, 255, 1)",
          "name": "Annotations & Alerts",
          "target": {
            "limit": 100,
            "matchAny": false,
            "tags": [],
            "type": "dashboard"
          },
          "type": "dashboard"
        }
      ]
    },
    "refresh": "10s",
    "schemaVersion": 37,
    "version": 1,
    "uid": "node-exporter-full"
  },
  "folderUid": "general",
  "message": "Updated dashboard",
  "overwrite": true
}

六、告警规则与通知配置

🚨 告警设计原则：

1. 分级分类：根据严重程度划分告警级别（紧急、重要、警告）
2. 静默降噪：避免告警风暴，合理设置静默规则
3. 明确可操作：告警信息应包含具体问题和解决建议
4. 多渠道通知：重要告警应通过多种渠道通知
5. 闭环管理：告警应关联事件、处理、复盘全过程

1. Alertmanager配置

# Alertmanager配置 (alertmanager.yml)

global:
  # SMTP配置
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'YourPassword'
  smtp_require_tls: true
  
  # Slack配置
  slack_api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX'
  
  # 微信企业号配置
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'your-wechat-secret'
  wechat_api_corp_id: 'your-corp-id'

# 路由配置 - 定义告警如何路由到接收器
route:
  # 默认路由
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default-receiver'
  
  # 子路由
  routes:
    # 按严重程度路由
    - match:
        severity: critical
      receiver: 'critical-receiver'
      group_wait: 5s
      group_interval: 5s
      repeat_interval: 5m
      continue: true
    
    # 按团队路由
    - match_re:
        team: ^(infra|platform).*
      receiver: 'infra-team'
      continue: false
    
    - match_re:
        team: ^(dev|app).*
      receiver: 'dev-team'
      continue: false
    
    # 按服务路由
    - match:
        service: mysql
      receiver: 'dba-team'
      continue: false
    
    - match:
        service: nginx
      receiver: 'web-team'
      continue: false
    
    # 工作时间路由
    - receiver: 'work-hours-receiver'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      continue: true
      matchers:
        - name: time_range
          value: '09:00-18:00'
        - name: weekday
          value: '1-5'  # 周一到周五

# 告警抑制规则 - 避免重复告警
inhibit_rules:
  # 当有节点宕机告警时，抑制该节点上的所有其他告警
  - source_match:
      alertname: NodeDown
      severity: critical
    target_match:
      severity: critical
    equal: ['instance', 'cluster']
  
  # 当有集群级别故障时，抑制所有节点级别告警
  - source_match:
      alertname: ClusterDown
    target_match_re:
      alertname: 'NodeDown|HighCpuUsage|HighMemoryUsage'
    equal: ['cluster']
  
  # 网络分区告警抑制
  - source_match:
      alertname: NetworkPartition
    target_match:
      severity: warning
    equal: ['zone']

# 静默配置 - 临时关闭特定告警
# 可以通过Web UI或API配置

# 接收器配置 - 定义告警发送方式
receivers:
  # 默认接收器
  - name: 'default-receiver'
    email_configs:
      - to: 'alerts@example.com'
        send_resolved: true
        headers:
          Subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        html: |
          {{ .GroupLabels.alertname }}
          状态: {{ .Status | toUpper }}
          开始时间: {{ (index .Alerts 0).StartsAt }}
          结束时间: {{ (index .Alerts 0).EndsAt }}
          摘要: {{ .CommonAnnotations.summary }}
          描述: {{ .CommonAnnotations.description }}
          
          告警详情
          
            {{ range .GroupLabels.SortedPairs }}
            
            {{ end }}
          
            
              标签
              值
            

              {{ .Name }}
              {{ .Value }}
            
          
          运行手册: {{ .CommonAnnotations.runbook }}
    
    webhook_configs:
      - url: 'http://alert-webhook.example.com/alerts'
        send_resolved: true
  
  # 紧急告警接收器
  - name: 'critical-receiver'
    email_configs:
      - to: 'oncall@example.com, manager@example.com'
        send_resolved: true
    
    # Slack通知
    slack_configs:
      - channel: '#alerts-critical'
        title: '[CRITICAL] {{ .GroupLabels.alertname }}'
        text: |-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Labels:*
          {{ range .Labels.SortedPairs }} • {{ .Name }}: {{ .Value }}
          {{ end }}
          {{ end }}
        send_resolved: true
        color: 'danger'  # 红色
    
    # 微信通知
    wechat_configs:
      - agent_id: '1000002'
        to_user: '@all'
        to_party: '2'
        message: '{{ template "wechat.default.message" . }}'
        send_resolved: true
    
    # 电话告警（通过第三方服务）
    webhook_configs:
      - url: 'http://phone-alert-service.example.com/call'
        send_resolved: false
  
  # 基础设施团队
  - name: 'infra-team'
    email_configs:
      - to: 'infra-team@example.com'
    slack_configs:
      - channel: '#infra-alerts'
  
  # 开发团队
  - name: 'dev-team'
    email_configs:
      - to: 'dev-team@example.com'
    slack_configs:
      - channel: '#dev-alerts'
  
  # DBA团队
  - name: 'dba-team'
    email_configs:
      - to: 'dba-team@example.com'
    webhook_configs:
      - url: 'http://dba-alert.example.com/webhook'
  
  # 工作时间接收器
  - name: 'work-hours-receiver'
    email_configs:
      - to: 'work-hours-team@example.com'
    slack_configs:
      - channel: '#work-hours-alerts'

# 模板配置
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# 钉钉机器人配置模板
- name: 'dingtalk-receiver'
  dingtalk_configs:
    - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx'
      message: |
        {{ range .Alerts }}
        ## [{{ .Status | toUpper }}] {{ .Labels.alertname }}
        **开始时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
        **实例**: {{ .Labels.instance }}
        **摘要**: {{ .Annotations.summary }}
        **描述**: {{ .Annotations.description }}
        {{ if .Annotations.runbook }}
        **运行手册**: [查看]({{ .Annotations.runbook }})
        {{ end }}
        ---
        {{ end }}
      send_resolved: true

# 企业微信配置模板
- name: 'wechat-work-receiver'
  wechat_configs:
    - api_secret: 'your-secret'
      corp_id: 'your-corp-id'
      agent_id: '1000002'
      message: '{{ template "wechat.default.message" . }}'
      send_resolved: true

# 短信配置（通过阿里云、腾讯云等）
- name: 'sms-receiver'
  webhook_configs:
    - url: 'http://sms-gateway.example.com/send'
      send_resolved: false

标签	值
{{ .Name }}	{{ .Value }}

2. 告警规则最佳实践

紧急 (Critical) 重要 (Warning) 警告 (Info)

告警级别	响应时间	通知渠道	示例场景
紧急	5分钟内	电话+短信+钉钉+邮件	核心服务不可用、数据库宕机、重大安全事件
重要	30分钟内	钉钉+邮件+Slack	性能严重下降、磁盘空间不足、CPU使用率过高
警告	2小时内	邮件+企业微信	磁盘使用率预警、内存使用率预警、服务重启

# 企业级告警规则示例

groups:
  # ============ 基础设施告警 ============
  - name: infrastructure_alerts
    interval: 30s
    rules:
      # 节点宕机
      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "节点宕机 {{ $labels.instance }}"
          description: "节点 {{ $labels.instance }} 已经宕机超过1分钟"
          runbook: "https://runbook.example.com/node-down"
          dashboard: "https://grafana.example.com/d/node-overview"
      
      # CPU使用率过高
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高CPU使用率 {{ $labels.instance }}"
          description: "CPU使用率超过80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-cpu"
          dashboard: "https://grafana.example.com/d/node-cpu"
      
      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 10m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高内存使用率 {{ $labels.instance }}"
          description: "内存使用率超过85%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-memory"
          dashboard: "https://grafana.example.com/d/node-memory"
      
      # 磁盘空间严重不足
      - alert: DiskSpaceCritical
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 90
        for: 2m
        labels:
          severity: critical
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "磁盘空间严重不足 {{ $labels.instance }}"
          description: "根分区使用率超过90%已经2分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/disk-space"
          dashboard: "https://grafana.example.com/d/node-disk"
      
      # 磁盘空间预警
      - alert: DiskSpaceWarning
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100 > 80
        for: 10m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "磁盘空间预警 {{ $labels.instance }}"
          description: "根分区使用率超过80%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/disk-space"
          dashboard: "https://grafana.example.com/d/node-disk"
      
      # 系统负载过高
      - alert: HighSystemLoad
        expr: node_load1 > count by(instance) (node_cpu_seconds_total{mode="system"}) * 1.5
        for: 5m
        labels:
          severity: warning
          team: infrastructure
          domain: infrastructure
          service: node
        annotations:
          summary: "高系统负载 {{ $labels.instance }}"
          description: "1分钟系统负载超过CPU核心数1.5倍已经5分钟。当前值: {{ $value }}"
          runbook: "https://runbook.example.com/high-load"
          dashboard: "https://grafana.example.com/d/node-load"

  # ============ 应用服务告警 ============
  - name: application_alerts
    interval: 15s
    rules:
      # 服务宕机
      - alert: ServiceDown
        expr: up{job=~".*"} == 0
        for: 1m
        labels:
          severity: critical
          team: development
          domain: application
        annotations:
          summary: "服务宕机 {{ $labels.job }}"
          description: "服务 {{ $labels.job }} (实例 {{ $labels.instance }}) 已经宕机超过1分钟"
          runbook: "https://runbook.example.com/service-down"
          dashboard: "https://grafana.example.com/d/service-overview"
      
      # 高请求延迟
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job, instance)) > 1
        for: 2m
        labels:
          severity: warning
          team: development
          domain: application
        annotations:
          summary: "高请求延迟 {{ $labels.job }}"
          description: "95%的请求延迟超过1秒已经2分钟。当前值: {{ $value }}秒"
          runbook: "https://runbook.example.com/high-latency"
          dashboard: "https://grafana.example.com/d/service-latency"
      
      # 高错误率
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100 > 5
        for: 5m
        labels:
          severity: critical
          team: development
          domain: application
        annotations:
          summary: "高错误率 {{ $labels.job }}"
          description: "HTTP 5xx错误率超过5%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/high-error-rate"
          dashboard: "https://grafana.example.com/d/service-errors"
      
      # 低请求量（可能服务有问题但没报错）
      - alert: LowRequestRate
        expr: rate(http_requests_total[10m]) < 10
        for: 5m
        labels:
          severity: warning
          team: development
          domain: application
        annotations:
          summary: "低请求量 {{ $labels.job }}"
          description: "请求率低于10次/秒已经5分钟。当前值: {{ $value }}次/秒"
          runbook: "https://runbook.example.com/low-request-rate"
          dashboard: "https://grafana.example.com/d/service-traffic"

  # ============ 数据库告警 ============
  - name: database_alerts
    interval: 30s
    rules:
      # MySQL连接数过高
      - alert: MySQLHighConnections
        expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "MySQL连接数过高 {{ $labels.instance }}"
          description: "MySQL连接数超过最大连接数的80%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/mysql-connections"
          dashboard: "https://grafana.example.com/d/mysql-overview"
      
      # MySQL复制延迟
      - alert: MySQLReplicationLag
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 5m
        labels:
          severity: critical
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "MySQL复制延迟 {{ $labels.instance }}"
          description: "MySQL从库复制延迟超过30秒已经5分钟。当前值: {{ $value }}秒"
          runbook: "https://runbook.example.com/mysql-replication"
          dashboard: "https://grafana.example.com/d/mysql-replication"
      
      # InnoDB缓冲池命中率低
      - alert: MySQLInnoDBBufferPoolHitRateLow
        expr: (1 - (mysql_global_status_innodb_buffer_pool_reads / mysql_global_status_innodb_buffer_pool_read_requests)) * 100 < 90
        for: 10m
        labels:
          severity: warning
          team: dba
          domain: database
          service: mysql
        annotations:
          summary: "InnoDB缓冲池命中率低 {{ $labels.instance }}"
          description: "InnoDB缓冲池命中率低于90%已经10分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/mysql-innodb"
          dashboard: "https://grafana.example.com/d/mysql-innodb"

  # ============ 业务指标告警 ============
  - name: business_alerts
    interval: 1m
    rules:
      # 订单量异常下降
      - alert: OrderRateAbnormalDrop
        expr: rate(orders_total[10m]) < rate(orders_total[40m:10m]) * 0.5
        for: 5m
        labels:
          severity: critical
          team: business
          domain: business
        annotations:
          summary: "订单量异常下降"
          description: "最近10分钟订单量比40分钟前下降超过50%。当前速率: {{ $value }} 订单/分钟"
          runbook: "https://runbook.example.com/order-drop"
          dashboard: "https://grafana.example.com/d/business-orders"
      
      # 支付失败率过高
      - alert: HighPaymentFailureRate
        expr: rate(payment_attempts_total{status="failed"}[10m]) / rate(payment_attempts_total[10m]) * 100 > 10
        for: 5m
        labels:
          severity: critical
          team: business
          domain: business
        annotations:
          summary: "支付失败率过高"
          description: "支付失败率超过10%已经5分钟。当前值: {{ $value | humanizePercentage }}"
          runbook: "https://runbook.example.com/payment-failure"
          dashboard: "https://grafana.example.com/d/business-payments"
      
      # 用户活跃度下降
      - alert: UserActivityDrop
        expr: active_users_total < (active_users_total offset 1d) * 0.7
        for: 1h
        labels:
          severity: warning
          team: business
          domain: business
        annotations:
          summary: "用户活跃度下降"
          description: "当前活跃用户数比昨天同时段下降超过30%。当前值: {{ $value }}"
          runbook: "https://runbook.example.com/user-activity"
          dashboard: "https://grafana.example.com/d/business-users"

  # ============ 黑盒监控告警 ============
  - name: blackbox_alerts
    interval: 30s
    rules:
      # HTTP探测失败
      - alert: HTTPProbeFailed
        expr: probe_success{job="blackbox-http"} == 0
        for: 1m
        labels:
          severity: critical
          team: infrastructure
          domain: availability
        annotations:
          summary: "HTTP服务不可用 {{ $labels.instance }}"
          description: "HTTP服务 {{ $labels.instance }} 探测失败已经1分钟"
          runbook: "https://runbook.example.com/http-probe-failed"
          dashboard: "https://grafana.example.com/d/blackbox-http"
      
      # SSL证书即将过期
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry{job="blackbox-https"} - time() < 86400 * 30  # 30天内过期
        for: 0m
        labels:
          severity: warning
          team: infrastructure
          domain: security
        annotations:
          summary: "SSL证书即将过期 {{ $labels.instance }}"
          description: "SSL证书将在{{ $value | humanizeDuration }}后过期"
          runbook: "https://runbook.example.com/ssl-cert-expiring"
          dashboard: "https://grafana.example.com/d/blackbox-ssl"

七、高可用与生产优化

🏗️ 生产环境要求：

1. 高可用性：监控系统自身不能成为单点故障
2. 可扩展性：支持监控上千节点，PB级数据存储
3. 性能优化：查询响应时间在秒级，资源消耗可控
4. 安全性：访问控制、数据加密、审计日志
5. 可维护性：自动化部署、配置管理、故障自愈

1. Prometheus高可用方案

负载均衡器

↓

Prometheus A Prometheus B

↓

Thanos Query

↓

对象存储长期存储

# Prometheus高可用配置示例

# 1. 多副本Prometheus配置
# prometheus-a.yml 和 prometheus-b.yml 配置相同，但external_labels不同

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    replica: 'A'  # 副本A使用'A'，副本B使用'B'

# 2. 服务发现配置（确保两个副本抓取相同的targets）
scrape_configs:
  - job_name: 'node'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      - source_labels: [__meta_consul_tags]
        regex: ',(production|staging|dev),'
        target_label: env

# 3. Thanos Sidecar配置（与每个Prometheus实例一起运行）
# thanos-sidecar.yml
prometheus:
  external_url: "http://prometheus-a.example.com:9090"
  # 或 "http://prometheus-b.example.com:9090"

thanos:
  sidecar:
    grpc_address: "0.0.0.0:10901"
    http_address: "0.0.0.0:10902"
    prometheus_url: "http://localhost:9090"
    tsdb_path: "/var/lib/prometheus"
  
  objstore:
    type: S3
    config:
      bucket: "thanos-metrics"
      endpoint: "s3.example.com"
      access_key: "YOUR_ACCESS_KEY"
      secret_key: "YOUR_SECRET_KEY"
      insecure: false
      signature_version2: false
      put_user_metadata: {}
      http_config:
        idle_conn_timeout: 90s
        response_header_timeout: 2m
      trace:
        enable: true
  
  compactor:
    data_dir: "/var/lib/thanos/compactor"
    retention: 30d
  
  query:
    http_address: "0.0.0.0:10903"
    grpc_address: "0.0.0.0:10904"
    store:
      - "prometheus-a.example.com:10901"
      - "prometheus-b.example.com:10901"
      - "thanos-store.example.com:10901"

# 4. Thanos Query前端配置
# thanos-query.yml
http:
  address: "0.0.0.0:10905"
  grace_period: 2m

grpc:
  address: "0.0.0.0:10906"

query:
  replica_labels:
    - "replica"
    - "prometheus_replica"
  
  auto_downsampling: true
  partial_response: true
  default_evaluation_interval: 1m

stores:
  - "prometheus-a.example.com:10901"
  - "prometheus-b.example.com:10901"
  - "thanos-store.example.com:10901"

# 5. 负载均衡配置（Nginx）
# nginx.conf
upstream prometheus {
    zone prometheus 64k;
    server prometheus-a.example.com:9090 max_fails=3 fail_timeout=30s;
    server prometheus-b.example.com:9090 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

upstream thanos_query {
    zone thanos_query 64k;
    server thanos-query.example.com:10905 max_fails=3 fail_timeout=30s;
    keepalive 16;
}

server {
    listen 80;
    server_name prometheus.example.com;
    
    location / {
        proxy_pass http://prometheus;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # 健康检查
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_connect_timeout 2s;
        proxy_read_timeout 30s;
        proxy_send_timeout 30s;
    }
}

server {
    listen 80;
    server_name thanos.example.com;
    
    location / {
        proxy_pass http://thanos_query;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

# 6. 监控目标分片配置（当监控目标太多时）
# 根据标签将目标分片到不同的Prometheus实例

scrape_configs:
  - job_name: 'node-shard-0'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      # 根据实例名称哈希分片
      - source_labels: [__meta_consul_service_id]
        action: hashmod
        modulus: 2
        target_label: __tmp_hash
      - source_labels: [__tmp_hash]
        action: keep
        regex: ^0$  # 这个实例只抓取hash为0的目标
  
  - job_name: 'node-shard-1'
    consul_sd_configs:
      - server: 'consul.example.com:8500'
        services: ['node-exporter']
    relabel_configs:
      - source_labels: [__meta_consul_service_id]
        action: hashmod
        modulus: 2
        target_label: __tmp_hash
      - source_labels: [__tmp_hash]
        action: keep
        regex: ^1$  # 这个实例只抓取hash为1的目标

# 7. 远程写配置（将数据写入VictoriaMetrics集群）
remote_write:
  - url: "http://vminsert:8480/insert/0/prometheus/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      capacity: 100000
      max_shards: 30
    write_relabel_configs:
      # 只保留重要的指标
      - action: keep
        regex: "up|node_.*|process_.*|prometheus_.*"
        source_labels: [__name__]

# 8. 资源限制配置
# 通过cgroups或systemd限制资源使用

[Service]
MemoryLimit=8G
CPUQuota=200%
IOWeight=100
TasksMax=10000

# 9. 数据保留策略
# prometheus.yml
storage:
  tsdb:
    retention: 15d  # 本地保留15天
    out_of_order_time_window: 1h

# 远程存储保留更长时间
remote_write:
  - url: "http://long-term-storage:9090/api/v1/write"
    remote_timeout: 30s
    write_relabel_configs:
      - action: keep
        regex: ".*"
        source_labels: [__name__]

# 10. 备份策略
# backup_prometheus.sh
#!/bin/bash
BACKUP_DIR="/backup/prometheus"
DATE=$(date +%Y%m%d_%H%M%S)

# 停止Prometheus抓取
curl -X POST http://localhost:9090/-/quit

# 等待写入完成
sleep 30

# 备份数据目录
tar czf $BACKUP_DIR/prometheus_data_$DATE.tar.gz /var/lib/prometheus/

# 备份配置文件
tar czf $BACKUP_DIR/prometheus_config_$DATE.tar.gz /etc/prometheus/

# 重启Prometheus
systemctl restart prometheus

echo "备份完成: $BACKUP_DIR/prometheus_data_$DATE.tar.gz"

2. 性能优化建议

存储优化

使用SSD存储TSDB数据，启用压缩，定期清理过期数据。

查询优化

使用Recording Rules预计算常用查询，优化PromQL，避免高基数查询。

网络优化

配置合理的抓取间隔，使用连接池，启用HTTP/2，减少网络往返。

内存优化

调整块大小，监控内存使用，启用内存映射文件，避免OOM。

八、总结与最佳实践

✅ 监控体系建设成功标志：

1. 全面覆盖：基础设施、应用、业务全面监控
2. 及时告警：问题发现时间从小时级降低到分钟级
3. 快速定位：MTTR（平均修复时间）显著降低
4. 数据驱动：监控数据用于容量规划和性能优化
5. 团队赋能：开发、运维、业务团队都能使用监控数据

1. 实施路线图

阶段	时间	主要任务	关键成果
第一阶段	1-2周	基础设施监控、基础告警	服务器监控、基础告警规则
第二阶段	2-4周	应用监控、业务监控	应用性能监控、关键业务指标
第三阶段	4-8周	高可用、自动化、优化	监控系统高可用、自动化部署
第四阶段	持续优化	智能分析、预测告警	异常检测、容量预测、AIOps

2. 监控系统检查清单

# 企业级监控系统检查清单

"""
1. 数据采集检查
    □ 所有服务器部署Node Exporter
    □ 关键服务有专用Exporter（MySQL、Nginx等）
    □ 应用层业务指标采集
    □ 网络探测（黑盒监控）配置
    □ 日志指标采集（通过Loki或ELK）

2. 存储与处理检查
    □ Prometheus配置高可用
    □ 数据保留策略合理（本地+远程）
    □ 告警规则分类清晰
    □ 记录规则优化查询性能
    □ 远程写配置正确

3. 可视化检查
    □ Grafana仪表板覆盖所有监控维度
    □ 仪表板有清晰的分类和组织
    □ 关键指标有实时可视化
    □ 历史数据可回溯分析
    □ 权限控制配置正确

4. 告警通知检查
    □ 告警分级合理（紧急/重要/警告）
    □ 通知渠道覆盖全面（邮件/钉钉/短信）
    □ 告警抑制规则配置合理
    □ 静默规则管理规范
    □ 告警处理流程明确

5. 高可用检查
    □ Prometheus多副本部署
    □ Alertmanager集群部署
    □ Grafana配置持久化存储
    □ 负载均衡配置正确
    □ 备份恢复方案验证

6. 性能检查
    □ 查询响应时间在秒级
    □ 抓取间隔合理（不造成目标压力）
    □ 内存使用可控（无OOM风险）
    □ 磁盘空间充足（有预警机制）
    □ 网络带宽足够

7. 安全检查
    □ 监控系统访问控制
    □ 数据传输加密（HTTPS）
    □ 认证授权配置
    □ 审计日志开启
    □ 敏感信息保护

8. 运维检查
    □ 配置版本管理
    □ 自动化部署脚本
    □ 监控系统自身监控
    □ 容量规划预测
    □ 定期演练恢复

9. 文档检查
    □ 架构设计文档完整
    □ 部署运维手册
    □ 告警处理手册（Runbook）
    □ 故障排查指南
    □ 培训材料

10. 合规检查
    □ 数据保留符合法规要求
    □ 审计日志满足合规
    □ 访问控制符合安全政策
    □ 告警通知符合SLA
    □ 故障处理流程规范
"""

# 监控系统成熟度评估
MATURITY_LEVELS = {
    "Level 1 - 基础监控": "服务器基础指标监控，手动告警",
    "Level 2 - 标准监控": "应用和业务监控，自动化告警",
    "Level 3 - 高级监控": "全链路监控，预测性告警",
    "Level 4 - 智能监控": "AIOps，自动化修复，业务影响分析"
}

3. 常见问题与解决方案

❓ 问题：Prometheus内存占用过高

解决方案：
1. 调整块大小和保留时间
2. 使用Recording Rules预计算
3. 限制标签基数（避免高基数标签）
4. 启用数据分片和远程存储

❓ 问题：告警风暴

解决方案：
1. 合理配置告警抑制规则
2. 设置合理的告警间隔和等待时间
3. 使用告警分组（按服务、按实例）
4. 实现告警升级和降级机制

❓ 问题：监控数据不一致

解决方案：
1. 统一时钟同步（NTP）
2. 配置合理的抓取超时时间
3. 监控Exporter健康状况
4. 实现数据一致性检查

4. 未来发展趋势

AIOps智能运维

机器学习异常检测、根因分析、自动化修复。

可观测性

Metrics、Logs、Traces三位一体，全链路追踪。

云原生监控

Kubernetes原生监控，Service Mesh监控，Serverless监控。

业务可观测性

业务指标监控，用户体验监控，业务影响分析。

🎯 最后建议：

1. 从小处着手：从核心业务开始，逐步扩大监控范围
2. 持续迭代：监控系统需要持续优化和演进
3. 文化先行：建立数据驱动的运维文化
4. 工具为辅：工具是手段，解决问题才是目的
5. 全员参与：监控不仅是运维的事，需要开发、测试、业务共同参与