ZooKeeper 监视器指南
新指标系统
新指标系统
功能自 3.6.0 起可用,它提供丰富的指标,帮助用户监视 ZooKeeper,涉及的主题包括:znode、网络、磁盘、法定人数、领导者选举、客户端、安全性、故障、监视/会话、requestProcessor 等。
指标
所有指标都包含在 ServerMetrics.java
中。
Prometheus
- 运行 Prometheus 监视服务是摄取和记录 ZooKeeper 指标的最简单方法。
- 先决条件
- 通过在 zoo.cfg 中设置
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
来启用Prometheus MetricsProvider
。 - 还可以通过设置
metricsProvider.httpPort
(默认值:7000)来配置端口。 - 安装 Prometheus:转到官方网站下载 页面,下载最新版本。
-
将 Prometheus 的刮削器设置为 ZooKeeper 集群端点
cat > /tmp/test-zk.yaml <<EOF global: scrape_interval: 10s scrape_configs: - job_name: test-zk static_configs: - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000'] EOF cat /tmp/test-zk.yaml
-
设置 Prometheus 处理程序
nohup /tmp/prometheus \ --config.file /tmp/test-zk.yaml \ --web.listen-address ":9090" \ --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log 2>&1 &
-
现在,Prometheus 将每 10 秒刮削一次 zk 指标。
使用 Prometheus 发出警报
-
我们建议您阅读 Prometheus 官方警报页面,了解一些警报原则
-
我们建议您使用 Prometheus Alertmanager,它可以帮助用户以更便捷的方式接收警报电子邮件或即时消息(通过 webhook)
-
我们提供了一个警报示例,其中应特别注意这些指标。注意:这仅供参考,您需要根据实际情况和资源环境进行调整
use ./promtool check rules rules/zk.yml to check the correctness of the config file cat rules/zk.yml groups: - name: zk-alert-example rules: - alert: ZooKeeper server is down expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} ZooKeeper server is down" description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]." - alert: create too many znodes expr: znode_count > 1000000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many znodes" description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]." - alert: create too many connections expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} create too many connections" description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]." - alert: znode total occupied memory is too big expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB) for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} znode total occupied memory is too big" description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB." - alert: set too many watch expr: watch_count > 10000 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} set too many watch" description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]." - alert: a leader election happens expr: increase(election_time_count[5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} a leader election happens" description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]." - alert: open too many files expr: open_file_descriptor_count > 300 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} open too many files" description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]." - alert: fsync time is too long expr: rate(fsynctime_sum[1m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} fsync time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]." - alert: take snapshot time is too long expr: rate(snapshottime_sum[5m]) > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} take snapshot time is too long" description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]." - alert: avg latency is too high expr: avg_latency > 100 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} avg latency is too high" description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]." - alert: JvmMemoryFillingUp expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8 for: 5m labels: severity: warning annotations: summary: "JVM memory filling up (instance {{ $labels.instance }})" description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }} value = {{ $value }}\n"
Grafana
- Grafana 内置了对 Prometheus 的支持;只需添加一个 Prometheus 数据源即可
Name: test-zk Type: Prometheus Url: https://127.0.0.1:9090 Access: proxy
- 然后下载并导入默认的 ZooKeeper 仪表盘 模板 并进行自定义。
- 用户可以向 [email protected] 发送电子邮件,询问 Grafana 仪表盘帐户,以获得任何良好的改进。
InfluxDB
InfluxDB 是一个开源时间序列数据,通常用于存储来自 Zookeeper 的指标。您可以 下载 开源版本或在 InfluxDB Cloud 上创建一个 免费 帐户。在任何一种情况下,配置 Apache Zookeeper Telegraf 插件 以开始收集和存储来自 Zookeeper 集群的指标到您的 InfluxDB 实例中。还有一个 Apache Zookeeper InfluxDB 模板,其中包括 Telegraf 配置和仪表盘,可让您立即进行设置。
JMX
更多详细信息可在 此处 找到
四个字母的单词
更多详细信息可在 此处 找到