ZooKeeper 监视器指南

新指标系统
JMX
四个字母的单词

新指标系统

新指标系统功能自 3.6.0 起可用，它提供丰富的指标，帮助用户监视 ZooKeeper，涉及的主题包括：znode、网络、磁盘、法定人数、领导者选举、客户端、安全性、故障、监视/会话、requestProcessor 等。

指标

所有指标都包含在 ServerMetrics.java 中。

Prometheus

运行 Prometheus 监视服务是摄取和记录 ZooKeeper 指标的最简单方法。
先决条件
通过在 zoo.cfg 中设置 metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider 来启用 Prometheus MetricsProvider。
还可以通过设置 metricsProvider.httpPort（默认值：7000）来配置端口。
安装 Prometheus：转到官方网站下载页面，下载最新版本。

将 Prometheus 的刮削器设置为 ZooKeeper 集群端点

cat > /tmp/test-zk.yaml <<EOF
global:
  scrape_interval: 10s
scrape_configs:
  - job_name: test-zk
    static_configs:
    - targets: ['192.168.10.32:7000','192.168.10.33:7000','192.168.10.34:7000']
EOF
cat /tmp/test-zk.yaml

设置 Prometheus 处理程序

nohup /tmp/prometheus \
    --config.file /tmp/test-zk.yaml \
    --web.listen-address ":9090" \
    --storage.tsdb.path "/tmp/test-zk.data" >> /tmp/test-zk.log  2>&1 &

现在，Prometheus 将每 10 秒刮削一次 zk 指标。

使用 Prometheus 发出警报

我们建议您阅读 Prometheus 官方警报页面，了解一些警报原则
我们建议您使用 Prometheus Alertmanager，它可以帮助用户以更便捷的方式接收警报电子邮件或即时消息（通过 webhook）

我们提供了一个警报示例，其中应特别注意这些指标。注意：这仅供参考，您需要根据实际情况和资源环境进行调整

use ./promtool check rules rules/zk.yml to check the correctness of the config file
cat rules/zk.yml

groups:
- name: zk-alert-example
  rules:
  - alert: ZooKeeper server is down
    expr:  up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} ZooKeeper server is down"
      description: "{{ $labels.instance }} of job {{$labels.job}} ZooKeeper server is down: [{{ $value }}]."

  - alert: create too many znodes
    expr: znode_count > 1000000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} create too many znodes"
      description: "{{ $labels.instance }} of job {{$labels.job}} create too many znodes: [{{ $value }}]."

  - alert: create too many connections
    expr: num_alive_connections > 50 # suppose we use the default maxClientCnxns: 60
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} create too many connections"
      description: "{{ $labels.instance }} of job {{$labels.job}} create too many connections: [{{ $value }}]."

  - alert: znode total occupied memory is too big
    expr: approximate_data_size /1024 /1024 > 1 * 1024 # more than 1024 MB(1 GB)
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} znode total occupied memory is too big"
      description: "{{ $labels.instance }} of job {{$labels.job}} znode total occupied memory is too big: [{{ $value }}] MB."

  - alert: set too many watch
    expr: watch_count > 10000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} set too many watch"
      description: "{{ $labels.instance }} of job {{$labels.job}} set too many watch: [{{ $value }}]."

  - alert: a leader election happens
    expr: increase(election_time_count[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} a leader election happens"
      description: "{{ $labels.instance }} of job {{$labels.job}} a leader election happens: [{{ $value }}]."

  - alert: open too many files
    expr: open_file_descriptor_count > 300
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} open too many files"
      description: "{{ $labels.instance }} of job {{$labels.job}} open too many files: [{{ $value }}]."

  - alert: fsync time is too long
    expr: rate(fsynctime_sum[1m]) > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} fsync time is too long"
      description: "{{ $labels.instance }} of job {{$labels.job}} fsync time is too long: [{{ $value }}]."

  - alert: take snapshot time is too long
    expr: rate(snapshottime_sum[5m]) > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} take snapshot time is too long"
      description: "{{ $labels.instance }} of job {{$labels.job}} take snapshot time is too long: [{{ $value }}]."

  - alert: avg latency is too high
    expr: avg_latency > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} avg latency is too high"
      description: "{{ $labels.instance }} of job {{$labels.job}} avg latency is too high: [{{ $value }}]."

  - alert: JvmMemoryFillingUp
    expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "JVM memory filling up (instance {{ $labels.instance }})"
      description: "JVM memory is filling up (> 80%)\n labels: {{ $labels }}  value = {{ $value }}\n"

Grafana

Grafana 内置了对 Prometheus 的支持；只需添加一个 Prometheus 数据源即可
```
Name:   test-zk
Type:   Prometheus
Url:    https://:9090
Access: proxy
```
然后下载并导入默认的 ZooKeeper 仪表盘模板并进行自定义。
用户可以向 dev@zookeeper.apache.org 发送电子邮件，询问 Grafana 仪表盘帐户，以获得任何良好的改进。

InfluxDB

InfluxDB 是一个开源时间序列数据，通常用于存储来自 Zookeeper 的指标。您可以下载开源版本或在 InfluxDB Cloud 上创建一个免费帐户。在任何一种情况下，配置 Apache Zookeeper Telegraf 插件以开始收集和存储来自 Zookeeper 集群的指标到您的 InfluxDB 实例中。还有一个 Apache Zookeeper InfluxDB 模板，其中包括 Telegraf 配置和仪表盘，可让您立即进行设置。

JMX

更多详细信息可在此处找到

四个字母的单词

更多详细信息可在此处找到