Elasticsearch监控相关 | Jack Wang Blog

Elasticsearch相关监控接口

监控相关接口

集群健康状态

[root@ELK-test ~]# curl -XGET -u elastic:password http://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "Julend_ES-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5929,
  "active_shards" : 11859,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
[root@ELK-test ~]#

status状态信息：

green（绿灯）：所有分片正常运行，集群非常健康
yellow（黄灯）：所有主分片正常运行，但副本分片有缺失。这种情况指Elasticsearch当前是正常运行的，但有一定风险
red（红灯）：有主分片缺失。这部分数据完全不可用。而Elasticsearch在写入端是简单的取余算法，轮到这个分片上的数据也会持续写入报错

其他数据：

number_of_nodes: 集群内的总节点数
number_of_data_nodes: 集群内的总数据节点数
active_primary_shards：集群内所有索引的主分片总数
active_shards: 集群内所有索引的分片总数
relocating_shards：正在迁移的分片数
Initializing_shards：正在初始化的分片数
unassigned_shards：未分配到具体节点上的分片数
delayed_unassigned_shards：延时待分配到具体节点上的分片数

level请求参数

1	curl -XGET -u elastic:admin@julend http://localhost:9200/_cluster/health?level=indices

节点状态

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42846  100 42846    0     0   5123      0  0:00:08  0:00:08 --:--:-- 12874
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "Julend_ES-cluster",
  "nodes" : {
    "mXtk8dQeQh-amui-d5pVoA" : {
      "timestamp" : 1509868787579,
      "name" : "Julend_es1",
      "transport_address" : "192.168.1.41:9300",
      "host" : "192.168.1.41",
      "ip" : "192.168.1.41:9300",
      "roles" : [
        "master",
        "data",
        "ingest"
      ],
...

索引信息

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "store" : {
          "size_in_bytes" : 38046324034,
          "throttle_time_in_millis" : 0
        },
        "indexing" : {
          "index_total" : 21895145,
          "index_time_in_millis" : 4513185,
          "index_current" : 0,
          "index_failed" : 0,
          "delete_total" : 0,
          "delete_time_in_millis" : 0,
          "delete_current" : 0,
          "noop_update_total" : 0,
          "is_throttled" : false,
          "throttle_time_in_millis" : 0
        },
...

docs.count 是节点上存储的数据条目总数
store.size_in_bytes 是节点上存储的数据占用磁盘实际大小
store.throttle_time_in_millis 则是Elasticsearch进程在做segment merge时出现磁盘限速的时长。如果在Elasticsearch日志中经常看到限速声明，则这个值也会偏大
indexing.index_total 是一个递增的累计数，表示节点完成的数据写入总次数
indexing.delete_total 记录删除的数据条数
indexing.is_throttled 为Elasticsearch 2.0新增计数，因为从此Elasticsearch开始自动管理throttle

读取性能

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "get" : {
          "total" : 167,
          "time_in_millis" : 18,
          "exists_total" : 167,
          "exists_time_in_millis" : 18,
          "missing_total" : 0,
          "missing_time_in_millis" : 0,
          "current" : 0
        },
...

get 显示的是直接使用_id读取数据的状态

搜索性能

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "search" : {
          "open_contexts" : 0,
          "query_total" : 10560,
          "query_time_in_millis" : 12189,
          "query_current" : 0,
          "fetch_total" : 7078,
          "fetch_time_in_millis" : 3383,
          "fetch_current" : 0,
          "scroll_total" : 1150,
          "scroll_time_in_millis" : 6697,
          "scroll_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        },
...

search.open_contexts 表示当前正在进行的搜索
search.query_total 表示节点启动以来完成的搜索数
search.query_time_in_millis 表示完成上次搜索花费的时间总和。search.query_time_in_millis除以 search.query_total 越大，说明性能越差，可以通过Elasticsearch 的slowlog获取具体的搜索语句，做出针对性的优化
search.fetch_total 等指标含义类似。因为Elasticsearch的搜索默认是query-then-fetch式的，所以fetch一般小而快。如果计算search.fetch_time_in_millis > search.query_time_in_millis，说明有人采用了较大的size参数做分页查询，通过showlog抓到具体的语句，相机优化成 scan式搜索

段合并性能

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "merges" : {
          "current" : 0,
          "current_docs" : 0,
          "current_size_in_bytes" : 0,
          "total" : 15675,
          "total_time_in_millis" : 8422758,38 16295         
          "total_docs" : 142493991,
          "total_size_in_bytes" : 75031786583,
          "total_stopped_time_in_millis" : 0,
          "total_throttled_time_in_millis" : 222454,
          "total_auto_throttle_in_bytes" : 839578885310
          },
...

merges 数据分为两部分，current开头的是当前正在发生的段合并行为统计；total开头的历史统计总数。作为ELK，以数据写入压力为主，merges相关数据会比较突出

过滤器缓存

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
"filter_cache": {
    "memory_size_in_bytes": 48,
    "evictions": 0
},
...

filter_cache.memory_size_in_bytes 表示过滤器缓存使用的内存
filter_cache.evictions 表示因内存满被回收的缓存大小，如果这个数比较大，说明过滤器缓存大小不足，或者过滤器本身不太适合缓存
注：过滤器缓存是建立在 megment 基础上，在当天新日志搜索中，存在大量或多或少的 segment。一个5G的segment和一个2M的segment，发生一次filter_cache.evictions 对搜索性能影响区别是巨大的。但节点状态中本身这个计数并不能反应这点区别。所以尽量减少这个数值，如果搜索本身感觉不慢，这几个值也无所谓

id缓存

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
"id_cache": {
    "memory_size_in_bytes": 0
},
...

id_cache 是 parent/child mappings 使用的内存。不过在ELK中，一般不会用到这个性能，所以此处数据一般为0

fielddata

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "fielddata" : {
          "memory_size_in_bytes" : 32480,
          "evictions" : 0
        },
...

此处显示fielddata使用的内存大小。fielddata用来做聚合、排序等工作。

segments

# curl -XGET -u elastic:password http://localhost:9200/_nodes/stats?pretty=true
...
        "segments" : {
          "count" : 18741,
          "memory_in_bytes" : 193556533,
          "terms_memory_in_bytes" : 152266264,
          "stored_fields_memory_in_bytes" : 17067832,
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory_in_bytes" : 13141184,
          "points_memory_in_bytes" : 2426489,
          "doc_values_memory_in_bytes" : 8654764,
          "index_writer_memory_in_bytes" : 0,
          "version_map_memory_in_bytes" : 0,
          "fixed_bit_set_memory_in_bytes" : 1552,
          "max_unsafe_auto_id_timestamp" : 1509866960317,
          "file_sizes" : { }
        },
...

segments.count 表示节点上所有索引的segment 数目总和。一般来说，一个索引通常会有50-150个 segment ，再多就对性能有较大影响了（可能merge 速度跟不上新segment 出现的速度）。
segments.memory_in_bytes 表示segment本身底层数据结构所使用的内存大小。像索引的倒排表，词典，bloom Filter（Elasticsearch 1.4以后默认关闭）等，所以过多的segment会导致这个数值迅速变大。

任务管理

等待执行的任务列表

master只有集群状态的数据维护，一般来说，这个任务列表都是空的

# curl -XGET -u elastic:password http://localhost:9200/_cluster/pending_tasks?pretty
{
  "tasks" : [ ]
}

如果遇到集群有异常，比如频繁更新索引，数据恢复，分片分配或者初始化的时候反复出错，就会看到一些任务了。

#集群故障了，出现了很多任务
# curl -XGET -u elastic:password http://localhost:9200/_cluster/pending_tasks?pretty
{
  "tasks" : [
    {
      "insert_order" : 1959,
      "priority" : "URGENT",
      "source" : "shard-started shard id [[192.168.1.24-thirdparty_errorlog-2017.11.02][2]], allocation id [rUOqVHjmTZiqL
COvSaYDnA], primary term [0], message [after existing recovery]",      "executing" : true,
      "time_in_queue_millis" : 833,
      "time_in_queue" : "833ms"
    },
    {
      "insert_order" : 1960,
      "priority" : "URGENT",
      "source" : "shard-started shard id [[192.168.1.21-thirdparty_log-2017.11.02][4]], allocation id [kmyECWLXRSKM4YDlsX
mRcA], primary term [0], message [after existing recovery]",      "executing" : false,
      "time_in_queue_millis" : 809,
      "time_in_queue" : "809ms"
    },
...

集群存储长期数据导致索引映射数据确实大到了master节点内存不足以快速处理的地步。
根据实际情况，可以选择：

索引特别多：给master加内存
索引字段太多：改用nested object 方式节省字段数量
索引多到内存不够了：把一部分数据拆出来到另一个集群

新版任务管理

新版任务并没有独立的接口，发起的具体某次search、snapshot、reindex等操作，自动就成了一个任务。而任务的列表可以通过/_tasks或者/_cat/tasks接口来获取。和其他操作一样，手动操作用cat，写程序时用JSON接口。

取消某个任务

# curl -XGET -u elastic:admin@julend http://localhost:9200/_cat/tasks?v #获取当前任务
action                         task_id                       parent_task_id                type start_time    timest
amp  running_time ip           nodeindices:monitor/stats          Go1e5drSSum7gbrKpIef8g:175591 -                             transport 1509884620079 20:23:
40  487ms        192.168.1.43 Julend_es3indices:monitor/stats[n]       VQf-ph7NRTeiEVHqAT4pRw:140954 Go1e5drSSum7gbrKpIef8g:175591 netty     1509884619871 20:23:
39  410.2ms      192.168.1.42 Julend_es2cluster:monitor/tasks/lists    mXtk8dQeQh-amui-d5pVoA:145705 -                             transport 1509884621290 20:23:
41  17.3ms       192.168.1.41 Julend_es1cluster:monitor/tasks/lists[n] VQf-ph7NRTeiEVHqAT4pRw:140959 mXtk8dQeQh-amui-d5pVoA:145705 netty     1509884620249 20:23:
40  32ms         192.168.1.42 Julend_es2cluster:monitor/tasks/lists[n] Go1e5drSSum7gbrKpIef8g:175595 mXtk8dQeQh-amui-d5pVoA:145705 netty     1509884620544 20:23:
40  21.4ms       192.168.1.43 Julend_es3cluster:monitor/tasks/lists[n] mXtk8dQeQh-amui-d5pVoA:145706 mXtk8dQeQh-amui-d5pVoA:145705 direct    1509884621306 20:23:
41  1.3ms        192.168.1.41 Julend_es1

# curl -XPOST 'http://localhost:9200/_tasks/task_id:175591/_cancel'

而 search任务和 reindex任务不同。Elasticsearch从5.1.1版本开始支持取消还在运行的search任务，但这个行为并不能立即生效。
默认情况下，对search tasks的管理粒度是以segment为单位的。即，这个搜索会在执行完当前segment后才停止。历史索引已经经过 forcemerge接口优化，一个分片里面只有一个segment，那么这个cancel可以认为是无效的。
对于这种情况，Elasticsearch提供了另一个更细粒度但是也更消耗资源的方法。

# curl -XPOST 'http://localhost:9200/_cluster/settings' -d '{
    "persistent": {
        "search.low_level_cancellation": true
    }
}'

这时，所有的搜索任务，都会定期检查自己是否被取消了。也这可能导致比较慢的搜索，执行时间更加漫长。

cat接口命令行使用

用户日常运维，可以读取各种监控数据

/_cat/nodes
/_cat/shards
/_cat/shards/{index}
/_cat/aliases
/_cat/aliases/{alias}
/_cat/tasks
/_cat/master
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
/_cat/pending_tasks
/_cat/count
/_cat/count/{index}
/_cat/snapshots/{repository}
/_cat/recovery
/_cat/recovery/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/thread_pool
/_cat/thread_pool/{thread_pools}/_cat/nodeattrs
/_cat/allocation
/_cat/repositories
/_cat/health
/_cat/indices
/_cat/indices/{index}

# curl -XGET -u elastic:password http://localhost:9200/_cat/nodes?help
id           | id,nodeId      | unique node id           
pid          | p              | process id               
ip           | i              | ip address               
port         | po             | bound transport port     
http_address | http           | bound http address       
version      | v              | es version
...

bigdesk

地址：https://github.com/hlstudio/bigdesk
bigdesk是一款针对Elasticsearch 性能的开源实时监控方案。

zabbix trapper

地址：https://github.com/Wprosdocimo/Elasticsearch-zabbix
https://github.com/untergeek/zabbix-grab-bag/blob/master/Elasticsearch/es_stats_zabbix.README.md

Percolator接口

ELK stack告警方式：

对于匹配报警，采用Elasticsearch的 Percolator接口做响应告警。
对于时序统计，采用定时任务方式，发送Elasticsearch aggs请求，分析响应体报警。

在Elasticsearch 5.0后，对Percolator 功能大幅改造，而现在作为一种 mapping类型，在创建索引时需要预先定义。

# curl -XPUT http://localhost:9200/syslog -d '{
    "mappings": {
        "syslog": {
            "properties": {
                "message": {
                    "type": "text"
                },
                "severity": {
                    "type": "long"
                },
                "program": {
                    "type": "keyword"
                }
            }
        },
        "queries": {
            "properties": {
                "query": {
                    "type": "percolator"
                }
            }
    }
}'

然后往 syslog/queries 里注册两条 percolator 请求规则：

# curl -XPUT http://localhost:9200/syslog/queries/memory -d '{
    "query": {
        "query_string": {
            "default_field": "message",
            "default_operator": "OR",
            "query": "mem DMA segfault page allocation AND severity: >2 AND program:kernel"
        }
    }
}'

# curl -XPUT http://localhost:9200/syslog/queries/disk -d '{
    "query": {
        "query_string": {
            "default_field": "message",
            "default_operator": "OR",
            "query": "scsi sata hdd sda AND severity:>2 AND program:kernal"
        }
    }
}'

然后，将标准的数据写入请求改成通过搜索接口进行：

# curl -XPOST http://localhost:9200/syslog/_search -d '{
    "query": {
        "percolate": {
            "field": "query",
            "document_type": "syslog",
            "document": {
                "program": "kernel",
                "severity": 3,
                "message": "swapper/0: page allocation failure: order:4, mode:0x4020"
            }
        }
    }
}'

得到的结果如下：

{
    ...
    "hits": [
        {
            "_index": "syslog",
            "_type": "queries",
            "_id": "memory",
            ...
        }
    ]
}

这条syslog 日志匹配上了memory内存异常。下面就可以发送给报警系统了。
如果syslog 索引中已经有数据了，也可以重新 Percolator 查询。比如：有一条之前已经写入到 http://localhost:9200/syslog/cisco/1234567 的数据，如果把这条数据在过一次 Percolate：

# curl -XPOST http://localhost:9200/syslog/_search -d '{
    "query": {
        "percolate": {
            "field": "query",
            "document_type": "syslog",
            "index": "syslog",
            "type": "cisco",
            "id": "1234567",
        }
    }
}'

更复杂的 query DSL 做 Percolator 请求示例，参考：https://www.elastic.co/blog/using-percolator-geo-tagging

报警机制

Watcher报警

现在已经集成到X-PACK，安装X-PACK即可
参考：https://www.elastic.co/cn/products/x-pack/alerting

ElastAlert

与 Watcher 属于同类型产品。
参考：http://elastalert.readthedocs.io/en/latest/

Etsy的 Kale 异常检测

Kale 系统是一个监控分析系统，分为：skyline 和 oculus。
参考：https://codeascraft.com/2013/06/11/introducing-kale/

本作品采用知识共享署名 2.5 中国大陆许可协议进行许可，欢迎转载，但转载请注明来自Jack Wang Blog，并保持转载后文章内容的完整。本人保留所有版权相关权利。

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2017/11/14/Elasticsearch%E7%9B%91%E6%8E%A7%E7%9B%B8%E5%85%B3/