Jack Wang Blog

解决Prometheus too many open files问题

2022-11-27T06:06:35.000Z

近期发现 prometheus 出现 discovery=file msg="Error adding file watcher" err="too many open files"的告警。且更新prometheus json 文件后，不能及时的刷新，需要等待很久。

报错信息如下：

$ systemctl status prometheus.service  -l
● prometheus.service - Prometheus Voice
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-03 13:45:32 +08; 1 day 3h ago
     Docs: https://github.com/prometheus/prometheus
 Main PID: 281503 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─281503 /data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
 
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"

通过观察报错，字面意思是 open files 打开的太多了，于是检查 open files 的设定。

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2056770
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024000
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

$ cat /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Voice
Documentation=https://github.com/prometheus/prometheus
After=network.target
 
[Service]
WorkingDirectory=/data1/prometheus/prometheus-2.16.0.linux-amd64
ExecStart=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
ExecReload=/usr/bin/curl -X POST http://localhost:9090/-/reload
Restart=on-failure
RestartSec=5
Type=simple
LimitNOFILE=65536
 
[Install]
WantedBy=multi-user.target

通过检查发现，open files设定正确，但仍未解决问题。
通过查找资料发现，是因为fs.inotify.max_user_instances 默认值设定太小导致的。表示每一个real user ID可创建的inotify instatnces的数量上限，默认128。

1 2	$ cat /proc/sys/fs/inotify/max_user_instances 128

1
2
3

$ tail -1 /etc/sysctl.conf
fs.inotify.max_user_instances=65000
$ sysctl -p

问题解决

$ systemctl status prometheus.service -l
● prometheus.service - Prometheus Voice
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-06-03 13:45:32 +08; 1 day 3h ago
     Docs: https://github.com/prometheus/prometheus
 Main PID: 281503 (prometheus)
   CGroup: /system.slice/prometheus.service
           └─281503 /data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus --config.file=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=60d --web.enable-lifecycle --web.external-url=http://143.92.123.121:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200
 
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 16:22:06 sg-prometheus-base01 prometheus[281503]: level=error ts=2020-06-04T08:22:06.523Z caller=file.go:225 component="discovery manager scrape" discovery=file msg="Error adding file watcher" err="too many open files"
Jun 04 17:00:04 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:04.829Z caller=main.go:747 msg="Loading configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:04 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:04.864Z caller=main.go:775 msg="Completed loading of configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:29 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:29.603Z caller=main.go:747 msg="Loading configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:00:29 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:00:29.634Z caller=main.go:775 msg="Completed loading of configuration file" filename=/data1/prometheus/prometheus-2.16.0.linux-amd64/prometheus.yml
Jun 04 17:03:07 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:03:07.125Z caller=compact.go:496 component=tsdb msg="write block" mint=1591250400000 maxt=1591257600000 ulid=01E9Z8XZN7M530V866W4WFC3Y6 duration=3m7.085617933s
Jun 04 17:03:18 sg-prometheus-base01 prometheus[281503]: level=info ts=2020-06-04T09:03:18.535Z caller=head.go:661 component=tsdb msg="head GC completed" duration=7.757127512s

参考：
https://blog.csdn.net/weiguang1017/article/details/54381439
https://groups.google.com/forum/#!topic/prometheus-users/OQzEYeggnpw

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/11/27/解决Prometheus too many open files问题/

部署文件不切片的Minio-Server

2022-11-07T07:48:33.000Z

MinIO 是一个高性能的对象存储，原生支持 Kubernetes 部署的解决方案。 MinIO也提供了一个 Amazon Web Services S3 兼容 API 并支持所有核心 S3 功能。

文件不切片的MinIO

部署Minio

$ mkdir -p /data1/release /data/minio /data2/minio-data
$ cd /data1/release
$ wget -c 'https://dl.min.io/server/minio/release/linux-amd64/archive/minio.RELEASE.2021-02-14T04-01-33Z'
$ wget -c "https://dl.min.io/client/mc/release/linux-amd64/mc"
$ mv minio.RELEASE.2021-02-14T04-01-33Z /data/minio/minio
$ mv mc /data/minio/
 
$ cd /data/minio
$ chmod +x minio mc

授权

$ groupadd -r minio-user
$ useradd -M -r -g minio-user minio-user
$ chown -R minio-user:minio-user /data/minio /data2/minio-data
$ chmod u+rxw /data2/minio-data

minio配置文件

$ cd /data/minio
$ cat minio.env
# Root user for the server.
MINIO_ROOT_USER="minioroot"
# Root secret for the server.
MINIO_ROOT_PASSWORD="password"
   
# Standard storage class
MINIO_STORAGE_CLASS_STANDARD="4"
# Reduced redundancy storage class(RSS)
MINIO_STORAGE_CLASS_RRS="2"
   
# Following environment variables are used in systemd file.
   
# Volume to be used for MinIO server.
#MINIO_VOLUMES="http://minio-test1/data4/minio"  #集群版本使用这个
MINIO_VOLUMES="/data4/minio-data"  #单节点使用这个
   
# Use if you want to run MinIO on a custom port.
MINIO_OPTS="--address 10.73.9.3:9199"

minio-server启动配置文件

$ cat /etc/systemd/system/minio-server.service
[Unit]
Description=MinIO
Documentation=https://docs.min.io
Wants=network-online.target
After=network-online.target
AssertFileIsExecutable=/data/minio/minio
   
[Service]
WorkingDirectory=/data/minio/
   
User=minio-user
Group=minio-user
   
EnvironmentFile=/data/minio/minio.env
ExecStartPre=/bin/bash -c "if [ -z \"${MINIO_VOLUMES}\" ]; then echo \"Variable MINIO_VOLUMES not set in /etc/default/minio\"; exit 1; fi"
ExecStart=/data/minio/minio server $MINIO_OPTS $MINIO_VOLUMES
   
# Let systemd restart this service always
Restart=always
   
# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=1048576
   
# Specifies the maximum number of threads this process can create
TasksMax=infinity
   
# Disable timeout logic and wait until process is stopped
TimeoutStopSec=infinity
SendSIGKILL=no
   
[Install]
WantedBy=multi-user.target

启动minio-server

1
2
3

$ systemctl daemon-reload
$ systemctl start minio-server
$ systemctl status minio-server -l

配置haproxy

frontend http-sgim-minio-data
    bind *:9800
    option forwardfor except 127.0.0.1
    option http-server-close
    option httplog
    option dontlognull
    default_backend sgim-minio-data
    maxconn 5000000
 
backend sgim-minio-data
    balance leastconn
    server minio-test1 10.73.9.3:9199 maxconn 1000000

设置MinIO Client(mc)

$ cd /data/minio
$ ./mc ls  #初始化mc
$ ./mc alias set minioTest http://10.73.9.3:9800 minioroot password --api s3v4
Added `minioTest` successfully.
 
$ ./mc mb minioTest/test   #创建minio bucket
$ ./mc cp file minioTest/test  #上传文件到minio
$ ./mc ls minioTest/test  #查看minio/test文件列表

验证文件是否被切片

登陆Minio Server的服务器

$ cd /data4/minio-data
$ tree test/
test/
└── telegraf-1.22.4-1.x86_64.rpm
 
0 directories, 1 file

访问minio-Server Web
http://ip:9800/minio/login
账号密码为，在minio.env配置文件里设置的MINIO_ROOT_USER和MINIO_ROOT_PASSWORD的值

MinIO权限管理

MinIO Console → Access → Create Policy → Write Policy
意思是，只能控制test bucket

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListAllMyBuckets",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::test/*"
            ]
        }
    ]
}

Minio旧版本Binary下载地址：https://dl.min.io/server/minio/release/linux-amd64/archive/
官网：http://www.minio.org.cn/download.shtml#/linux

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/11/07/部署文件不切片的Minio-Server/

解决Grafana query processing would load too many samples into memory in query exec

2022-10-27T04:00:30.000Z

现象

版更后, Grafana 无法正常展示 Prometheus 资料。

问题描述

稍微拉长 Grafana 时间轴，Grafana 会显示 query processing would load too many samples into memory in query execution. 报错信息。

报错截图

问题分析

因版更后，可能因业务改变而导致prometheus metric突增。但 Grafana 拉取的数据超过了 Prometheus 限制。通过查找资源和翻阅 Prometheus GitHub 发现。由于PromQL语句可能会载入大量的 metrics 数据，导致 Prometheus 内存以及 CPU 消耗超标，为了保护 Prometheus 不会被复杂的 PromQL 查询吃掉过多的资源，Prometheus 作者在代码层面设置了限制。在 Prometheus 2.5.0 版本后，Prometheus 新增可以通过–query.max-samples参数来调整限制，满足更多的业务需求。
Prometheus 2.5.0 GitHub 特性说明：
地址：https://github.com/prometheus/prometheus/releases/tag/v2.5.0
Issue 地址：https://github.com/prometheus/prometheus/pull/4513

解决办法

因 GS prometheus在迁移时升级过 Prometheus 版本，故需要查阅对应版本的 Prometheus 源码来了解 Prometheus 默认限制大小。
Prometheus 源码地址：https://github.com/prometheus/prometheus/blob/v2.15.2/cmd/prometheus/main.go

通过 Prometheus 源码可知，Prometheus 默认的最大查询上限为：50000000。因 GS Grafana 图查询次数超过 Prometheus 限制，所以，我们需要通过参数来调整上线大小。
参数说明：

--query.timeout=2m
Maximum time a query may take before being aborted.
--query.max-concurrency=20
Maximum number of queries executed concurrently.
--query.max-samples=50000000
Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load more samples than this into memory, so this also limits the number of samples a query can return.

经过沟通，目前使用的 Gs prometheus 启动参数为：

/data1/prometheus_gs_all/prometheus-2.15.2.linux-amd64/prometheus --config.file=/data1/prometheus_gs_all/prometheus-2.15.2.linux-amd64/prometheus.yml --web.listen-address=0.0.0.0:9090 --storage.tsdb.retention=90d --web.enable-lifecycle --web.external-url=http://143.92.123.123:9090 --query.max-samples=500000000 --query.timeout=20m --query.max-concurrency=200

参考文档：
http://blog.kankanan.com/article/query-processing-would-load-too-many-samples-into-memory-in-query-execution.html
https://github.com/prometheus/prometheus/blob/v2.15.2/cmd/prometheus/main.go
https://github.com/prometheus/prometheus/releases/tag/v2.5.0
https://github.com/prometheus/prometheus/pull/4513

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/10/27/解决Grafana query processing would load too many samples into memory in query exec/

Redis-Sentinel

2022-08-27T15:40:54.000Z

sentinel是一个管理redis实例的工具，它可以实现对redis的监控、通知、自动故障转移。sentinel不断的检测redis实例是否可以正常工作，通过API向其他程序报告redis的状态，如果redis master不能工作，则会自动启动故障转移进程，将其中的一个slave提升为master，其他的slave重新设置新的master服务器。

Sentinel介绍

Redis Sentinel是一个分布式架构，包含若干个Sentinel节点和Redis数据节点，每个Sentinel节点会对数据节点和其余Sentinel节点进行监控，当发现节点不可达时，会对节点做下线标识.
如果被标识的是主节点，他还会选择和其他Sentinel节点进行“协商”，当大多数的Sentinel节点都认为主节点不可达时，他们会选举出一个Sentinel节点来完成自动故障转移工作，同时将这个变化通知给Redis应用方.
整个过程完全自动，不需要人工介入，所以可以很好解决Redis的高可用问题.

redis主从复制
Redis主从复制可将主节点数据同步给从节点，从节点此时有两个作用:

一旦主节点宕机，从节点作为主节点的备份可以随时顶上来
扩展主节点的读能力，分担主节点读压力

Redis Sentinel有以下几个功能:

监控：Sentinel节点会定期检测Redis数据节点和其余Sentinel节点是否可达
通知：Sentinel节点会将故障转移通知给应用方
主节点故障转移：实现从节点晋升为主节点并维护后续正确的主从关系
配置提供者：在Redis Sentinel结构中，客户端在初始化的时候连接的是Sentinel节点集合，从中获取主节点信息

环境：
　　　Centos 7.9.2009
　　　Redis version：5.0.13

分别有3个Sentinel节点，1个主节点，2个从节点组成一个Redis Sentinel
| hostname | ip | redis port | sentinel port |
| — | — | — | — |
| node1 | 10.21.248.7 | 16379 | 26379 |
| node2 | 10.21.248.179 | 16379 | 26379 |
| node3 | 10.21.108.47 | 16379 | 26379 |

安装Redis

$ mkdir -p /data/release
$ cd /data/release
$ wget -c 'https://download.redis.io/releases/redis-5.0.13.tar.gz'
$ tar -xf redis-5.0.13.tar.gz
$ cd /data/release/redis-5.0.13/
$ make && make install

部署Redis Sentinel

Node1 Redis配置

$ mkdir -p /data/redis/{conf,data/16379,logs,sentinel}
$ cd /data/redis/conf
$ cat redis-16379.conf 
daemonize yes
bind 10.21.248.7
masterauth "YWY0MGIx@201"
requirepass "YWY0MGIx@201"
replica-read-only yes
maxclients 10000
maxmemory 1gb
pidfile "/data/redis/redis_16379.pid"
port 16379
tcp-backlog 511
timeout 600
tcp-keepalive 0
loglevel verbose
logfile "/data/redis/logs/redis-16379.log"
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/data/redis/data/16379"
replica-serve-stale-data yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
replica-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes

Node1 Sentinel配置
注意：sentinel myid每个节点都要唯一

$ cd /data/redis/conf
$ cat sentinel-26379.conf 
port 26379
bind 10.21.248.7
daemonize yes
dir "/data/redis/sentinel"
logfile "/data/redis/logs/sentinel-26379.log"
sentinel myid 525e0e47f39d28d46b707401054b0680057c800a
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.21.248.7 16379 2
sentinel down-after-milliseconds mymaster 5000
sentinel auth-pass mymaster YWY0MGIx@201
sentinel config-epoch mymaster 59
sentinel leader-epoch mymaster 59

Node1 启动脚本

$ cd /data/redis
$ cat redis_sentinel.sh
#!/bin/bash
num="$1"
function start() {
  #redis
  /usr/local/bin/redis-server /data/redis/conf/redis-16379.conf 
  
  #sentinel
  /usr/local/bin/redis-sentinel /data/redis/conf/sentinel-26379.conf
}
function stop() {
  #redis
  /usr/local/bin/redis-cli -h 10.21.248.7 -p 16379 -a YWY0MGIx@201 shutdown
  #sentinel
  /usr/local/bin/redis-cli -h 10.21.248.7 -p 26379 -a YWY0MGIx@201 shutdown
}
function status() {
  ps -ef |grep -i 'redis-' | grep -iv 'grep'
}
case $num in
  start)
     start;;
  stop)
     stop;;
  restart)
     stop
     start;;
  status)
     status;;
  *)
     echo "Usage: start|stop|restart|status";;
esac

Node2 Redis配置

从节点需新增replicaof masterip 配置

$ mkdir -p /data/redis/{conf,data/16379,logs,sentinel}
$ cd /data/redis/conf
$ cat redis-16379.conf
daemonize yes
bind 10.21.248.179
masterauth "YWY0MGIx@201"
requirepass "YWY0MGIx@201"
replica-read-only yes
maxclients 10000
maxmemory 1gb
pidfile "/data/redis/redis_16379.pid"
port 16379
tcp-backlog 511
timeout 600
tcp-keepalive 0
loglevel verbose
logfile "/data/redis/logs/redis-16379.log"
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/data/redis/data/16379"
replica-serve-stale-data yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
replica-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
# Generated by CONFIG REWRITE
replicaof 10.21.248.7 16379

Node2 Sentinel配置
注意：sentinel myid每个节点都要唯一

$ cd /data/redis/conf
$ cat sentinel-26379.conf  
port 26379
bind 10.21.248.179
daemonize yes
dir "/data/redis/sentinel"
logfile "/data/redis/logs/sentinel-26379.log"
sentinel myid 525e0e47f39d28d46b707401054b0680057c800b
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.21.248.7 16379 2
sentinel down-after-milliseconds mymaster 5000
sentinel auth-pass mymaster YWY0MGIx@201
sentinel config-epoch mymaster 59
sentinel leader-epoch mymaster 59

Node2 启动脚本

$ cd /data/redis
$ cat redis_sentinel.sh 
#!/bin/bash
num="$1"
function start() {
  #redis
  /usr/local/bin/redis-server /data/redis/conf/redis-16379.conf 
  
  #sentinel
  /usr/local/bin/redis-sentinel /data/redis/conf/sentinel-26379.conf
}
function stop() {
  #redis
  /usr/local/bin/redis-cli -h 10.21.248.179 -p 16379 -a YWY0MGIx@201 shutdown
  #sentinel
  /usr/local/bin/redis-cli -h 10.21.248.179 -p 26379 -a YWY0MGIx@201 shutdown
}
function status() {
  ps -ef |grep -i 'redis-' | grep -iv 'grep'
}
case $num in
  start)
     start;;
  stop)
     stop;;
  restart)
     stop
     start;;
  status)
     status;;
  *)
     echo "Usage: start|stop|restart|status";;
esac

Node3 Redis配置

从节点需新增replicaof masterip 配置

$ mkdir -p /data/redis/{conf,data/16379,logs,sentinel}
$ cd /data/redis/conf
$ cat redis-16379.conf
daemonize yes
bind 10.21.108.47
masterauth "YWY0MGIx@201"
requirepass "YWY0MGIx@201"
replica-read-only yes
maxclients 10000
maxmemory 1gb
pidfile "/data/redis/redis_16379.pid"
port 16379
tcp-backlog 511
timeout 600
tcp-keepalive 0
loglevel verbose
logfile "/data/redis/logs/redis-16379.log"
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename "dump.rdb"
dir "/data/redis/data/16379"
replica-serve-stale-data yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
replica-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
replicaof 10.21.248.7 16379

Node3 Sentinel配置
注意：sentinel myid每个节点都要唯一

$ cat sentinel-26379.conf 
port 26379
bind 10.21.108.47
daemonize yes
dir "/data/redis/sentinel"
logfile "/data/redis/logs/sentinel-26379.log"
sentinel myid 525e0e47f39d28d46b707401054b0680057c800c
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.21.248.7 16379 2
sentinel down-after-milliseconds mymaster 5000
sentinel auth-pass mymaster YWY0MGIx@201
sentinel config-epoch mymaster 59
sentinel leader-epoch mymaster 59

Node3 启动脚本

$ cd /data/redis
$ cat redis_sentinel.sh
#!/bin/bash
num="$1"
function start() {
  #redis
  /usr/local/bin/redis-server /data/redis/conf/redis-16379.conf 
  
  #sentinel
  /usr/local/bin/redis-sentinel /data/redis/conf/sentinel-26379.conf
}
function stop() {
  #redis
  /usr/local/bin/redis-cli -h 10.21.108.47 -p 16379 -a YWY0MGIx@201 shutdown
  #sentinel
  /usr/local/bin/redis-cli -h 10.21.108.47 -p 26379 -a YWY0MGIx@201 shutdown
}
function status() {
  ps -ef |grep -i 'redis-' | grep -iv 'grep'
}
case $num in
  start)
     start;;
  stop)
     stop;;
  restart)
     stop
     start;;
  status)
     status;;
  *)
     echo "Usage: start|stop|restart|status";;
esac

启动各个节点的Redis Sentinel

1 2	$ cd /data/redis $ sh redis_sentinel.sh start

查看Redis Sentinel日志

$ cd /data/redis/logs
$ cat sentinel-26379.log
9897:X 14 Aug 2022 21:41:04.503 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
9897:X 14 Aug 2022 21:41:04.504 # Redis version=5.0.13, bits=64, commit=00000000, modified=0, pid=9897, just started
9897:X 14 Aug 2022 21:41:04.504 # Configuration loaded
9901:X 14 Aug 2022 21:41:04.505 * Running mode=sentinel, port=26379.
9901:X 14 Aug 2022 21:41:04.506 # Sentinel ID is 525e0e47f39d28d46b707401054b0680057c800a
9901:X 14 Aug 2022 21:41:04.506 # +monitor master mymaster 10.21.248.7 16379 quorum 2
9901:X 14 Aug 2022 21:41:09.032 * +sentinel sentinel 525e0e47f39d28d46b707401054b0680057c800b 10.21.248.179 26379 @ mymaster 10.21.248.7 16379
9901:X 14 Aug 2022 21:42:34.765 * +slave slave 10.21.248.179:16379 10.21.248.179 16379 @ mymaster 10.21.248.7 16379
9901:X 14 Aug 2022 21:42:34.766 * +slave slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.7 16379
9901:X 14 Aug 2022 21:42:35.889 * +sentinel sentinel 525e0e47f39d28d46b707401054b0680057c800c 10.21.108.47 26379 @ mymaster 10.21.248.7 16379

任意登陆一个节点的Redis Sentinel，可查看Redis Sentinel状态

$ redis-cli -h 10.21.248.7 -p 26379
10.21.248.7:26379> info sentinel  #查看redis sentinel状态
10.21.248.7:26379> sentinel masters  #查看sentinel masters信息
10.21.248.7:26379> sentinel slaves mymaster #查看sentinel slave信息

模拟stop master Sentinel

1 2	$ cd /data/redis $ sh redis_sentinel.sh stop

node2的sentinel日志

$ cd /data/redis/logs
$ cat sentinel-26379.log
15383:X 14 Aug 2022 21:45:57.635 # +sdown master mymaster 10.21.248.7 16379
15383:X 14 Aug 2022 21:45:57.635 # +sdown sentinel 525e0e47f39d28d46b707401054b0680057c800a 10.21.248.7 26379 @ mymaster 10.21.248.7 16379
15383:X 14 Aug 2022 21:45:57.745 # +new-epoch 58
15383:X 14 Aug 2022 21:45:57.756 # +vote-for-leader 525e0e47f39d28d46b707401054b0680057c800c 58
15383:X 14 Aug 2022 21:45:58.142 # +config-update-from sentinel 525e0e47f39d28d46b707401054b0680057c800c 10.21.108.47 26379 @ mymaster 10.21.248.7 16379
15383:X 14 Aug 2022 21:45:58.142 # +switch-master mymaster 10.21.248.7 16379 10.21.248.179 16379
15383:X 14 Aug 2022 21:45:58.142 * +slave slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.179 16379
15383:X 14 Aug 2022 21:45:58.142 * +slave slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
15383:X 14 Aug 2022 21:46:03.189 # +sdown slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379

启动旧redis master，则自动以从节点的身份加入到redis sentinel

$ cd /data/redis/logs
$ cat sentinel-26379.log
323096:X 14 Aug 2022 21:47:09.917 # -sdown slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:47:10.684 # -sdown sentinel 525e0e47f39d28d46b707401054b0680057c800a 10.21.248.7 26379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:47:19.833 * +convert-to-slave slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379

模拟master redis sentinel服务器宕机

关机/重启命令略
node3 sentinel日志

$ cd /data/redis/logs
$ cat sentinel-26379.log
323096:X 14 Aug 2022 21:48:45.033 # +sdown sentinel 525e0e47f39d28d46b707401054b0680057c800b 10.21.248.179 26379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.641 # +sdown master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.693 # +odown master mymaster 10.21.248.179 16379 #quorum 2/2
323096:X 14 Aug 2022 21:48:45.693 # +new-epoch 59
323096:X 14 Aug 2022 21:48:45.693 # +try-failover master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.694 # +vote-for-leader 525e0e47f39d28d46b707401054b0680057c800c 59
323096:X 14 Aug 2022 21:48:45.696 # 525e0e47f39d28d46b707401054b0680057c800a voted for 525e0e47f39d28d46b707401054b0680057c800c 59
323096:X 14 Aug 2022 21:48:45.795 # +elected-leader master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.795 # +failover-state-select-slave master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.848 # +selected-slave slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.848 * +failover-state-send-slaveof-noone slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:45.907 * +failover-state-wait-promotion slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:46.708 # +promoted-slave slave 10.21.248.7:16379 10.21.248.7 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:46.709 # +failover-state-reconf-slaves master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:46.807 * +slave-reconf-sent slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:47.737 * +slave-reconf-inprog slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:47.737 * +slave-reconf-done slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:47.837 # -odown master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:47.837 # +failover-end master mymaster 10.21.248.179 16379
323096:X 14 Aug 2022 21:48:47.838 # +switch-master mymaster 10.21.248.179 16379 10.21.248.7 16379
323096:X 14 Aug 2022 21:48:47.838 * +slave slave 10.21.108.47:16379 10.21.108.47 16379 @ mymaster 10.21.248.7 16379
323096:X 14 Aug 2022 21:48:47.838 * +slave slave 10.21.248.179:16379 10.21.248.179 16379 @ mymaster 10.21.248.7 16379
323096:X 14 Aug 2022 21:48:52.858 # +sdown slave 10.21.248.179:16379 10.21.248.179 16379 @ mymaster 10.21.248.7 16379

启动旧redis master，则自动以从节点的身份加入到redis sentinel

$ cd /data/redis/logs
$ cat sentinel-26379.log
323096:X 14 Aug 2022 21:50:40.487 # -sdown sentinel 525e0e47f39d28d46b707401054b0680057c800b 10.21.248.179 26379 @ mymaster 10.21.248.7 16379
323096:X 14 Aug 2022 21:50:40.965 # -sdown slave 10.21.248.179:16379 10.21.248.179 16379 @ mymaster 10.21.248.7 16379

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/08/27/Redis-Sentinel/

部署NSQ

2022-05-07T03:31:47.000Z

NSQ是一个基于Go语言的分布式实时消息队列中间件，它基于MIT开源协议发布，由bitly公司开源出来的一款简单易用的消息中间件，它有以下特点：

分布式，去中心化的拓扑结构，具有无单点故障、故障容错、高可用性以及能够保证消息的可靠传递的特征
可扩展，自带服务发现和负载均衡策略，可在线扩容缩容
运维友好，部署binary即可，无外部依赖，自带admin管理集群
高度集成，官方支持Go和Python的Client，社区也为大多数语言提供了Client

NSQ由以下3种守护进程组成：

Nsqd
- 负责接收，排队和传送消息到clients的守护进程，nsqd在内存中维护topic和channel的队列，队列大小通过参数mem-queue-size配置，超过大小的消息会记录在磁盘中。另外，nsqd也可以独立运行，不过通常它是配合nsqlookupd一起使用，这样是为了在集群中可以让clients动态发现注册的nsqd。
Nsqlookupd
- 管理集群拓扑信息的作为服务发现的守护进程，它可以部署多个实例，实例之间不互相通信且做到数据最终一致；它接受nsqd节点广播的topic和channel信息，clients通过查询nsqlookupd来发现指定topic的producer。它有两个接口：
  - tcp接口，nsqd用它来广播
  - http接口，clients用它来发现和管理
Nsqadmin
- 一个Web UI来实时监控NSQ集群，并执行各种管理任务，比如暂停/清空/删除某个Topic/Channel，Metrics的意义可以参考nsqadmin

部署NSQ

下载解压NSQ

$ cd /data/
$ wget -c 'https://github.com/nsqio/nsq/releases/download/v1.2.1/nsq-1.2.1.linux-amd64.go1.16.6.tar.gz'
$ tar xf nsq-1.2.1.linux-amd64.go1.16.6.tar.gz
$ tree nsq-1.2.1.linux-amd64.go1.16.6
nsq-1.2.1.linux-amd64.go1.16.6
└── bin
    ├── nsqadmin
    ├── nsqd
    ├── nsqlookupd
    ├── nsq_stat
    ├── nsq_tail
    ├── nsq_to_file
    ├── nsq_to_http
    ├── nsq_to_nsq
    └── to_nsq
 
1 directory, 9 files
$ mv nsq-1.2.1.linux-amd64.go1.16.6 nsq
$ mkdir -p /data/nsq/data/{node1,node2}
$ mkdir -p /data/nsq/logs

supervisor部署

#!/bin/bash
install_supervisor ()
{
    if [ ! -f "/etc/supervisord.conf" ]; then
        yum install -y python-setuptools 
        easy_install "supervisor==3.3.5"
        if [ $? -ne 0 ]; then
            echo "supervisor安装失败"
            exit 1
        fi
        echo_supervisord_conf > /etc/supervisord.conf
        echo -e "[include]\nfiles = /etc/supervisor/*.conf" >> /etc/supervisord.conf
        if [ ! -d "/etc/supervisord.conf" ]; then
            mkdir /etc/supervisor
        fi
        supervisord -c /etc/supervisord.conf
    else
        echo "supervisord.conf配置文件已经存在,请确认supervisord是否已经安装"                 
    fi
}
install_supervisor

nsq1

$ cd /etc/supervisor
$ nsqd1.conf
[program:nsqd1]
command=/data/nsq/bin/nsqd -broadcast-address=10.71.5.67 -http-address=10.71.5.67:4152 -node-id=1 -tcp-address=10.71.5.67:4153 -lookupd-tcp-address=10.71.5.67:4160 -lookupd-tcp-address=10.71.5.67:4163 -data-path=/data/nsq/data/node1 -mem-queue-size 100000
;process_name=%(program_name)s
numprocs=1
directory=/data/nsq/
umask=022
;priority=999
autostart=false
autorestart=true
startsecs=5
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
killasgroup=false
user=root
redirect_stderr=true
stdout_logfile=/data/nsq/logs/nsqd1.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_capture_maxbytes=10MB
stdout_events_enabled=false
stderr_logfile=/data/nsq/logs/nsqd1_stderr.log
stderr_logfile_maxbytes=10MB
stderr_logfile_backups=10
stderr_capture_maxbytes=10MB
stderr_events_enabled=false
;environment=A="1",B="2"
serverurl=AUTO

nsq2

$ cat nsqd2.conf
[program:nsqd2]
command=/data/nsq/bin/nsqd -broadcast-address=10.71.5.67 -http-address=10.71.5.67:4154 -node-id=2 -tcp-address=10.71.5.67:4155 -lookupd-tcp-address=10.71.5.67:4160 -lookupd-tcp-address=10.71.5.67:4163 -data-path=/data/nsq/data/node2 -mem-queue-size 100000
;process_name=%(program_name)s
numprocs=1
directory=/data/nsq/
umask=022
;priority=999
autostart=false
autorestart=true
startsecs=5
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
killasgroup=false
user=root
redirect_stderr=true
stdout_logfile=/data/nsq/logs/nsqd2.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_capture_maxbytes=10MB
stdout_events_enabled=false
stderr_logfile=/data/nsq/logs/nsqd2_stderr.log
stderr_logfile_maxbytes=10MB
stderr_logfile_backups=10
stderr_capture_maxbytes=10MB
stderr_events_enabled=false
;environment=A="1",B="2"
serverurl=AUTO

部署Nsqlookupd

nsqlookup1

$ cat nsqlookupd1.conf
[program:nsqlookupd1]
command=/data/nsq/bin/nsqlookupd -tcp-address=10.21.248.7:4160 -http-address=10.21.248.7:4161 -broadcast-address=10.21.248.7
;process_name=%(program_name)s
numprocs=1
directory=/data/nsq/
umask=022
;priority=999
autostart=false
autorestart=true
startsecs=5
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
killasgroup=false
user=root
redirect_stderr=true
stdout_logfile=/data/nsq/logs/nsqlookupd1.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_capture_maxbytes=10MB
stdout_events_enabled=false
stderr_logfile=/data/nsq/logs/nsqlookupd1_stderr.log
stderr_logfile_maxbytes=10MB
stderr_logfile_backups=10
stderr_capture_maxbytes=10MB
stderr_events_enabled=false
;environment=A="1",B="2"
serverurl=AUTO

nsqlookup2

$ cat nsqlookupd2.conf
[program:nsqlookupd2]
command=/data/nsq/bin/nsqlookupd -tcp-address=10.21.248.7:4163 -http-address=10.21.248.7:4162 -broadcast-address=10.21.248.7
;process_name=%(program_name)s
numprocs=1
directory=/data/nsq/
umask=022
;priority=999
autostart=false
autorestart=true
startsecs=5
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
killasgroup=false
user=root
redirect_stderr=true
stdout_logfile=/data/nsq/logs/nsqlookupd2.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_capture_maxbytes=10MB
stdout_events_enabled=false
stderr_logfile=/data/nsq/logs/nsqlookupd2_stderr.log
stderr_logfile_maxbytes=10MB
stderr_logfile_backups=10
stderr_capture_maxbytes=10MB
stderr_events_enabled=false
;environment=A="1",B="2"
serverurl=AUTO

部署Nsqadmin

nsqadmin

$ cat nsqadmin.conf
[program:nsqadmin]
command=/data/nsq/bin/nsqadmin --lookupd-http-address=10.21.248.7:4161 --lookupd-http-address=10.21.248.7:4162
;process_name=%(program_name)s
numprocs=1
directory=/data/nsq/
umask=022
;priority=999
autostart=true
autorestart=false
startsecs=10
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
killasgroup=false
user=root
redirect_stderr=true
stdout_logfile=/data/nsq/logs/nsqadmin.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
stdout_capture_maxbytes=10MB
stdout_events_enabled=false
stderr_logfile=/data/nsq/logs/nsqadmin_stderr.log
stderr_logfile_maxbytes=10MB
stderr_logfile_backups=10
stderr_capture_maxbytes=10MB
stderr_events_enabled=false
;environment=A="1",B="2"
serverurl=AUTO

启动服务

$ supervisorctl reread
$ supervisorctl update
$ supervisorctl start nsq1 nsq2 nsqlookupd1 nsqlookupd2 nsqadmin
$ supervisorctl status |grep -i 'runn'
nsqadmin                         RUNNING   pid 210281, uptime 0:00:40
nsqd1                            RUNNING   pid 208306, uptime 0:13:09
nsqd2                            RUNNING   pid 208344, uptime 0:13:04
nsqlookupd1                      RUNNING   pid 208255, uptime 0:13:24
nsqlookupd2                      RUNNING   pid 208282, uptime 0:13:19

访问Nsqadmin
http://ip:port:4171

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/05/07/部署nsq/

Prometheus监控docker

2022-04-22T05:18:22.000Z

Prometheus监控非K8S环境的docker

环境说明

Centos 7.4
Docker version 18.09.0

环境部署

[root@1-206 ~]# yum install -y yum-utils device-mapper-persistent-data lvm2
[root@1-206 ~]# yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
[root@1-206 ~]# yum install -y docker-ce
[root@1-206 ~]# systemctl start docker && systemctl enable docker

拉取容器镜像

[root@1-206 ~]# docker pull nginx
[root@1-206 ~]# docker pull google/cadvisor
[root@1-206 ~]# docker pull yfshare/node-exporter:0.17.0
[root@1-206 ~]# docker pull yfshare/prometheus:2.5.0
[root@1-206 ~]# docker pull prom/alertmanager
[root@1-206 ~]# docker pull prom/blackbox-exporter:v0.13.0
[root@1-206 ~]# docker pull prom/snmp-exporter
[root@1-206 ~]# docker pull grafana/grafana

启动容器

[root@1-206 ~]# docker run -it -d -p 80:80 -v /data1/tmp:/tmp/:rw --name nginx nginx
[root@1-206 ~]# docker run -d -p 8080:8080 -v /data1/tmp:/tmp:rw --name cadvisor google/cadvisor
[root@1-206 ~]# docker run -d -p 9100:9100 -v /data1/tmp:/tmp:rw --name node-exporter yfshare/node-exporter:0.17.0
[root@1-206 ~]# docker run -d -p 9090:9090 -v /data1/tmp:/tmp:rw --name prometheus yfshare/prometheus:2.5.0
[root@1-206 ~]# docker run -d -p 9093:9093 -v /data1/tmp:/tmp:rw --name alertmanager prom/alertmanager
[root@1-206 ~]# docker run -d -p 9115:9115 -v /data1/tmp:/tmp:rw --name blackbox-exporter prom/blackbox-exporter:v0.13.0
[root@1-206 ~]# docker run -d -p 9116:9116 -v /data1/tmp:/tmp:rw --name snmp-exporter prom/snmp-exporter
[root@1-206 ~]# docker run -d -p 3000:3000 -v /data1/tmp:/tmp:rw --name grafana grafana/grafana

查看容器状态

[root@1-206 ~]# docker ps 
CONTAINER ID    IMAGE                      COMMAND                  CREATED         STATUS          PORTS                            NAMES
c66f56e6f287    yfshare/grafana:5.4.2      "/docker-entrypoint.…"   6 days ago      Up 8 seconds    22/tcp, 0.0.0.0:3000->3000/tcp   grafana
ba97caebb358    prom/snmp-exporter         "/bin/snmp_exporter …"   6 days ago      Up 7 seconds    0.0.0.0:9116->9116/tcp           snmp-exporter
7dc28a01a329    nginx                      "nginx -g 'daemon of…"   6 days ago      Up 6 seconds    0.0.0.0:80->80/tcp               nginx
3e8ff1dc1d23    prom/blackbox-exporter:v0.13.0   "/bin/blackbox_expor…"   6 days ago      Up 6 seconds    0.0.0.0:9115->9115/tcp     blackbox-exporter
4af8de4eb4ef    prom/alertmanager          "/bin/alertmanager -…"   6 days ago      Up 5 seconds    0.0.0.0:9093->9093/tcp           alertmanager
3ad0edf76f32    yfshare/prometheus:2.5.0   "./prometheus --conf…"   6 days ago      Up 5 seconds    22/tcp, 0.0.0.0:9090->9090/tcp   prometheus
c7f8365c3c05    yfshare/node-exporter:0.17.0   "./node_exporter --w…"   18 minutes ago  Up 18 minutes 22/tcp, 0.0.0.0:9100->9100/tcp node-exporter
b7ea0dc8ade6    google/cadvisor            "/usr/bin/cadvisor -…"   6 days ago      Up 2 seconds    0.0.0.0:8080->8080/tcp           cadvisor
[root@1-206 ~]#

好干净哇…！

我们先准备一个测试页面

当我们遇到这个问题时，是Prometheus所在的宿主机时间不对，我们需要先同步时间

1 2	[root@1-206 ~]# ntpdate -u time.windows.com [root@1-206 ~]# /sbin/hwclock --systohc

prometheus默认监控了自己的9100端口

编写prometheus.yml配置文件

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.1.206:9093"]
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=` to any timeseries scraped from this config.  
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.1.206:9090','192.168.1.206:9100','192.168.1.206:8080']
  - job_name: 'snmp'
    static_configs:
      - targets:
        - 192.168.1.206
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.1.206:9116
#网站监控
  - job_name: 'instance-web-monitor'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.1.206:9090/metrics
        - http://192.168.1.206/yfshare
        labels:
            city: '上海'
            env: 'test'
            inhibit: 'on'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.1.206:9115
####业务监控
  - job_name: 'yfshare-web-node'
    static_configs:
    - targets: ['192.168.1.206:9100']
      labels:
        city: '上海'
        env: 'DEV'
    metrics_path: '/metrics'
  - job_name: 'yfshare-web-cadvisor'
    static_configs:
    - targets: ['192.168.1.206:8080']
      labels:
        city: '上海'
        env: 'DEV'
    metrics_path: '/metrics'

这里使用的是自己封装的镜像，因官方镜像不支持通过API来重新加载Prometheus配置文件，试想一下，如果每次修改了配置文件，都需要重启Prometheus，是不是有点那啥..

1	[root@3ad0edf76f32 prometheus]# curl -X POST http://127.0.0.1:9090/-/reload

正确添加Prometheus配置文件后，我们可以查看到监控的key.

这里也使用的是自己封装的镜像，原因同上。
添加Prometheus的DataSource

Grafana dashboard

安装grafana常用插件，绘制图标。

[root@c66f56e6f287 grafana]# grafana-cli plugins install btplc-status-dot-panel
[root@c66f56e6f287 grafana]# grafana-cli plugins install grafana-clock-panel
[root@c66f56e6f287 grafana]# grafana-cli plugins install grafana-piechart-panel
[root@c66f56e6f287 grafana]# grafana-cli plugins install grafana-worldmap-panel
[root@c66f56e6f287 grafana]# grafana-cli plugins install michaeldmoore-annunciator-panel
[root@c66f56e6f287 grafana]# grafana-cli plugins install vonage-status-panel

grafana插件安装完成后需要重启服务

[root@1-206 ~]# docker exec -it grafana /bin/bash
[root@c66f56e6f287 grafana]# grafana-cli plugins ls
installed plugins:
btplc-status-dot-panel @ 0.2.3 
grafana-clock-panel @ 1.0.2 
grafana-piechart-panel @ 1.3.3 
grafana-worldmap-panel @ 0.1.2 
michaeldmoore-annunciator-panel @ 1.0.0 
vonage-status-panel @ 1.0.9 
Restart grafana after installing plugins . 
[root@c66f56e6f287 grafana]#

绘制完成后的Prometheus监控面板

Prometheus alter

这里监控2个，分别是密码文件修改监控和网站探测
url测试

#网站监控
  - job_name: 'instance-web-monitor'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.1.206:9090/metrics
        - http://192.168.1.206/yfshare
        - http://192.168.1.206/yfshare/aaa.html
        labels:
            city: '上海'
            env: 'test'
            inhibit: 'on'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.1.206:9115

[root@c7f8365c3c05 keys]# pwd
/usr/local/node_exporter/keys
[root@c7f8365c3c05 keys]# chmod +x check_file_md5.sh
[root@c7f8365c3c05 keys]# cat check_md5_file.txt 
/etc/passwd
/etc/shadow
[root@c7f8365c3c05 keys]# sh check_file_md5.sh
[root@c7f8365c3c05 keys]# ls
check_file_md5.sh  check_md5_file.txt  check_md5.prom
[root@c7f8365c3c05 keys]# cat check_md5.prom 
check_md5 {check_file="/etc/passwd",md5="93dfcbcaf36cddd4fa8a162bda2c98e3"} 0
check_md5 {check_file="/etc/shadow",md5="c48673bf1c829d4979bcd090649c3cbf"} 0
[root@c7f8365c3c05 keys]#
[root@c7f8365c3c05 keys]# useradd test
[root@c7f8365c3c05 keys]# sh check_file_md5.sh
[root@c7f8365c3c05 keys]# cat check_md5.prom 
check_md5 {check_file="/etc/passwd",md5="93dfcbcaf36cddd4fa8a162bda2c98e3"} 1
check_md5 {check_file="/etc/shadow",md5="c48673bf1c829d4979bcd090649c3cbf"} 1
[root@c7f8365c3c05 keys]# crontab -l
* * * * * /bin/bash /usr/local/node_exporter/keys/check_file_md5.sh
[root@c7f8365c3c05 keys]#

定义value为0，正常；value为1，触发告警

[root@1-206 ~]# docker exec -it prometheus /bin/bash
[root@3ad0edf76f32 prometheus]# pwd
/usr/local/prometheus
[root@3ad0edf76f32 prometheus]# cat check.rules 
groups:
- name: base
  rules:
  - alert: 密码文件变更告警
    expr: check_md5 == 1
    for: 1m
    labels:
      CITY: ALL 
      info: 密码文件变更告警
      severity: Warning
      resolved: OK
    annotations:
      summary: "{{ $labels.instance }} 服务器 {{ $labels.check_file }} 文件MD5发生变更，请检查."
      description: "主机名: {{ $labels.hostname }} ；文件名: {{ $labels.check_file }}" 
  - alert: 网站状态码告警
    expr: count_code{request="201"} >= 100 or count_code{request="403"} >= 100 or count_code{request="409"} >= 100 or count_code{request="404"} >= 100 or count_code{request="500"} >= 100 or count_code{request="502"} >= 100 or count_code{request="503"} >= 100
    for: 1m
    labels:
      CITY: ALL 
      info: 网站状态码告警
      severity: Warning
      resolved: OK
    annotations:
      summary: "{{ $labels.instance }} 服务器网站状态码{{ $labels.request }}告警" 
      description: "主机名: {{ $labels.hostname }} ；状态码来源：{{ $labels.source }} ；状态码：{{ $labels.request }}"
[root@3ad0edf76f32 prometheus]#

这里出现重复告警，是因为测试环境只有一台，通过JOB看出Prometheus重复监控了

附件：
Grafana_templates.tar.gz
check_md5.zip
check_http.zip
check_code.zip

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/04/22/Prometheus监控docker/

Grafana展示Prometheus数据

2022-01-14T03:07:05.000Z

Grafana 是 Graphite 和 InfluxDB 仪表盘和图形编辑器。
Grafana 是开源的，功能齐全的度量仪表盘和图形编辑器，支持 Graphite，InfluxDB 和 OpenTSDB。
Grafana 主要特性：灵活丰富的图形化选项；可以混合多种风格；支持白天和夜间模式；多个数据源；Graphite 和 InfluxDB 查询编辑器等。

部署Grafana

Prometheus采取到k8s集群的监控数据后，由于Prometheus自带的展示图形不太友好，故我们需要选择更加强大的图形展示工具。Prometheus的时区是UTC +0且不支持修改，而Grafana可以选择时区

Grafana官网：https://grafana.com/

[root@localhost Prometheus]# cat prometheus-grafana.yaml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana
  namespace: kube-ops
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: grafana
        task: monitoring
    spec:
      containers:
      - name: grafana
        image: yfshare/heapster-grafana-amd64:v4.4.3
        ports:
        - containerPort: 3000
          hostPort: 3000
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: ca-certificates
          mountPath: /etc/ssl/certs
          readOnly: true
        - name: grafana-data
          mountPath: /var
          subPath: grafana
        env:
        - name: INFLUXDB_HOST
          value: influxdb
        - name: INFLUXDB_SERVICE_URL
          value: http://influxdb.kube-ops.svc.cluster.local:8086
        - name: GF_SERVER_HTTP_PORT
          value: "3000"
        - name: GF_AUTH_BASIC_ENABLED
          value: "false"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          value: Admin
        - name: GF_SERVER_ROOT_URL
          # If you're only using the API Server proxy, set this value instead:
          # value: /api/v1/proxy/namespaces/kube-system/services/monitoring-grafana/
          value: /
      volumes:
      - name: ca-certificates
        hostPath:
          path: /etc/ssl/certs
      - name: grafana-data
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: grafana
  name: grafana
  namespace: kube-ops
spec:
  ports:
  - port: 3000
    targetPort: 3000
  selector:
    k8s-app: grafana
[root@localhost Prometheus]#

1	[root@localhost Prometheus]# kubectl apply -f prometheus-grafana.yaml

[root@localhost ~]# kubectl get pod -n kube-ops -o wide
NAME                          READY     STATUS    RESTARTS   AGE       IP            NODE
grafana-694786cf95-ph5pr      1/1       Running   0          1d        172.30.57.8   192.168.1.198
[root@localhost ~]#

访问Grafana Dashboard

访问 http://192.168.1.198:3000 打开Grafana WEB界面

添加数据源

Add Data Source添加数据源

这里是编辑 influxdb-datasource
执行kubectl get svc --all-namespaces -o wide | grep -i influxdb可以获取influxdb的cluster-ip和Port

添加Grafana的Prometheus数据源
数据源Type选择Prometheus
执行kubectl get pod --all-namespaces -o wide | grep -i prometheus可以获取Prometheus的ip

添加Dashboard
Grafana官方有个Prometheus模板，参考：https://grafana.com/dashboards/162 ，可以把它导入到Grafana

把下载的kubernetes-pod-monitoring_rev1.json 导入到Grafana

上传json文件并选择数据源

导入成功。下图为模板默认监控数据，具体需要根据实际情况修改

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2022/01/14/Grafana展示Prometheus数据/

Prometheus查询语法

2021-06-02T09:51:23.000Z

prometheus从根本上存储的所有数据都是时间序列: 具有时间戳的数据流只属于单个度量指标和该度量指标下的多个标签维度。除了存储时间序列数据外，Prometheus也可以利用查询表达式存储5分钟的返回结果中的时间序列数据。

Prometheus提供一个函数式的表达式语言，可以使用户实时地查找和聚合时间序列数据。表达式计算结果可以在图表中展示，也可以在Prometheus表达式浏览器中以表格形式展示，或者作为数据源, 以HTTP API的方式提供给外部系统使用。

表达式语言数据类型

在Prometheus的表达式语言中，任何表达式或者子表达式都可以归为四种类型：

即时向量(instant vector) 包含每个时间序列的单个样本的一组时间序列，共享相同的时间戳
范围向量(Range vector) 包含每个时间序列随时间变化的数据点的一组时间序列
标量(Scalar) 一个简单的数字浮点值
字符串(String) 一个简单的字符串值(目前未被使用)

时间序列选择器

即时向量选择器
即时向量选择器允许选择一组时间序列，或者某个给定的时间戳的样本数据。如：container_cpu_usage_seconds_total选择了具有container_cpu_usage_seconds_total的时间序列

查询pod (nginx-deployment-d8d99448f-h6585)聚合一分钟之内的cpu 负载

1 2	#查看当前K8S集群中所有的POD kubectl get pod --all-namespaces

1	sum by (container_name)(rate(container_cpu_usage_seconds_total{image!="",container_name!="POD",pod_name="kubernetes-dashboard-666fbbf977-v9vsh"}[1m]))

等于

sum by (container_name)(rate(container_cpu_usage_seconds_total{image="k8scn/kubernetes-dashboard-amd64@sha256:811a618089732441fa2aaaac0c860956e61410cec4b5715bbb328d0d4a685393",container_name="kubernetes-dashboard",pod_name="kubernetes-dashboard-666fbbf977-v9vsh"}[1m]))

sum：求和。聚合操作符参考
by：固定语法。保留结果向量中的标签集合，其他标签被保留输出。
rate：计算范围向量中时间序列的每平平均增长率。rate范围向量参考)
container_cpu_usage_seconds_total：容器CPU每秒使用总和，是一个时间序列名称。
{}：即时矢量选择器。允许在给定的时间戳(即时)中为每个时间序列和每个样本值选择一组时间序列：比如仅指定度量标准名称。这将产生包含具有此度量标准名称的所有时间序列的元素的即时矢量。即时矢量选择器参考
{image!="",container_name!="POD",pod_name="kubernetes-dashboard-666fbbf977-v9vsh"}都是时间序列container_cpu_usage_seconds_total的标签。其中image是名称标签，container_name是工作标签，pod_name是组标签。
image!=""：是时间序列container_cpu_usage_seconds_total的名称标签，意思是image不为空。container_name!="POD"意思同上，container_name不等于POD。pod_name="kubernetes-dashboard-666fbbf977-v9vsh"意思同上，意思是pod_name等于kubernetes-dashboard-666fbbf977-v9vsh。匹配操作符参考
[1m]：范围矢量选择器。在语法上，范围持续时间会附在矢量选择器末尾的[]内，意思是为每个结果范围矢量元素提取多久的时间值。范围矢量选择器参考

我们可以通过Prometheus的UI界面查询到该containername的具体信息，如下：
随便输入一个时间序列值，Prometheus默认会列出当前K8s里所有的监控信息，然后ctrl+F搜索
![prometheus时间序列](https://note.youdao.com/yws/api/personal/file/F458416FAFC341B0BBDCD5D0C29DE7DE?method=download&shareKey=ab0ca193e6ec09a12f3cf779e2d43da3)

下面两个监控指标获取的监控目标相同

范围向量选择器
范围向量表达式正如即时向量表达式一样运行，前者返回从当前时刻的时间序列回来。语法是，在一个向量表达式之后添加[]来表示时间范围，持续时间用数字表示，后接下面单元之一：
时间长度有一个数值决定，后面可以跟下面的单位：

s - seconds
m - minutes
h - hours
d - days
w - weeks
y - years

见上面例子，范围矢量选择器参考

偏移修饰符(offset modifier)
偏移修饰符允许更改查询中单个即时向量和范围向量的时间偏移量。偏移修饰符(offset modifier)参考

以下表达式返回container_cpu_usage_seconds_total在一天前5分钟内的速率

1	(rate(container_cpu_usage_seconds_total{image!="",container_name!="POD",pod_name="kubernetes-dashboard-666fbbf977-v9vsh"} [5m] offset 1d))

操作符

4.1. 二元操作符
Prometheus的查询语言支持基本的逻辑运算和算术运算。对于两个瞬时向量, 匹配行为可以被改变

4.2. 算术二元运算符

+ 加法
- 减法
* 乘法
/ 除法
% 模
^ 幂等

二元运算操作符支持scalar/scalar(标量/标量)、vector/scalar(向量/标量)、和vector/vector(向量/向量)之间的操作
在两个标量之间进行操作符运算，得到的结果也是标量

在向量和标量之间，这个操作符会作用于这个向量的每个样本值上。例如：如果一个时间序列瞬时向量除以2，操作结果也是一个新的瞬时向量，且度量指标名称不变, 它是原度量指标瞬时向量的每个样本值除以2
在两个向量之间，一个二元算术操作符作用在左边瞬时向量的每个样本值，且该样本值与操作符右边能匹配上的样本值计算，向量匹配。结果写入到一个没有度量指标名称的瞬时向量

4.3. 比较二元操作符
在Prometheus系统中，比较二元操作符有：

== 等于
!= 不等于
> 大于
< 小于
>= 大于等于
<= 小于等于

比较二元操作符被应用于scalar/scalar（标量/标量）、vector/scalar(向量/标量)，和vector/vector（向量/向量）。比较操作符得到的结果是bool布尔类型值，返回1或者0值。

在两个标量之间的比较运算，bool结果写入到另一个结果标量中。
瞬时向量和标量之间的比较运算，这个操作符会应用到某个当前时刻的每个时间序列数据上，如果一个时间序列数据值与这个标量比较结果是false，则这个时间序列数据被丢弃掉，如果是true, 则这个时间序列数据被保留在结果中。

在两个瞬时向量之间的比较运算，左边度量指标数据中的每个时间序列数据，与右边度量指标中的每个时间序列数据匹配，没有匹配上的，或者计算结果为false的，都被丢弃，不在结果中显示。否则将保留左边的度量指标和标签的样本数据写入瞬时向量。

4.4. 逻辑/集合二元操作符
逻辑/集合二元操作符只能作用在即时向量，包括：

and 交集
or 并集
unless 补集

vector1 and vector2 的逻辑/集合二元操作符，规则：vector1瞬时向量中的每个样本数据与vector2向量中的所有样本数据进行标签匹配，不匹配的，全部丢弃。运算结果是保留左边的度量指标名称和值。
vector1 or vector2的逻辑/集合二元操作符，规则: 保留vector1向量中的每一个元素，对于vector2向量元素，则不匹配vector1向量的任何元素，则追加到结果元素中。
vector1 unless vector2的逻辑/集合二元操作符，又称差积。规则：包含在vector1中的元素，但是该元素不在vector2向量所有元素列表中，则写入到结果集中。

向量匹配
向量之间的匹配是指右边向量中的每一个元素，在左边向量中也存在。这里有两种基本匹配行为特征：

一对一，找到这个操作符的两边向量元素的相同元素。默认情况下，操作符的格式是vector1 [operate] vector2。如果它们有相同的标签和值，则表示相匹配。ingoring关键字是指，向量匹配时，可以忽略指定标签。on关键字是指，在指定标签上进行匹配。格式如下所示：
参考

聚合操作符
Prometheus支持下面的内置聚合操作符。这些聚合操作符被用于聚合单个即时向量的所有时间序列列表，把聚合的结果值存入到新的向量中。

sum (在维度上求和)
max (在维度上求最大值)
min (在维度上求最小值)
avg (在维度上求平均值)
stddev (求标准差)
stdvar (求方差)
count (统计向量元素的个数)
count_values (统计相同数据值的元素数量)
bottomk (样本值第k个最小值)
topk (样本值第k个最大值)
quantile (统计分位数)

这些操作符被用于聚合所有标签维度，或者通过without或者by子句来保留不同的维度。
parameter只能用于count_values, quantile, topk和bottomk。without移除结果向量中的标签集合，其他标签被保留输出。by关键字的作用正好相反，即使它们的标签值在向量的所有元素之间。keep_common子句允许保留额外的标签(在元素之间相同，但不在by子句中的标签)

count_values对每个唯一的样本值输出一个时间序列。每个时间序列都附加一个标签。这个标签的名字有聚合参数指定，同时这个标签值是唯一的样本值。每一个时间序列值是结果样本值出现的次数。
topk和bottomk与其他输入样本子集聚合不同，返回的结果中包括原始标签。by和without仅仅用在输入向量的桶中。

例如：如果度量指标名称http_requests_total包含由group, application, instance的标签组成的时间序列数据，我们可以通过以下方式计算去除instance标签的http请求总数：

1	sum(http_requests_total) without (instance)

如果我们对所有应用程序的http请求总数：

1	sum(http_requests_total)

统计每个编译版本的二进制文件数量：

1	count_values("version", build_version)

通过所有实例，获取http请求第5个最大值：

1	topk(5, http_requests_total)

二元运算符优先级
在Prometheus系统中，二元运算符优先级从高到低：

^
*,/,%
+,-
==,!=,<=,<,>=,>
and,unless
or

参考：https://github.com/1046102779/prometheus/blob/master/prometheus/querying/operators.md

函数
一些函数有默认的参数，例如：year(v=vector(time()) instant-vector)。v是参数值，instant-vector是参数类型。vector(time())是默认值。

abs()
abs(v instant-vector)返回输入向量的所有样本的绝对值
absent()
absent(v instant-vector)，如果赋值给它的向量具有样本数据，则返回空向量；如果传递的瞬时向量参数没有样本数据，则返回不带度量指标名称且带有标签的样本值为1的结果
当监控度量指标时，如果获取到的样本数据是空的，使用absent方法对告警是非常有用的。
1
2
3
absent(nonexistent{job="myjob"}) # => key: value = {job="myjob"}: 1
absent(nonexistent{job="myjob", instance=~".*"}) # => {job="myjob"} 1
absent(sum(nonexistent{job="myjob"})) # => key:value {}: 0
ceil()
ceil(v instant-vector) 是一个向上舍入为最接近的整数
changes()
changes(v range-vector)输入一个范围向量，返回这个范围向量内每个样本数据值变化的次数。
clamp_max()
clamp_max(v instant-vector, max scalar)数，输入一个瞬时向量和最大值，样本数据值若大于max，则改为max，否则不变.
clamp_min()
clamp_min(v instant-vector)函数，输入一个瞬时向量和最大值，样本数据值小于min，则改为min。否则不变。
count_saclar()
count_scalar(v instant-vector)函数, 输入一个瞬时向量，返回key:value="scalar": 样本个数。而count()函数，输入一个瞬时向量，返回key:value=向量：样本个数，其中结果中的向量允许通过by条件分组
day_of_month()
day_of_month(v=vector(time()) instant-vector)函数，返回被给定UTC时间所在月的第几天。返回值范围：1~31。
day_of_week()
day_of_week(v=vector(time()) instant-vector)函数，返回被给定UTC时间所在周的第几天。返回值范围：0~6. 0表示星期天
days_in_month()
days_in_month(v=vector(time()) instant-vector)函数，返回当月一共有多少天。返回值范围：28~31
delta()
delta(v range-vector)函数，计算一个范围向量v的第一个元素和最后一个元素之间的差值。返回值：key:value=度量指标：差值

返回过去两小时的CPU温度差：

1	delta(cpu_temp_celsius{host="zeus"}[2h])

delta函数返回值类型只能是gauges

deriv()
deriv(v range-vector)函数，计算一个范围向量v中各个时间序列二阶导数，使用简单线性回归
deriv二阶导数返回值类型只能是gauges
drop_common_labels()
drop_common_labels(instant-vector)函数，输入一个瞬时向量，返回值是key:value=度量指标：样本值，其中度量指标是去掉了具有相同标签
例如：
http_requests_total{code=”200”, host=”127.0.0.1:9090”, method=”get”} : 4,
http_requests_total{code=”200”, host=”127.0.0.1:9090”, method=”post”} : 5,
返回值: http_requests_total{method=”get”} : 4, http_requests_total{code=”200”, method=”post”} : 5
exp()
exp(v instant-vector)函数，输入一个瞬时向量, 返回各个样本值的e指数值，即为e^N次方。特殊情况如下所示：
Exp(+inf) = +Inf Exp(NaN) = NaN
floor()
floor(v instant-vector)函数，与ceil()函数相反。 4.3 为 4
histogram_quantile()
histogram_quatile(φ float, b instant-vector)函数计算b向量的φ-直方图 (0 ≤ φ ≤ 1) 。参考中文文献
holt_winters()
holt_winters(v range-vector, sf scalar, tf scalar)函数基于范围向量v，生成事件序列数据平滑值。平滑因子sf越低, 对老数据越重要。趋势因子tf越高，越多的数据趋势应该被重视。0< sf, tf <=1。 holt_winters仅用于gauges
hour()
hour(v=vector(time()) instant-vector)函数返回被给定UTC时间的当前第几个小时，时间范围：0~23
idelta()
idelta(v range-vector)函数，输入一个范围向量，返回key: value = 度量指标：每最后两个样本值差值
increase()
increase(v range-vector)函数，输入一个范围向量，返回：key:value =度量指标：last值-first值，自动调整单调性，如：服务实例重启，则计数器重置。与delta()不同之处在于delta是求差值，而increase返回最后一个减第一个值,可为正为负
返回过去5分钟，连续两个时间序列数据样本值的http请求增加值:
1
increase(http_requests_total{job="api-server"}[5m])

irate
irate(v range-vector)函数, 输入：范围向量，输出：key: value =度量指标： (last值-last前一个值)/时间戳差值。它是基于最后两个数据点，自动调整单调性，如：服务实例重启，则计数器重置
针对范围向量中的每个时间序列数据，返回两个最新数据点过去5分钟的HTTP请求速率：
1
irate(http_requests_total{job="api-server"}[5m])
irate只能用于绘制快速移动的计数器。因为速率的简单更改可以重置FOR子句，利用警报和缓慢移动的计数器，完全由罕见的尖峰组成的图形很难阅读
label_replace()
对于v中的每个时间序列，label_replace(v instant-vector, dst_label string, replacement string, src_label string, regex string)将正则表达式与标签值src_label匹配。如果匹配，则返回时间序列，标签值dst_label被替换的扩展替换。$1替换为第一个匹配子组，$2替换为第二个等。如果正则表达式不匹配，则时间序列不会更改。

另一种更容易的理解是：label_replace函数，输入：瞬时向量，输出：key: value =度量指标：值（要替换的内容：首先，针对src_label标签，对该标签值进行regex正则表达式匹配。如果不能匹配的度量指标，则不发生任何改变；否则，如果匹配，则把dst_label标签的标签纸替换为replacement 下面这个例子返回一个向量值a带有foo标签： label_replace(up{job="api-server", serice="a:c"}, "foo", "$1", "service", "(.*):.*")
ln()
ln(v instance-vector)计算瞬时向量v中所有样本数据的自然对数。特殊例子：
ln(+Inf) = +Inf ln(0) = -Inf ln(x<0) = NaN ln(NaN) = NaN
log2()
log2(v instant-vector)函数计算瞬时向量v中所有样本数据的二进制对数。
log10()
log10(v instant-vector)函数计算瞬时向量v中所有样本数据的10进制对数。相当于ln()
minute()
minute(v=vector(time()) instant-vector)函数返回给定UTC时间当前小时的第多少分钟。结果范围：0~59
month()
month(v=vector(time()) instant-vector)函数返回给定UTC时间当前属于第几个月，结果范围：0~12
predict_linear()
predict_linear(v range-vector, t scalar)预测函数，输入：范围向量和从现在起t秒后，输出：不带有度量指标，只有标签列表的结果值
rate()
rate(v range-vector)函数, 输入：范围向量，输出：key: value = 不带有度量指标，且只有标签列表：(last值-first值)/时间差
rate(http_requests_total[5m])
rate()函数返回值类型只能用counters，当用图表显示增长缓慢的样本数据时，这个函数是非常合适的
注意：当rate函数和聚合方式联合使用时，一般先使用rate函数，再使用聚合操作, 否则，当服务实例重启后，rate无法检测到counter重置
resets()
resets()函数, 输入：一个范围向量，输出：key-value=没有度量指标，且有标签列表[在这个范围向量中每个度量指标被重置的次数]。在两个连续样本数据值下降，也可以理解为counter被重置
resets(http_requests_total[5m])。resets只能和counters一起使用
round()
round(v instant-vector, to_nearest 1= scalar)函数，与ceil和floor函数类似，输入：瞬时向量，输出：指定整数级的四舍五入值, 如果不指定，则是1以内的四舍五入
scalar()
scalar(v instant-vector)函数, 输入：瞬时向量，输出：key: value = “scalar”, 样本值[如果度量指标样本数量大于1或者等于0, 则样本值为NaN, 否则，样本值本身]
sort()
sort(v instant-vector)函数，输入：瞬时向量，输出：key: value = 度量指标：样本值[升序排列]
sort_desc()
sort(v instant-vector函数，输入：瞬时向量，输出：key: value = 度量指标：样本值[降序排列]
sqrt()
sqrt(v instant-vector)函数，输入：瞬时向量，输出：key: value = 度量指标：样本值的平方根
time()
time()函数，返回从1970-01-01到现在的秒数，注意：它不是直接返回当前时间，而是时间戳
vector()
vector(s scalar)函数，返回：key: value= {}, 传入参数值
year()
year(v=vector(time()) instant-vector)，返回年份
_over_time()
下面的函数列表允许传入一个范围向量，返回一个带有聚合的瞬时向量：
- avg_over_time(range-vector): 范围向量内每个度量指标的平均值
- min_over_time(range-vector): 范围向量内每个度量指标的最小值
- max_over_time(range-vector): 范围向量内每个度量指标的最大值
- sum_over_time(range-vector): 范围向量内每个度量指标的求和值
- count_over_time(range-vector): 范围向量内每个度量指标的样本数据个数
- quantile_over_time(scalar, range-vector): 范围向量内每个度量指标的样本数据值分位数，φ-quantile (0 ≤ φ ≤ 1)
- stddev_over_time(range-vector): 范围向量内每个度量指标的总体标准偏差
- stdvar_over_time(range-vector): 范围向量内每个度量指标的总体标准方差

参考：https://github.com/1046102779/prometheus/blob/master/prometheus/querying/functions.md
https://www.kancloud.cn/huyipow/prometheus/525003

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2021/06/02/Prometheus查询语法/

代理转发zabbix proxy

2021-03-17T06:06:35.000Z

现象：us地区 zabbix-proxy 在晚上高峰期时（21：00～24:00）容易发生 zabbix-server 无法连接 zabbix proxy。
问题原因：经检查，US 到 TW 之间通讯走国际链路会经过 PCCW 运营商，且保障到运营商和调整路由后效果不明显。

mtr 截图

因为 CDS 在 TW 有服务器，最后和厂商 CDS 沟通后，采用 CDS 云互联解决问题，网络链路不会经过 PCCW 运营商。即 US CDS 机器与 TW CDS 机器建立 “大内网”，两者通过内网通讯，然后 TW CDS 转发 US zabbix proxy 流量到 TW zabbix-server。

解决过程：
最开始想到的是通过 iptables SNAT 和 DNAT 实现路由转发。但可惜的是，失败了。
然后尝试使用 nginx 来转发 zabbix proxy 流量到 zabbix server，经过验证，由于 zabbix proxy 数据包过大，七层 nginx 反向代理无法完整转发流量，经多次尝试调整参数后，仍无法解决。失败了。
使用四层 haproxy 来代理 zabbix proxy 流量到 zabbix server。成功！。

使用 SNAT 和 DNAT ，代理服务器一直在发送大量 ARP 包，暂未解决问题。

20:53:19.828856 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:20.330829 ARP, Request who-has 10.123.0.205 tell 10.123.0.10, length 46
20:53:20.376638 ARP, Request who-has 10.123.0.204 tell 10.123.0.10, length 46
20:53:20.830674 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:21.230574 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:21.832715 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:22.232912 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:22.801790 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46
20:53:22.834676 ARP, Request who-has 10.123.0.207 tell 10.123.0.10, length 46
20:53:23.234728 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:23.804725 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46
20:53:24.236666 ARP, Request who-has 10.123.0.203 tell 10.123.0.10, length 46
20:53:24.806665 ARP, Request who-has 10.123.0.208 tell 10.123.0.10, length 46

这是之前配置的 SNAT 和 DNAT 规则，可能是姿势不对，未能解决问题。

1 2	iptables -t nat -D POSTROUTING -s 10.123.1.68/32 -o ens192 -j SNAT --to-source 164.52.1.106 iptables -t nat -D PREROUTING -s 124.108.157.186 -j DNAT --to-destination 164.52.1.106

在 zabbix proxy 上配置了路由到代理服务器，但zabbix-proxy 仍无法与 zabbix-server 建立通讯。

111991:20190820:212111.845 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111990:20190820:212111.846 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]. Will retry every 120 second(s)
111992:20190820:212111.942 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]. Will retry every 15 second(s)
111991:20190820:212211.878 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212227.046 Still unable to connect...
111991:20190820:212311.926 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212327.110 Still unable to connect...
111990:20190820:212341.925 Still unable to connect...
111991:20190820:212411.974 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212427.173 Still unable to connect...
111991:20190820:212512.005 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212527.238 Still unable to connect...
111990:20190820:212556.965 Still unable to connect...
111991:20190820:212611.046 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212627.302 Still unable to connect...
111991:20190820:212711.077 Unable to connect to the server [124.108.157.186]:10051 [cannot connect to [[124.108.157.186]:10051]: [110] Connection timed out]
111992:20190820:212727.365 Still unable to connect...

使用 nginx 代理时，可以发现 zabbix proxy 发送过来的包很大，nginx 无法完整接收此数据包，才导致最后的失败。

10.123.1.68 - - [20/Aug/2019:20:25:04 +0800] "ZBXD\x01\x95\xB4\x11\x00\x00\x00\x00\x00{\x22request\x22:\x22history data\x22,\x22host\x22:\x22US_FF_zabbix-proxy\x22,\x22data\x22:[{\x22host\x22:\x22us-ff-mysql08s2\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-mysql53s1\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-mysql32m\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22mysql_bak03\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-backend02\x22,\x22key\x22:\x22icmppingloss[{HOST.IP}, 10]\x22,\x22clock\x22:1566295973,\x22ns\x22:143328430,\x22value\x22:\x220.000000\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond0.1711]\x22,\x22clock\x22:1566295977,\x22ns\x22:131031460,\x22value\x22:\x2225911485220\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond0]\x22,\x22clock\x22:1566295977,\x22ns\x22:131165849,\x22value\x22:\x2230945671117\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if.in[bond1.1712]\x22,\x22clock\x22:1566295977,\x22ns\x22:131254822,\x22value\x22:\x2219002118429\x22},{\x22host\x22:\x22us-ff-backend34\x22,\x22key\x22:\x22net.if." 400 173 "-" "-" "-"

使用 haproxy 代理

Aug 20 21:42:55 localhost haproxy[34707]: 10.123.1.68:40592 [20/Aug/2019:21:42:53.425] tcenterd tcenterd/tcenterd 1/74/1829 111 -- 0/0/0/0/0 0/0
Aug 20 21:42:57 localhost haproxy[34707]: 10.123.1.68:40596 [20/Aug/2019:21:42:55.538] tcenterd tcenterd/tcenterd 1/78/1748 111 -- 0/0/0/0/0 0/0
Aug 20 21:42:59 localhost haproxy[34707]: 10.123.1.68:40598 [20/Aug/2019:21:42:57.569] tcenterd tcenterd/tcenterd 1/74/1838 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:01 localhost haproxy[34707]: 10.123.1.68:40600 [20/Aug/2019:21:42:59.692] tcenterd tcenterd/tcenterd 1/74/1742 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:03 localhost haproxy[34707]: 10.123.1.68:40602 [20/Aug/2019:21:43:01.720] tcenterd tcenterd/tcenterd 1/74/1735 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:05 localhost haproxy[34707]: 10.123.1.68:40606 [20/Aug/2019:21:43:03.738] tcenterd tcenterd/tcenterd 1/74/1774 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:07 localhost haproxy[34707]: 10.123.1.68:40608 [20/Aug/2019:21:43:05.800] tcenterd tcenterd/tcenterd 1/73/1720 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:09 localhost haproxy[34707]: 10.123.1.68:40610 [20/Aug/2019:21:43:07.807] tcenterd tcenterd/tcenterd 1/75/1809 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:11 localhost haproxy[34707]: 10.123.1.68:40612 [20/Aug/2019:21:43:09.899] tcenterd tcenterd/tcenterd 1/77/1741 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:13 localhost haproxy[34707]: 10.123.1.68:40614 [20/Aug/2019:21:43:11.924] tcenterd tcenterd/tcenterd 1/82/1783 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:15 localhost haproxy[34707]: 10.123.1.68:40616 [20/Aug/2019:21:43:13.990] tcenterd tcenterd/tcenterd 1/75/1793 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:17 localhost haproxy[34707]: 10.123.1.68:40618 [20/Aug/2019:21:43:16.066] tcenterd tcenterd/tcenterd 1/77/1737 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:19 localhost haproxy[34707]: 10.123.1.68:40620 [20/Aug/2019:21:43:18.086] tcenterd tcenterd/tcenterd 1/74/1788 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:21 localhost haproxy[34707]: 10.123.1.68:40622 [20/Aug/2019:21:43:20.158] tcenterd tcenterd/tcenterd 1/74/1761 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:23 localhost haproxy[34707]: 10.123.1.68:40624 [20/Aug/2019:21:43:22.199] tcenterd tcenterd/tcenterd 1/77/1771 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:26 localhost haproxy[34707]: 10.123.1.68:40626 [20/Aug/2019:21:43:24.256] tcenterd tcenterd/tcenterd 1/74/1845 111 -- 0/0/0/0/0 0/0
Aug 20 21:43:28 localhost haproxy[34707]: 10.123.1.68:40628 [20/Aug/2019:21:43:26.383] tcenterd tcenterd/tcenterd 1/77/1791 111 -- 0/0/0/0/0 0/0

使用 haproxy 代理后，看到 zabbix proxy 已经可以正常发送数据到 zabbix-server。

114075:20190820:212810.030 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:212915.579 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213021.450 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213126.427 received configuration data from server at "10.123.0.2", datalen 9492921
114077:20190820:213133.145 cannot send history data to server at "10.123.0.2": ZBX_TCP_WRITE() failed: [32] Broken pipe
114075:20190820:213234.122 received configuration data from server at "10.123.0.2", datalen 9492921
114077:20190820:213245.932 cannot send history data to server at "10.123.0.2": ZBX_TCP_WRITE() failed: [32] Broken pipe
114075:20190820:213348.664 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213453.962 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213603.570 received configuration data from server at "10.123.0.2", datalen 9492921
114075:20190820:213713.600 received configuration data from server at "10.123.0.2", datalen 9492921

haproxy 配置：

$ cat /etc/haproxy/haproxy.cfg
global
    log         127.0.0.1 local2
 
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     400000
    user        haproxy
    group       haproxy
    daemon
    nbproc 30
    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats
 
defaults
    mode                    http
    log                     global
    option                  dontlognull
    option http-server-close
    option                  redispatch
    retries                 3
    timeout http-request    100s
    timeout queue           10m
    timeout connect         100s
    timeout client          10m
    timeout server          10m
    timeout http-keep-alive 100s
    timeout check           100s
    maxconn                 300000
 
frontend us_to_tw_proxy
    bind *:10051
    mode tcp
    default_backend zabbix-server
    maxconn 5000000
 
backend zabbix-server
    mode tcp
    server tcenterd 124.108.157.186:10051  maxconn 100
 
listen stats :1936
    stats enable
    stats uri /
    stats realm Haproxy\ Statistics
    stats auth admin:test1

zabbix proxy 需要配置静态路由改变 zabbix proxy 到 zabbix server 的网络链路，即通过代理服务器转发流量。

1	/usr/sbin/route add -host 124.108.157.186 gw 10.123.0.3

然后zabbix proxy的 zabbix server 配置改为代理服务器 IP 地址。

1 2	$ grep -iv '^$' zabbix_proxy_tw.conf \| grep -iv '^#' Server=10.123.0.3

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2021/03/17/代理转zabbix proxy/

Python 读取Mysql生成EXCEL（XLSX）

2020-10-16T05:17:50.000Z

以前做报表从Mysql数据里面提取数据很简单，几条简单的SHELL脚本就可以实现该功能，因为是自己使用，所以对于格式无所谓，以前使用的都是.csv格式，因为其是以逗号区分的，方便脚本处理。这次不一样了，虽然这次接到需求也是从Mysql中导出数据，但是！！导出文件的格式必须是.XLSX这是微软EXCEL软件自有的格式，包含特殊的文件头，所以之前的方法失效了，失效了… 折腾了好久，也请教了别人，终于，终于解决了这个需求。

环境：
　　　Centos 7.2.1511
　　　Python 2.7.5 -> 3.5

[root@localhost ~]# python -V
Python 2.7.5
[root@localhost ~]#
#安装easy_install命令
[root@localhost ~]# wget https://bootstrap.pypa.io/ez_setup.py -O - | python
#安装pip命令(python包管理软件)
[root@localhost ~]# wget https://pypi.python.org/packages/11/b6/abcb525026a4be042b486df43905d6893fb04f05aac21c32c638e939e447/pip-9.0.1.tar.gz#md5=35f01da33009719497f01a4ba69d63c9
[root@localhost ~]# tar -zxf pip-9.0.1.tar.gz 
[root@localhost ~]# cd pip-9.0.1
[root@localhost pip-9.0.1]# python setup.py install
#安装python module
[root@localhost ~]# python select3.py 
Traceback (most recent call last):
  File "select3.py", line 3, in 
    from sqlalchemy import create_engine
ImportError: No module named sqlalchemy
[root@localhost ~]#
#安装sqlalchemy模块
[root@localhost ~]# pip install sqlalchemy
Collecting pymysql
  Downloading PyMySQL-0.7.11-py2.py3-none-any.whl (78kB)
    100% |████████████████████████████████| 81kB 48kB/s 
Installing collected packages: pymysql
Successfully installed pymysql-0.7.11
[root@localhost ~]#
#重复执行python select3.py 安装缺失的模块
[root@localhost ~]# pip install pymysql
[root@localhost ~]# pip install pandas
# Python 2.7不支持中文，报错如下，需要升级到 Python ，测试 Python 3.5.4 支持
[root@localhost ~]# python select3.py 
Traceback (most recent call last):
  File "select3.py", line 10, in 
    sql.encode('gb18030')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 20: ordinal not in range(128)
[root@localhost ~]#

升级 Python 2.7.5 至 Python 3.5
不要动系统自带的Python，否则系统的某些功能都无法正常使用

[root@localhost ~]# wget https://www.python.org/ftp/python/3.5.4/Python-3.5.4.tgz
[root@localhost ~]# yum install openssl-devel -y
[root@localhost ~]# tar -zxf Python-3.5.4.tgz
[root@localhost ~]# mkdir -p /usr/local/Python-3.5.4
[root@localhost ~]# cd Python-3.5.4
[root@localhost Python-3.5.4]# ./configure --prefix=/usr/local/Python-3.5.4/
[root@localhost Python-3.5.4]# make && make install
[root@localhost ~]# /usr/local/Python-3.5.4/bin/python3.5 -V
Python 3.5.4
[root@localhost ~]#
# 现在系统同时存在两个版本的 Python，因此脚本需要的模块需要重新安装，因为Python 3.5.2之后自带 pip3 和 setuptools，因此需要使用 pip3 安装所需要的模块
[root@localhost ~]# /usr/local/Python-3.5.4/bin/python3.5 select3.py 
Traceback (most recent call last):
  File "select3.py", line 3, in 
    from sqlalchemy import create_engine
ImportError: No module named 'sqlalchemy'
[root@localhost ~]#
#安装所需要的 Python 模块
[root@localhost ~]# /usr/local/Python-3.5.4/bin/pip3.5 install sqlalchemy
[root@localhost ~]# /usr/local/Python-3.5.4/bin/pip3.5 install pymysql
[root@localhost ~]# /usr/local/Python-3.5.4/bin/pip3.5 install pandas     #这个很慢
[root@localhost ~]# /usr/local/Python-3.5.4/bin/pip3.5 install openpyxl
[root@localhost ~]# cat /etc/profile.d/Python.sh 
#!/bin/sh
Python3_5_home='/usr/local/Python-3.5.4'
export PATH=${Python3_5_home}/bin:$PATH
[root@localhost ~]# source /etc/profile.d/Python.sh
[root@localhost ~]# /usr/local/Python-3.5.4/bin/python3.5 select3.py
[root@localhost ~]# ll | grep output.xlsx 
-rw-r--r--   1 root root      4854 Oct 15 06:37 output.xlsx
[root@localhost ~]#

如果报这个错，好像是 查询SQL 时超时了，在同级目录下会生成 __pycache__文件夹，脚本换个目录（不要在含有__pycache__这个目录的目录中执行 py脚本），然后删掉其父目录，再执行 py 脚本

Traceback (most recent call last):
  File "/usr/local/Python-3.5.4/lib/python3.5/site-packages/pandas/compat/__init__.py", line 47, in 
    import __builtin__ as builtins
ImportError: No module named '__builtin__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "select.py 
", line 4, in 
    import pymysql
  File "/usr/local/Python-3.5.4/lib/python3.5/site-packages/pymysql/__init__.py", line 92, in 
    from . import connections as _orig_conn
  File "/usr/local/Python-3.5.4/lib/python3.5/site-packages/pymysql/connections.py 
", line 13, in 
    import socket
  File "/usr/local/Python-3.5.4/lib/python3.5/socket.py 
", line 52, in 
    import os, sys, io, selectors
  File "/usr/local/Python-3.5.4/lib/python3.5/selectors.py 
", line 11, in 
    import select
  File "/data/soros/job/report/jinrongban_data_back/test/select.py 
", line 5, in 
    import pandas as pd
  File "/usr/local/Python-3.5.4/lib/python3.5/site-packages/pandas/__init__.py", line 23, in 
    from pandas.compat.numpy import *
  File "/usr/local/Python-3.5.4/lib/python3.5/site-packages/pandas/compat/__init__.py", line 60, in 
    import http.client as httplib
  File "/usr/local/Python-3.5.4/lib/python3.5/http/client.py 
", line 739, in 
    class HTTPConnection:
  File "/usr/local/Python-3.5.4/lib/python3.5/http/client.py 
", line 749, in HTTPConnection
    def __init__(self, host, port=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
AttributeError: module 'socket' has no attribute '_GLOBAL_DEFAULT_TIMEOUT'

Python脚本：

[root@localhost ~]# cat select3.py
#!/usr/bin/env python
#-*- coding: utf8 -*-
from sqlalchemy import create_engine
import pymysql
import pandas as pd
from pandas import DataFrame,Series
engine=create_engine('mysql+pymysql://username:password@192.168.1.1/database_name?charset=gbk,pool_timeout=3000')
sql="select count(*) as '总数' from database.table1;"
sql.encode('gb18030')
df=pd.read_sql(sql,engine)
writer = pd.ExcelWriter('/root/output.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
[root@localhost ~]#

附件：
Python-3.5.4.tar.gz
select3.py
numpy-1.13.3-cp35-cp35m-manylinux1_x86_64.whl

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2020/10/16/Python-读取Mysql生成EXCEL（XLSX）/

k8s集群水平扩展(HPA)

2019-01-28T14:51:38.000Z

Horizontal Pod Autoscaling，简称HPA，是Kubernetes中实现POD水平自动伸缩的功能。

简介

K8S集群可以通过Replication Controller的scale机制完成服务的扩容或缩容，实现具有伸缩性的服务。
K8S自动伸缩分为：

sacle手动伸缩。见k8s滚动升级(RollingUpdate)
autoscale自动伸缩，见HPA。

自动扩展主要分为两种：

水平扩展(scale out)，针对于实例数目的增减。
垂直扩展(scal up)，即单个实例可以使用的资源的增减, 比如增加cpu和增大内存。

HPA属于前者。它可以根据CPU使用率或应用自定义metrics自动扩展Pod数量(支持 replication controller、deployment 和 replica set)。

获取metrics的两种方式：

Heapster：heapster提供metrics服务，但是在v1(autoscaling/v1)版本中仅支持以CPU作为扩展度量指标。而其他比如：内存，网络流量，qps等目前处于beta阶段(autoscaling/v2beta1)。
Cousom：同样处于beta阶段(autoscaling/v2beta1)，但是涉及到自定义的REST API的开发，复杂度会大一些，并且当需要从自定义的监控中获取数据时，只能设置绝对值，无法设置使用率。

工作流程

创建HPA资源，设定目标CPU使用率限额，以及最大/最小实例数，一定要设置Pod的资源限制参数: request，否则HPA不会工作。
控制管理器每隔30s(在kube-controller-manager.service中可以通过–horizontal-pod-autoscaler-sync-period修改)查询metrics的资源使用情况。
然后与创建时设定的值和指标做对比(平均值之和/限额)，求出目标调整的实例个数。
目标调整的实例数不能超过第一条中设定的最大/最小实例数。如果没有超过，则扩容；超过，则扩容至最大的实例个数。
重复第2-4步。

自动伸缩算法

HPA Controller会通过调整副本数量使得CPU使用率尽量向期望值靠近，而且不是完全相等。另官方考虑到自动扩展的决策可能需要一段时间才会生效：例如当pod所需要的CPU负荷过大，从而在创建一个新pod的过程中，系统的CPU使用量可能会同样在有一个攀升的过程。所以在每一次作出决策后的一段时间内，将不再进行扩展决策。对于扩容而言，这个时间段为3分钟，缩容为5分钟(可以通过--horizontal-pod-autoscaler-downscale-delay，--horizontal-pod-autoscaler-upscale-delay进行调整)。

HPA Controller中有一个tolerance（容忍力）的概念，它允许一定范围内的使用量的不稳定，现在默认为0.1，这也是出于维护系统稳定性的考虑。例如设定HPA调度策略为cpu使用率高于50%触发扩容，那么只有当使用率大于55%或者小于45%才会触发伸缩活动，HPA会尽力把Pod的使用率控制在这个范围之间。
具体的每次扩容或者缩容的多少Pod的算法为：Ceil(前采集到的使用率 / 用户自定义的使用率) * Pod数量)。
每次最大扩容pod数量不会超过当前副本数量的2倍。

环境说明

角色	IP	操作系统版本
master	192.168.1.201	centos 7.4
etcd1	192.168.1.201	centos 7.4
etcd2	192.168.1.202	centos 7.4
etcd3	192.168.1.203	centos 7.4
node1	192.168.1.204	centos 7.4
node2	192.168.1.205	centos 7.4

环境	软件版本
kubectl server	v1.9.2
kubectl client	v1.9.2
Go	go1.9.2
etcdctl	3.2.15
etcd	3.2.15
flanneld	v0.10.0
cfssl	1.2.0
docker	18.09.1-beta1

[root@master ~]# kubectl cluster-info
Kubernetes master is running at https://192.168.1.201:6443
Heapster is running at https://192.168.1.201:6443/api/v1/namespaces/kube-system/services/heapster/proxy
monitoring-grafana is running at https://192.168.1.201:6443/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
monitoring-influxdb is running at https://192.168.1.201:6443/api/v1/namespaces/kube-system/services/monitoring-influxdb/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
[root@master ~]#

[root@master ~]#  kubectl -s http://192.168.1.201:8080 get componentstatuses 
NAME                 STATUS    MESSAGE              ERROR
controller-manager   Healthy   ok                   
etcd-2               Healthy   {"health": "true"}   
etcd-1               Healthy   {"health": "true"}   
scheduler            Healthy   ok                   
etcd-0               Healthy   {"health": "true"}   
[root@master ~]#

[root@master ~]# kubectl get nodes
NAME            STATUS    ROLES     AGE       VERSION
192.168.1.204   Ready         21h       v1.9.2
192.168.1.205   Ready         21h       v1.9.2
[root@master ~]#

部署HPA

先准备一套K8S集群环境，环境部署略。

创建Deployment POD应用nginx

[root@master ~]# cat nginx.yml 
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx-hpa
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
        resources:
          requests:
            cpu: 0.01
            memory: 25Mi
          limits:
            cpu: 0.05
            memory: 60Mi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx-hpa
spec:
  selector:
    app: nginx-hpa
  type: NodePort
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 80
    nodePort: 30080
[root@master ~]#

[root@master ~]# kubectl apply -f nginx.yml
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-5dcf548595-bk9cr   1/1       Running   1          14h       172.30.94.2   192.168.1.205
[root@master ~]#

创建nginx应用的HPA

[root@master ~]# cat nginx-hpa-cpu.yml 
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 70
[root@master ~]#

[root@master ~]# kubectl apply -f nginx-hpa-cpu.yml
[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx    / 70%   1         5         1          14h
[root@master ~]#

Q1

这时发现nginx-hpa获取不到当前的CPU情况（TARGETS）。等待几分钟后执行kubectl describe hpa发现HPA报错信息如下：

[root@master ~]# kubectl describe hpa
Name:                                                  nginx-hpa
Namespace:                                             default
Labels:                                                
Annotations:                                           kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"au
toscaling/v1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"nginx-hpa","namespace":"default"},"spec":{"maxReplic...
CreationTimestamp:                                     Sat, 26 Jan 2019 22:23:08 +0800
Reference:                                             Deployment/nginx
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):   / 70%
Min replicas:                                          1
Max replicas:                                          5
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics for resource cpu: unable to fetch metrics from API: the server could not find the requested resource (get pods.metrics.k8s.io)
Events:
  Type     Reason                        Age               From                       Message
  ----     ------                        ----              ----                       -------
  Warning  FailedComputeMetricsReplicas  1m (x12 over 3m)  horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from API: the server could not find the requested resource (get pods.metrics.k8s.io)
  Warning  FailedGetResourceMetric       1m (x13 over 3m) horizontal-pod-autoscaler  unable to get metrics for resource cpu: unable to fetch metrics from API: the server could not find the requested resource (get pods.metrics.k8s.io)
[root@master ~]#

大概意思是HPA无法通过API获取到metrics值。
解决办法：
在/etc/systemd/system/kube-controller-manager.service配置文件中新增--horizontal-pod-autoscaler-use-rest-clients=false配置参数。然后重启kube-controller-manager服务即可。

1 2	kube-controller-manager's parameter --horizontal-pod-autoscaler-use-rest-clients in k8s 1.9.0 default value is true , while in k8s 1.8.x is false change it to false and it works.

[root@master ~]# cat /etc/systemd/system/kube-controller-manager.service 
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
[Service]
ExecStart=/usr/local/k8s/bin/kube-controller-manager \
  --address=127.0.0.1 \
  --master=http://192.168.1.201:8080 \
  --allocate-node-cidrs=true \
  --service-cluster-ip-range=172.16.0.0/16 \
  --cluster-cidr=172.30.0.0/16 \
  --cluster-name=kubernetes \
  --cluster-signing-cert-file=/etc/kubernetes/ssl/ca.pem \
  --cluster-signing-key-file=/etc/kubernetes/ssl/ca-key.pem \
  --service-account-private-key-file=/etc/kubernetes/ssl/ca-key.pem \
  --root-ca-file=/etc/kubernetes/ssl/ca.pem \
  --leader-elect=true \
  --horizontal-pod-autoscaler-use-rest-clients=false \
  --v=2
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
[root@master ~]#

1 2	[root@master ~]# systemctl daemon-reload [root@master ~]# systemctl restart kube-controller-manager

Q2

配置并重启完成kube-controller-manager服务后，执行kubectl delete -f nginx-hpa-cpu.yml和kubectl apply -f nginx-hpa-cpu.yml重新创建服务后，发现出现新的错误，信息如下：

[root@master ~]# kubectl describe hpa
Name:                                                  nginx-hpa
Namespace:                                             default
Labels:                                                
Annotations:                                           kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"au
scaling/v1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"nginx-hpa","namespace":"default"},"spec{"maxRepl...
CreationTimestamp:                                     Sun, 27 Jan 2019 00:18:02 +0800
Reference:                                             Deployment/nginx
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):   / 70%
Min replicas:                                          1
Max replicas:                                          5
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics r resource cpu: failed to get pod resource metrics: an error on the server ("Error: 'dial tcp 172.30.9.4:8082: getsockoptconnection timed out'\nTrying to reach: 'http://172.30.9.4:8082/apis/metrics/v1alpha1/namespaces/default/pods?labelSelect=app%3Dnginx-hpa'") has prevented the request from succeeding (get services http:heapster:)
Events:
  Type     Reason                        Age               From                       Message
  ----     ------                        ----              ----                       -------
  Warning  FailedUpdateStatus            2m                horizontal-pod-autoscaler  Operation cannot be fulfilled on hozontalpodautoscalers.autoscaling "nginx-hpa": the object has been modified; please apply your changes to the latest versi and try again
  Warning  FailedGetResourceMetric       24s (x3 over 4m)  horizontal-pod-autoscaler  unable to get metrics for resource u: failed to get pod resource metrics: an error on the server ("Error: 'dial tcp 172.30.9.4:8082: getsockopt: connection med out'\nTrying to reach: 'http://172.30.9.4:8082/apis/metrics/v1alpha1/namespaces/default/pods?labelSelector=app%3Dnginhpa'") has prevented the request from succeeding (get services http:heapster:)
  Warning  FailedComputeMetricsReplicas  24s (x3 over 4m)  horizontal-pod-autoscaler  failed to get cpu utilization: unab to get metrics for resource cpu: failed to get pod resource metrics: an error on the server ("Error: 'dial tcp 172.30.9.4:8082: getsockopt: connection timed out'\nTrying to reach: 'http://172.30.9.4:8082/apis/metrics/v1alpha1/namespaces/defaulpods?labelSelector=app%3Dnginx-hpa'") has prevented the request from succeeding (get services http:heapster:)
[root@master ~]#

意思是HPA无法连接heapster服务。于是检查heapster服务是否异常。

[root@master ~]# kubectl get pod -o wide -n kube-system
NAME                                   READY     STATUS    RESTARTS   AGE         IP           NODE
heapster-6d5c495969-2rgcr              1/1       Running   2          20h         172.30.9.4   192.168.1.204
kubernetes-dashboard-cbbf9945c-bkvbk   1/1       Running   2          20h         172.30.9.3   192.168.1.204
monitoring-grafana-67d68bf9c6-zv928    1/1       Running   2          20h         172.30.9.2   192.168.1.204
monitoring-influxdb-7c4c46745f-kbxgb   1/1       Running   0             172.30.9.5   192.168.1.204
[root@master ~]#

访问kube-dashboard发现POD是可以通过heapster获取到CPU内存的信息的。如下，说明heapster工作正常。

于是到node节点手动curl访问连接异常的URL。经测试在node1节点上访问正常。

[root@node1 ~]# curl 'http://172.30.9.4:8082/apis/metrics/v1alpha1/namespaces/default/pods?labelSelector=app%3Dnginx-hpa'
{
  "metadata": {},
  "items": [
   {
    "metadata": {
     "name": "nginx-5dcf548595-bk9cr",
     "namespace": "default",
     "creationTimestamp": "2019-01-27T07:29:43Z"
    },
    "timestamp": "2019-01-27T07:29:00Z",
    "window": "1m0s",
    "containers": [
     {
      "name": "nginx",
      "usage": {
       "cpu": "0",
       "memory": "2820Ki"
      }
     }
    ]
   }
  ]
 }
 [root@node1 ~]#

于是到kube-master上访问测试，发现HPA无法访问到heapster。

1
2
3

[root@master ~]# curl 'http://172.30.9.4:8082/apis/metrics/v1alpha1/namespaces/default/pods?labelSelector=app%3Dnginx-hpa'
curl: (7) Failed connect to 172.30.9.4:8082; Connection timed out
[root@master ~]#

接下来我们来测试下网络情况，发现kube-master无法Ping通heapster的POD地址。

[root@master ~]# ping 172.30.9.4
PING 172.30.9.4 (172.30.9.4) 56(84) bytes of data.
^C
--- 172.30.9.4 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1002ms
[root@master ~]# telnet 172.30.9.4 8082
Trying 172.30.9.4...
telnet: connect to address 172.30.9.4: Connection timed out
[root@master ~]#

测试发现是网络不通导致的。解决办法是在kube-master上安装flannel网络。
如果flannel网络的IP地址丢失，重启flannel网卡systemctl restart flanneld即可解决。

[root@localhost ~]# ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:48:f6:1d brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.201/24 brd 192.168.1.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::22d8:9dda:6705:ec09/64 scope link 
       valid_lft forever preferred_lft forever
3: flannel.1:  mtu 1450 qdisc noqueue state UNKNOWN 
    link/ether 6e:05:c0:9c:34:3f brd ff:ff:ff:ff:ff:ff
    inet 172.30.13.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::6c05:c0ff:fe9c:343f/64 scope link 
       valid_lft forever preferred_lft forever
[root@localhost ~]#

再测试下kube-master到heapster POD的网络情况：

[root@master ~]# ping 172.30.9.4 -c 4
PING 172.30.9.4 (172.30.9.4) 56(84) bytes of data.
64 bytes from 172.30.9.4: icmp_seq=1 ttl=63 time=2.15 ms
64 bytes from 172.30.9.4: icmp_seq=2 ttl=63 time=1.27 ms
64 bytes from 172.30.9.4: icmp_seq=3 ttl=63 time=1.30 ms
64 bytes from 172.30.9.4: icmp_seq=4 ttl=63 time=1.66 ms
--- 172.30.9.4 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 1.277/1.599/2.150/0.354 ms
[root@master ~]# telnet 172.30.9.4 8082
Trying 172.30.9.4...
telnet: connect to address 172.30.9.4: Connection refused
[root@master ~]#

重新导入nginx-hpa-cpu.yml文件，然后等待几分钟…

[root@localhost ~]# kubectl delete -f nginx-hpa-cpu.yml 
horizontalpodautoscaler "nginx-hpa" deleted
[root@localhost ~]# 
[root@localhost ~]# kubectl apply -f nginx-hpa-cpu.yml 
horizontalpodautoscaler "nginx-hpa" created
[root@localhost ~]#

OK，HPA连接heapster成功。

[root@localhost ~]# kubectl get hpa
NAME        REFERENCE          TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   0% / 70%   1         5         1          39s
[root@localhost ~]# 
[root@localhost ~]# kubectl describe hpa
Name:                                                  nginx-hpa
Namespace:                                             default
Labels:                                                
Annotations:                                           kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"au
toscaling/v1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"nginx-hpa","namespace":"default"},"spec":{"maxRepl...
CreationTimestamp:                                     Sun, 27 Jan 2019 01:04:25 +0800
Reference:                                             Deployment/nginx
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (0) / 70%
Min replicas:                                          1
Max replicas:                                          5
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound  the HPA was able to succesfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is increasing faster than the maximum scale rate
Events:           
[root@localhost ~]#

HPA测试

截至目前，HPA支持的API版本有三个。分别是autoscaling/v1，autoscaling/v2beta1，autoscaling/v2beta2。其中autoscaling/v1只支持CPU一种伸缩指标；在autoscaling/v2beta1中增加支持custom metrics；在autoscaling/v2beta2中增加支持external metrics。
详细说明参考：

官方说明，在k8s 1.11版本，HPA将不再从heapster上获取指标。

The HorizontalPodAutoscaler normally fetches metrics from a series of aggregated APIs (metrics.k8s.io, custom.metrics.k8s.io, and external.metrics.k8s.io). The metrics.k8s.io API is usually provided by metrics-server, which needs to be launched separately. See metrics-server for instructions. The HorizontalPodAutoscaler can also fetch metrics directly from Heapster.
Note:
FEATURE STATE: Kubernetes 1.11 deprecated
Fetching metrics from Heapster is deprecated as of Kubernetes 1.11.

autoscaling/v1

[root@master ~]# cat nginx-hpa-cpu.yml 
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 70
[root@master ~]#

这里只针对CPU的HPA 压力测试。
压测命令

[root@node1 ~]# cat test.sh 
while true
do
	wget -q -O- http://192.168.1.204:30080
done
[root@node1 ~]# sh test.sh

观察HPA当前负载和POD的情况

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   0% / 70%   1         5         1          14h
[root@master ~]#

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   14% / 70%   1         5         1          14h
[root@master ~]#

当负载飙升时，HPA会按照定义的规则开始创建新的POD副本(定义POD的CPU阈值为70%)。

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   180% / 70%   1         5         3          14h
[root@master ~]#
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-5dcf548595-bk9cr   1/1       Running   1          15h       172.30.94.2   192.168.1.205
nginx-5dcf548595-pdndb   1/1       Running   0          1m        172.30.94.4   192.168.1.205
nginx-5dcf548595-z9d6h   1/1       Running   0          1m        172.30.94.3   192.168.1.205
[root@master ~]#

继续压测，会发现POD副本数量继续增加（REPLICAS从3到5）。

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   139% / 70%   1         5         5          14h
[root@master ~]#
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS              RESTARTS   AGE       IP            NODE
nginx-5dcf548595-9gmqf   0/1       ContainerCreating   0          39s               192.168.1.204
nginx-5dcf548595-bk9cr   1/1       Running             1          15h       172.30.94.2   192.168.1.205
nginx-5dcf548595-pdndb   1/1       Running             0          10m       172.30.94.4   192.168.1.205
nginx-5dcf548595-r7n4b   1/1       Running             0          39s       172.30.94.5   192.168.1.205
nginx-5dcf548595-z9d6h   1/1       Running             0          10m       172.30.94.3   192.168.1.205
[root@master ~]#

当REPLICAS达到定义的上限时，即使当前CPU的压力仍然很大，REPLICAS也不会再增加了。

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   112% / 70%   1         5         5          14h
[root@master ~]#
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-5dcf548595-9gmqf   1/1       Running   0          2m        172.30.9.6    192.168.1.204
nginx-5dcf548595-bk9cr   1/1       Running   1          15h       172.30.94.2   192.168.1.205
nginx-5dcf548595-pdndb   1/1       Running   0          12m       172.30.94.4   192.168.1.205
nginx-5dcf548595-r7n4b   1/1       Running   0          2m        172.30.94.5   192.168.1.205
nginx-5dcf548595-z9d6h   1/1       Running   0          12m       172.30.94.3   192.168.1.205
[root@master ~]#

停止压测，当CPU负载降低时，HPA会自动减少POD的数量。

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   40% / 70%   1         5         3          14h
[root@master ~]#
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-5dcf548595-pdndb   1/1       Running   0          16m       172.30.94.4   192.168.1.205
nginx-5dcf548595-r7n4b   1/1       Running   0          6m        172.30.94.5   192.168.1.205
nginx-5dcf548595-z9d6h   1/1       Running   0          16m       172.30.94.3   192.168.1.205
[root@master ~]#

慢慢的，HPA会减少POD的数量，直到降低到最小POD数(MINPODS)。

[root@master ~]# kubectl get hpa
NAME        REFERENCE          TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   0% / 70%   1         5         1          15h
[root@master ~]#
[root@master ~]# kubectl get pod -o wide
NAME                     READY     STATUS    RESTARTS   AGE       IP            NODE
nginx-5dcf548595-z9d6h   1/1       Running   0          1h        172.30.94.3   192.168.1.205
[root@master ~]#

通过kube-dashboard观察这个过程的变化。

通过HPA的日志信息查看到它伸缩的过程。

[root@master ~]# kubectl describe hpa
Name:                                                  nginx-hpa
Namespace:                                             default
Labels:                                                
Annotations:                                           kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"au
toscaling/v1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"nginx-hpa","namespace":"default"},"spec":{"maxRepl...CreationTimestamp:                                     Sun, 27 Jan 2019 01:04:25 +0800
Reference:                                             Deployment/nginx
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (0) / 70%
Min replicas:                                          1
Max replicas:                                          5
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     False   BackoffDownscale  the time since the previous scale is still within the downscale forbidden window
  ScalingActive   True    ValidMetricFound  the HPA was able to succesfully calculate a replica count from cpu resource utilization (percentage of request)  ScalingLimited  True    TooFewReplicas    the desired replica count is increasing faster than the maximum scale rate
Events:
  Type    Reason             Age               From                       Message
  ----    ------             ----              ----                       -------
  Normal  SuccessfulRescale  41m (x2 over 1h)  horizontal-pod-autoscaler  New size: 5; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  29m (x2 over 1h)  horizontal-pod-autoscaler  New size: 3; reason: All metrics below target
  Normal  SuccessfulRescale  17m               horizontal-pod-autoscaler  New size: 2; reason: All metrics below target
  Normal  SuccessfulRescale  8m (x2 over 1h)   horizontal-pod-autoscaler  New size: 3; reason: cpu resource utilization (percentage of request) above target
  Normal  SuccessfulRescale  3m (x2 over 12m)  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target
[root@master ~]#

autoscaling/v2beta1

autoscaling/v2beta1中增加支持custom metrics。

[root@master ~]# cat nginx-hpa-v2beta1.yml 
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: nginx
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: memory
        targetAverageUtilization: 70
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 70
[root@master ~]#

1	[root@master ~]# kubectl apply -f nginx-hpa-v2beta1.yml

等待几分钟后…
观察发现前面10%是内存的使用百分比，后面0%是CPU的使用百分比。

[root@master ~]# kubectl get hpa nginx-hpa
NAME        REFERENCE          TARGETS               MINPODS   MAXPODS   REPLICAS   AGE
nginx-hpa   Deployment/nginx   10% / 70%, 0% / 70%   1         5         1          51s
[root@master ~]#

[root@master ~]# kubectl describe hpa nginx-hpa
Name:                                                     nginx-hpa
Namespace:                                                default
Labels:                                                   
Annotations:                                              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":
"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"nginx-hpa","namespace":"default"},"spec":{"ma...CreationTimestamp:                                        Mon, 28 Jan 2019 22:22:01 +0800
Reference:                                                Deployment/nginx
Metrics:                                                  ( current / target )
  resource memory on pods  (as a percentage of request):  10% (2670592) / 70%
  resource cpu on pods  (as a percentage of request):     0% (0) / 70%
Min replicas:                                             1
Max replicas:                                             5
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive   True    ValidMetricFound    the HPA was able to succesfully calculate a replica count from memory resou
rce utilization (percentage of request)  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           
[root@master ~]#

autoscaling/v2beta2

autoscaling/v2beta2测试发现目前k8s 1.9.2暂不支持这个API版本。

1
2
3

[root@master ~]# kubectl get hpa.v2beta2.autoscaling -o yaml
the server doesn't have a resource type "hpa" in group "v2beta2.autoscaling"
[root@master ~]#

参考：
http://blog.51cto.com/ylw6006/2113848
https://blog.frognew.com/2017/01/kubernetes-pod-scale.html
https://k8smeetup.github.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
https://blog.csdn.net/qq_17016649/article/details/79297796
https://github.com/kubernetes/kubernetes/issues/57673

附件：
HPA测试配置文件.zip

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2019/01/28/k8s集群水平扩展-HPA/

Grafana日志聚合工具Loki

2019-01-20T08:22:01.000Z

Loki是 Grafana Labs 团队最新的开源项目，是一个水平可扩展，高可用性，多租户的日志聚合系统。它的设计非常经济高效且易于操作，因为它不会为日志内容编制索引，而是为每个日志流编制一组标签，为 Prometheus和 Kubernetes用户做了相关优化。项目受 Prometheus 启发，类似于 Prometheus 的日志系统。

简介

Loki初体验
与其他日志聚合系统相比， Loki具有下面的一些特性：

不对日志进行全文索引。通过存储压缩非结构化日志和仅索引元数据，Loki 操作起来会更简单，更省成本。
通过使用与 Prometheus 相同的标签记录流对日志进行索引和分组，这使得日志的扩展和操作效率更高。
特别适合储存 Kubernetes Pod 日志; 诸如 Pod 标签之类的元数据会被自动删除和编入索引。
受 Grafana 原生支持。

Loki 由以下3个部分组成：

loki是主服务器，负责存储日志和处理查询。
promtail是代理，负责收集日志并将其发送给 loki。
Grafana 用于 UI 展示。

环境说明

环境	版本
centos	v7.4
docker	v18.09.0
docker-compose	v1.23.2
grafana	v5.5.0
node-exporter	v0.17.0
prometheus	v2.5.0

部署

clone loki源码

[root@1-206 ~]# curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
[root@1-206 ~]# yum install -y git
[root@1-206 ~]# git clone https://github.com/grafana/loki.git
[root@1-206 ~]# cd loki/production/

可以直接运行官方的docker-compose.yml文件，执行命令为：docker-compose -f docker-compose.yaml up -d
docker-compose.yml 文件如下，这里使用的 grafana 镜像为自己编译的，官方的镜像为：image: grafana/grafana:master

[root@1-206 production]# cat docker-compose.yaml 
version: "3"
networks:
  loki:
services:
  loki:
    image: grafana/loki:master
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki
  promtail:
    image: grafana/promtail:master
    volumes:
      - /var/log:/var/log
    command: 
      -config.file=/etc/promtail/docker-config.yaml
    networks:
      - loki
      
  grafana:
    image: yfshare/grafana:5.5.0_beta1
    ports:
      - "3000:3000"
    environment:
      GF_EXPLORE_ENABLED: "true"
    networks:
      - loki
[root@1-206 production]#

[root@1-206 production]# docker-compose -f docker-compose.yaml up -d
[root@1-206 production]# docker-compose ps
        Name                       Command               State               Ports             
-----------------------------------------------------------------------------------------------
production_grafana_1    /docker-entrypoint.sh            Up      22/tcp, 0.0.0.0:3000->3000/tcp
production_loki_1       /bin/loki -config.file=/et ...   Up      0.0.0.0:3100->3100/tcp, 80/tcp
production_promtail_1   /usr/bin/promtail -config. ...   Up                                    
[root@1-206 production]#

体验

部署完成后，访问grafana：http://ip:3000，默认密码为`admin:admin`
Grafana v5.5.0+以上的版本才支持Loki

添加数据源Prometheus和Loki

根据系统标签来查看Loki日志
测试发现当promtail定义读取的日志路径为/var/log时，即使日志路径为/var/log/grafana/grafana.log也不能读取。

这里的job varlogs是在promtail的配置文件里定义的，promtail配置文件如下：

[root@1-206 production]# docker exec -it production_promtail_1 /bin/sh
/ # cat /etc/promtail/docker-config.yaml
server:
  http_listen_port: 0
  grpc_listen_port: 0
positions:
  filename: /tmp/positions.yaml
client:
  url: http://loki:3100/api/prom/push
scrape_configs:
- job_name: system
  entry_parser: raw
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log
/ #

Loki也可以查看Prometheus的item

参考：
Grafana 日志聚合工具 Loki
Loki官网
 Deploy Loki to Kubernetes
Configuring the Loki Datasource in Grafana
Loki API

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2019/01/20/Grafana日志聚合工具Loki/

Python十行代码爬抖音

2018-11-05T16:07:32.000Z

今天看到一篇文章介绍怎么用python来爬抖音的小视频和音乐。于是兴趣大增，先来试试能不能玩，哈哈。

先来看个效果图，嗯嗯…

效果图

环境说明

环境：
　　　python 3.7.1
　　　centos 7.4
　　　pip 10.0.1

部署

1
2
3

[root@localhost ~]# python3.7 --version
Python 3.7.1
[root@localhost ~]#

1	[root@localhost ~]# pip3 install douyin

有时候因为网络原因会安装失败，这时重新执行上面的命令即可，直到安装完成。

导入douyin模块

1
2
3

[root@localhost ~]# python3.7
>>> import douyin
>>>

导入如果报错的话，可能douyin模块没有安装成功。

下面我们开始爬…爬抖音小视频和音乐咯

1	[root@localhost douyin]# python3.7 dou.py

几分钟后…我们来看看爬的成果
可以看到视频配的音乐被存储成了 mp3 格式的文件，抖音视频存储成了 mp4 文件。

嗯…不错，哈哈。

py脚本

作者说，能爬抖音上所有热门话题和音乐下的相关视频都爬取到，并且将爬到的视频下载下来，同时还要把视频所配的音乐也单独下载下来，不仅如此，所有视频的相关信息如发布人、点赞数、评论数、发布时间、发布人、发布地点等等信息都需要爬取下来，并存储到 MongoDB 数据库。

import douyin
from douyin.structures import Topic, Music
# 定义视频下载、音频下载、MongoDB 存储的处理器
video_file_handler = douyin.handlers.VideoFileHandler(folder='./videos')
music_file_handler = douyin.handlers.MusicFileHandler(folder='./musics')
#mongo_handler = douyin.handlers.MongoHandler()
# 定义下载器，并将三个处理器当做参数传递
#downloader = douyin.downloaders.VideoDownloader([mongo_handler, video_file_handler, music_file_handler])
downloader = douyin.downloaders.VideoDownloader([video_file_handler, music_file_handler])
# 循环爬取抖音热榜信息并下载存储
for result in douyin.hot.trend():
    for item in result.data:
        # 爬取热门话题和热门音乐下面的所有视频，每个话题或音乐最多爬取 10 个相关视频。
        downloader.download(item.videos(max=10))

由于我这里没有mongodb所以，把这mongodb相关的配置给注释掉了。
作者github地址：https://github.com/Python3WebSpider/DouYin

====以下摘自作者====

代码解读

本库依赖的其他库有：

aiohttp：利用它可以完成异步数据下载，加快下载速度
dateparser：利用它可以完成任意格式日期的转化
motor：利用它可以完成异步 MongoDB 存储，加快存储速度
requests：利用它可以完成最基本的 HTTP 请求模拟
tqdm：利用它可以进行进度条的展示

数据结构定义
如果要做一个库的话，一个很重要的点就是对一些关键的信息进行结构化的定义，使用面向对象的思维对某些对象进行封装，抖音的爬取也不例外。

在抖音中，其实有很多种对象，比如视频、音乐、话题、用户、评论等等，它们之间通过某种关系联系在一起，例如视频中使用了某个配乐，那么视频和音乐就存在使用关系；比如用户发布了视频，那么用户和视频就存在发布关系，我们可以使用面向对象的思维对每个对象进行封装，比如视频的话，就可以定义成如下结构：

class Video(Base):
    def __init__(self, **kwargs):
        """
        init video object
        :param kwargs:
        """
        super().__init__()
        self.id = kwargs.get('id')
        self.desc = kwargs.get('desc')
        self.author = kwargs.get('author')
        self.music = kwargs.get('music')
        self.like_count = kwargs.get('like_count')
        self.comment_count = kwargs.get('comment_count')
        self.share_count = kwargs.get('share_count')
        self.hot_count = kwargs.get('hot_count')
        ...
        self.address = kwargs.get('address')
    def __repr__(self):
        """
        video to str
        :return: str
        """
        return '>' % (self.id, self.desc[:10].strip() if self.desc else None)

这里将一些关键的属性定义成 Video 类的一部分，包括 id 索引、desc 描述、author 发布人、music 配乐等等，其中 author 和 music 并不是简单的字符串的形式，它也是单独定义的数据结构，比如 author 就是 User 类型的对象，而 User 的定义又是如下结构：

class User(Base):
    def __init__(self, **kwargs):
        """
        init user object
        :param kwargs:
        """
        super().__init__()
        self.id = kwargs.get('id')
        self.gender = kwargs.get('gender')
        self.name = kwargs.get('name')
        self.create_time = kwargs.get('create_time')
        self.birthday = kwargs.get('birthday')
        ...
    def __repr__(self):
        """
        user to str
        :return:
        """
        return '>' % (self.alias, self.name)

所以说，通过属性之间的关联，我们就可以将不同的对象关联起来，这样显得逻辑架构清晰，而且我们也不用一个个单独维护字典来存储了，其实这就和 Scrapy 里面的 Item 的定义是类似的。

请求和重试
实现爬取的过程就不必多说了，这里面其实用到的就是最简单的抓包技巧，使用 Charles 直接进行抓包即可。抓包之后便可以观察到对应的接口请求，然后进行模拟即可。

所以问题就来了，难道我要一个接口写一个请求方法吗？另外还要配置 Headers、超时时间等等的内容，那岂不是太费劲了，所以，我们可以将请求的方法进行单独的封装，这里我定义了一个 fetch 方法：

def _fetch(url, **kwargs):
    """
    fetch api response
    :param url: fetch url
    :param kwargs: other requests params
    :return: json of response
    """
    response = requests.get(url, **kwargs)
    if response.status_code != 200:
        raise requests.ConnectionError('Expected status code 200, but got {}'.format(response.status_code))
    return response.json()

这个方法留了一个必要参数，即 url，另外其他的配置我留成了 kwargs，也就是可以任意传递，传递之后，它会依次传递给 requests 的请求方法，然后这里还做了异常处理，如果成功请求，即可返回正常的请求结果。

定义了这个方法，在其他的调用方法里面我们只需要单独调用这个 fetch 方法即可，而不需要再去关心异常处理，返回类型了。

好，那么定义好了请求之后，如果出现了请求失败怎么办呢？按照常规的方法，我们可能就会在外面套一层方法，然后记录调用 fetch 方法请求失败的次数，然后重新调用 fetch 方法进行重试，但这里可以告诉大家一个更好用的库，叫做 retrying，使用它我们可以通过定义一个装饰器来完成重试的操作。

比如我可以使用 retry 装饰器这么装饰 fetch 方法：

from retrying import retry
@retry(stop_max_attempt_number=retry_max_number, wait_random_min=retry_min_random_wait,
           wait_random_max=retry_max_random_wait, retry_on_exception=need_retry)
def _fetch(url, **kwargs):
    pass

这里使用了装饰器的四个参数：

stop_max_attempt_number：最大重试次数，如果重试次数达到该次数则放弃重试
wait_random_min：下次重试之前随机等待时间的最小值
wait_random_max：下次重试之前随机等待时间的最大值
retry_on_exception：判断出现了怎样的异常才重试

这里 retry_on_exception 参数指定了一个方法，叫做 need_retry，方法定义如下：

def need_retry(exception):
    """
    need to retry
    :param exception:
    :return:
    """
    result = isinstance(exception, (requests.ConnectionError, requests.ReadTimeout))
    if result:
        print('Exception', type(exception), 'occurred, retrying...')
    return result

这里判断了如果是 requests 的 ConnectionError 和 ReadTimeout 异常的话，就会抛出异常进行重试，否则不予重试。

所以，这样我们就实现了请求的封装和自动重试，是不是非常 Pythonic？

下载处理器的设计
为了下载视频，我们需要设计一个下载处理器来下载已经爬取到的视频链接，所以下载处理器的输入就是一批批的视频链接，下载器接收到这些链接，会将其进行下载处理，并将视频存储到对应的位置，另外也可以完成一些信息存储操作。

在设计时，下载处理器的要求有两个，一个是保证高速的下载，另一个就是可扩展性要强，下面我们分别来针对这两个特点进行设计：
高速下载，为了实现高速的下载，要么可以使用多线程或多进程，要么可以用异步下载，很明显，后者是更有优势的。
扩展性强，下载处理器要能下载音频、视频，另外还可以支持数据库等存储，所以为了解耦合，我们可以将视频下载、音频下载、数据库存储的功能独立出来，下载处理器只负责视频链接的主要逻辑处理和分配即可。

为了实现高速下载，这里我们可以使用 aiohttp 库来完成，另外异步下载我们也不能一下子下载太多，不然网络波动太大，所以我们可以设置 batch 式下载，可以避免同时大量的请求和网络拥塞，主要的下载函数如下：

def download(self, inputs):
    """
    download video or video lists
    :param data:
    :return:
    """
    if isinstance(inputs, types.GeneratorType):
        temps = []
        for result in inputs:
            print('Processing', result, '...')
            temps.append(result)
            if len(temps) == self.batch:
                self.process_items(temps)
                temps = []
    else:
        inputs = inputs if isinstance(inputs, list) else [inputs]
        self.process_items(inputs)

这个 download 方法设计了多种数据接收类型，可以接收一个生成器，也可以接收单个或列表形式的视频对象数据，接着调用了 process_items 方法进行了异步下载，其方法实现如下：

def process_items(self, objs):
    """
    process items
    :param objs: objs
    :return:
    """
    # define progress bar
    with tqdm(total=len(objs)) as self.bar:
        # init event loop
        loop = asyncio.get_event_loop()
        # get num of batches
        total_step = int(math.ceil(len(objs) / self.batch))
        # for every batch
        for step in range(total_step):
            start, end = step * self.batch, (step + 1) * self.batch
            print('Processing %d-%d of files' % (start + 1, end))
            # get batch of objs
            objs_batch = objs[start: end]
            # define tasks and run loop
            tasks = [asyncio.ensure_future(self.process_item(obj)) for obj in objs_batch]
            for task in tasks:
                task.add_done_callback(self.update_progress)
            loop.run_until_complete(asyncio.wait(tasks))

这里使用了 asyncio 实现了异步处理，并通过对视频链接进行分批处理保证了流量的稳定性，另外还使用了 tqdm 实现了进度条的显示。

我们可以看到，真正的处理下载的方法是 process_item，这里面会调用视频下载、音频下载、数据库存储的一些组件来完成处理，由于我们使用了 asyncio 进行了异步处理，所以 process_item 也需要是一个支持异步处理的方法，定义如下：

async def process_item(self, obj):
    """
    process item
    :param obj: single obj
    :return:
    """
    if isinstance(obj, Video):
        print('Processing', obj, '...')
        for handler in self.handlers:
            if isinstance(handler, Handler):
                await handler.process(obj)

这里我们可以看到，真正的处理逻辑都在一个个 handler 里面，我们将每个单独的功能进行了抽离，定义成了一个个 Handler，这样可以实现良好的解耦合，如果我们要增加和关闭某些功能，只需要配置不同的 Handler 即可，而不需要去改动代码，这也是设计模式的一个解耦思想，类似工厂模式。

Handler 的设计
刚才我们讲了，Handler 就负责一个个具体功能的实现，比如视频下载、音频下载、数据存储等等，所以我们可以将它们定义成不同的 Handler，而视频下载、音频下载又都是文件下载，所以又可以利用继承的思想设计一个文件下载的 Handler，定义如下：

from os.path import join, exists
from os import makedirs
from douyin.handlers import Handler
from douyin.utils.type import mime_to_ext
import aiohttp
class FileHandler(Handler):
    def __init__(self, folder):
        """
        init save folder
        :param folder:
        """
        super().__init__()
        self.folder = folder
        if not exists(self.folder):
            makedirs(self.folder)
    async def _process(self, obj, **kwargs):
        """
        download to file
        :param url: resource url
        :param name: save name
        :param kwargs:
        :return:
        """
        print('Downloading', obj, '...')
        kwargs.update({'ssl': False})
        kwargs.update({'timeout': 10})
        async with aiohttp.ClientSession() as session:
            async with session.get(obj.play_url, **kwargs) as response:
                if response.status == 200:
                    extension = mime_to_ext(response.headers.get('Content-Type'))
                    full_path = join(self.folder, '%s.%s' % (obj.id, extension))
                    with open(full_path, 'wb') as f:
                        f.write(await response.content.read())
                    print('Downloaded file to', full_path)
                else:
                    print('Cannot download %s, response status %s' % (obj.id, response.status))
    async def process(self, obj, **kwargs):
        """
        process obj
        :param obj:
        :param kwargs:
        :return:
        """
        return await self._process(obj, **kwargs)

这里我们还是使用了 aiohttp，因为在下载处理器中需要 Handler 支持异步操作，这里下载的时候就是直接请求了文件链接，然后判断了文件的类型，并完成了文件保存。

视频下载的 Handler 只需要继承当前的 FileHandler 即可：

from douyin.handlers import FileHandler
from douyin.structures import Video
class VideoFileHandler(FileHandler):
    async def process(self, obj, **kwargs):
        """
        process video obj
        :param obj:
        :param kwargs:
        :return:
        """
        if isinstance(obj, Video):
            return await self._process(obj, **kwargs)

这里其实就是加了类别判断，确保数据类型的一致性，当然音频下载也是一样的。

异步 MongoDB 存储
上面介绍了视频和音频处理的 Handler，另外还有一个存储的 Handler 没有介绍，那就是 MongoDB 存储，平常我们可能习惯使用 PyMongo 来完成存储，但这里我们为了加速，需要支持异步操作，所以这里有一个可以实现异步 MongoDB 存储的库，叫做 Motor，其实使用的方法差不太多，MongoDB 的连接对象不再是 PyMongo 的 MongoClient 了，而是 Motor 的 AsyncIOMotorClient，其他的配置基本类似。

在存储时使用的是 update_one 方法并开启了 upsert 参数，这样可以做到存在即更新，不存在即插入的功能，保证数据的不重复性。

整个 MongoDB 存储的 Handler 定义如下：

from douyin.handlers import Handler
from motor.motor_asyncio import AsyncIOMotorClient
from douyin.structures import *
class MongoHandler(Handler):
    def __init__(self, conn_uri=None, db='douyin'):
        """
        init save folder
        :param folder:
        """
        super().__init__()
        if not conn_uri:
            conn_uri = 'localhost'
        self.client = AsyncIOMotorClient(conn_uri)
        self.db = self.client[db]
    async def process(self, obj, **kwargs):
        """
        download to file
        :param url: resource url
        :param name: save name
        :param kwargs:
        :return:
        """
        collection_name = 'default'
        if isinstance(obj, Video):
            collection_name = 'videos'
        elif isinstance(obj, Music):
            collection_name = 'musics'
        collection = self.db[collection_name]
        # save to mongodb
        print('Saving', obj, 'to mongodb...')
        if await collection.update_one({'id': obj.id}, {'$set': obj.json()}, upsert=True):
            print('Saved', obj, 'to mongodb successfully')
        else:
            print('Error occurred while saving', obj)

可以看到我们在类中定义了 AsyncIOMotorClient 对象，并暴露了 conn_uri 连接字符串和 db 数据库名称，可以在声明 MongoHandler 类的时候指定 MongoDB 的链接地址和数据库名。
同样的 process 方法，这里使用 await 修饰了 update_one 方法，完成了异步 MongoDB 存储。
好，以上便是 douyin 库的所有的关键部分介绍，这部分内容可以帮助大家理解这个库的核心部分实现，另外可能对设计模式、面向对象思维以及一些实用库的使用有一定的帮助。

参考：https://github.com/Python3WebSpider/DouYin
不到 10 行代码爬抖音

附件：
douyin.py
Python-3.7.1.tar.xz

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/11/06/Python十行代码爬抖音/

Python升级

2018-11-03T16:20:00.000Z

Python是一种解释型、面向对象、动态数据类型的高级程序设计语言。

编译安装python3.7.1并支持ssl 模块

环境说明

环境：
　　python 3.7.1
　　centos 7.4

解决依赖关系

1	[root@localhost ~]# yum install -y openssl openssl-devel libffi-devel zlib*

编译安装python

[root@localhost ~]# wget -c 'https://www.python.org/ftp/python/3.7.1/Python-3.7.1.tar.xz'
[root@localhost ~]# tar -xf Python-3.7.1.tar.xz
[root@localhost ~]# cd Python-3.7.1
[root@localhost Python-3.7.1]# ./configure --prefix=/usr/local/python3.7 --with-ssl
[root@localhost Python-3.7.1]# make && make install

设置环境变量

[root@localhost ~]# cat /etc/profile.d/python37.sh 
#!/bin/bash
python37="/usr/local/python3.7"
export PATH=$PATH:${python37}/bin
[root@localhost ~]# source /etc/profile.d/python37.sh

测试python

1
2
3

[root@localhost ~]# python3.7 --version
Python 3.7.1
[root@localhost ~]#

Question

如果不安装libffi-devel，在编译时会报下面的错误

File "/root/Python-3.7.0/Lib/ctypes/__init__.py", line 7, in 
    from _ctypes import Union, Structure, Array
ModuleNotFoundError: No module named '_ctypes'
make: *** [install] Error 1

如果不安装openssl，在使用ssl模块时会报错

pip is configured with locations that require TLS/SSL, however the ssl module in Python is 
not available.Collecting douyin
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after conn
ection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/douyin/  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after conn
ection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/douyin/  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after conn
ection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/douyin/  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after conn
ection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/douyin/  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after conn
ection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.")': /simple/douyin/  Could not fetch URL https://pypi.org/simple/douyin/: There was a problem confirming the s
sl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/douyin/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping  Could not find a version that satisfies the requirement douyin (from versions: )
No matching distribution found for douyin
pip is configured with locations that require TLS/SSL, however the ssl module in Python is 
not available.Could not fetch URL https://pypi.org/simple/pip/: There was a problem confirming the ssl ce
rtificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/pip/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.")) - skipping

1
2
3

...
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

1	[root@localhost ~]# python3.7 -m pip install --upgrade pip

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/11/04/Python升级/

node_exporter自定义key

2018-09-29T14:04:34.000Z

在部署完k8s及使用prometheus监控后，发现一个问题啊，那就是能不能像zabbix一样，自己定义key啊，监控自己的服务，毕竟官方提供的key也不能完全覆盖。。

在查阅官网资料发现，prometheus是可以使用自定义key的。
官方git地址：https://github.com/prometheus/node_exporter

背景

官方描述：

1
2

There is varying support for collectors on each operating system. The tables below list all existing collectors and the supported systems.
Collectors are enabled by providing a --collector. flag. Collectors that are enabled by default can be disabled by providing a --no-collector. flag.

即通过官方提供的collector来实现自定义key功能。
具体的collector这里就不一一列出了，可参考这里

环境说明

环境：
　　　Prometheus version 2.2.0
　　　node_exporter version 0.15.2
　　　Centos 7.4

这里就以监控nginx页面的状态码为例了，使用的是textfile这个collector

监控脚本

编写和zabbix类似的监控脚本并赋予其返回值

[root@localhost ~]# cat node_nginx_testpage_status
#!/bin/bash
#Monitor nginx test page.
NGINX_PATH='/etc/nginx'
if [ -d "${NGINX_PATH}" ];then 
NGINX_PORT='80'
#NGINX_PORT=`grep -v "#"  /etc/nginx/nginx.conf | grep listen |awk -F ";" '{print $1}' |awk '{print $2}'`
/usr/bin/curl -o /dev/null --retry 1 --max-time 3 -w %{http_code} -s "http://127.0.0.1:${NGINX_PORT}"| grep -c '200'
fi
[root@localhost ~]#

1
2
3

[root@localhost ~]# sh node_nginx_testpage_status
1
[root@localhost ~]#

[root@localhost ~]# cat /etc/profile.d/node_exporter.sh
#!/bin/bash
node_exporter_HOME='/usr/local/node_exporter'
PATH=$node_exporter_HOME:$PATH
[root@localhost ~]#

编写脚本生成key

[root@localhost ~]# cat zabbix_runner 
#!/bin/bash
# Runs a textfile collector.
textfile_dir=$(dirname $0)
source /etc/profile.d/node_exporter.sh
metric="$1"
shift
script="$textfile_dir/$metric"
prom_file="$textfile_dir/$metric".prom
if [[ ! -x "$script" || -d "$script" ]]; then
  echo "ERROR: Can't find script for '$metric'. Aborting."
  exit 1
fi
VALUE=`"$script" "$@"`
if [[ ! -n $VALUE ]]; then
    exit 0
    # echo "ERROR: Can't get value for '$metric'. Aborting."
    # exit 1
else
    echo "# TYPE ${metric} gauge"> "$prom_file".$$
    echo "${metric} ${VALUE}" >> "$prom_file".$$ && mv "$prom_file".$$ "$prom_file"
fi 
[root@localhost ~]#

1 2	[root@localhost ~]# chmod +x node_nginx_testpage_status [root@localhost ~]# sh zabbix_runner node_nginx_testpage_status

zabbix_runner脚本会生成node_nginx_testpage_status.prom文件，该文件记录当前监控指标的状态

[root@localhost ~]# cat node_nginx_testpage_status.prom
# TYPE node_nginx_testpage_status gauge
node_nginx_testpage_status 1
[root@localhost ~]#

启动node_exporter

1	[root@localhost ~]# /usr/local/node_exporter/node_exporter --web.listen-address=:9100 --collector.textfile.directory=/root/

--collector.textfile.directory指定textfile收集器读取文件的目录。根据官网说明，textfile收集器会读取以.prom结尾的文件

访问prometheus Dashboard
http://ip:9090

首先先确认prometheus与node_exporter建立连接

输入在node_nginx_testpage_status.prom文件中生成的key

这样就完成了node_exporter自定义key

接下来把其放到定时任务中即可

[root@localhost ~]# cat node_exporter_key.sh 
#!/bin/bash
#这里存储所有node_exporter自定义key
#监控脚本存储目录
scripts='/root'
#zabbix_runner存储目录
zabbix_runner_dir='/root'
#注：需要进入监控脚本目录后执行命令，否则会报"ERROR: Can't get value for '$metric'. Aborting."。即脚本文件找不到，是脚本目录不对
#监控nginx testpage
cd $scripts
/bin/bash $zabbix_runner_dir/zabbix_runner node_nginx_testpage_status
[root@localhost ~]#

[root@localhost ~]# crontab -l
#获取node_exporter自定义key
*/1 * * * * /bin/bash /root/node_exporter_key.sh
[root@localhost ~]#

效果图

附件：
zabbix_runner
node_nginx_testpage_status
node_exporter_key.sh

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/09/29/node-exporter自定义key/

Python运算符

2018-08-04T11:44:54.000Z

运算符的优先级：+和-优先级最低，*,/,//,%优先级较高，单目运算符+和-优先级更高，乘方的优先级最高

环境说明

环境：
　　　python 2.7
　　　centos 7.5

1
2
3

>>> print -2 * 4 + 3 ** 2
1
>>>

算数运算

字符串之间使用加号(+)可以做字符串拼接
只要除数或被除数中有一个数是浮点数，那么得到的结果也是浮点数

运算符	描述	示例
+	两个对象相加	>>>print ‘abc’+’def’ abcdef >>>print 123+234 357 >>>
-	两个对象相减	>>>print 123-456 -333 >>>
*	两个数相乘或返回一个重复若干次的序列	>>>print 123456 56088 >>>print ‘abc’2 abcabc >>>
/	两个数相除	>>>print 2/3 0 >>>print 2/3.0 0.666666666667 >>>print 2.0/3 0.666666666667 >>>
//	整除，返回商的整数部分	>>>print 3//2 1 >>>print 3//2.0 1.0 >>>
%	求余/取模，返回除法的余数	>>>print 4%3 1 >>>print 4%3.0 1.0 >>>print 4%2.0 0.0 >>>print 4%2 0 >>>
**	求幂/次方	>>>print 23 8 >>>print 23.0 8.0 >>>

赋值运算

运算符	描述	实例	示例
=	简单赋值运算符	a = 5,b = 3,c = a - b	>>>a = 5 >>>b = 3 >>>c = a - b >>>print ‘c的值为:’,c c的值为:2 >>>
+=	加法赋值运算符	a+= b 相当于 a = a+b	>>>a = 5 >>>b = 3 >>>a+= b >>>print ‘a的值为:’,a a的值为:8 >>>
-=	减法赋值运算符	a -= b 相当于 a = a -b	>>>a = 5 >>>b = 3 >>>a -= b >>>print ‘a的值为:’,a a的值为:2 >>>
*=	乘法赋值运算符	a *= b 相当于 a = ab	>>>a = 5 >>>b = 3 >>>a *= b >>>print ‘a的值为:’,a a的值为:15 >>>
/=	除法赋值运算符	a /= b 相当于 a = a / b	>>>a = 5 >>>b = 3 >>>a /= b >>>print ‘a的值为:’,a a的值为:1 >>>
//=	取整除赋值运算符	a //= b 相当于 a = a // b	>>>a = 5 >>>b = 3 >>>a //= b >>>print ‘a的值为:’,a a的值为:1 >>>
%=	取模赋值运算符	a %= b 相当于 a = a % b	>>>a = 5 >>>b = 3 >>>a %= b >>>print ‘a的值为:’,a a的值为:2 >>>
**=	幂赋值运算符符	a **= b 相当于 a = a^b	>>>a = 5 >>>b = 3 >>>a **= b >>>print ‘a的值为:’,a a的值为:125 >>>

python 不支持C语言中的自增1和自减1运算符，因为+和-也是单目运算，python会将--n解释为-(-n)从而得到n，同样，++n的结果也是n。

比较运算

运算符	描述	示例
<	小于，比较运算符返回1表示真，返回0表示假。这分别与特殊的变量True和False等价	>>>a=21 >>>b=10 >>>a>b True >>>
<=	小于或等于	>>>a=21 >>>b=31 >>>a<=b True >>>
>	大于	>>>a=21 >>>b=31 >>>a>b False >>>
>=	大于或等于	>>>a=31 >>>b=21 >>>a>=b True >>>
==	等于	>>>a=31 >>>b=21 >>>a==b False >>>
!=	不等于	>>>a=31 >>>b=21 >>>a!=b True >>>
is	判断两个标识符是否引用自一个对象	>>>x = [4,5,6] >>>y = [7,8,9] >>>x is y False >>>
is not	判断两个标识符是否引用自不同对象	>>>x = [4,5,6] >>>y = [7,8,9] >>>x is not y True >>>

逻辑运算

“真”值测试(Truth Value Testing)

Python中的任何一个对象都可以被测试“真”值。可以这样理解：Python中的任何一个对象都可以转换为一个布尔值，而这个“真”值测试就是获取一个对象对应的布尔值的过程。
Python中只有下面这些值对应的布尔值为False：

None
False
数字类型中的0，如：0,0.0,0j
任意空序列，如：'',(),[]
任意一个空映射，如：{}
一个用户自定义类的实例。该用户自定义类中定义了一个__bool__()或__len__()方法，且实例调用该方法时返回整数0或布尔值False

除此之外，所有的其它值对应的布尔值都是True, 因此许多类型的对象永远是True。”真”值测试可以被用在 if或while条件中，也可以作为布尔操作的操作数。

布尔运算(Boolean Operations)

Python中的逻辑运算称为“布尔运算(Boolean Operations)”，操作符包括：and(与)、or(或)、not(非)。

运算符	逻辑表达式	描述	示例
and	x and y	与运算，只有两个布尔值都为 True 时，计算结果才为 True	>>>x = True >>>y = False >>>x and y False >>>
or	x or y	或运算，只要有一个布尔值为 True，计算结果就是 True	>>>x = True >>>y = False >>>x or y True >>>
not	not x	非运算，把True变为False，或者把False变为True	>>>x = True >>>not x False >>>

位运算

按位运算是指把数字转换为二进制来进行计算。

1
2
3

#左边为十进制，右边为二进制
60 ，0011 1100
13 ，0000 1101

运算符	描述	实例	示例
&	按位与，参与运算的两个值。如果相应的二进制位都为1，则该位结果为1，否则为0	a&b对应的二进制结果为 0000 1100，十进制为12	>>>a=60 >>>b=13 >>>c=a & b >>>print c 12 >>>
\|	按位或运算符，只要对应的二个二进位有一个为1时，结果位就为1	(a\|b)输出结果61，二进制解释： 0011 1101	>>>a=60 >>>b=13 >>>c=a\|b >>>print c 61 >>>
^	按位异或运算符，当两对应的二进位相异时，结果为1	(a^b)输出结果49，二进制解释： 0011 0001	>>>a=60 >>>b=13 >>>c=a^b >>>print c 49 >>>
~	按位取反运算符，对数据的每个二进制位取反，即把1变为0,把0变为1。~x 类似于-x-1	(~a)输出结果-61，二进制解释：1100 0011，在一个有符号二进制数的补码形式	>>>a=60 >>>b=~a >>>print b -61 >>>
<<	左移动运算符，运算数的各二进位全部左移若干位，由 << 右边的数字指定了移动的位数，高位丢弃，低位补0	a << 2输出结果240，二进制解释： 1111 0000	>>>a=60 >>>b=a<<2< br="">>>>print b 240 >>>
>>	右移动运算符，把”>>”左边的运算数的各二进位全部右移若干位， >>右边的数字指定了移动的位数	a>>2输出结果15，二进制解释：0000 1111	>>>a=60 >>>b=a>>2 >>>print b 15 >>>

Python成员运算符

运算符	描述	实例
in	如果在指定的序列中找到值返回 True，否则返回 False	x 在 y 序列中 , 如果 x 在 y 序列中返回 True
not in	如果在指定的序列中没有找到值返回 True，否则返回 False	x 不在 y 序列中 , 如果 x 不在 y 序列中返回 True

#!/usr/bin/python
#-*- coding: UTF-8 -*-
a=10
b=20
list=[1,2,3,4,5]
if a in list:
   print '变量 a 在给定的列表中 list 中'
else:
   print '变量 a 不在给定的列表中 list 中'
if b not in list:
   print '变量 b 不在给定的列表中 list 中'
else:
   print '变量 b 在给定的列表中 list 中'

# python in.py
变量 a 不在给定的列表中 list 中
变量 b 不在给定的列表中 list 中
#

Python运算符优先级

运算符	描述
`**`	指数(最高优先级)
`~`,`+`,`-`	按位翻转, 一元加号和减号 (最后两个的方法名为 +@ 和 -@)
`*`,`/`,`%`,`//`	乘,除,取模和取整除
`+`,`-`	加法减法
`>>`,`<<`	右移,左移运算符
`&`	位’AND’
`^`	位运算符
`<=`,`<`,`>`,`>=`	比较运算符
`<>`,`==`,`!=`	等于运算符
`=`,`%=`,`/=`,`//=`,`-=`,`+=`,`=`,`*=`	赋值运算符
`is is not`	身份运算符
`in not in`	成员运算符
`not and or`	逻辑运算符

单目运算符

指运算所需变量为一个的运算符，又叫一元运算符，其中有：

逻辑非运算符：!
按位取反运算符：~
自增自减运算符：++,–
负号运算符：-
类型转换运算符：(类型)
指针运算符和取地址运算符：*,&
长度运算符：sizeof等

双目运算符

指对两个变量进行操作

初等运算符
- 下标运算符：[]
- 分量运算符：->(向结构体成员运算符)
- 结构体成员运算符：_
算术运算符
- 乘法运算符：*
- 除法运算符：/
- 取余运算符：%
- 加法运算符：+
- 减法运算符：-
关系运算符
- 等于运算符：==
- 不等于运算符：!=
- 关系运算符：<,>,<=,>=
逻辑运算符
- 逻辑与运算符：&&
- 逻辑或运算符：||
- 逻辑非运算符：!
位运算符
- 按位与运算符：&
- 按位异或运算符：^
- 按位或运算符：|
- 左移动运算符：<<
- 右移动运算符：>>
赋值运算符
- 简单赋值运算符：=
- 加法赋值运算符：+=
- 减法赋值运算符：-=
- 乘法赋值运算符：*=
- 除法赋值运算符：/=
- 取模赋值运算符：%=
- 左移动运算符：>>=(对变量进行位运算移位之后的结果再赋值给原来的变量)
- 右移动运算符：<<=(同上)
- 实现按位与运算符：&=，a = iand(a, b) is equivalent to a &= b 参考，参考
- 实现按位或赋值运算符：|=，a|=2等价于a=a|2()；a = ior(a, b) is equivalent to a |= b。参考同上
- 按位异或赋值运算符：^=；a = ixor(a, b) is equivalent to a ^= b。参考同上
逗号运算符：,

三目运算符

三目运算符对三个变量进行操作，指的是计算机c语言的重要组成部分。条件运算符是唯一有3个操作数的运算符，所以有时又称为三元运算符。一般来说，三目运算符的结合性是右结合的。

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/08/04/Python运算符/

运维知识体系之操作系统层

2018-07-27T14:03:19.000Z

到目前为止，也工作了好几年了吧。曾经某次听赵班长的公开课，听着听着发现，越听越懵…有种啥也不会的感觉..-_-!!!
但是呢，这些东西也确实是在工作中用到过，只是比较零碎。某些名词也确实很熟悉，但是呢，比如你现在问我里面具体的细节，我会用一脸懵逼来回应你，哈哈。
这些天抽空对着赵班长总结的表格，尝试在网上收集了些资料，整理出下面的文章。如果有不对的地方，欢迎大家指点。在整理这些东西的过程中，也是对我自己所掌握知识的一种回顾吧。让我们为了自己心中所想，一起努力吧！

整理运维知识体系之操作系统层
参考：https://www.unixhot.com/page/ops

运维架构层级	运维角度	内容描述/主要技术关键词	监控体系	自动化/DevOps	云计算
操作系统层	CPU	CPU运行级别、CPU管理(进程管理、taskset、intel VT-X)、使用率、上下文切换、运行队列、进程调度、系统调用	mpstat、strace	虚拟化	公有云弹性计算产品
	内存	虚拟内存、SWAP换入换出、内存寻址、内存管理（Buffer Cache、HugePages、ksmd、EPT）	vmstat、free
	I/O(磁盘)	缺页中断、IOPS(顺序IO、随机IO)、IO管理(IO调度算法、virtio)、VFS	iostat、iotop
	I/O(网络)	TCP/IP(三次握手、四次挥手、状态转换、TCP队列)、IO模型、Bonding、Bridge、网络管理(iftop、tcpdump)	iftop
	内核/Shell	内核定制、内存参数优化、脚本编程(AWK、Sed、Shell、Python、PHP、Perl、Ruby、Lua)	系统监控

CPU

文档参考

CPU运行级别
1. Intel的 x86处理器是通过Ring级别来进行访问控制的，级别共分4层，RING0,RING1,RING2,RING3。Windows只使用其中的两个级别RING0和RING3。RING0层拥有最高的权限，RING3层拥有最低的权限。
2. 按照Intel原有的构想，应用程序工作在RING3层，只能访问RING3层的数据，操作系统工作在RING0层，可以访问所有层的数据，而其他驱动程序位于RING1、RING2层，每一层只能访问本层以及权限更低层的数据。RING设计的初衷是将系统权限与程序分离出来，使之能够让OS更好的管理当前系统资源，也使得系统更加稳定。
3. 应用程序的代码运行在最低运行级别上ring3上，不能做受控操作。如果要做，比如要访问磁盘，写文件，那就要通过执行系统调用（函数），执行系统调用的时候，CPU的运行级别会发生从ring3到ring0的切换，并跳转到系统调用对应的内核代码位置执行，这样内核就为你完成了设备访问，完成之后再从ring0返回ring3。这个过程也称作用户态和内核态的切换。
4. 驱动程序都是工作在ring0上，否则驱动不了设备。
intel VT-X
1. 虚拟化在这里就遇到了一个难题，因为宿主操作系统是工作在ring0的，客户操作系统就不能也在ring0了。但执行的指令还是一样的，这样肯定不行，因为没权限，跑不起来。一般客户操作系统执行特权指令时，会触发异常（CPU机制，没权限的指令，触发异常），然后VMM捕获这个异常，在异常里面做翻译，模拟，最后返回到客户操作系统内，客户操作系统认为自己的特权指令工作正常，继续运行。但是这个性能损耗非常的大。以前只是简单的执行一条指令，现在却需要复杂的异常处理。
2. 后来，CPU厂商开始支持虚拟化，支持Intel-VT 的CPU，有VMX root operation 和 VMX non-root operation两种模式，两种模式都支持Ring 0 ~ Ring 3 这 4 个运行级别。这时VMM可以运行在VMX root operation模式下，客户OS运行在VMX non-root operation模式下。即硬件这层做了些区分，在全虚拟化下，有些靠“捕获异常-翻译-模拟”的实现就不需要了。
进程管理
1. 进程是正在运行的程序实体，并且包括这个运行的程序中占据的所有系统资源，如CPU，IO，内存，网络资源等。进程和程序的区别在于：程序是指令的集合，是程序的静态描述，而进程是动态的一次活动的执行。参考

taskset

LINUX提供的一个命令，可以让某个程序运行在某个（或）某些CPU上。参考性能调优攻略

1
2
3

# taskset -p 31100
pid 31100's current affinity mask: f
# 显示结果的 f 实际上是二进制4个低位均为1的bitmask，每一个1对应于1个CPU，表示该进程在4个CPU上运行

# taskset -pc 2 31100
pid 31100's current affinity list: 0-3
pid 31100's new affinity list: 2
# 2表示CPU将只会运行在第3个CPU上(从0开始计数)

CPU使用率
1. 反映的是当前cpu的繁忙程度，忽高忽低的原因在于占用cpu处理时间的进程可能处于io等待状态但却还未释放进入wait。
2. 平均负载(load average)是指某段时间内占用cpu时间的进程和等待cpu时间的进程数，这里等待cpu时间的进程是指等待被唤醒的进程，不包括处于wait状态进程。
3. 对于每一个CPU来说运行队列最好不要超过3。如果是双核CPU就不要超过6。如果队列长期保持在3以上，说明任何一个进程运行时都不能马上得到cpu的响应，这时可能需要考虑升级cpu。另外满负荷运行cpu的使用率最好是user空间保持在65%～70%，system空间保持在30%，空闲保持在0%~5%。
4. top

参数	描述
%us	用户空间程序的cpu使用率(没有通过nice调度)
%sy	系统空间的cpu使用率，主要是内核程序
%ni	用户空间且通过nice调度过的程序的cpu使用率
%id	空闲cpu
%wa	cpu运行时在等待io的时间
%hi	cpu处理硬中断的数量
%si	cpu处理软中断的数量
%st	被虚拟机偷走的cpu

vmstat

参数	描述
r	运行队列(多少个进程真的分配到CPU)。当值超过CPU数目，就会出现CPU瓶颈。和CPU的负载有关系，一般负载超过3就比较高，超过5就很高，超过10就不正常，服务器状态很危险。top的负载类似每秒的运行队列，如果队列过大，表示CPU很繁忙，一般会造成CPU使用率很高。
b	阻塞的进程
swpd	虚拟内存使用的大小。如果大于0表示机器的物理内存不足
free	空闲的物理内存大小
buff	缓存(主要用于块设备)
cache	缓存(缓存文件)
si	每秒从磁盘读入虚拟内存的大小。如果大于0表示物理内存不够或内存泄漏
so	每秒虚拟内存写入磁盘的大小。如果大于0表示物理内存不够或内存泄漏
bi	块设备每秒写入的块数量，块设备指系统上所有的磁盘和其他设备，默认为1024byte
bo	块设备每秒读取的块数量，如果读取文件，bo会大于0。bi和bo一般都接近0，不然就是IO过于频繁
in	每秒CPU的中断次数，包括时间中断
cs	每秒上下文切换次数，在调用系统函数，就要进行上下文切换，线程的切换，也要进程上下文切换，这个值要越小越好，太大了，要考虑调低线程或者进程的数目。每次调用系统函数，代码就会进入内核空间，导致上下文切换，这个很耗资源。
us	用户CPU使用率
sy	系统CPU使用率，如果太高，表示系统调用时间长
id	空闲CPU时间，一般来说，id + us + sy = 100
wt	等待IO CPU时间

sar
sar命令语法和vmstat一样

命令	参数	描述
sar -q 1 5	runq-sz	运行队列的长度(等待运行的进程数)
	plist-sz	进程列表中进程(processes)和线程(threads)的数量
	ldavg-1	最后1分钟的系统平均负载(System load average)
	ldavg-5	过去5分钟的系统平均负载
	ldavg-15	过去15分钟的系统平均负载

mpstat

命令	参数	描述
mpstat 1 5	%user	处理用户进程所使用 CPU 的百分比
	%nice	使用 nice 命令对进程进行降级时 CPU 的百分比
	%system	内核进程使用的 CPU 百分比
	%iowait	等待进行 I/O 所使用的 CPU 时间百分比
	%irq	用于处理系统中断的 CPU 百分比
	%soft	软件中断的 CPU 百分比
	%idle	显示 CPU 的空闲百分比
	%intr/s	显示每秒 CPU 接收的中断总数

pidstat

命令	参数	描述
pidstat	%user	处理用户进程所使用 CPU 的百分比
	%system	内核进程使用的 CPU 百分比
	%guest	进程在虚拟机占用cpu的百分比
	%CPU	进程占用cpu的百分比
	CPU	处理进程的cpu编号
	Command	当前进程对应的命令
pidstat -r	PID	进程标识符
	Minflt/s	任务每秒发生的次要错误，不需要从磁盘中加载页
	Majflt/s	任务每秒发生的主要错误，需要从磁盘中加载页
	VSZ	虚拟地址大小，虚拟内存的使用KB
	RSS	常驻集合大小，非交换区五里内存使用KB
	Command	当前进程对应的命令
pidstat -d	PID	进程id
	kB_rd/s	每秒从磁盘读取的KB
	kB_wr/s	每秒写入磁盘KB
	kB_ccwr/s	任务取消的写入磁盘的KB。当任务截断脏的pagecache的时候会发生
	Command	当前进程对应的命令
pidstat -w -p PID	PID	进程id
	Cswch/s	每秒主动任务上下文切换数量
	Nvcswch/s	每秒被动任务上下文切换数量
	Command	当前进程对应的命令
pidstat -t -p PID	TGID	主线程号
	TID	线程id
	%user	处理用户进程所使用 CPU 的百分比
	%system	内核进程使用的 CPU 百分比
	%guest	进程在虚拟机占用cpu的百分比
	%CPU	进程占用cpu的百分比
	CPU	处理进程的cpu编号
	Command	当前进程对应的命令
pidstat -T ALL -p PID	PID	进程标识符
	Usr-ms	任务和子线程在用户级别使用的毫秒数
	System-ms	任务和子线程在系统级别使用的毫秒数
	Guest-ms	任务和子线程在虚拟机(running a virtual processor)使用的毫秒数
	Command	当前进程对应的命令

上下文切换
文档参考
现在linux是大多基于抢占式，CPU给每个任务一定的服务时间，当时间片轮转的时候，需要把当前状态保存下来，同时加载下一个任务，这个过程叫做上下文切换。时间片轮转的方式，使得多个任务利用一个CPU执行成为可能，但是保存现场和加载现场，也带来了性能消耗。
对于抢占式操作系统，引起上下文切换的原因大致有几下几种：
- 当前任务的时间片用完之后，系统CPU正常调度下一个任务
- 当前任务碰到IO阻塞，调度线程将挂起此任务，继续下一个任务
- 多个任务抢占锁资源，当前任务没有抢到，被调度器挂起，继续下一个任务
- 用户代码挂起当前任务，让出CPU时间
- 硬件中断
监测Linux的应用的时候，当CPU的利用率非常高，但是系统的性能却上不去的时候，不妨监控一下线程/进程的切换，看看是不是context switching导致的overhead过高。常用命令：pidstat，vmstat
运行队列
文档参考
1. run-queue：活动（正在运行）和排队的进程数。每个CPU都会维持一个运行队列，理想情况下，调度器会不断让队列中的进程运行。进程不是处在sleep状态就是runable状态。如果CPU过载，就会出现调度器跟不上系统的要求，导致可运行的进程会填满队列。队列愈大，程序执行时间就愈长。
2. 对于每一个CPU来说运行队列最好不要超过3。如果是双核CPU就不要超过6。如果队列长期保持在3以上，说明任何一个进程运行时都不能马上得到cpu的响应，这时可能需要考虑升级cpu。另外满负荷运行cpu的使用率最好是user空间保持在65%～70%，system空间保持在30%，空闲保持在0%~5%。
进程调度
文档参考1，文档参考2
1. CPU调度也叫进程调度。分为：短程调度、中程调度、长程调度。
2. CPU调度发生的情况：
  - 从运行状态切换到等待状态
  - 从运行状态切换到就绪状态
  - 从等待切换到准备就绪
  - 终止
3. 非抢占式(nonpreemptive)和抢占式(preemptive)调度。前者是指让程序一直运行着，直到它自己出异常；后者允许其他程序抢占现在正在运行的程序。
4. 上下文切换。
系统调用
文档参考1，文档参考2，文档参考3
当用户态的进程调用一个系统调用时，CPU从用户态切换到内核态并开始执行一个内核函数。Linux通过由向量为128（0x80）的编程异常实现CPU由用户态到内核态的转换。

内存

虚拟内存
文档参考1，文档参考2
1. 每个进程都有自己独立的内存空间，各个进程的内存空间具有类似的结构。
2. Linux内存管理采用的是页式管理，使用的是多级页表，动态地址转换机构与主存、辅存共同实现虚拟内存。
3. 一个新进程建立的时候，将会建立起自己的内存空间，此进程的数据，代码等从磁盘拷贝到自己的进程空间，哪些数据在哪里，都由进程控制表中的task_struct记录，task_struct中记录中一条链表，记录中内存空间的分配情况，哪些地址有数据，哪些地址无数据，哪些可读，哪些可写，都可以通过这个链表记录。
4. 每个进程已经分配的内存空间，都与对应的磁盘空间映射。
5. 对于32位系统，寻址指针为4字节，对应的虚拟地址空间为0-2^32，即0-4G；对于64位系统，寻址指针为8字节，对应的虚拟地址空间为0-2^64，即0-16G。这个地址空间是虚拟的，并非实际存在的。
SWAP换入换出
文档参考
内存页面分为用户页面和内核页面。
关于SWAP的设置，Oracle官方推荐：
RAM|Swap Space
—|—
Up to 512 MB|2 times the size of RAM
Between 1024 MB and 2048 MB|1.5 times the size of RAM
Between 2049 MB and 8192 MB|Equal to the size of RAM
More than 8192 MB|0.75 times the size of RAM
内存寻址
文档参考1，文档参考2
内存寻址是指CPU允许支持的内存大小。双通道内存技术其实是一种内存控制和管理技术，它依赖于芯片组的内存控制器发生作用，在理论上能够使两条同等规格内存所提供的带宽增长一倍。计算机管理内存的基本方式有两种：段式管理和页式管理。
内存管理
文档参考
内存管理是指软件运行时对计算机内存资源的分配和使用的技术。其最主要的目的是如何高效，快速的分配，并且在适当的时候释放和回收内存资源。
Buffer Cache
文档参考
缓存区cache(磁盘缓存)和缓冲区buffer(内存缓存)都是临时存储区。不同点：
- 缓冲区buffer主要存在于RAM中，作为CPU暂时存储数据的区域
- 高速缓存cache是一种高速存储区域，可以是主存或硬盘等其他独立存储区域的一部分
HugePages
文档参考
HugePages是linux内核的一个特性，使用hugepage可以用更大的内存页来取代传统的4K页面。优点有：
- 没有swap
- 减轻快表压力
- 减轻换页表的负载
- 提高内存的性能，降低CPU负载
Kmsd
文档参考
KSM是内核中的一种内存共享机制。在2.6.36版本的内核中引入，它会合并某些相同的页面以减少页面冗余，在内核中有一个KSM守护进程 ksmd,它定期扫描用户向它注册的内存区，寻找相同的页面，从而用一个添加写保护的页面来代替，当有进程尝试写入的时候，会自动分配一个新页面，这点就是典型的COW机制。
EPT
文档参考
在虚拟化环境下，intel CPU在处理器级别加入了对内存虚拟化的支持。即扩展页表EPT，而AMD也有类似的成为NPT。

I/0(磁盘)

缺页中断
文档参考
缺页中断是要访问的页不在主存，需要操作系统将其调入主存后再进行访问。在这个时候，被内存映射的文件实际上成了一个分页交换文件。
- 软性页缺失指页缺失发生时，相关的页已经被加载进内存，但是没有向MMU注册的情况。操作系统只需要在MMU中注册相关页对应的物理地址即可。
- 硬性页缺失是指相关的页在页缺失发生时未被加载进内存的情况，这时操作系统需要：
  1. 寻找到一个空闲的页。或者把另外一个使用中的页写到磁盘上（如果其在最后一次写入后发生了变化的话），并注销在MMU内的记录；
  2. 将数据读入被选定的页
  3. 向MMU注册该页
IOPS
文档参考1，文档参考2
IOPS(Input/Output Per Second)即每秒的输入输出量(或读写次数)，是衡量磁盘性能的主要指标之一。IOPS是指单位时间内系统能处理的I/O请求数量，一般以每秒处理的I/O请求数量为单位。
- 顺序IO指读取和写入操作基于逻辑块逐个连续访问来自相邻地址的数据。在顺序IO访问中，HDD所需的磁道搜索时间显着减少，因为读/写磁头可以以最小的移动访问下一个块。
  - 数据备份和日志记录等业务是顺序IO业务。
- 随机IO指读写操作时间连续，但访问地址不连续，随机分布在磁盘LUN的地址空间中。
  - 产生随机IO的业务有OLTP服务，SQL，即时消息服务等

IO调度
文档参考1，文档参考2

#查看当前系统支持的IO调度算法
dmesg | grep -i scheduler
[    0.852570] io scheduler noop registered
[    0.852575] io scheduler deadline registered (default)
[    0.852614] io scheduler cfq registered
[    0.852619] io scheduler mq-deadline registered
[    0.852624] io scheduler kyber registered

IO调度算法
- CFQ(完全公平排队I/O调度程序)
  特点：
  - 在最新的内核版本和发行版中,都选择CFQ做为默认的I/O调度器。
  - CFQ试图均匀地分布对I/O带宽的访问,避免进程被饿死并实现较低的延迟,是deadline和as调度器的折中。
  - CFQ对于多媒体应用(video,audio)和桌面系统是最好的选择。
  - CFQ赋予I/O请求一个优先级，而I/O优先级请求独立于进程优先级，高优先级的进程的读写不能自动地继承高的I/O优先级。
  工作原理：
  - CFQ为每个进程/线程，单独创建一个队列来管理该进程所产生的请求，也就是说每个进程一个队列,各队列之间的调度使用时间片来调度，以此来保证每个进程都能被很好的分配到I/O带宽。I/O调度器每次执行一个进程的4次请求。
- NOOP(电梯式调度程序)
  特点：
  - 在Linux2.4或更早的版本的调度程序，那时只有这一种I/O调度算法。
  - NOOP实现了一个简单的FIFO队列，它像电梯的工作方法一样对I/O请求进行组织，当有一个新的请求到来时，它将请求合并到最近的请求之后，以此来保证请求同一介质。
  - NOOP倾向饿死读而利于写。
  - NOOP对于闪存设备,RAM,嵌入式系统是最好的选择。
  电梯算法饿死读请求的解释：
  - 因为写请求比读请求更容易。
  - 写请求通过文件系统cache，不需要等一次写完成，就可以开始下一次写操作，写请求通过合并，堆积到I/O队列中。
  - 读请求需要等到它前面所有的读操作完成，才能进行下一次读操作。在读操作之间有几毫秒时间，而写请求在这之间就到来，饿死了后面的读请求。
- Deadline(截止时间调度程序)
  特点：
  - 通过时间以及硬盘区域进行分类，这个分类和合并要求类似于noop的调度程序。
  - Deadline确保了在一个截止时间内服务请求，这个截止时间是可调整的，而默认读期限短于写期限。这样就防止了写操作因为不能被读取而饿死的现象。
  - Deadline对数据库环境(ORACLE RAC,MYSQL等)是最好的选择。
- AS(预料I/O调度程序)
  特点：
  - 本质上与Deadline一样，但在最后一次读操作后，要等待6ms，才能继续进行对其它I/O请求进行调度。
  - 可以从应用程序中预订一个新的读请求，改进读操作的执行，但以一些写操作为代价。
  - 它会在每个6ms中插入新的I/O操作，而会将一些小写入流合并成一个大写入流，用写入延时换取最大的写入吞吐量。
  - AS适合于写入较多的环境，比如文件服务器。
  - AS对数据库环境表现很差。
virtio
文档参考
一种 I/O 半虚拟化解决方案，是一套通用 I/O 设备虚拟化的程序，是对半虚拟化 Hypervisor 中的一组通用 I/O 设备的抽象。提供了一套上层应用与各 Hypervisor虚拟化设备（KVM，Xen，VMware等）之间的通信框架和编程接口，减少跨平台所带来的兼容性问题，大大提高驱动程序开发效率。
VFS
文档参考1，文档参考2
VFS(virtual File System)，也称为虚拟文件系统交换层(Virtual Filesystem Switch)的作用就是采用标准的Unix系统调用读写位于不同物理介质上的不同文件系统，即为各类文件系统提供了一个统一的操作界面和应用编程接口。VFS是一个可以让open()、read()、write()等系统调用不用关心底层的存储介质和文件系统类型就可以工作的粘合层。

iostat

命令	参数	描述
iostat -x	rrqm/s	每秒这个设备相关的读取请求有多少被Merge了(当系统调用需要读取数据的时候，VFS将请求发到各个FS，如果FS发现不同的读取请求读取的是相同Block的数据，FS会将这个请求合并Merge)
	wrqm/s	每秒这个设备相关的写入请求有多少被Merge了
	r/s	每秒读取的扇区数
	w/s	每秒写入的扇区数
	rkB/s	每秒读K字节数，是 rsect/s 的一半，因为每扇区大小为512字节。(需要计算)
	wkB/s	每秒写K字节数。是 wsect/s 的一半。(需要计算)
	avgrq-sz	平均每次设备I/O操作的数据大小(扇区)。delta(rsect+wsect)/delta(rio+wio)
	avgqu-sz	平均I/O队列长度。即 delta(aveq)/s/1000(因为aveq的单位为毫秒)
	await	平均每次设备I/O操作的等待时间(毫秒)。即 delta(ruse+wuse)/delta(rio+wio)
	r_await	发送给要服务的设备的读取请求的平均时间(毫秒)。这包括队列中请求所花费的时间和服务它们所花费的时间。
	w_await	发送给要服务的设备的写入请求的平均时间(毫秒)。这包括队列中请求所花费的时间和服务它们所花费的时间。
	svctm	平均每次设备I/O操作的服务时间(毫秒)
	%util	一秒中有百分之多少的时间用于 I/O 操作，即被io消耗的cpu百分比

iotop(yum安装)
- 左右箭头：改变排序方式，默认是按IO排序
- r：改变排序顺序
- o：只显示有IO输出的进程
- p：进程/线程的显示方式的切换
- a：显示累积使用量
- q：退出

I/O(网络)

TCP三次握手四次挥手
文档参考

最开始的时候客户端和服务器都是处于CLOSED状态。主动打开连接的为客户端，被动打开连接的是服务器。
1. TCP服务器进程先创建传输控制块TCB，时刻准备接受客户进程的连接请求，此时服务器就进入了LISTEN(监听)状态；
2. TCP客户进程也是先创建传输控制块TCB，然后向服务器发出连接请求报文，这时报文首部中的同部位SYN=1，同时选择一个初始序列号 seq=x ，此时TCP客户端进程进入了 SYN-SENT(同步已发送状态)状态。TCP规定，SYN报文段(SYN=1的报文段)不能携带数据，但需要消耗掉一个序号。
3. TCP服务器收到请求报文后，如果同意连接，则发出确认报文。确认报文中应该 ACK=1，SYN=1，确认号是ack=x+1，同时也要为自己初始化一个序列号 seq=y，此时TCP服务器进程进入了SYN-RCVD(同步收到)状态。这个报文也不能携带数据，但是同样要消耗一个序号。
4. TCP客户进程收到确认后，还要向服务器给出确认。确认报文的ACK=1，ack=y+1，自己的序列号seq=x+1，此时TCP连接建立，客户端进入ESTABLISHED(已建立连接)状态。TCP规定，ACK报文段可以携带数据，但是如果不携带数据则不消耗序号。
5. 当服务器收到客户端的确认后也进入ESTABLISHED状态，此后双方就可以开始通信了。
数据传输完毕后，双方都可释放连接。最开始的时候，客户端和服务器都是处于ESTABLISHED状态，然后客户端主动关闭，服务器被动关闭。
1. 客户端进程发出连接释放报文，并且停止发送数据。释放数据报文首部，FIN=1，其序列号为seq=u(等于前面已经传送过来的数据的最后一个字节的序号加1)，此时客户端进入FIN-WAIT-1(终止等待1)状态。TCP规定，FIN报文段即使不携带数据，也要消耗一个序号。
2. 服务器收到连接释放报文，发出确认报文ACK=1、ack=u+1，并且带上自己的序列号seq=v，此时服务端就进入了CLOSE-WAIT(关闭等待)状态。TCP服务器通知高层的应用进程，客户端向服务器的方向就释放了，这时候处于半关闭状态，即客户端已经没有数据要发送了，但是服务器若发送数据，客户端依然要接受。这个状态还要持续一段时间，也就是整个CLOSE-WAIT状态持续的时间。
3. 客户端收到服务器的确认请求后，此时客户端就进入FIN-WAIT-2(终止等待2)状态，等待服务器发送连接释放报文(在这之前还需要接受服务器发送的最后的数据)。
4. 服务器将最后的数据发送完毕后，就向客户端发送连接释放报文FIN=1、ack=u+1，由于在半关闭状态，服务器很可能又发送了一些数据，假定此时的序列号为seq=w，此时服务器就进入了LAST-ACK(最后确认)状态，等待客户端的确认。
5. 客户端收到服务器的连接释放报文后，必须发出确认ACK=1、ack=w+1，而自己的序列号是seq=u+1，此时客户端就进入了TIME-WAIT(时间等待)状态。注意此时TCP连接还没有释放，必须经过2*MSL(最长报文段寿命)的时间后，当客户端撤销相应的TCB后，才进入CLOSED状态。
6. 服务器只要收到了客户端发出的确认，立即进入CLOSED状态。同样撤销TCB后，就结束了这次的TCP连接。可以看到，服务器结束TCP连接的时间要比客户端早一些。
TCP状态变迁
文档参考
- 三次握手:
  - LISTEN：表示服务器的某个SOCKET处于监听状态，可以进行连接了。
  - SYN_SENT：表示客户端的某个SOCKET与服务器进行connect时，首先发送SYN报文，然后进入SYN_SENT状态，等待服务器发送ACK+SYN报文。
  - SYN_RECV：表示服务器收到客户端发送的SYN报文，然后向客户端发送SYN+ACK报文，随后服务器进入SYN_RECV状态。
  - ESTABLISHED：表示连接已经建立，当客户端在SYN_SENT状态时，收到服务器发送的ACK+SYN报文之后，然后进行第三次握手，客户端发送ACK报文，然后进入ESTABLISHED状态，当处于SYN_RECV状态的服务器收到客户端发送的ACK报文之后，也进入ESTABLISHED状态，然后连接建立。
- 四次挥手：
  - FIN_WAIT_1：表示客户端SOCKET想主动关闭连接，于是向服务器发送FIN报文，然后进入FIN_WAIT_1状态。
  - FIN_WAIT_2：表示客户端收到服务器发来的ACK报文，此时客户端进入FIN_WAIT_2状态，此时客户端这边的连接已经关闭，但服务器端的连接还没关闭，也就是服务器还可以继续向客户端发送数据。
  - CLOSING：这种状态表示此时双方刚好可能都在关闭连接，即客户端向服务器发送FIN报文，进入FIN_WAIT_1状态后，没有收到服务器发来的ACK报文，反而受到服务器发来的FIN报文，说明此时客户端和服务器同时发起关闭连接，随后，客户端进入CLOSING状态。
  - TIME_WAIT：表示收到了服务器发来的FIN报文，然后客户端发送ACK报文，随后进入TIME_WAIT状态，等待2MSL之后进入CLOSED状态。
  - CLOSE_WAIT：表示当服务器收到客户端发来的FIN报文之后，发送ACK报文，随后服务器进入CLOSE_WAIT状态。
  - LAST_ACK：表示服务器主动关闭连接，向客户端发送FIN报文后，随即进入LAST_ACK状态，如果收到了客户端发来的ACK报文之后，就进入CLOSED状态。
为何TIME_WAIT需要等2MSL时间才能回到CLOSED状态：
如果网络不可靠，那么就无法保证最后客户端发送的ACK报文服务器端一定能够收到，因此处于LAST_ACK状态的服务器可能会因为超时而未收到ACK报文，而重新向客户端发送FIN报文，TIME_WAIT的作用就是用来客户端重新发送可能丢失的ACK报文。
TCP队列
文档参考
- 半连接队列：保存SYN_RECV状态的连接。队列长度由net.ipv4.tcp_max_syn_backlog设置
- accept队列：保存ESTABLISHED状态的连接。队列长度为min(net.core.somaxconn,backlog)。其中backlog是我们创建ServerSocket(intport,int backlog)时指定的参数，最终会传递给listen方法。
IO模型
文档参考
- blocking IO
- nonblocking IO
- IO multiplexing
- signal driven IO(不常用)
- asynchronous IO
网卡Bonding模式
文档参考
- Mode 0(balance-rr) Round-robin策略，这个模式具备负载均衡和容错能力
- Mode 1(active-backup) 主备策略，在绑定中只有一个网卡被激活，其他处于备份状态
- Mode 2(balance-xor) XOR策略，通过源MAC地址与目的MAC地址做异或操作选择slave网卡
- Mode 3 (broadcast) 广播，在所有的网卡上传送所有的报文
- Mode 4 (802.3ad) IEEE 802.3ad动态链路聚合。创建共享相同的速率和双工模式的聚合组
- Mode 5 (balance-tlb) 适配器传输负载均衡
- Mode 6 (balance-alb) 适配器适应性负载均衡
Bridge
文档参考
计算机内部一般有系统总线来连接内部所有的硬件设备。一个典型的系统总线是PCI(Peripheral Component Interconnect)总线。其他类型的用得较多的总线还有ISA,EISA,MCA,SCSI,和USB。
一个计算机有多个不同类型的总线，这些总线由桥(bridge)链接起来。有以下两种高速总线处理到达或出自内存芯片的数据传输：
1. 前端总线FSB：连接CPU和RAM控制器
2. 后端总线：连接CPU和外部硬件设备CACHE
网络管理
- iftop(yum安装)
  - TX：发送流量
  - RX：接收流量
  - TOTAL：总流量
  - Cumm：运行iftop到目前时间的总流量
  - peak：流量峰值
  - rates：分别表示过去 2s 10s 40s 的平均流量
- tcpdump
  文档参考
  - 抓取回环网口的包：tcpdump -i lo
  - 防止包截断：tcpdump -s0
  - 以数字显示主机及端口：tcpdump -n

内核/Shell

内核定制
文档参考1，文档参考2，文档参考3
linux系统的启动流程：POST自检过程(BIOS) –>如果有多块磁盘，需要在BIOS上选择启动磁盘 –>引导MBR(bootloader引导程序) –> 加载initrd文件 –>执行进程init –>显示欢迎界面
Linux系统定制的目的和意义：
- 系统小型化
- 提高实时性
- 对特殊硬件的支持
- 提高系统的可靠性

Linux内核参数优化
文档参考

sysctl -a 查看所有系统变量
/proc/sys下内核文件与配置文件sysctl.conf中变量存在着对应关系

#最大的待发送TCP数据缓冲区空间
net.inet.tcp.sendspace=65536
#最大的接受TCP缓冲区空间
net.inet.tcp.recvspace=65536
#最大的接受UDP缓冲区大小
net.inet.udp.sendspace=65535
#最大的发送UDP数据缓冲区大小
net.inet.udp.maxdgram=65535
#本地套接字连接的数据发送空间
net.local.stream.sendspace=65535
#加快网络性能的协议
net.inet.tcp.rfc1323=1
net.inet.tcp.rfc1644=1
net.inet.tcp.rfc3042=1
net.inet.tcp.rfc3390=1
#最大的套接字缓冲区
kern.ipc.maxsockbuf=2097152
#系统中允许的最多文件数量
kern.maxfiles=65536
#每个进程能够同时打开的最大文件数量
kern.maxfilesperproc=32768
#当一台计算机发起TCP连接请求时，系统会回应ACK应答数据包。该选项设置是否延迟ACK应答数据包，把它和包含数据的数据包一起发送，在高速网络和低负载的情况下会略微提高性能，但在网络连接较差的时候，对方计算机得不到应答会持续发起连接请求，反而会降低性能
net.inet.tcp.delayed_ack=0
#屏蔽ICMP重定向功能
net.inet.icmp.drop_redirect=1
net.inet.icmp.log_redirect=1
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0
#防止ICMP广播风暴
net.inet.icmp.bmcastecho=0
net.inet.icmp.maskrepl=0
#限制系统发送ICMP速率
net.inet.icmp.icmplim=100
#安全参数，编译内核的时候加了options TCP_DROP_SYNFIN才可以用
net.inet.icmp.icmplim_output=0
net.inet.tcp.drop_synfin=1
#设置为1会帮助系统清除没有正常断开的TCP连接，这增加了一些网络带宽的使用，但是一些死掉的连接最终能被识别并清除。死的TCP连接是被拨号用户存取的系统的一个特别的问题，因为用户经常断开modem而不正确的关闭活动的连接
net.inet.tcp.always_keepalive=1
#若看到net.inet.ip.intr_queue_drops这个在增加，就要调大net.inet.ip.intr_queue_maxlen，为0最好
net.inet.ip.intr_queue_maxlen=1000
#防止DOS攻击，默认为30000
net.inet.tcp.msl=7500
#接收到一个已经关闭的端口发来的所有包，直接drop，如果设置为1则是只针对TCP包
net.inet.tcp.blackhole=2
#接收到一个已经关闭的端口发来的所有UDP包直接drop
net.inet.udp.blackhole=1
#为网络数据连接时提供缓冲
net.inet.tcp.inflight.enable=1
#如果打开的话每个目标地址一次转发成功以后它的数据都将被记录进路由表和arp数据表，节约路由的计算时间,但会需要大量的内核内存空间来保存路由表
net.inet.ip.fastforwarding=0
#kernel编译打开options POLLING功能，高负载情况下使用低负载不推荐SMP不能和polling一起用
#kern.polling.enable=1
#并发连接数，默认为128，推荐在1024-4096之间，数字越大占用内存也越大
kern.ipc.somaxconn=32768
#禁止用户查看其他用户的进程
security.bsd.see_other_uids=0
#设置kernel安全级别
kern.securelevel=0
#记录下任何TCP连接
net.inet.tcp.log_in_vain=1
#记录下任何UDP连接
net.inet.udp.log_in_vain=1
#防止不正确的udp包的攻击
net.inet.udp.checksum=1
#防止DOS攻击
net.inet.tcp.syncookies=1
#仅为线程提供物理内存支持，需要256兆以上内存
kern.ipc.shm_use_phys=1
# 线程可使用的最大共享内存
kern.ipc.shmmax=67108864
# 最大线程数量
kern.ipc.shmall=32768
# 程序崩溃时不记录
kern.coredump=0
# lo本地数据流接收和发送空间  
net.local.stream.recvspace=65536
net.local.dgram.maxdgram=16384
net.local.dgram.recvspace=65536
# 数据包数据段大小，ADSL为1452
net.inet.tcp.mssdflt=1460
# 为网络数据连接时提供缓冲  
net.inet.tcp.inflight_enable=1
# 数据包数据段最小值，ADSL为1452
net.inet.tcp.minmss=1460
# 本地数据最大数量
net.inet.raw.maxdgram=65536
# 本地数据流接收空间
net.inet.raw.recvspace=65536
#ipfw防火墙动态规则数量，默认为4096，增大该值可以防止某些病毒发送大量TCP连接，导致不能建立正常连接
net.inet.ip.fw.dyn_max=65535
#设置ipf防火墙TCP连接空闲保留时间，默认8640000(120小时)
net.inet.ipf.fr_tcpidletimeout=864000

参考值(具体根据系统硬件配置对应值)

/proc/sys/net/core/wmem_max
# 最大socket写buffer,可参考的优化值:873200
/proc/sys/net/core/rmem_max
# 最大socket读buffer,可参考的优化值:873200
/proc/sys/net/ipv4/tcp_wmem
# TCP写buffer,可参考的优化值: 8192 436600 873200
/proc/sys/net/ipv4/tcp_rmem
# TCP读buffer,可参考的优化值: 32768 436600 873200
/proc/sys/net/ipv4/tcp_mem
# 同样有3个值,意思是:
# net.ipv4.tcp_mem[0]:低于此值,TCP没有内存压力.
# net.ipv4.tcp_mem[1]:在此值下,进入内存压力阶段.
# net.ipv4.tcp_mem[2]:高于此值,TCP拒绝分配socket.
# 上述内存单位是页,而不是字节.可参考的优化值是:786432 1048576 1572864
/proc/sys/net/core/netdev_max_backlog
# 进入包的最大设备队列.默认是300,对重负载服务器而言,该值太低,可调整到1000.
/proc/sys/net/core/somaxconn
# listen()的默认参数,挂起请求的最大数量.默认是128.对繁忙的服务器,增加该值有助于网络性能.可调整到256.
/proc/sys/net/core/optmem_max
# socket buffer的最大初始化值,默认10K.
/proc/sys/net/ipv4/tcp_max_syn_backlog
# 进入SYN包的最大请求队列.默认1024.对重负载服务器,增加该值显然有好处.可调整到2048.
/proc/sys/net/ipv4/tcp_retries2
# TCP失败重传次数,默认值15,意味着重传15次才彻底放弃.可减少到5,以尽早释放内核资源.
/proc/sys/net/ipv4/tcp_keepalive_time
/proc/sys/net/ipv4/tcp_keepalive_intvl
/proc/sys/net/ipv4/tcp_keepalive_probes
# 这3个参数与TCP KeepAlive有关.默认值是:
# tcp_keepalive_time = 7200 seconds (2 hours)
# tcp_keepalive_probes = 9
# tcp_keepalive_intvl = 75 seconds
# 意思是如果某个TCP连接在idle 2个小时后,内核才发起probe.如果probe 9次(每次75秒)不成功,内核才彻底放弃,认为该连接已失效.对服务器而言,显然上述值太大. 可调整到:
# /proc/sys/net/ipv4/tcp_keepalive_time 1800
# /proc/sys/net/ipv4/tcp_keepalive_intvl 30
# /proc/sys/net/ipv4/tcp_keepalive_probes 3
/proc/sys/net/ipv4/ip_local_port_range
# 指定端口范围的一个配置,默认是32768 61000,已够大.
net.ipv4.tcp_syncookies = 1
# 表示开启SYN Cookies。当出现SYN等待队列溢出时，启用cookies来处理，可防范少量SYN攻击，默认为0，表示关闭
net.ipv4.tcp_tw_reuse = 1
# 表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接，默认为0，表示关闭
net.ipv4.tcp_tw_recycle = 1
# 表示开启TCP连接中TIME-WAIT sockets的快速回收，默认为0，表示关闭
net.ipv4.tcp_fin_timeout = 30
# 表示如果套接字由本端要求关闭，这个参数决定了它保持在FIN-WAIT-2状态的时间
net.ipv4.tcp_keepalive_time = 1200
# 表示当keepalive起用的时候，TCP发送keepalive消息的频度。缺省是2小时，改为20分钟
net.ipv4.ip_local_port_range = 1024 65000
# 表示用于向外连接的端口范围。缺省情况下很小：32768到61000，改为1024到65000
net.ipv4.tcp_max_syn_backlog = 8192
# 表示SYN队列的长度，默认为1024，加大队列长度为8192，可以容纳更多等待连接的网络连接数
net.ipv4.tcp_max_tw_buckets = 5000
# 表示系统同时保持TIME_WAIT套接字的最大数量，如果超过这个数字，TIME_WAIT套接字将立刻被清除并打印警告信息。默认为180000，改为 5000。对于Apache、Nginx等服务器，上几行的参数可以很好地减少TIME_WAIT套接字数量，但是对于Squid，效果却不大。此项参数可以控制TIME_WAIT套接字的最大数量，避免Squid服务器被大量的TIME_WAIT套接字拖死

脚本编程
- awk参考
- sed参考
- shell参考
- python参考
- php参考
- perl参考
- ruby参考
- lua参考
系统监控
在工作中我使用的是Zabbix做系统监控，参考zabbix官网

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/07/27/运维知识体系之操作系统层/

部署jenkins项目

2018-07-04T14:56:08.000Z

Jenkins持续集成工具，也玩了不少遍，但很少分享这类文档，今天抽空整理了下文档，和大家分享，哈哈

回想一下我们之前的发布流程：

本地把项目打包
通过FTP等工具，传输到服务器
关闭tomcat，把打包的项目放到webapp目录下
启动tomcat
如果每次都这么搞的话，不仅慢，而且容易出错

有了jenkins以后，发布流程：

用户在jenkins上点击某个按钮进行发布
jenkins收到发布命令
jenkins从GIT上把源代码download下来
jenkins根据你设置的mvn命令进行打包
jenkins把你打包的好的war/jar工程传输到tomcat的webapps目录下
tomcat启动

对于程序员而言，只要做两件事：

在jenkins上配置某个项目的部署流程
在jenkins上点击某个项目的部署按钮，进行一键部署

环境说明

环境：
　　　Centos 7.5
　　　nexus 3.12.0
　　　jenkins 2.128
　　　maven 3.5.3

项目	IP	备注
jenkins	192.168.1.60	基于docker部署
maven	192.168.1.60	和jenkins集成在一起
nexus3_oss	192.168.1.61	基于docker部署

环境部署

安装docker环境

1 2	[root@localhost ~]# yum install -y https://mirrors.aliyun.com/epel/epel-release-latest-7.noarch.rpm [root@localhost ~]# yum install -y docker-ce

docker-compose

jenkins

[root@localhost ~]# cat jenkins-docker-compose.yml 
version: '3'
services:
  jenkins:
    image: yfshare/jenkins:2.128
    container_name: jenkins
    ports:
      - 8080:8080
      - 50000:50000
    hostname: jenkins.example.com
    environment:
      JAVA_OPTS: '-Xms1500m -Xmx1500m -XX:MaxPermSize=512m-Djava.awt.headless=true'
    extra_hosts:
      jenkins.example.com: 127.0.0.1
    volumes:
      - /data/docker_mount/jenkins_home:/var/jenkins_home
      - /data/docker_mount/maven_repository:/data/maven/repository
      - /etc/localtime:/etc/localtime:ro
    restart: always
[root@localhost ~]#

nexus3_oss

[root@localhost ~]# cat nexus3-docker-compose.yml
version: '3'
services:
  jenkins:
    image: sonatype/nexus3:3.12.0
    container_name: nexus3
    ports:
      - 8081:8081
    volumes:
      - /data/docker_mount/nexus-data:/nexus-data
      - /etc/localtime:/etc/localtime:ro
    restart: always
[root@localhost ~]#

应用环境部署

1 2	[root@localhost ~]# docker-compose -f jenkins-docker-compose.yml up -d [root@localhost ~]# docker-compose -f nexus3-docker-compose.yml up -d

[root@localhost ~]# docker ps -a
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS                  PORTS                                              NAMES
56e49974e39d        yfshare/jenkins:2.128     "/sbin/tini -- /usr/…"   2 days ago          Up 42 hours             0.0.0.0:8080->8080/tcp, 0.0.0.0:50000->50000/tcp   jenkins
[root@localhost ~]#

[root@localhost ~]# docker exec -it jenkins /bin/bash
root@jenkins:/# ps -ef |grep -i jenkins | grep -iv grep
root         1     0  0 Jul05 ?        00:00:01 /sbin/tini -- /usr/local/bin/jenkins.sh
root         5     1  0 Jul05 ?        00:02:04 java -Duser.home=/var/jenkins_home -Xms1500m -Xmx1500m -XX:MaxPermSize=512m -Djava.awt.headless=true -jar /usr/share/jenkins/jenkins.war
root@jenkins:/#

[root@localhost ~]# docker ps -a
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                    NAMES
28751b5a7cdc        sonatype/nexus3:3.12.0   "sh -c ${SONATYPE_DI…"   20 hours ago        Up 19 hours         0.0.0.0:8081->8081/tcp   nexus3
[root@localhost ~]#

访问Dashboard

打开nexus3_oss，访问http://192.168.1.61:8081
默认的账户密码为：admin/admin123

Online - Remote Available表示nexus仓库正在向nexus远端中央库下载公共依赖包
Online - Ready to Connect表示nexus仓库当前没有下载任务(向远端nexus中央库)

打开jenkins，访问http://192.168.1.60:8080
登录密码在日志中查找，在初始化jenkins时也会用到

jenkins安装maven Integration插件，可以配置全局Maven Options
“系统管理” –“系统设置” –“Maven项目配置”
-Xms1024m -Xmx1024m -XX:MaxPermSize=512m

编译项目

登录jenkins后，点击“新建任务”，然后配置jenkins

在Git上添加相应的账户，Jenkins需要通过此账户从git上clone代码进行编译

Jenkins配置完成后，就可以编译项目了

Q & A

在编译过程中，会遇到各种依赖问题，需要找开发童鞋沟通获取相应的依赖包
总结遇到的错误有（部分）：

从Maven私服(nexus)正常下载依赖包

[INFO] Scanning for projects...
Downloading from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/org/springframework/boot/spring-boot-starter-parent/1.5.1.RELEASE/spring-boot-starter-parent-1.5.1.RELEASE.pom
Downloaded from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/org/springframework/boot/spring-boot-starter-parent/1.5.1.RELEASE/spring-boot-starter-parent-1.5.1.RELEASE.pom (7.4 kB at 986 B/s)
Downloading from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/org/springframework/boot/spring-boot-dependencies/1.5.1.RELEASE/spring-boot-dependencies-1.5.1.RELEASE.pom
Downloaded from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/org/springframework/boot/spring-boot-dependencies/1.5.1.RELEASE/spring-boot-dependencies-1.5.1.RELEASE.pom (89 kB at 12 kB/s)
Downloading from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/com/fasterxml/jackson/jackson-bom/2.8.6/jackson-bom-2.8.6.pom
Downloaded from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/com/fasterxml/jackson/jackson-bom/2.8.6/jackson-bom-2.8.6.pom (10 kB at 7.8 kB/s)

在编译时遇到依赖关系

# mvn -X package -P pre clean install -Dmaven.test.skip=true
[INFO] Scanning for projects...
[INFO] 
[INFO] ---------------------< com.recycle:recycle-common >---------------------
[INFO] Building recycle-common 1.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[WARNING] The POM for fakepath:json-lib:jar:jdk15:2.4 is missing, no dependency information available
Downloading from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/ctc-smscloud/jsonhttp/1.0/jsonhttp-1.0.pom
Downloaded from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/ctc-smscloud/jsonhttp/1.0/jsonhttp-1.0.pom (391 B at 1.8 kB/s)
Downloading from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/ctc-smscloud/jsonhttp/1.0/jsonhttp-1.0.jar
Downloaded from deploymentRepo: http://192.168.1.61:8081/repository/maven-public/ctc-smscloud/jsonhttp/1.0/jsonhttp-1.0.jar (5.1 kB at 212 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.857 s
[INFO] Finished at: 2018-06-28T09:58:07Z
[INFO] ------------------------------------------------------------------------
[WARNING] The requested profile "pre" could not be activated because it does not exist.
[ERROR] Failed to execute goal on project recycle-common: Could not resolve dependencies for project com.recycle:recycle-common:jar:1.0-SNAPSHOT: Failure to find fakepath:json-lib:jar:jdk15:2.4 in http://192.168.1.61:8081/repository/maven-public/ was cached in the local repository, resolution will not be reattempted until the update interval of deploymentRepo has elapsed or updates are forced -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

与开发童鞋沟通获取相应的依赖包上传到nexus仓库

上传到nexus命令
需要先安装Maven和配置settings.xml(maven/conf/settings.xml)
settings.xml文件配置参考这里

通过报错可以看到，缺少json-lib-2.4-jdk15.jar这个依赖jar包，且json-lib-2.4-jdk15.jar这个依赖包存放的路径为：fakepath/json-lib/2.4/json-lib-2.4-jdk15.jar
手动上传到nexus仓库需要各个参数指定jar的文件名，如下：

Dclassifier字段有的jar依赖包没有就不需要指定该参数
DgroupId字段表示jar包的目录路径，如果有多级目录写法如上
Durl字段为上传到nexus仓库的路径
deploymentRepo字段为maven setting.xml里定义的nexus仓库的密码，具体见settings.xml文件

mvn deploy:deploy-file -DgroupId=fakepath -DartifactId=json-lib -Dversion=2.4 -Dclassifier=jdk15 -Dpackaging=jar -Dfile=/data/code/json-lib-2.4-jdk15.jar -Durl=http://192.168.1.61:8081/repository/maven-releases/ -DrepositoryId="deploymentRepo"

上传到nexus私服后，再去maven仓库目录删除对应出错的依赖jar包目录后，再次编译即可

1 2	# cd /data/docker_mount/maven_repository # rm -rf fakepath/json-lib/

编译时再遇到jar包依赖解决方法如上，找开发童鞋获取相应的jar依赖包并上传到nexus仓库。这个上传到nexus仓库开发写好pom.xml文件后应该可以自动上传的，而不需要我们手动上传到nexus仓库

如果不想通过jenkins来测试jar包是否存在依赖问题，可以配置好maven后并在该服务器上手动执行如下命令测试

1
2
3

cd $WORKSPACE/recycle-back/
mvn clean -P pre
mvn package -P pre clean install -Dmaven.test.skip=true

附上一张jenkins编译失败的图片：

编译成功

披荆斩棘，经过九九八十一难后，jenkins编译成功啦，结果如下

找到jenkins的$WORKSPACE 目录，进入到项目里面的可以看到编译成功后的jar包文件

# ls recycle-*/target/*.jar | grep -iv 'dubbo-privider'
recycle-activity/target/recycle-activity.jar
recycle-common/target/recycle-common-1.0-SNAPSHOT.jar
recycle-front/target/recycle-front.jar
recycle-message/target/recycle-message.jar
recycle-product/target/recycle-product.jar
recycle-risk/target/recycle-risk.jar
recycle-task/target/recycle-task.jar
recycle-trade/target/recycle-trade.jar
recycle-user/target/recycle-user.jar
#

项目发布

#!/bin/bash
BUILD_ID=DONTKILLME
cd $WORKSPACE/king-back/
mvn clean -P pre
mvn package -P pre clean install -Dmaven.test.skip=true
result=$?
if [ $result == '0' ];then
find . -type f -iname "king*.jar" | egrep -iv 'common|dubbo-privider|api' | xargs -i scp {} user@ip:/data/king/project/
find . -type f -iname "king*.war" | egrep -iv 'common|dubbo-privider|api' | xargs -i scp {} user@ip:/data/king/project/
ssh -fn user@ip "ps -ef | grep 'king-front.jar' |grep -v grep |awk '{print \$2}' |xargs -i -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-mis.war' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-task.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-trade.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-risk.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-user.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-activity.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-product.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "ps -ef | grep 'king-message.jar' |grep -v grep |awk '{print \$2}' |xargs -i kill {}"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-message.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-user.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-trade.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-risk.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-task.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-product.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-activity.jar >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms600m -Xmx600m /data/king/project/king-mis.war >/dev/null 2>&1 &"
ssh -fn user@ip "nohup java -jar -Xms350m -Xmx350m /data/king/project/king-front.jar >/dev/null 2>&1 &"
ssh -fn user@ip "ps -ef | grep -i king | egrep -iv 'grep|color'"
else
	echo '编译失败'
    exit 1
fi

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/07/04/部署jenkins项目/

部署Zabbix 3.4

2018-06-30T12:30:00.000Z

虽然zabbix已经玩了好多遍了，但每次重新部署时，还得到处找，今天抽空整理下文档，以后就可以直接复制粘贴了，哈哈

环境说明

环境：
　　centos 7.4
　　zabbix 3.4.11
　　mysql 5.6
　　php 5.4
　　nginx 12.2

zabbix 3.4环境要求：
参考：zabbix requirements

MySQL 5.0.3 - 5.7.x
PHP 5.4.0 or later

安装yum源

[root@localhost ~]# wget https://mirrors.aliyun.com/epel/epel-release-latest-7.noarch.rpm
[root@localhost ~]# wget https://repo.mysql.com//mysql80-community-release-el7-1.noarch.rpm
#如果网络不好，可以用epel源自带的php5.4
[root@localhost ~]# wget https://mirror.webtatic.com/yum/el7/webtatic-release.rpm
[root@localhost ~]# wget https://jaist.dl.sourceforge.net/project/zabbix/ZABBIX%20Latest%20Stable/3.4.11/zabbix-3.4.11.tar.gz
[root@localhost ~]# yum -y install epel-release-latest-7.noarch.rpm mysql80-community-release-el7-1.noarch.rpm webtatic-release.rpm

安装依赖包

1	[root@localhost ~]# yum -y install yum-utils gcc gcc-c++ make libxml2 libxml2-devel libjpeg-devel libpng-devel bzip2-devel libcurl-devel gd-devel net-snmp-devel libevent-devel

部署PHP

如果能使用webtatic源的话，可以安装高版本的PHP。这里用不了webtatic的yum源。-_-!!!

#安装php5.4
[root@localhost ~]# yum -y install php php-mysql php-gd libjpeg* php-imap php-odbc libmcrypt libmcrypt-devel php-fpm php-cli php-bcmath php-mbstring php-xml
#安装php70w  
[root@localhost ~]# yum install php70w php70w-cli php70w-common php70w-fpm php70w-gd php70w-devel php70w-imap php70w-mysql php70w-odbc php70w-bcmath php70w-mbstring php70w-xml

#配置PHP  
[root@localhost ~]# sed -i '/max_execution_time/s/30/300/' /etc/php.ini
[root@localhost ~]# sed -i '/post_max_size/s/8/16/' /etc/php.ini
[root@localhost ~]# sed -i '/max_input_time/s/60/300/' /etc/php.ini
[root@localhost ~]# sed -i '/^;date.timezone/adate.timezone \= \"Asia\/Shanghai\"' /etc/php.ini
#yum安装的php下面配置不需要修改
[root@localhost ~]# sed -i '/run\/php-fpm.pid/s/^;//g' /etc/php-fpm.d/www.conf
[root@localhost ~]# sed -i '/^;pm.min_spare_servers/s/^;//g' /etc/php-fpm.d/www.conf
[root@localhost ~]# sed -i '/^;pm.max_spare_servers/s/^;//g' /etc/php-fpm.d/www.conf
[root@localhost ~]# sed -i '/^;pm.start_servers/s/^;//g' /etc/php-fpm.d/www.conf
[root@localhost ~]# systemctl enable php-fpm
[root@localhost ~]# systemctl start php-fpm

部署Mysql 5.6

这里也操作了mysql迁移DATA目录

[root@localhost ~]# yum-config-manager --disable mysql80-community
[root@localhost ~]# yum-config-manager --enable mysql56-community
[root@localhost ~]# yum install -y mysql-community-server.x86_64 mysql-community-devel.x86_64 mysql-community-common.x86_64 mysql-community-client.x86_64
[root@localhost ~]# systemctl enable mysqld
[root@localhost ~]# systemctl start mysqld

迁移mysql data目录

[root@localhost ~]# mysql -u root
mysql> use mysql
mysql> update user set password=password('123456') where user='root';
mysql> SELECT Host,User,password_expired FROM user;
mysql> update user set host='192.168.%.%',password_expired='N' where host='127.0.0.1';
mysql> grant all privileges on *.* to 'root'@'localhost' identified by '123456' with grant option;
mysql> grant all privileges on *.* to 'root'@'192.168.%.%' identified by '123456' with grant option;
mysql> flush privileges;
mysql> commit;
mysql> show global variables like "%datadir%";
+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| datadir       | /var/lib/mysql/ |
+---------------+-----------------+
1 row in set (0.00 sec)
mysql> quit
[root@localhost ~]# systemctl stop mysqld
[root@localhost ~]# mkdir -p /data
[root@localhost ~]# mv /var/lib/mysql /data/mysql_data
[root@localhost ~]# grep -iv '#' /etc/my.cnf | grep -iv '^$'
[mysqld]
#修改为新目录
datadir=/data/mysql_data
socket=/data/mysql_data/mysql.sock
symbolic-links=0
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
#添加
[client]
socket=/data/mysql_data/mysql.sock
[root@localhost ~]# systemctl start mysqld

部署zabbix

[root@localhost ~]# useradd -s /sbin/nologin zabbix
[root@localhost ~]# tar -zxf zabbix-3.4.11.tar.gz
[root@localhost ~]# cd zabbix-3.4.11
[root@localhost zabbix-3.4.11]# ./configure --prefix=/usr/local/zabbix --enable-server --enable-agent --with-mysql --with-net-snmp --with-libcurl --with-libxml2
[root@localhost zabbix-3.4.11]# make && make install

1 2	#在编译zabbix-server时，`--enable-java`需要安装下面这个依赖包 [root@localhost ~]# yum install -y java*

初始化数据库

[root@localhost ~]# mysql -uroot -p123456
mysql> create database zabbix default charset utf8;
mysql> grant all privileges on zabbix.* to zbxuser@localhost identified by 'zbxpass';
mysql> grant all privileges on zabbix.* to zbxuser@'192.168.%.%' identified by 'zbxpass';
mysql> flush privileges;
mysql> commit;
mysql> quit

[root@localhost ~]# cd zabbix-3.4.11
[root@localhost zabbix-3.4.11]# mysql -uzbxuser -pzbxpass zabbix < database/mysql/schema.sql
[root@localhost zabbix-3.4.11]# mysql -uzbxuser -pzbxpass zabbix < database/mysql/images.sql
[root@localhost zabbix-3.4.11]# mysql -uzbxuser -pzbxpass zabbix < database/mysql/data.sql

配置zabbix

[root@localhost ~]# sed -i '/LogFile/s/tmp/var\/log\/zabbix/' /usr/local/zabbix/etc/zabbix_server.conf
[root@localhost ~]# sed -i '/^DBUser\=/s/root/zbxuser/' /usr/local/zabbix/etc/zabbix_server.conf
[root@localhost ~]# sed -i '/# DBPassword/aDBPassword=zbxpass' /usr/local/zabbix/etc/zabbix_server.conf
[root@localhost ~]# sed -i '/# DBPort/aDBPort=3306' /usr/local/zabbix/etc/zabbix_server.conf
[root@localhost ~]# sed -i '/^Timeout/s/4/30/' /usr/local/zabbix/etc/zabbix_server.conf
#如果修改了数据库的data目录，也需要修改zabbix server配置文件
[root@localhost ~]# sed -i '/^# DBSocket/aDBSocket=/data/mysql_data/mysql.sock' /usr/local/zabbix/etc/zabbix_server.conf

[root@localhost ~]# sed -i '/^Server/s/127.0.0.1/192.168.1.202/g' /usr/local/zabbix/etc/zabbix_agentd.conf
[root@localhost ~]# sed -i "/^ServerActive/s/127.0.0.1/192.168.1.202:10051/g" /usr/local/zabbix/etc/zabbix_agentd.conf
[root@localhost ~]# sed -i "/^Hostname/s/Zabbix server/192.168.1.202/g" /usr/local/zabbix/etc/zabbix_agentd.conf
[root@localhost ~]# sed -i '/^LogFile/s/tmp/var\/log\/zabbix/' /usr/local/zabbix/etc/zabbix_agentd.conf

[root@localhost ~]# ln -s /usr/local/zabbix/sbin/* /usr/local/sbin/
[root@localhost ~]# ln -s /usr/local/zabbix/bin/* /usr/local/sbin/
[root@localhost ~]# mkdir -p /var/log/zabbix
[root@localhost ~]# touch /var/log/zabbix/{zabbix_agentd.log,zabbix_server.log}
[root@localhost ~]# chown zabbix:zabbix /var/log/zabbix/ -R
[root@localhost ~]# /usr/local/zabbix/sbin/zabbix_server -c /usr/local/zabbix/etc/zabbix_server.conf
[root@localhost ~]# /usr/local/zabbix/sbin/zabbix_agentd -c /usr/local/zabbix/etc/zabbix_agentd.conf
#或者用启动脚本启动，zabbix源码自带启动脚本  
[root@localhost ~]# cp -a zabbix-3.4.11/misc/init.d/fedora/core/zabbix_* /etc/init.d/
[root@localhost ~]# chmod +x /etc/init.d/zabbix_*
[root@localhost ~]# /etc/init.d/zabbix_server restart
[root@localhost ~]# /etc/init.d/zabbix_agentd restart

部署Nginx

[root@localhost ~]# yum install -y nginx
[root@localhost ~]# mkdir /usr/share/nginx/html/zabbix
[root@localhost ~]# cd zabbix-3.4.11
[root@localhost zabbix-3.4.11]# cp -a frontends/php/* /usr/share/nginx/html/zabbix/
[root@localhost ~]# chown zabbix:zabbix /usr/share/nginx/html/zabbix/ -R
[root@localhost ~]# chmod o+w /usr/share/nginx/html/zabbix/conf -R

[root@localhost ~]# cat /etc/nginx/conf.d/zabbix.conf 
server {
	listen	80;
	server_name 192.168.1.202;
	root /usr/share/nginx/html;
	access_log /var/log/nginx/zabbix_access.log main;
	error_log /var/log/nginx/zabbix_error.log warn;
	location /zabbix {
                index index.php;
                try_files $uri $uri/ /index.php?$args;
        }
        location ~ ^(.+.php)(.*)$ {
                fastcgi_buffer_size 128k;
                fastcgi_buffers 32 32k;
                fastcgi_split_path_info ^(.+.php)(.*)$;
                include fastcgi.conf;
                fastcgi_pass  127.0.0.1:9000;
                fastcgi_index index.php;
                fastcgi_param  PATH_INFO          $fastcgi_path_info;
        }
}
[root@localhost ~]#
[root@localhost ~]# systemctl enable nginx
[root@localhost ~]# systemctl start nginx

效果图

如果点”next step”没有反应，需要检查下/var/lib/php/session权限，如果没有这个目录就新建一个并授权777

附件：
epel-release-latest-7.noarch
mysql80-community-release-el7-1.noarch
webtatic-release
zabbix-3.4.11.tar.gz

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/06/30/部署Zabbix-3-4/

Ansible部署Jenkins环境

2018-06-21T15:00:00.000Z

本文是基于Ansible批量部署Jenkins环境

支持部署的环境：

基于docker的Jenkins基础环境
Maven
Nodejs
基于docker的Nexus3基础环境
docker基础环境

工具包测试环境信息

环境	ip
Jenkins	192.168.1.201
Maven	192.168.1.202
Nodejs	192.168.1.203
Nexus3	192.168.1.206

工具版本信息

工具	版本	备注
centos	7.4	7版本的系统对docker性能支持更好
Ansible	2.5.2	因有些语法较新，需要ansible大于2.4
docker-ce	latest	-
Jenkins	2.128	-
Maven	3.5.3	-
Nodejs	v8.11.2	-
Nexus3_oss	3.12.0	-

Ansible工具包内容

[root@localhost ~]# tree jenkins
jenkins
├── 01-install-jenkins.yml
├── 02-clean-jenkins.yml
├── 03-install-maven.yml
├── 04-install-nodejs.yml
├── 05-install-nexus3.yml
├── 06-clean-nexus3.yml
├── 07-install-docker.yml
├── hosts
├── pics
│   ├── jenkins_dashboard.png
│   └── nexus3_dashboard.png
├── README.md
└── roles
    ├── clean-jenkins
    │   ├── files
    │   │   └── jenkins-docker-compose.yml.j2
    │   └── tasks
    │       └── main.yml
    ├── clean-nexus3
    │   ├── files
    │   │   └── nexus3-docker-compose.yml.j2
    │   └── tasks
    │       └── main.yml
    ├── common
    │   ├── files
    │   │   └── epel-release-latest-7.noarch.rpm
    │   ├── tasks
    │   │   └── main.yml
    │   └── templates
    │       ├── 20-nproc.conf
    │       └── limits.conf
    ├── docker
    │   ├── files
    │   │   ├── daemon.json
    │   │   ├── docker-ce.repo
    │   │   ├── docker-compose
    │   │   └── docker.service
    │   └── tasks
    │       └── main.yml
    ├── jdk
    │   ├── files
    │   │   ├── java.sh
    │   │   └── jdk-8u171-linux-x64.tar.gz
    │   └── tasks
    │       └── main.yml
    ├── jenkins
    │   ├── files
    │   │   ├── jenkins_2.128.tar.gz
    │   │   └── jenkins-docker-compose.yml.j2
    │   └── tasks
    │       └── main.yml
    ├── maven
    │   ├── files
    │   │   ├── apache-maven-3.5.3-bin.tar.gz
    │   │   └── maven.sh
    │   └── tasks
    │       └── main.yml
    ├── nexus3
    │   ├── files
    │   │   ├── nexus3-docker-compose.yml.j2
    │   │   └── nexus_oss_3.12.0.tar.gz
    │   └── tasks
    │       └── main.yml
    └── nodejs
        ├── files
        │   ├── nodejs.sh
        │   └── node-v8.11.2-linux-x64.tar.xz
        └── tasks
            └── main.yml
30 directories, 39 files
[root@localhost ~]#

工具包大小：

1
2
3

[root@localhost ~]# du -sh jenkins
1.5G	jenkins
[root@localhost ~]#

工具包README

本工具测试环境为：centos 7.4
Author：Jack_wang
Blog：http://www.yfshare.vip

准备工作

ansible当前服务器自己root互信
ansible服务器和其他所有各个节点root互信
命令: ssh-copy-id -i ~/.ssh/id_rsa.pub root@ip

注：ansible需要使用2.4以上的版本，因有些语法2.4以下不支持.
当前ansible版本为2.5.2。

安装步骤

1 2	yum -y install https://mirrors.aliyun.com/epel/epel-release-latest-7.noarch.rpm yum -y install ansible

注：这里安装ansible-2.5.2时，yum源里会缺少python-babel-0.9.6-8.el7.noarch.rpm包，可以在https://pkgs.org/download/python-b
abel这里下载.在当前目录也提供了该依赖包.
按要求修改hosts文件相关参数
操作步骤：

ansible-playbook -i hosts 01-install-jenkins.yml #部署jenkins环境，docker镜像里集成maven
ansible-playbook -i hosts 03-install-maven.yml #部署maven环境
ansible-playbook -i hosts 04-install-nodejs.yml #部署nodejs环境
ansible-playbook -i hosts 05-install-nexus3.yml #部署nexus3-OSS环境
ansible-playbook -i hosts 07-install-docker.yml #部署docker-ce基础环境

部署成功后，可执行下面命令查看结果：
Jenkins：访问http://ip:8080

1	docker ps -a \|grep -i 'jenkins'

Maven：
执行mvn -v

Nodejs：
执行node -v

nexus3：访问http://ip:8081

1	docker ps -a \|grep -i 'nexus'

卸载jenkins

按要求修改hosts文件相关参数
操作步骤：

ansible-playbook -i hosts 02-clean-jenkins.yml #卸载jenkins

卸载nexus3_oss

按要求修改hosts文件相关参数
操作步骤：

ansible-playbook -i hosts 06-clean-nexus3.yml #卸载nexus3

附件：
jenkins_ansible_2.128.tar.gz百度网盘，密码：dxng

本文出自”Jack Wang Blog”：http://www.yfshare.vip/2018/06/21/Ansible部署Jenkins环境/