ELK 部署及常见错误

使用 Docker Compose 部署 3 个节点的 Elasticsearch 集群并开启安全认证

环境信息

  • Rocky9 Linux
  • Elastic Stack 8.12

三台 Rocky9 Linux 服务器, 配置为 4CPU 16G RAM , 内网地址和主机名分别为:

  • 172.31.29.164 vp-elk-1
  • 172.31.24.61 vp-elk-2
  • 172.31.25.106 vp-elk-3

修改系统参数 /etc/sysctl.conf ,Elasticsearch 必须配置:

/etc/sysctl.conf
vm.max_map_count = 262144

执行命令 sysctl -p 生效,使用命令 sysctl vm.max_map_count 验证

创建 ELK 目录

mkdir -p /data/elk
cd /data/elk
mkdir -p data/{elasticsearch,kibana} config/certs

目录结构如下:

/data/elk
├── config
│ └── certs
├── data
│ ├── elasticsearch
│ └── kibana
└── docker-compose.yml

因为要开启集群安全认证(X-Pack Security) xpack.security.enabled: true ,就必须为节点之间的通信配置 Transport 层 TLS 加密 ,否则 ES 会拒绝在生产模式下启动。

证书可以选用以下两种方式之一

  1. 生成集群证书(仅在一个节点上操作即可) 。利用 ES 自带的工具生成 CA 和节点证书,每个节点有自己的证书。

    cd /data/elk

    # 首先生成 CA 证书
    docker run --rm \
    -v $(pwd)/config/certs:/certs \
    docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
    bin/elasticsearch-certutil ca \
    --silent \
    --pem \
    --out /certs/ca.zip

    # 解压 CA 证书,获得 ca.crt ca.key
    cd config/certs
    unzip ca.zip

    创建 实例配置文件,比如 config/certs/instances.yml 用来为节点生成证书

    config/certs/instances.yml
    instances:
    - name: vp-elk-1
    ip:
    - 172.31.29.164

    - name: vp-elk-2
    ip:
    - 172.31.24.61

    - name: vp-elk-3
    ip:
    - 172.31.25.106

    使用实例配置文件,比如 config/certs/instances.yml 生成节点证书

    docker run --rm \
    -v $(pwd)/config/certs:/certs \
    docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
    bin/elasticsearch-certutil cert \
    --silent \
    --pem \
    --in /certs/instances.yml \
    --ca-cert /certs/ca/ca.crt \
    --ca-key /certs/ca/ca.key \
    --out /certs/certs.zip

    解压证书文件,获得节点证书,文件结构如下:

    # tree
    .
    ├── ca
    │ ├── ca.crt
    │ └── ca.key
    ├── ca.zip
    ├── certs.zip
    ├── instances.yml
    ├── vp-elk-1
    │ ├── vp-elk-1.crt
    │ └── vp-elk-1.key
    ├── vp-elk-2
    │ ├── vp-elk-2.crt
    │ └── vp-elk-2.key
    └── vp-elk-3
    ├── vp-elk-3.crt
    └── vp-elk-3.key

    4 directories, 11 files

  2. 生成 p12 类型的节点证书

    生成 config/instances.yml

    config/instances.yml
    instances:
    - name: vp-elk-1
    dns:
    - vp-elk-1
    - localhost
    ip:
    - 172.31.29.164
    - 127.0.0.1

    - name: vp-elk-2
    dns:
    - vp-elk-2
    - localhost
    ip:
    - 172.31.24.61
    - 127.0.0.1

    - name: vp-elk-3
    dns:
    - vp-elk-3
    - localhost
    ip:
    - 172.31.25.106
    - 127.0.0.1

    生成 CA

    cd /data/elk

    docker run --rm -v ./config/certs:/certs docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
    bin/elasticsearch-certutil ca --out /certs/elastic-ca.p12 --pass ""

    会生成无密码的 config/certs/elastic-ca.p12 CA 证书文件,接着使用 CA 根证书生成节点证书,使用 config/instances.yml 配置证书中包含的 SAN

    docker run --rm -v ./config/certs:/certs docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
    bin/elasticsearch-certutil cert --ca /certs/elastic-ca.p12 --ca-pass "" \
    --in /certs/instances.yml \
    --out /certs/elastic-certificates.p12 --pass ""

    会生成 ./config/certs/elastic-certificates.p12 ,这实际上是个 zip 压缩文件,要解压后获得各个节点的证书

    可以通过以下方式检查证书中的 SAN 信息

    # # cd config/certs/

    # file elastic-certificates.p12
    elastic-certificates.p12: Zip archive data, at least v2.0 to extract

    # unzip elastic-certificates.p12
    Archive: elastic-certificates.p12
    creating: vp-elk-1/
    inflating: vp-elk-1/vp-elk-1.p12
    creating: vp-elk-2/
    inflating: vp-elk-2/vp-elk-2.p12
    creating: vp-elk-3/
    inflating: vp-elk-3/vp-elk-3.p12


    # openssl pkcs12 -in vp-elk-1/vp-elk-1.p12 -nodes -passin pass: \
    | openssl x509 -noout -text | grep -A1 "Subject Alternative Name"
    X509v3 Subject Alternative Name:
    IP Address:172.31.29.164, IP Address:127.0.0.1, DNS:localhost, DNS:vp-elk-1

生成证书后,将证书文件分发到其他两个节点上。

Elasticsearch 集群配置文件 ./config/elasticsearch.yml每个节点都要配置 ,修改 node.name 为对应的节点名称; 修改证书路径为对应节点的证书

./config/elasticsearch.yml
cluster.name: vp-elk-cluster

node.name: vp-elk-1 # 其他节点修改为对应值

network.host: 172.31.29.164 # 其他节点修改为对应值
http.port: 9200
transport.port: 9300

discovery.seed_hosts:
- 172.31.29.164
- 172.31.24.61
- 172.31.25.106

cluster.initial_master_nodes:
- vp-elk-1
- vp-elk-2
- vp-elk-3


xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12 # 其他节点修改为对应值
xpack.security.transport.ssl.truststore.path: /usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12 # 其他节点修改为对应值

xpack.security.http.ssl.enabled: false

bootstrap.memory_lock: true

Kibana 配置文件 config/kibana.yml ,只需要在一个节点上配置即可。

config/kibana.yml
server.name: kibana
server.host: "0.0.0.0"

server.ssl.enabled: false # Kibana 使用 HTTP 和 ES 通行,ES 已经配置 xpack.security.http.ssl.enabled: false

elasticsearch.hosts:
- http://172.31.29.164:9200
- http://172.31.24.61:9200
- http://172.31.25.106:9200

elasticsearch.username: "kibana_system" # 这里不允许使用 elastic 用户
elasticsearch.password: "YourStrongPassword"

monitoring.ui.container.elasticsearch.enabled: true

Docker Compose 配置文件 docker-compose.yml ,3 台 ES 节点使用同样的配置即可, ES 配置在配置文件在每个节点的 config/elasticsearch.yml 。Kibana 只需要在一台服务器部署即可。

docker-compose.yml
services:

elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.2
container_name: elasticsearch
network_mode: host
environment:
- ES_JAVA_OPTS=-Xms4g -Xmx4g

volumes:
- ./data/elasticsearch:/usr/share/elasticsearch/data
- ./config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
- ./config/certs:/usr/share/elasticsearch/config/certs:ro

ulimits:
memlock:
soft: -1
hard: -1
restart: always
mem_limit: 8g



kibana:
image: docker.elastic.co/kibana/kibana:8.12.2
container_name: kibana
restart: always
network_mode: host
depends_on:
- elasticsearch

volumes:
- ./config/kibana.yml:/usr/share/kibana/config/kibana.yml
- ./config/certs:/usr/share/kibana/config/certs:ro
- ./data/kibana:/usr/share/kibana/data

启动

docker compose up -d

elastic 等用户重置密码

# docker compose exec -it elasticsearch bin/elasticsearch-setup-passwords interactive
******************************************************************************
Note: The 'elasticsearch-setup-passwords' tool has been deprecated. This command will be removed in a future release.
******************************************************************************

Initiating the setup of passwords for reserved users elastic,apm_system,kibana,kibana_system,logstash_system,beats_system,remote_monitoring_user.
You will be prompted to enter passwords as the process progresses.
Please confirm that you would like to continue [y/N]y


Enter password for [elastic]:
Reenter password for [elastic]:
Enter password for [apm_system]:
Enter password for [apm_system]:
Reenter password for [apm_system]:
Enter password for [kibana_system]:
Reenter password for [kibana_system]:
...
Changed password for user [beats_system]
Changed password for user [remote_monitoring_user]
Changed password for user [elastic]


重置 elastic 用户密码

# docker compose exec elasticsearch bin/elasticsearch-reset-password -u elastic
This tool will reset the password of the [elastic] user to an autogenerated value.
The password will be printed in the console.
Please confirm that you would like to continue [y/N]y


Password for the [elastic] user successfully reset.
New value: GjJadE-ihZJ+Ddb5SvKs

Elasticsearch 集群常规检查

  1. 首先确认节点是否全部加入集群。

    # GET /
    {
    "name": "vp-elk-1",
    "cluster_name": "vp-elk-cluster",
    "cluster_uuid": "Wvi6Vl5mTsKGlDUTp12xhQ",
    "version": {
    "number": "8.12.2",
    "build_flavor": "default",
    "build_type": "docker",
    "build_hash": "48a287ab9497e852de30327444b0809e55d46466",
    "build_date": "2024-02-19T10:04:32.774273190Z",
    "build_snapshot": false,
    "lucene_version": "9.9.2",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
    },
    "tagline": "You Know, for Search"
    }
  2. 检查集群健康状态

    # GET /_cluster/health?pretty
    {
    "cluster_name": "vp-elk-cluster",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 3,
    "number_of_data_nodes": 3,
    "active_primary_shards": 29,
    "active_shards": 59,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100
    }
  3. 检查节点状态

    # GET /_cat/nodes?v
    ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
    172.31.25.106 8 59 0 0.00 0.00 0.00 cdfhilmrstw - vp-elk-3
    172.31.24.61 12 60 0 0.03 0.02 0.00 cdfhilmrstw * vp-elk-2
    172.31.29.164 25 60 1 0.02 0.02 0.01 cdfhilmrstw - vp-elk-1
  4. 检查节点角色

    # GET /_cat/nodes?v&h=name,ip,node.role,master
    name ip node.role master
    vp-elk-3 172.31.25.106 cdfhilmrstw -
    vp-elk-2 172.31.24.61 cdfhilmrstw *
    vp-elk-1 172.31.29.164 cdfhilmrstw -

    • m master
    • d data
    • i **ingest
  5. 检查分片分布

    # GET /_cat/shards?v
    index shard prirep state docs store dataset ip node
    .kibana_analytics_8.12.2_001 0 p STARTED 5 2.3mb 2.3mb 172.31.24.61 vp-elk-2
    .kibana_analytics_8.12.2_001 0 r STARTED 5 2.3mb 2.3mb 172.31.29.164 vp-elk-1
    .internal.alerts-observability.apm.alerts-default-000001 0 p STARTED 0 249b 249b 172.31.24.61 vp-elk-2
    .internal.alerts-observability.apm.alerts-default-000001 0 r STARTED 0 249b 249b 172.31.29.164 vp-elk-1
    .ds-.kibana-event-log-ds-2026.03.13-000001 0 p STARTED 1 6.3kb 6.3kb 172.31.25.106 vp-elk-3
  6. 检查索引状态

    # GET /_cat/indices?v
    health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
    green open .internal.alerts-observability.logs.alerts-default-000001 QQ1ALFIwTS6Cr1IUjp384w 1 1 0 0 498b 249b 249b
    green open .internal.alerts-observability.threshold.alerts-default-000001 UzpYLZbzTMyK2yCYnmayKw 1 1 0 0 498b 249b 249b
    green open .kibana-observability-ai-assistant-kb-000001 yV9I-sIMQgyf0edNxw1kPA 1 1 0 0 498b 249b 249b
    green open .internal.alerts-observability.apm.alerts-default-000001 0HOJ4bCgT_2X4c29FryFCw 1 1 0 0 498b 249b 249b
    green open .internal.alerts-stack.alerts-default-000001 NJCmcGptQOWG6rd0I259Uw 1 1 0 0 498b 249b 249b
    green open .internal.alerts-observability.slo.alerts-default-000001 rSSFxAYfR0O9L9XXBIpZlA 1 1 0 0 498b 249b 249b
    green open .internal.alerts-ml.anomaly-detection.alerts-default-000001 fT9FJoirRiSMVXx1V3F7dQ 1 1 0 0 498b 249b 249b
    green open .internal.alerts-observability.metrics.alerts-default-000001 E9vjU7WETSebjg8Y_ddPHw 1 1 0 0 498b 249b 249b
  7. 检查 Master 选举

    # GET /_cat/master?v
    id host ip node
    5hG_mSEjRd6Ov-rClowAoQ 172.31.24.61 172.31.24.61 vp-elk-2

    只有一个 Master 就正常

  8. 检查 JVM Heap

    # GET /_cat/nodes?v&h=name,heap.percent
    name heap.percent
    vp-elk-3 20
    vp-elk-2 48
    vp-elk-1 57
  9. 检查磁盘使用情况

    # GET /_cat/allocation?v
    shards disk.indices disk.used disk.avail disk.total disk.percent host ip node node.role
    20 2.8mb 14.1gb 1009.7gb 1023.9gb 1 172.31.29.164 172.31.29.164 vp-elk-1 cdfhilmrstw
    20 754.3kb 11gb 1012.8gb 1023.9gb 1 172.31.25.106 172.31.25.106 vp-elk-3 cdfhilmrstw
    19 2.9mb 11gb 1012.8gb 1023.9gb 1 172.31.24.61 172.31.24.61 vp-elk-2 cdfhilmrstw

  10. 检查线程池情况

    # GET /_cat/thread_pool?v
    node_name name active queue rejected
    vp-elk-3 analyze 0 0 0
    vp-elk-3 auto_complete 0 0 0
    vp-elk-3 azure_event_loop 0 0 0
    vp-elk-3 ccr 0 0 0
    vp-elk-3 cluster_coordination 0 0 0
    ...

    关注 queue、rejected > 0 ,说明集群过载

  11. 检查 Pending Tasks

    # GET /_cluster/pending_tasks?pretty
    {
    "tasks": []
    }
  12. 检查证书

    # GET /_ssl/certificates
    [
    {
    "path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
    "format": "PKCS12",
    "alias": "ca",
    "subject_dn": "CN=Elastic Certificate Tool Autogenerated CA",
    "serial_number": "825a4f350b8815940e60d557036edbe205f68a93",
    "has_private_key": false,
    "expiry": "2029-03-11T13:21:08.000Z",
    "issuer": "CN=Elastic Certificate Tool Autogenerated CA"
    },
    {
    "path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
    "format": "PKCS12",
    "alias": "vp-elk-1",
    "subject_dn": "CN=vp-elk-1",
    "serial_number": "220b64fa6c86b2145a86c274eb914f9e3b299350",
    "has_private_key": true,
    "expiry": "2029-03-12T01:29:26.000Z",
    "issuer": "CN=Elastic Certificate Tool Autogenerated CA"
    },
    {
    "path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
    "format": "PKCS12",
    "alias": "vp-elk-1",
    "subject_dn": "CN=Elastic Certificate Tool Autogenerated CA",
    "serial_number": "825a4f350b8815940e60d557036edbe205f68a93",
    "has_private_key": false,
    "expiry": "2029-03-11T13:21:08.000Z",
    "issuer": "CN=Elastic Certificate Tool Autogenerated CA"
    }
    ]
  13. 查看集群的全局配置

    # GET /_cluster/settings?include_defaults

    这会输出 Elasticsearch 集群的所有配置,包括默认配置。

  14. 查看模版配置

    # GET /_index_template?pretty

    其中可以查看 number_of_shardsnumber_of_replicas ,默认值都是 1

常见错误

Elasticsearch exited unexpectedly, with exit code 137

在 Docker 中 几乎 90% 是被系统 OOM Killer 杀掉(内存不够)。重点检查 mem_limit: 8g , 和 ES_JAVA_OPTS=-Xms8g -Xmx8g

filebeat 上传数据到 elasticsearch 问题汇总

filebeat 上传数据到 elasticsearch 报错

适用版本信息说明

  • filebeat 7
  • elasticsearch 7

filebeat 7.5.2 上传数据到 Elasticsearch 报错:

# journalctl -f -u filebeat
{"type":"illegal_argument_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [6924]/[3000] maximum shards open;"}

此错误原因是由于 Elasticsearch 的集群中打开的分片数量超过了集群的最大分片限制。在 Elasticsearch 中,每个索引由多个分片组成,而集群有一个设置的最大分片数限制。这个限制是为了防止分片数过多导致性能问题。

错误消息 {"type":"illegal_argument_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [6924]/[3000] maximum shards open;"} 显示当前集群已有 6924 个分片,超过了 3000 个的限制。

要解决这个问题,可以考虑以下几个选项:

  1. 调整 Elasticsearch 集群设置,增加最大分片数限制

    可以通过更改 Elasticsearch 配置来增加最大分片数的限制。但请注意,这可能会导致性能问题,尤其是如果硬件资源有限的话。

    这可以通过修改 cluster.max_shards_per_node 设置来实现

    PUT /_cluster/settings
    {
    "persistent": {
    "cluster.max_shards_per_node": "新的分片数限制"
    }
    }

    获取 Elasticsearch 集群的最大分片数限制

    curl -X GET "http://[your_elasticsearch_host]:9200/_cluster/settings?include_defaults=true&pretty"

  2. 删除一些不必要的索引 :如果有些索引不再需要,可以删除它们来减少分片数。

    curl -X DELETE "localhost:9200/my_index"
    curl -X DELETE "localhost:9200/logstash-2021.11.*"
  3. 合并一些小索引:如果有很多小的索引,可以考虑将它们合并为更大的索引,以减少总分片数。

  4. 优化现有索引的分片策略:可以优化索引的分片数量,例如,通过减少每个索引的主分片数量。

filebeat 错误

filebeat 配置上传数据到 elasticsearch 报错

适用版本信息说明

  • filebeat 7
  • elasticsearch 7

使用以下 filebeat 配置文件

/etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
paths:
- /home/logs/laravel-2023*
tags: ["admin-log"]
close_timeout: 3h
clean_inactive: 72h
ignore_older: 70h
close_inactive: 5m

output.elasticsearch:
hosts: ["1.56.219.122:9200", "1.57.115.214:9200", "1.52.53.31:9200"]
username: "elastic"
password: "passwd"
index: "logstash-admin-%{+yyyy.MM.dd}"
setup.template.enabled: true
setup.template.name: "logstash-admin"
setup.template.pattern: "logstash-admin-*"

filebeat 启动后报错,elasticsearch 上未创建相应的索引,关键错误信息 Failed to connect to backoff(elasticsearch(http://1.57.115.214:9200)): Connection marked as failed because the onConnect callback failed: resource 'filebeat-7.5.2' exists, but it is not an alias

journalctl -f -u filebeat
INFO [index-management] idxmgmt/std.go:269 ILM policy successfully loaded.
ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://1.57.115.214:9200)): Connection marked as failed because the onConnect callback failed: resource 'filebeat-7.5.2' exists, but it is not an alias
INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://1.57.115.214:9200)) with 3 reconnect attempt(s)
INFO elasticsearch/client.go:753 Attempting to connect to Elasticsearch version 7.6.2
INFO [index-management] idxmgmt/std.go:256 Auto ILM enable success.
INFO [index-management.ilm] ilm/std.go:138 do not generate ilm policy: exists=true, overwrite=false
INFO [index-management] idxmgmt/std.go:269 ILM policy successfully loaded.
ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://1.56.219.122:9200)): Connection marked as failed because the onConnect callback failed: resource 'filebeat-7.5.2' exists, but it is not an alias
INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://1.56.219.122:9200)) with 3 reconnect attempt(s)

这表明 Filebeat 无法正常连接到 Elasticsearch 集群。出现这个问题的主要原因可能为:

  • 索引/别名冲突: Filebeat 试图创建或使用一个名为 filebeat-7.5.2 的索引或别名,但这个资源在 Elasticsearch 中已存在且不是一个别名。解决方法为 删除或重命名冲突索引

  • ILM 配置问题

    使用此配置文件,解决 索引/别名冲突 问题后,filebeat 运行正常,但是 Elasticsearch 上未创建配置中的索引 logstash-admin-*,而是将数据上传到了索引 filebeat-7.5.2-*。这个问题是由 ILM 导致,可以禁用 ILM。参考以下配置,禁用 ILM (setup.ilm.enabled: false)

    /etc/filebeat/filebeat.yml
    filebeat.inputs:
    - type: log
    paths:
    - /home/logs/laravel-2023*
    tags: ["admin-log"]
    close_timeout: 3h
    clean_inactive: 72h
    ignore_older: 70h
    close_inactive: 5m

    output.elasticsearch:
    hosts: ["1.56.219.122:9200", "1.57.115.214:9200", "1.52.53.31:9200"]
    username: "elastic"
    password: "passwd"
    index: "logstash-admin-%{+yyyy.MM.dd}"
    setup.ilm.enabled: false
    setup.template.enabled: true
    setup.template.name: "logstash-admin"
    setup.template.pattern: "logstash-admin-*"