ELK 部署及常见错误
使用 Docker Compose 部署 3 个节点的 Elasticsearch 集群并开启安全认证
环境信息 :
- Rocky9 Linux
- Elastic Stack 8.12
三台 Rocky9 Linux 服务器, 配置为 4CPU 16G RAM , 内网地址和主机名分别为:
172.31.29.164 vp-elk-1172.31.24.61 vp-elk-2172.31.25.106 vp-elk-3
修改系统参数 /etc/sysctl.conf ,Elasticsearch 必须配置:
vm.max_map_count = 262144 |
执行命令 sysctl -p 生效,使用命令 sysctl vm.max_map_count 验证
创建 ELK 目录
mkdir -p /data/elk |
目录结构如下:
/data/elk |
因为要开启集群安全认证(X-Pack Security) xpack.security.enabled: true ,就必须为节点之间的通信配置 Transport 层 TLS 加密 ,否则 ES 会拒绝在生产模式下启动。
证书可以选用以下两种方式之一
生成集群证书(仅在一个节点上操作即可) 。利用 ES 自带的工具生成 CA 和节点证书,每个节点有自己的证书。
cd /data/elk
首先生成 CA 证书
docker run --rm \
-v $(pwd)/config/certs:/certs \
docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
bin/elasticsearch-certutil ca \
--silent \
--pem \
--out /certs/ca.zip
解压 CA 证书,获得 ca.crt ca.key
cd config/certs
unzip ca.zip创建 实例配置文件,比如
config/certs/instances.yml用来为节点生成证书config/certs/instances.yml instances:
- name: vp-elk-1
ip:
- 172.31.29.164
- name: vp-elk-2
ip:
- 172.31.24.61
- name: vp-elk-3
ip:
- 172.31.25.106使用实例配置文件,比如
config/certs/instances.yml生成节点证书docker run --rm \
-v $(pwd)/config/certs:/certs \
docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
bin/elasticsearch-certutil cert \
--silent \
--pem \
--in /certs/instances.yml \
--ca-cert /certs/ca/ca.crt \
--ca-key /certs/ca/ca.key \
--out /certs/certs.zip解压证书文件,获得节点证书,文件结构如下:
tree
.
├── ca
│ ├── ca.crt
│ └── ca.key
├── ca.zip
├── certs.zip
├── instances.yml
├── vp-elk-1
│ ├── vp-elk-1.crt
│ └── vp-elk-1.key
├── vp-elk-2
│ ├── vp-elk-2.crt
│ └── vp-elk-2.key
└── vp-elk-3
├── vp-elk-3.crt
└── vp-elk-3.key
4 directories, 11 files生成 p12 类型的节点证书
生成
config/instances.ymlconfig/instances.yml instances:
- name: vp-elk-1
dns:
- vp-elk-1
- localhost
ip:
- 172.31.29.164
- 127.0.0.1
- name: vp-elk-2
dns:
- vp-elk-2
- localhost
ip:
- 172.31.24.61
- 127.0.0.1
- name: vp-elk-3
dns:
- vp-elk-3
- localhost
ip:
- 172.31.25.106
- 127.0.0.1生成 CA
cd /data/elk
docker run --rm -v ./config/certs:/certs docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
bin/elasticsearch-certutil ca --out /certs/elastic-ca.p12 --pass ""会生成无密码的
config/certs/elastic-ca.p12CA 证书文件,接着使用 CA 根证书生成节点证书,使用config/instances.yml配置证书中包含的 SANdocker run --rm -v ./config/certs:/certs docker.elastic.co/elasticsearch/elasticsearch:8.12.2 \
bin/elasticsearch-certutil cert --ca /certs/elastic-ca.p12 --ca-pass "" \
--in /certs/instances.yml \
--out /certs/elastic-certificates.p12 --pass ""会生成
./config/certs/elastic-certificates.p12,这实际上是个 zip 压缩文件,要解压后获得各个节点的证书可以通过以下方式检查证书中的 SAN 信息
# cd config/certs/
file elastic-certificates.p12
elastic-certificates.p12: Zip archive data, at least v2.0 to extract
unzip elastic-certificates.p12
Archive: elastic-certificates.p12
creating: vp-elk-1/
inflating: vp-elk-1/vp-elk-1.p12
creating: vp-elk-2/
inflating: vp-elk-2/vp-elk-2.p12
creating: vp-elk-3/
inflating: vp-elk-3/vp-elk-3.p12
openssl pkcs12 -in vp-elk-1/vp-elk-1.p12 -nodes -passin pass: \
| openssl x509 -noout -text | grep -A1 "Subject Alternative Name"
X509v3 Subject Alternative Name:
IP Address:172.31.29.164, IP Address:127.0.0.1, DNS:localhost, DNS:vp-elk-1
生成证书后,将证书文件分发到其他两个节点上。
Elasticsearch 集群配置文件 ./config/elasticsearch.yml , 每个节点都要配置 ,修改 node.name 为对应的节点名称; 修改证书路径为对应节点的证书
cluster.name: vp-elk-cluster |
Kibana 配置文件 config/kibana.yml ,只需要在一个节点上配置即可。
server.name: kibana |
Docker Compose 配置文件 docker-compose.yml ,3 台 ES 节点使用同样的配置即可, ES 配置在配置文件在每个节点的 config/elasticsearch.yml 中 。Kibana 只需要在一台服务器部署即可。
services: |
启动
docker compose up -d |
为 elastic 等用户重置密码
docker compose exec -it elasticsearch bin/elasticsearch-setup-passwords interactive |
重置 elastic 用户密码
docker compose exec elasticsearch bin/elasticsearch-reset-password -u elastic |
Elasticsearch 集群常规检查
首先确认节点是否全部加入集群。
GET /
{
"name": "vp-elk-1",
"cluster_name": "vp-elk-cluster",
"cluster_uuid": "Wvi6Vl5mTsKGlDUTp12xhQ",
"version": {
"number": "8.12.2",
"build_flavor": "default",
"build_type": "docker",
"build_hash": "48a287ab9497e852de30327444b0809e55d46466",
"build_date": "2024-02-19T10:04:32.774273190Z",
"build_snapshot": false,
"lucene_version": "9.9.2",
"minimum_wire_compatibility_version": "7.17.0",
"minimum_index_compatibility_version": "7.0.0"
},
"tagline": "You Know, for Search"
}检查集群健康状态
GET /_cluster/health?pretty
{
"cluster_name": "vp-elk-cluster",
"status": "green",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 29,
"active_shards": 59,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}检查节点状态
GET /_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.25.106 8 59 0 0.00 0.00 0.00 cdfhilmrstw - vp-elk-3
172.31.24.61 12 60 0 0.03 0.02 0.00 cdfhilmrstw * vp-elk-2
172.31.29.164 25 60 1 0.02 0.02 0.01 cdfhilmrstw - vp-elk-1检查节点角色
GET /_cat/nodes?v&h=name,ip,node.role,master
name ip node.role master
vp-elk-3 172.31.25.106 cdfhilmrstw -
vp-elk-2 172.31.24.61 cdfhilmrstw *
vp-elk-1 172.31.29.164 cdfhilmrstw -mmasterddatai**ingest
检查分片分布
GET /_cat/shards?v
index shard prirep state docs store dataset ip node
.kibana_analytics_8.12.2_001 0 p STARTED 5 2.3mb 2.3mb 172.31.24.61 vp-elk-2
.kibana_analytics_8.12.2_001 0 r STARTED 5 2.3mb 2.3mb 172.31.29.164 vp-elk-1
.internal.alerts-observability.apm.alerts-default-000001 0 p STARTED 0 249b 249b 172.31.24.61 vp-elk-2
.internal.alerts-observability.apm.alerts-default-000001 0 r STARTED 0 249b 249b 172.31.29.164 vp-elk-1
.ds-.kibana-event-log-ds-2026.03.13-000001 0 p STARTED 1 6.3kb 6.3kb 172.31.25.106 vp-elk-3检查索引状态
GET /_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open .internal.alerts-observability.logs.alerts-default-000001 QQ1ALFIwTS6Cr1IUjp384w 1 1 0 0 498b 249b 249b
green open .internal.alerts-observability.threshold.alerts-default-000001 UzpYLZbzTMyK2yCYnmayKw 1 1 0 0 498b 249b 249b
green open .kibana-observability-ai-assistant-kb-000001 yV9I-sIMQgyf0edNxw1kPA 1 1 0 0 498b 249b 249b
green open .internal.alerts-observability.apm.alerts-default-000001 0HOJ4bCgT_2X4c29FryFCw 1 1 0 0 498b 249b 249b
green open .internal.alerts-stack.alerts-default-000001 NJCmcGptQOWG6rd0I259Uw 1 1 0 0 498b 249b 249b
green open .internal.alerts-observability.slo.alerts-default-000001 rSSFxAYfR0O9L9XXBIpZlA 1 1 0 0 498b 249b 249b
green open .internal.alerts-ml.anomaly-detection.alerts-default-000001 fT9FJoirRiSMVXx1V3F7dQ 1 1 0 0 498b 249b 249b
green open .internal.alerts-observability.metrics.alerts-default-000001 E9vjU7WETSebjg8Y_ddPHw 1 1 0 0 498b 249b 249b检查 Master 选举
GET /_cat/master?v
id host ip node
5hG_mSEjRd6Ov-rClowAoQ 172.31.24.61 172.31.24.61 vp-elk-2只有一个 Master 就正常
检查 JVM Heap
GET /_cat/nodes?v&h=name,heap.percent
name heap.percent
vp-elk-3 20
vp-elk-2 48
vp-elk-1 57检查磁盘使用情况
GET /_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node node.role
20 2.8mb 14.1gb 1009.7gb 1023.9gb 1 172.31.29.164 172.31.29.164 vp-elk-1 cdfhilmrstw
20 754.3kb 11gb 1012.8gb 1023.9gb 1 172.31.25.106 172.31.25.106 vp-elk-3 cdfhilmrstw
19 2.9mb 11gb 1012.8gb 1023.9gb 1 172.31.24.61 172.31.24.61 vp-elk-2 cdfhilmrstw检查线程池情况
GET /_cat/thread_pool?v
node_name name active queue rejected
vp-elk-3 analyze 0 0 0
vp-elk-3 auto_complete 0 0 0
vp-elk-3 azure_event_loop 0 0 0
vp-elk-3 ccr 0 0 0
vp-elk-3 cluster_coordination 0 0 0
...关注
queue、rejected > 0,说明集群过载检查 Pending Tasks
GET /_cluster/pending_tasks?pretty
{
"tasks": []
}检查证书
GET /_ssl/certificates
[
{
"path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
"format": "PKCS12",
"alias": "ca",
"subject_dn": "CN=Elastic Certificate Tool Autogenerated CA",
"serial_number": "825a4f350b8815940e60d557036edbe205f68a93",
"has_private_key": false,
"expiry": "2029-03-11T13:21:08.000Z",
"issuer": "CN=Elastic Certificate Tool Autogenerated CA"
},
{
"path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
"format": "PKCS12",
"alias": "vp-elk-1",
"subject_dn": "CN=vp-elk-1",
"serial_number": "220b64fa6c86b2145a86c274eb914f9e3b299350",
"has_private_key": true,
"expiry": "2029-03-12T01:29:26.000Z",
"issuer": "CN=Elastic Certificate Tool Autogenerated CA"
},
{
"path": "/usr/share/elasticsearch/config/certs/vp-elk-1/vp-elk-1.p12",
"format": "PKCS12",
"alias": "vp-elk-1",
"subject_dn": "CN=Elastic Certificate Tool Autogenerated CA",
"serial_number": "825a4f350b8815940e60d557036edbe205f68a93",
"has_private_key": false,
"expiry": "2029-03-11T13:21:08.000Z",
"issuer": "CN=Elastic Certificate Tool Autogenerated CA"
}
]查看集群的全局配置
GET /_cluster/settings?include_defaults
这会输出 Elasticsearch 集群的所有配置,包括默认配置。
查看模版配置
GET /_index_template?pretty
其中可以查看
number_of_shards,number_of_replicas,默认值都是1
常见错误
Elasticsearch exited unexpectedly, with exit code 137
在 Docker 中 几乎 90% 是被系统 OOM Killer 杀掉(内存不够)。重点检查 mem_limit: 8g , 和 ES_JAVA_OPTS=-Xms8g -Xmx8g
filebeat 上传数据到 elasticsearch 问题汇总
filebeat 上传数据到 elasticsearch 报错
适用版本信息说明
- filebeat 7
- elasticsearch 7
filebeat 7.5.2 上传数据到 Elasticsearch 报错:
journalctl -f -u filebeat |
此错误原因是由于 Elasticsearch 的集群中打开的分片数量超过了集群的最大分片限制。在 Elasticsearch 中,每个索引由多个分片组成,而集群有一个设置的最大分片数限制。这个限制是为了防止分片数过多导致性能问题。
错误消息 {"type":"illegal_argument_exception","reason":"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [6924]/[3000] maximum shards open;"} 显示当前集群已有 6924 个分片,超过了 3000 个的限制。
要解决这个问题,可以考虑以下几个选项:
调整 Elasticsearch 集群设置,增加最大分片数限制
可以通过更改 Elasticsearch 配置来增加最大分片数的限制。但请注意,这可能会导致性能问题,尤其是如果硬件资源有限的话。
这可以通过修改
cluster.max_shards_per_node设置来实现PUT /_cluster/settings
{
"persistent": {
"cluster.max_shards_per_node": "新的分片数限制"
}
}获取 Elasticsearch 集群的最大分片数限制
curl -X GET "http://[your_elasticsearch_host]:9200/_cluster/settings?include_defaults=true&pretty"
删除一些不必要的索引 :如果有些索引不再需要,可以删除它们来减少分片数。
curl -X DELETE "localhost:9200/my_index"
curl -X DELETE "localhost:9200/logstash-2021.11.*"合并一些小索引:如果有很多小的索引,可以考虑将它们合并为更大的索引,以减少总分片数。
优化现有索引的分片策略:可以优化索引的分片数量,例如,通过减少每个索引的主分片数量。
filebeat 错误
filebeat 配置上传数据到 elasticsearch 报错
适用版本信息说明
- filebeat 7
- elasticsearch 7
使用以下 filebeat 配置文件
filebeat.inputs: |
filebeat 启动后报错,elasticsearch 上未创建相应的索引,关键错误信息 Failed to connect to backoff(elasticsearch(http://1.57.115.214:9200)): Connection marked as failed because the onConnect callback failed: resource 'filebeat-7.5.2' exists, but it is not an alias
journalctl -f -u filebeat |
这表明 Filebeat 无法正常连接到 Elasticsearch 集群。出现这个问题的主要原因可能为:
索引/别名冲突: Filebeat 试图创建或使用一个名为
filebeat-7.5.2的索引或别名,但这个资源在 Elasticsearch 中已存在且不是一个别名。解决方法为 删除或重命名冲突索引ILM 配置问题
使用此配置文件,解决 索引/别名冲突 问题后,filebeat 运行正常,但是 Elasticsearch 上未创建配置中的索引
logstash-admin-*,而是将数据上传到了索引filebeat-7.5.2-*。这个问题是由ILM导致,可以禁用ILM。参考以下配置,禁用ILM(setup.ilm.enabled: false)/etc/filebeat/filebeat.yml filebeat.inputs:
- type: log
paths:
- /home/logs/laravel-2023*
tags: ["admin-log"]
close_timeout: 3h
clean_inactive: 72h
ignore_older: 70h
close_inactive: 5m
output.elasticsearch:
hosts: ["1.56.219.122:9200", "1.57.115.214:9200", "1.52.53.31:9200"]
username: "elastic"
password: "passwd"
index: "logstash-admin-%{+yyyy.MM.dd}"
setup.ilm.enabled: false
setup.template.enabled: true
setup.template.name: "logstash-admin"
setup.template.pattern: "logstash-admin-*"