# kubernetes日志管理最佳实践EFK

### 一、kubernetes和docker都有哪些日志

![](https://user-gold-cdn.xitu.io/2019/10/7/16da579193c24644?imageView2/0/w/1280/h/960/format/webp/ignore-error/1)

* 以上日志都是默认日志，对日志不进行配置就这样
* kubectl logs和docker logs一样的，都是查看容器内部应用的日志
* 对于容器内部应用产生stdout和stderr日志一定会被引擎拦截，如果在应用中通过代码把日志保存到容器内部，那么会产生两份日志；一份原生的日志（代码生成在项目中），一份经过加工过的日志比如json-file保存在/var/lib/docker/containers/<容器>/<容器>.log中。

### 二、kubernetes日志管理最佳实践EFK(采集)

![](https://user-gold-cdn.xitu.io/2019/10/7/16da4f61f51ff106?imageView2/0/w/1280/h/960/format/webp/ignore-error/1) kubernetes efk三大组件配方

```
https://github.com/kubernetes/kubernetes/tree/v1.16.1/cluster/addons/fluentd-elasticsearch
复制代码
```

其实只需要安装fluentd-es到你的k8s集群即可，完全可以将elasticsearch和kibana单独部署，如果有票子可以直接买云厂商现成的，需要的修改fluentd-es的configMap中output：

```
output.conf: |-
    <match **>
      @id elasticsearch
      @type elasticsearch
      ...
      ...
      host elasticsearch-logging # 替换集群外部的host
      port 9200
      user elastic # 如果开启账户验证
      password elastic # 如果开启账户验证
      ...
      ...
    </match>
复制代码
```

> 参考\
> &#x20;[www.cnblogs.com/cocowool/p/…](https://link.juejin.im?target=https%3A%2F%2Fwww.cnblogs.com%2Fcocowool%2Fp%2FDocker_Kubernetes_Log_Location.html)

### 三、问题排查

**1、fluentd-es缓冲区**

查看`fluentd-es` pod日志文件，有个`[warn]`：

```
[elasticsearch] failed to write data into buffer by buffer overflow action=:block\n
复制代码
```

我们看采集具体流程

![](data:image/svg+xml;utf8,<?xml%20version="1.0"?><svg%20xmlns="http://www.w3.org/2000/svg"%20version="1.1"%20width="846"%20height="378"></svg>) 我们可以修改`fluentd-es configMap`配置文件中的`buffer.chunk_limit_size`

```
output.conf: |-
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      type_name _doc
      include_tag_key true
      host es-cn-36200000.public.elasticsearch.aliyuncs.com
      port 9200
      user elastic
      password elastic
      logstash_format true
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        ...
        chunk_limit_size 100M # 由原来的2M增加到100M，一般2M就够了，只是为了测试 ：）
        ...
        overflow_action block # 我们看到，这个block就是溢出标志
      </buffer>
    </match>
复制代码
```

> *如果出现缓冲区出现大量未处理的日志，那么一定是`elasticsearch`有性能瓶颈，不能实时获取最新的日志文件，需要优化一下`elasticsearch`*

**2、fluentd-es pod重复拉去自身的日志**

查看`fluentd-es` pod日志文件，有好多`[warn]`：![](data:image/svg+xml;utf8,<?xml%20version="1.0"?><svg%20xmlns="http://www.w3.org/2000/svg"%20version="1.1"%20width="1280"%20height="179"></svg>) 我们发现其中出现了很多了`\\\`这种转义字符，大概可以猜到应该是该日志文件经过多次采集，为什么会采集多次呢？我们仔细看该条日志记录的是

![](data:image/svg+xml;utf8,<?xml%20version="1.0"?><svg%20xmlns="http://www.w3.org/2000/svg"%20version="1.1"%20width="927"%20height="16"></svg>) 也就是fluentd-es自身的日志。fluentd-es先采集自己的日志，然后把采集自己的日志又写在自己的日志上，就这样循环下去，直到内存溢出，pod挂了。\
&#x20;我们需要继续修改`fluentd-es configMap`，通过`exclude_path`把这个日志文件排除了 ：

```
containers.input.conf: |-
    <source>
      @id fluentd-containers.log
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/es-containers.log.pos
      exclude_path /var/log/containers/fluentd-es*.log # 这里将fluentd-es自己的日志排除，不采集
      ...
      <parse>
        ..
      </parse>
    </source>
复制代码
```

**3、云厂商elasticsearch**

查看`fluentd-es` pod日志文件，有好多`[warn]`：

```
dump an error event: error_class=ArgumentError error=\"log does not exist\"
复制代码
```

一般云厂商提供的`elasticsearch`套件是不开启`允许自动创建索引`功能的，需要开启。

**4、排查所有问题**

将@log\_level设置为debug

```
output.conf: |-
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level debug # 如果测试没有出现问题，则改回info
      ...
      ...
      host elasticsearch-logging # 替换集群外部的host
      port 9200
      user elastic # 如果开启账户验证
      password elastic # 如果开启账户验证
      ...
      ...
    </match>
复制代码
```

然后，查看`fluentd-es` pod日志文件，根据debug提示修复。

### 四、fluentd-es configMap demo

```
kind: ConfigMap
apiVersion: v1
metadata:
  name: fluentd-es-config-v0.2.0
  namespace: kube-system
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
data:
  system.conf: |-
    <system>
      root_dir /tmp/fluentd-buffers/
    </system>

  containers.input.conf: |-
    # This configuration file for Fluentd / td-agent is used
    # to watch changes to Docker log files. The kubelet creates symlinks that
    # capture the pod name, namespace, container name & Docker container ID
    # to the docker logs for pods in the /var/log/containers directory on the host.
    # If running this fluentd configuration in a Docker container, the /var/log
    # directory should be mounted in the container.
    #
    # These logs are then submitted to Elasticsearch which assumes the
    # installation of the fluent-plugin-elasticsearch & the
    # fluent-plugin-kubernetes_metadata_filter plugins.
    # See https://github.com/uken/fluent-plugin-elasticsearch &
    # https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter for
    # more information about the plugins.
    #
    # Example
    # =======
    # A line in the Docker log file might look like this JSON:
    #
    # {"log":"2014/09/25 21:15:03 Got request with path wombat\n",
    #  "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z"}
    #
    # The time_format specification below makes sure we properly
    # parse the time format produced by Docker. This will be
    # submitted to Elasticsearch and should appear like:
    # $ curl 'http://elasticsearch-logging:9200/_search?pretty'
    # ...
    # {
    #      "_index" : "logstash-2014.09.25",
    #      "_type" : "fluentd",
    #      "_id" : "VBrbor2QTuGpsQyTCdfzqA",
    #      "_score" : 1.0,
    #      "_source":{"log":"2014/09/25 22:45:50 Got request with path wombat\n",
    #                 "stream":"stderr","tag":"docker.container.all",
    #                 "@timestamp":"2014-09-25T22:45:50+00:00"}
    #    },
    # ...
    #
    # The Kubernetes fluentd plugin is used to write the Kubernetes metadata to the log
    # record & add labels to the log record if properly configured. This enables users
    # to filter & search logs on any metadata.
    # For example a Docker container's logs might be in the directory:
    #
    #  /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b
    #
    # and in the file:
    #
    #  997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    #
    # where 997599971ee6... is the Docker ID of the running container.
    # The Kubernetes kubelet makes a symbolic link to this file on the host machine
    # in the /var/log/containers directory which includes the pod name and the Kubernetes
    # container name:
    #
    #    synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #    ->
    #    /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
    #
    # The /var/log directory on the host is mapped to the /var/log directory in the container
    # running this instance of Fluentd and we end up collecting the file:
    #
    #   /var/log/containers/synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # This results in the tag:
    #
    #  var.log.containers.synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # The Kubernetes fluentd plugin is used to extract the namespace, pod name & container name
    # which are added to the log message as a kubernetes field object & the Docker container ID
    # is also added under the docker field object.
    # The final tag is:
    #
    #   kubernetes.var.log.containers.synthetic-logger-0.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log
    #
    # And the final log record look like:
    #
    # {
    #   "log":"2014/09/25 21:15:03 Got request with path wombat\n",
    #   "stream":"stderr",
    #   "time":"2014-09-25T21:15:03.499185026Z",
    #   "kubernetes": {
    #     "namespace": "default",
    #     "pod_name": "synthetic-logger-0.25lps-pod",
    #     "container_name": "synth-lgr"
    #   },
    #   "docker": {
    #     "container_id": "997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b"
    #   }
    # }
    #
    # This makes it easier for users to search for logs by pod name or by
    # the name of the Kubernetes container regardless of how many times the
    # Kubernetes pod has been restarted (resulting in a several Docker container IDs).

    # Json Log Example:
    # {"log":"[info:2016-02-16T16:04:05.930-08:00] Some log text here\n","stream":"stdout","time":"2016-02-17T00:04:05.931087621Z"}
    # CRI Log Example:
    # 2016-02-17T00:04:05.931087621Z stdout F [info:2016-02-16T16:04:05.930-08:00] Some log text here
    <source>
      @id fluentd-containers.log
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/es-containers.log.pos
      exclude_path /var/log/containers/fluentd-es*.log
      tag raw.kubernetes.*
      read_from_head true
      <parse>
        @type multi_format
        <pattern>
          format json
          time_key time
          time_format %Y-%m-%dT%H:%M:%S.%NZ
        </pattern>
        <pattern>
          format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
          time_format %Y-%m-%dT%H:%M:%S.%N%:z
        </pattern>
      </parse>
    </source>

    # Detect exceptions in the log output and forward them as one log entry.
    <match raw.kubernetes.**>
      @id raw.kubernetes
      @type detect_exceptions
      remove_tag_prefix raw
      message log
      stream stream
      multiline_flush_interval 5
      max_bytes 500000
      max_lines 1000
    </match>

    # Concatenate multi-line logs
    <filter **>
      @id filter_concat
      @type concat
      key message
      multiline_end_regexp /\n$/
      separator ""
    </filter>

    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
      @id filter_kubernetes_metadata
      @type kubernetes_metadata
    </filter>

    # Fixes json fields in Elasticsearch
    <filter kubernetes.**>
      @id filter_parser
      @type parser
      key_name log
      reserve_data true
      remove_key_name_field true
      <parse>
        @type multi_format
        <pattern>
          format json
        </pattern>
        <pattern>
          format none
        </pattern>
      </parse>
    </filter>

  system.input.conf: |-
    # Example:
    # 2015-12-21 23:17:22,066 [salt.state       ][INFO    ] Completed state [net.ipv4.ip_forward] at time 23:17:22.066081
    <source>
      @id minion
      @type tail
      format /^(?<time>[^ ]* [^ ,]*)[^\[]*\[[^\]]*\]\[(?<severity>[^ \]]*) *\] (?<message>.*)$/
      time_format %Y-%m-%d %H:%M:%S
      path /var/log/salt/minion
      pos_file /var/log/salt.pos
      tag salt
    </source>

    # Example:
    # Dec 21 23:17:22 gke-foo-1-1-4b5cbd14-node-4eoj startupscript: Finished running startup script /var/run/google.startup.script
    # <source>
      # @id startupscript.log
      # @type tail
      # format syslog
      # path /var/log/startupscript.log
      # pos_file /var/log/es-startupscript.log.pos
      # tag startupscript
    # </source>

    # Examples:
    # time="2016-02-04T06:51:03.053580605Z" level=info msg="GET /containers/json"
    # time="2016-02-04T07:53:57.505612354Z" level=error msg="HTTP Error" err="No such image: -f" statusCode=404
    # TODO(random-liu): Remove this after cri container runtime rolls out.
    # <source>
      # @id docker.log
      # @type tail
      # format /^time="(?<time>[^)]*)" level=(?<severity>[^ ]*) msg="(?<message>[^"]*)"( err="(?<error>[^"]*)")?( statusCode=($<status_code>\d+))?/
      # path /var/log/docker.log
      # pos_file /var/log/es-docker.log.pos
      # tag docker
    # </source>

    # Example:
    # 2016/02/04 06:52:38 filePurge: successfully removed file /var/etcd/data/member/wal/00000000000006d0-00000000010a23d1.wal
    # <source>
      # @id etcd.log
      # @type tail
      # # Not parsing this, because it doesn't have anything particularly useful to
      # # parse out of it (like severities).
      # format none
      # path /var/log/etcd.log
      # pos_file /var/log/es-etcd.log.pos
      # tag etcd
    # </source>

    # Multi-line parsing is required for all the kube logs because very large log
    # statements, such as those that include entire object bodies, get split into
    # multiple lines by glog.

    # Example:
    # I0204 07:32:30.020537    3368 server.go:1048] POST /stats/container/: (13.972191ms) 200 [[Go-http-client/1.1] 10.244.1.3:40537]
    # <source>
      # @id kubelet.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/kubelet.log
      # pos_file /var/log/es-kubelet.log.pos
      # tag kubelet
    # </source>

    # Example:
    # I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
    # <source>
      # @id kube-proxy.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/kube-proxy.log
      # pos_file /var/log/es-kube-proxy.log.pos
      # tag kube-proxy
    # </source>

    # Example:
    # I0204 07:00:19.604280       5 handlers.go:131] GET /api/v1/nodes: (1.624207ms) 200 [[kube-controller-manager/v1.1.3 (linux/amd64) kubernetes/6a81b50] 127.0.0.1:38266]
    # <source>
      # @id kube-apiserver.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/kube-apiserver.log
      # pos_file /var/log/es-kube-apiserver.log.pos
      # tag kube-apiserver
    # </source>

    # Example:
    # I0204 06:55:31.872680       5 servicecontroller.go:277] LB already exists and doesn't need update for service kube-system/kube-ui
    # <source>
      # @id kube-controller-manager.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/kube-controller-manager.log
      # pos_file /var/log/es-kube-controller-manager.log.pos
      # tag kube-controller-manager
    # </source>

    # Example:
    # W0204 06:49:18.239674       7 reflector.go:245] pkg/scheduler/factory/factory.go:193: watch of *api.Service ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [2578313/2577886]) [2579312]
    # <source>
      # @id kube-scheduler.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/kube-scheduler.log
      # pos_file /var/log/es-kube-scheduler.log.pos
      # tag kube-scheduler
    # </source>

    # Example:
    # I0603 15:31:05.793605       6 cluster_manager.go:230] Reading config from path /etc/gce.conf
    # <source>
      # @id glbc.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/glbc.log
      # pos_file /var/log/es-glbc.log.pos
      # tag glbc
    # </source>

    # Example:
    # I0603 15:31:05.793605       6 cluster_manager.go:230] Reading config from path /etc/gce.conf
    # <source>
      # @id cluster-autoscaler.log
      # @type tail
      # format multiline
      # multiline_flush_interval 5s
      # format_firstline /^\w\d{4}/
      # format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<message>.*)/
      # time_format %m%d %H:%M:%S.%N
      # path /var/log/cluster-autoscaler.log
      # pos_file /var/log/es-cluster-autoscaler.log.pos
      # tag cluster-autoscaler
    # </source>

    # Logs from systemd-journal for interesting services.
    # TODO(random-liu): Remove this after cri container runtime rolls out.
    <source>
      @id journald-docker
      @type systemd
      matches [{ "_SYSTEMD_UNIT": "docker.service" }]
      <storage>
        @type local
        persistent true
        path /var/log/journald-docker.pos
      </storage>
      read_from_head true
      tag docker
    </source>

    <source>
      @id journald-container-runtime
      @type systemd
      matches [{ "_SYSTEMD_UNIT": "{{ fluentd_container_runtime_service }}.service" }]
      <storage>
        @type local
        persistent true
        path /var/log/journald-container-runtime.pos
      </storage>
      read_from_head true
      tag container-runtime
    </source>

    <source>
      @id journald-kubelet
      @type systemd
      matches [{ "_SYSTEMD_UNIT": "kubelet.service" }]
      <storage>
        @type local
        persistent true
        path /var/log/journald-kubelet.pos
      </storage>
      read_from_head true
      tag kubelet
    </source>

    <source>
      @id journald-node-problem-detector
      @type systemd
      matches [{ "_SYSTEMD_UNIT": "node-problem-detector.service" }]
      <storage>
        @type local
        persistent true
        path /var/log/journald-node-problem-detector.pos
      </storage>
      read_from_head true
      tag node-problem-detector
    </source>

    <source>
      @id kernel
      @type systemd
      matches [{ "_TRANSPORT": "kernel" }]
      <storage>
        @type local
        persistent true
        path /var/log/kernel.pos
      </storage>
      <entry>
        fields_strip_underscores true
        fields_lowercase true
      </entry>
      read_from_head true
      tag kernel
    </source>

  forward.input.conf: |-
    # Takes the messages sent over TCP
    <source>
      @id forward
      @type forward
    </source>

  monitoring.conf: |-
    # Prometheus Exporter Plugin
    # input plugin that exports metrics
    <source>
      @id prometheus
      @type prometheus
    </source>

    <source>
      @id monitor_agent
      @type monitor_agent
    </source>

    # input plugin that collects metrics from MonitorAgent
    <source>
      @id prometheus_monitor
      @type prometheus_monitor
      <labels>
        host ${hostname}
      </labels>
    </source>

    # input plugin that collects metrics for output plugin
    <source>
      @id prometheus_output_monitor
      @type prometheus_output_monitor
      <labels>
        host ${hostname}
      </labels>
    </source>

    # input plugin that collects metrics for in_tail plugin
    <source>
      @id prometheus_tail_monitor
      @type prometheus_tail_monitor
      <labels>
        host ${hostname}
      </labels>
    </source>

  output.conf: |-
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level debug
      type_name doc
      include_tag_key true
      host es-cn-11111.public.elasticsearch.aliyuncs.com
      port 9200
      user elastic
      password 111111
      logstash_format true
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 100M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>
复制代码
```

DeamonSet

```
https://raw.githubusercontent.com/kubernetes/kubernetes/v1.16.1/cluster/addons/fluentd-elasticsearch/fluentd-es-ds.yaml
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.k8stech.net/kubernetes-ri-zhi-guan-li-zui-jia-shi-jian-efk.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
