kafka集群Broker端参数设置及调优

Distributed streaming platform

Apache Kafka® is a distributed streaming platform. What exactly does that mean?
A streaming platform has three key capabilities:
   -  Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
   - similar to a message queue or enterprise messaging system.
   -  Process streams of records as they occur.

Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:

- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.

Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them 
is effectively acting as a storage system for the in-flight messages. What is 
different about Kafka is that it is a very good storage system.

- Data written to Kafka is written to disk and replicated for fault-tolerance. 
Kafka allows producers to wait on acknowledgement so that a write isn't considered
complete until it is fully replicated and guaranteed to persist even if the server 
written to fails.

- The disk structures Kafka uses scale well，Kafka will perform the same whether you 
have 50 KB or 50 TB of persistent data on the server.

- As a result of taking storage seriously and allowing the clients to control 
their read position, you can think of Kafka as a kind of special purpose 
distributed filesystem dedicated to high-performance, low-latency commit 
log storage, replication, and propagation.

kafka实现高吞吐率的秘密

一个用户程序要把文件内容发送到网络，这个用户程序是工作在用户空间，文件和网络socket属于硬件资源，两者之间有一个内核空间。

在Linux kernel2.2 之后出现了一种叫做”零拷贝(zero-copy)”系统调用机制，就是跳过“用户缓冲区”的拷贝，建立一个磁盘空间和内存的直接映射，数据不再复制到“用户态缓冲区”

kafka的队列topic被分为了多个区partition，每个partition又分为多个段segment，所以一个队列中的消息实际上是保存在N多个片段文件中，通过分段的方式，每次文件操作都是对一个小文件的操作，增加了并行处理能力

kafka允许进行批量发送消息，先将消息缓存在内存中，然后通过一次请求批量把消息发送出去，比如：可以指定缓存的消息达到某个量的时候就发出去，或者缓存了固定的时间后就发送出去，如100条消息就发送，或者每5秒发送一次这种策略将大大减少服务端的I/O次数。
kafka还支持对消息集合进行压缩，Producer可以通过GZIP或Snappy格式或LZ4对消息集合进行压缩,压缩的好处就是减少传输的数据量，减轻对网络传输的压力。

kafka集群Broker端全局参数设置

broker. id

唯一的整数来标识每个broker，不能与其他broker冲突，建议从0开始。

log.dirs <= 吞吐量提升

确保该目录有比较大的硬盘空间。如果需要指定多个目录，以逗号分隔即可，比如/xin/kafka1,/xin/kafka2。这样做的好处是Kafka会力求均匀地在多个目录下存放分区(partition)数据。如果挂载多块磁盘，那么会有多个磁头同时执行写操作。对吞吐量具有非常强的提升。

zookeeper.connect

该参数则完全没有默认值，必须要配置。这个参数也可以是一个逗号分隔值的列表，比如zk1:2181,zk2:2181,zk3:2181/kafka。注意结尾的/kafka，它是zookeeper的chroot，是可选的配置，如果不指定的话就默认使用zookeeper的根路径。

listeners

协议配置包括PLAINTEXT，SSL, SASL_SSL等，格式是[协议]://[主机名]:[端口],[[协议]://[主机名]:[端口]]，该参数是Brocker端开发给clients的监听端口。建议配置：

1 2	PLAINTEXT://hostname:port（未启用安全认证） SSL://hostname:port（启用安全认证）

unclean.leader.election.enable <= 数据的完整性保证

解决ISR所有副本为空，leader又出现宕机的情况。此时leader该如何选择呢？截止kafka 1.0.0版本，该参数默认为false，表示不允许选择非ISR副本集之外的broker。因为高可用性与数据的完整性，kafka官方选择了后者。

delete.topic.enable

不多说，是否允许删除topic，鉴于0.9.0.0新增了ACL机制权限机制，误操作基本是不存在的。

log.retention.{hours|minutes|ms} <=时间维度

优先选取ms的配置，minutes次之，hours最后，默认留存机制是7天。如何判断：

新版本：基于消息中的时间戳来进行判断。老版本：根据日志文件的最新修改时间进行比较.

log.retention.bytes <=空间维度

Kafka会定期删除那些大小超过该参数值的日志文件。默认值是-1，表示Kafka永远不会根据大小来删除日志

min.insync.replicas <= 与acks=-1 搭配使用

持久化级别，用于最少需要多少副本同步。在acks=all(或-1) 时才有意义。min.insync.replicas指定了必须要应答写请求的最小数量的副本数。如果不能满足，producer将会抛出NotEnoughReplicas或NotEnoughReplicasAfterAppend异常。该参数用于实现更好的消息持久性。

举例如下：

5台broker ack =-1 min.insync.replicas = 3

上述表示最少需要3个副本同步后，Broker才能够对外提供服务,否则将会抛出异常。若3台Broker宕机，即使剩余2台全部同步结束，满足了 ack =-1也要报错。

num.network.threads <= 请求转发线程数量

默认值为3，主要负责转发来自broker和clients发送过来的各种请求。强调这里只是转发，真实环境下，需要监听NetWorkerProcessorAvgIdlePercent JMX指标，若指标低于0.3，则建议调高该值。

num.io.threads <= 实际处理线程数量

默认是8，也即broker后端有8个线程以轮询的方式不停的监听转发过来的网络请求并进行实时处理。

message.max.bytes

broker能够接收的最大消息大小，默认是977KB。因此注意，生产环境应该调高该值。

kafka集群Broker端Topic级别参数设置

delete.topic.enable
message.max.bytes
log.retention.bytes

操作系统参数设置

OS页缓存刷盘flush时间 <= 提升吞吐量

默认是5秒，间隔实在太短，适当增加该值可以很高的提行OS物理写入操作的性能。LinkedIn设置为2分钟来提升吞吐量。

文件系统选择

官方测试XFS文件系统写入时间为160秒，Ext4大约是250毫秒。建议生产环境最好使用XFS文件系统。

OS文件描述符限制

OS系统最大打开的文件描述符是有上限的，举例：一个kafka集群主要有3个副本，50个分区，若每一个分区文件大小为10G，而分区内日志段大小为1GB，则一个Broker需要维护1500个左右的文件描述符。因此根据需要设置：

1	ulimit -n 100000

OS 操作系统缓冲区设置（尚不确定）