Apache Kafka

Kafka™用于构建实时数据流水线和流媒体应用。「Kafka™ is used for building real-time data pipelines and streaming apps.

Github stars Tracking Chart

Apache Kafka

Kafka™ 用于构建实时数据流水线和流媒体应用。 它具有水平可扩展性,容错性,出奇得快,并且在成千上万的公司中运行。

Apache Kafka™是一个分布式流媒体平台。这是什么意思?(原文:http://kafka.apache.org/documentation/#introductio...
我们认为流媒体平台有三个关键功能:
  • 它允许您发布和订阅记录流。在这方面,它类似于消息队列或企业消息系统。
  • 它允许您以容错方式存储记录流。
  • 它可以让您处理记录流。
什么是Kafka?
它被用于两大类应用程序:
  • 构建可在系统或应用程序之间可靠获取数据的实时流数据流水线
  • 构建对数据流进行变换或反应的实时流应用程序
要了解Kafka如何做这些事情,让我们从下而上潜入和探索Kafka的能力。
首先几个概念:
  • Kafka 作为一个或多个服务器上的集群运行。
  • Kafka 群集将名为主题的类别的记录流存储起来。
  • 每个记录由一个键,一个值和一个时间戳组成。
Kafka 有四个核心API:
  • Producer API 允许应用程序将记录流发布到一个或多个 Kafka 主题。
  • Consumer API 允许应用程序订阅一个或多个主题并处理为其生成的记录流。
  • Streams API 允许应用程序充当流处理器,从一个或多个主题消耗输入流,并产生输出流到一个或多个输出主题,有效地将输入流转换为输出流。
  • 连接器 API 允许构建和运行将 Kafka 主题与现有应用程序或数据系统相连接的可重复使用的生产者或消费者。例如,关系数据库的连接器可能会捕获表的每个更改。

Kafka Streams(原文段落:http://kafka.apache.org/0102/documentation/streams...

Kafka 流是用于处理和分析存储在 Kafka 中的数据的客户端库,并将所得数据写回 Kafka 或将最终输出发送到外部系统。它基于重要的流处理概念,如适当区分事件时间和处理时间,窗口支持以及简单而有效的管理应用程序状态。

Kafka 流进入门槛较低:您可以在单个机器上快速编写和运行一个小规模的概念验证;并且您只需要在多台计算机上运行其他应用程序实例,以扩展到大批量生产工作负载。 Kafka 流透过 Kafka 的并行模型透明地处理同一应用程序的多个实例的负载平衡。

Kafka 流的一些亮点:
  • 设计为一个简单而轻量级的客户端库,可以轻松地嵌入到任何 Java 应用程序中,并与任何现有的包装,部署和操作工具集成,用户可以使用它们的流媒体应用程序。
  • 没有对 Apache Kafka 本身以外的系统的内部消息层的外部依赖;特别地,它使用Kafka的分区模型来水平扩展处理,同时保持强大的排序保证。
  • 支持容错的本地状态,可实现非常快速和高效的状态操作,如窗口加入和聚合。
  • 采用一次一次的处理,以实现毫秒级的处理延迟,并支持基于事件时间的窗口操作,记录迟到。
  • 提供必要的流处理原语,以及高级 Streams DSL 和低级 Processor API。


Overview

Name With Ownerapache/kafka
Primary LanguageJava
Program languageShell (Language Count: 9)
PlatformLinux, Solaris, Unix-like, Mac
License:Apache License 2.0
Release Count239
Last Release Name3.6.2 (Posted on 2024-04-04 21:52:55)
First Release Namekafka-0.7.0-incubating-candidate-9 (Posted on 2012-11-28 08:41:58)
Created At2011-08-15 18:06:16
Pushed At2024-04-21 18:05:15
Last Commit At
Stargazers Count27.3k
Watchers Count1.1k
Fork Count13.5k
Commits Count12.7k
Has Issues Enabled
Issues Count0
Issue Open Count0
Pull Requests Count8095
Pull Requests Open Count1089
Pull Requests Close Count6586
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Apache Kafka

See our web site for details on the project.

You need to have Java installed.

We build and test Apache Kafka with Java 8, 11 and 17. We set the release parameter in javac and scalac
to 8 to ensure the generated binaries are compatible with Java 8 or higher (independently of the Java version
used for compilation). Java 8 support has been deprecated since Apache Kafka 3.0 and will be removed in Apache
Kafka 4.0 (see KIP-750 for more details).

Scala 2.12 and 2.13 are supported and 2.13 is used by default. Scala 2.12 support has been deprecated since
Apache Kafka 3.0 and will be removed in Apache Kafka 4.0 (see KIP-751
for more details). See below for how to use a specific Scala version or all of the supported Scala versions.

Build a jar and run it

./gradlew jar

Follow instructions in https://kafka.apache.org/quickstart

Build source jar

./gradlew srcJar

Build aggregated javadoc

./gradlew aggregatedJavadoc

Build javadoc and scaladoc

./gradlew javadoc
./gradlew javadocJar # builds a javadoc jar for each module
./gradlew scaladoc
./gradlew scaladocJar # builds a scaladoc jar for each module
./gradlew docsJar # builds both (if applicable) javadoc and scaladoc jars for each module

Run unit/integration tests

./gradlew test # runs both unit and integration tests
./gradlew unitTest
./gradlew integrationTest

Force re-running tests without code change

./gradlew cleanTest test
./gradlew cleanTest unitTest
./gradlew cleanTest integrationTest

Running a particular unit/integration test

./gradlew clients:test --tests RequestResponseTest

Running a particular test method within a unit/integration test

./gradlew core:test --tests kafka.api.ProducerFailureHandlingTest.testCannotSendToInternalTopic
./gradlew clients:test --tests org.apache.kafka.clients.MetadataTest.testMetadataUpdateWaitTime

Running a particular unit/integration test with log4j output

Change the log4j setting in either clients/src/test/resources/log4j.properties or core/src/test/resources/log4j.properties

./gradlew clients:test --tests RequestResponseTest

Specifying test retries

By default, each failed test is retried once up to a maximum of five retries per test run. Tests are retried at the end of the test task. Adjust these parameters in the following way:

./gradlew test -PmaxTestRetries=1 -PmaxTestRetryFailures=5

See Test Retry Gradle Plugin for more details.

Generating test coverage reports

Generate coverage reports for the whole project:

./gradlew reportCoverage -PenableTestCoverage=true -Dorg.gradle.parallel=false

Generate coverage for a single module, i.e.:

./gradlew clients:reportCoverage -PenableTestCoverage=true -Dorg.gradle.parallel=false

Building a binary release gzipped tar ball

./gradlew clean releaseTarGz

The release file can be found inside ./core/build/distributions/.

Building auto generated messages

Sometimes it is only necessary to rebuild the RPC auto-generated message data when switching between branches, as they could
fail due to code changes. You can just run:

./gradlew processMessages processTestMessages

Running a Kafka broker in ZooKeeper mode

./bin/zookeeper-server-start.sh config/zookeeper.properties
./bin/kafka-server-start.sh config/server.properties

Running a Kafka broker in KRaft (Kafka Raft metadata) mode

See config/kraft/README.md.

Cleaning the build

./gradlew clean

Running a task with one of the Scala versions available (2.12.x or 2.13.x)

Note that if building the jars with a version other than 2.13.x, you need to set the SCALA_VERSION variable or change it in bin/kafka-run-class.sh to run the quick start.

You can pass either the major version (eg 2.12) or the full version (eg 2.12.7):

./gradlew -PscalaVersion=2.12 jar
./gradlew -PscalaVersion=2.12 test
./gradlew -PscalaVersion=2.12 releaseTarGz

Running a task with all the scala versions enabled by default

Invoke the gradlewAll script followed by the task(s):

./gradlewAll test
./gradlewAll jar
./gradlewAll releaseTarGz

Running a task for a specific project

This is for core, examples and clients

./gradlew core:jar
./gradlew core:test

Streams has multiple sub-projects, but you can run all the tests:

./gradlew :streams:testAll

Listing all gradle tasks

./gradlew tasks

Building IDE project

Note that this is not strictly necessary (IntelliJ IDEA has good built-in support for Gradle projects, for example).

./gradlew eclipse
./gradlew idea

The eclipse task has been configured to use ${project_dir}/build_eclipse as Eclipse's build directory. Eclipse's default
build directory (${project_dir}/bin) clashes with Kafka's scripts directory and we don't use Gradle's build directory
to avoid known issues with this configuration.

Publishing the jar for all versions of Scala and for all projects to maven

The recommended command is:

./gradlewAll publish

For backwards compatibility, the following also works:

./gradlewAll uploadArchives

Please note for this to work you should create/update ${GRADLE_USER_HOME}/gradle.properties (typically, ~/.gradle/gradle.properties) and assign the following variables

mavenUrl=
mavenUsername=
mavenPassword=
signing.keyId=
signing.password=
signing.secretKeyRingFile=

Publishing the streams quickstart archetype artifact to maven

For the Streams archetype project, one cannot use gradle to upload to maven; instead the mvn deploy command needs to be called at the quickstart folder:

cd streams/quickstart
mvn deploy

Please note for this to work you should create/update user maven settings (typically, ${USER_HOME}/.m2/settings.xml) to assign the following variables

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
                       https://maven.apache.org/xsd/settings-1.0.0.xsd">
...                           
<servers>
   ...
   <server>
      <id>apache.snapshots.https</id>
      <username>${maven_username}</username>
      <password>${maven_password}</password>
   </server>
   <server>
      <id>apache.releases.https</id>
      <username>${maven_username}</username>
      <password>${maven_password}</password>
    </server>
    ...
 </servers>
 ...

Installing the jars to the local Maven repository

The recommended command is:

./gradlewAll publishToMavenLocal

For backwards compatibility, the following also works:

./gradlewAll install

Building the test jar

./gradlew testJar

Determining how transitive dependencies are added

./gradlew core:dependencies --configuration runtime

Determining if any dependencies could be updated

./gradlew dependencyUpdates

Running code quality checks

There are two code quality analysis tools that we regularly run, spotbugs and checkstyle.

Checkstyle

Checkstyle enforces a consistent coding style in Kafka.
You can run checkstyle using:

./gradlew checkstyleMain checkstyleTest

The checkstyle warnings will be found in reports/checkstyle/reports/main.html and reports/checkstyle/reports/test.html files in the
subproject build directories. They are also printed to the console. The build will fail if Checkstyle fails.

Spotbugs

Spotbugs uses static analysis to look for bugs in the code.
You can run spotbugs using:

./gradlew spotbugsMain spotbugsTest -x test

The spotbugs warnings will be found in reports/spotbugs/main.html and reports/spotbugs/test.html files in the subproject build
directories. Use -PxmlSpotBugsReport=true to generate an XML report instead of an HTML one.

JMH microbenchmarks

We use JMH to write microbenchmarks that produce reliable results in the JVM.

See jmh-benchmarks/README.md for details on how to run the microbenchmarks.

Common build options

The following options should be set with a -P switch, for example ./gradlew -PmaxParallelForks=1 test.

  • commitId: sets the build commit ID as .git/HEAD might not be correct if there are local commits added for build purposes.
  • mavenUrl: sets the URL of the maven deployment repository (file://path/to/repo can be used to point to a local repository).
  • maxParallelForks: limits the maximum number of processes for each task.
  • ignoreFailures: ignore test failures from junit
  • showStandardStreams: shows standard out and standard error of the test JVM(s) on the console.
  • skipSigning: skips signing of artifacts.
  • testLoggingEvents: unit test events to be logged, separated by comma. For example ./gradlew -PtestLoggingEvents=started,passed,skipped,failed test.
  • xmlSpotBugsReport: enable XML reports for spotBugs. This also disables HTML reports as only one can be enabled at a time.
  • maxTestRetries: the maximum number of retries for a failing test case.
  • maxTestRetryFailures: maximum number of test failures before retrying is disabled for subsequent tests.
  • enableTestCoverage: enables test coverage plugins and tasks, including bytecode enhancement of classes required to track said
    coverage. Note that this introduces some overhead when running tests and hence why it's disabled by default (the overhead
    varies, but 15-20% is a reasonable estimate).
  • scalaOptimizerMode: configures the optimizing behavior of the scala compiler, the value should be one of none, method, inline-kafka or
    inline-scala (the default is inline-kafka). none is the scala compiler default, which only eliminates unreachable code. method also
    includes method-local optimizations. inline-kafka adds inlining of methods within the kafka packages. Finally, inline-scala also
    includes inlining of methods within the scala library (which avoids lambda allocations for methods like Option.exists). inline-scala is
    only safe if the Scala library version is the same at compile time and runtime. Since we cannot guarantee this for all cases (for example, users
    may depend on the kafka jar for integration tests where they may include a scala library with a different version), we don't enable it by
    default. See https://www.lightbend.com/blog/scala-inliner-optimizer for more details.

Dependency Analysis

The gradle dependency debugging documentation mentions using the dependencies or dependencyInsight tasks to debug dependencies for the root project or individual subprojects.

Alternatively, use the allDeps or allDepInsight tasks for recursively iterating through all subprojects:

./gradlew allDeps

./gradlew allDepInsight --configuration runtimeClasspath --dependency com.fasterxml.jackson.core:jackson-databind

These take the same arguments as the builtin variants.

Running system tests

See tests/README.md.

Running in Vagrant

See vagrant/README.md.

Contribution

Apache Kafka is interested in building the community; we would welcome any thoughts or patches. You can reach us on the Apache mailing lists.

To contribute follow the instructions here:

To the top