Summingbird

用 Scalding 和 Storm 流式传输MapReduce。(Streaming MapReduce with Scalding and Storm)

Github星跟踪图

Summingbird

Build Status
Codecov branch
Latest version
Chat

Summingbird is a library that lets you write MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms, including Storm and Scalding.

Summingbird Logo

While a word-counting aggregation in pure Scala might look like this:

  def wordCount(source: Iterable[String], store: MutableMap[String, Long]) =
    source.flatMap { sentence =>
      toWords(sentence).map(_ -> 1L)
    }.foreach { case (k, v) => store.update(k, store.get(k) + v) }

Counting words in Summingbird looks like this:

  def wordCount[P <: Platform[P]]
    (source: Producer[P, String], store: P#Store[String, Long]) =
      source.flatMap { sentence =>
        toWords(sentence).map(_ -> 1L)
      }.sumByKey(store)

The logic is exactly the same, and the code is almost the same. The main difference is that you can execute the Summingbird program in "batch mode" (using Scalding), in "realtime mode" (using Storm), or on both Scalding and Storm in a hybrid batch/realtime mode that offers your application very attractive fault-tolerance properties.

Summingbird provides you with the primitives you need to build rock solid production systems.

Getting Started: Word Count with Twitter

The summingbird-example project allows you to run the wordcount program above on a sample of Twitter data using a local Storm topology and memcache instance. You can find the actual job definition in ExampleJob.scala.

First, make sure you have memcached installed locally. If not, if you're on OS X, you can get it by installing Homebrew and running this command in a shell:

brew install memcached

When this is finished, run the memcached command in a separate terminal.

Now you'll need to set up access to the Twitter Streaming API. This blog post has a great walkthrough, so open that page, head over to https://dev.twitter.com/ and get your various keys and tokens. Once you have these, clone the Summingbird repository:

git clone https://github.com/twitter/summingbird.git
cd summingbird

And open StormRunner.scala in your editor. Replace the dummy variables under config variable with your auth tokens:

lazy val config = new ConfigurationBuilder()
    .setOAuthConsumerKey("mykey")
    .setOAuthConsumerSecret("mysecret")
    .setOAuthAccessToken("token")
    .setOAuthAccessTokenSecret("tokensecret")
    .setJSONStoreEnabled(true) // required for JSON serialization
    .build

You're all ready to go! Now it's time to unleash Storm on your Twitter stream. Make sure the memcached terminal is still open, then start Storm from the summingbird directory:

./sbt "summingbird-example/run --local"

Storm should puke out a bunch of output, then stabilize and hang. This means that Storm is updating your local memcache instance with counts of every word that it sees in each tweet.

To query the aggregate results in Memcached, you'll need to open an SBT repl in a new terminal:

./sbt summingbird-example/console

At the launched repl, run the following:

scala> import com.twitter.summingbird.example._
import com.twitter.summingbird.example._

scala> StormRunner.lookup("i")
<memcache store loading elided>
res0: Option[Long] = Some(5)

scala> StormRunner.lookup("i")
res1: Option[Long] = Some(52)

Boom. Counts for the word "i" are growing in realtime.

See the wiki page for a more detailed explanation of the configuration required to get this job up and running and some ideas for where to go next.

Documentation

To learn more and find links to tutorials and information around the web, check out the Summingbird Wiki.

The latest ScalaDocs are hosted on Summingbird's Github Project Page.

Contact

Discussion occurs primarily on the Summingbird mailing list. Issues should be reported on the GitHub issue tracker. Simpler issues appropriate for first-time contributors looking to help out are tagged "newbie".

IRC: freenode channel #summingbird

Follow @summingbird on Twitter for updates.

Please feel free to use the beautiful Summingbird logo artwork anywhere.

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome!

We use a lightweight form of project governence inspired by the one used by Apache projects.
Please see Contributing and Committership for our code of conduct and our pull request review process.
The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Maven

Summingbird modules are published on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.9.1.

Current published artifacts are

  • summingbird-core_2.11
  • summingbird-core_2.10
  • summingbird-batch_2.11
  • summingbird-batch_2.10
  • summingbird-client_2.11
  • summingbird-client_2.10
  • summingbird-storm_2.11
  • summingbird-storm_2.10
  • summingbird-scalding_2.11
  • summingbird-scalding_2.10
  • summingbird-builder_2.11
  • summingbird-builder_2.10

The suffix denotes the scala version.

Authors (alphabetically)

License

Copyright 2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

主要指标

概览
名称与所有者twitter/summingbird
主编程语言Scala
编程语言Scala (语言数: 3)
平台Linux, Mac, Windows
许可证Apache License 2.0
所有者活动
创建于2012-09-25 22:38:35
推送于2022-01-19 17:31:02
最后一次提交2016-02-05 13:53:04
发布数45
最新版本名称twitter/20190422 (发布于 )
第一版名称0.0.1 (发布于 2013-05-21 12:46:04)
用户参与
星数2.1k
关注者数288
派生数266
提交数1.8k
已启用问题?
问题数282
打开的问题数148
拉请求数433
打开的拉请求数15
关闭的拉请求数46
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?