bashreduce

mapreduce in bash

Github星跟蹤圖

h2. bashreduce : mapreduce in a bash script

bashreduce lets you apply your favorite unix tools in a mapreduce fashion across multiple machines/cores. There's no installation, administration, or distributed filesystem. You'll need:

  • "br":http://github.com/erikfrey/bashreduce/blob/master/br somewhere handy in your path
  • vanilla unix tools: sort, awk, ssh, netcat, pv
  • password-less ssh to each machine you plan to use

h2. Configuration

Edit @/etc/br.hosts@ and enter the machines you wish to use as workers. Or specify your machines at runtime:

To take advantage of multiple cores, repeat the host name.

h2. Examples

h3. sorting

h3. word count

h3. great big join

h2. Performance

h3. big honkin' local machine

Let's start with a simpler scenario: I have a machine with multiple cores and with normal unix tools I'm relegated to using just one core. How does br help us here? Here's br on an 8-core machine, essentially operating as a poor man's multi-core sort:, _. command, _. using, _. time, _. rate, sort -k1,1 -S2G 4gb_file > 4gb_file_sorted, coreutils, 30m32.078s, 2.24 MBps, br -i 4gb_file -o 4gb_file_sorted, coreutils, 11m3.111s, 6.18 MBps, br -i 4gb_file -o 4gb_file_sorted, brp/brm, 7m13.695s, 9.44 MBps, The job completely i/o saturates, but still a reasonable gain!

h3. many cheap machines

Here lies the promise of mapreduce: rather than use my big honkin' machine, I have a bunch of cheaper machines lying around that I can distribute my work to. How does br behave when I add four cheaper 4-core machines into the mix?, _. command, _. using, _. time, _. rate, sort -k1,1 -S2G 4gb_file > 4gb_file_sorted, coreutils, 30m32.078s, 2.24 MBps, br -i 4gb_file -o 4gb_file_sorted, coreutils, 8m30.652s, 8.02 MBps, br -i 4gb_file -o 4gb_file_sorted, brp/brm, 4m7.596s, 16.54 MBps, We have a new bottleneck: we're limited by how quickly we can partition/pump our dataset out to the nodes. awk and sort begin to show their limitations (our clever awk script is a bit cpu bound, and @sort -m@ can only merge so many files at once). So we use two little helper programs written in C (yes, I know! it's cheating! if you can think of a better partition/merge using core unix tools, contact me) to partition the data and merge it back.

h3. Future work

I've tested this on ubuntu/debian, but not on other distros. According to Daniel Einspanjer, netcat has different parameters on Redhat.

br has a poor man's dfs like so:

But this breaks if you specify the same host multiple times. Maybe some kind of very basic virtualization is in order. Maybe.

Other niceties would be to more closely mimic the options presented in sort (numeric, reverse, etc).

主要指標

概覽
名稱與所有者erikfrey/bashreduce
主編程語言C
編程語言Shell (語言數: 3)
平台
許可證MIT License
所有者活动
創建於2009-03-06 05:31:06
推送於2019-10-26 21:58:51
最后一次提交2015-03-01 10:31:46
發布數0
用户参与
星數0.9k
關注者數57
派生數88
提交數42
已啟用問題?
問題數7
打開的問題數5
拉請求數1
打開的拉請求數2
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?