graphchi-cpp

GraphChi's C++ version. Big Data - small machine.

  • Owner: GraphChi/graphchi-cpp
  • Platform:
  • License::
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

GraphChi - disk-based large-scale graph computation

NOTE: This project has been recently moved from Google Code, and some of the wiki pages might be partly broken.

MIT Technology Review article about GraphChi: "Your laptop can now analyze big data"

NEW: Improved performance.

(October 21, 2013)
In-edges are now loaded in parallel, improving performance on multicore machines significantly.

NEW: Graph contraction algorithms

Read about graph contraction technique, which we used to implemented efficient minimum spanning forest computation Graph Contraction Algorithms .

Discussion group

http://groups.google.com/group/graphchi-discuss

License

GraphChi is licensed under the Apache License, Version 2.0.
Each source code file has full license information.

Introduction

GraphChi is a spin-off of the GraphLab ( http://www.graphlab.org ) -project from the Carnegie Mellon University. It is based on research by Aapo Kyrola ( http://www.cs.cmu.edu/~akyrola/) and his advisors.

GraphChi can run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google's Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph. Section 'Performance' contains some examples of applications implemented for GraphChi and their running times on GraphChi.

The promise of GraphChi is to bring web-scale graph computation, such as analysis of social networks, available to anyone with a modern laptop. It saves you from the hassle and costs of working with a distributed cluster or cloud services. We find it much easier to debug applications on a single computer than trying to understand how a distributed algorithm is executed.

In some cases GraphChi can solve bigger problems in reasonable time than many other available distributed frameworks. GraphChi also runs efficiently on servers with plenty of memory, and can use multiple disks in parallel by striping the data.

Even if you do require the processing power of high-performance clusters, GraphChi can be an excellent tool for developing and debugging your algorithms prior to deploying them to the cluster. For high-performance graph computation in the distributed setting, we direct you to GraphLab's new version (v2.1), which can now handle large graphs in astonishing speed. GraphChi supports also most of the new GraphLab v2.1 API (with some restrictions), making the transition easy.

GraphChi is implemented in plain C++, and available as open-source under the flexible Apache License 2.0.

Java version

Java-version of GraphChi: https://github.com/GraphChi/graphchi-java

Publication

GraphChi is part of the OSDI'12 proceedings. PDF of the paper can be downloaded here: https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf

Slides (OSDI talk): http://www.cs.cmu.edu/~akyrola/files/osditalk-graphchi.pptx

Collaborative Filtering Toolkit

Danny Bickson has ported several collaborative filtering algorithms from GraphLab to GraphChi:
http://bickson.blogspot.com/2012/12/collaborative-filtering-with-graphchi.html

Features

  • Vertex-centric computation model (similar to GraphLab, Pregel or Giraph)

  • Asynchronous, parallel execution, with (optional) deterministic scheduling (see semantics section at Creating-GraphChi-Applications )

  • Can run graphs with billions of edges, with linear scalability, on a standard consumer grade machine
    ** Can also utilize large amounts of memory by preloading (caching), making it competitive on large servers: UsingMoreMemory

  • Multidisk striping - RAID-style operation, see MultipleDisksSupports

  • Works well on both hard-drive and SSD.

  • Evolving graphs, streaming graph updates (can add and delete edges): https://github.com/GraphChi/graphchi-cpp/wiki/Evolving-And-StreamingGraphs

  • Easy to install, headers-only, no dependencies.

  • Tested on Mac OS X and Linux

Getting Started

Best way to get started is to start from the https://github.com/GraphChi/graphchi-cpp/wiki/Example-Apps page.
Prior to that, you need to download the source code (no configuration
or installation is required).

For an introduction on writing your own applications, read Creating-GraphChi-Applications.

Problems compiling on Mac

First, try:
xcode-select --install

If compiler complains about missing "omp.h" (OpenMP library), here is a way you can install it:

A) Direct installation of binary (on Yosemite)

See intructions in https://wiki.helsinki.fi/display/HUGG/Installing+the+GNU+compilers+on+Mac+OS+X#InstallingtheGNUcompilersonMacOSX-InstructionsforMacOS10.10(Yosemite)withXcode6

B) Using Homebrew
(Contributed by Jose Pablo Gonzalez):

  1. Install homebrew ( http://brew.sh )
  2. brew tap homebrew/versions
  3. Install a compiler:
    brew install apple-gcc42
  4. Modify the Makefile to use the new compiler:
    CPP = g++-4.2
  5. make

NOTE: you might want to use newer compiler version.

How GraphChi works

GraphChi is based on the Parallel Sliding Windows method which allows efficient asynchronous processing of mutable graphs from disk. See Introduction-To-GraphChi for description.

Performance

NOTE: Following performance figures are somewhat outdated. Current performance should be better.

In the table below, we have picked some recent running time results for large-scale graph problems from the literature, and run the same experiment using GraphChi. For GraphChi, we used a Mac Mini (2012 model), with 8 gigabytes of RAM and 256 gigabyte SSD drive.

While distributed clusters can solve the same problems faster than GraphChi on a single computer, for many purposes GraphChi's performance should be adequate. The numbers below do not include time for transferring the input to cloud or cluster, and usually do not include the graph loading time. GraphChi's running times include loading the graph and saving the results, but not preprocessing time. Preprocessing needs to be done only once per graph (you can run many different algorithms on the same preprocessed graph). The preprocessing times are listed in a separate table below.

Performance on hard drive: We repeated the same experiments on hard-drive, and the running times are roughly double compared to SSD. The hard-drive performance is thus sufficient for many purposes.

Configuration:
membudget_mb 3000 execthreads 2 loadthreads 2 niothreads 2 io.blocksize 1048576

Preprocessing times

Main metrics

Overview
Name With OwnerGraphChi/graphchi-cpp
Primary LanguageC++
Program languageMakefile (Language Count: 9)
Platform
License:
所有者活动
Created At2013-07-24 13:36:43
Pushed At2019-01-02 20:58:16
Last Commit At2019-01-02 12:58:14
Release Count1
Last Release Namebeforemerge (Posted on )
First Release Namebeforemerge (Posted on )
用户参与
Stargazers Count806
Watchers Count105
Fork Count311
Commits Count1k
Has Issues Enabled
Issues Count41
Issue Open Count25
Pull Requests Count4
Pull Requests Open Count8
Pull Requests Close Count7
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private