Smile

统计机器智能与学习引擎。(Statistical Machine Intelligence & Learning Engine)

Smile

Smile(Statistical Machine Intelligence and Learning Engine)是一个快速而全面的机器学习、NLP、线性代数、图形、插值和可视化系统,采用 Java 和 Scala。通过先进的数据结构和算法,Smile提供了最先进的性能。Smile 有完善的文档,请查看项目网站的编程指南和更多信息。

Smile 涵盖了机器学习的方方面面,包括分类、回归、聚类、关联规则挖掘、特征选择、歧义学习、多维缩放、遗传算法、缺失值推算、高效最近邻搜索等。

Smile 主要实现了以下几种机器学习算法:

  • 分类:支持向量机,决策树, 支持向量机,决策树,AdaBoost,梯度提升,随机森林,逻辑回归,神经网络,RBF网络,最大熵分类器,KNN,Naïve Bayesian,Fisher/Linear/Quadratic/Regularized Discriminant Analysis。
  • 回归。支持向量回归、高斯过程、回归树、梯度提升、随机森林、RBF网络、OLS、LASSO、ElasticNet、Ridge 回归。
  • 特征选择。基于遗传算法的特征选择,基于集合学习的特征选择,TreeSHAP,信噪比,Sum Squares ratio。
  • 聚类。BIRCH、CLARANS、DBSCAN、DENCLUE、确定性退火、K-Means、X-Means、G-Means、Neural Gas、Growing Neural Gas、层次聚类、序列信息瓶颈、自组织图、光谱聚类、最小熵聚类。
  • 关联规则与频繁项目集挖掘。FP-growth 挖掘算法。
  • Manifold Learning。IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA。
  • 多维缩放。经典 MDS, 同调 MDS, Sammon Mapping。
  • 最近邻搜索。BK 树, Cover 树, KD 树, SimHash, LSH。
  • 序列学习。隐马尔科夫模型,条件随机场。
  • 自然语言处理。句子分割器和标记器,Bigram 统计测试,短语提取器,关键词提取器,蒸馏器,POS 标签,相关性排名。

你可以通过 Maven 中央仓库使用这些库,在你的项目 pom.xml 文件中添加以下内容。

<dependency>
    <groupId>com.github.haifengl</groupId>
    <artifactId>smile-core</artifactId>
    <version>2.6.0</version>
</dependency>

对于 NLP,请使用 artifactId smile-nlp。

对于 Scala API,请使用

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "2.6.0"

对于 Kotlin API,请在 Gradle 构建脚本的依赖项部分添加以下内容:

implementation("com.github.haifengl:smile-kotlin:2.6.0")

对于 Clojure API,请在你的项目或构建文件中添加以下依赖项:

[org.clojars.haifengl/smile "2.6.0"]

一些算法依赖于 BLAS 和 LAPACK(例如流形学习、一些聚类算法、高斯过程回归、MLP等)。为了使用这些算法,你应该包括 OpenBLAS 来优化矩阵计算。

libraryDependencies ++= Seq(
    "org.bytedeco" % "javacpp"   % "1.5.4"        classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
    "org.bytedeco" % "openblas"  % "0.3.10-1.5.4" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
    "org.bytedeco" % "arpack-ng" % "3.7.0-1.5.4"  classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le"
)

在这个例子中,我们包括所有支持的 64 位平台,并过滤掉 32 位平台。用户应该只包括需要的平台以节省空间。

如果你喜欢其他的 BLAS 实现,你可以使用在 "java.library.path" 或类路径上找到的任何库,只要用 "org.bytedeco.openblas.load"系统属性指定它。例如,为了使用 Mac OS X 上加速框架的 BLAS 库,我们可以通过选项,如 -Djava.library.path=/usr/lib/ -Dorg.bytedeco.openblas.load=blas。

对于 MKL 的默认安装,应该是 -Dorg.bytedeco.openblas.load=mkl_rt。或者你可以简单地在你的项目中包括 smile-mkl 模块,这包括 MKL 二进制文件。在类路径中加入 smile-mkl 模块,Smile 会自动切换到MKL。

libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.6.0"

Shell

Smile 带有 Java、Scala 和 Kotlin 的交互式 shell。从 发布页面 下载预先打包好的 Smile。在 Smile 的主目录下,输入

./bin/smile

来进入 Scala shell。你可以在 shell 中运行任何有效的 Scala 表达式。在最简单的情况下,你可以把它当作一个计算器使用。此外,所有高级别的 Smile 运算符都是在 shell 中预定义的。默认情况下,shell 最多使用 75% 的内存。如果你需要更多的内存来处理大数据,请使用选项 -J-Xmx 或 -XX:MaxRAMPercentage。例如:

./bin/smile -J-Xmx30G

你也可以修改配置文件 ./conf/smile.ini 中的内存和其他 JVM 设置。

要使用 Java 的 JShell,键入

./bin/jshell.sh

该文件在 classpath 中有 Smile 的 jars。同样地,运行

./bin/kotlin.sh

来进入 Kotlin REPL。

模型序列化

大多数模型都支持 Java Serializable 接口(所有分类器都支持 Serializable 接口),这样你就可以在 Spark 中使用它们。对于在非 Java 代码中读/写模型,我们建议用[XStream](https://github.com/x-stream/xstream)来序列化训练过的模型。

XStream 是一个简单的库,可以将对象序列化为 XML,然后再返回。XStream 很容易使用,而且不需要映射(实际上不需要对对象进行修改)。Protostuff 是一个很好的选择,它支持向前-向后的兼容性(模式的演变)和验证。除了 XML,Protostuff 还支持许多其他格式,如 JSON、YAML、protobuf 等。

可视化

Smile 提供了一个基于 Swing 的数据可视化库 SmilePlot,它提供了散点图、直线图、阶梯图、条形图、箱形图、直方图、3D直方图、树枝图、热图、六角图、QQ图、等高线图、表面图和线框。

要使用 SmilePlot,请在依赖项中添加以下内容

<dependency>
    <groupId>com.github.haifengl</groupId>
    <artifactId>smile-plot</artifactId>
    <version>2.6.0</version>
</dependency>

Smile 也支持声明式的数据可视化。通过 smile.plot.vega 包,我们可以创建一个规范,将可视化描述为从数据到图形标记(例如,点或条)的属性的映射。该规范是基于 Vega-Lite 的。Vega-Lite 编译器自动产生可视化组件,包括轴、图例和刻度。然后,它根据一套精心设计的规则来确定这些组件的属性。

图库

例图恕删略,请参考自述文件。

(First version translated and edited by vz on 2021.05.15)

Overview

Name With Ownerhaifengl/smile
Primary LanguageJava
Program languageJava (Language Count: 15)
PlatformLinux, Mac, Unix-like, Windows
License:Other
Release Count32
Last Release Namev3.1.0 (Posted on )
First Release Namev1.1.0 (Posted on )
Created At2014-11-20 16:28:12
Pushed At2024-04-20 14:02:34
Last Commit At2024-04-20 09:58:34
Stargazers Count5.9k
Watchers Count271
Fork Count1.1k
Commits Count3.5k
Has Issues Enabled
Issues Count592
Issue Open Count10
Pull Requests Count109
Pull Requests Open Count0
Pull Requests Close Count65
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Smile

Join the chat at https://gitter.im/haifengl/smile Maven Central

Smile (Statistical Machine Intelligence and Learning Engine) is
a fast and comprehensive machine learning, NLP, linear algebra,
graph, interpolation, and visualization system in Java and Scala.
With advanced data structures and algorithms,
Smile delivers state-of-art performance.

Smile covers every aspect of machine learning, including classification,
regression, clustering, association rule mining, feature selection,
manifold learning, multidimensional scaling, genetic algorithms,
missing value imputation, efficient nearest neighbor search, etc.

Smile is well documented and please check out the
project website
for programming guides and more information.

You can use the libraries through Maven central repository by adding the following to your project pom.xml file.

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-core</artifactId>
      <version>2.1.0</version>
    </dependency>

For NLP, use the artifactId smile-nlp.

For Scala API, please use

    libraryDependencies += "com.github.haifengl" %% "smile-scala" % "2.1.0"

To enable machine optimized matrix computation, the users should add
the dependency of smile-netlib:

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-netlib</artifactId>
      <version>2.1.0</version>
    </dependency>

and also make their machine-optimized libblas3 (CBLAS) and liblapack3 (Fortran)
available as shared libraries at runtime. This module employs the highly efficient
netlib-java library.

OS X

Apple OS X requires no further setup as it ships with the veclib framework.

Linux

Generically-tuned ATLAS and OpenBLAS are available with most distributions
and must be enabled explicitly using the package-manager. For example,

  • sudo apt-get install libatlas3-base libopenblas-base
  • sudo update-alternatives --config libblas.so
  • sudo update-alternatives --config libblas.so.3
  • sudo update-alternatives --config liblapack.so
  • sudo update-alternatives --config liblapack.so.3

However, these are only generic pre-tuned builds. If you have Intel MKL installed,
you could also create symbolic links from libblas.so.3 and liblapack.so.3 to libmkl_rt.so
or use Debian's alternatives system.

Windows

The native_system builds expect to find libblas3.dll and liblapack3.dll
on the %PATH% (or current working directory). Smile ships a prebuilt
OpenBLAS.
The users can also install vendor-supplied implementations, which may
offer better performance.

Shell

Smile comes with an interactive shell. Download pre-packaged Smile from the releases page.
In the home directory of Smile, type

    ./bin/smile

to enter the shell, which is based on Ammonite-REPL. You can run any valid Scala expressions in the shell.
In the simplest case, you can use it as a calculator. Besides, all high-level Smile operators are predefined
in the shell. By default, the shell uses up to 4GB memory. If you need more memory to handle large data,
use the option -J-Xmx. For example,

    ./bin/smile -J-Xmx8192M

You can also modify the configuration file ./conf/smile.ini for the memory and other JVM settings.
For detailed help, checkout the project website.

Smile implements the following major machine learning algorithms:

  • Classification
    Support Vector Machines, Decision Trees, AdaBoost, Gradient Boosting, Random Forest, Logistic Regression, Neural Networks, RBF Networks, Maximum Entropy Classifier, KNN, Naïve Bayesian, Fisher/Linear/Quadratic/Regularized Discriminant Analysis.

  • Regression
    Support Vector Regression, Gaussian Process, Regression Trees, Gradient Boosting, Random Forest, RBF Networks, OLS, LASSO, ElasticNet, Ridge Regression.

  • Feature Selection
    Genetic Algorithm based Feature Selection, Ensemble Learning based Feature Selection, Signal Noise ratio, Sum Squares ratio.

  • Clustering
    BIRCH, CLARANS, DBSCAN, DENCLUE, Deterministic Annealing, K-Means, X-Means, G-Means, Neural Gas, Growing Neural Gas, Hierarchical Clustering, Sequential Information Bottleneck, Self-Organizing Maps, Spectral Clustering, Minimum Entropy Clustering.

  • Association Rule & Frequent Itemset Mining
    FP-growth mining algorithm.

  • Manifold learning
    IsoMap, LLE, Laplacian Eigenmap, t-SNE, PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA.

  • Multi-Dimensional Scaling
    Classical MDS, Isotonic MDS, Sammon Mapping.

  • Nearest Neighbor Search
    BK-Tree, Cover Tree, KD-Tree, LSH.

  • Sequence Learning
    Hidden Markov Model, Conditional Random Field.

  • Natural Language Processing
    Sentence Splitter and Tokenizer, Bigram Statistical Test, Phrase Extractor, Keyword Extractor, Stemmer, POS Tagging, Relevance Ranking

Model Serialization

Most models support the Java Serializable interface (all classifiers do support Serializable interface) so that
you can use them in Spark. For reading/writing the models in non-Java code, we suggest XStream to serialize the trained models.
XStream is a simple library to serialize objects to XML and back again. XStream is easy to use and doesn't require mappings
(actually requires no modifications to objects). Protostuff is a
nice alternative that supports forward-backward compatibility (schema evolution) and validation.
Beyond XML, Protostuff supports many other formats such as JSON, YAML, protobuf, etc. For some predictive models,
we look forward to supporting PMML (Predictive Model Markup Language), an XML-based file format developed by the Data Mining Group.

Smile Scala API provides read(), read.xstream(), write(), and write.xstream() functions in package smile.io.

SmilePlot

Smile also has a Swing-based data visualization library SmilePlot, which provides scatter plot, line plot, staircase plot, bar plot, box plot, histogram, 3D histogram, dendrogram, heatmap, hexmap, QQ plot, contour plot, surface, and wireframe. The class PlotCanvas provides builtin functions such as zoom in/out, export, print, customization, etc.

SmilePlot requires SwingX library for JXTable. But if your environment cannot use SwingX, it is easy to remove this dependency by using JTable.

To use SmilePlot, add the following to dependencies

    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-plot</artifactId>
      <version>2.1.0</version>
    </dependency>

Demo Gallery

To the top