GRF

广义随机森林。「Generalized Random Forests」

所有者: grf-labs/grf
平台: Linux, Mac, Windows
许可证: GNU General Public License v3.0
分类:

C/C++

机器学习
主题:

machine-learning

statistics

random-forest

econometrics

causal-inference

causal-machine-learning
喜欢:

0

比较:

Github星跟踪图

广义随机森林（generalized random forests）

一个可插拔的软件包，用于基于森林的统计估计和推理。GRF 目前提供了最小二乘回归、分位数回归、生存回归和治疗效果估计的非参数方法（可选择使用工具变量），并支持缺失值。

此外，GRF 还支持 "诚实（honest）" 估计（其中一个数据子集用于选择分割，另一个用于填充树叶），以及最小二乘回归和治疗效果估计的置信区间。

一些有用的入门链接：

R 包文档包含了使用范例和方法参考。
GRF 参考给出了 GRF 算法的详细描述，并包含故障排除建议。
关于社区中关于使用的问题和答案，请参见Github 上标有 "question"的问题。

这个版本库最初是作为 ranger 版本库的一个分支开始的--我们非常感谢 ranger 作者提供的有用的免费软件包。

安装

最新版本的软件包可以通过 CRAN 安装：

install.packages("grf")

conda 用户可以从 conda-forge 频道进行安装。

conda install -c conda-forge r-grf

当前的开发版本可以使用 devtools 从源代码安装：

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

请注意，如果要从源码安装，需要一个实现 C++11 的编译器（clang 3.3 或更高，或 g++ 4.8 或更高）。如果在 Windows 上安装，还需要 RTools 工具链。

使用实例

下面的脚本演示了如何使用 GRF 进行异质性治疗效果估计。关于如何使用森林类型的例子，如量化回归和使用工具变量进行因果效应估计，请查阅相关森林方法的 R 文档（quantile_forest、instrumental_forest 等）。

library(grf)

# 生成数据。
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# 训练因果林。
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# 利用袋外预测估计训练数据的处理效果。
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# 估计测试样本的治疗效果。
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# 估计全样本的条件平均治疗效果（CATE）。
average_treatment_effect(tau.forest, target.sample = "all")

# 估计治疗样本的条件平均治疗效果（CATT）。
average_treatment_effect(tau.forest, target.sample = "treated")

# 增加异质处理效果的置信区间；现在建议多种树。
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# 在某些例子中，分别对 Y 和 W 进行预拟合模型可能会有帮助
#（例如，如果不同的模型使用不同的协变量）。
# 在某些应用中，人们甚至可能希望使用一种完全不同的方法（例如，提升）
# 来获得Y.hat 和 W.hat。

# 生成新数据。
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# 注意：当森林在极少的变量上进行训练时，可能会遇到困难。
# (例如，ncol(X) = 1, 2, 或 3)。我们建议在选择时不要过于激进。
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# 检查因果林预测是否得到很好的校准。
test_calibration(tau.forest)

开发中

除了为量化回归和因果效应估计提供开箱即用的森林外，GRF 还提供了一个为新的统计任务量身打造森林的框架。如果您想使用 GRF 进行开发，请查阅算法参考和开发指南。

资助

GRF 的开发得到了美国国家科学基金会、斯隆基金会、海军研究办公室（赠款 N00014-17-1-2131）和 Schmidt Futures 的支持。

参考文献

Susan Athey 和 Stefan Wager。用因果林估计治疗效果。An Application. Observational Studies, 5, 2019. paper, arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager. Generalized Random Forests. Annals of Statistics，47（2），2019. [paper, arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager. Local Linear Forests. Journal of Computational and Graphical Statistics，2020. [paper, arxiv]
Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse. Doubly Robust Treatment Effect Estimation with Missing Attributes. Annals of Applied Statistics, 14(3) 2020. [paper, arxiv]

Stefan Wager和Susan Athey。Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association，113（523），2018. [paper, arxiv]

主要指标

概览

名称与所有者	grf-labs/grf
主编程语言	C++
编程语言	CMake (语言数: 5)
平台	Linux, Mac, Windows
许可证	GNU General Public License v3.0

所有者活动

创建于	2016-08-12 13:17:37
推送于	2025-06-24 21:21:05
最后一次提交	2025-06-25 07:21:02
发布数	24
最新版本名称	v2.4.0 (发布于 2024-11-15 22:27:44)
第一版名称	v0.9.0 (发布于 2019-10-12 21:31:13)

用户参与

星数	1k
关注者数	46
派生数	262
提交数	2.1k
已启用问题?
问题数	530
打开的问题数	60
拉请求数	923
打开的拉请求数	1
关闭的拉请求数	50

项目设置

已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

generalized random forests

CRAN Downloads overall

A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods for least-squares regression, quantile regression, survival regression, and treatment effect estimation (optionally using instrumental variables), with support for missing values.

In addition, GRF supports 'honest' estimation (where one subset of the data is used for choosing splits, and another for populating the leaves of the tree), and confidence intervals for least-squares regression and treatment effect estimation.

Some helpful links for getting started:

The R package documentation contains usage examples and method reference.
The GRF reference gives a detailed description of the GRF algorithm and includes troubleshooting suggestions.
For community questions and answers around usage, see Github issues labelled 'question'.

The repository first started as a fork of the ranger repository -- we owe a great deal of thanks to the ranger authors for their useful and free package.

Installation

The latest release of the package can be installed through CRAN:

install.packages("grf")

conda users can install from the conda-forge channel:

conda install -c conda-forge r-grf

The current development version can be installed from source using devtools.

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

Note that to install from source, a compiler that implements C++11 is required (clang 3.3 or higher, or g++ 4.8 or higher). If installing on Windows, the RTools toolchain is also required.

Usage Examples

The following script demonstrates how to use GRF for heterogeneous treatment effect estimation. For examples
of how to use types of forest, as for quantile regression and causal effect estimation using instrumental
variables, please consult the R documentation on the relevant forest methods (quantile_forest, instrumental_forest, etc.).

library(grf)

# Generate data.
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# Train a causal forest.
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# Estimate treatment effects for the training data using out-of-bag prediction.
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# Estimate treatment effects for the test sample.
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# Estimate the conditional average treatment effect on the full sample (CATE).
average_treatment_effect(tau.forest, target.sample = "all")

# Estimate the conditional average treatment effect on the treated sample (CATT).
average_treatment_effect(tau.forest, target.sample = "treated")

# Add confidence intervals for heterogeneous treatment effects; growing more trees is now recommended.
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# In some examples, pre-fitting models for Y and W separately may
# be helpful (e.g., if different models use different covariates).
# In some applications, one may even want to get Y.hat and W.hat
# using a completely different method (e.g., boosting).

# Generate new data.
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# Check whether causal forest predictions are well calibrated.
test_calibration(tau.forest)

Developing

In addition to providing out-of-the-box forests for quantile regression and causal effect estimation, GRF provides a framework for creating forests tailored to new statistical tasks. If you'd like to develop using GRF, please consult the algorithm reference and development guide.

Funding

Development of GRF is supported by the National Science Foundation, the Sloan Foundation, the Office of Naval Research (Grant N00014-17-1-2131) and Schmidt Futures.

References

Susan Athey and Stefan Wager.
Estimating Treatment Effects with Causal Forests: An Application.
Observational Studies, 5, 2019.
[paper,
arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager.
Generalized Random Forests. Annals of Statistics, 47(2), 2019.
[paper,
arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager.
Local Linear Forests. Journal of Computational and Graphical Statistics, 2020.
[paper,
arxiv]

Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse.
Doubly Robust Treatment Effect Estimation with Missing Attributes.
Annals of Applied Statistics, 14(3) 2020.
[paper,
arxiv]

Stefan Wager and Susan Athey.
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.
Journal of the American Statistical Association, 113(523), 2018.
[paper,
arxiv]