GRF

广义随机森林。「Generalized Random Forests」

Github星跟蹤圖

广义随机森林(generalized random forests)

一个可插拔的软件包,用于基于森林的统计估计和推理。GRF 目前提供了最小二乘回归、分位数回归、生存回归和治疗效果估计的非参数方法(可选择使用工具变量),并支持缺失值。

此外,GRF 还支持 "诚实(honest)" 估计(其中一个数据子集用于选择分割,另一个用于填充树叶),以及最小二乘回归和治疗效果估计的置信区间。

一些有用的入门链接:

这个版本库最初是作为 ranger 版本库的一个分支开始的--我们非常感谢 ranger 作者提供的有用的免费软件包。

安装

最新版本的软件包可以通过 CRAN 安装:

install.packages("grf")

conda 用户可以从 conda-forge 频道进行安装。

conda install -c conda-forge r-grf

当前的开发版本可以使用 devtools 从源代码安装:

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

请注意,如果要从源码安装,需要一个实现 C++11 的编译器(clang 3.3 或更高,或 g++ 4.8 或更高)。如果在 Windows 上安装,还需要 RTools 工具链。

使用实例

下面的脚本演示了如何使用 GRF 进行异质性治疗效果估计。关于如何使用森林类型的例子,如量化回归和使用工具变量进行因果效应估计,请查阅相关森林方法的 R 文档(quantile_forest、instrumental_forest 等)。

library(grf)

# 生成数据。
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# 训练因果林。
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# 利用袋外预测估计训练数据的处理效果。
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# 估计测试样本的治疗效果。
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# 估计全样本的条件平均治疗效果(CATE)。
average_treatment_effect(tau.forest, target.sample = "all")

# 估计治疗样本的条件平均治疗效果(CATT)。
average_treatment_effect(tau.forest, target.sample = "treated")

# 增加异质处理效果的置信区间;现在建议多种树。
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# 在某些例子中,分别对 Y 和 W 进行预拟合模型可能会有帮助
#(例如,如果不同的模型使用不同的协变量)。
# 在某些应用中,人们甚至可能希望使用一种完全不同的方法(例如,提升)
# 来获得Y.hat 和 W.hat。

# 生成新数据。
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# 注意:当森林在极少的变量上进行训练时,可能会遇到困难。
# (例如,ncol(X) = 1, 2, 或 3)。我们建议在选择时不要过于激进。
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# 检查因果林预测是否得到很好的校准。
test_calibration(tau.forest)

开发中

除了为量化回归和因果效应估计提供开箱即用的森林外,GRF 还提供了一个为新的统计任务量身打造森林的框架。如果您想使用 GRF 进行开发,请查阅 算法参考开发指南

资助

GRF 的开发得到了美国国家科学基金会、斯隆基金会、海军研究办公室(赠款 N00014-17-1-2131)和 Schmidt Futures 的支持。

参考文献

Susan Athey 和 Stefan Wager。用因果林估计治疗效果。An Application. Observational Studies, 5, 2019. paper, arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager. Generalized Random Forests. Annals of Statistics,47(2),2019. [paper, arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager. Local Linear Forests. Journal of Computational and Graphical Statistics,2020. [paper, arxiv]
Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse. Doubly Robust Treatment Effect Estimation with Missing Attributes. Annals of Applied Statistics, 14(3) 2020. [paper, arxiv]

Stefan Wager和Susan Athey。Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association,113(523),2018. [paper, arxiv]



主要指標

概覽
名稱與所有者grf-labs/grf
主編程語言C++
編程語言CMake (語言數: 5)
平台Linux, Mac, Windows
許可證GNU General Public License v3.0
所有者活动
創建於2016-08-12 13:17:37
推送於2025-04-26 05:27:14
最后一次提交2025-04-26 15:27:11
發布數24
最新版本名稱v2.4.0 (發布於 2024-11-15 22:27:44)
第一版名稱v0.9.0 (發布於 2019-10-12 21:31:13)
用户参与
星數1k
關注者數46
派生數259
提交數2.1k
已啟用問題?
問題數522
打開的問題數55
拉請求數921
打開的拉請求數0
關閉的拉請求數50
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

generalized random forests

CRANstatus
CRAN Downloads overall
Build Status

A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods for least-squares regression, quantile regression, survival regression, and treatment effect estimation (optionally using instrumental variables), with support for missing values.

In addition, GRF supports 'honest' estimation (where one subset of the data is used for choosing splits, and another for populating the leaves of the tree), and confidence intervals for least-squares regression and treatment effect estimation.

Some helpful links for getting started:

The repository first started as a fork of the ranger repository -- we owe a great deal of thanks to the ranger authors for their useful and free package.

Installation

The latest release of the package can be installed through CRAN:

install.packages("grf")

conda users can install from the conda-forge channel:

conda install -c conda-forge r-grf

The current development version can be installed from source using devtools.

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

Note that to install from source, a compiler that implements C++11 is required (clang 3.3 or higher, or g++ 4.8 or higher). If installing on Windows, the RTools toolchain is also required.

Usage Examples

The following script demonstrates how to use GRF for heterogeneous treatment effect estimation. For examples
of how to use types of forest, as for quantile regression and causal effect estimation using instrumental
variables, please consult the R documentation on the relevant forest methods (quantile_forest, instrumental_forest, etc.).

library(grf)

# Generate data.
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# Train a causal forest.
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# Estimate treatment effects for the training data using out-of-bag prediction.
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# Estimate treatment effects for the test sample.
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# Estimate the conditional average treatment effect on the full sample (CATE).
average_treatment_effect(tau.forest, target.sample = "all")

# Estimate the conditional average treatment effect on the treated sample (CATT).
average_treatment_effect(tau.forest, target.sample = "treated")

# Add confidence intervals for heterogeneous treatment effects; growing more trees is now recommended.
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# In some examples, pre-fitting models for Y and W separately may
# be helpful (e.g., if different models use different covariates).
# In some applications, one may even want to get Y.hat and W.hat
# using a completely different method (e.g., boosting).

# Generate new data.
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# Check whether causal forest predictions are well calibrated.
test_calibration(tau.forest)

Developing

In addition to providing out-of-the-box forests for quantile regression and causal effect estimation, GRF provides a framework for creating forests tailored to new statistical tasks. If you'd like to develop using GRF, please consult the algorithm reference and development guide.

Funding

Development of GRF is supported by the National Science Foundation, the Sloan Foundation, the Office of Naval Research (Grant N00014-17-1-2131) and Schmidt Futures.

References

Susan Athey and Stefan Wager.
Estimating Treatment Effects with Causal Forests: An Application.
Observational Studies, 5, 2019.
[paper,
arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager.
Generalized Random Forests. Annals of Statistics, 47(2), 2019.
[paper,
arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager.
Local Linear Forests. Journal of Computational and Graphical Statistics, 2020.
[paper,
arxiv]

Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse.
Doubly Robust Treatment Effect Estimation with Missing Attributes.
Annals of Applied Statistics, 14(3) 2020.
[paper,
arxiv]

Stefan Wager and Susan Athey.
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.
Journal of the American Statistical Association, 113(523), 2018.
[paper,
arxiv]