featran

Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.

Introduction

Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:

Min-Max Scaler
- Aggregation: global min & max
- Mapping: scale each value to [min, max]
One-Hot Encoder
- Aggregation: distinct labels
- Mapping: convert each label to a binary vector

We can implement this in a naive way using reduce and map.

case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))

val a = data
  .map(p => (p.score, p.score, Set(p.label)))
  .reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))

val features = data.map { p =>
  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}

But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.

import com.spotify.featran._
import com.spotify.featran.transformers._

val fs = FeatureSpec.of[Point]
  .required(_.score)(MinMaxScaler("min-max"))
  .required(_.label)(OneHotEncoder("one-hot"))

val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]

Featran also supports these additional features.

Extract from Scala collections, Flink DataSets, Scalding TypedPipes, Scio SCollections and Spark RDDs
Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf, XGBoost LabeledPoint and NumPy .npy file
Import aggregation from a previous extraction for training, validation and test sets
Compose feature specifications and separate outputs

See Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.

See ScalaDocs for current API documentation.

Presentations

Featran - Type safe and generic feature transformation in Scala - NABD Conf Palo Alto 2017 talk

Artifacts

Feature includes the following artifacts:

featran-core - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors
featran-java - Java interface, see JavaExample.java
featran-flink - support for extraction from Flink DataSet
featran-scalding - support for extraction from Scalding TypedPipe
featran-scio - support for extraction from Scio SCollection
featran-spark - support for extraction from Spark RDD
featran-tensorflow - support for output as TensorFlow Example Protobuf
featran-xgboost - support for output as XGBoost LabeledPoint
featran-numpy - support for output as NumPy .npy file

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

名稱與所有者	spotify/featran
主編程語言	Scala
編程語言	Scala (語言數: 5)
平台
許可證	Apache License 2.0

創建於	2017-05-08 17:20:27
推送於	2025-02-07 19:39:26
最后一次提交
發布數	39
最新版本名稱	v0.8.0 (發布於 2023-01-18 13:54:25)
第一版名稱	v0.1.0 (發布於 2017-05-23 22:36:05)

星數	469
關注者數	26
派生數	68
提交數	798
已啟用問題?
問題數	85
打開的問題數	10
拉請求數	494
打開的拉請求數	1
關閉的拉請求數	104

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

featran

Github星跟蹤圖

featran

Introduction

Presentations

Artifacts

License

主要指標