NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Introduction
NeuralClassifier is designed for quick implementation of neural models for hierarchical multi-label classification task, which is more challenging and common in real-world scenarios. A salient feature is that NeuralClassifier currently provides a variety of text encoders, such as FastText, TextCNN, TextRNN, RCNN, VDCNN, DPCNN, DRNN, AttentiveConvNet and Transformer encoder, etc. It also supports other text classification scenarios, including binary-class and multi-class classification. It is built on PyTorch. Experiments show that models built in our toolkit achieve comparable performance with reported results in the literature.
Support tasks
- Binary-class text classifcation
- Multi-class text classification
- Multi-label text classification
- Hiearchical (multi-label) text classification (HMC)
Support text encoders
- TextCNN (Kim, 2014)
- RCNN (Lai et al., 2015)
- TextRNN (Liu et al., 2016)
- FastText (Joulin et al., 2016)
- VDCNN (Conneau et al., 2016)
- DPCNN (Johnson and Zhang, 2017)
- AttentiveConvNet (Yin and Schutze, 2017)
- DRNN (Wang, 2018)
- Region embedding (Qiao et al., 2018)
- Transformer encoder (Vaswani et al., 2017)
- Star-Transformer encoder (Guo et al., 2019)
Requirement
- Python 3
- PyTorch 0.4+
- Numpy 1.14.3+
System Architecture
Usage
Training
python train.py conf/train.json
Detail configurations and explanations see Configuration.
The training info will be outputted in standard output and log.logger_file.
Evaluation
python eval.py conf/train.json
- if eval.is_flat = false, hierarchical evaluation will be outputted.
- eval.model_dir is the model to evaluate.
- data.test_json_files is the input text file to evaluate.
The evaluation info will be outputed in eval.dir.
Prediction
python predict.py conf/train.json data/predict.json
- predict.json should be of json format, while each instance has a dummy label like "其他" or any other label in label map.
- eval.model_dir is the model to predict.
- eval.top_k is the number of labels to output.
- eval.threshold is the probability threshold.
The predict info will be outputed in predict.txt.
Input Data Format
JSON example:
{
"doc_label": ["Computer--MachineLearning--DeepLearning", "Neuro--ComputationalNeuro"],
"doc_token": ["I", "love", "deep", "learning"],
"doc_keyword": ["deep learning"],
"doc_topic": ["AI", "Machine learning"]
}
"doc_keyword" and "doc_topic" are optional.
Performance
0. Dataset
- RCV1: Lewis et al., 2004
- Yelp: Yelp
1. Compare with state-of-the-art
- HR-DGCNN: Peng et al., 2018
- HMCN: Wehrmann et al., 2018
2. Different text encoders
3. Hierarchical vs Flat
Acknowledgement
Some public codes are referenced by our toolkit:
- https://pytorch.org/docs/stable/
- https://github.com/jadore801120/attention-is-all-you-need-pytorch/
- https://github.com/Hsuxu/FocalLoss-PyTorch
- https://github.com/Shawn1993/cnn-text-classification-pytorch
- https://github.com/ailias/Focal-Loss-implement-on-Tensorflow/
- https://github.com/brightmart/text_classification
- https://github.com/NLPLearn/QANet
- https://github.com/huggingface/pytorch-pretrained-BERT
Update
- 2019-04-29, init version