StanfordNLP

[弃用]该库已改名为 "Stanza"。最新开发情况见：https://github.com/stanfordnlp/stanza。（[Deprecated] This library has been renamed to "Stanza". Latest development at: https://github.com/stanfordnlp/stanza）

所有者: stanfordnlp/stanfordnlp
平台: Linux, Mac, Windows
許可證: Other
分類:

Python

自然語言處理
主題:

universal-dependencies
喜歡:

0

比較:

Github星跟蹤圖

StanfordNLP：一个适用于多种人类语言的 Python NLP 库

所有的开发、问题、持续维护和支持都已经转移到我们新的 GitHub 仓库，因为自 1.0.0 版本以来，该工具包已经更名为 Stanza。请访问我们的新网站了解更多信息。你仍然可以通过 pip 下载 stanfordnlp，但这个软件包的新版本将以 stanza 的形式提供。这个存储库是为了存档而保留的。

Stanford NLP Group 的官方 Python NLP 库。它包含用于运行我们最新的 CoNLL 2018 共享任务的全神经管道和用于访问 Java Stanford CoreNLP 服务器的包。详细信息请访问我们的官方网站。

参考文献

如果您在研究中使用了我们的神经管道，包括 tokenizer、多词 token 扩展模型、lemmatizer、POS/形态特征标记器或依赖性解析器，请善意引用我们的 CoNLL 2018 Shared Task 系统描述论文：

@inproceedings{qi2018universal,
 address = {Brussels, Belgium},
 author = {Qi, Peng  and  Dozat, Timothy  and  Zhang, Yuhao  and  Manning, Christopher D.},
 booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
 month = {October},
 pages = {160--170},
 publisher = {Association for Computational Linguistics},
 title = {Universal Dependency Parsing from Scratch},
 url = {https://nlp.stanford.edu/pubs/qi2018universal.pdf},
 year = {2018}
}

本资源库中神经管道的 PyTorch 实现归功于 Peng Qi 和 Yuhao Zhang，并得到了 Tim Dozat 和 Jason Bolton 的帮助。

这个版本和斯坦福大学的 CoNLL 2018 共享任务系统不一样。tokenizer、lemmatizer、形态特征和多词术语系统是共享任务代码的清理版本，但在比赛中，我们使用了 Tim Dozat 的 Tensorflow 版本的标记器和解析器，这个版本已经在 PyTorch 中大致重现（虽然与原版有一些偏差）。

如果你使用 CoreNLP 服务器，请按照这里的描述（"在论文中引用斯坦福 CoreNLP"）引用 CoreNLP 软件包和各个模块。CoreNLP 客户端主要由 Arun Chaganty 编写，Jason Bolton 率先将两个项目合并在一起。

问题和使用问答

要提问、报告问题或请求功能，请使用 GitHub 问题跟踪器。

设置

StanfordNLP 支持 Python 3.6 或更高版本。我们强烈建议您从 PyPI 安装StanfordNLP。如果您已经安装了 pip，只需运行：

pip install stanfordnlp

这也有助于解决 StanfordNLP 的所有依赖关系，例如 PyTorch 1.0.0 或以上。
如果你目前已经安装了以前版本的stanfordnlp，请使用。

pip install stanfordnlp -U

另外，你也可以从这个 git 仓库的源头安装，这将使你在 StanfordNLP 之上开发和训练自己的模型时更加灵活。对于这个选项，请运行：

git clone https://github.com/stanfordnlp/stanfordnlp.git
cd stanfordnlp
pip install -e .

运行 StanfordNLP

开始使用神经管道

要运行您的第一个 StanfordNLP 管道，只需在您的 Python 交互式解释器中遵循这些步骤即可：

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
# IMPORTANT: The above line prompts you before downloading, which doesn't work well in a Jupyter notebook.
# To avoid a prompt when using notebooks, instead use: >>> stanfordnlp.download('en', force=True)
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

最后一条命令将打印出输入字符串中第一句话（或 Document，在 StanfordNLP 中表示）中的单词，以及该句话（它的 "head"）的 Universal Dependencies 解析中支配它的单词的索引，以及单词之间的依赖关系。输出的结果应该是这样的：

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

注意：如果您遇到了类似 OSError.[Errno 22] Invalid argument 这样的问题，您很可能受到了已知 Python 问题的影响。[Errno 22] Invalid argument，这很有可能是受已知的 Python 问题影响，我们建议使用 Python 3.6.8 或更高版本和 Python 3.7.2 或更高版本。

我们还提供了一个多语言演示脚本，演示如何在英语以外的其他语言中使用 StanfordNLP，例如中文（繁体）：

python demo/pipeline_demo.py -l zh

更多详情请看我们的入门指南。

访问 Java Stanford CoreNLP 服务器

除了神经管道，这个项目还包括一个官方的包装器，用于使用 Python 代码访问 Java Stanford CoreNLP 服务器。

有几个初始设置步骤：

下载 Stanford CoreNLP 和你想使用的语言的模型。
将模型 jars 放在分发文件夹中
告诉 python 代码中斯坦福 CoreNLP 的位置： export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05。

我们提供了另一个演示脚本，展示如何使用 CoreNLP 客户端并从中提取各种注释。

在线 Colab 笔记本

为了让你入门，我们还在 demo 文件夹中提供了交互式的 Jupyter notebooks，你也可以打开这些 notebooks 并在 Google Colab 上交互运行。你也可以打开这些 notebooks，并在 Google Colab 上交互式地运行它们。要查看所有可用的 notebooks，请遵循以下步骤：

进入 Google Colab 网站
导航到 File->Open notebook，在弹出的菜单中选择 GitHub。
请注意，您不需要给您的 github 账户提供 Colab 的访问权限。
在搜索栏中输入 stanfordnlp/stanfordnlp，然后点击回车。

神经管道的训练模型

我们目前为 CoNLL 2018 共享任务中的所有树库提供模型。您可以在这里找到下载和使用这些模型的说明。

分批处理以最大限度地提高管道速度

为了最大限度地提高速度性能，必须在成批的文档上运行流水线。每次只对一个句子运行 for 循环会非常慢。此时最好的方法是将文档连在一起，每个文档之间用空行隔开（即，两个换行符\n\n）。记号器会将空行识别为换行符。我们正在积极改进多文档处理。

训练您自己的神经管道

该库中的所有神经模块，包括 tokenizer、多词令牌(MWT)扩展器、POS/形态特征标记器、词缀器和依赖性分析器，都可以用自己的 CoNLL-U 格式数据进行训练。

目前，我们不支持通过 Pipeline 接口进行模型训练。因此，如果要训练自己的模型，你需要克隆这个 git 仓库，并从源头进行设置。

有关如何训练和评估您自己的模型的详细步骤指导，请访问我们的培训文档。

许可证

StanfordNLP 以 Apache License 2.0 版本发布。更多细节请参见 LICENSE 文件。

（The first version translated by vz on 2020.12.12）

主要指標

概覽

名稱與所有者	stanfordnlp/stanfordnlp
主編程語言	Python
編程語言	Shell (語言數: 5)
平台	Linux, Mac, Windows
許可證	Other

所有者活动

創建於	2020-03-16 22:26:27
推送於	2023-09-12 15:39:50
最后一次提交	2023-09-12 08:39:50
發布數	3
最新版本名稱	v0.2.0 (發布於 )
第一版名稱	v0.1.0 (發布於 )

用户参与

星數	120
關注者數	14
派生數	29
提交數	867
已啟用問題?
問題數	10
打開的問題數	2
拉請求數	0
打開的拉請求數	0
關閉的拉請求數	1

项目设置

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

StanfordNLP: A Python NLP Library for Many Human Languages

Python Versions

The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server. For detailed information please visit our official website.

References

If you use our neural pipeline including the tokenizer, the multi-word token expansion model, the lemmatizer, the POS/morphological features tagger, or the dependency parser in your research, please kindly cite our CoNLL 2018 Shared Task system description paper:

@inproceedings{qi2018universal,
 address = {Brussels, Belgium},
 author = {Qi, Peng  and  Dozat, Timothy  and  Zhang, Yuhao  and  Manning, Christopher D.},
 booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
 month = {October},
 pages = {160--170},
 publisher = {Association for Computational Linguistics},
 title = {Universal Dependency Parsing from Scratch},
 url = {https://nlp.stanford.edu/pubs/qi2018universal.pdf},
 year = {2018}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi and Yuhao Zhang, with help from Tim Dozat and Jason Bolton.

This release is not the same as Stanford's CoNLL 2018 Shared Task system. The tokenizer, lemmatizer, morphological features, and multi-word term systems are a cleaned up version of the shared task code, but in the competition we used a Tensorflow version of the tagger and parser by Tim Dozat, which has been approximately reproduced in PyTorch (though with a few deviations from the original) for this release.

If you use the CoreNLP server, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

Issues and Usage Q&A

To ask questions, report issues or request features, please use the GitHub Issue Tracker.

Setup

StanfordNLP supports Python 3.6 or later. We strongly recommend that you install StanfordNLP from PyPI. If you already have pip installed, simply run:

pip install stanfordnlp

this should also help resolve all of the dependencies of StanfordNLP, for instance PyTorch 1.0.0 or above.

If you currently have a previous version of stanfordnlp installed, use:

pip install stanfordnlp -U

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of StanfordNLP and training your own models. For this option, run

git clone https://github.com/stanfordnlp/stanfordnlp.git
cd stanfordnlp
pip install -e .

Running StanfordNLP

Getting Started with the neural pipeline

To run your first StanfordNLP pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
# IMPORTANT: The above line prompts you before downloading, which doesn't work well in a Jupyter notebook.
# To avoid a prompt when using notebooks, instead use: >>> stanfordnlp.download('en', force=True)
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

Note: If you are running into issues like OSError: [Errno 22] Invalid argument, it's very likely that you are affected by a known Python issue, and we would recommend Python 3.6.8 or later and Python 3.7.2 or later.

We also provide a multilingual demo script that demonstrates how one uses StanfordNLP in other languages than English, for example Chinese (traditional)

python demo/pipeline_demo.py -l zh

See our getting started guide for more details.

Access to Java Stanford CoreNLP Server

Aside from the neural pipeline, this project also includes an official wrapper for acessing the Java Stanford CoreNLP Server with Python code.

There are a few initial setup steps.

Download Stanford CoreNLP and models for the language you wish to use
Put the model jars in the distribution folder
Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05

We provide another demo script that shows how one can use the CoreNLP client and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

Go to the Google Colab website
Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
Note that you do not need to give Colab access permission to your github account
Type stanfordnlp/stanfordnlp in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the treebanks in the CoNLL 2018 Shared Task. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop
on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together,
with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks.
We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library, including the tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser, can be trained with your own CoNLL-U format data. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and set up from source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

StanfordNLP is released under the Apache License, Version 2.0. See the LICENSE file for more details.