talon

用于提取邮件引文和签名的 Mailgun 库。「Mailgun library to extract message quotations and signatures.」

Github星跟踪图

talon

Mailgun library to extract message quotations and signatures.

If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:

Usage

Here’s how you initialize the library and extract a reply from a text
message:

.. code:: python

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"

To extract a reply from html:

.. code:: python

html = """Reply
<blockquote>

  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>

  <div>
    Quote
  </div>

</blockquote>"""

reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"

Often the best way is the easiest one. Here’s how you can extract
signature from email message without any
machine learning fancy stuff:

.. code:: python

from talon.signature.bruteforce import extract_signature


message = """Wow. Awesome!
--
Bob Smith"""

text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"

Quick and works like a charm 90% of the time. For other 10% you can use
the power of machine learning algorithms:

.. code:: python

import talon
# don't forget to init the library first
# it loads machine learning classifiers
talon.init()

from talon import signature


message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='john.doe@example.com')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"

For machine learning talon currently uses the scikit-learn_ library to build SVM
classifiers. The core of machine learning algorithm lays in
talon.signature.learning package. It defines a set of features to
apply to a message (featurespace.py), how data sets are built
(dataset.py), classifier’s interface (classifier.py).

Currently the data used for training is taken from our personal email
conversations and from ENRON_ dataset. As a result of applying our set
of features to the dataset we provide files classifier and
train.data that don’t have any personal information but could be
used to load trained classifier. Those files should be regenerated every
time the feature/data set is changed.

To regenerate the model files, you can run

.. code:: sh

python train.py

or

.. code:: python

from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init
train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

Open-source Dataset

Recently we started a forge_ project to create an open-source, annotated dataset of raw emails. In the project we
used a subset of ENRON_ data, cleansed of private, health and financial information by EDRM_. At the moment over 190
emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to
start using it for talon.

.. _scikit-learn: http://scikit-learn.org
.. _ENRON: https://www.cs.cmu.edu/~enron/
.. _EDRM: http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set
.. _forge: https://github.com/mailgun/forge

Training on your dataset

talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the forge_ project does. Then do:

.. code:: python

from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c 

build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")

Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).

.. _forge: https://github.com/mailgun/forge

Research

The library is inspired by the following research papers and projects:

主要指标

概览
名称与所有者pusher/pusher-http-php
主编程语言PHP
编程语言Python (语言数: 1)
平台
许可证
所有者活动
创建于2012-07-31 17:21:19
推送于2025-01-06 11:03:22
最后一次提交2025-01-06 12:03:22
发布数57
最新版本名称7.2.7 (发布于 )
第一版名称v2.1.2 (发布于 2012-11-18 10:51:58)
用户参与
星数1.4k
关注者数53
派生数310
提交数549
已启用问题?
问题数147
打开的问题数0
拉请求数202
打开的拉请求数1
关闭的拉请求数49
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?