ark-twokenize-py

Python port of the Twokenize class of ark-tweet-nlp

  • 所有者: myleott/ark-twokenize-py
  • 平台:
  • 许可证: GNU General Public License v3.0
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

ark-twokenize-py

This is a crude Python port of the Twokenize class from ark-tweet-nlp.

It produces nearly identical output to the original Java tokenizer, except in a
few infrequent situations. In particular, Python does not support partial
case-insensitivity in regular expressions and this causes some tokenization
differences for ``Eastern" style emoticons, particularly when the left and right
halves are of different cases. For example:

Java (original): v.V
Python (port): v . V

Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included
a fix for one special case:

Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O

Evaluation

A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization
differed between the original Java version and this Python port. The differences
were primarily related to the emoticon issue discussed above, and it was not
clear in general which output was more desirable. For example:

Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

主要指标

概览
名称与所有者myleott/ark-twokenize-py
主编程语言Python
编程语言Python (语言数: 1)
平台
许可证GNU General Public License v3.0
所有者活动
创建于2013-04-29 20:15:32
推送于2018-05-04 16:45:54
最后一次提交2018-05-04 09:32:21
发布数0
用户参与
星数142
关注者数11
派生数60
提交数5
已启用问题?
问题数1
打开的问题数0
拉请求数1
打开的拉请求数0
关闭的拉请求数3
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?