ark-twokenize-py

Python port of the Twokenize class of ark-tweet-nlp

  • Owner: myleott/ark-twokenize-py
  • Platform:
  • License:: GNU General Public License v3.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

ark-twokenize-py

This is a crude Python port of the Twokenize class from ark-tweet-nlp.

It produces nearly identical output to the original Java tokenizer, except in a
few infrequent situations. In particular, Python does not support partial
case-insensitivity in regular expressions and this causes some tokenization
differences for ``Eastern" style emoticons, particularly when the left and right
halves are of different cases. For example:

Java (original): v.V
Python (port): v . V

Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included
a fix for one special case:

Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O

Evaluation

A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization
differed between the original Java version and this Python port. The differences
were primarily related to the emoticon issue discussed above, and it was not
clear in general which output was more desirable. For example:

Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Main metrics

Overview
Name With Ownermyleott/ark-twokenize-py
Primary LanguagePython
Program languagePython (Language Count: 1)
Platform
License:GNU General Public License v3.0
所有者活动
Created At2013-04-29 20:15:32
Pushed At2018-05-04 16:45:54
Last Commit At2018-05-04 09:32:21
Release Count0
用户参与
Stargazers Count142
Watchers Count11
Fork Count60
Commits Count5
Has Issues Enabled
Issues Count1
Issue Open Count0
Pull Requests Count1
Pull Requests Open Count0
Pull Requests Close Count3
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private