ark-twokenize-py

This is a crude Python port of the Twokenize class from ark-tweet-nlp.

It produces nearly identical output to the original Java tokenizer, except in a
few infrequent situations. In particular, Python does not support partial
case-insensitivity in regular expressions and this causes some tokenization
differences for ``Eastern" style emoticons, particularly when the left and right
halves are of different cases. For example:

Java (original): v.V
Python (port): v . V

Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included
a fix for one special case:

Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O

Evaluation

A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization
differed between the original Java version and this Python port. The differences
were primarily related to the emoticon issue discussed above, and it was not
clear in general which output was more desirable. For example:

Text:
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Java (original):
Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

Python (port):
Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets

名稱與所有者	myleott/ark-twokenize-py
主編程語言	Python
編程語言	Python (語言數: 1)
平台
許可證	GNU General Public License v3.0

名稱與所有者

myleott/ark-twokenize-py

主編程語言

Python

編程語言

Python (語言數: 1)

平台

許可證

GNU General Public License v3.0

創建於	2013-04-29 20:15:32
推送於	2018-05-04 16:45:54
最后一次提交	2018-05-04 09:32:21
發布數	0

創建於

2013-04-29 20:15:32

推送於

2018-05-04 16:45:54

最后一次提交

2018-05-04 09:32:21

發布數

星數	142
關注者數	11
派生數	60
提交數	5
已啟用問題?
問題數	1
打開的問題數	0
拉請求數	1
打開的拉請求數	0
關閉的拉請求數	3

星數

142

關注者數

派生數

提交數

已啟用問題?

問題數

打開的問題數

拉請求數

打開的拉請求數

關閉的拉請求數

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

已啟用Wiki?

已存檔?

是復刻?

已鎖定?

是鏡像?

是私有?

ark-twokenize-py

Github星跟蹤圖

ark-twokenize-py

Evaluation

主要指標