nlp_thai_resources

50多个泰国自然语言处理库的集合。每日更新。(More than 50+ collections of Thai Natural Language Processing libraries. Update daily.)

Github星跟蹤圖

Thai Natural Language Processing (Thai NLP) Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.

Thai NLP Libraries/Services

Thai Character Cluster

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
JTCC, Thai Character Cluster, Java, GPL-3.0, Wittawat
TCC, Thai Character Cluster, Python, Apache 2.0, Wannaphong

Thai Sentiment Analysis

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
sentiment_analysis_thai, JagerV3

Thai Soundex

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
LK82 + Udom83, Thai Soundex, Python, Korakot

Word Segmentation

Part of Speech Tagging (POS Tagging)

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Jitar+NAiST, A simple Trigram HMM part-of-speech tagger, Java, Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai, Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM., Python, 0.9163 F-measure. RNN. LSTM, MIT, KenjiroAI, github
Chart-POS, Thai POS Tagger, C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp

Name Entity Recognition

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Named Entity Tagging (Thai NEST), Thai Named Entity tagging Specification and Tools, GPL, KINDML, SIIT, AIAT
ThaiNER, Thai Named Entity Recognition for PyThaiNLP, Python, Apache 2.0 (code) & CC BY 3.0 (Dataset), ThaiNER

News Structure Tagging

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
News Structure Tagging Program, Thai News Structure Tagging Program, Metadata tagging, Structure tagging, Automatic News Title Generation, GPL, AIAT

Syntactic Parsing & Tools

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Chart-parser, Extract Syntactic Structure from POS Tagged Sentence., C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp
Grammar Processing, Labelled Brackets -> Context Free Grammars (CFGs), Python, Transform and compute probability, Thodsaporn C.

Thai Word Embedding

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
kobkrit-word-embedding, Tensorflow implementation of Thai word embedding, Python, Source code, Example, Word distance graph, LGPL, Kobkrit V.

Thai Question Answering (Machine Comprehension)

Service, Description, License, Author & Link
---, ---, ---, ---
Thai Machine Comprehension (ThaiMC), Bidirectional Attention Flow, Copyright (As the service), iApp-AI

Thai Emojification

Service, Description, License, Author & Link
---, ---, ---, ---
Thai Emotification, LSTM, GNU General Public License v3.0, Demo at iApp-AI and Source, Github

Dictionaries / Translation Pairs

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Transliteration Corpus, 31K pairs, Thai-Eng Translation Pair, CC BY-NC-SA 3.0 TH, NECTEC
LEXiTRON, Thai<->English Dictionary, TH->EN, EN->TH, LEXiTRON License, NECTEC
Yaitron, LEXiTRON in machine readable format (XML), TH->EN, EN->TH, LEXiTRON License, Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
ORCHID, 30K sent., Word Seg., POS Tagged., CC BY-NC-SA 3.0 TH, NECTEC
THAI-NEST, Thai-NEST: Thai Named Entity tagging Specification and Tools, 45K+ Name Entity Token, Name Entity Tagged, GNU Lesser General Public 2.1, KINDML
InterBEST 2009/2010, 5M words, Word Seg., CC BY-NC-SA 3.0 TH, NECTEC
Thai Wikipedia, Formal Articles, 1.49GB (~213.1 MB compressed), XML, GFDL, WIKIPEDIA
TNC Top-5000 Words, Word frequency, 5,000 words, Frequency of Thai words in various genres, EXCEL, All rights reserved, CHULA
Click Bait Sentences, Thai Click Bait Sentence, 330 sent. (90.7KB), MIT, Wannaphongcom
Thai Sentimental Word List, Thai Sentimental Words List, 52KB, Seperated Words as Adj, V, MIT, Wannaphongcom
Prime Minister 29, Prime Minister 29's Speech Sentences, 338KB, Word segged, Name Entity Tagged, MIT, Wannaphongcom
Thai named entity corpora, named entity corpora by Wirote Aroonmanakun's students, 266KB-1.5MB, syllable seg., word seg., Named Entity tagged, GPLv3(not sure, but tltk is using this license), นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
Thai WordNet, THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร), WordNet, N/A, ธนนท์ หลีน้อย 2008ปริศนา อัครพุทธิพร Data 2008
Toxicity in Thai Tweet Corpus, Tokyo Metropolitan University Natural Language Processing Group, Each tweet is labeled as toxic or non-toxic, CC BY-NC 4.0, tmu-nlp
thai-jokes-corpus, Cleaned Thai Jokes Corpus, 457 jokes, GNU-GPL3.0, iApp Technology

Web Query Text Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Thai National Corpus 2, 32M words, Query text by genre, domain, All rights reserved, CHULA
Thai Medical Document, 3,594 docs, Document and dynamic keyword map, All rights reserved, KINDML, SIIT
Southeast Asian Languages Library, Thai News, Web Text, Pop Music, Literature, Toponyms, 20M chars, Phase around a search text, SEALang
HSE Thai Corpus, Modern texts written in Thai language (mostly news websites), 50M tokens, Query by word form, lexeme, translation, grammatical attributes, lexical attributees, HSE School of Linguistics

Parallel Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
TALPCo, TUFS Asian Language Parallel Corpus, 1327 sent, open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English, CC BY 4.0, TALPCo

Pre-trained Word Vectors

Pre-trained Model, Description, Size, Dimensions, License, Link
---, ---, ---, ---, ---, ---
fastText, Skip-Gram model trained on Wikipedia using fastText, 300, CC BY-SA 3.0, Facebook + Bin & Text + Text Only
thai2fit, ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings., 70MB, 300, MIT, thai2vec / pyThaiNLP

Thai Text Classification Benchmarks

wongnai-corpus, model, micro_f1_public, micro_f1_private, -----------, -----------------, ------------------, ULMFit, 0.59313, 0.60322, fastText, 0.5145, 0.5109, LinearSVC, 0.5022, 0.4976, Kaggle Score, 0.59139, 0.58139, BERT, 0.56612, 0.57057, ### prachathai-67k: body_text, Model, Macro-accuracy, Macro-F1, -----------, ----------------, ----------, fastText, 0.9302, 0.5529, LinearSVC, 0.513277, 0.552801, ULMFit, 0.948737, 0.744875, ## wisesight-sentiment, Model, Public Accuracy, Private Accuracy, ---------------------, -----------------, ------------------, Logistic Regression, 0.72781, 0.7499, FastText, 0.63144, 0.6131, ULMFit, 0.71259, 0.74194, ULMFit Semi-supervised, 0.73119, 0.75859, ULMFit Semi-supervised Repeated One Time, 0.73372, 0.75968, ## truevoice-intent: destination, model, accuracy, micro-F1, -----------, ----------, ----------, fastText, 0.384116, 0.384116, LinearSVC, 0.807876, 0.327565, ULMFit, 0.834981, 0.834981, ## Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements

主要指標

概覽
名稱與所有者kobkrit/nlp_thai_resources
主編程語言
編程語言 (語言數: 0)
平台Web browsers
許可證
所有者活动
創建於2017-06-18 03:22:10
推送於2023-04-09 11:28:14
最后一次提交2021-09-21 23:15:51
發布數0
用户参与
星數387
關注者數39
派生數76
提交數133
已啟用問題?
問題數7
打開的問題數5
拉請求數28
打開的拉請求數2
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?