nlp_thai_resources

50多个泰国自然语言处理库的集合。每日更新。(More than 50+ collections of Thai Natural Language Processing libraries. Update daily.)

Github stars Tracking Chart

Thai Natural Language Processing (Thai NLP) Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.

Thai NLP Libraries/Services

Thai Character Cluster

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
JTCC, Thai Character Cluster, Java, GPL-3.0, Wittawat
TCC, Thai Character Cluster, Python, Apache 2.0, Wannaphong

Thai Sentiment Analysis

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
sentiment_analysis_thai, JagerV3

Thai Soundex

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
LK82 + Udom83, Thai Soundex, Python, Korakot

Word Segmentation

Part of Speech Tagging (POS Tagging)

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Jitar+NAiST, A simple Trigram HMM part-of-speech tagger, Java, Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai, Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM., Python, 0.9163 F-measure. RNN. LSTM, MIT, KenjiroAI, github
Chart-POS, Thai POS Tagger, C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp

Name Entity Recognition

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Named Entity Tagging (Thai NEST), Thai Named Entity tagging Specification and Tools, GPL, KINDML, SIIT, AIAT
ThaiNER, Thai Named Entity Recognition for PyThaiNLP, Python, Apache 2.0 (code) & CC BY 3.0 (Dataset), ThaiNER

News Structure Tagging

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
News Structure Tagging Program, Thai News Structure Tagging Program, Metadata tagging, Structure tagging, Automatic News Title Generation, GPL, AIAT

Syntactic Parsing & Tools

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Chart-parser, Extract Syntactic Structure from POS Tagged Sentence., C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp
Grammar Processing, Labelled Brackets -> Context Free Grammars (CFGs), Python, Transform and compute probability, Thodsaporn C.

Thai Word Embedding

Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
kobkrit-word-embedding, Tensorflow implementation of Thai word embedding, Python, Source code, Example, Word distance graph, LGPL, Kobkrit V.

Thai Question Answering (Machine Comprehension)

Service, Description, License, Author & Link
---, ---, ---, ---
Thai Machine Comprehension (ThaiMC), Bidirectional Attention Flow, Copyright (As the service), iApp-AI

Thai Emojification

Service, Description, License, Author & Link
---, ---, ---, ---
Thai Emotification, LSTM, GNU General Public License v3.0, Demo at iApp-AI and Source, Github

Dictionaries / Translation Pairs

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Transliteration Corpus, 31K pairs, Thai-Eng Translation Pair, CC BY-NC-SA 3.0 TH, NECTEC
LEXiTRON, Thai<->English Dictionary, TH->EN, EN->TH, LEXiTRON License, NECTEC
Yaitron, LEXiTRON in machine readable format (XML), TH->EN, EN->TH, LEXiTRON License, Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
ORCHID, 30K sent., Word Seg., POS Tagged., CC BY-NC-SA 3.0 TH, NECTEC
THAI-NEST, Thai-NEST: Thai Named Entity tagging Specification and Tools, 45K+ Name Entity Token, Name Entity Tagged, GNU Lesser General Public 2.1, KINDML
InterBEST 2009/2010, 5M words, Word Seg., CC BY-NC-SA 3.0 TH, NECTEC
Thai Wikipedia, Formal Articles, 1.49GB (~213.1 MB compressed), XML, GFDL, WIKIPEDIA
TNC Top-5000 Words, Word frequency, 5,000 words, Frequency of Thai words in various genres, EXCEL, All rights reserved, CHULA
Click Bait Sentences, Thai Click Bait Sentence, 330 sent. (90.7KB), MIT, Wannaphongcom
Thai Sentimental Word List, Thai Sentimental Words List, 52KB, Seperated Words as Adj, V, MIT, Wannaphongcom
Prime Minister 29, Prime Minister 29's Speech Sentences, 338KB, Word segged, Name Entity Tagged, MIT, Wannaphongcom
Thai named entity corpora, named entity corpora by Wirote Aroonmanakun's students, 266KB-1.5MB, syllable seg., word seg., Named Entity tagged, GPLv3(not sure, but tltk is using this license), นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
Thai WordNet, THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร), WordNet, N/A, ธนนท์ หลีน้อย 2008ปริศนา อัครพุทธิพร Data 2008
Toxicity in Thai Tweet Corpus, Tokyo Metropolitan University Natural Language Processing Group, Each tweet is labeled as toxic or non-toxic, CC BY-NC 4.0, tmu-nlp
thai-jokes-corpus, Cleaned Thai Jokes Corpus, 457 jokes, GNU-GPL3.0, iApp Technology

Web Query Text Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Thai National Corpus 2, 32M words, Query text by genre, domain, All rights reserved, CHULA
Thai Medical Document, 3,594 docs, Document and dynamic keyword map, All rights reserved, KINDML, SIIT
Southeast Asian Languages Library, Thai News, Web Text, Pop Music, Literature, Toponyms, 20M chars, Phase around a search text, SEALang
HSE Thai Corpus, Modern texts written in Thai language (mostly news websites), 50M tokens, Query by word form, lexeme, translation, grammatical attributes, lexical attributees, HSE School of Linguistics

Parallel Corpus

Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
TALPCo, TUFS Asian Language Parallel Corpus, 1327 sent, open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English, CC BY 4.0, TALPCo

Pre-trained Word Vectors

Pre-trained Model, Description, Size, Dimensions, License, Link
---, ---, ---, ---, ---, ---
fastText, Skip-Gram model trained on Wikipedia using fastText, 300, CC BY-SA 3.0, Facebook + Bin & Text + Text Only
thai2fit, ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings., 70MB, 300, MIT, thai2vec / pyThaiNLP

Thai Text Classification Benchmarks

wongnai-corpus, model, micro_f1_public, micro_f1_private, -----------, -----------------, ------------------, ULMFit, 0.59313, 0.60322, fastText, 0.5145, 0.5109, LinearSVC, 0.5022, 0.4976, Kaggle Score, 0.59139, 0.58139, BERT, 0.56612, 0.57057, ### prachathai-67k: body_text, Model, Macro-accuracy, Macro-F1, -----------, ----------------, ----------, fastText, 0.9302, 0.5529, LinearSVC, 0.513277, 0.552801, ULMFit, 0.948737, 0.744875, ## wisesight-sentiment, Model, Public Accuracy, Private Accuracy, ---------------------, -----------------, ------------------, Logistic Regression, 0.72781, 0.7499, FastText, 0.63144, 0.6131, ULMFit, 0.71259, 0.74194, ULMFit Semi-supervised, 0.73119, 0.75859, ULMFit Semi-supervised Repeated One Time, 0.73372, 0.75968, ## truevoice-intent: destination, model, accuracy, micro-F1, -----------, ----------, ----------, fastText, 0.384116, 0.384116, LinearSVC, 0.807876, 0.327565, ULMFit, 0.834981, 0.834981, ## Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

https://resources.aiat.or.th/

Acknowledgements

Main metrics

Overview
Name With Ownerkobkrit/nlp_thai_resources
Primary Language
Program language (Language Count: 0)
PlatformWeb browsers
License:
所有者活动
Created At2017-06-18 03:22:10
Pushed At2023-04-09 11:28:14
Last Commit At2021-09-21 23:15:51
Release Count0
用户参与
Stargazers Count387
Watchers Count39
Fork Count75
Commits Count133
Has Issues Enabled
Issues Count7
Issue Open Count5
Pull Requests Count28
Pull Requests Open Count2
Pull Requests Close Count1
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private