Thai Natural Language Processing (Thai NLP) Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Thai NLP Libraries/Services
Thai Character Cluster
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
JTCC, Thai Character Cluster, Java, GPL-3.0, Wittawat
TCC, Thai Character Cluster, Python, Apache 2.0, Wannaphong
Thai Sentiment Analysis
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
sentiment_analysis_thai, JagerV3
Thai Soundex
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
LK82 + Udom83, Thai Soundex, Python, Korakot
Word Segmentation
Part of Speech Tagging (POS Tagging)
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Jitar+NAiST, A simple Trigram HMM part-of-speech tagger, Java, Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai, Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM., Python, 0.9163 F-measure. RNN. LSTM, MIT, KenjiroAI, github
Chart-POS, Thai POS Tagger, C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp
Name Entity Recognition
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Named Entity Tagging (Thai NEST), Thai Named Entity tagging Specification and Tools, GPL, KINDML, SIIT, AIAT
ThaiNER, Thai Named Entity Recognition for PyThaiNLP, Python, Apache 2.0 (code) & CC BY 3.0 (Dataset), ThaiNER
News Structure Tagging
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
News Structure Tagging Program, Thai News Structure Tagging Program, Metadata tagging, Structure tagging, Automatic News Title Generation, GPL, AIAT
Syntactic Parsing & Tools
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
Chart-parser, Extract Syntactic Structure from POS Tagged Sentence., C, All rights reserved, AIAT, KINDML, Thanaruk T. (thanaruk@siit.tu.ac.th), Thodsaporn C., Demo at iApp
Grammar Processing, Labelled Brackets -> Context Free Grammars (CFGs), Python, Transform and compute probability, Thodsaporn C.
Thai Word Embedding
Library, Description, Programming Languages, Features, License, Author & Link
---, ---, ---, ---, ---, ---
kobkrit-word-embedding, Tensorflow implementation of Thai word embedding, Python, Source code, Example, Word distance graph, LGPL, Kobkrit V.
Thai Question Answering (Machine Comprehension)
Service, Description, License, Author & Link
---, ---, ---, ---
Thai Machine Comprehension (ThaiMC), Bidirectional Attention Flow, Copyright (As the service), iApp-AI
Thai Emojification
Service, Description, License, Author & Link
---, ---, ---, ---
Thai Emotification, LSTM, GNU General Public License v3.0, Demo at iApp-AI and Source, Github
Dictionaries / Translation Pairs
Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Transliteration Corpus, 31K pairs, Thai-Eng Translation Pair, CC BY-NC-SA 3.0 TH, NECTEC
LEXiTRON, Thai<->English Dictionary, TH->EN, EN->TH, LEXiTRON License, NECTEC
Yaitron, LEXiTRON in machine readable format (XML), TH->EN, EN->TH, LEXiTRON License, Veer66 Schema, Data & Conversion Code
Downloadable Text Corpus
Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
ORCHID, 30K sent., Word Seg., POS Tagged., CC BY-NC-SA 3.0 TH, NECTEC
THAI-NEST, Thai-NEST: Thai Named Entity tagging Specification and Tools, 45K+ Name Entity Token, Name Entity Tagged, GNU Lesser General Public 2.1, KINDML
InterBEST 2009/2010, 5M words, Word Seg., CC BY-NC-SA 3.0 TH, NECTEC
Thai Wikipedia, Formal Articles, 1.49GB (~213.1 MB compressed), XML, GFDL, WIKIPEDIA
TNC Top-5000 Words, Word frequency, 5,000 words, Frequency of Thai words in various genres, EXCEL, All rights reserved, CHULA
Click Bait Sentences, Thai Click Bait Sentence, 330 sent. (90.7KB), MIT, Wannaphongcom
Thai Sentimental Word List, Thai Sentimental Words List, 52KB, Seperated Words as Adj, V, MIT, Wannaphongcom
Prime Minister 29, Prime Minister 29's Speech Sentences, 338KB, Word segged, Name Entity Tagged, MIT, Wannaphongcom
Thai named entity corpora, named entity corpora by Wirote Aroonmanakun's students, 266KB-1.5MB, syllable seg., word seg., Named Entity tagged, GPLv3(not sure, but tltk is using this license), นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
Thai WordNet, THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร), WordNet, N/A, ธนนท์ หลีน้อย 2008ปริศนา อัครพุทธิพร Data 2008
Toxicity in Thai Tweet Corpus, Tokyo Metropolitan University Natural Language Processing Group, Each tweet is labeled as toxic or non-toxic, CC BY-NC 4.0, tmu-nlp
thai-jokes-corpus, Cleaned Thai Jokes Corpus, 457 jokes, GNU-GPL3.0, iApp Technology
Web Query Text Corpus
Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
Thai National Corpus 2, 32M words, Query text by genre, domain, All rights reserved, CHULA
Thai Medical Document, 3,594 docs, Document and dynamic keyword map, All rights reserved, KINDML, SIIT
Southeast Asian Languages Library, Thai News, Web Text, Pop Music, Literature, Toponyms, 20M chars, Phase around a search text, SEALang
HSE Thai Corpus, Modern texts written in Thai language (mostly news websites), 50M tokens, Query by word form, lexeme, translation, grammatical attributes, lexical attributees, HSE School of Linguistics
Parallel Corpus
Library, Description, Size, Features, License, Link
---, ---, ---, ---, ---, ---
TALPCo, TUFS Asian Language Parallel Corpus, 1327 sent, open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English, CC BY 4.0, TALPCo
Pre-trained Word Vectors
Pre-trained Model, Description, Size, Dimensions, License, Link
---, ---, ---, ---, ---, ---
fastText, Skip-Gram model trained on Wikipedia using fastText, 300, CC BY-SA 3.0, Facebook + Bin & Text + Text Only
thai2fit, ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings., 70MB, 300, MIT, thai2vec / pyThaiNLP