Jargon

Jargon is a lemmatizer, useful for recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Jargon uses Stack Overflow tags & synonyms, and implements “insensitivity” to spaces, dots and dashes.

Online demo

Command line

go install github.com/clipperhouse/jargon/cmd/jargon

(Assumes a Go installation.)

To display usage, simply type:

jargon

jargon accepts piped UTF8 text from Stdin and pipes lemmatized text to Stdout

  Example: echo "I luv Rails", jargon

Alternatively, use jargon 'standalone' by passing flags for inputs and outputs:

  -f string
    	Input file path
  -o string
    	Output file path
  -s string
    	A (quoted) string to lemmatize
  -u string
    	A URL to fetch and lemmatize

  Example: jargon -f /path/to/original.txt -o /path/to/lemmatized.txt

In your code

See GoDoc.

Dictionaries

Canonical terms (lemmas) are looked up in dictionaries. Three are available:

Stack Exchange technology tags
- Ruby on Rails → ruby-on-rails
- ObjC → objective-c
Contractions
- Couldn‘t → Could not
Simple numbers
- Thirty-five hundred → 3500

To implement your own, see the jargon.Dictionary interface

Tokenizer

Jargon includes its own tokenizer, with an emphasis on handling technology terms correctly:

C++, ASP.net, and other non-alphanumeric terms are recognized as single tokens
#hashtags and @handles
Simple URLs and email address are handled pretty well, though can be notoriously hard to get right

The tokenizer preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

(It turns out that the above rules work well in structured text such as CSV and JSON.)

Background

When dealing with technology terms in text – say, a job listing or a resume –
it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious
what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents
one technology, but databases naively see two words.

Prior art

Existing tokenizers (such as Treebank), appear not to be round-trippable, i.e., are destructive. They also take a hard line on punctuation, so “ASP.net” would come out as two tokens instead of one. Of course I’d like to be corrected or pointed to other implementations.

Search-oriented databases like Elastic handle synonyms with analyzers.

In NLP, it’s handled by stemmers or lemmatizers. There, the goal is to replace variations of a term (manager, management, managing) with a single canonical version.

Recognizing mutli-words-as-a-single-term (“Ruby on Rails”) is named-entity recognition.

What’s it for?

Recognition of domain terms in text
NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.

名稱與所有者	clipperhouse/jargon
主編程語言	Go
編程語言	Go (語言數: 2)
平台
許可證	MIT License

創建於	2018-05-07 20:38:20
推送於	2025-09-02 22:11:17
最后一次提交	2025-09-02 18:11:13
發布數	25
最新版本名稱	v1.0.9 (發布於 2022-06-11 21:59:20)
第一版名稱	v0.9.0 (發布於 2020-02-20 11:53:02)

星數	110
關注者數	3
派生數	3
提交數	431
已啟用問題?
問題數	9
打開的問題數	0
拉請求數	3
打開的拉請求數	6
關閉的拉請求數	0

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

jargon

Github星跟蹤圖