Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

aozora:download - use crawler/scraper to collect the data
aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
aozora:clean - clean the scraped pages (apply gaiji replacements)
aozora:count - create the dataset

Wikipedia:

wikipedia:fetch - fetch random pages using MediaWiki API
wikipedia:count - create the dataset

News:

news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
news:count - create the dataset
news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

Name With Owner	scriptin/kanji-frequency
Primary Language	Astro
Program language	HTML (Language Count: 4)
Platform
License:	Creative Commons Attribution 4.0 International

Name With Owner

scriptin/kanji-frequency

Primary Language

Astro

Program language

HTML (Language Count: 4)

Platform

License:

Creative Commons Attribution 4.0 International

Created At	2016-01-24 01:51:10
Pushed At	2025-10-21 04:40:28
Last Commit At	2025-10-11 00:37:55
Release Count	0

Created At

2016-01-24 01:51:10

Pushed At

2025-10-21 04:40:28

Last Commit At

2025-10-11 00:37:55

Release Count

Stargazers Count	150
Watchers Count	3
Fork Count	21
Commits Count	194
Has Issues Enabled
Issues Count	4
Issue Open Count	0
Pull Requests Count	34
Pull Requests Open Count	1
Pull Requests Close Count	0

Stargazers Count

150

Watchers Count

Fork Count

Commits Count

194

Has Issues Enabled

Issues Count

Issue Open Count

Pull Requests Count

Pull Requests Open Count

Pull Requests Close Count

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

Has Wiki Enabled

Is Archived

Is Fork

Is Locked

Is Mirror

Is Private

汉字使用频率（Kanji usage frequency）

Github stars Tracking Chart

Kanji usage frequency

Building the datasets

Building the website

Main metrics