汉字使用频率(Kanji usage frequency)

从各种渠道收集的汉字使用频率数据。「Kanji usage frequency data collected from various sources」

Github stars Tracking Chart

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

  • aozora:download - use crawler/scraper to collect the data
  • aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
  • aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
  • aozora:clean - clean the scraped pages (apply gaiji replacements)
  • aozora:count - create the dataset

Wikipedia:

  • wikipedia:fetch - fetch random pages using MediaWiki API
  • wikipedia:count - create the dataset

News:

  • news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
  • news:count - create the dataset
  • news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

Main metrics

Overview
Name With Ownerscriptin/kanji-frequency
Primary LanguageAstro
Program languageHTML (Language Count: 4)
Platform
License:Creative Commons Attribution 4.0 International
所有者活动
Created At2016-01-24 01:51:10
Pushed At2025-04-06 15:08:12
Last Commit At2025-04-06 17:08:09
Release Count0
用户参与
Stargazers Count140
Watchers Count4
Fork Count20
Commits Count186
Has Issues Enabled
Issues Count4
Issue Open Count0
Pull Requests Count26
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private