汉字使用频率(Kanji usage frequency)

从各种渠道收集的汉字使用频率数据。「Kanji usage frequency data collected from various sources」

Github星跟踪图

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

  • aozora:download - use crawler/scraper to collect the data
  • aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
  • aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
  • aozora:clean - clean the scraped pages (apply gaiji replacements)
  • aozora:count - create the dataset

Wikipedia:

  • wikipedia:fetch - fetch random pages using MediaWiki API
  • wikipedia:count - create the dataset

News:

  • news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
  • news:count - create the dataset
  • news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

主要指标

概览
名称与所有者scriptin/kanji-frequency
主编程语言Astro
编程语言HTML (语言数: 4)
平台
许可证Creative Commons Attribution 4.0 International
所有者活动
创建于2016-01-24 01:51:10
推送于2025-04-06 15:08:12
最后一次提交2025-04-06 17:08:09
发布数0
用户参与
星数140
关注者数4
派生数20
提交数186
已启用问题?
问题数4
打开的问题数0
拉请求数26
打开的拉请求数0
关闭的拉请求数0
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?