汉字使用频率(Kanji usage frequency)

从各种渠道收集的汉字使用频率数据。「Kanji usage frequency data collected from various sources」

Github星跟蹤圖

Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here: https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

  • aozora:download - use crawler/scraper to collect the data
  • aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
  • aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
  • aozora:clean - clean the scraped pages (apply gaiji replacements)
  • aozora:count - create the dataset

Wikipedia:

  • wikipedia:fetch - fetch random pages using MediaWiki API
  • wikipedia:count - create the dataset

News:

  • news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
  • news:count - create the dataset
  • news:dates - create additional file with dates of articles

Building the website

See Astro docs and the scripts section in package.json.

主要指標

概覽
名稱與所有者scriptin/kanji-frequency
主編程語言Astro
編程語言HTML (語言數: 4)
平台
許可證Creative Commons Attribution 4.0 International
所有者活动
創建於2016-01-24 01:51:10
推送於2025-04-06 15:08:12
最后一次提交2025-04-06 17:08:09
發布數0
用户参与
星數139
關注者數4
派生數20
提交數186
已啟用問題?
問題數4
打開的問題數0
拉請求數26
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?