makemeahanzi

Free, open-source Chinese character data

  • 所有者: skishore/makemeahanzi
  • 平台:
  • 許可證: Other
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Make Me a Hanzi annotation tool

Make Me a Hanzi Demo

New: Inkstone Chinese writing app

New: No more cut-off strokes (due to @chanind)!

Make Me a Hanzi provides dictionary and graphical data for over 9000 of the
most common simplified and traditional Chinese characters. Among other things,
this data includes stroke-order vector graphics for all these characters. You
can see the project output at the demo site
where you can look up a characters by drawing them. You can also download the
data for use in your own site or app.

See the project site for general
information and updates on the project.

Make Me a Hanzi data is split into two data files,
dictionary.txt
and graphics.txt,
because the sources that the files are derived from have different licenses.
In addition, we provide an experimental tarball of animated SVGs,
svgs.tar.gz
that is licensed the same way as graphics.txt.
See the Sources section and the
COPYING
file for more information.

Sources

This project would not have been possible without the generosity of
Arphic Technology, a Taiwanese font forge that
released their work under a permissive license in 1999.

In addition, I would like to thank Gábor Ugray for his thoughtful advice on
the project and for verifying stroke data for most of the traditional
characters in the two data sets. Gábor maintains Zydeo,
a free and open-source Chinese dictionary.

Format

Both dictionary.txt and graphics.txt are '\n'-separated lists of lines, where
each line is JSON object. They differ in which keys are present, but the
common key, 'character', can be used to join the two data sets. You can also
rely on the fact that the two files will always come in the same order.

dictionary.txt keys:

  • character: The Unicode character for this glyph. Required.

  • definition: A String definition targeted towards second-language
    learners. Optional.

  • pinyin A comma-separated list of String pronunciations of this character.
    Required, but may be empty.

  • decomposition: An Ideograph Description Sequence
    decomposition of the character. Required, but invalid if it starts with a
    full-width question mark '?'.

    Note that even if the first character is a
    proper IDS symbol, any component within the decomposition may be a wide
    question mark as well. For example, if we have a decomposition of a
    character into a top and bottom component but can only recognize the top
    component, we might have a decomposition like so: '⿱逢?'

  • etymology: An etymology for the character. This field may be null. If
    present, it will always have a "type" field, which will be one of
    "ideographic", "pictographic", or "pictophonetic".
    If the type is one of the first two options, then the etymology will
    always include a string "hint" field explaining its formation.

    If the type is "pictophonetic", then the etymology will contain three
    other fields: "hint", "phonetic", and "semantic", each of which is
    a string and each of which may be null. The etymology should be read as:
    ${semantic} (${hint}) provides the meaning while ${phonetic}
    provides the pronunciation.
    with allowances for possible null values.

  • radical: Unicode primary radical for this character. Required.

  • matches:
    A list of mappings from strokes of this character to strokes of its
    components, as indexed in its decomposition tree. Any given entry in
    this list may be null. If an entry is not null, it will be a list of
    indices corresponding to a path down the decomposition tree.

    This schema is a little tricky to explain without an example. Suppose
    that the character '俢' has the decomposition: '⿰亻⿱夂彡'

    The third stroke in that character belongs to the radical '夂'.
    Its match would be [1, 0]. That is, if you think of the decomposition as
    a tree, it has '⿰' at its root with two children '亻' and '⿱', and
    '⿱' further has two children '夂' and '彡'. The path down the tree
    to '夂' is to take the second child of '⿰' and the first of '⿱',
    hence, [1, 0].

    This field can be used to generate visualizations marking each component
    within a given character, or potentially for more exotic purposes.

graphics.txt keys:

  • character: The Unicode character for this glyph. Required.

  • strokes:
    List of SVG path data for each stroke of this character, ordered by
    proper stroke order. Each stroke is laid out on a 1024x1024 size
    coordinate system where:

    • The upper-left corner is at position (0, 900).
    • The lower-right corner is at position (1024, -124).

    Note that the y-axes DECREASES as you move downwards, which is strage!
    To display these paths properly, you should hide render them as follows:

    <svg viewBox="0 0 1024 1024">
      <g transform="scale(1, -1) translate(0, -900)">
        <path d="STROKE[0] DATA GOES HERE"></path>
        <path d="STROKE[1] DATA GOES HERE"></path>
        ...
      </g>
    </svg>
    
  • medians:
    A list of stroke medians, in the same coordinate system as the SVG
    paths above. These medians can be used to produce a rough stroke-order
    animation, although it is a bit tricky. Each median is a list of pairs
    of integers. This list will be as long as the strokes list.

TODOs and Future Work

  • As an experimental next step, we have produced an animated SVG image for
    each character that we have data for (see the svgs directory). The SVGs are
    named by the Unicode codepoint of the character they correspond to.
    Using Javascript, you can find the codepoint of a character x by calling
    x.charCodeAt(0). It's easy to embed these SVGs in a website. A minimal
    example is as follows:

    <body><embed src="31119.svg" width="200px" height="200px"/></body>
    

    This feature is experimental because it is still tricky to work with these
    images beyond this basic example. For instance, it's not clear how to
    embed two of these images side-by-side and have the second start animating
    when the first is complete. However, the images are still the easiest way
    to make use of this data..

There are quite a few clients using the Make Me a Hanzi data. Many of them
have had to do additional preprocessing of it for their use case. If you might
find this data useful, please feel free to contact me by email - I may be able
to give tips or suggest algorithms for making use of it.

  • This project is focused on building stroke order diagrams that follow the
    People's Republic of China (PRC) stroke order. Some characters are written
    with different stroke orders in Japan, Taiwan, and elsewhere. I don't have
    the time or knowledge to produce similar data for those orderings, but
    there are other resources that you can try:

    • parsimohni's animCJK project provides Japanese stroke order data: GitHub and Demo
    • KanjiVG also has Japanese stroke order data, and isn't based on Arphic's font: Website
    • chanind's Hanzi Writer Javascript library supports animations and writing practice: Website
  • There are also some apps and websites that use this data:

    • gugray maintains HanDeDict, a Chinese-German dictionary that uses these animations: GitHub and Website
    • meshonline wrote a free iOS app for learning Chinese characters using this data: GitHub and App Store
    • embermitre uses Make Me a Hanzi animations in Hanping Chinese Dictionary: Lite version and Pro version

主要指標

概覽
名稱與所有者skishore/makemeahanzi
主編程語言JavaScript
編程語言Python (語言數: 3)
平台
許可證Other
所有者活动
創建於2015-10-02 23:16:27
推送於2022-02-16 20:21:49
最后一次提交2018-10-16 00:41:45
發布數0
用户参与
星數2.1k
關注者數64
派生數518
提交數45
已啟用問題?
問題數111
打開的問題數60
拉請求數9
打開的拉請求數2
關閉的拉請求數4
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?