ZHENGMA (郑码)
A Brief Introduction for English Speakers
Learning Chinese Characters by Typing Them on the Computer
By MEGATRON COOKIE
September 2008
    —— • ——
  1. INTRODUCTION
  2. QUICK START
  3. LEGO A CHINESE CHARACTER
  4. DISSECT A CHINESE CHARACTER
  5. FURTHER EXPLORATION
  6. APPENDIX
    —— • ——
  1. INTRODUCTION

    Nowadays Pinyin (拼音) is the most popular method to enter Chinese characters into the computer. This method is based on the pronunciation (Pinyin) of the Chinese character, which can be “spelled out” with 26 English alphabetic letters on a regular keyboard.

    The obvious caveat is that many Chinese characters share the same pronunciation, though bearing completely different meanings. For example, the Pinyin “zhi bing” matches the pronunciation of both “治病” and “致病”. The former means “curing of the disease” and the latter means “causing the disease”. (Yes, read the label carefully next time you try some herbal medicine.)

    So this phonetic input method requires the user to interactively select the Chinese character among possible candidates after he/she spells out the Pinyin.

    For example, to type the Chinese character 早 (morning or early) using Pinyin, one needs to type “zao” and more than 10 possible candidates pop out on the computer screen. Fortunately, number 1 matches what we intend to type, so the user would key in “1” or just spacebar to select that character.

    Fig. 1. Choose from multiple Chinese characters with Pinyin input method

    This extra interaction step slows down the speed at which a professional typist can feed Chinese into the computer, since he/she needs to constantly switch attention between the original paper document and the electronic version on the computer screen.

    So in the beginning, people invented a different category of input methods. They are called XingMa (stroke-encoding). These methods assign codes (usually 4-letter or less) for each Chinese character based on how the character is written (i.e. stroke, root and structure), which leaves almost no ambiguity.

    ZhengMa (郑码) is one of the best XingMa input methods. Fig. 2. shows an example of typing the same character 早 (morning or early) using ZhengMa. Its code is “ked”, which happens to bring up a second character 旪. You may have noticed that both characters share similar features. That is, they are made of the same parts: 日 and 十 (called roots), but have different structures (i.e. one is top-bottom, the other is left-right).

    Fig. 2. Input a Chinese character with ZhengMa

    By the way, 旪 is such an ancient and obsolete character, it is safe to say that 95% of the Chinese speaking population don’t know how to pronounce it. (Try typing that with Pinyin, if you can.)

    There are some six thousand frequently used Chinese characters that, at first look, all have distinctive shapes. (Just the characters in Fig. 1 can give you an idea.) To uniquely encode each character isn’t an easy task at all. Remember that this isn’t the same as assigning a number (like the 16-bit binary Unicode) to each character. A human typist needs to be able to retrieve each code from his/her brain when typing the Chinese character.

    Since humans are usually equipped with a small size memory, but are good at logic deduction. ZhengMa is one of the most “easy” stroke-encoding methods. Because most of its encodings are highly logical and can be deduced by a few generalized rules. (We will describe these in detail later.) But still the codes of some basic building blocks (roots) need to be memorized (e.g. k for 日, ed for 十). The memorization of these root codes poses a steep learning curve, which impedes the wide spread of almost all stroke-encoding methods.

    It’s not surprising that, over the time, the Pinyin method wins the popularity vote, due to its ease of use, as well as another important reason. When typing Chinese character by Pinyin, there are far less ambiguities, if one choose to combine the characters into 2-character to 4-character phrases.

    For example, to type the phrase 计算机 (computer) by Pinyin is “ji suan ji”. There is only one possible candidate to choose from. In fact, one may even reduce the keystrokes to “ji s ji” and the computer will respond with the same characters.

    So in general, the longer the phrase into which the user groups the Chinese characters, the more probable he/she gets the intended input into the computer. With almost no training, a regular Chinese speaking computer user can type in Chinese characters at a reasonable speed by Pinyin.

    Do all these reasons make the stroke-encoding method completely obsolete? Probably not.

    Traditionally one argument for the stroke-encoding method is that single character rather than multi-character phrases are the main constituent of the Chinese language, and Pinyin is a degenerative way of representing the language.[ 1 ] More importantly, new multi-character phrases are being conjugated, as well as old ones phasing out, quickly every day. And the Pinyin input method (a phonetic method) that relies on matching phrases “intelligently” would find itself difficult to keep up with the fast evolving language.

    In reality, on the contrary, with the techniques of updating phrase database over the Internet and dynamically adjust/predict phrase candidates based on usage, Pinyin turns out to be very effective at keeping pace with the modernizing language. In fact, it becomes so successful that the popular misuses of the phrases that brew out of BBS or Internet chat room often “contaminate” the formal use of the language.

    Just recently (8/14/2008), one article in a Beijing evening newspaper, reporting on a womanOlympic gold medalist, mistakenly used the phrase “故娘” in its title. It should be 姑娘 meaning “young lady”, and 故娘 which has the same pronunciation, in some context, could mean “deceased lady”. That is probably a mistake caused by Pinyin input method. (No “spelling check” possible, since both phrases spell exactly the same in Pinyin.)

    Interestingly, on the other hand, since ancient Chinese poems and classical literature do rely heavily on isolated Chinese characters, and these Chinese characters/phrases barely have frequent daily usage in the modern language, they easily fall into the blind spot of Pinyin input method.

    For example, below is the beginning of a long Chinese classical verse. It is at least time consuming, if not error-prone, to type these characters by Pinyin.

    « 浔阳江头夜送客,枫叶荻花秋瑟瑟。»

    (The above characters were typed by ZhengMa.)

    So why is ZhengMa relevant for English speakers who are interested in learning (or playing with) Chinese characters?

    First of all, in addition to listening/speaking Chinese, the ability to read/write Chinese is naturally the next step in learning the language. After all the beauty of Chinese lies in part in her hieroglyphic characters. Isn’t it fascinating that 山 (mountain) looks like a mountain, 月 (moon) looks like the moon, and 日 (sun), 月 (moon) together become 明 (bright).

    By learning the 26+ building blocks (roots) in ZhengMa, one can start building Chinese characters on the computer. Call that the ultimate LEGO experience. It’s going to be a fun way to get to know Chinese characters.

    Secondly, what is even better, by knowing its encoding rules, one can use ZhengMa to type a Chinese character even without knowing its pronunciation or meaning at first. In fact, with 26 root characters and some logically derived sub-roots, one can, in theory, type almost 60,000+ Chinese characters into the computer. (Remember that “old” saying, if you can type it in the computer, you can google it.)

    The main objective of this article is not to train you (English speakers) to be a professional typist (at 200+ characters/minute input rate, which can be achieved through practicing though). The aim is to reveal the intellectually rewarding fun behind typing Chinese characters with ZhengMa, a really good stroke-encoding method. So it becomes unimportant to memorize the codes for root characters, to discuss speed-typing the characters with shortcut codes, etc. Without these hurdles, it turns out that ZhengMa is no longer difficult to learn at all, even for the English speakers.[ 2 ]


    [ 1 ] Literate Chinese speakers, who are used to word processing on the computer with Pinyin, often find themselves scratching the head when it comes to write the Chinese characters by hand. The symptom is called 提笔忘字. ZhengMa can help cure such memory losses. (no meditation needed)
    [ 2 ] Partially inspired by and derived from ZhengMa (郑码), Wubi (五笔) is another popular stroke-encoding input method.
    A wonderful tutorial for English speakers can be found at www.yale.edu/chinesemac/wubi/xing.html
    Choosing the input method is like picking a programming language. Religious war can be fanned by arguing which input method (or similarly programming language) is better.
    Suffice to say, Wubi has more users and thus more study/help material can be found on the Internet. On the other hand, ZhengMa is more complete and consistent in its encoding. It is also shipped by default with almost every version of MS-Windows operating system. This article is an effort to fill in the lack of introduction material to this wonderful, but less known input method.

  2. QUICK START

    ZhengMa is shipped by default with almost every version of MS-Windows operating system. You should have added Chinese language support in Windows International Settings already. Here is a test page to see if your operating system can display Chinese characters properly en.wikipedia.org/wiki/Chinese_language

    (google: Microsoft windows read Chinese, if you need help)

    ZhengMa can be added as one of the Chinese input methods (Fig. 3.), in addition to, for example, Microsoft Pinyin.

    Fig. 3. Add ZhengMa as a Chinese input method

    (google: Microsoft chinese IME, if you need help)

    And the ZhengMa input language bar can be evoked, which looks like Fig. 4 on an English version of Windows XP.[ 3 ]

    Fig. 4. Language bar for ZhengMa input method

    Now you can open a text editor (e.g. Notepad) and start typing Chinese. As a quick start, let’s begin with the first 7 characters of that ancient Chinese verse: 浔阳江头夜送客 by typing:

    vxds yk vbi tgd snr wug wdrj

    (Press space-bar in between the letters to confirm the Chinese character selection, if necessary. Extra spacebar may accidentally bring up multi-character phrases.)

    Isn’t that fun, the computer seems to be able to respond to these mysterious letters. It’s only the beginning of fun, wait till you know how to compose the codes yourself.[ 4 ]


    [ 3 ] There are fan DIY software to attach ZhengMa input method to SCIM (input method framework under Linux). For MAC users, such patching may undergo OpenVanilla or FIT. QIM supports ZhengMa, but it isn't a free software.
    [ 4 ] A complete root code diagram is attached in the Appendix. It may seem intimidating at first look. But it's just like the many dials in the cockpit of an airplane. Once you know how things are organized, it's not that complex after all.

  3. LEGO A CHINESE CHARACTER

    There are basically five kinds of strokes to write a Chinese character by hand. In ZhengMa, they are coded by five letters: A, I, M, S and Y.

    • A 	: 一
    • I 	: 丨
    • M	: 丿
    • S 	: 丶
    • Y 	: 了

    If you type “ai”, the computer will respond with a Chinese character 丁, which is composedof these two strokes: 一 and 丨.

    One of my primary school classmates has the first name 丁. Like Bart Simpson, when a kid gets into trouble in the school, as punishment, he is forced to write his name repeatedly on the blackboard. 丁 was never afraid of doing so because his name only has two simple strokes. (Talk about parents foreseeing a kid’s future) By the way, the series of comic strips “The Adventures of Tintin”, its hero Tintin has his name in Chinese: 丁丁.

    Another example, “ya” yields 子, meaning son.

    The real tricky part is how do you know the order of the strokes, whether it is “ay” or “ya”. The general rule is to write from left to right, and from top to bottom. For characters with few strokes or convoluted structures, the rule isn’t that inclusive. However you don’t need to take a 3-credit-hour Chinese course or have a Chinese tutor to know the stroke nuances. ZhengMa on your computer is your best tutor, just trial and error, see whether the computer agrees with your stroke order.

    However these five strokes are too simple to be useful. Because a 20+ strokes Chinese character may require 20+ keystrokes to input, stroke by stroke like this. That’s definitely too cumbersome. So ZhengMa’s inventor came up with larger “LEGO” bricks, called roots (coded by single letters from A-Z) and sub-roots (coded by 2 alphabetic letters). Both the roots and sub-roots are frequently used building blocks of Chinese characters. With the help of these sub-structures, ZhengMa can code each Chinese character with no more than 4 letters.

    But still these five strokes (A, I, M, S, Y) are the most important elements of any stroke-based encoding system. If you are into computer programming languages, these five strokes are just like “primitives” (or keywords) of a programming language. The various roots are like built-in function libraries (or API calls).

    Of course, the third most important part of a programming language is “the rules of combination”, i.e. how to write your own function/code. In the context of ZhengMa, how to compose a Chinese character with these roots (and sub-roots)? Here are the complete rules:

      —— • ——
    1. ZhengMa uses no more than 4 alphabetic letters to encode each Chinese character.
    2. Each Chinese character is composed of roots and/or sub-roots. The roots are encoded with single letters; sub-roots are encoded with 2 letters. Example: root 土 (b) and sub-root 工 (bi).
    3. Character made of 1 or 2 roots/sub-roots, its code won’t exceed 4 letters. Just type in the corresponding letters for its roots/sub-roots, following the order of “left-to-right”, “top-to-bottom”. Example: 木 (f) + 子 (ya) = 李 (fya), 日 (k) + 十 (ed) = 早 (ked), 日 (k) + 月 (q) = 明 (kq)
    4. Character made of 3 roots/sub-roots, its first root/sub-root needs to be spelled out completely (i.e. 1 or 2 letters), the middle one allows for only 1 letter, the last root/sub-root allows for 1 or 2 letters to fill the 4-letter slot. So the formula is 1+1+2, 1+1+1 or 2+1+1. Example: 箱 (mfl; 1+1+1), 肝 (qaed; 1+1+2), 教 (bmym; 2+1+1). Notice 教 (teach) has parts: 耂 (bm), 子 (ya), and 攵 (mo), but to accommodate the 4-letter slot, only the first letters of the sub-roots (子, 攵) are included.
    5. Character made of 4+ roots/sub-roots, as always its first root/sub-root needs to be spelled out completely (i.e. 1 or 2 letters), the second, the second-to-last and the last root/sub-root each allows for 1 letter respectively. The formula becomes 1+1+1+1 or 2+0+1+1. Example: 鳞 (rurm; 1+1+1+1), 赢 (shlq; 2+0+0+1+1).

    With the complete root/sub-root table in hand (see Appendix), and these 5 rules, you can start building/typing Chinese characters right away. Not every character you build corresponds to an existing Chinese character, but it’s easy to fish out a big one that contains the pieces you like.

    For example, you may like the character 人 (human) because it looks exactly like someone walking. It is listed as a sub-root od (2-letter) in the table. To add “more people”, try 从 (odod), not happy, and want more people? How about ododod? Not really.

    Remember rule 1? The code can‘t be longer than 4 letters. Also remember rule 4, for a 3 sub-roots character the code should be (od+o+o): 众 (lots of people).

    There is certainly no Chinese character with more people in it, because if there is, the code (odoo) will certainly bring it out. (Remember rule 5?) Exercise: try to type 品, 森, 淼, 垚. (Yes, these are all Chinese characters, but your Chinese professor probably won’t be able to pronounce all of them without the help of a dictionary.) One way to be familiar yourself with the roots/sub-roots in the table is to game with them by putting them on your “scrabble” tiles. You can play “scrabble” with friends, to see who can make the “longest” Chinese character. And ZhengMa on your computer is the ideal “scrabble” dictionary to check the validity of the character.

  4. DISSECT A CHINESE CHARACTER

    If you would like to impress your friend by telling him/her the meaning of his/her Chinese character tattoo, you can type that character in the computer and use google translate or other online dictionary to find its meaning.

    This leads to most difficult section in this article: given a Chinese character, dissect it into parts that match ZhengMa’s roots and sub-roots.

    It is essential to recognize the roots and sub-roots and to realize that these are the basic units that shall not be further taken apart. For example, the character 郭 (sjyy; 1+1+1+1) should be dissected into 亠 (s), 口 (j), 子 (ya) and 阝(y). According to rule 5, we shall pick s (for the first root 亠) + jyy (the first letters of the rest of the roots/sub-roots). In this case 子 is a sub-root (coded ya), which should not be further decomposed into 了 (y) and 一 (a). Because otherwise, if you thought 郭 were made of five components: 亠, 口, 了, 一, 阝, its code would have become sjay, according to rule 5.

    Just like taking apart your iPod, in order to be able to pry it open, you need to know where to insert the plastic wedge. For many Chinese characters, its structure gives hint where the break point shall be.

    For example: 栖 (ffj) = 木 (f) + 西 (fj) LEFT-RIGHT structure. 西 is already a sub-root, so don’t break it further.

    The structure that is most difficult to identify is called SANDWICH structure. For example 巫 (bioo) = 工 (bi) + 人 (od) + 人 (od) and 亘 (bdk) = 二 (bd) + 曰 (k).

    There are also nuances like 失 and 矢 are two different characters, but with the same parts. To distinguish between them, 矢 (ma) is made into a sub-root. 失 is allowed to be broken into mb + od. This is also the case for 牛 (mb) and 午 (maed).

    The general guideline is to break the character into sub-structures until roots/sub-roots are matched. Exceptions can happen when the character is too “slim”.

  5. FURTHER EXPLORATION

    Much can be said about ZhengMa’s roots/sub-roots table. Decades of research effort went into perfecting it.

    Sub-roots and roots are grouped into five categories based on their first stroke, which is among 一, 丨, 丿, 丶, 了 as discussed in section 3.

      —— • ——
    1. A to H encode roots/sub-roots with the first stroke 一.
    2. I to L encode roots/sub-roots with the first stroke 丨.
    3. M to R encode roots/sub-roots with the first stroke 丿.
    4. S to W encode roots/sub-roots with the first stroke 丶.
    5. X to Z encode roots/sub-roots with the first stroke 乛, 了, 乚 respectively.

    Once you recognize the first stroke of a character (or its sub-structure) you can start looking for the corresponding root/sub-root in the table (see Appendix). If there is no proper match, it means that you need to take the character further apart.

    Even though sub-roots are coded with 2 letters, it is obvious, according to the encoding rule, that the second letter is left out most of the time.

    Also the way in which sub-roots are encoded isn’t arbitrary at all. For example, 子 (ya) is a sub-root under root 了 (y). Its code comes from 了 (y) + 一 (a). Similarly 田 (ki) comes from 曰 (k) + 丨 (i).

    In fact, most of the sub-roots can be deduced as add-on of the root character. It certainly helps you look up and remember the 2-letter codes, once you are familiar with the 26 root characters.

    So why do we need 2-letter sub-root? Why not just use 1-letter roots to code the characters?

    Well, for one thing, the code can become longer than 4 letters in some cases.

    In order to stick with 4-letter codes, one then has to deal with ambiguities caused by packing 1-letter with multiple roots. By the way, Wubi input method is a 1-letter root encoding system. It relies on a complicated isolation-key method to fight the ambiguities. Still its limited number of roots makes its character space quite small comparing to ZhengMa.

    As we mentioned in section 1, the Pinyin method encourages users to type multicharacter phrase in order to resolve the ambiguities.

    The same multi-character strategy can apply to ZhengMa. For example, 计算机 (smlf) = 讠 (s) + 算 (ml) + 木 (f), it is composed of 1-letter from the first root/sub-root of the first character 计, first two letters from the first two roots/sub-roots of the middle character 算, and 1-letter from the first root/sub-root of the last character 机. It is as if we are coding each character with total of only 1 or 2 letters in a multi-character phrase.

    To go the other direction, why don’t we define more 2 or 3 letters sub-roots? Obviously, with 2-letter codes, 26 × 26 = 676, there are not that many sub-roots in ZhengMa’s code table (see Appendix).

    It would make typing single character faster at the price of more memorization. (The good old “speed” vs. “memory space” trade-off)

    In fact, ZhengMa does have 2-letter and 3-letter “short-cut” codes that aim for speed-typing by professional typists.

  6. APPENDIX

    (ZhengMa Roots/sub-roots Table)