Monday, June 21, 2010

CJK support in LaTeX

(Oh gosh. This is one post I should have written a loooong time ago.)

Najmi has previous posted about Jawi support on LaTeX. Well then, what about CJK (Chinese, Japanese, Korean) characters?

(The process may be simpler using XƎLaTeX, but I personally use LaTeX more, so this post won’t touch on XƎLaTeX.)

Short snippets

If you need only short CJK snippets, use the CJK package. (While you’re at it, you may as well grab cjk-fonts and wadalab for the fonts.) On Debian-based systems, just grab latex-cjk-all and you should be good. Or if you don’t want the whole package (it’s huge), grab whatever you need for latex-cjk-chinese, latex-cjk-japanese or latex-cjk-korean (and whatever relevant font packages).

Here’s a basic example for Chinese:

\usepackage{CJK}
...
%% if your file is saved as GB simplified encoding
... as we say in Chinese,
\begin{CJK}{GB}{gbsn}子曰:有朋自远方来,不亦乐乎?\end{CJK}

%% if you file is saved as Big5 traditional encoding
... as we say in Chinese,
\begin{CJK}{Bg5}{bsmi}子曰:有朋自遠方來,不亦樂乎?\end{CJK}


But if you're saving as UTF-8 then you need CJKutf8.sty (included in CJK package):

\usepackage{CJKutf8}
...
as we say in Chinese,
\begin{CJK}{UTF8}{gbsn}子曰:有朋自远方来,不亦乐乎?\end{CJK}
or \begin{CJK}{UTF8}{bsmi}子曰:有朋自遠方來,不亦樂乎?\end{CJK}

You have a few font choices (make sure you get the latex-cjk-chinese-arphic-* files!)
  • gbsn (简体宋体, simplified Chinese)
  • gkai (简体楷体, simplified Chinese )
  • bsmi (繁体细上海宋体, traditional Chinese)
  • bkai (繁体标楷体, traditional Chinese)

Japanese and Korean text are typeset much the same way. If you save everything as UTF-8, then it’s just a matter of knowing what fonts to invoke:

\usepackage{CJKutf8}
...
%% Japanese
\begin{CJK}{UTF8}{min}
露の世は 露の世ながら さりながら
\end{CJK}

%% Korean
\begin{CJK}{UTF8}{mj}
편편황조 자웅상의 염아지독 수기여귀
\end{CJK}


The Japanese fonts are from the wadalab packages (latex-cjk-japanese-wadalab-*):
  • min (明朝 Mincho)
  • goth (ゴシック Gothic)
  • maru (丸ゴシック Maru Gothic)

As for Korean, well I’ve only been able to get mj (明朝体 MyongJu) working so far.

Entire Document in Chinese

On the other hand, if your entire document is going to be in Chinese, you might be better off using the ctexart document class (in the ctex package):

\documentclass[UTF8]{ctexart}

\begin{document}

\section{论语}
子曰:有朋自远方来,不亦乐乎?

\end{document}


There is a caveat, though. You’ll need to copy some Windows Chinese font files to your $localtexmf/fonts/truetype/... directory (don’t forget to run texhash!) to use ctex properly (font name in CJK/ctexart in brackets). These are all for simplified Chinese characters:
  • simsun.ttc 宋体 (song, default)
  • simfant.ttf 仿宋 (fs)
  • simkai.ttf 楷书 (kai)
  • simhei.ttf 黑体 (hei)
  • simli.ttf 隶书 (li)
  • simyou.ttf 幼圆 (you)

In any case, for more help on the ctex package and ctexart.cls, you’d best ask for help at the CTEX forum. (Language there is predominantly Mandarin Chinese.) I’m not aware of similar classes for Japanese nor Korean, though.

Pinyin and Ruby

Younger children learning Chinese characters (Hanzi/Kanji/Hanja) would often have the pronunciations annotated alongside/above/beneath the characters. For Chinese pinyin pronunciations, you would invoke

\usepackage{pinyin}
...\dian4 \deng1

to get diàn dēng.

To cite Martin Duerst:
Ruby are small characters used for annotations of a text, at the right side for vertical text, and atop for horizontal text, to indicate the reading (pronounciation) of ideographic characters.
And you can produce them with the ruby package:

\usepackage{CJKutf8,pinyin}
\usepackage[overlap,CJK]{ruby}
...

%% By convention, the pinyin would be *under* the Hanzi
%% so change the \rubysep to move it under

\begin{CJK}{UTF8}{gbsn}
\renewcommand\rubysep{-1.4em}
\ruby{电}{\dian4}\ruby{灯}{\deng1}
\end{CJK}

%% I find the default \rubysep (-0.5ex) too tight, so
%% let's enlarge it a little.

\renewcommand\rubysep{-0.2ex}

%% Shonen manga readers would get the written as
%% rival, pronounced as friend
 reference

%% (CORRECTED June 22)
\begin{CJK}{UTF8}{min}
\ruby{素敵}{ともだち}
\end{CJK}

%% Disclaimer: I'm actually unsure where the
ruby should be placed for Korean Hanja

\begin{CJK}{UTF8}{mj}
\ruby{南}{남}\ruby{宮}{궁}
\end{CJK}

The output of which looks something like this:

11 comments:

  1. Here is my version for CJK. I am using CJK, UTF8 and Cyberbit TTF font.


    \documentclass[12pt,a4paper]{article}
    \usepackage[encapsulated]{CJK}
    \usepackage[utf8]{inputenc}
    \usepackage{arabtex}
    \newcommand{\cjktext}[1]{\begin{CJK}{UTF8}{cyberbit}#1\end{CJK}}
    \renewcommand\rmdefault{phv}

    \begin{document}
    \setmalay
    \novocalize
    [code]
    \noindent Arab\TeX\ \\
    \cjktext{
    日本語とアラブ文字を一緒に書けましょう。
    }

    \end{document}

    p/s: How to create the boxes in the article :)

    ReplyDelete
  2. @bahathir, cyberbit is cool in that being a unicode font, it can be used for Chinese, Japanese *and* Korean characters. However I got fed up with having to generate the font files from the ttf every time I upgrade my computer etc. Plus I really dislike how some of the characters look (to me some of the strokes and composition are just wrong, especially for Chinese, I dare say it's just me though).

    Oh the "code snippet" boxes? Just some CSS trickery. Here's how I did it:

    In the .css (blogger now lets you define custom CSS):

    .codesnippet {
    border: 1pt solid;
    font-family: Courier, monospace;
    padding: 6pt;
    }

    .codecomment {
    color: #666699;
    font-style: italic;
    }


    And in your HTML:

    <div class="codesnippet">
    \documentclass{article}

    <span class="codecomment">% This be a comment!</span>
    \begin{document}
    blah blah...
    \end{document}
    </div>

    ReplyDelete
  3. @bahathir, @najmi, @rizal
    I forgot to mention that I've already added the custom css to this site, so when writing your posts you just need to do the <div class="codesnippet">...</div> to get the boxed code snippet.

    Also (if you don't mind the extra typing) to get the LaTeX logo:
    <span class="latex">L<sup>a</sup>T<sub>e</sub>X</span>

    I learned the trick from here: http://nitens.org/taraborelli/texlogo

    ReplyDelete
  4. Thanks for the "codesnippet" tip. but, I got this error,

    "Your HTML cannot be accepted: Tag is not allowed: DIV"

    Actually I wanted to use PRE html tag, but got the same error :) MMmm.. it seem that this blog cannot accept any standard HTML tags.

    p/s: I post using my Google Account id.

    ReplyDelete
  5. @bahathir, it's just in the comments section that you can't use the HTML tags. You can use whatever HTML tags (including div, span etc) if you are writing a new blog post (as opposed to blog post comment)

    In my previous post comment I used &lt; and &gt; to "simulate" the HTML open and end tags.

    ReplyDelete
  6. Hi! Thanks for the article :D

    I was wondering though--how would it work under a Windows system? Is it basically the same concept, or do I need to do extra work to make it work right?

    Thanks :)

    ReplyDelete
  7. @瑞歌, assuming you're using MikTeX on Windows, grab the cjk, cjk-fonts and cjkpunct packages. (Also ctex if you prefer that approach.) I use the exact same methods as outlined in this post on Windows, GNU/Linux (tried on Debian, Ubuntu, Fedora, Slackware) and Mac and they all work. ;-)

    Make sure you have some kinda Chinese input method installed (you need to install/enable them from the Windows installation disk), and remember to save your .tex files as UTF-8, Big5 or GBSN (your choice), while taking care to passing the right encoding param to \begin{CJK}{UTF8}{...} etc.

    ReplyDelete
  8. Did you mean
    \usepackage{CJKutf8}
    by
    \documentclass{CJKutf8}
    in the 3rd block, in the Japanese and Korean example?

    ReplyDelete
    Replies
    1. Oh yes, you're right! Thanks for pointing it out; post is now updated.

      In addition, in recent versions \usepackage{CJK} would automatically know if UTF-8 is in use.

      Delete
  9. Thanks so much!!

    I have done all of the above and Chinese works but

    %% Japanese
    \begin{CJK}{UTF8}{min}
    露の世は 露の世ながら さりながら
    \end{CJK}

    %% Korean
    \begin{CJK}{UTF8}{mj}
    편편황조 자웅상의 염아지독 수기여귀
    \end{CJK}

    returns the following error:

    \begin{CJK}{UTF8}{mj} 편
    편황조 자웅상의 염아지독 수기여...
    I wasn't able to read the size data for this font,
    so I will ignore the font specification.
    [Wizards can fix TFM files using TFtoPL/PLtoTF.]
    You might try inserting a different font spec;
    e.g., type `I\font='.


    Missing character: There is no � in font nullfont!
    ! Font C70/mj/m/n/12/d6=uwmjd6 at 12.0pt not loadable: Metric (TFM) file not fo

    ReplyDelete
    Replies
    1. Hi, it sounds like you're missing the font files for only the Korean mj fonts. Which is weird... what platform are you using (MikTeX, TeXLive or MacTeX)?

      Delete