Search All site
Search CKIP site

"Academia Sinica Balanced Corpus of Modern Chinese", simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. Later in 1997 we present the corpus (Sinica Corpus 3.0) with 5 million words and a user-friendly search interface. The new version Sinica Corpus 4.0 targeted at 10 million words is ready for license in 2010 and the web search interface opens to public in 2013.

In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).

1. Data extraction and classification for a Balanced Corpus
Topical distribution of the Sinica corpus:

Topics Philosophy Literature Life Society Science Art
Percentage 8% 13% 28% 38% 8% 5%

2. Issues of Chinese word segmentation:
The word segmentation standard for Chinese information processing issued by the Central Standards Bureau was adopted as the guideline for segmenting words in the Sinica corpus.

3. The Part-of Speech tagging system and its Interpretation:
In accordance with the Tagset of 178 syntactic categories from the CKIP lexicon(CKIP 1993), a reduced tagset of 46 different tags (43 tags plus 3 features) is applied by Sinica Corpus.

4. Part-of-speech analysis: Technical Report no.93-05. This technical report includes detail PoS analysis and the corresponding argument structures.

  • The Sinica corpus, a Balanced Corpus of Modern Chinese with 10 million words:

    • 10 million words collected, primarily since 1996.
    • Texts in the corpus are being collected from different areas and classified according to five criteria: genre, style, mode, topic, and source.
    • Every text is segmented, and each segmented word is tagged with its pos.
    • The Sinica Corpus web-interface is designed for statistical comparison according to users' specification of topics, genres, etc.
    • The web-interface address for Sinica Corpus:

Huang, Chu-Ren, Keh-jiann Chen and Li-Li Chang. 1996.SegmentationStandardfor Chinese Natural Language Processing.Proceedings of the 1996 International Conference on Computational Linguistics (COLING 96).August. Copenhagan, Denmark.

Chang, Li-ping and Keh-jiann Chen, 1995. The CKIP Part-of-speech Tagging System for Modern Chinese Texts.Proceedings ofICCPOL'95, Hawaii.

Hsu, Hui-li and Chu-Ren Huang, 1995. Design Criteria for a Balanced Modern Chinese Corpus. Proceedings of ICCPOL'95, Hawaii.

Chen, Keh-jiann, Shing-huan Liu, Li-ping Chang and Yeh-Hao Chin, 1994. A Practical Tagger for Chinese Corpora. Proceedings of ROCLING VII, pp.111-126.

Huang, Chu-Ren, 1994. Corpus-based Studies of Mandarin Chinese: Foundational Issues and Preliminary Results. In Matthew Chen and Ovid Tzeng Eds. In Honor of William S-Y. Wang: Interdisciplinary Studies on Language and Language Change. pp. 165-186. Taipei:Pyramid.

Huang, Chu-Ren and Keh-jiann Chen, 1992. A Chinese Corpus for Linguistics Research. In the Proceedings of the 1992 International Conference on Computational Linguistics (COLING-92). pp.1214-1217. Nantes, France.

Huang, Chu-Ren, and Ruo-pingMo, 1992. Mandarin Ditransitive Constructions and the Category of Gei. In the Proceedings of the Berkeley Linguistics Society Annual Meeting (BLS 18), pp. 109-122. Berkeley: BLS.

Chen, Keh-jiann, Shing-huan Liu, 1992. Word Identification for Mandarin Chinese Sentences. Proceedings COLING'92, pp.54-59.

Huang, Chu-Ren, 1987. Mandarin Chinese NP de: A Comparative Study of Current Grammatical Theories. Special Publications No.93 of the Institute of History & Philology, Academia Sinica, Taipei.

  Parser    Word Segmentation    Sinica Treebank    EHowNet