"Academia Sinica Balanced Corpus of Modern Chinese", simplified as Sinica Corpus, is the first Balanced Modern Chinese Corpus with part-of-speech tagging. The preliminary version of Sinica Corpus was developed on a small-scale and opened to the academic community in 1994 with the major purpose of obtaining feedback. The present corpus (Sinica 3.0), which was completed in 1997, has 5 million words. The new version (Sinica 5.0) will target 10 million words and is expected to be completed before 2006.
In addition to data-collection and data cleaning in the construction of a Chinese Balanced Corpus, we are also concerned with: 1) balancing and classifying collected data, 2) Chinese word segmentation, and 3) the design of pos-tag sets (Chen 1994).
1. Data extraction and classification for a Balanced Corpus
Topical distribution of the Sinica corpus:
2. Issues of Chinese word segmentation:
The word segmentation standard for Chinese information processing issued by the Central Standards Bureau was adopted as the guideline for segmenting words in the Sinica corpus.
3. The Part-of Speech tagging system and its Interpretation:
In accordance with the Tagset of 178 syntactic categories from the CKIP lexicon(CKIP 1993), a reduced tagset of 46 different tags (43 tags plus 3 features) is applied by Sinica Corpus.
4. Part-of-speech analysis: Technical Report no.93-05. This technical report includes detail PoS analysis and the corresponding argument structures.
Huang, Chu-Ren, Keh-jiann Chen and Li-Li Chang. 1996.SegmentationStandardfor Chinese Natural Language Processing.Proceedings of the 1996 International Conference on Computational Linguistics (COLING 96).August. Copenhagan, Denmark.
Chang, Li-ping and Keh-jiann Chen, 1995. The CKIP Part-of-speech Tagging System for Modern Chinese Texts.Proceedings ofICCPOL'95, Hawaii.
Hsu, Hui-li and Chu-Ren Huang, 1995. Design Criteria for a Balanced Modern Chinese Corpus. Proceedings of ICCPOL'95, Hawaii.
Chen, Keh-jiann, Shing-huan Liu, Li-ping Chang and Yeh-Hao Chin, 1994. A Practical Tagger for Chinese Corpora. Proceedings of ROCLING VII, pp.111-126.
Huang, Chu-Ren, 1994. Corpus-based Studies of Mandarin Chinese: Foundational Issues and Preliminary Results. In Matthew Chen and Ovid Tzeng Eds. In Honor of William S-Y. Wang: Interdisciplinary Studies on Language and Language Change. pp. 165-186. Taipei:Pyramid.
Huang, Chu-Ren and Keh-jiann Chen, 1992. A Chinese Corpus for Linguistics Research. In the Proceedings of the 1992 International Conference on Computational Linguistics (COLING-92). pp.1214-1217. Nantes, France.
Huang, Chu-Ren, and Ruo-pingMo, 1992. Mandarin Ditransitive Constructions and the Category of Gei. In the Proceedings of the Berkeley Linguistics Society Annual Meeting (BLS 18), pp. 109-122. Berkeley: BLS.
Chen, Keh-jiann, Shing-huan Liu, 1992. Word Identification for Mandarin Chinese Sentences. Proceedings COLING'92, pp.54-59.
Huang, Chu-Ren, 1987. Mandarin Chinese NP de: A Comparative Study of Current Grammatical Theories. Special Publications No.93 of the Institute of History & Philology, Academia Sinica, Taipei.