Projects

Mapping from Written Chinese to Cantonese

Proposal in Support of Continued Affiliation with the Fairbank Center of East Asian Studies, Harvard University, 2001-02

Grace Wiersma, MIT Libraries

One concrete result of my work as an independent researcher during the last two years has been completion of a survey article on the Bai language, "Yunnan Bai in Brief," to appear in The Sino-Tibetan Languages, edited by Graham Thurgood and Randy LaPolla (Curzon Language Family Series). For fun, I have recently resumed work on a translation from Chinese to English of a contemporary historical novel that conveys the experiences and sensibility of a young Hong Kong boy from his childhood, living under Japanese occupation, through young adulthood, and finally allows him a mature and ironic glimpse of the transition from colonial rule to Chinese sovereignty. I might also mention that within the past year I contributed substantially to a grant proposal submitted to the National Science Foundation and the National Endowment for the Humanities, for funding in support of a Bai reference grammar, to be completed jointly with Dr. Fu Jingqi of St. Mary's College in Maryland. Our proposal was not funded for the current cycle, but resubmission in the near future is currently under discussion.

I plan to pursue three projects during the current year. Two of these are designed to present samples of my research data on Yunnan Bai and written Cantonese on the Web, and to address problems of data representation related to their delivery as html documents. The two Web projects, and one more traditional study related to my current pursuit of a qualification in library and information science, are briefly described below. The snapshot here illustrates a sample of the hand-marked data that forms the basis of the project on alternate views of written Chinese text (II).

Return to top

I Pilot Online Corpus of Yunnan Bai

Beginning in the late 19th century and throughout the 20th, the genetic affiliation of Yunnan Bai has been the subject of continuing scholarly speculation, and the ongoing controversy has given rise to a considerable literature. It is interesting that while modern linguists in China consolidated their view of Bai as belonging to the Tibeto-Burman language family either as an isolate or as a member of the Yi (Loloish) group, Bai-speaking literati of the early to mid-20th century relied on their knowledge of Chinese philology to uncover archaic Chinese etymologies for the basic terms of their language. Western scholars discovered such etymologies independently only within the last two decades, and the situation at present is that scholarly opinion within China, as well as in the wider international community, is sharply divided as to whether Bai is historically closer to Sinitic or to Tibeto-Burman.

The pilot online corpus of Yunnan Bai is planned as a Web-accessible sample of Jianchuan Bai text data with interlinear analysis displayed on the top page, enhanced by links to a relational mini-glossary that can be called by clicking on structured parts of the main text display. One technological problem I hope to address in implementing this project is how to display special characters from the International Phonetic Alphabet on the Web. This is an important issue for research on endangered languages for which only scant information is yet available. The online Oxford English Dictionary is one illustration of successful display of phonetic character fonts on the Web, and I plan to investigate the OED's success as a preliminary step.

Return to top

II Alternate Views of Written Chinese on the Web

As part of my study on the interactions and interface between written standard Chinese and public discourse as conducted in the spoken Cantonese vernacular, I collected taperecorded data from floor debates of the Hong Kong Legislative Council during its final sessions in 1996-97, and a separate corpus of Chinese newspaper readings from the Hong Kong broadcast media, where skilled "animators" adapted a written Chinese text "on the fly" for oral delivery to a Cantonese-speaking audience. Data for this were captured by marking up the newspaper text with written Cantonese edits, as documented by the recorded broadcast. I presented a preliminary analysis of this material in December, 1998, to the Annual Research Forum of the Hong Kong Linguistic Society. The perspective I sought to convey was to view such performances as an important tool of what I termed "Cantonese literacy."

The display of alternate views of a written Chinese text on the Web is planned as an experimental illustration of the above corpus, to be viewed at first as a graphic image (a webmap) with links from particular image quadrants to alternate displays of the same text segment with written Cantonese replacements. A technological issue to be addressed by this project is the display of a traditional Chinese font ("standard Chinese" characters) along with selected common characters from the local Hong Kong character set, within the same HTML document. The Unicode initiatives currently underway seem unlikely to superannuate all legacy encodings for CJK data, because of the need to preserve access to databases and other structured resources that have been built prior to Unicode implementation. However, recent progress in application of the Unicode standard, and its expansion to include a superset of Chinese characters including the common representations of spoken Cantonese forms, does offer the hope that I may be able to carry this experiment forward from a graphics display to a display of alternate linked texts, using a Unicode font for both standard and dialect representation.

Return to top

III Survey of Split Collections and Local Classification Schemes for Chinese Library Materials

In many distinguished libraries that include East Asian collections, local classification systems have traditionally been used to catalog Chinese and other East Asian language materials, in effect partitioning these collections from the rest of the library where non-CJK materials have been cataloged using the Library of Congress scheme or Dewey. I propose to study the situation in selected libraries in North America and East Asia, using relevant bibliographic reference works available in print as well as selected bibliographic records drawn from the Web-accessible online public access catalogs of the target libraries, to assess the current situation as regards split collections worldwide, and as an extension of this, to study the possible correspondences across schemes within selected classes, investigating how a local classification scheme is mapped to either LC or Dewey, where this has occurred, and exploring the feasibility of developing an original crosswalk and bilingual thesaurus for selected classes.

I plan to make use of existing materials on LC-to-Dewey conversion to establish an understanding of how to proceed, and to contact librarians who have directed relevant conversion projects for advice. The Harvard-Yenching Library's collection provides a useful point of reference for this project, because its Chinese (and other CJK) materials have historically been split between those classed in a local scheme and more recent acquisitions classed according to LC. The goals of this study are to identify the characteristics of various local schemes, to assess the legacy of split collections due to local classification, and to master the theory and technology of designing a conversion from local to LC classification.

Return to top


---

http://web.simmons.edu/~wiersma/projects.html
Last revised Dec. 6, 2001
Simmons College disclaimer

XHTML Validated

Contact gwiersma@eclectronics.net
Text and images herein contained are solely owned by the author, with the exception of bibliographic screen captures subject to the terms of the ADEC Fair Use Guidelines for Educational Multimedia (permission pending)