Lim, Kong Hua (2024) Text Segmentation in Code-Mixed Chinese-English Social Media Messages Using Conditional Random Fields. Doctoral thesis, Tunku Abdul Rahman University of Management and Technology.
Text
1. Lim Kong Hua (TCS).pdf Restricted to Registered users only Download (8MB) |
Abstract
This research investigates the impact of associative, contextual, or statistical-based features on Chinese-English code-mixed messages by exploring the goodness measures of Pointwise Mutual Information (PMI), Accessor Variety (AV) or Good-Turing (GT) of the Conditional Random Field (CRF) statistical model. These techniques will be studied and evaluated on text segments of the 2 million messages that were crawled from e-commerce platforms, social networking platforms and online news platforms. This research has identified four problems for the Chinese-English code-mixed text segments; these include (1) English Roman alphabet and proper phrase, (2) Alphanumeric and Tokens Trigram, (3) Proper tokens including repeating tokens, and (4) proper token and proper token. This research designs the PMI feature (association measure), GT feature (smoothing), AV feature (association measure) and AV-PMI feature to address each of the four problem statements, respectively. The four techniques have overall better results compared to the base CRF approach. In terms of F1 score, the AV_PMI proposed solution has about 0.24%, 0.09%, 0.15% and 0.44% improvement over GT, AV, PMI, and base, respectively. In conclusion, goodness measures provided an exponential output score for prediction improvement.
Item Type: | Thesis / Dissertation (Doctoral) |
---|---|
Subjects: | Science > Computer Science |
Faculties: | Faculty of Computing and Information Technology > Doctor of Philosophy (Computer Science) |
Depositing User: | Library Editor |
Date Deposited: | 31 Dec 2024 08:12 |
Last Modified: | 31 Dec 2024 08:13 |
URI: | https://eprints.tarc.edu.my/id/eprint/31434 |