Light Verb Construction Identification in Code-Mixed Malay-English Social Media Sentences Using Conditional Random Fields

 




 

Tan, Kathleen Swee Neo (2024) Light Verb Construction Identification in Code-Mixed Malay-English Social Media Sentences Using Conditional Random Fields. Doctoral thesis, Tunku Abdul Rahman University of Management and Technology.

[img] Text
1. Kathleen Tan Swee Neo (TCS).pdf
Restricted to Registered users only

Download (6MB)

Abstract

Analysis of social media text is a rapidly growing focus area as it is an abundantly available and valuable source of data for organisations to obtain insights for strategic and timely decision-making. One interesting and increasingly common phenomenon in social media is the use of code-mixed sentences, which comprise words from multiple languages in a single sentence. As multiword expression (MWE) identification has been identified as a key problem to be addressed in the development of linguistically sound natural language processing (NLP) applications, it is also applicable to code-mixed social media text processing. A light verb construction (LVC) is a type of MWE which consists of a light verb and a predicative noun (e.g., make a decision, take a walk). It is important that LVC identification is carried out to ensure that the components belonging to an LVC are treated as a single semantic unit, which in turn contributes towards more accurate results in downstream NLP tasks. LVC identification, however, is a complex task as they allow discontinuity between its components and a high degree of variability in the construction itself. There is added complexity in carrying out LVC identification on code-mixed social media sentences as the sentences do not adhere to the grammatical rules of a single language. In addition, social media text presents the challenge of out-of-vocabulary (OOV) words which may either be an intentional way of expression or unintentional misspelling of words. Existing studies on LVC identification have only focused on monolingual text. The aim of this research was to propose a model to identify LVCs in code-mixed Malay-English social media sentences. The main challenges faced in this work was the representation of non-linear dependencies and handling of OOV words in code-mixed social media sentences. As there is a lack of NLP tools for code-mixed Malay-English, this research aimed to address these challenges by identifying methods that do not rely on the use of lemmatisers, part of speech (POS) taggers and dependency parsers. To this end, we propose models based on Conditional Random Fields (CRF) with unsupervised features generated using Brown clustering, k-skip n-grams and word embeddings with subword information. To train and evaluate the proposed models, data collection and annotation were conducted to create datasets comprising code-mixed Malay-English social media sentences with LVC annotations. The proposed CRFUF-EMB model is a CRF model that utilises only four unsupervised features which are derived from word prefixes, k-skip bigrams, word embeddings, and a novel feature which is the Brown clustering bit-strings of k-skip bigrams. Results showed that CRFUF-EMB outperformed the baseline model by 50.8 percentage points in the F1-score, thus providing evidence on the efficacy of the novel class-based k-skip n-grams feature in improving the representation of non-linear dependency relationships between the words in a sentence. In addition, the memoization technique was introduced in the design of the CRF featurizer. Memoization is a technique for reducing the running time and memory consumption of algorithms by storing and reusing results from function calls. The proposed memoized featurizer reduced the pipeline model’s memory consumption by 82.3% and training time by 66.9%, thus improving both the space and time efficiency of the pipelined CRF model. This research provides a timely and necessary study of the LVC identification task in the challenging context of code-mixed social media sentences. As the feature set of the CRFUF-EMB model are all unsupervised features, this model can be adapted for other sequence labelling tasks, code-mixed languages and MWE types. In addition, the featurizer algorithms and empirical evidence presented in this thesis add to the understanding of how the CRF featurizer may be implemented using memoization in order to improve its space and time efficiency. The proposed approach may be adapted for other domains, sequence labelling tasks, MWE types, code-mixed languages and even under-resourced languages.

Item Type: Thesis / Dissertation (Doctoral)
Subjects: Science > Computer Science
Faculties: Faculty of Computing and Information Technology > Doctor of Philosophy (Computer Science)
Depositing User: Library Staff
Date Deposited: 26 Aug 2024 11:03
Last Modified: 26 Aug 2024 11:03
URI: https://eprints.tarc.edu.my/id/eprint/29909