CSE256 Assignment 3: Language Modeling

Wed, 27 Apr 2022 15:52:59 +0800

In this report, we discuss the various ways of building probabilistic language models, specifically N-grams. Using the given corpora from three different domains, we first evaluate the reference Unigram implementation provided in the starter code in Section 2. Then we propose our Trigram approach in Section 3, with the implementation details explained inSection 3.1. We show that our approach outperforms the Unigram baseline in almost every performance metric, in Section 3.2 and 3.3. Finally we explore the possibility of adapting our language model from one corpus to another, and demonstrated significant improvement in perplexity in Section 4. Finally we conclude our report in Section 5.

For the Content-aware Language Model experiments, we implemented a generic N-gram model with two optimizations:

Interpolation: The probability estimates from N-gram down to unigram are mixed and weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm (3.1.2).
Smoothing: Instead of using Laplace Smoothing (add-1), we added hyper-parameter k and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).
Low frequency cut-off: Taking a parameter min_freq, we remove all the rare item in vocab and treat them as “UNK”

CSE256 Assignment 1: Text Classification

Mon, 11 Apr 2022 15:52:59 +0800

In this report, we discuss the various ways of data pre-processing and feature engineering for a text classification task. We first start by giving an overview of the classification task, the model used, and the given baseline implementation in Section 2. Then we iterate on top that version guided by the project documentation to use TF-IDF for token weighting to achieve better accuracy, detailed in Section 3. Finally we present our various approaches for feature extraction and pre-processing, such as BPE [2] and Word2Vec [1] in Section 4. We will discuss the accuracy and other performance metrics of the above approaches in Section 5, will conclude the paper in Section 6.

NLP | L.E.R Academic

CSE256 Assignment 3: Language Modeling

CSE256 Assignment 1: Text Classification