Machine Learning | L.E.R Academic

CSE256 Assignment 3: Language Modeling

Wed, 27 Apr 2022 15:52:59 +0800

In this report, we discuss the various ways of building probabilistic language models, specifically N-grams. Using the given corpora from three different domains, we first evaluate the reference Unigram implementation provided in the starter code in Section 2. Then we propose our Trigram approach in Section 3, with the implementation details explained inSection 3.1. We show that our approach outperforms the Unigram baseline in almost every performance metric, in Section 3.2 and 3.3. Finally we explore the possibility of adapting our language model from one corpus to another, and demonstrated significant improvement in perplexity in Section 4. Finally we conclude our report in Section 5.

For the Content-aware Language Model experiments, we implemented a generic N-gram model with two optimizations:

Interpolation: The probability estimates from N-gram down to unigram are mixed and weighted (3.1.1), and the the weights λs are dynamically tuned using EM Algorithm (3.1.2).
Smoothing: Instead of using Laplace Smoothing (add-1), we added hyper-parameter k and implemented Add-k Smoothing, with k being tuned on a dev set (3.1.3).
Low frequency cut-off: Taking a parameter min_freq, we remove all the rare item in vocab and treat them as “UNK”

CSE256 Assignment 1: Text Classification

Mon, 11 Apr 2022 15:52:59 +0800

In this report, we discuss the various ways of data pre-processing and feature engineering for a text classification task. We first start by giving an overview of the classification task, the model used, and the given baseline implementation in Section 2. Then we iterate on top that version guided by the project documentation to use TF-IDF for token weighting to achieve better accuracy, detailed in Section 3. Finally we present our various approaches for feature extraction and pre-processing, such as BPE [2] and Word2Vec [1] in Section 4. We will discuss the accuracy and other performance metrics of the above approaches in Section 5, will conclude the paper in Section 6.

CSE203B: Linear Regression under Interval Truncation

Tue, 15 Mar 2022 15:52:59 +0800

In traditional linear regression, we try to recover a hidden model parameter $\vec w*$ with samples $(\vec x, y)$ of the form $y = \vec {w}^{*T} \vec x + \epsilon$, where $\epsilon$ is sampled from some noise distribution. Classical results show that $\vec w*$ can be recovered within the $\ell_2$-reconstruction error $O(\sqrt{k/n})$, where $n$ is the number of truncated/observable samples, and $k$ is the dimension of $\vec w^*$. However, this kind of classic technique does not apply to partially observable data, namely, the truncated setting. Analysis from truncated samples is one of the biggest challenge in today’s life cause truncation always happened whenever samples not in the bound are not observed. This kind of case is very common in biological search, social science, business and economics field either due to the limitation of data collecting device or inherit flaws of sampling processes. If we ignore the truncation, as shown in various experiments, the regression result shows very limited generalization property under regions where data is missing. Recently, a series of work (See) has lead to theoretically sound truncated linear regression with optimal sample complexity since the challenge was introduced in 1958. While these are all polynomial-time algorithms, their run-time is far-from being practical due to some complicated projection steps that rely heavily on the ellipsoid methods. On the other hand, these works often assume that the data truncation is arbitrary and only oracle access to the truncation set is provided while, in practice, the censoring setting, where the truncation set is convex, is more ubiquitous.

Our Contributions. In this work, we give an efficient truncated linear-regression algorithm tailored for the censoring setting. The run-time of the algorithm is in the worst case bounded by $O(n^3)$, where $n$ is the input data-size, and is on-average better when the data follows certain structured distributions, showing that truncated linear regression can be practical.

CSE251A Project 2: Coordinate Descent for Logistic Regression

Tue, 22 Feb 2022 15:52:59 +0800

In this report, we discuss our attempt to use a better strategy to pick the direction of coordinate descent at each step rather than random selection. At each iteration, we pick the gradient direction with the largest absolute value, and shows that our method outperforms the random coordinate descent in terms of convergence speed.

CSE251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor

Sat, 29 Jan 2022 15:52:59 +0800

In the report, we discuss our attempt to choose a better prototype for nearest neighbor other than random selection.We present a KMeans based prototype selection method that clearly outperforms the naive random selection in all our experiments.