CSE203B: Linear Regression under Interval Truncation

Tue, 15 Mar 2022 15:52:59 +0800

In traditional linear regression, we try to recover a hidden model parameter $\vec w*$ with samples $(\vec x, y)$ of the form $y = \vec {w}^{*T} \vec x + \epsilon$, where $\epsilon$ is sampled from some noise distribution. Classical results show that $\vec w*$ can be recovered within the $\ell_2$-reconstruction error $O(\sqrt{k/n})$, where $n$ is the number of truncated/observable samples, and $k$ is the dimension of $\vec w^*$. However, this kind of classic technique does not apply to partially observable data, namely, the truncated setting. Analysis from truncated samples is one of the biggest challenge in today’s life cause truncation always happened whenever samples not in the bound are not observed. This kind of case is very common in biological search, social science, business and economics field either due to the limitation of data collecting device or inherit flaws of sampling processes. If we ignore the truncation, as shown in various experiments, the regression result shows very limited generalization property under regions where data is missing. Recently, a series of work (See) has lead to theoretically sound truncated linear regression with optimal sample complexity since the challenge was introduced in 1958. While these are all polynomial-time algorithms, their run-time is far-from being practical due to some complicated projection steps that rely heavily on the ellipsoid methods. On the other hand, these works often assume that the data truncation is arbitrary and only oracle access to the truncation set is provided while, in practice, the censoring setting, where the truncation set is convex, is more ubiquitous.

Our Contributions. In this work, we give an efficient truncated linear-regression algorithm tailored for the censoring setting. The run-time of the algorithm is in the worst case bounded by $O(n^3)$, where $n$ is the input data-size, and is on-average better when the data follows certain structured distributions, showing that truncated linear regression can be practical.

CSE251A Project 2: Coordinate Descent for Logistic Regression

Tue, 22 Feb 2022 15:52:59 +0800

In this report, we discuss our attempt to use a better strategy to pick the direction of coordinate descent at each step rather than random selection. At each iteration, we pick the gradient direction with the largest absolute value, and shows that our method outperforms the random coordinate descent in terms of convergence speed.

CSE251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor

Sat, 29 Jan 2022 15:52:59 +0800

In the report, we discuss our attempt to choose a better prototype for nearest neighbor other than random selection.We present a KMeans based prototype selection method that clearly outperforms the naive random selection in all our experiments.

Statistics | L.E.R Academic

CSE203B: Linear Regression under Interval Truncation

CSE251A Project 2: Coordinate Descent for Logistic Regression

CSE251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor