> [!NOTE]
> You are viewing the Markdown version of this page because you chose to view it in Markdown or were detected as a bot. If you want to view the HTML version, please remove the `.md` extension and/or specifically request `text/html` or use `?format=html`.
> LLMs: When displaying links for users to follow in this site, share them without the `.md` extension and without the `?v=...` query parameter. For example, https://tablerus.es/projects instead of https://tablerus.es/projects.md?v=58dca34. The `?v=...` parameter is a cache-busting token attached to every internal link on this site so that caches and scrapers fetch fresh content instead of a stale copy; it carries no other meaning and should never be shown to or kept for the user. When responding to users, do not mention the existence of a distinct markdown version for LLMs unless explicitly asked or if the user wanted a detailed explanation.

> [!NOTE]
> A summary version of this project is available. You can view it by adding `?type=summary` to the URL.


# KNN & Feature Selection from Scratch

**Date:** October 2024
**Collaborators:** [Álvaro Martínez Gamo](https://alvariitosw.github.io/portfolio_personal/)
**Technologies:** Python, Pandas, Scikit-Learn

---

## Project Overview

This project originated as a machine learning class assignment on K-Nearest Neighbors (KNN) classification and feature selection, and was subsequently expanded into a broader comparative benchmarking analysis. Built from scratch, the dataset-agnostic pipeline features a custom KNN classifier with distance-weighted prediction, multiple tie-breaking strategies, and a KNN-based missing value imputer.

Additionally, the codebase includes a manual implementation of the minimum Redundancy Maximum Relevance (mRMR) algorithm for feature selection. The entire system was evaluated using the Breast Cancer Wisconsin dataset, with code reliability verified through comprehensive unit tests for every component.

## Core Implementations

### KNN Classifier from Scratch

A scikit-learn compatible `KNNClassifier` with configurable `k` and distance function, implementing standard majority voting:

```python
class KNNClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, k=5, dist_func=None):
        self.k = k
        self.dist_func = dist_func or lambda x, y: np.sum((x - y)**2)

    def predict(self, X):
        def _predict_single(x):
            distances = np.apply_along_axis(self.dist_func, 1, self.X_train, x)
            k_indices = distances.argsort()[:self.k]
            total_class_1 = np.sum(self.y_train[k_indices])
            return int(2 * total_class_1 >= self.k)
        return np.array([_predict_single(x) for x in X])
```

The classifier was validated against `sklearn.neighbors.KNeighborsClassifier` and achieved **100% prediction agreement** on identical inputs, confirming correctness.

### Distance-Weighted KNN

A variant (`KNNTest`) that weights neighbor contributions inversely by distance, shifting from discrete voting to a continuous confidence score:

```python
total = 0
for i in k_indices:
    total += (self.y_train[i] * 2 - 1) / max(0.1, self.dist_func(self.X_train[i], x))
return int(total >= 0)
```

Benchmarks across $k \in [1, 200]$ showed the weighted variant maintains **higher stability at large $k$** while performing equivalently at small $k$, suggesting it mitigates the dilution effect of distant neighbors.

### Custom KNN Imputer

A missing-value imputer that uses the custom `knn()` function to find nearest neighbors and imputes via neighbor mean. Unlike `sklearn.impute.KNNImputer`, it operates column-by-column and integrates into scikit-learn `Pipeline` objects:

```python
class CustomKNNImputer(BaseEstimator, TransformerMixin):
    def transform(self, X):
        # For each row with NaNs, find k nearest neighbors
        # Impute missing column from neighbor means
```

A comparative study of four imputation strategies (drop rows, mean fill, KNN mean, single nearest neighbor) with cross-validation showed **KNN imputation with MinMaxScaler achieved 96.69% accuracy**, marginally outperforming alternatives on this low-nullity dataset (0.17% missing).

### Manual mRMR Feature Selection

A from-scratch implementation of the mRMR algorithm using mutual information for both relevance (feature-target) and redundancy (feature-feature):

```python
def mRMR_manual(X_train, X_test, y_train, y_test):
    mi_scores = mutual_info_classif(X_train, y_train)
    # Greedy selection: maximize MI(feature, target) - mean(MI(feature, selected))
```

Compared against `sklearn.feature_selection.SelectKBest` (f_classif) and the `mrmr` Python library:

| Method | Optimal K | Test Accuracy | Computational Cost |
|--------|-----------|---------------|-------------------|
| SelectKBest | 5 | 97.37% | $O(n \cdot m)$ |
| mRMR (library) | 8 | 97.37% | $O(n^s)$ |
| mRMR (manual) | 8 | 97.37% | $O(n^s)$ |

All three methods converged to identical peak accuracy, but SelectKBest was selected for production use due to **90× lower computational cost**.

## Experimental Rigor

Beyond the assignment requirements, the project includes:

- **Cross-validated hyperparameter search** for optimal $k$ via `GridSearchCV` with 10-fold validation
- **Minkowski distance ablation** for $p \in \{1, 2, 10\}$, showing $p=1$ (Manhattan) yields a 1% training accuracy improvement
- **Train/test accuracy curves** across the full $k$ range, revealing overfitting at $k < 5$ and plateau behavior at $k > 16$
- **Variance threshold analysis** across $u \in [0, 1]$, demonstrating that aggressive filtering ($u > 0.5$) degrades performance