Real Estate Price Predictor · Kaggle Rebuild

Kaggle competition landing — the source task for the rebuild

01 // Project Goals

Explore how the real-estate market is broken down by price, square footage, and story count.
Quantify how individual house features affect sale price.
Predict sale price from features and location.

Correlation heatmap — feature selection starting point

Pairplot — top-correlated features vs. sale price

Sale price — raw distribution (skewed right)

Sale price — log-transformed for modeling

02 // Approach

The task came from a Kaggle competition; the goal was to rebuild an end-to-end pipeline that had placed in the top 0.3% — understanding every step rather than copying the notebook.

EDA — heatmap of correlations against SalePrice; highest- and lowest-correlated features plotted pairwise for visual inspection.
Missing data — imputed by averaging missing features (e.g. lot square footage) within zip code rather than a naive global mean.
Target transform — log-transform on SalePrice to reduce skew and stabilize residuals.
Feature transforms — Box-Cox on high-skew numeric features.
Modeling — multiple models trained independently; final prediction is the average of their outputs (simple ensemble).

Missing data audit — imputation strategy by zip

Numeric features — stabilized after Box-Cox

Final scores — competitive vs. the top-0.3% notebook

03 // Outcome

The rebuilt pipeline reached competitive scores against the original, plus a simplified variant that sacrificed a small amount of leaderboard score for significantly less model surface area. The bigger win was internalizing the rhythm of real ML work: clean, transform, validate, ensemble — in that order, never skipping a step.