01 // Project Goals
- Explore how the real-estate market is broken down by price, square footage, and story count.
- Quantify how individual house features affect sale price.
- Predict sale price from features and location.
02 // Approach
The task came from a Kaggle competition; the goal was to rebuild an end-to-end pipeline that had placed in the top 0.3% — understanding every step rather than copying the notebook.
- EDA — heatmap of correlations against
SalePrice; highest- and lowest-correlated features plotted pairwise for visual inspection. - Missing data — imputed by averaging missing features (e.g. lot square footage) within zip code rather than a naive global mean.
- Target transform — log-transform on
SalePriceto reduce skew and stabilize residuals. - Feature transforms — Box-Cox on high-skew numeric features.
- Modeling — multiple models trained independently; final prediction is the average of their outputs (simple ensemble).
03 // Outcome
The rebuilt pipeline reached competitive scores against the original, plus a simplified variant that sacrificed a small amount of leaderboard score for significantly less model surface area. The bigger win was internalizing the rhythm of real ML work: clean, transform, validate, ensemble — in that order, never skipping a step.