Predicting House Prices
This notebook is part of the CaTabRa GitHub repository.
This short example demonstrates how to create a model for predicting house prices with CaTabRa:
Familiarity with CaTabRa’s main data analysis workflow is assumed. A step-by-step introduction can be found in CaTabRa Workflow.
Prerequisites
[1]:
from catabra.util import io
[2]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'house_sales'
Prepare Data
[3]:
# load dataset
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=44066, return_X_y=True, as_frame=True)
[4]:
# add target labels to DataFrame
X['price'] = y
[5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X['date_year'] == '0' # temporal split
[6]:
X.head()
[6]:
| bedrooms | bathrooms | sqft_living | sqft_lot | waterfront | grade | sqft_above | sqft_basement | yr_built | yr_renovated | lat | long | sqft_living15 | sqft_lot15 | date_year | date_month | date_day | price | train | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 1.00 | 1180.0 | 5650.0 | 0 | 7.0 | 1180.0 | 0.0 | 1955.0 | 0.0 | 47.5112 | -122.257 | 1340.0 | 5650.0 | 0 | 10.0 | 13.0 | 12.309987 | True |
| 1 | 3.0 | 2.25 | 2570.0 | 7242.0 | 0 | 7.0 | 2170.0 | 400.0 | 1951.0 | 1991.0 | 47.7210 | -122.319 | 1690.0 | 7639.0 | 0 | 12.0 | 9.0 | 13.195616 | True |
| 2 | 2.0 | 1.00 | 770.0 | 10000.0 | 0 | 6.0 | 770.0 | 0.0 | 1933.0 | 0.0 | 47.7379 | -122.233 | 2720.0 | 8062.0 | 1 | 2.0 | 25.0 | 12.100718 | False |
| 3 | 4.0 | 3.00 | 1960.0 | 5000.0 | 0 | 7.0 | 1050.0 | 910.0 | 1965.0 | 0.0 | 47.5208 | -122.393 | 1360.0 | 5000.0 | 0 | 12.0 | 9.0 | 13.311331 | True |
| 4 | 3.0 | 2.00 | 1680.0 | 8080.0 | 0 | 8.0 | 1680.0 | 0.0 | 1987.0 | 0.0 | 47.6168 | -122.045 | 1800.0 | 7503.0 | 1 | 2.0 | 18.0 | 13.142168 | False |
Analyze Data and Train Model
[7]:
from catabra.analysis import analyze
analyze(
X, # table to analyze; can also be the path to a CSV/Excel/HDF5 file
regress='price', # name of column containing regression target
split='train', # name of column containing information about the train-test split (optional)
time=3, # time budget for hyperparameter tuning, in minutes (optional)
jobs=2, # number of parallel jobs
out=output_dir
)
[CaTabRa] ### Analysis started at 2023-04-19 14:53:07.415165
[CaTabRa warning] 2 columns appear to contain IDs, but are used as features: 'sqft_lot', 'sqft_lot15'
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for regression
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 1.0 (regression not supported by 2.0).
/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.metafeatures = self.metafeatures.append(metafeatures)
/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
self.algorithm_runs[metric].append(runs)
[CaTabRa] New model #1 trained:
val_r2: 0.891634
val_mean_absolute_error: 0.123526
val_mean_squared_error: 0.030498
train_r2: 0.984437
type: random_forest
total_elapsed_time: 00:30
[CaTabRa] New model #2 trained:
val_r2: 0.904801
val_mean_absolute_error: 0.116970
val_mean_squared_error: 0.026792
train_r2: 0.976546
type: gradient_boosting
total_elapsed_time: 00:35
[CaTabRa] New model #3 trained:
val_r2: 0.902060
val_mean_absolute_error: 0.119535
val_mean_squared_error: 0.027564
train_r2: 0.981697
type: gradient_boosting
total_elapsed_time: 00:37
[CaTabRa] New model #4 trained:
val_r2: 0.897691
val_mean_absolute_error: 0.122485
val_mean_squared_error: 0.028793
train_r2: 0.994704
type: gradient_boosting
total_elapsed_time: 00:48
[CaTabRa] New model #5 trained:
val_r2: 0.768186
val_mean_absolute_error: 0.187598
val_mean_squared_error: 0.065241
train_r2: 1.000000
type: k_nearest_neighbors
total_elapsed_time: 00:55
[CaTabRa] New model #6 trained:
val_r2: 0.900880
val_mean_absolute_error: 0.119753
val_mean_squared_error: 0.027896
train_r2: 0.996603
type: gradient_boosting
total_elapsed_time: 01:18
[CaTabRa] New model #7 trained:
val_r2: 0.825280
val_mean_absolute_error: 0.163811
val_mean_squared_error: 0.049173
train_r2: 0.840148
type: mlp
total_elapsed_time: 02:04
[CaTabRa] New model #8 trained:
val_r2: 0.879164
val_mean_absolute_error: 0.131235
val_mean_squared_error: 0.034008
train_r2: 0.939784
type: adaboost
total_elapsed_time: 02:04
[CaTabRa] New model #9 trained:
val_r2: 0.885336
val_mean_absolute_error: 0.130923
val_mean_squared_error: 0.032270
train_r2: 0.889781
type: gradient_boosting
total_elapsed_time: 02:06
[CaTabRa] New model #10 trained:
val_r2: 0.729445
val_mean_absolute_error: 0.215322
val_mean_squared_error: 0.076144
train_r2: 0.724252
type: ard_regression
total_elapsed_time: 02:07
[CaTabRa] New model #11 trained:
val_r2: -0.000007
val_mean_absolute_error: 0.415429
val_mean_squared_error: 0.281438
train_r2: -0.000000
type: ard_regression
total_elapsed_time: 02:08
[CaTabRa] New model #12 trained:
val_r2: 0.607176
val_mean_absolute_error: 0.238048
val_mean_squared_error: 0.110555
train_r2: 1.000000
type: k_nearest_neighbors
total_elapsed_time: 02:11
[CaTabRa] Final training statistics:
n_models_trained: 12
ensemble_val_r2: 0.9081625850582595
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-19 14:56:55.414768
[CaTabRa] ### Elapsed time: 0 days 00:03:47.999603
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales
[CaTabRa] ### Evaluation started at 2023-04-19 14:56:55.443873
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
r2: 0.9366044548544811
mean_absolute_error: 0.10082619789893818
mean_squared_error: 0.017566229157146635
[CaTabRa] Evaluation results for not_train:
r2: 0.845543453616034
mean_absolute_error: 0.1517555994192618
mean_squared_error: 0.04293557006018874
[CaTabRa] ### Evaluation finished at 2023-04-19 14:57:01.911073
[CaTabRa] ### Elapsed time: 0 days 00:00:06.467200
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/eval
Evaluate Model
The model was automatically evaluated after training, because we specified a train-test split. We can inspect the results:
[8]:
metrics = io.read_df(output_dir + '/eval/not_train/metrics.xlsx')
[9]:
metrics
[9]:
| Unnamed: 0 | n | r2 | mean_absolute_error | mean_squared_error | root_mean_squared_error | mean_squared_log_error | median_absolute_error | mean_absolute_percentage_error | max_error | explained_variance | mean_poisson_deviance | mean_gamma_deviance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | price | 6980 | 0.845543 | 0.151756 | 0.042936 | 0.207209 | 0.000215 | 0.110974 | 0.011558 | 1.19069 | 0.857237 | 0.003273 | 0.00025 |
| 1 | __overall__ | 6980 | 0.845543 | 0.151756 | 0.042936 | 0.207209 | 0.000215 | 0.110974 | 0.011558 | 1.19069 | 0.857237 | 0.003273 | 0.00025 |
Also check out /eval/not_train/static_plots/price.pdf, which shows a scatter plot of ground-truth vs. predicted house prices.
Explain Model
[10]:
from catabra.explanation import explain
explain(
X,
folder=output_dir, # directory containing trained model (= output directory of previous call to `analyze()`)
from_invocation=output_dir + '/invocation.json',
out=output_dir + '/explain',
explainer='permutation' # can be omitted for using SHAP, but SHAP takes very long in this case ...
)
[CaTabRa] ### Explanation started at 2023-04-19 14:57:20.962098
[CaTabRa] *** Split train
Features: 100%|########################################| 17/17 [00:33<00:00, 1.94s/it]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 17/17 [00:13<00:00, 1.28it/s]
[CaTabRa] ### Explanation finished at 2023-04-19 14:58:23.904994
[CaTabRa] ### Elapsed time: 0 days 00:01:02.942896
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/explain
[11]:
importance = io.read_df(output_dir + '/explain/not_train/__ensemble__.h5')
[12]:
importance.sort_values('r2', ascending=False)
[12]:
| r2 | mean_absolute_error | mean_squared_error | r2 std | mean_absolute_error std | mean_squared_error std | |
|---|---|---|---|---|---|---|
| lat | 0.550714 | 0.199182 | 0.153086 | 0.005703 | 0.001403 | 0.001585 |
| sqft_living | 0.201067 | 0.082459 | 0.055892 | 0.003280 | 0.001340 | 0.000912 |
| grade | 0.150816 | 0.060628 | 0.041924 | 0.004245 | 0.001382 | 0.001180 |
| long | 0.075237 | 0.034909 | 0.020914 | 0.000940 | 0.000536 | 0.000261 |
| sqft_living15 | 0.025162 | 0.011922 | 0.006994 | 0.001850 | 0.000653 | 0.000514 |
| sqft_lot | 0.020695 | 0.012332 | 0.005753 | 0.001152 | 0.000607 | 0.000320 |
| waterfront | 0.013390 | 0.003592 | 0.003722 | 0.000506 | 0.000205 | 0.000141 |
| bathrooms | 0.010384 | 0.005163 | 0.002887 | 0.000498 | 0.000217 | 0.000139 |
| yr_built | 0.008597 | 0.004961 | 0.002390 | 0.000567 | 0.000129 | 0.000158 |
| sqft_lot15 | 0.005792 | 0.003533 | 0.001610 | 0.000316 | 0.000200 | 0.000088 |
| sqft_above | 0.004771 | 0.002596 | 0.001326 | 0.000181 | 0.000122 | 0.000050 |
| yr_renovated | 0.000813 | 0.000337 | 0.000226 | 0.000155 | 0.000082 | 0.000043 |
| sqft_basement | 0.000198 | 0.000451 | 0.000055 | 0.000073 | 0.000109 | 0.000020 |
| date_month | 0.000170 | 0.000065 | 0.000047 | 0.000053 | 0.000039 | 0.000015 |
| date_year | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| date_day | -0.000208 | -0.000136 | -0.000058 | 0.000150 | 0.000083 | 0.000042 |
| bedrooms | -0.001534 | -0.000674 | -0.000426 | 0.000127 | 0.000187 | 0.000035 |
Also check out /explain/not_train/static_plots/ for visualizations of the permutation importance.