Predicting House Prices


This notebook is part of the CaTabRa GitHub repository.

This short example demonstrates how to create a model for predicting house prices with CaTabRa:

Familiarity with CaTabRa’s main data analysis workflow is assumed. A step-by-step introduction can be found in CaTabRa Workflow.

Prerequisites

[1]:
from catabra.util import io
[2]:
# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'house_sales'

Prepare Data

[3]:
# load dataset
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=44066, return_X_y=True, as_frame=True)
[4]:
# add target labels to DataFrame
X['price'] = y
[5]:
# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X['date_year'] == '0'    # temporal split
[6]:
X.head()
[6]:
bedrooms bathrooms sqft_living sqft_lot waterfront grade sqft_above sqft_basement yr_built yr_renovated lat long sqft_living15 sqft_lot15 date_year date_month date_day price train
0 3.0 1.00 1180.0 5650.0 0 7.0 1180.0 0.0 1955.0 0.0 47.5112 -122.257 1340.0 5650.0 0 10.0 13.0 12.309987 True
1 3.0 2.25 2570.0 7242.0 0 7.0 2170.0 400.0 1951.0 1991.0 47.7210 -122.319 1690.0 7639.0 0 12.0 9.0 13.195616 True
2 2.0 1.00 770.0 10000.0 0 6.0 770.0 0.0 1933.0 0.0 47.7379 -122.233 2720.0 8062.0 1 2.0 25.0 12.100718 False
3 4.0 3.00 1960.0 5000.0 0 7.0 1050.0 910.0 1965.0 0.0 47.5208 -122.393 1360.0 5000.0 0 12.0 9.0 13.311331 True
4 3.0 2.00 1680.0 8080.0 0 8.0 1680.0 0.0 1987.0 0.0 47.6168 -122.045 1800.0 7503.0 1 2.0 18.0 13.142168 False

Analyze Data and Train Model

[7]:
from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    regress='price',          # name of column containing regression target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    jobs=2,                   # number of parallel jobs
    out=output_dir
)
[CaTabRa] ### Analysis started at 2023-04-19 14:53:07.415165
[CaTabRa warning] 2 columns appear to contain IDs, but are used as features: 'sqft_lot', 'sqft_lot15'
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for regression
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 1.0 (regression not supported by 2.0).
/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.metafeatures = self.metafeatures.append(metafeatures)
/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.algorithm_runs[metric].append(runs)
[CaTabRa] New model #1 trained:
    val_r2: 0.891634
    val_mean_absolute_error: 0.123526
    val_mean_squared_error: 0.030498
    train_r2: 0.984437
    type: random_forest
    total_elapsed_time: 00:30
[CaTabRa] New model #2 trained:
    val_r2: 0.904801
    val_mean_absolute_error: 0.116970
    val_mean_squared_error: 0.026792
    train_r2: 0.976546
    type: gradient_boosting
    total_elapsed_time: 00:35
[CaTabRa] New model #3 trained:
    val_r2: 0.902060
    val_mean_absolute_error: 0.119535
    val_mean_squared_error: 0.027564
    train_r2: 0.981697
    type: gradient_boosting
    total_elapsed_time: 00:37
[CaTabRa] New model #4 trained:
    val_r2: 0.897691
    val_mean_absolute_error: 0.122485
    val_mean_squared_error: 0.028793
    train_r2: 0.994704
    type: gradient_boosting
    total_elapsed_time: 00:48
[CaTabRa] New model #5 trained:
    val_r2: 0.768186
    val_mean_absolute_error: 0.187598
    val_mean_squared_error: 0.065241
    train_r2: 1.000000
    type: k_nearest_neighbors
    total_elapsed_time: 00:55
[CaTabRa] New model #6 trained:
    val_r2: 0.900880
    val_mean_absolute_error: 0.119753
    val_mean_squared_error: 0.027896
    train_r2: 0.996603
    type: gradient_boosting
    total_elapsed_time: 01:18
[CaTabRa] New model #7 trained:
    val_r2: 0.825280
    val_mean_absolute_error: 0.163811
    val_mean_squared_error: 0.049173
    train_r2: 0.840148
    type: mlp
    total_elapsed_time: 02:04
[CaTabRa] New model #8 trained:
    val_r2: 0.879164
    val_mean_absolute_error: 0.131235
    val_mean_squared_error: 0.034008
    train_r2: 0.939784
    type: adaboost
    total_elapsed_time: 02:04
[CaTabRa] New model #9 trained:
    val_r2: 0.885336
    val_mean_absolute_error: 0.130923
    val_mean_squared_error: 0.032270
    train_r2: 0.889781
    type: gradient_boosting
    total_elapsed_time: 02:06
[CaTabRa] New model #10 trained:
    val_r2: 0.729445
    val_mean_absolute_error: 0.215322
    val_mean_squared_error: 0.076144
    train_r2: 0.724252
    type: ard_regression
    total_elapsed_time: 02:07
[CaTabRa] New model #11 trained:
    val_r2: -0.000007
    val_mean_absolute_error: 0.415429
    val_mean_squared_error: 0.281438
    train_r2: -0.000000
    type: ard_regression
    total_elapsed_time: 02:08
[CaTabRa] New model #12 trained:
    val_r2: 0.607176
    val_mean_absolute_error: 0.238048
    val_mean_squared_error: 0.110555
    train_r2: 1.000000
    type: k_nearest_neighbors
    total_elapsed_time: 02:11
[CaTabRa] Final training statistics:
    n_models_trained: 12
    ensemble_val_r2: 0.9081625850582595
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-19 14:56:55.414768
[CaTabRa] ### Elapsed time: 0 days 00:03:47.999603
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales
[CaTabRa] ### Evaluation started at 2023-04-19 14:56:55.443873
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    r2: 0.9366044548544811
    mean_absolute_error: 0.10082619789893818
    mean_squared_error: 0.017566229157146635
[CaTabRa] Evaluation results for not_train:
    r2: 0.845543453616034
    mean_absolute_error: 0.1517555994192618
    mean_squared_error: 0.04293557006018874
[CaTabRa] ### Evaluation finished at 2023-04-19 14:57:01.911073
[CaTabRa] ### Elapsed time: 0 days 00:00:06.467200
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/eval

Evaluate Model

The model was automatically evaluated after training, because we specified a train-test split. We can inspect the results:

[8]:
metrics = io.read_df(output_dir + '/eval/not_train/metrics.xlsx')
[9]:
metrics
[9]:
Unnamed: 0 n r2 mean_absolute_error mean_squared_error root_mean_squared_error mean_squared_log_error median_absolute_error mean_absolute_percentage_error max_error explained_variance mean_poisson_deviance mean_gamma_deviance
0 price 6980 0.845543 0.151756 0.042936 0.207209 0.000215 0.110974 0.011558 1.19069 0.857237 0.003273 0.00025
1 __overall__ 6980 0.845543 0.151756 0.042936 0.207209 0.000215 0.110974 0.011558 1.19069 0.857237 0.003273 0.00025

Also check out /eval/not_train/static_plots/price.pdf, which shows a scatter plot of ground-truth vs. predicted house prices.

Explain Model

[10]:
from catabra.explanation import explain

explain(
    X,
    folder=output_dir,       # directory containing trained model (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain',
    explainer='permutation'  # can be omitted for using SHAP, but SHAP takes very long in this case ...
)
[CaTabRa] ### Explanation started at 2023-04-19 14:57:20.962098
[CaTabRa] *** Split train
Features: 100%|########################################| 17/17 [00:33<00:00, 1.94s/it]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 17/17 [00:13<00:00, 1.28it/s]
[CaTabRa] ### Explanation finished at 2023-04-19 14:58:23.904994
[CaTabRa] ### Elapsed time: 0 days 00:01:02.942896
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/explain
[11]:
importance = io.read_df(output_dir + '/explain/not_train/__ensemble__.h5')
[12]:
importance.sort_values('r2', ascending=False)
[12]:
r2 mean_absolute_error mean_squared_error r2 std mean_absolute_error std mean_squared_error std
lat 0.550714 0.199182 0.153086 0.005703 0.001403 0.001585
sqft_living 0.201067 0.082459 0.055892 0.003280 0.001340 0.000912
grade 0.150816 0.060628 0.041924 0.004245 0.001382 0.001180
long 0.075237 0.034909 0.020914 0.000940 0.000536 0.000261
sqft_living15 0.025162 0.011922 0.006994 0.001850 0.000653 0.000514
sqft_lot 0.020695 0.012332 0.005753 0.001152 0.000607 0.000320
waterfront 0.013390 0.003592 0.003722 0.000506 0.000205 0.000141
bathrooms 0.010384 0.005163 0.002887 0.000498 0.000217 0.000139
yr_built 0.008597 0.004961 0.002390 0.000567 0.000129 0.000158
sqft_lot15 0.005792 0.003533 0.001610 0.000316 0.000200 0.000088
sqft_above 0.004771 0.002596 0.001326 0.000181 0.000122 0.000050
yr_renovated 0.000813 0.000337 0.000226 0.000155 0.000082 0.000043
sqft_basement 0.000198 0.000451 0.000055 0.000073 0.000109 0.000020
date_month 0.000170 0.000065 0.000047 0.000053 0.000039 0.000015
date_year 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
date_day -0.000208 -0.000136 -0.000058 0.000150 0.000083 0.000042
bedrooms -0.001534 -0.000674 -0.000426 0.000127 0.000187 0.000035

Also check out /explain/not_train/static_plots/ for visualizations of the permutation importance.