Predicting House Prices

This notebook is part of the CaTabRa GitHub repository.

This short example demonstrates how to create a model for predicting house prices with CaTabRa:

prepare data,
train a regression model,
evaluate the model, and
explain the model.

Familiarity with CaTabRa’s main data analysis workflow is assumed. A step-by-step introduction can be found in CaTabRa Workflow.

Prerequisites

[1]:

from catabra.util import io

[2]:

# output directory (where all generated artifacts, like statistics, models, etc. are saved)
output_dir = 'house_sales'

Prepare Data

[3]:

# load dataset
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=44066, return_X_y=True, as_frame=True)

[4]:

# add target labels to DataFrame
X['price'] = y

[5]:

# split into train- and test set by adding column with corresponding values
# the name of the column is arbitrary; CaTabRa tries to "guess" which samples belong to which set based on the column name and -values
X['train'] = X['date_year'] == '0'    # temporal split

[6]:

X.head()

[6]:

	bedrooms	bathrooms	sqft_living	sqft_lot	grade	sqft_above	sqft_basement	yr_built	yr_renovated	lat	long	sqft_living15	sqft_lot15	date_year	date_month	date_day	price	train
0	3.0	1.00	1180.0	5650.0	7.0	1180.0	0.0	1955.0	0.0	47.5112	-122.257	1340.0	5650.0	0	10.0	13.0	12.309987	True
1	3.0	2.25	2570.0	7242.0	7.0	2170.0	400.0	1951.0	1991.0	47.7210	-122.319	1690.0	7639.0	0	12.0	9.0	13.195616	True
2	2.0	1.00	770.0	10000.0	6.0	770.0	0.0	1933.0	0.0	47.7379	-122.233	2720.0	8062.0	1	2.0	25.0	12.100718	False
3	4.0	3.00	1960.0	5000.0	7.0	1050.0	910.0	1965.0	0.0	47.5208	-122.393	1360.0	5000.0	0	12.0	9.0	13.311331	True
4	3.0	2.00	1680.0	8080.0	8.0	1680.0	0.0	1987.0	0.0	47.6168	-122.045	1800.0	7503.0	1	2.0	18.0	13.142168	False

Analyze Data and Train Model

[7]:

from catabra.analysis import analyze

analyze(
    X,                        # table to analyze; can also be the path to a CSV/Excel/HDF5 file
    regress='price',          # name of column containing regression target
    split='train',            # name of column containing information about the train-test split (optional)
    time=3,                   # time budget for hyperparameter tuning, in minutes (optional)
    jobs=2,                   # number of parallel jobs
    out=output_dir
)

[CaTabRa] ### Analysis started at 2023-04-19 14:53:07.415165

[CaTabRa warning] 2 columns appear to contain IDs, but are used as features: 'sqft_lot', 'sqft_lot15'

[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Using AutoML-backend auto-sklearn for regression
[CaTabRa] Successfully loaded the following auto-sklearn add-on module(s): xgb
[CaTabRa] Using auto-sklearn 1.0 (regression not supported by 2.0).

/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:68: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.metafeatures = self.metafeatures.append(metafeatures)
/home/amaletzk/miniconda3/envs/catabra/lib/python3.9/site-packages/autosklearn/metalearning/metalearning/meta_base.py:72: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  self.algorithm_runs[metric].append(runs)

[CaTabRa] New model #1 trained:
    val_r2: 0.891634
    val_mean_absolute_error: 0.123526
    val_mean_squared_error: 0.030498
    train_r2: 0.984437
    type: random_forest
    total_elapsed_time: 00:30
[CaTabRa] New model #2 trained:
    val_r2: 0.904801
    val_mean_absolute_error: 0.116970
    val_mean_squared_error: 0.026792
    train_r2: 0.976546
    type: gradient_boosting
    total_elapsed_time: 00:35
[CaTabRa] New model #3 trained:
    val_r2: 0.902060
    val_mean_absolute_error: 0.119535
    val_mean_squared_error: 0.027564
    train_r2: 0.981697
    type: gradient_boosting
    total_elapsed_time: 00:37
[CaTabRa] New model #4 trained:
    val_r2: 0.897691
    val_mean_absolute_error: 0.122485
    val_mean_squared_error: 0.028793
    train_r2: 0.994704
    type: gradient_boosting
    total_elapsed_time: 00:48
[CaTabRa] New model #5 trained:
    val_r2: 0.768186
    val_mean_absolute_error: 0.187598
    val_mean_squared_error: 0.065241
    train_r2: 1.000000
    type: k_nearest_neighbors
    total_elapsed_time: 00:55
[CaTabRa] New model #6 trained:
    val_r2: 0.900880
    val_mean_absolute_error: 0.119753
    val_mean_squared_error: 0.027896
    train_r2: 0.996603
    type: gradient_boosting
    total_elapsed_time: 01:18
[CaTabRa] New model #7 trained:
    val_r2: 0.825280
    val_mean_absolute_error: 0.163811
    val_mean_squared_error: 0.049173
    train_r2: 0.840148
    type: mlp
    total_elapsed_time: 02:04
[CaTabRa] New model #8 trained:
    val_r2: 0.879164
    val_mean_absolute_error: 0.131235
    val_mean_squared_error: 0.034008
    train_r2: 0.939784
    type: adaboost
    total_elapsed_time: 02:04
[CaTabRa] New model #9 trained:
    val_r2: 0.885336
    val_mean_absolute_error: 0.130923
    val_mean_squared_error: 0.032270
    train_r2: 0.889781
    type: gradient_boosting
    total_elapsed_time: 02:06
[CaTabRa] New model #10 trained:
    val_r2: 0.729445
    val_mean_absolute_error: 0.215322
    val_mean_squared_error: 0.076144
    train_r2: 0.724252
    type: ard_regression
    total_elapsed_time: 02:07
[CaTabRa] New model #11 trained:
    val_r2: -0.000007
    val_mean_absolute_error: 0.415429
    val_mean_squared_error: 0.281438
    train_r2: -0.000000
    type: ard_regression
    total_elapsed_time: 02:08
[CaTabRa] New model #12 trained:
    val_r2: 0.607176
    val_mean_absolute_error: 0.238048
    val_mean_squared_error: 0.110555
    train_r2: 1.000000
    type: k_nearest_neighbors
    total_elapsed_time: 02:11
[CaTabRa] Final training statistics:
    n_models_trained: 12
    ensemble_val_r2: 0.9081625850582595
[CaTabRa] Creating shap explainer
[CaTabRa] Initialized out-of-distribution detector of type BinsDetector
[CaTabRa] Fitting out-of-distribution detector...
[CaTabRa] Out-of-distribution detector fitted.
[CaTabRa] ### Analysis finished at 2023-04-19 14:56:55.414768
[CaTabRa] ### Elapsed time: 0 days 00:03:47.999603
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales
[CaTabRa] ### Evaluation started at 2023-04-19 14:56:55.443873
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Saving descriptive statistics completed
[CaTabRa] Predicting out-of-distribution samples.
[CaTabRa] Evaluation results for train:
    r2: 0.9366044548544811
    mean_absolute_error: 0.10082619789893818
    mean_squared_error: 0.017566229157146635
[CaTabRa] Evaluation results for not_train:
    r2: 0.845543453616034
    mean_absolute_error: 0.1517555994192618
    mean_squared_error: 0.04293557006018874
[CaTabRa] ### Evaluation finished at 2023-04-19 14:57:01.911073
[CaTabRa] ### Elapsed time: 0 days 00:00:06.467200
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/eval

Evaluate Model

The model was automatically evaluated after training, because we specified a train-test split. We can inspect the results:

[8]:

metrics = io.read_df(output_dir + '/eval/not_train/metrics.xlsx')

[9]:

metrics

[9]:

	Unnamed: 0	n	r2	mean_absolute_error	mean_squared_error	root_mean_squared_error	mean_squared_log_error	median_absolute_error	mean_absolute_percentage_error	max_error	explained_variance	mean_poisson_deviance	mean_gamma_deviance
0	price	6980	0.845543	0.151756	0.042936	0.207209	0.000215	0.110974	0.011558	1.19069	0.857237	0.003273	0.00025
1	__overall__	6980	0.845543	0.151756	0.042936	0.207209	0.000215	0.110974	0.011558	1.19069	0.857237	0.003273	0.00025

Also check out /eval/not_train/static_plots/price.pdf, which shows a scatter plot of ground-truth vs. predicted house prices.

Explain Model

[10]:

from catabra.explanation import explain

explain(
    X,
    folder=output_dir,       # directory containing trained model (= output directory of previous call to `analyze()`)
    from_invocation=output_dir + '/invocation.json',
    out=output_dir + '/explain',
    explainer='permutation'  # can be omitted for using SHAP, but SHAP takes very long in this case ...
)

[CaTabRa] ### Explanation started at 2023-04-19 14:57:20.962098
[CaTabRa] *** Split train
Features: 100%|########################################| 17/17 [00:33<00:00, 1.94s/it]
[CaTabRa] *** Split not_train
Features: 100%|########################################| 17/17 [00:13<00:00, 1.28it/s]
[CaTabRa] ### Explanation finished at 2023-04-19 14:58:23.904994
[CaTabRa] ### Elapsed time: 0 days 00:01:02.942896
[CaTabRa] ### Output saved in /mnt/c/Users/amaletzk/Documents/CaTabRa/catabra/examples/house_sales/explain

[11]:

importance = io.read_df(output_dir + '/explain/not_train/__ensemble__.h5')

[12]:

importance.sort_values('r2', ascending=False)

[12]:

	r2	mean_absolute_error	mean_squared_error	r2 std	mean_absolute_error std	mean_squared_error std
lat	0.550714	0.199182	0.153086	0.005703	0.001403	0.001585
sqft_living	0.201067	0.082459	0.055892	0.003280	0.001340	0.000912
grade	0.150816	0.060628	0.041924	0.004245	0.001382	0.001180
long	0.075237	0.034909	0.020914	0.000940	0.000536	0.000261
sqft_living15	0.025162	0.011922	0.006994	0.001850	0.000653	0.000514
sqft_lot	0.020695	0.012332	0.005753	0.001152	0.000607	0.000320
waterfront	0.013390	0.003592	0.003722	0.000506	0.000205	0.000141
bathrooms	0.010384	0.005163	0.002887	0.000498	0.000217	0.000139
yr_built	0.008597	0.004961	0.002390	0.000567	0.000129	0.000158
sqft_lot15	0.005792	0.003533	0.001610	0.000316	0.000200	0.000088
sqft_above	0.004771	0.002596	0.001326	0.000181	0.000122	0.000050
yr_renovated	0.000813	0.000337	0.000226	0.000155	0.000082	0.000043
sqft_basement	0.000198	0.000451	0.000055	0.000073	0.000109	0.000020
date_month	0.000170	0.000065	0.000047	0.000053	0.000039	0.000015
date_year	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
date_day	-0.000208	-0.000136	-0.000058	0.000150	0.000083	0.000042
bedrooms	-0.001534	-0.000674	-0.000426	0.000127	0.000187	0.000035

Also check out /explain/not_train/static_plots/ for visualizations of the permutation importance.