Quick Start
Installation
Clone the repository and install the package with Poetry. Set up a new Python environment with Python >=3.9, <3.11 (e.g. using conda), activate it, and then run
pip install poetry
(unless Poetry has been installed already) and
git clone https://github.com/risc-mi/catabra.git
cd catabra
poetry install
The project is installed in editable mode by default. This is useful if you plan to make changes to CaTabRa’s code.
IMPORTANT: CaTabRa currently only runs on Linux, because
auto-sklearn only runs on Linux. If on Windows,
you can use a virtual machine, like WSL 2, and install CaTabRa
there. If you want to use Jupyter, install Jupyter on the virtual machine as well and launch it with the --no-browser
flag.
Usage Mode 1: Command-Line
python -m catabra analyze example_data/breast_cancer.csv --classify diagnosis --split train --out breast_cancer_result
This command analyzes breast_cancer.csv
and trains a prediction model for classifying the samples according to column
"diagnosis"
. Column "train"
is used for splitting the data into a train- and a test set, which means that the final
model is automatically evaluated on the test set after training. All results are saved in directory breast_cancer_out
.
python -m catabra explain breast_cancer_result --on example_data/breast_cancer.csv --out breast_cancer_result/expl
This command explains the classifier trained in the previous command by computing SHAP feature importance scores for
every sample. The results are saved in directory breast_cancer_result/expl
. Depending on the type of the trained
models, this command may take several minutes to complete.
Usage Mode 2: Python
The two commands above translate to the following Python code:
from catabra.analysis import analyze
from catabra.explanation import explain
analyze("example_data/breast_cancer.csv", classify="diagnosis", split="train", out="breast_cancer_result")
explain("example_data/breast_cancer.csv", "breast_cancer_result", out="breast_cancer_result/expl")
Results
Invoking the two commands generates a bunch of results, most notably
the trained classifier
descriptive statistics of the underlying data
performance metrics in tabular and graphical form
feature importance scores in tabular and graphical form
… and many more.