Analysis

Analyze a table by creating descriptive statistics and training models for predicting one or more columns from the remaining ones. Wrapper for Analyzer.__call__.

Parameters:

*table (str | Path | DataFrame) – The table(s) to analyze. If multiple are given, their columns are merged into a single table.
classify (Iterable[str | Path | pd.DataFrame], optional) – Column(s) to classify. If more than one, a multilabel classification problem is solved, which means that each of these columns can take on only two distinct values. Must be None if regress` is given.
regress (Iterable[str | Path | pd.DataFrame], optional) – Column(s) to regress. Must have numerical or time-like data type. Must be None if classify is given.
group (str, optional) – Column used for grouping samples for internal (cross) validation. If not specified or set to “”, and the row index of the given table has a name, group by row index.
split (str, optional) – Column used for splitting the data into train- and test set. If specified and not “”, descriptive statistics, OOD-detectors and prediction models are generated based exclusively on the training split and then automatically evaluated on the test split. The name and/or values of the column must contain the string “train”, “test” or “val”, to clearly indicate what is the training- and what is the test data.
sample_weight (str, optional) – Column with sample weights. If specified and not “”, must have numeric data type. Sample weights are used both for training and evaluating prediction models.
ignore (Iterable[str], optional) – List of columns to ignore when training prediction models. Automatically includes group`and `split, but may contain further columns.
create_stats (bool, optional) – Whether to generate and save descriptive statistics of the given data table.
calibrate (str, optional) – Value in column split defining the subset to calibrate the trained classifier on. If None, no calibration happens. Ignored in regression tasks or if split is not specified.
time (int, optional) – Time budget for model training, in minutes. Some AutoML backends require a fixed budget, others might not. Overwrites the time_limit config param.
out (str | Path, optional) – Directory where to save all generated artifacts. Defaults to a directory located in the parent directory of table, with a name following a fixed naming pattern. If out already exists, the user is prompted to specify whether it should be replaced; otherwise, it is automatically created.
config (dict | str | Path, optional) – Configuration dict or path to JSON file containing such a dict. Merged with the default configuration specified via default_config. Empty string means that the default configuration is used.
default_config (str, optional) – Default configuration to use, one of full, “”, basic, interpretable or None.
monitor (str, optional) – Training monitor to use.
jobs (int) – Number of jobs to use. Overwrites the “jobs” config param.
from_invocation (dict | str | Path, optional) – dict or path to an invocation.json file. All arguments of this function not explicitly specified are taken from this dict; this also includes the table to analyze.

class CaTabRaAnalysis(invocation: str | Path | dict | None = None)[source]

Bases: CaTabRaBase

static plot_training_history(hist: DataFrame | str | Path, interactive: bool = False) → dict[source]

Plot the evolution of performance scores during model training.

Parameters:

hist (DataFrame | str | Path) – The history to plot, as saved in “training_history.xlsx”.
interactive (bool, default=False) – Whether to create static Matplotlib plots or interactive plotly plots.

Returns:

Dict with single key “training_history”, which is mapped to a Matplotlib or plotly figure object. The sole reason for returning a dict is consistency with other plotting functions.

Return type:

dict