Analysis

analyze(*table: str | Path | DataFrame, classify: Iterable[DataFrame | str | Path] | None = None, regress: Iterable[DataFrame | str | Path] | None = None, group: str | None = None, split: str | None = None, sample_weight: str | None = None, ignore: Iterable[str] | None = None, create_stats: bool | None = None, calibrate: str | None = None, time: int | None = None, out: str | Path | None = None, config: str | Path | dict | None = None, default_config: str | None = None, monitor: str | None = None, jobs: int | None = None, from_invocation: str | Path | dict | None = None)[source]

Analyze a table by creating descriptive statistics and training models for predicting one or more columns from the remaining ones. Wrapper for Analyzer.__call__.

Parameters:
  • *table (str | Path | DataFrame) – The table(s) to analyze. If multiple are given, their columns are merged into a single table.

  • classify (Iterable[str | Path | pd.DataFrame], optional) – Column(s) to classify. If more than one, a multilabel classification problem is solved, which means that each of these columns can take on only two distinct values. Must be None if regress` is given.

  • regress (Iterable[str | Path | pd.DataFrame], optional) – Column(s) to regress. Must have numerical or time-like data type. Must be None if classify is given.

  • group (str, optional) – Column used for grouping samples for internal (cross) validation. If not specified or set to “”, and the row index of the given table has a name, group by row index.

  • split (str, optional) – Column used for splitting the data into train- and test set. If specified and not “”, descriptive statistics, OOD-detectors and prediction models are generated based exclusively on the training split and then automatically evaluated on the test split. The name and/or values of the column must contain the string “train”, “test” or “val”, to clearly indicate what is the training- and what is the test data.

  • sample_weight (str, optional) – Column with sample weights. If specified and not “”, must have numeric data type. Sample weights are used both for training and evaluating prediction models.

  • ignore (Iterable[str], optional) – List of columns to ignore when training prediction models. Automatically includes group`and `split, but may contain further columns.

  • create_stats (bool, optional) – Whether to generate and save descriptive statistics of the given data table.

  • calibrate (str, optional) – Value in column split defining the subset to calibrate the trained classifier on. If None, no calibration happens. Ignored in regression tasks or if split is not specified.

  • time (int, optional) – Time budget for model training, in minutes. Some AutoML backends require a fixed budget, others might not. Overwrites the time_limit config param.

  • out (str | Path, optional) – Directory where to save all generated artifacts. Defaults to a directory located in the parent directory of table, with a name following a fixed naming pattern. If out already exists, the user is prompted to specify whether it should be replaced; otherwise, it is automatically created.

  • config (dict | str | Path, optional) – Configuration dict or path to JSON file containing such a dict. Merged with the default configuration specified via default_config. Empty string means that the default configuration is used.

  • default_config (str, optional) – Default configuration to use, one of full, “”, basic, interpretable or None.

  • monitor (str, optional) – Training monitor to use.

  • jobs (int) – Number of jobs to use. Overwrites the “jobs” config param.

  • from_invocation (dict | str | Path, optional) – dict or path to an invocation.json file. All arguments of this function not explicitly specified are taken from this dict; this also includes the table to analyze.

class CaTabRaAnalysis(invocation: str | Path | dict | None = None)[source]

Bases: CaTabRaBase

static plot_training_history(hist: DataFrame | str | Path, interactive: bool = False) dict[source]

Plot the evolution of performance scores during model training.

Parameters:
  • hist (DataFrame | str | Path) – The history to plot, as saved in “training_history.xlsx”.

  • interactive (bool, default=False) – Whether to create static Matplotlib plots or interactive plotly plots.

Returns:

Dict with single key “training_history”, which is mapped to a Matplotlib or plotly figure object. The sole reason for returning a dict is consistency with other plotting functions.

Return type:

dict