Processing Longitudinal Data

This notebook is part of the CaTabRa GitHub repository.

This example demonstrates how longitudinal data (also called panel data) can be effectively used with CaTabRa. In particular, it shows how such data can be brought into a form that can be input into CaTabRa’s data analysis workflow.

Longitudinal data are similar to time series data, but have another dimension: measurements are taken for more than one individual. They frequently appear in medicine and econometrics.

Table of contents:

Create Synthetic Dataset
Example Use-Case
Resample
Closing Remarks

Prerequisites

[1]:

import numpy as np
import pandas as pd

Create Synthetic Dataset

For demonstration purposes we create a synthetic dataset of laboratory tests at a hospital over a period of 10 years. Each test is characterized by the subject (patient) it belongs to, a timestamp, the name of the measured parameter, and the measured value. Laboratory tests are usually performed manually, so we cannot rely on a specific measurement frequency.

[2]:

rng = np.random.RandomState(seed=1234)

[3]:

# create subjects; subjects are characterized by unique identifier and hospital admission time
subjects = pd.DataFrame(index=pd.RangeIndex(10000, name='subject_id'))
subjects['admission_date'] = \
    pd.to_timedelta(rng.randint(10 * 365, size=len(subjects)), unit='d') + pd.Timestamp('2010-01-01')

[4]:

subjects.head()

[4]:

	admission_date
subject_id
0	2017-11-03
1	2011-12-25
2	2013-08-11
3	2018-07-23
4	2018-12-21

[5]:

# create laboratory test results
labs = pd.DataFrame(index=pd.RangeIndex(10 ** 7, name='lab_id'))
labs['subject_id'] = rng.choice(subjects.index, size=len(labs))
labs['timestamp'] = \
    subjects.reindex(labs['subject_id'])['admission_date'].values + \
    pd.to_timedelta(rng.randint(10 * 24 * 60, size=len(labs)), unit='m')
labs['parameter'] = \
    rng.choice(
        ['creatinine', 'hemoglobin', 'red blood cells', 'white blood cells', 'platelets', 'oxygen saturation'],
        size=len(labs)
    )

[6]:

# assign measured values
labs['value'] = np.nan
labs.loc[labs['parameter'] == 'creatinine', 'value'] = \
    rng.uniform(0.1, 5, size=(labs['parameter'] == 'creatinine').sum())
labs.loc[labs['parameter'] == 'hemoglobin', 'value'] = \
    rng.uniform(3, 20, size=(labs['parameter'] == 'hemoglobin').sum())
labs.loc[labs['parameter'] == 'red blood cells', 'value'] = \
    rng.uniform(0, 10.2, size=(labs['parameter'] == 'red blood cells').sum())
labs.loc[labs['parameter'] == 'white blood cells', 'value'] = \
    rng.uniform(0, 1000, size=(labs['parameter'] == 'white blood cells').sum())
labs.loc[labs['parameter'] == 'platelets', 'value'] = \
    rng.uniform(0, 2000, size=(labs['parameter'] == 'platelets').sum())
labs.loc[labs['parameter'] == 'oxygen saturation', 'value'] = \
    rng.uniform(0, 100, size=(labs['parameter'] == 'oxygen saturation').sum())

[7]:

labs.head()

[7]:

	subject_id	timestamp	parameter	value
lab_id
0	4645	2010-11-14 02:57:00	creatinine	2.546876
1	1360	2017-05-23 13:44:00	platelets	138.391543
2	3236	2010-01-18 16:31:00	creatinine	3.261040
3	8222	2015-06-27 13:32:00	red blood cells	5.872180
4	3909	2010-08-23 22:31:00	hemoglobin	17.747439

[8]:

len(labs)

[8]:

10000000

[9]:

labs['subject_id'].nunique()

[9]:

Example Use-Case

Consider a use-case where certain events are recorded during each patient’s hospital stay, e.g., clinical interventions such as blood transfusions, transfer to another care unit, etc. Each event has an outcome associated to it, and our goal is to predict the outcome based on data available until the time of the event, which, in particular, include the laboratory tests we’ve just created.

The events may happen at random times during the hospital stay, and there can be an arbitrary number for every patient. Hence, we create a synthetic table of events with associated outcomes. The type of the outcomes is arbitrary, but for the sake of simplicity we assume binary outcomes (e.g., whether an intervention was successful or not).

[10]:

events = pd.DataFrame(index=pd.RangeIndex(100000, name='event_id'))
events['subject_id'] = rng.choice(subjects.index, size=len(events))
events['timestamp'] = \
    subjects.reindex(events['subject_id'])['admission_date'].values + \
    pd.to_timedelta(rng.randint(2 * 24 * 60, 8 * 24 * 60, size=len(events)), unit='m')
events['outcome'] = rng.choice([False, True], size=len(events), p=[0.7, 0.3])

[11]:

events.head()

[11]:

	subject_id	timestamp	outcome
event_id
0	9961	2014-10-17 10:21:00	False
1	2921	2018-06-01 13:06:00	False
2	8244	2019-06-17 18:24:00	True
3	4061	2014-03-28 00:42:00	True
4	8048	2012-05-22 03:34:00	False

CaTabRa does not accept longitudinal data as input to its main data analysis workflow (function `catabra.analysis.analyze() <https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py>`__). Instead, it expects data in the \(samples\times attributes\) format, where every row corresponds to one single sample and every column corresponds to either a feature or a target. So, before we can start building prediction models we have to transform our data into the required form. This typically proceeds by eliminating the temporal dimension of our longitudinal data by aggregating multiple measurements at different times into one (or any other fixed number) feature values. Aggregations can be as simple as taking the mean of all values, or the first/last value, or computing complex time-series features with libraries like tsfresh. This process is called resampling.

Before we can resample our data we have to specify a table of time windows over which we want to aggregate data. Each window is characterized by a subject-ID, a start time and a stop time. Both times are optional, as long as at least one is specified. In our case, every event in events corresponds to a window. Windows stop at the time of the event (because that’s where we stop looking for data as input for our model). We set their start time to 2 days before the event, because we believe laboratory tests are only valid for 2 days and we don’t want to use “outdated” information.

[12]:

windows = pd.DataFrame(
    index=events.index,
    data=dict(
        subject_id=events['subject_id'],
        start=events['timestamp'] - pd.Timedelta('2 days'),
        stop=events['timestamp'],
        label=events['outcome']
    )
)
windows.columns = pd.MultiIndex.from_tuples([['subject_id', ''], ['timestamp', 'start'], ['timestamp', 'stop'], ['label', '']])

[13]:

windows.head()

[13]:

	subject_id	timestamp		label
		start	stop
event_id
0	9961	2014-10-15 10:21:00	2014-10-17 10:21:00	False
1	2921	2018-05-30 13:06:00	2018-06-01 13:06:00	False
2	8244	2019-06-15 18:24:00	2019-06-17 18:24:00	True
3	4061	2014-03-26 00:42:00	2014-03-28 00:42:00	True
4	8048	2012-05-20 03:34:00	2012-05-22 03:34:00	False

Note that windows has two column index levels, and that the naming of the columns follows a certain pattern:

the column containing subject-IDs has the same name as in labs (i.e., "subject_id") on the first level, and "" on the second level,
the columns containing start- and stop times have the same name as the timestamp-column in labs (i.e., "timestamp") on the first level, and "start" and "stop", respectively, on the second level.

Furthermore, an arbitrary number of additional columns may be present, which in our case is the target ("label").

NOTE By construction, some of the time windows (of the same subject) may overlap. This is no problem, CaTabRa can handle such cases without ado.

Resample

Let’s compute some simple built-in aggregations of the measured laboratory parameters for each time window. The aggreagtions may differ for each laboratory parameter.

The main function is `catabra.util.longitudinal.resample_eav() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ (“eav” standing for “entity-attribute-value”):

[15]:

from catabra.util.longitudinal import resample_eav

[16]:

tic = pd.Timestamp.now()
resampled = resample_eav(
    labs,                                            # longitudinal data to resample
    windows,                                         # time windows
    agg={                                            # aggregations
        'creatinine': 'mean',                                # mean of creatinine values
        'hemoglobin': ['min', 'max'],                        # minimum and maximum of hemoglobin values
        'red blood cells': ['mean', 'std'],                  # mean and std. dev. of red blood cells
        'white blood cells': ['r0', 'r-1', 't0', 't-1'],     # first and last value of white blood cells, plus corresponding measurement times
        'platelets': 'count',                                # total number of platelets measurements
        'oxygen saturation': ['p10', 'p90']                  # 10th and 90th percentile of oxygen saturation values
    },
    entity_col='subject_id',                         # name of column with entity-IDs
    time_col='timestamp',                            # name of column with times
    attribute_col='parameter',                       # name of column with attribute names
    value_col='value'                                # name of column with values to aggregate
)
toc = pd.Timestamp.now()

It took less than a minute to resample our 10 million rows of laboratory data of 10,000 distinct subjects and 100,000 time windows by computing both simple and non-standard aggregations:

[48]:

toc - tic

[48]:

Timedelta('0 days 00:00:46.314381')

The output DataFrame is exactly like windows, but with additional columns containing the requested aggregations:

[49]:

resampled.head()

[49]:

	subject_id	timestamp		label	creatinine	hemoglobin		red blood cells		white blood cells				platelets	oxygen saturation
		start	stop		mean	min	max	mean	std	r0	r-1	t0	t-1	count	p10	p90
event_id
0	9961	2014-10-15 10:21:00	2014-10-17 10:21:00	False	1.994618	3.139749	19.764566	5.396520	2.742879	126.278639	931.995807	2014-10-15 10:54:00	2014-10-17 09:55:00	32.0	16.937398	78.116129
1	2921	2018-05-30 13:06:00	2018-06-01 13:06:00	False	2.481352	3.611202	19.850998	5.230297	3.286836	485.478178	341.270608	2018-05-30 16:25:00	2018-06-01 11:28:00	28.0	10.712035	77.491403
2	8244	2019-06-15 18:24:00	2019-06-17 18:24:00	True	2.529095	3.636045	19.040233	5.887074	2.811139	817.979464	200.394162	2019-06-15 18:51:00	2019-06-17 16:21:00	38.0	20.086898	78.725686
3	4061	2014-03-26 00:42:00	2014-03-28 00:42:00	True	2.056492	3.112208	19.824033	5.146293	2.964574	22.554426	999.518133	2014-03-26 02:01:00	2014-03-27 22:52:00	28.0	10.041334	86.814089
4	8048	2012-05-20 03:34:00	2012-05-22 03:34:00	False	2.352913	4.084322	19.743881	5.728578	3.097373	634.744263	404.023845	2012-05-20 03:48:00	2012-05-22 03:22:00	30.0	11.319649	89.229113

[50]:

(resampled.index == events.index).all()

[50]:

True

[51]:

(resampled[('subject_id', '')] == events['subject_id']).all()

[51]:

True

resampled can now be used as input to CaTabRa’s automatic data analysis workflow.

Closing Remarks

Time Windows

One of the strengths of CaTabRa’s resample_eav() function is that it can handle arbitrary time windows: they may overlap, they may have unequal lengths (even for the same entity), and they may be infinite (with only start- or stop time, but not both).

Time windows can be constructed manually, as above, or conveniently using function make_windows() in module catabra.util.longitudinal. See the documentation of function make_windows() for more information.

tsfresh Integration

tsfresh is a library for computing sophisticated time-series features from tabular data in a very similar format as the one expected by function resample_eav() (“long” or “stacked” DataFrame). The main difference is that tsfresh does not accept time windows but computes features either for all observations per entity-attribute pair, or in a rolling/forecasting fashion. CaTabRa handles time windows natively and efficiently.

tsfresh can be integrated into resample_eav() by passing a callable to the list of desired aggregations; see the docstring of resample_eav() for details. Note that the input format expected by tsfresh and output format returned by it match almost exactly the input/output format of the callable.

Dask Support

resample_eav() accepts Dask DataFrames as input. This applies both to the table containing the observations (labs in our example) and the table with the time windows. Therefore, even large amounts of data can be efficiently processed.

Time Period Observations

The example presented in this notebook illustrates how isolated observations/measurements can be aggregated to resample longitudinal data in entity-attribute-value format. A similar albeit subtly different data modality are data where observations/measurements do not happen at specific points in time, but over a time period. One simple example, again from the clinical domain, are infusions of medications: they are characterized by start- and end times, and the administered medication amount.

Function `catabra.util.longitudinal.resample_interval() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ can be used to resample data of that kind. Its API bears close resemblance to that of resample_eav(), with the main difference that the only supported aggregation is summing all observed values in each time window, taking the size of the intersection of the time window with the observation interval into account. See the documentation of resample_interval() for more information.