Processing Longitudinal Data


This notebook is part of the CaTabRa GitHub repository.

This example demonstrates how longitudinal data (also called panel data) can be effectively used with CaTabRa. In particular, it shows how such data can be brought into a form that can be input into CaTabRa’s data analysis workflow.

Longitudinal data are similar to time series data, but have another dimension: measurements are taken for more than one individual. They frequently appear in medicine and econometrics.

Table of contents:

  1. Create Synthetic Dataset

  2. Example Use-Case

  3. Resample

  4. Closing Remarks

Prerequisites

[1]:
import numpy as np
import pandas as pd

Create Synthetic Dataset

For demonstration purposes we create a synthetic dataset of laboratory tests at a hospital over a period of 10 years. Each test is characterized by the subject (patient) it belongs to, a timestamp, the name of the measured parameter, and the measured value. Laboratory tests are usually performed manually, so we cannot rely on a specific measurement frequency.

[2]:
rng = np.random.RandomState(seed=1234)
[3]:
# create subjects; subjects are characterized by unique identifier and hospital admission time
subjects = pd.DataFrame(index=pd.RangeIndex(10000, name='subject_id'))
subjects['admission_date'] = \
    pd.to_timedelta(rng.randint(10 * 365, size=len(subjects)), unit='d') + pd.Timestamp('2010-01-01')
[4]:
subjects.head()
[4]:
admission_date
subject_id
0 2017-11-03
1 2011-12-25
2 2013-08-11
3 2018-07-23
4 2018-12-21
[5]:
# create laboratory test results
labs = pd.DataFrame(index=pd.RangeIndex(10 ** 7, name='lab_id'))
labs['subject_id'] = rng.choice(subjects.index, size=len(labs))
labs['timestamp'] = \
    subjects.reindex(labs['subject_id'])['admission_date'].values + \
    pd.to_timedelta(rng.randint(10 * 24 * 60, size=len(labs)), unit='m')
labs['parameter'] = \
    rng.choice(
        ['creatinine', 'hemoglobin', 'red blood cells', 'white blood cells', 'platelets', 'oxygen saturation'],
        size=len(labs)
    )
[6]:
# assign measured values
labs['value'] = np.nan
labs.loc[labs['parameter'] == 'creatinine', 'value'] = \
    rng.uniform(0.1, 5, size=(labs['parameter'] == 'creatinine').sum())
labs.loc[labs['parameter'] == 'hemoglobin', 'value'] = \
    rng.uniform(3, 20, size=(labs['parameter'] == 'hemoglobin').sum())
labs.loc[labs['parameter'] == 'red blood cells', 'value'] = \
    rng.uniform(0, 10.2, size=(labs['parameter'] == 'red blood cells').sum())
labs.loc[labs['parameter'] == 'white blood cells', 'value'] = \
    rng.uniform(0, 1000, size=(labs['parameter'] == 'white blood cells').sum())
labs.loc[labs['parameter'] == 'platelets', 'value'] = \
    rng.uniform(0, 2000, size=(labs['parameter'] == 'platelets').sum())
labs.loc[labs['parameter'] == 'oxygen saturation', 'value'] = \
    rng.uniform(0, 100, size=(labs['parameter'] == 'oxygen saturation').sum())
[7]:
labs.head()
[7]:
subject_id timestamp parameter value
lab_id
0 4645 2010-11-14 02:57:00 creatinine 2.546876
1 1360 2017-05-23 13:44:00 platelets 138.391543
2 3236 2010-01-18 16:31:00 creatinine 3.261040
3 8222 2015-06-27 13:32:00 red blood cells 5.872180
4 3909 2010-08-23 22:31:00 hemoglobin 17.747439
[8]:
len(labs)
[8]:
10000000
[9]:
labs['subject_id'].nunique()
[9]:
10000

Example Use-Case

Consider a use-case where certain events are recorded during each patient’s hospital stay, e.g., clinical interventions such as blood transfusions, transfer to another care unit, etc. Each event has an outcome associated to it, and our goal is to predict the outcome based on data available until the time of the event, which, in particular, include the laboratory tests we’ve just created.

The events may happen at random times during the hospital stay, and there can be an arbitrary number for every patient. Hence, we create a synthetic table of events with associated outcomes. The type of the outcomes is arbitrary, but for the sake of simplicity we assume binary outcomes (e.g., whether an intervention was successful or not).

[10]:
events = pd.DataFrame(index=pd.RangeIndex(100000, name='event_id'))
events['subject_id'] = rng.choice(subjects.index, size=len(events))
events['timestamp'] = \
    subjects.reindex(events['subject_id'])['admission_date'].values + \
    pd.to_timedelta(rng.randint(2 * 24 * 60, 8 * 24 * 60, size=len(events)), unit='m')
events['outcome'] = rng.choice([False, True], size=len(events), p=[0.7, 0.3])
[11]:
events.head()
[11]:
subject_id timestamp outcome
event_id
0 9961 2014-10-17 10:21:00 False
1 2921 2018-06-01 13:06:00 False
2 8244 2019-06-17 18:24:00 True
3 4061 2014-03-28 00:42:00 True
4 8048 2012-05-22 03:34:00 False

CaTabRa does not accept longitudinal data as input to its main data analysis workflow (function `catabra.analysis.analyze() <https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py>`__). Instead, it expects data in the \(samples\times attributes\) format, where every row corresponds to one single sample and every column corresponds to either a feature or a target. So, before we can start building prediction models we have to transform our data into the required form. This typically proceeds by eliminating the temporal dimension of our longitudinal data by aggregating multiple measurements at different times into one (or any other fixed number) feature values. Aggregations can be as simple as taking the mean of all values, or the first/last value, or computing complex time-series features with libraries like tsfresh. This process is called resampling.

Before we can resample our data we have to specify a table of time windows over which we want to aggregate data. Each window is characterized by a subject-ID, a start time and a stop time. Both times are optional, as long as at least one is specified. In our case, every event in events corresponds to a window. Windows stop at the time of the event (because that’s where we stop looking for data as input for our model). We set their start time to 2 days before the event, because we believe laboratory tests are only valid for 2 days and we don’t want to use “outdated” information.

[12]:
windows = pd.DataFrame(
    index=events.index,
    data=dict(
        subject_id=events['subject_id'],
        start=events['timestamp'] - pd.Timedelta('2 days'),
        stop=events['timestamp'],
        label=events['outcome']
    )
)
windows.columns = pd.MultiIndex.from_tuples([['subject_id', ''], ['timestamp', 'start'], ['timestamp', 'stop'], ['label', '']])
[13]:
windows.head()
[13]:
subject_id timestamp label
start stop
event_id
0 9961 2014-10-15 10:21:00 2014-10-17 10:21:00 False
1 2921 2018-05-30 13:06:00 2018-06-01 13:06:00 False
2 8244 2019-06-15 18:24:00 2019-06-17 18:24:00 True
3 4061 2014-03-26 00:42:00 2014-03-28 00:42:00 True
4 8048 2012-05-20 03:34:00 2012-05-22 03:34:00 False

Note that windows has two column index levels, and that the naming of the columns follows a certain pattern:

  • the column containing subject-IDs has the same name as in labs (i.e., "subject_id") on the first level, and "" on the second level,

  • the columns containing start- and stop times have the same name as the timestamp-column in labs (i.e., "timestamp") on the first level, and "start" and "stop", respectively, on the second level.

Furthermore, an arbitrary number of additional columns may be present, which in our case is the target ("label").

NOTE By construction, some of the time windows (of the same subject) may overlap. This is no problem, CaTabRa can handle such cases without ado.

Resample

Let’s compute some simple built-in aggregations of the measured laboratory parameters for each time window. The aggreagtions may differ for each laboratory parameter.

The main function is `catabra.util.longitudinal.resample_eav() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ (“eav” standing for “entity-attribute-value”):

[15]:
from catabra.util.longitudinal import resample_eav
[16]:
tic = pd.Timestamp.now()
resampled = resample_eav(
    labs,                                            # longitudinal data to resample
    windows,                                         # time windows
    agg={                                            # aggregations
        'creatinine': 'mean',                                # mean of creatinine values
        'hemoglobin': ['min', 'max'],                        # minimum and maximum of hemoglobin values
        'red blood cells': ['mean', 'std'],                  # mean and std. dev. of red blood cells
        'white blood cells': ['r0', 'r-1', 't0', 't-1'],     # first and last value of white blood cells, plus corresponding measurement times
        'platelets': 'count',                                # total number of platelets measurements
        'oxygen saturation': ['p10', 'p90']                  # 10th and 90th percentile of oxygen saturation values
    },
    entity_col='subject_id',                         # name of column with entity-IDs
    time_col='timestamp',                            # name of column with times
    attribute_col='parameter',                       # name of column with attribute names
    value_col='value'                                # name of column with values to aggregate
)
toc = pd.Timestamp.now()

It took less than a minute to resample our 10 million rows of laboratory data of 10,000 distinct subjects and 100,000 time windows by computing both simple and non-standard aggregations:

[48]:
toc - tic
[48]:
Timedelta('0 days 00:00:46.314381')

The output DataFrame is exactly like windows, but with additional columns containing the requested aggregations:

[49]:
resampled.head()
[49]:
subject_id timestamp label creatinine hemoglobin red blood cells white blood cells platelets oxygen saturation
start stop mean min max mean std r0 r-1 t0 t-1 count p10 p90
event_id
0 9961 2014-10-15 10:21:00 2014-10-17 10:21:00 False 1.994618 3.139749 19.764566 5.396520 2.742879 126.278639 931.995807 2014-10-15 10:54:00 2014-10-17 09:55:00 32.0 16.937398 78.116129
1 2921 2018-05-30 13:06:00 2018-06-01 13:06:00 False 2.481352 3.611202 19.850998 5.230297 3.286836 485.478178 341.270608 2018-05-30 16:25:00 2018-06-01 11:28:00 28.0 10.712035 77.491403
2 8244 2019-06-15 18:24:00 2019-06-17 18:24:00 True 2.529095 3.636045 19.040233 5.887074 2.811139 817.979464 200.394162 2019-06-15 18:51:00 2019-06-17 16:21:00 38.0 20.086898 78.725686
3 4061 2014-03-26 00:42:00 2014-03-28 00:42:00 True 2.056492 3.112208 19.824033 5.146293 2.964574 22.554426 999.518133 2014-03-26 02:01:00 2014-03-27 22:52:00 28.0 10.041334 86.814089
4 8048 2012-05-20 03:34:00 2012-05-22 03:34:00 False 2.352913 4.084322 19.743881 5.728578 3.097373 634.744263 404.023845 2012-05-20 03:48:00 2012-05-22 03:22:00 30.0 11.319649 89.229113
[50]:
(resampled.index == events.index).all()
[50]:
True
[51]:
(resampled[('subject_id', '')] == events['subject_id']).all()
[51]:
True

resampled can now be used as input to CaTabRa’s automatic data analysis workflow.

Closing Remarks

Time Windows

One of the strengths of CaTabRa’s resample_eav() function is that it can handle arbitrary time windows: they may overlap, they may have unequal lengths (even for the same entity), and they may be infinite (with only start- or stop time, but not both).

Time windows can be constructed manually, as above, or conveniently using function make_windows() in module catabra.util.longitudinal. See the documentation of function make_windows() for more information.

tsfresh Integration

tsfresh is a library for computing sophisticated time-series features from tabular data in a very similar format as the one expected by function resample_eav() (“long” or “stacked” DataFrame). The main difference is that tsfresh does not accept time windows but computes features either for all observations per entity-attribute pair, or in a rolling/forecasting fashion. CaTabRa handles time windows natively and efficiently.

tsfresh can be integrated into resample_eav() by passing a callable to the list of desired aggregations; see the docstring of resample_eav() for details. Note that the input format expected by tsfresh and output format returned by it match almost exactly the input/output format of the callable.

Dask Support

resample_eav() accepts Dask DataFrames as input. This applies both to the table containing the observations (labs in our example) and the table with the time windows. Therefore, even large amounts of data can be efficiently processed.

Time Period Observations

The example presented in this notebook illustrates how isolated observations/measurements can be aggregated to resample longitudinal data in entity-attribute-value format. A similar albeit subtly different data modality are data where observations/measurements do not happen at specific points in time, but over a time period. One simple example, again from the clinical domain, are infusions of medications: they are characterized by start- and end times, and the administered medication amount.

Function `catabra.util.longitudinal.resample_interval() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ can be used to resample data of that kind. Its API bears close resemblance to that of resample_eav(), with the main difference that the only supported aggregation is summing all observed values in each time window, taking the size of the intersection of the time window with the observation interval into account. See the documentation of resample_interval() for more information.