Processing Longitudinal Data
This notebook is part of the CaTabRa GitHub repository.
This example demonstrates how longitudinal data (also called panel data) can be effectively used with CaTabRa. In particular, it shows how such data can be brought into a form that can be input into CaTabRa’s data analysis workflow.
Longitudinal data are similar to time series data, but have another dimension: measurements are taken for more than one individual. They frequently appear in medicine and econometrics.
Table of contents:
Prerequisites
[1]:
import numpy as np
import pandas as pd
Create Synthetic Dataset
For demonstration purposes we create a synthetic dataset of laboratory tests at a hospital over a period of 10 years. Each test is characterized by the subject (patient) it belongs to, a timestamp, the name of the measured parameter, and the measured value. Laboratory tests are usually performed manually, so we cannot rely on a specific measurement frequency.
[2]:
rng = np.random.RandomState(seed=1234)
[3]:
# create subjects; subjects are characterized by unique identifier and hospital admission time
subjects = pd.DataFrame(index=pd.RangeIndex(10000, name='subject_id'))
subjects['admission_date'] = \
pd.to_timedelta(rng.randint(10 * 365, size=len(subjects)), unit='d') + pd.Timestamp('2010-01-01')
[4]:
subjects.head()
[4]:
| admission_date | |
|---|---|
| subject_id | |
| 0 | 2017-11-03 |
| 1 | 2011-12-25 |
| 2 | 2013-08-11 |
| 3 | 2018-07-23 |
| 4 | 2018-12-21 |
[5]:
# create laboratory test results
labs = pd.DataFrame(index=pd.RangeIndex(10 ** 7, name='lab_id'))
labs['subject_id'] = rng.choice(subjects.index, size=len(labs))
labs['timestamp'] = \
subjects.reindex(labs['subject_id'])['admission_date'].values + \
pd.to_timedelta(rng.randint(10 * 24 * 60, size=len(labs)), unit='m')
labs['parameter'] = \
rng.choice(
['creatinine', 'hemoglobin', 'red blood cells', 'white blood cells', 'platelets', 'oxygen saturation'],
size=len(labs)
)
[6]:
# assign measured values
labs['value'] = np.nan
labs.loc[labs['parameter'] == 'creatinine', 'value'] = \
rng.uniform(0.1, 5, size=(labs['parameter'] == 'creatinine').sum())
labs.loc[labs['parameter'] == 'hemoglobin', 'value'] = \
rng.uniform(3, 20, size=(labs['parameter'] == 'hemoglobin').sum())
labs.loc[labs['parameter'] == 'red blood cells', 'value'] = \
rng.uniform(0, 10.2, size=(labs['parameter'] == 'red blood cells').sum())
labs.loc[labs['parameter'] == 'white blood cells', 'value'] = \
rng.uniform(0, 1000, size=(labs['parameter'] == 'white blood cells').sum())
labs.loc[labs['parameter'] == 'platelets', 'value'] = \
rng.uniform(0, 2000, size=(labs['parameter'] == 'platelets').sum())
labs.loc[labs['parameter'] == 'oxygen saturation', 'value'] = \
rng.uniform(0, 100, size=(labs['parameter'] == 'oxygen saturation').sum())
[7]:
labs.head()
[7]:
| subject_id | timestamp | parameter | value | |
|---|---|---|---|---|
| lab_id | ||||
| 0 | 4645 | 2010-11-14 02:57:00 | creatinine | 2.546876 |
| 1 | 1360 | 2017-05-23 13:44:00 | platelets | 138.391543 |
| 2 | 3236 | 2010-01-18 16:31:00 | creatinine | 3.261040 |
| 3 | 8222 | 2015-06-27 13:32:00 | red blood cells | 5.872180 |
| 4 | 3909 | 2010-08-23 22:31:00 | hemoglobin | 17.747439 |
[8]:
len(labs)
[8]:
10000000
[9]:
labs['subject_id'].nunique()
[9]:
10000
Example Use-Case
Consider a use-case where certain events are recorded during each patient’s hospital stay, e.g., clinical interventions such as blood transfusions, transfer to another care unit, etc. Each event has an outcome associated to it, and our goal is to predict the outcome based on data available until the time of the event, which, in particular, include the laboratory tests we’ve just created.
The events may happen at random times during the hospital stay, and there can be an arbitrary number for every patient. Hence, we create a synthetic table of events with associated outcomes. The type of the outcomes is arbitrary, but for the sake of simplicity we assume binary outcomes (e.g., whether an intervention was successful or not).
[10]:
events = pd.DataFrame(index=pd.RangeIndex(100000, name='event_id'))
events['subject_id'] = rng.choice(subjects.index, size=len(events))
events['timestamp'] = \
subjects.reindex(events['subject_id'])['admission_date'].values + \
pd.to_timedelta(rng.randint(2 * 24 * 60, 8 * 24 * 60, size=len(events)), unit='m')
events['outcome'] = rng.choice([False, True], size=len(events), p=[0.7, 0.3])
[11]:
events.head()
[11]:
| subject_id | timestamp | outcome | |
|---|---|---|---|
| event_id | |||
| 0 | 9961 | 2014-10-17 10:21:00 | False |
| 1 | 2921 | 2018-06-01 13:06:00 | False |
| 2 | 8244 | 2019-06-17 18:24:00 | True |
| 3 | 4061 | 2014-03-28 00:42:00 | True |
| 4 | 8048 | 2012-05-22 03:34:00 | False |
CaTabRa does not accept longitudinal data as input to its main data analysis workflow (function `catabra.analysis.analyze() <https://github.com/risc-mi/catabra/tree/main/catabra/analysis/main.py>`__). Instead, it expects data in the \(samples\times attributes\) format, where every row corresponds to one single sample and every column corresponds to either a feature or a target. So, before we can start building prediction models we have to transform our data into the required form. This
typically proceeds by eliminating the temporal dimension of our longitudinal data by aggregating multiple measurements at different times into one (or any other fixed number) feature values. Aggregations can be as simple as taking the mean of all values, or the first/last value, or computing complex time-series features with libraries like tsfresh. This process is called resampling.
Before we can resample our data we have to specify a table of time windows over which we want to aggregate data. Each window is characterized by a subject-ID, a start time and a stop time. Both times are optional, as long as at least one is specified. In our case, every event in events corresponds to a window. Windows stop at the time of the event (because that’s where we stop looking for data as input for our model). We set their start time to 2 days before the event, because we believe
laboratory tests are only valid for 2 days and we don’t want to use “outdated” information.
[12]:
windows = pd.DataFrame(
index=events.index,
data=dict(
subject_id=events['subject_id'],
start=events['timestamp'] - pd.Timedelta('2 days'),
stop=events['timestamp'],
label=events['outcome']
)
)
windows.columns = pd.MultiIndex.from_tuples([['subject_id', ''], ['timestamp', 'start'], ['timestamp', 'stop'], ['label', '']])
[13]:
windows.head()
[13]:
| subject_id | timestamp | label | ||
|---|---|---|---|---|
| start | stop | |||
| event_id | ||||
| 0 | 9961 | 2014-10-15 10:21:00 | 2014-10-17 10:21:00 | False |
| 1 | 2921 | 2018-05-30 13:06:00 | 2018-06-01 13:06:00 | False |
| 2 | 8244 | 2019-06-15 18:24:00 | 2019-06-17 18:24:00 | True |
| 3 | 4061 | 2014-03-26 00:42:00 | 2014-03-28 00:42:00 | True |
| 4 | 8048 | 2012-05-20 03:34:00 | 2012-05-22 03:34:00 | False |
Note that windows has two column index levels, and that the naming of the columns follows a certain pattern:
the column containing subject-IDs has the same name as in
labs(i.e.,"subject_id") on the first level, and""on the second level,the columns containing start- and stop times have the same name as the timestamp-column in
labs(i.e.,"timestamp") on the first level, and"start"and"stop", respectively, on the second level.
Furthermore, an arbitrary number of additional columns may be present, which in our case is the target ("label").
NOTE By construction, some of the time windows (of the same subject) may overlap. This is no problem, CaTabRa can handle such cases without ado.
Resample
Let’s compute some simple built-in aggregations of the measured laboratory parameters for each time window. The aggreagtions may differ for each laboratory parameter.
The main function is `catabra.util.longitudinal.resample_eav() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ (“eav” standing for “entity-attribute-value”):
[15]:
from catabra.util.longitudinal import resample_eav
[16]:
tic = pd.Timestamp.now()
resampled = resample_eav(
labs, # longitudinal data to resample
windows, # time windows
agg={ # aggregations
'creatinine': 'mean', # mean of creatinine values
'hemoglobin': ['min', 'max'], # minimum and maximum of hemoglobin values
'red blood cells': ['mean', 'std'], # mean and std. dev. of red blood cells
'white blood cells': ['r0', 'r-1', 't0', 't-1'], # first and last value of white blood cells, plus corresponding measurement times
'platelets': 'count', # total number of platelets measurements
'oxygen saturation': ['p10', 'p90'] # 10th and 90th percentile of oxygen saturation values
},
entity_col='subject_id', # name of column with entity-IDs
time_col='timestamp', # name of column with times
attribute_col='parameter', # name of column with attribute names
value_col='value' # name of column with values to aggregate
)
toc = pd.Timestamp.now()
It took less than a minute to resample our 10 million rows of laboratory data of 10,000 distinct subjects and 100,000 time windows by computing both simple and non-standard aggregations:
[48]:
toc - tic
[48]:
Timedelta('0 days 00:00:46.314381')
The output DataFrame is exactly like windows, but with additional columns containing the requested aggregations:
[49]:
resampled.head()
[49]:
| subject_id | timestamp | label | creatinine | hemoglobin | red blood cells | white blood cells | platelets | oxygen saturation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| start | stop | mean | min | max | mean | std | r0 | r-1 | t0 | t-1 | count | p10 | p90 | |||
| event_id | ||||||||||||||||
| 0 | 9961 | 2014-10-15 10:21:00 | 2014-10-17 10:21:00 | False | 1.994618 | 3.139749 | 19.764566 | 5.396520 | 2.742879 | 126.278639 | 931.995807 | 2014-10-15 10:54:00 | 2014-10-17 09:55:00 | 32.0 | 16.937398 | 78.116129 |
| 1 | 2921 | 2018-05-30 13:06:00 | 2018-06-01 13:06:00 | False | 2.481352 | 3.611202 | 19.850998 | 5.230297 | 3.286836 | 485.478178 | 341.270608 | 2018-05-30 16:25:00 | 2018-06-01 11:28:00 | 28.0 | 10.712035 | 77.491403 |
| 2 | 8244 | 2019-06-15 18:24:00 | 2019-06-17 18:24:00 | True | 2.529095 | 3.636045 | 19.040233 | 5.887074 | 2.811139 | 817.979464 | 200.394162 | 2019-06-15 18:51:00 | 2019-06-17 16:21:00 | 38.0 | 20.086898 | 78.725686 |
| 3 | 4061 | 2014-03-26 00:42:00 | 2014-03-28 00:42:00 | True | 2.056492 | 3.112208 | 19.824033 | 5.146293 | 2.964574 | 22.554426 | 999.518133 | 2014-03-26 02:01:00 | 2014-03-27 22:52:00 | 28.0 | 10.041334 | 86.814089 |
| 4 | 8048 | 2012-05-20 03:34:00 | 2012-05-22 03:34:00 | False | 2.352913 | 4.084322 | 19.743881 | 5.728578 | 3.097373 | 634.744263 | 404.023845 | 2012-05-20 03:48:00 | 2012-05-22 03:22:00 | 30.0 | 11.319649 | 89.229113 |
[50]:
(resampled.index == events.index).all()
[50]:
True
[51]:
(resampled[('subject_id', '')] == events['subject_id']).all()
[51]:
True
resampled can now be used as input to CaTabRa’s automatic data analysis workflow.
Closing Remarks
Time Windows
One of the strengths of CaTabRa’s resample_eav() function is that it can handle arbitrary time windows: they may overlap, they may have unequal lengths (even for the same entity), and they may be infinite (with only start- or stop time, but not both).
Time windows can be constructed manually, as above, or conveniently using function make_windows() in module catabra.util.longitudinal. See the documentation of function make_windows() for more information.
tsfresh Integration
tsfresh is a library for computing sophisticated time-series features from tabular data in a very similar format as the one expected by function resample_eav() (“long” or “stacked” DataFrame). The main difference is that tsfresh does not accept time windows but computes features either for all observations per entity-attribute pair, or in a rolling/forecasting fashion. CaTabRa handles time windows natively and efficiently.
tsfresh can be integrated into resample_eav() by passing a callable to the list of desired aggregations; see the docstring of resample_eav() for details. Note that the input format expected by tsfresh and output format returned by it match almost exactly the input/output format of the callable.
Dask Support
resample_eav() accepts Dask DataFrames as input. This applies both to the table containing the observations (labs in our example) and the table with the time windows. Therefore, even large amounts of data can be efficiently processed.
Time Period Observations
The example presented in this notebook illustrates how isolated observations/measurements can be aggregated to resample longitudinal data in entity-attribute-value format. A similar albeit subtly different data modality are data where observations/measurements do not happen at specific points in time, but over a time period. One simple example, again from the clinical domain, are infusions of medications: they are characterized by start- and end times, and the administered medication amount.
Function `catabra.util.longitudinal.resample_interval() <https://github.com/risc-mi/catabra/tree/main/catabra/util/longitudinal.py>`__ can be used to resample data of that kind. Its API bears close resemblance to that of resample_eav(), with the main difference that the only supported aggregation is summing all observed values in each time window, taking the size of the intersection of the time window with the observation interval into account. See the documentation of
resample_interval() for more information.