Dynamic treatment regimes

In medical research, a dynamic treatment regime is a set of sequential decision rules defining what actions should be taken to treat a patient based on information observed up to that point. Also referred to as adaptive treatment strategies, dynamic treatment regimes attempt to individualize treatment for patients while operationalizing clinical practice dealing with chronic illnesses. The goal of a dynamic treatment regime is to define the sequence of treatment actions which result in the most favorable clinical outcome possible.

History
Historically, medical research and the practice of medicine tended to rely on an acute care model for the treatment of all medical problems, including chronic illness. More recently the medical field has begun to look at long term care plans to treat patients with a chronic illness. This shift in ideology, coupled with increased demand for evidence based medicine and individualized care led to the application of sequential decision making research to medical problems and the formulation of dynamic treatment regimes.

Notation and Formulation
For a series of decision time points, $$t = 1, \ldots, T$$, define $$ A_t $$ to be the treatment action taken at time point $$t$$, and define $$ O_t $$ to be all clinical observations taken at time $$t$$, prior to treatment action $$ A_t $$. Then a dynamic treatment regime, $$ \pi = (\pi_1,...,\pi_t) $$ consists of a set of rules for each time point $$t$$ for choosing treatment $$ A_t $$, based clinical observations $$ O_t $$. Thus $$ \pi_t(o_1, a_1, ..., o_t,a_t) $$, is a function of the past and current observations, $$ (o_1, ..., o_t) $$ and past actions $$ (a_1, ..., a_{t-1}) $$, which returns a choice of the current action, $$ a_t $$. The goal of the dynamic treatment regime is to make decisions that result in the best possible clinical outcome or response, $$ R $$.

For example, the goal of cancer treatment is to achieve and maintain remission if possible or to prolong and improve quality of life when remission is not possible. The treatment for cancer is often complex and may require many different decisions throughout the process. Some of these decisions may include: A particular dynamic treatment regime might answer some of these questions as follows: Here treatment actions, $$ A_t $$, are listed in the boxes branching out from the box labeled 'CANCER' and clinical observations, $$ O_t $$, are listed next to the arrows between boxes. An example of a dynamic treatment regime that has been suggested for use in clinical practice is ...
 * Should the patient have surgery to remove the cancer?
 * Would the patient benefit from chemo/radiation therapy?
 * How long should the patient be on chemo/radiation therapy?
 * How will side effects of treatment be managed?
 * What type of maintenance care should the patient have following chemo/radiation therapy?

Optimization
For a dynamic treatment regime, $$ \pi $$, to achieve the most favorable clinical outcome possible it must maximize the average outcome. In other words, the goal is to find a dynamic regime $$ \pi^* $$ for which:

$$ \pi^* = \max_{\pi}{E[R |\text{ actions are chosen according to }\pi]}$$

The quantity $$ E[R | \text{ actions are chosen according to }\pi]$$ is often referred to as the Value of $$ \pi $$.

Methods for developing dynamic treatment regimes
The development of dynamic treatment regimes that result in optimal clinical outcomes is a process which involves the consideration of many different issues. A brief review of some of this process is given below.

Delayed Effects
To find an optimal dynamic treatment regimes it might seem reasonable just to collect data from individual trials over different time points, finding the optimal treatment at each time point and then combine these treatment steps to create a dynamic treatment regime. However, this type of data collection and analysis is shortsighted and will often result in an inferior dynamic treatment regime. Many treatments can have effects that do not occur until after the immediate treatment outcome has been measured, such as improving the effect of a future treatment or long term side effects which prevent a patient from being able to use an alternate useful treatment in the future. For example, cognitive behavioral therapy may not result in the best short term outcome of firs-tline monotherapies for drug addiction. However it may lead to dramatic improvements in future outcomes when followed by a combination drug and therapy treatment. This could happen if the cognitive behavioral therapy was useful for helping patients learn how to best use future behavioral therapies, but was not a strong first-line monotherapy treatment. This issue is often referred to as 'delayed effects' and is an important concept to consider when trying to find optimal dynamic treatment regimes. Data collection and analysis should always encompass the treatment regime, not just the individual treatments within the regime.

Data
Data analysis is needed to find optimal dynamic treatment regimes. The data used to find optimal dynamic treatment regimes should consist of the sequence of observations and actions $$(o_1,a_1,o_2,a_2,..,o_T,a_T)_i $$, for multiple patients $$ i=1,...,n$$ along with the patients outcomes $$ R_j $$.

While this type of data can be obtained through careful observation, it is often preferable to collect data through experimentation if possible. The use of experimental data, where treatments have been randomly assigned, is preferred due to the fact that we are typically only able to see a patient's outcome under the treatment sequence they have been given and not under alternate treatment sequences. Randomizing treatments helps eliminate bias caused by confounding variables which affect both the choice of the treatment and the clinical outcome. This is especially important when dealing with sequential treatments since these biases can compound over time.

Experimental design
Experimental design for dynamic treatment regimes basically involves an initial randomization of patients to treatments, followed by re-randomizations at each subsequent time point of some or all of the patients to another treatment. The re-randomizations at each subsequent time point my depend on information collected after previous treatments, but prior to assigning the new treatment, such as how successful the prior treatment was. These types of trials were introduced and developed in, and  and are often referred to as SMART trials (Sequential Multiple Assignment Randomized Trail). Some examples of SMART trials are the CATIE trial for treatment of Alzheimer's and the STAR*D trial for treatment of depression.

SMART trials attempt to conform better with the way clinical practice actually occurs, but still retain the advantages of experimentation over observation. They are clearly more complicated and high dimensional than just individually testing a set of treatments for a given time point. However, they are necessary for dealing with the problem of delayed effects. Several suggestions have been made to attempt to reduce complexity and resources needed such as combining data over same treatment sequences within different treatment regimes, splitting up trials into screening, refining and confirmatory trials rather than trying to attack the problem in a single trial, using fractional factorial designs rather than a full factorial design , and targeting primary analysis to simple regime comparisons.

Reward construction
A critical part of finding the best dynamic treatment regime is the construction of a meaningful and comprehensive outcome variable, $$ R $$. To construct a useful outcome variable, the goals of the treatment need to be well defined and quantifiable. The goals of the treatment can include multiple aspects of a patient's health and welfare, such as degree of symptoms, treatment side effects, time until treatment response, quality of life and cost. However, quantifying the various aspects of a successful treatment with single function can be difficult. The outcome variable should reflect how successful the treatment regime was in achieving the overall goals for each patient.

Variable selection and feature construction
Analysis is often improved by the collection of any variables that might be related to the illness or the treatment. This is especially important when data is collected by observation, to avoid bias in the analysis due to unmeasured confounders. Subsequently more observation variables are collected than are actually needed to estimate optimal dynamic treatment regimes. Thus variable selection is often required as a preprocessing step on the data before algorithms used to find the best dynamic treatment regime are employed.

Algorithms and Inference
Given an experimental data set, the optimal dynamic treatment regime can be estimated from the data using a number of different algorithms. Inference can also be done to determine if the estimated optimal dynamic treatment regime results in significant improvements in the expected outcome over an alternated treatment regime.

Several algorithms exist for estimating optimal dynamic treatment regimes from data. Many of these algorithms were developed in the field of computer science to help robots and computers make optimal decisions in an interactive environment. These types of algorithms are often referred to as reinforcement learning methods . The most popular of these methods used to estimate dynamic treatment regimes is called q-learning. In q-learning models are fit sequentially to estimate the Value of the treatment regime used to collect the data and then the models are optimized with respect to the treatment actions to find the best dynamic treatment regime. Many variations of this algorithm exist including modeling only portions of the Value of the treatment regime.