What are we really predicting with fMRI in epilepsy surgery?

While memory and language functional magnetic resonance imaging (fMRI) paradigms are becoming evermore reﬁned, the measures of outcome they predict following epilepsy surgery tend to remain single scores on pencil and paper tests that were developed decades ago and have been repeatedly shown to bear little relation to patients’ subjective reports of memory problems in the real world. The growing imbalance between the increasing sophistication of the predictive paradigms on the one hand and the vintage measures of the outcome on the other in the fMRI epilepsy surgery literature threatens the clinical relevance of studies employing these technologies. This paper examines some of the core principles of assessing neuropsychological outcomes following epilepsy surgery and explores how these may be adapted and applied in fMRI study designs to maximize the clinical relevance of these studies. (cid:1) 2023 The Author(s)


Introduction
Task-based functional magnetic resonance imaging (fMRI) is gradually making the transition from a research method to a clinical tool in epilepsy surgery centers around the world [1,2]. In a recent systematic review and metanalysis of the predictive value of fMRI paradigms in forecasting cognitive outcomes following epilepsy surgery, Crow et al. (2023) [3] concluded that fMRI is a modest predictor of language and memory outcomes following left temporal lobe surgery, although sex, level of education, age of onset of seizures and disease duration can modify outcomes. The authors examined a large range of clinical and demographic factors and also included analyses of technical variables associated with fMRI paradigms including the field strength of the magnet and the software used in the analyses.
The meta-analyses from Crow et al. are the latest in a series of impressive and comprehensive reviews of the fMRI paradigms used in the prediction of cognitive outcomes.
There is no comparable review or meta-analysis published of the outcome measures used in these studies. Original studies and reviews of the literature focus heavily on the characteristics of the paradigms used to predict cognitive outcomes following surgery. Even a cursory skim through the fMRI literature gives the impression that the measure of cognitive outcome that is being predicted in each study is well established as a 'gold standard'.
While in the discussion sections of their papers, authors invariably ponder the changes and modifications that could be made to their fMRI paradigms, analyses, regions of interest, and technical specifications, with few exceptions, their choice of outcome measure in the research study rarely musters a sentence. While shortcomings in the outcome measures may be acknowledged as a limitation, very few studies include any really useful discussion of how representative or otherwise, single scores from individual tests may really be of postoperative change following epilepsy surgery.
In this commentary, I will argue that in neglecting to carefully consider the outcome measures used in these research designs, the fMRI outcome prediction literature in epilepsy surgery currently suffers from a very significant imbalance, one which increasingly threatens to limit the clinical applications of the studies employing these technologies.

Basic principles for assessing neuropsychological outcomes following epilepsy surgery
There are a number of basic principles for assessing change in neuropsychological function following epilepsy surgery, or indeed any clinical intervention in this patient population that may impact cognitive function. These principles should be considered in all research designs looking at predictors of outcome following epilepsy surgery if the findings are to have any useful applications in a clinical setting. Many factors can influence someone's performance on a neuropsychological test at any given time, including stress, anxiety, fatigue, and even the position of the task within the test battery [4][5][6]. In epilepsy, additional considerations include the proximity of the last seizure to the assessment and subclinical epileptiform activity [4]. Scores on performance validity tasks (PVTs) -tests designed to ensure the validity of neuropsychological test scores indicate substantial failure rates in some samples of people with epilepsy, particularly those with temporal lobe epilepsy [5]. Failure on these PVTs does not necessarily indicate poor motivation but simply indicates that someone's level of function is such that other test scores may not be a reliable or valid indicator of someone's optimal level of function. These factors mean that test scores must be interpreted with caution in this population. This is routinely done in the clinical setting. However, when these scores are used in research paradigms, these clinical caveats are rarely, if ever acknowledged or controlled for. In addition to underlying structural/functional integrity, the measure of baseline function will reflect the influence of a number of factors (and their interaction) for a significant number of patients enrolled in these studies, before the surgery has even been conducted.

Defining a postoperative decline
A lower score on any given neuropsychological test following surgery does not necessarily indicate a post-operative decline in cognitive function. No neuropsychological test has 100% test-retest reliability and, as described above, many factors can influence test performance at any given time. When assessing longitudinal change, practice effects can mask deterioration in test-retest study designs. To control for the natural variation in test performance and practice effects, genuine changes in performance should be determined using reliable change indices (RCI) or standardized regression-based norms [6,7]. Simply subtracting a postoperative score from a preoperative one to identify decline and dichotomizing change on this basis is not meaningful in any clinical sense. Dichotomizing patients on this basis also places someone who has dropped just one score on a task following surgery, into the same group as someone who has dropped 20 or more points. Clearly, the clinical outcome is not the same for these patients.
The subtraction of the preoperative score from the postoperative score may be useful in research designs that employ a multivariate approach to predict the magnitude of change, but even in these designs, researchers should be cognizant of what these numbers actually mean. If the majority of 'declines' predicted are smaller than those identified by reliable change indices, the inclusion of these patients in the models may introduce a significant amount of noise in the data. The fact that fMRI paradigms can predict change in single scores does not negate this point. Firstly the amount of the variance predicted by these models could be improved if these factors were taken into account. Secondly, the fact that an fMRI paradigm can predict change in a score does not make it a clinically relevant prediction. The fMRI literature is important in epilepsy for telling us about brain function and the response to surgery but ultimately it is the clinical applications of these predictions that the patients who participate in them are interested in (see below).

Capacity to decline
Up to one in two people with temporal lobe epilepsy demonstrate impairments on tests of language or memory function prior to surgery [8]. Many of these people score below the 2nd percentile on standardized tests of cognitive function and are effectively func-tioning at or below the 'floor' of the standardized test. Every standardized test of cognitive function has a floor and a ceiling. The numeric floor of a test is the lowest possible score someone can obtain on a test (usually zero). The psychometric floor of a standardized test is the score at the 2nd percentile. The psychometric floor of a test is the point at which the test is no longer sensitive to change even if the patient can physically obtain a lower score. These are the limits beyond which the test cannot measure function. While someone functioning on the floor of a standardized test prior to surgery may experience cognitive decline as a result of surgery, the test is not sensitive to this decline. Failure to account for patients who are already functioning on the floor of a test prior to surgery and who therefore cannot deteriorate on the outcome measure used in fMRI prediction paradigms introduces a significant confound in the majority of these studies.
For example, in our surgical series as a whole, approximately one in four patients demonstrate a significant decline (defined by reliable change indices) on a list learning task, one year after surgery. This rises to one in three of those who undergo a left temporal lobe resection. If those who are already functioning at the floor of the test prior to the surgery are excluded, approximately one in two patients who undergo a left temporal lobe resection demonstrate a significant decline in list learning at one year, determined by reliable change indices. Failure to account for the capacity to decline in the outcome measure in predictive studies will lead to classification errors, which reflect the insensitivity of the outcome measure to change rather than stable neuropsychological function.

Group analyses vs individual trajectories
On a related issue, research designs that compare pre and postoperative scores on a group basis fail to account for the important differences in individual trajectories in cognitive function following surgery [9]. It is these trajectories that both patients and clinicians need to be able to predict following surgery. Potential surgical candidates are not interested in the mean group average following surgery, particularly if the post-operative group mean score on a test reflects people who improved, people who experienced no change, and people who experienced a significant decline. Rather they need to know which of these outcomes they are most likely to experience if they are to make an informed decision about surgery. See Fig. 1.

Measures of cognitive outcome
A single score on a single test is a poor reflection of cognitive outcome. In a clinical setting, we would never define 'language' or 'memory' function by one score on a single test, much fewer changes in these domains that may be attributed to a treatment effect. A single score from a single test is limited because it does not reflect all of the possible factors that could have influenced that score. It is not that the single score means nothing, it is that it could mean many things. A neuropsychological test score always needs to be interpreted in the context of their wider neuropsychological profile. For example, a score of 15/30 on a naming test may represent an island of strength in someone with an IQ of 75 or a focal deficit in someone with an IQ of 120. It may have no lateraling or localizing significance with respect to impairment in someone functioning in the average range intellectually. It may represent the impact of low mood on function in someone who would have otherwise scored 20/30. Changes in these scores following surgery may reflect secondary changes in other domains or the impact of depression or anxiety. Failure to account for the clinical context and significance of a test score representing 'outcome' in fMRI prediction paradigms introduces noise into the data and reduces the clinical utility of the prediction models generated.
We have long known about 'task specificity' [10,11], the dissociation of scores on tasks that ostensibly test the same underlying construct. Even in the language domain, there appears to be some variation in performance between tests such as the Boston Naming Test and other confrontation naming tests which at face value only appear to differ in the choice of stimuli for the line drawings [12]. In the memory domain, the situation is even more complex. For example, scores on verbal memory tests may represent the ability to take in information embedded in a prose structure, the ability to commit 15 unrelated words to memory over multiple trials, or accelerated forgetting over a short period of time. All of the scores on these tests ostensibly assess 'verbal memory' but there is often variation between them.

Ecological validity of neuropsychological test scores
In the context of predicting 'outcome' it is noteworthy that performance on most of these tasks does not correlate well with subjective reports of memory function, either before or after surgery [13]. Subjective complaints of cognitive impairment are not highly correlated with performance on standardized tests [14]. In a series of 1,186 consecutive patients with confirmed epilepsy seen in our department, 80% of the people who reported that their memory was a significant nuisance in their everyday life on a subjective memory questionnaire did not demonstrate severe impairments on the formal tests [13].
The dissociation between formal test results and subjective complaints reflects a number of factors including differences in the language used by clinicians and people with epilepsy when describing cognitive impairments; the nature of the tests, many of which are deliberately constructed not to replicate real-life function (for example, on standardized list learning tasks, people are not permitted to make written notes); and varying levels of insight, particularly in people with executive dysfunction. (For a full discussion of these issues see Baxendale, 2023).
While studies show a relationship between QOL and cognitive decline in epilepsy suggesting that there is a relationship between tests scores and real-world experience [15], it is unclear how this relationship is mediated by the impact of low mood on both of these variables, particularly given the diffuse nature of the cognitive domains affected.
Concerns about the ecological validity of test scores do not outweigh the value of neuropsychological scores as objective measures of cognitive outcomes, or negate their inclusion in predictive models of outcome in FMRI studies. However, they do shape the narrative with respect to the clinical applications of these studies. These caveats are by no means limited to fMRI stud-ies but apply to all predictions of cognitive outcome used in presurgical counseling with respect to the likely cognitive costs of surgery [16].

Timing of the postoperative assessment
At a group level, cognitive function changes over the first 12 months following surgery, with a dip immediately following the surgical insult and a gradual recovery to 12 months, after which function tends to plateau, although as noted above, individual trajectories of recovery can vary significantly in relation to a variety of factors including mood [17], postoperative seizure control [18] and biomarkers for brain plasticity and possible reorganization [19][20][21]. The timing of the assessment of 'outcome' within the postsurgical timeframe can have an impact on the variables that are found to be significant predictors in fMRI studies. This is neatly illustrated in a recent paper from Binding et al. [22]. The authors report changes in language function assessed via reliable chance indices at 3 months and one year following surgery on the Graded Naming Test. Surgical damage to the arcuate and inferior fronto-occipital tracts was associated with deterioration on the naming task evident 3 months following surgery but not at the one-year assessment. The patterns of apparent deterioration in language function reported in this study may well reflect trajectories of recovery rather than the ultimate outcome. As with traumatic brain injury, the dynamic nature of cognitive recovery following surgery means that in any outcome study, all participants should be assessed at the same time point in their recovery and the significance of this time point in the recovery trajectory must be considered in any interpretation or discussion of the results.

Measuring Outcome: Current practice
In their thorough meta-analyses, Crow et al. (2023) [3] identified 18 studies that used fMRI to predict cognitive outcomes and reported new data of their own. Many of these studies are correlational in design and used regression analyses to predict the magnitude of discrepancy in pre-post scores on single scores of cognitive function. None of these studies considered the validity of their baseline measure of function or whether their participants were functioning at or close to the floor of their test prior to surgery in their study designs. It is noteworthy that the majority of the studies predicting language outcome assessed outcomes measured at or prior to 6 months following surgery [23][24][25][26], at a time when neuropsychological recovery is likely to be incomplete. See Table 1. Given that predictors of outcome at three months following surgery are not stable predictors of change measured at one year [22] the clinical relevance of studies based on outcomes assessed prior to one year is unclear. Of the three studies in the metaanalyses that included patients whose language outcomes were assessed beyond one year, the follow-up period was unstandardized and the participants were assessed at varying intervals ranging from less than one month to more than five years [3,27,28], again introducing a significant confound into the design.
It is difficult to accurately plot the trajectories of cognitive recovery following epilepsy surgery due to the limitations associated with practice effects in repeated testing. However, a number of sources indicate that cognitive outcomes may not have stabilized at 6 months. Alpherts et al. (2006) [29] examined cognitive performance at 6 months, 2 years, and 6 years following epilepsy surgery and found dynamic changes in verbal memory function in some of their patients for up to two years after surgery. While different mechanisms are a play in neurosurgical patients, in patients with brain injury dynamic recovery is expected over twelve months and any assessment prior to this would not be expected to capture the full extent of recovery (Rabinowitz et al, 2018).

Conclusions
There is a clear consensus that epilepsy surgery is an effective treatment for some people with medically intractable epilepsy [30]. Nevertheless, it is an elective surgery with potentially significant and irreversible impacts on neuropsychological function [16]. It is therefore imperative that prospective surgical candidates are given as much information as possible about likely postoperative outcomes prior to surgery in order to ensure that they are able to make an informed decision. Multivariate models predicting cognitive outcomes that incorporate fMRI measures of structural and functional integrity address many of the methodological shortcomings in the literature. However, in some ways, these designs add even further to the imbalance in these research designs. Functional MRI and statistical methods are becoming ever more sophisticated but there will be diminishing returns if the 'outcome' they are predicting remains a single test score. Functional MRI techniques have the potential to provide valuable prognostic data in this respect, but a tendency to treat single scores on single tests, obtained at time points relatively early in the postoperative recovery process, as gold standard measures of outcome limits the clinical value and applications of these studies. There are a number of new initiatives in neuropsychology that could help researchers interpret outcome scores in a more clinically sensitive way. These include the International Classification of Cognitive Disorders in Epilepsy (IC CODE) [31] and innovative ways of transforming scores which reference baseline function [32] and adjustments for the influence of mood and other factors such as educational background and socioeconomic status on test performance [33][34][35].
The purpose of this editorial is not to discount or discredit the impressive fMRI literature in epilepsy surgery to date. The fact that these studies have established a sound and growing evidence base in the prediction of cognitive outcomes, despite the noise in the outcome data is a testament to the powerful clinical potential of these methods. The literature tells much about how the ways that function can be rearranged following surgery and how these changes might underpin some of the declines in memory and language function we see in the clinic and that patients report following surgery. This commentary is not written to dismiss this literature but rather to highlight the huge untapped potential of prediction models using these techniques. Incorporation of some of the basic principles of neuropsychological outcome research could considerably strengthen the direct clinical applications of findings in the fMRI epilepsy surgery literature in the future.