Prediction of the clinical and naming status after anterior temporal lobe resection in patients with epilepsy

By assessing the cognitive capital, neuropsychological evaluation (NPE) plays a vital role in the perioperative workup of patients with refractory focal epilepsy. In this retrospective study, we used cutting-edge statistical approaches to examine a group of 47 patients with refractory temporal lobe epilepsy (TLE), who underwent standard anterior temporal lobectomy (ATL). Our objective was to determine whether NPE may represent a robust predictor of the postoperative status, two years after surgery. Specifically, based on pre- and postsurgical neuropsychological data, we estimated the sensitivity of cognitive indicators to predict and to disentangle phenotypes associated with more or less favorable outcomes. Engel (ENG) scores were used to assess clinical outcome, and picture naming (NAM) performance to estimate naming status. Two methods were applied: (a) machine learning (ML) to explore cognitive sensitivity to postoperative outcomes; and (b) graph theory (GT) to assess network properties reflecting favorable vs. less favorable phenotypes after surgery. Specific neuropsychological indices assessing language, memory, and executive functions can globally predict outcomes. Interestingly, preoperative cognitive networks associated with poor postsurgical outcome already exhibit an atypical, highly modular and less densely interconnected configuration. We provide statistical and clinical tools to anticipate the condition after surgery and achieve a more personalized clinical management. Our results also shed light on possible mechanisms put in place for cognitive adaptation after acute injury of central nervous system in relation with surgery.


Introduction
The central role of neuropsychology in epilepsy has been historically rooted, notably in the context of neurosurgery for refractory epilepsy. Major discoveries such as Penfield's homunculus [1], the functional specialization of the hippocampal-temporal lobe complex in memory [2,3], or the functional lateralization of certain cognitive functions as observed via the Wada test [4] or surgical callosotomy and ''split brain" patients [5] have been documented in this particular context [6]. Neuropsychology and neurosurgery have thus been -and still remain -reciprocally beneficial in understanding of neurocognitive mechanisms [7].
Cognitive impairment as assessed by the neuropsychological assessment is, in fact, the most common comorbidity in patients with epilepsy, whether generalized or focal [8]. It is now well-established that focal epilepsy is a systemic neurological pathology disrupting large scale functional and structural brain networks which is accompanied by long-term persistent cognitive symptoms (e.g., [9][10][11][12] for recent studies). The degree of cognitive impairment reportedly worsens over time, suggesting that chronic epileptic insult (i.e., the recurrence of seizures) causes it, as evidenced by the term ''epileptic dementia" introduced in the early 19th century to characterize progressive cognitive decline [13]. Cognitive impairments are actually occurring very early. Frequently, they are already detected by the time of epilepsy's onset. It is estimated that 70% of adult patients with newly diagnosed epilepsy have at least one proven cognitive deficit well before the introduction of the medication [14].
Despite the reported progressive deterioration, the intensity of cognitive symptoms generally remains mild to moderate in focal epilepsy [15]. Regarding the most common form of focal epilepsy in adults, the temporal lobe epilepsy (TLE), interictal lasting language and memory deficits of varying degrees of severity, are commonly noticed [16][17][18][19] known to support language and memory networks  even propose that this is the ''language-and-memory interface"; see also [21], which may explain the vulnerability of these functions in TLE). However, in addition to language and memory, the temporal lobe is also involved in large-scale functional circuitry and other cognitive processes, potentially inducing various cognitive deficits (executive functioning, social cognition, or face recognition [22,23]). Recent studies have, in this respect, stressed a variety of cognitive phenotypes in patients with TLE [24,25], moderating the conjunction of language-memory deficits traditionally associated with patients with TLE. Undoubtedly, there is no single cognitive profile associated with TLE, but a panel of phenotypes whose expression depends on many factors [26].
Neuropsychological evaluation (NPE) is an essential tool for clinical diagnosis and management of patients with epilepsy presenting with suboptimal cognitive functioning. Routine objective screening is indeed essential considering that there is a substantial underestimation of cognitive deficits in epilepsy when their assessment is based solely on subjective complaints (this relates to language, memory, and executive performance; [14] for an estimate). As an indicator of cognitive status before invasive neurosurgery, NPE can also help to detect, localize, and lateralize brain dysfunctions and to formulate hypotheses about the epileptogenic networks involved. Consistent with other studies [27,28], we have indeed previously demonstrated the usefulness of NPE in diagnosing epilepsy localization and the demonstrated sensitivity of certain neuropsychological indices to epilepsy lateralization and localization [29]. Even more, NPE performed in epilepsy surgery as an objective control of preoperative cognitive status can be used to prognosticate postsurgical cognitive and functional outcome, allowing for timely initiation of tailored cognitive remediation if necessary [30].
However, although several studies have sought to identify clinical and cognitive factors that are useful in predicting patients' postoperative status ( [31,32], for two recent studies on large patient samples), there is still a lack of benchmark about the most relevant cognitive indices (i.e., in terms of sensitivity, reliability, and validity) for predicting postoperative outcomes. The main objective of this study was therefore to provide information regarding this important consideration.
To this end, we first identified relevant presurgical neuropsychological predictors of (1) clinical and (2) naming long-term outcomes after temporal lobe resection (2 years after anterior temporal lobectomy; ATL) in patients with drug-resistant TLE. Clinical outcome was determined using the Engel (ENG) score [33,34] which quantifies the success of neurosurgery in the context of epilepsy [35]. Concerning the postoperative naming outcome, we focused on picture naming ability assessed by the DO80 [36] (a French equivalent of the Boston Naming Test: BNT [37]). Despite some psychometric limitations (mentioned in the discussion section), this task presents substantial clinical advantages. Firstly, lexical access involved in object naming is a major determinant for the quality of life and the return to work activities of patients, after surgery (e.g., [38] in patients with low-grade glioma). Secondly, the naming test and picture naming in particular is a gold standard [39], widely used in temporal epilepsy [17]. Lastly, ATL can produce cognitive ''side effects" and naming deficits in particular were shown to persist (persistent dysnomia can be noted in up to 60% of cases depending on the study [40], for a systematic review; in one third of left temporal patients [41], for a review using weighted average estimate of naming decline). For these reasons and in this context, it is one of the most important cognitive indicators to predict.
We performed state-of-the-art and data-driven analyses. We used machine learning (ML) algorithms to identify latent relationships of our predictors and make predictions regarding selected outcomes. We applied model-agnostic methods to facilitate the interpretation of ML results (black box insight [42]) and cognitive phenotypes. Finally, and as a complementary approach to identify factors related to the different outcomes, we modeled the presurgical neuropsychological networks of patients whose surgery was effective versus less effective. On these networks we performed graph theory (GT) analyses to define phenotypes and characterize fundamental differences regarding the preoperative neuropsychological architecture according to postoperative outcome. Graph theory applied to neuropsychological scores indeed offer an optimal viewing angle, adapted to the current vision of the cognitive system as a scaffold of interactive links between cognitive domains [43][44][45].

Population
Forty-seven patients with unilateral and drug-resistant TLE diagnosed in accordance with the clinical criteria described in ILAE committee report [46] were included in the study. Patient inclusion criteria were as follows: (1) older than 18 years of age; (2) had undergone standard ATL resection; (3) had complete data for all variables of interest. Patients with neurological comorbidities such as stroke, tumor, or neurodegenerative disease and/or a prior neurosurgery were systematically excluded.

Clinical and neuropsychological evaluations
Presurgical assessments including a comprehensive neurological examination and a NPE for each patient were performed between 2014 and 2019, with the same protocol. Neuropsychological evaluations were conducted by a specialized neuropsychologist in the epilepsy unit of the Grenoble Alpes University Hospital. Clinical information and performance on 32 neuropsychological indices providing a comprehensive overview of cognitive functioning before neurosurgery were collected (Table S1 includes all collected data; see Appendix S1 for details and explanations about clinical and neuropsychological variables). A postoperative evaluation was also systematically performed at two years postoperatively (M = 2.06 years after neurosurgery; SD = 0.22). Two main postoperative factors: (ENG) the Engel score and (NAM) the postoperative object naming score accounting for the baseline (i.e., the preoperative naming performance) -were selected to predict long-term neurosurgery outcomes. We determined two respective classes for each of these postoperative factors, namely: -(ENG+) optimal postoperative clinical outcome: the Engel score of I, no seizures observed 2 years after surgery; -(ENGÀ) suboptimal postoperative clinical outcome: the Engel score of II-IV: persistence of more or less frequent and more or less severe seizures 2 years after surgery; -(NAM+) optimal postoperative naming outcome: the naming score at 2 years after surgery did not decline or even improved compared to the presurgical baseline; -(NAMÀ) suboptimal postoperative naming outcome: the naming score at 2 years after surgery has decreased compared to the presurgical baseline.
Note that to define if the change of individual NAM performance was meaningful, we calculated a Reliable Change Index (RCI) according to Chelune's methodology ( [47] and see also [48] for a description) and based on the DO80 naming scores. RCIs below À1.28 (a = 0.1 or 90% one-sided confidence interval) were considered as a reliable NAM decline (NAMÀ). Otherwise, patients' performance was considered to be without significant NAM decline (and therefore categorized in the NAM+ condition).
Patients belonging to the different classes (ENGÀ/ENG+ and NAMÀ/NAM+) were matched by age, education, duration of epilepsy, number of AEDs, age at surgery, and distance of postoperative evaluations from surgery date (Table S2 for statistical reports). Table 1 below summarizes the proportion of patients associated with the different classes based on the main and binarized clinical variables.

Method
We mirrored two types of complementary methods to address both the postoperative clinical outcome (ENG) and the postoperative naming outcome (NAM): a predictive machine learning approach (ML), followed by a network analysis using GT. A graphical and general outline of the method is presented below (Fig. 1). Codes and data are openly available: https://github.com/ltor-lpnc/ neuropsy-2021.git, for the online repository.

Predictive ML approach
We used a similar approach we previously employed [29] to estimate the ability of neuropsychological scores to predict the postoperative status and by identifying the most discriminating scores and their interactions. To this end, we developed several Machine Learning workflows trying each time three different algorithms belonging to a different family: (a) a classical Support Vector Machine (SVM) algorithm [49] with a Radial Basis Function (RBF), (b) XGBoost algorithm [50] and (c) a logistic regression with penalty L2. To avoid sub-optimal results and because we had no reason to privilege a particular algorithm, different algorithms were evaluated. Practically, a binary classification has been conducted on the full dataset (n = 47 patients; no missing values) with a repeated cross-validation (CV) scheme including a feature selection step in each training set. The prediction performance of the most relevant and stable features was then measured once again with the same CV scheme. We then applied model-agnostic methods to get an insight about the features and their interactions in the prediction.

Binary classification
Our purpose was to predict two kinds of outcomes separately (ENG and NAM) with the help of neuropsychological indices. In order to make robust predictions, we restricted the number of features to 9 main neuropsychological indices (as previously used in  to take into account: (1) the patients' sample size and (2) the curse of dimensionality [51]. We have preliminar-ily standardized the 9 features of interest as required by some algorithms.
In practice, we trained a model to try to assign correctly a patient to its own group (ENG+/ENGÀ or NAM+/NAMÀ). We used a classical 5-fold CV scheme repeated 100 times. The CV was stratified, i.e., samples were randomly chosen in order to get always the same ratio of classes that existed in the original dataset. We chose the balanced accuracy (BAcc; [52]) as a measure of performance to quantify the quality of predictions. The BAcc has the advantage of dealing with imbalanced datasets and to offer an easy to use metrics in a clinical perspective. The BAcc formula is as follows: The best value is 100% and the worst 0%. BAcc is equal to the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate). For balanced datasets, the score is equal to classical accuracy.

Feature selection
In clinical perspective, it is interesting to simplify the model to the most relevant neuropsychological indices (i.e., to perform feature selection) in order to make easier and usable the interpretation of the ML approach. Penalized linear models are popular in order to get sparse solutions. Thus, we applied L2-norm on our dataset. Practically, feature selection was done in each training set with the default threshold implemented in scikit-learn v. 0.24 (i.e. mean of the features importance [53]). Selected features were used then to train the algorithm before measuring the performance with the held-out fold. This was repeated 500 times (5-fold cross validation repeated 100 times) to get a good estimate of feature stability and performance. In addition, we estimated the stability which is a very important aspect of results reproducibility. It is indeed crucial to know whether by randomly repeating the feature selection, we can trust the machine learning workflow and scientific conclusions. To this end and as metrics, we opted for the selected features' frequencies and the stability measure b U proposed by Nogueira et al. [54].

Black box insight
Model-agnostic methods are unspecific and offer interpretability to any machine learning algorithm [55]. We chose Partial Dependence Plot (PDP) to see if the relationship between the target and a feature or two features is linear, monotonic or more complex. A PDP allows this by showing the marginal effect that one or two features have on the prediction. This algorithm builds the model by averaging the features, except the feature of interest F, and measures the changes in prediction for different values of F.
Considering the dataset reduced to the selected features, we performed a random partition (90% for the training set). We In a similar way to the PDPs, nomograms are a pictorial representation of a model and provide complementary and useful information about the prediction. Clinical nomograms use variables or features to graphically depict a prognostic model that generates the probability of an event, in our case the clinical and the naming postoperative status. We developed clinical nomograms based on a logistic regression applied on the ML selected cognitive indices, using the rms R package. The quality of the validation was assessed by bootstrapping and measured by the area under the curve (AUC) of a receiver operating curve (ROC).

Networks and GT approach
Graph theory is a well-established theory using many mathematical methods to study networks. We present here a dual approach based on GT to take a broader view of the relationships in-between (1) patients (networks of individuals) and then (2) neuropsychological indices (cognitive networks). Graphs are defined as a set of nodes connected by edges which represent a relation between the connected nodes. Since we were interested in similarity/proximity relations between the nodes, our networks were composed of weighted edges that are not directed (undirected weighted graphs). The whole workflow was realized with iGraph library [56].
3.2.1. Patients' networks 3.2.1.1. General overview. We first considered patients as nodes to identify similarities/dissimilarities between their neuropsychological profiles. The link between nodes (i.e., the edge) was obtained from the neuropsychological indices selected in the ML feature selection analyses. We therefore respectively constructed two independent networks: one based on the neuropsychological tests selected for the surgery outcome (ENG patients' network); and a second based on the neuropsychological tests selected for the naming outcome (NAM patients' network). Patients with similar characteristics tend to cluster together and community analyses applied on the patients' networks inform about the shared charac-teristics by comparing the detected communities to the clinical data.

Graph construction.
As with other studies (e.g., [24]), a proximity measure (based on Euclidean distance) was used to construct the patients' networks. To avoid a chance that a variable, here a neuropsychological index, creates unrealistic greater inter-sample differences, we standardized the dataset. Then we implemented the proximity measure as: ddistance max e is the ceiling of the maximal standardized Euclidean distance observed between two nodes in the dataset; distance min is the minimal standardized Euclidean distance observed; distance Node A ; Node B ð Þthe standardized Euclidean distance between our two nodes/patients of interest.
The proximity maximal value is 1 and it corresponds to the minimal distance and the minimal value is close to zero and it corresponds to the maximal distance. This standardized Euclidean distance was calculated with the neuropsychological features selected in the ML workflow each for the ENG (4 features) and then for the NAM condition (4 features also), independently.
We obtained in this way two complete weighted graphs (ENG patients' graph and NAM patients' graph), with no thresholding assumption. Every pair of nodes is connected but with edges of different weights (weighted graphs), according to the proximity. A higher weight means a smaller distance between the patients, this way a community of patients detected in the graph reveals a proximity of their neuropsychological profiles composed with the 4 selected indices.
3.2.1.3. Graph statistics. Based on the respective patients' graphs, we used a state-of-the-art community detection to detect clusters of patients: the Louvain/Multilevel algorithm [57,58]. Communities appear in the weighted network if there are groups of nodes with strong internal connections and weak external (between groups) connections. For both graphs we detected exactly two communities whose membership has been compared to binary clinical variables (such as hemispherical laterality, epilepsy severity, etc.; the variables described in Table 1) with a Simple Matching Coefficient (SMC). Simple Matching Coefficient is a simple and intuitive way to give the ratio of coincidence between the binary labels: 0% means that the labels have nothing in common and 100% that they have identical sequences. Simple Matching Coefficient is the ratio of both mutual presences and mutual absence with the length of the binary sequence.  [59]. This procedure allowed us to obtain 4 sparse graphs or networks according to the postsurgery condition (ENG+/ENGÀ; NAM+/NAMÀ).

Graph statistics.
We applied both global and local measurements on the cognitive graphs. Community detection was performed with Multilevel/Louvain algorithm. We reported the (global) modularity index showing how strongly separated the different clusters are from each other. Density was also reported. This measure shows how connected the network is compared to a full graph by calculating the ratio of the number of edges and the number of possible edges. Finally, we estimated the average degree which is simply the average number of edges per node in the graph.
Regarding the local measures computed at the nodal level, we deployed a bootstrap strategy to make inferences about the populations. 1000 resampling iterations were done on each patient sub-group. Each time, a graph was built and two complementary centralities were measured on every node: the strength and the clustering coefficient. The first represents the sum of the edge weights of the adjacent edges for each node and the later quantifies how close its neighbors are to being a complete graph. In other words, clustering coefficient measures the local density in a network, the tendency to form a highly connected neighborhood.

Predictive power and validity of the neuropsychological indices
Among the three different algorithms tested on our dataset, the best performance was obtained each time by the linear algorithm: the L2 logistic regression. The following results are thus obtained using this algorithm.

Prediction of the clinical outcome
We obtained a very high classification performance (BAcc) for the prediction of the clinical outcome (ENG score): mean perfor-mance = 82.5%±13.7%; precision = 93.1% ± 11.8%; recall = 77.3% ± 16.7% (see Appendix S2 for the distribution of performances and Fig. 2 Panel A for the confusion matrix). The feature selection approach (cf. Fig. 2 Panel B) shows a very good stability of b U = 88% with k = 4 features selected (in terms of selection frequencies: VMI = 99%, AMI = 98.2%, NAM = 94% and TMT = 81.6%). The prediction of ENG based on the 4 selected features is further improved: BAcc = 84.4% ± 3.7%, with a remarkable level of prediction for the class corresponding to a worse postoperative clinical prognosis in particular (ENGÀ = 89.9%; Fig. 2 Panel A). The nomogram below shows the influence of the selected features in the prediction and can be used as a tool to predict a patient's risk of postoperative clinical deficit (individual level; Fig. 2 Panel C and associated captions provide detailed explanations). The PDP visualizations (Appendix S2) also show, as a complement, how the values of selected features change the prediction and how they interact.

Prediction of the naming outcome
Classification performance for predicting the naming outcome (NAM score) with all the cognitive indices is moderate: mean BAcc performance = 59.1% ± 14.6%; precision = 68.5% ± 16%; recall = 60.2% ± 21.2% (see Appendix S2 for the distribution of performances and Fig. 3 Panel A for the confusion matrix). However, we observed a clear improvement of the performance when the classification was restricted to the selected features (BAcc = 65.3% ± 3.4%) and in particular for the NAM class -representing postoperative naming decline (NAMÀ = 68.4%; Fig. 3

Cognitive networks.
Regarding the networks generated on cognitive indices (i.e., the cognitive networks), the network related to ENGÀ is more fragmented (6 communities and a modularity index of 0.66) than the one associated with ENG+ (4 communities and a modularity index of 0.29; Fig. 4 Panel B). Appendix S3 shows the evolution of the modularity index as a function of the graph thresholding. The ENGÀ graph is less densely interconnected (den-  Þof each NAM class, before and after feature selection (FS). Panel B. Feature frequency and stability in classification. VCI was less frequently selected than the other 3 cognitive features, but still contributed significantly to the prediction. Panel C. Nomogram designed to predict postoperative naming outcome and in particular the risk of experiencing a significant decline in naming abilities following neurosurgery (AUC = 72%, 1000 bootstrap). The procedure for using the nomogram is the same as that detailed in the legend of Fig. 2 Panel C. Following the nomogram, the cognitive profile of P15 on the relevant indices to predict NAM leads to a 100% risk of significant postoperative decline (red flag). sity = 0.05 versus 0.21 for ENG+). The mean degree associated to ENG+ is indeed largely higher than that of the ENGÀ condition (d = 6.57 versus d = 1.63, respectively).
At the local level, there is a significant and systematic difference in node strength and clustering coefficient between ENG+ and ENGÀ networks (for all nodes; Appendix S3). The absolute ENG+/ ENGÀ difference in node strength is significantly more important for the ML selected indices than the non-selected features (t = 18.525, df = 1997.7, p < .001), meaning that the overall centrality of ML nodes changes more between the ENG+ and ENGÀ conditions than that of the other nodes. Concerning the local centrality (clustering coefficient), the absolute ENG+/ENGÀ difference is significantly lower for the ML selected indices compared to the others (t = À9.1225, df = 1916.2, p < .001; Fig. 4 (Fig. 5 Panel A). As for ENG, the hemisphere involved in epileptic seizures shows a remarkable SMC of 70% with the detected communities. Based on the cognitive indices selected for the prediction of NAM (see Fig. 3), patients with LTLE show a similar neuropsychological profile to that of the community predominantly represented by NAMÀ (the reciprocal is also valid for RTLE and NAM+). To a lesser extent, manual laterality, presence of hippocampal sclerosis, gender, and thymic score present an SMC with the communities greater than or equal to 60% (SMC HDS = 0.64; SMC HS = 0.62; SMC GEN = 0.6; SMC THY = 0.6).

Cognitive network.
The cognitive networks of NAM+ and NAMÀ have the same number of communities (n = 5), but NAMÀ is slightly more modular and less dense (M = 0.37; density = 0.11) than NAM+ (M = 0.28; density = 0.2; Fig. 5 Panel B and see Appendix S3 for the modularity coefficient as a function of the graph thresholds). The average degree of connection of NAMÀ (d = 3.38) is also lower than that of NAM+ (d = 5.81).
Nodal strength and clustering coefficient of NAM+ and NAMÀ conditions are significantly different (p < .001), for almost all nodes (Appendix S3). We observe the same general pattern of difference in terms of centrality as observed for the ENG cognitive networks. We find a more modest but significant difference between the nodal strength of the ML selected nodes versus the unselected ones (t = 2.23, df = 1932.4, p = 0.01). This difference is related to greater changes between NAM+ and NAMÀ for ML selected nodes. The clustering coefficient difference is significantly lower for the ML selected nodes, compared to the others (t = À14.94, df = 1990.6, p < .001; Fig. 5 Panel C).

Discussion
The main objective of this data-driven study was to determine whether neuropsychological scores can be useful predictors of postoperative long-term outcome after temporal lobe resection (ATL in patients with drug-resistant TLE), and if so, to identify indicators that are the most relevant. We assessed postsurgical outcome through two important markers reflecting either long-term clinical (ENG) or naming (NAM) result of the neurosurgery. The findings of this study clearly demonstrate the validity of some NPE composite scores in predicting clinical and naming outcome and detecting patients at risk (red flag). Reliable prediction of long-term clinical outcome (ENG) is achieved as a combination of preoperative scores from 4 neuropsychological indices in particular -VMI, AMI (visual and auditory memory index of the WMS-IV [60]), NAM (DO80 naming task [36], and TMT (Trail Making Test B-A score [61]). Using these specific predictors, 8 out of 10 patients were correctly classified and the performance further increases to accurately identify 9 out of 10 patients among those who will have a less favorable clinical outcome (i.e., not seizure-free at 2 years after the surgery; ENGÀ). The power of these NPE indices to target at-risk profiles is therefore excellent (Fig. 2), implying that the initial cognitive state is clinically decisive. Concomitant observation of the entire cognition-brain-clinical sphere in this respect will help improve our understanding of the mechanisms underlying this relationship and will help identify its mediators. It should be noted, however, that we binarized the Engel clinical variable to distinguish complete/optimal surgical success (seizure-free: type I) from partial surgical success or therapeutic failure (persistent but more occasional seizures: types II-III; or no noteworthy improvement: type IV [33]; Table S1). Future work considering different scenarios and refining this categorization is needed to separate patients with occasional seizures after surgery (i.e., substantial improvement in their clinical condition) from patients with no clinical improvement, for example.
Long-term postoperative naming outcome (NAM) is however less easily predictable from preoperative cognitive scores. By relying on the four robustly selected neuropsychological indices -AMI (auditory memory index of the WMS-IV [60]), NAM (DO80 naming task [36]), SFL (semantic verbal fluency [61]), and VCI (verbal comprehension index of WAIS IV [62]) -about 2/3 of the patients can be correctly classified. The predictive power of this formula is nevertheless improved for the prediction of at-risk profiles of patients that show a significant degradation of the naming efficiency at 2 years of follow-up (NAMÀ; Fig. 3). The moderate level of prediction obtained for NAM can be related to the method of coding used to define the improvement or deterioration of performance. With the objective of limiting the omission of at-risk patients (red flag profiles), we have chosen a rather permissive threshold (RCI, 10% unilateral), favoring the inclusion of patients without strong decline in the NAMÀ group. A more restrictive RCI of 95% (cutoffs below À1.65 SD) and/or other empirically based techniques for identifying the change from baseline (e.g., standardized regression-based change scores, changes estimated from standardized rather than raw scores [63]) may potentially improve the predictive power of neuropsychological indices for estimating postoperative naming change. In addition, the RCI is based on parameters related to task's performance distributions (in particular, the estimation of the standard deviation [64]). The overall internal consistency and test-retest stability of the naming task are good, making it a valid and psychometrically reliable task [65]. However, this task has limited discriminatory power [66] and is particularly sensitive to individuals with moderate to severe naming disorders (due to a ''ceiling effect"; [67]). The performances of the normative sample as well as of the patients included in this study (see Appendix S2) show indeed skewed or bimodal distributions. Since truncated or non-normal distributions may bias the interpretation of change scores [68], applying additional correction on distributions or to RCI estimate (e.g., [69]) could result in (1) a more accurate approximation of change and therefore (2) a better predictive performance for NAM+/NAMÀ conditions.
Overall, this study highlights the high sensitivity and validity of specific neuropsychological tests in the context of ATL in temporal epilepsy. Specifically, the tests derived from the Wechsler scales appear relevant and valid in these patients who may present subtle cognitive disorders (as mentioned in the Introduction; see also Table S1 for the average z scores), difficult to estimate with less sophisticated and comprehensive testing. We previously reported their clinical efficiency in a study aiming to lateralize epilepsy in drug-resistant patients [29] and the robustness of the WMS compared with other tests has also been highlighted by other studies in patients with epilepsy [70]. Furthermore, these neuropsychological indices are associated with the functional disturbances in resting-state brain connectivity exhibited by patients with TLE [11], confirming their clinical efficiency and potentially explaining why they are selected as good predictors of postsurgical status after temporal lobectomy. Thus, these tests appear specifically relevant and fine grained to target epilepsy-related disorders following ATL surgery. A future step may be to determine whether some sub-indicators of these composite indices are better predictors, thereby refining our observations and gaining a better understanding of the processes at work. Recent studies have also tackled the prediction of postoperative naming performance of patients with refractory TLE. Busch and colleagues [31], for example, performed the predictions based on 10 predictors a priori associated with the change in naming performance following temporal lobe resection (clinical variables selected on the basis of evidence available in the literature, such as: sex, education, age at surgery, age at epilepsy onset, duration of epilepsy, side of surgery, etc.). They observed very good prediction performances of naming scores (6-12 months after surgery) and two clinical factors were particularly important for predicting the decline: the age at epilepsy onset (also found in [71][72][73]) and the side of surgery (consistent with [74,75]). By employing an individual graph approach (patients' graphs) and in accordance with the body of work on material-specific hemispheric specialization, we have similarly observed in this study that the hemisphere involved in epilepsy and ATL resection was the most important clinical variable (>70% matching with the respective neuropsychological profiles associated with ENG or NAM; Figs. 3-4, Panels A). Thus, the development of combined models, including relevant information from both neurological and neuropsychological examinations, could be beneficial in portraying ''red flag" profiles. Moreover, the inclusion of biomarkers of postsurgical outcome in the model, such as resting-state fMRI GT measures [76] or an estimate of the resection volume, could also improve predictions. Although ATL is a standard procedure, the extent of resection is commonly larger in RTLE than LTLE [77]. Variations in the amount of tissue resected and/or in surgical techniques (e.g., standard ATL, Spencer-type lobectomies, lobectomies sparing the superior temporal gyrus, the hippocampus or the lateral neocortex, selective amygdalohippocampectomy, and lesionectomies) did not appear to have a major effect on cognitive outcome -except for naming [41] -and thus remain a factor to consider in future predictions.
In addition, prediction of a broader range of neuropsychological indices assessing problematic cognitive functions in the daily lives of patients with TLE is essential. Naming latency instead of naming score could be considered since it might be more sensitive in identifying finer-grained but troublesome everyday language impairments [38,78,79]. Others naming-related measures could also be highly sensitive to TLE and ATL and thus be used to predict postoperative language status in future works. In particular, the evaluation of the naming of specific categories, in auditory modality (on verbal input such as definition), or in a more natural and ecological context as in spontaneous speech ( [40], for a systematic review). The decline in episodic memory (or learning performance) is another factor that should be aimed to be predicted given the high occurrence/incidence of memory disorders [41] and as a memory decrease is often observed in association with language deficits in patients with TLE [80]. Recently, Busch and collaborators [32] reported reliable prediction of postoperative verbal memory decline by including the side of surgery, the baseline memory score, and the educational level or the hippocampal resection (depending on the memory score to be predicted). These results are promising and a prediction of varied cognitive scores -including functions more traditionally associated with the right hemisphere (or ''non-dominant" hemisphere, such as visuo-spatial cognition abilities) -and using as an input a wide spectrum of neuropsychological indices would also be valuable for the development of comprehensive neuropsychological tools to assist clinicians in preoperative decision making and patient counseling.
Some sophisticated methods/algorithms deriving from artificial intelligence are not easy to comprehend (black box models) and thus prevent clinical translatability. The estimation of the classification/prediction performance is essential to determine the relevance of the model tested but does not describe the existing relationships. Considering the ''accuracy versus interpretability trade-off" [81], statistical or even visualization techniques now allow to estimate the features importance, the features interaction or the model internals (structure and/or weights of learned attributes [55]). Interpretable surrogate models such as Partial Dependancy Plots (PDPs) provide, for example, threshold scores modulating prediction and reveal relationships between features (Appendix S2 and [29] for a previous application in patients with TLE). In the same way, nomograms allow to (1) directly and graphically visualize the influence of the predictors and; (2) perform individual predictions by locating a given patient. Overall, these tools facilitate the progress towards personalized neuropsychology (i.e., precision neuropsychology).
Furthermore, GT, applied to neuropsychological performance, makes explicit the interdependence between cognitive domains and has become a valuable tool in the study of systemic pathology such as epilepsy. In the past few years, efforts have been made to model the cognitive maps of patients with epilepsy through networks (in pediatric focal and generalized epilepsies [82]; or in TLE presenting diverse cognitive phenotypes [83,84]). Examined in comparison with those of control populations, epileptic cognitive networks systematically show a significant disruption, resulting in interconnectivity decrease and a more severe and disorganized modular fragmentation. Our study carried out in the peri-operative context shows that this ultra-modular and sparse organization of connections is particularly exaggerated in patients presenting a poor long-term clinical or naming outcome after surgery. Optimal cognitive functioning thus emanates from the synergy between indicators (or cognitive processes) and sustained support of an entire cognitive network after surgical injuries is essential. It can be translated into graph language by a higher density of the network and a lower modularity, as observed for conditions with favorable postsurgical outcomes (ENG+ and NAM+). Interestingly, the connectivity of the ML selected NPE indices is particularly different depending on the postsurgical outcome, which may explain the greater predictive power of these specific features. Memory scores such as AMI and general intellectual abilities (e.g., VCI and associated subtests) are important connector hubs in the cognitive landscape of patients with TLE (i.e., they are highly integrated nodes; Fig. 4-4, Panels B). Their connectivity, or centrality within the whole network, undergoes significant changes between the ENG+/ENGÀ and NAM+/NAMÀ conditions ( Fig. 4-4, Panels C). They would indeed play a protective role against pre/postsurgical decline, echoing the concept of cognitive reserve [85].
Finally, the development and the application of techniques used in this study requires special precautions. In particular the sample size must be sufficient and as homogeneous as much as possible to detect robust patterns. The main pitfall in ML analyses, which is related to small samples, is the risk of overfitting. There are how-ever several ways to control the reliability of the prediction that we applied in this study to ensure the generalizability of the results: adopting a multi-algorithm approach, performing CVs, iterating and estimating the stability of selected features are examples [86]. Regarding GT analyses, the method and thresholding of the matrices as well as the number of network nodes influence the results (discussed for example by [87][88][89] in the case of brain networks). To limit these biases, different solutions can be adopted, such as comparing the results obtained with those of random networks [90], estimating the evolution of metrics as a function of the threshold (see Appendix S3 in our case) or applying corrections to multiple comparisons problem. Optimally, results generalizability should be systematically assessed using a fully independent sample. This last point is far from trivial in clinical practice, given the high variability of neuropsychological practices across time, practitioners and/or clinical sites [91].

Conclusion
This data-driven research provides support that neuropsychology can serve as a relevant predictor for both clinical and longterm naming outcomes after traditional temporal lobectomy in patients with epilepsy. In this context, machine learning and GT methods are complementary and mutually beneficial. They offer insightful screening of (1) the profiles likely to experience a suboptimal postsurgical outcome; and (2) the cognitive landscape of the concerned patients. This study promotes evidence-based neuropsychology and emphasizes the concept of neuropsychological ''canvas", which emphasizes the role of interactions between cognitive functions at the origin of the phenotypic expression of cognition. The prediction visualizations and techniques proposed in this research attempt to bridge the gap between fundamental and clinical research by providing concrete applications (e.g., nomograms). Beyond the diagnosis, the yielded tools and evidence can guide neuropsychologists in setting up an early or even anticipated cognitive rehabilitation (''cognitive prehabilitation") and thus envisage the most appropriate neuropsychological follow-up in case of a risk profile (red flags).

Ethical Statement
Patients provided written informed consent to participate in the study, which was approved by the local ethics committee (CPP: 09-CHUG-14/ANSM (ID RCB) 2009-A00632-55).