Introduction
- Puelles VG
- Lütgehetmann M
- Lindenmeyer MT
- et al.
,
- Gavriatopoulou M
- Korompoki E
- Fotiou D
- et al.
The long-term clinical consequences of COVID-19 are still poorly understood and are collectively termed post-acute sequelae of SARS-CoV-2 infection, known as long COVID.
- Nalbandian A
- Sehgal K
- Gupta A
- et al.
At this time, this disease is referred to by a number of terms that may or may not represent the same constellation of signs and symptoms; here, we consider post-acute sequelae of SARS-CoV-2 infection synonymous with long COVID. Long COVID can be broadly defined as persistent or new symptoms more than 4 weeks after severe, mild, or asymptomatic SARS-CoV-2 infection.
- Greenhalgh T
- Knight M
- A’Court C
- Buxton M
- Husain L
,
- Huang Y
- Pinto MD
- Borelli JL
- et al.
Characterising, diagnosing, treating, and caring for patients with long COVID has been challenging due to heterogeneous signs and symptoms that evolve over long trajectories.
- Rando HM
- Bennett TD
- Byrd JB
- et al.
The effect of long COVID on patients’ quality of life and ability to work can be profound.
- McCorkell L
- Assaf GS
- Davis HE
- Wei H
- Akrami A
which conducted deep longitudinal characterisation of long COVID symptoms and trajectories in patients with suspected and confirmed COVID-19 who reported illness lasting more than 28 days.
- Davis HE
- Assaf GS
- McCorkell L
- et al.
Evaluation and harmonisation of patient-reported and clinically reported long COVID features using the Human Phenotype Ontology also revealed heterogeneous signs and symptoms, supporting the hypothesis that a complex collection of patient-reported and clinically reported features is necessary to correctly classify and manage patients with long COVID.
- Deer RR
- Rock MA
- Vasilevsky N
- et al.
WHO recently published its own case definition of post COVID-19 condition (WHO’s term) that includes 12 criteria, which similarly require a wide variety of patient-declared and clinical information.
A clinical case definition of post COVID-19 condition by a Delphi consensus, 6 October 2021.
Evidence before this study
Initial characterisation of patients with long COVID has contributed to an emerging clinical understanding, but the substantial heterogeneity of disease features makes diagnosing and treating this new disease challenging. This challenge is urgent to address, as many patients report that long COVID symptoms are debilitating and severely affecting their ability to engage in activities of daily life. No formal literature review was done. Few studies have used large-scale databases to understand concordance of clinical patterns and generate data-driven definitions of long COVID. The US National Institutes of Health’s RECOVER programme has invested in electronic health record studies to understand the risk factors for, and mechanisms behind, long COVID, accurately identify individuals with long COVID, and prevent and treat long COVID.
Added value of this study
The National COVID Cohort Collaborative (N3C) harmonises patient-level electronic health record data from over 8 million demographically diverse and geographically distributed patients. Here, we describe highly accurate XGBoost machine learning models that use N3C to identify patients with potential long COVID, trained using electronic health record data from patients who attended a long COVID specialty clinic at least once. The most powerful predictors in these models are outpatient clinic utilisation after acute COVID-19, patient age, dyspnoea, and other diagnosis and medication features that are readily available in the electronic health record. The model is transparent and reproducible, and can be widely deployed in individual health-care systems to enable local research recruitment or secondary data analysis.
Implications of all the available evidence
N3C’s longitudinal data for patients with COVID-19 provides a comprehensive foundation for the development of machine learning models to identify patients with potential long COVID. Such models enable efficient study recruitment that, in turn, deepen our understanding of long COVID and offer opportunities for hypothesis generation. Moreover, as more patients are diagnosed with long COVID and more data are available, our models can be refined and retrained to evolve the algorithm as more evidence emerges.
Researching COVID to enhance recovery.
aims to recruit thousands of participants in the USA to answer critical research questions about long COVID, such as understanding pregnancy risk factors, cognitive impairment and mental health, and outcome disparities and comorbidities. Efficient recruitment of cohorts of this size and scope often entails leveraging computable phenotypes
,
- Mo H
- Thompson WK
- Rasmussen LV
- et al.
,
(ie, electronic cohort definitions) to find sufficient numbers of patients meeting a study’s inclusion criteria. Poor cohort definition can result in poor study outcomes.
- Richesson RL
- Rusincovitch SA
- Wixted D
- et al.
,
For long COVID, as with other novel conditions, the absence of an unambiguous consensus definition and the heterogeneity of the condition’s presentation poses a substantial challenge to cohort identification. Machine learning can help to address this challenge by using the rich longitudinal data available in electronic health records to algorithmically identify patients similar to those in a long COVID gold standard.
- Haendel MA
- Chute CG
- et al.
The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment.
offers a data-driven solution to quantifying the features of long COVID and an appropriate hypothesis-testing scenario for a machine learning approach.
- Bennett TD
- Moffitt RA
- Hajagos JG
- et al.
N3C is an NIH National Center for Advancing Translational Sciences (NCATS)-sponsored data and analytic environment which compiles and harmonises longitudinal electronic health record data from 65 sites in the USA and over 8 million patients who have tested positive for SARS-CoV-2 infection; have symptoms that are consistent with a COVID-19 diagnosis; or are demographically matched controls who have tested negative for SARS-CoV-2 infection (and have never tested positive) to support comparative studies.
NIH COVID-19 data warehouse data transfer agreement.
We aimed to build a foundation for a robust clinical definition of long COVID by linking curated lists of patients who have attended a long COVID clinic from three N3C sites with data in the N3C repository. We used the linked dataset to train and test three machine learning models and applied those models to define a nationwide US cohort of potential patients with long COVID, and to derive a list of prominent clinical features shared among that cohort to help to identify patients for research studies and target features for further investigation.
Results
TableCharacteristics of the three-site cohort used for model training and testing
Data are n (%) unless otherwise stated. All patients shown had acute COVID-19. Diabetes was not separated by type.
Figure 2Machine learning model performance in identifying potential long COVID in patients
ROC curves, with 5-fold cross-validation and five repeats, identifying the ability of each of the three models (non-hospitalised, hospitalised, and all patients) to classify patients with long COVID as the discrimination threshold is varied. To emphasise recall of patients with potential long COVID, all models use a predicted probability threshold of 0·45 to generate the precision, recall, and F-score. The threshold can be adjusted to emphasise precision or recall, depending on the use case. AUROC=area under the receiver operating characteristic curve. ROC=receiver operating characteristic.
The three models were validated against an independent dataset from a fourth site. When tested against the patient population of this site qualifying for our base criteria (n=32 411, 125 of whom were long COVID clinic patients, without sampling to address the class imbalance), the AUROCs were 0·82 for the all-patients model, 0·79 for the hospitalised model, and 0·78 for the non-hospitalised model.

Figure 3Most important model features associated with visits to a long COVID clinic
The top 20 features for each model are shown. Each point on the plot is a Shapley (importance) value for a single patient. The color of each point represents the magnitude and direction of the value of that feature for that patient. The point’s position on the horizontal axis represents the importance and direction of that feature for the prediction for that patient. Some features are important predictors in all models (eg, outpatient utilisation, dyspnoea, and COVID-19 vaccine), whereas others are specific to one or two of the models (eg, dyssomnia or dexamethasone). Conditions labelled as chronic were diagnosed in patients before their COVID-19 index. Diabetes was not separated by type. dx=diagnosis. med=medication.

Figure 4Univariate odds ratios for important model features
Shown are the relative feature importance and univariate odds ratios for the top features (union of the 20 most important features) in each model. Regardless of importance, some features are significantly more prominent in the long COVID clinic population, while others are more prominent in the non-long COVID clinic population. ·· denotes that the feature was not in the top 20 features for the model in that column. Conditions labelled chronic were associated with patients before their COVID-19 index. Diabetes was not separated by type. dx=diagnosis. med=medication. *Odds ratios exclude age, which has a non-linear relationship with long COVID.

Figure 5Example paths taken by the machine learning models to classify patients with potential long COVID
Force plots showing the contribution of individual features to the final predicted probability of long COVID, as generated for individual patients by the all-patients model (A), hospitalised model (B), and non-hospitalised model (C). Features in red increase the predicted probability of long COVID classification by the model, whereas features in blue decrease that probability. The length of the bar for a given feature is proportional to the effect that feature has on the prediction for that patient. The final predicted probability is shown in bold. GERD=gastroesophageal reflux disease.
Discussion
- McCorkell L
- Assaf GS
- Davis HE
- Wei H
- Akrami A
,
- Deer RR
- Rock MA
- Vasilevsky N
- et al.
,
- Nasserie T
- Hittle M
- Goodman SN
A confounding factor that prioritises these features might be that the long COVID clinics at two of the three sites that contributed long COVID clinic patients are based in the pulmonary department. However, given that SARS-CoV-2 is primarily a respiratory virus, it is not surprising that long-term respiratory symptoms were observed. Similar long-term respiratory symptomatology is well described with respiratory viral syndromes, including those from severe acute respiratory syndrome, respiratory syncytial virus, influenza, and COVID-19.
- Ngai JC
- Ko FW
- Ng SS
- To K-W
- Tong M
- Hui DS
,
- Fauroux B
- Simões EAF
- Checchia PA
- et al.
The high proportion of albuterol use and use of inhaled steroids is consistent with the expected high prevalence of post-viral reactive airways disease. Examples of the most important features include dyspnoea or difficulty breathing, cough, albuterol, guaifenesin, and hypoxaemia.
A clinical case definition of post COVID-19 condition by a Delphi consensus, 6 October 2021.
The example features in this group include symptoms and mitigating treatments. Example features include dyssomnia, chest pain, and malaise, and treatments with lorazepam, melatonin, and polyethylene glycol 3350.
Science Brief: evidence used to update the list of underlying medical conditions associated with higher risk for severe COVID-19.
Fourth, proxies for hospitalisation. Features that are representative of standard hospital admission orders probably contributed to the model as proxies for hospitalisation in general, rather than being individually meaningful. These features were most prominent in patients without long COVID (true negatives), suggesting that the model is correctly differentiating between acute illness requiring hospitalisation and long COVID. Example features include the use of glucose, ketorolac, propofol, and naloxone.
Rates of outpatient and inpatient utilisation are important features in all three models. This finding can be interpreted in a number of ways—patients who continue to feel unwell long after acute COVID-19 might be more likely to visit their providers repeatedly than those patients who fully recover. Because diagnosing and treating the heterogenous symptoms of long COVID is a challenge, these patients could be referred to one or more specialists, further increasing their utilisation.
Electronic health records were the source of all features used by our model. Although electronic health records contain rich clinical features, these data are also a proxy for health-care utilisation and can be interpreted through that lens. Diagnoses coded in the electronic health record are not representative of the whole patient, but rather are focused on the specific reasons the patient has visited a health-care site on that day. Moreover, the absence of electronic health record data about a patient does not equate to the absence of a disease; it merely represents the absence of a patient seeking care for that disease.
Even as a proxy for health-care utilisation, electronic health record data is well suited to the task of cohort definition by way of computable phenotyping, especially when the end goal is study recruitment. Although there are other methods of identifying potential study participants, a computable phenotype allows us to efficiently narrow the recruitment pool down from everyone available to patients who are likely to qualify— easily eliminating large numbers of patients that do not qualify, and ascertaining patients that might elude human curation.
There are additional advantages to using electronic health record data to identify patients with long COVID. With an evolving definition and no gold standard to compare with, the electronic health record allows us to define proxies for a condition and select on those—in this case, a patient’s visit to a long COVID specialty clinic. However, rather than settling for a restrictive criterion of at least one visit to a long COVID specialty clinic, our machine learning models allow us to decouple patients’ utilisation patterns from the clinic visit, meaning that we can use the models to identify similar patients who might not have access to a long COVID clinic.
This study has several limitations. Electronic health record data is skewed towards patients who make more use of health-care systems, and is further skewed towards high utilisers, patients with more severe symptoms, and hospital inpatients. When researchers train models on N3C’s electronic health record data, it is essential to acknowledge whose data is less likely to be represented; for example, uninsured patients, patients with restricted access to or ability to pay for care, or patients seeking care at small practices or community hospitals with scarce data exchange capabilities. Moreover, for patients included in our models, clinic visits and hospitalisations that occur outside of the health-care system (ie, N3C site) for that patient are generally absent from our data. Finally, because our models require an index date for the execution of temporal logic, we cannot make use of cases without a positive indicator (test or diagnosis code) recorded in the electronic health record. This approach excludes the analysis of patients who had COVID-19 early in the pandemic and were not able to be tested.
We did not include race and ethnicity as model features, because we did not believe our three-site sample of long COVID clinic patients to be appropriately representative. As more data on patients with long COVID are available over time, we will be able to balance the cohort based on demographics and, critically, carefully account for race and ethnicity in future iterations of the model.
Beyond identifying cohorts for research studies, the models presented here can be used in various applications and could be enhanced in several ways. Specifically, in future studies, it will be necessary to use a large sample size of patients with long COVID to validate hypotheses relating to social determinants of health and demographics, comorbidities, and treatment implications, and to understand the relationship between acute COVID-19 severity and specific long COVID signs and symptoms and their longitudinal progression. The influence of vaccination in such trajectories will also need to be explored.
It is plausible that long COVID will not have a single definition, and it might be better described as a set of related conditions with their own symptoms, trajectories, and treatments. Thus, as larger cohorts of patients with long COVID are established, future research should identify sub-phenotypes of long COVID by clustering patients with long COVID with similar electronic health record data fingerprints. Such fingerprints might be enhanced by natural language processing of clinical notes, which often include descriptions of signs and symptoms not recorded in structured diagnosis data. Future iterations of our models could discern among these clusters given N3C’s large sample size and recurring data feeds.
Carolyn Bramante, David Dorr, Michele Morris, Ann M Parker, Hythem Sidky, Ken Gersing, Stephanie Hong, and Emily Niehaus.
ERP, ATG, KK, CGC, and MAH curated the data. ERP, ATG, MGK, KK, and CGC integrated the data. ERP, ATG, MGK, and CGC handled data quality assurance. ERP, KK, and CGC defined the N3C phenotype. ERP, ATG, TDB, MGK, KK, and CGC provided clinical data model expertise. TDB, RRD, and SEJ provided clinical subject matter expertise. ATG, AB, and JPD did the statistical analysis. ATG, AB, JPD, JAM, and MAH were responsible for data visualisation. ERP, ATG, TDB, IMB, RRD, SEJ, MGK, JAM, RM, AW, KK, CGC, and MAH critically revised the manuscript. ERP, ATG, RRD, SEJ, JAM, CGC, and MAH drafted the manuscript. JAM, AW, CGC, and MAH were responsible for governance and regulatory oversight. ERP and ATG accessed and verified all underlying data for these analyses. Authors were not precluded from accessing data in the study, and they accept responsibility to submit for publication.
#Identifying #long #COVID #USA #machine #learning #approach #N3C #data