Introduction

Adverse drug reaction (ADR) is a harmful or unpleasant response to a medicinal product that predicts hazard for future administration, requiring prevention, specific treatment, dose adjustment, or product withdrawal1. ADRs are among the leading causes of death in the United States, and serious ADRs occur in approximately 6.7% of hospitalized patients, with fatal ADRs reported in 0.32% of cases2. In a prospective study of 5000 patients hospitalized at three tertiary hospitals in Korea, ADRs were observed in 10.2% of cases, and those who developed ADRs had a 5-day longer hospital stay and significantly higher medical expenses compared to the control group3. ADRs worsen clinical outcomes and increase medical costs because of prolonged hospital stays and additional visits3,4,5.

Among the various manifestations, skin is most commonly affected by ADRs, involving mild rashes to life-threatening diseases, such as Stevens–Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN). Antibiotics are among the most common triggers for cutaneous adverse drug reactions (CADRs)6,7. In a prospective study from an Italian university hospital, ADRs occurred in 4% of cases, of which 28% were CADRs8. In a French general hospital, the prevalence of CADR was 3.6 per 1000 patients, 34% of whom were classified as severe, with antibiotics, particularly penicillin, being the most commonly implicated drug9. Serious CADRs are rare but potentially life-threatening drug hypersensitivity reactions including Drug Rash with Eosinophilia and Systemic Symptoms (DRESS) syndrome, SJS, and TEN, which have reported mortality rate of 20–40%10,11. Thus, predicting CADRs and preventing deterioration is important; however, a consistent and standardized method for predicting CADRs has not yet been established.

The adoption of electronic health records (EHRs) has grown significantly worldwide, leading to a substantial increase in healthcare data12,13,14. Subsequently, deep learning-based EHR analysis models have been developed using the large-scale patient data, and representations of patient trajectories have emerged as a promising area of research15,16,17,18,19,20. As individual medical codes and sequential sets of medical codes correspond to words and sentences, several studies have proposed EHR foundation models (FMs) based on the pretraining paradigms used in language models (LMs). For pretraining, although most studies have utilized the masked LM21, autoregressive LMs have also been used22. Notably, previous studies have achieved excellent performance in predicting several diseases, such as pancreatic cancer18, heart failure19, non-accidental trauma15, and various ADRs23 using EHR FMs and patient trajectories.

The present study aims to develop a CADR prediction model using EHR FMs and patient trajectories consisting of diagnosis, measurement, prescription, and procedure records from EHR. We compared the performance of several machine learning, deep learning, and EHR foundation models, including BEHRT17, CLMBR-t-base22, and CDM-BERT23. Specifically, we demonstrated the effectiveness of domain embedding (DE) in CDM-BERT, which exhibited superior performance in predicting various ADRs. Skin rash records were extracted from nursing records (including statements and reports) and integrated with the EHR data. For internal and external validation, we utilize data from three tertiary hospitals in the Republic of Korea (hereafter, Korea). Unlike prior research, this study demonstrates true generalizability with stable performance across institutions without additional finetuning on external datasets. Moreover, we classify CADRs into immediate and delayed types for a detailed sub-analysis, and interpret the results based on the clinical background. With a rigorously defined practical cohort for CADR prediction, this study represents a progression in ADR prediction from proof-of-concept to clinically implementable solutions.

Results

Study population

We used the records of adult patients who were hospitalized for at least 3 days and were prescribed antibiotics during the stay. Since we used nursing statements and reports for CADR record extraction, only the data from inpatients were used because accurate labeling of antibiotic-associated CADRs in outpatients was not feasible due to the lack of nursing records. Patients from three tertiary hospitals in Korea—Seoul National University Hospital (SNUH; n = 366,434), Seoul National University Bundang Hospital (SNUBH; n = 325,396), and Chungbuk National University Hospital (CNUH; n = 110,301)—were included in this study. The prediction for all patients who were prescribed antibiotics was performed during their last visit. The exclusion criteria for patients with antibiotic-associated skin rash were as follows: <20 records, absence of skin rash during antibiotic treatment, and continuation of antibiotic treatment after the occurrence of skin rash. Given the need for continuous antibiotic treatment, patients who were prescribed medication for skin rash within 1 day after the occurrence were included, even if they were treated with antibiotics continuously. Finally, 1906, 3070, and 524 patients with CADR and 216,649, 311,631, and 34,357 patients without CADR from SNUH, SNUBH, and CNUH, respectively, were included in the study cohort (Fig. 1). The antibiotics and the drugs used to treat skin rash are summarized in Supplementary Table 1.

Fig. 1: Participant enrollment flowchart.
Fig. 1: Participant enrollment flowchart.
Full size image

An internal dataset from Seoul National University Hospital (SNUH) was used for training and internal validation of the model. External datasets from Seoul National University Bundang Hospital (SNUBH) and Chungbuk National University Hospital (CNUH) were used for external validation.

Data from SNUH were used for training and internal validation of the CADR prediction model and data from SNUBH and CNUH were used for external validation (Table 1). The incidence ratio of CADR was approximately 1%, consistent with that reported in previous studies24,25,26. The incidence ratio of CADR for each antibiotic is summarized in Supplementary Table 2. No significant differences were observed for age and sex between the CADR and non-CADR groups. However, the days of hospitalization and the number of codes per patient were higher in the CADR groups than in the non-CADR groups for all hospitals. The prevalence rates of all comorbidities were higher in the CADR group of SNUH; however, no evident pattern of comorbidities was observed for SNUBH and CNUH. The overall distribution of data varied between hospitals, and we demonstrated the validity of our model using these external datasets. For a more detailed analysis, we classified CADRs into immediate and delayed types. The baseline characteristics of patients with immediate and delayed CADR types are summarized in Supplementary Table 3.

Table 1 Baseline characteristics of the included patients

Prediction performance

For training, the prediction timepoint was set as the time of the latest antibiotic treatment before CADR. The models were designed to predict whether CADR would occur if the patient was prescribed specific antibiotics. The last prescribed antibiotics and the subsequent past records were used as inputs for the model. We evaluated the CADR prediction performance using eight models: two tree-based models (random forest27 and gradient boosting machine (GBM)28), three recurrent neural network (RNN)-based models (long short-term memory (LSTM)29, gated recurrent unit (GRU)30, and RETAIN31), and three EHR FMs (BEHRT17, CLMBR-t-base22, and CDM-BERT23). We compared the performance of BEHRT and CDM-BERT including the models trained without pretraining. As sequence data (patient trajectory) cannot be used as input for the random forest and GBM, we transformed each sequence into a count-based one-dimensional vector32. For the remaining models, we used the sequence data with age and segment embeddings. We randomly split the patient data from SNUH into development (80%) and hold-out internal validation (20%) datasets. CDM-BERT demonstrated the best prediction performance for internal and external validations, with area under the receiver operating characteristic (ROC) curve (AUROC) values of 0.975, 0.928, and 0.893 for SNUH, SNUBH, and CNUH, respectively (Table 2). Notably, the performance was maintained even without the additional fine-tuning for external validation. Meanwhile, BEHRT and CLMBR-t-base poorly predicted the CADR, even after pretraining. Despite the low incidence ratio of CADR (approximately 1%), CDM-BERT achieved high precision of 15.8%, 7.8%, and 12.6% for SNUH, SNUBH, and CNUH, respectively. The ROC and precision-recall (PR) curves of all models are summarized in Supplementary Fig. 1.

Table 2 Performance for predicting CADR

Feature importance

Feature importance was assessed using fine-tuned CDM-BERT, which demonstrated the best prediction performance (Table 2). To evaluate whether the model focuses on clinically relevant features, we extracted attention scores from the model and compared averaged scores between CADR-related and unrelated concepts. CADR-related diagnosis concepts included cancer, chronic kidney disease, and chronic liver disease, while CADR-unrelated diagnosis concepts included obesity, cataract, hypertension, and fracture. Similarly, CADR-related drug concepts included antibiotics, anticonvulsants, nonsteroidal anti-inflammatory drugs, and antitubercular drugs, while CADR-unrelated drug concepts included vitamins, normal saline, and dextrose. As shown in Fig. 2a, b, attention scores of CADR-related concepts were higher than unrelated concepts for both diagnosis and drug domains across all institutions. Calculating attention scores for drug concepts of CNUH was not feasible because the drug concepts of CNUH were not compatible with those of SNUH. Figure 2c shows the 40 most important features of the model for each hospital. For SNUH, drugs (such as famotidine) and antibiotics (such as ceftriaxone and piperacillin) were the most important variables. For SNUBH, midazolam, famotidine, and fluid-related drugs and chest CT procedure items were the most important variables. For CNUH, measurements such as protein, albumin, and liver enzyme levels, and conditions such as chronic kidney disease and acute kidney injury were the most important variables. No drugs appeared in the top 40 features for CNUH because it used drug codes different from those of SNUH. Accordingly, when predicting the CADRs in the CNUH cohort, the model could not utilize the drug records because they were replaced with the [UNK] code, which represents codes not found in the vocabulary. The number of codes and shared codes for each domain is summarized in Supplementary Table 4.

Fig. 2: Feature importance analysis.
Fig. 2: Feature importance analysis.
Full size image

a Comparison of attention scores between CADR-related and unrelated diagnosis concepts. b Comparison of attention scores between CADR-related and unrelated drug concepts. Drug concept analysis for CNUH was not possible due to different RxNorm concept classes used (see “Discussion”). c Top 40 most important features for predicting CADRs. The bar color indicates the domain of the medical code (condition, drug, measurement, and procedure). SNUH Seoul National University Hospital, SNUBH Seoul National University Bundang Hospital, CNUH Chungbuk National University Hospital.

Sub-analysis of immediate and delayed CADR types

For a more detailed analysis, the CADRs were classified into immediate and delayed types. Furthermore, the prediction time point for the delayed type was divided into two categories: immediately before antibiotic treatment and immediately before CADR occurrence to assess the impact of cumulative antibiotic treatment history on CADR. Across all cohorts, the model demonstrated greater confidence in predicting the delayed type CADRs compared with the immediate type. The detailed performance metrics of the models are summarized in Fig. 3 and Supplementary Table 5. The ROC and PR curves of the models are presented in Supplementary Fig. 2. The feature importance results for each group are presented in Supplementary Figs. 3–5.

Fig. 3: Risk comparison of patients without CADR and with immediate or delayed CADRs.
Fig. 3: Risk comparison of patients without CADR and with immediate or delayed CADRs.
Full size image

The green line in each boxplot indicates the median, and the length of each box indicates the interquartile range (IQR) of the logits for each group. The ends of the box correspond to the 25th (Q1) and 75th (Q3) percentiles, and the lines extend to Q1 ˗ 1.5 IQR and Q3 + 1.5 IQR. SNUH Seoul National University Hospital, SNUBH Seoul National University Bundang Hospital, CNUH Chungbuk National University Hospital, CADR cutaneous adverse drug reaction, Abx antibiotics, Tx treatment.

For uncertainty analysis, we reported the results of conformal prediction analysis in Table 3. Considering the class imbalance of our cohort, we adopted Mondrian conformal prediction33, which is class-conditional, with a significance level of α = 0.05. The average set size was smaller when the model predicted delayed-type CADR compared to immediate-type CADR. For delayed-type CADR, the fraction of uncertain predictions was 0 in SNUH and SNUBH, and 0.144 in CNUH. The overall and negative coverages were acceptable and close to 0.95, while positive coverage showed slight variability (0.915–0.950). This phenomenon is consistent with previous findings that the minority class tends to be undercovered in imbalanced datasets34,35.

Table 3 Conformal prediction analysis of immediate and delayed CADR types

Discussion

In this study, we developed, externally validated, and qualitatively analyzed CADR prediction models using patient trajectories and EHR FMs as well as other baselines. To extract high-quality patient records of skin rashes, we integrated nursing records with structured EHR data. CDM-BERT outperformed all other baselines, confirming the effectiveness of DE for predicting CADR. Our model showed remarkable generalizability, maintaining consistent performance across three different hospitals without requiring institution-specific finetuning. By analyzing CADR subtypes (immediate and delayed types), we found distinct predictive patterns that enhance applicability to clinical practice with subdivided use cases. These findings demonstrate the effectiveness of EHR FMs as a CADR prediction model that can directly contribute to patient care.

We assumed that the model pretrained using dense medical records from inpatients would reflect the variation in medical status before and after various treatments, including medication. Several studies have predicted ADRs using patient records from EHR data. One previous study developed a prediction model using a random forest algorithm, with clinical code data (prescription drugs and diagnoses), and measurements extracted from the EHR database36. Another study also aggregated diagnosis, drug, and measurement records to develop prediction models for various adverse drug events using several machine learning methods37. Nevertheless, only a few studies have used EHR FMs for ADR prediction23. In addition, to the best of our knowledge, CADRs have not been explored through EHR FMs.

Nursing records are important for developing an ADR prediction model, as they include detailed and regular observations of clinical signs and symptoms, such as skin rash, itching, nausea, and vomiting. Despite the abundance of information in nursing records, utilizing them from the EHR is challenging, as not all records are typically converted into diagnosis or observation codes. Many items in nursing records are ambiguous in matching existing diagnostic codes due to their variability in type, severity, and durability. Although some diagnostic records for skin conditions are present in our EHR databases, their integrity could not be guaranteed because patients in nursing records and EHR were hardly overlapped. In this study, we combined nursing records and EHR data to fully utilize the CADR records. Subsequently, we constructed reliable prediction models that captured changes in the skin condition of the patients before and after antibiotic treatment.

Many drugs were included in the top 40 most important features of SNUH and SNUBH; however, none were included in those of CNUH (Fig. 2c), as CNUH uses different drug codes. Although all hospitals in the study operated observational medical outcomes partnership (OMOP) common data model (CDM)-based EHR databases and used the same terminologies for medical codes, the specific classes for drug codes were different. SNUH and SNUBH used drug codes in the “clinical drug” class of Rx-Norm, whereas CNUH used those in the “quant clinical drug” class of Rx-Norm. For example, the Rx-Norm concept ID “19027493” indicates “famotidine 10 MG/ML Injectable Solution” under the “Clinical Drug” class, whereas the Rx-Norm concept ID “35606552” indicates “2 ML famotidine 10 MG/ML Injection” under the “Quant Clinical Drug” class. These codes are almost identical but the model recognizes them as different. However, although drug information from the CNUH was not available for external validation, our model still achieved an AUROC of approximately 0.9. Instead of drug information, the model classified patients with CADRs primarily based on the measurement items, which was hardly observed in SNUH and SNUBH. Additionally, we also conducted additional experiments using CDM-BERT after replacing drug codes with “unknown” codes, and the model achieved AUROCs of 0.971 and 0.928 for SNUH and SNUBH, respectively, demonstrating minimal performance degradation compared to the results of original setting (0.975 and 0.928). This implies that our model does not solely rely on a specific set of medical codes but comprehensively considers the overall patient information.

Our feature importance analysis revealed that our model assigns higher attention scores to clinically relevant concepts such as cancer, chronic kidney disease, and chronic liver disease compared to unrelated concepts (Fig. 2a, b). This finding suggests that our model makes the decision focusing more on clinically relevant concepts. However, it is important to note that attention scores indicate how much the corresponding records influenced the decision of model and high attention scores do not directly indicate clinically meaningful records. In addition, as described in above paragraph, EHR FM makes decisions by comprehensively analyzing all given concepts, rather than relying solely on specific “meaningful” concepts or domains. This makes the interpretation of EHR FMs difficult despite their superior performance.

Although skin tests can be performed for patients with suspected β-lactam allergy, they are not routinely recommended before antibiotic administration in patients without a history of hypersensitivity. While standardized protocols exist for β-lactam allergy testing, the diagnostic sensitivity varies depending on factors such as the time elapsed since the reaction and the specific drug involved38,39,40. In a study evaluating the clinical utility of beta-lactam antibiotics skin testing, where skin tests were performed with five penicillin and cephalosporin antibiotics in patients with no history of allergy to beta-lactam antibiotics, 97 (6.8%) of the 1,421 patients tested positive for at least one drug. However, none of these patients showed hypersensitivity to cephalosporins. However, 4 of the 1324 patients with negative test results exhibited hypersensitivity following cephalosporin injection, indicating that skin testing before the use of penicillin and cephalosporin may not be clinically useful41. A study evaluating the effectiveness of skin testing prior to cefazolin use in 13,153 cases also failed to prove the clinical effectiveness of skin tests. Of the 13,153 individuals who underwent skin tests, 184 (1.4%) exhibited positive results. Of these, 174 avoided cefazolin use, and 10 underwent a stepwise challenge test with cefazolin, all of which were negative42. A meta-analysis of penicillin skin tests reported 30.7% sensitivity and 96.8% specificity of the skin tests, with an AUROC value of 0.68643.

Delayed-type CADRs are a broad spectrum of diseases that occur after a considerable period of time after antibiotic exposure and have various pathophysiological and clinical risk factors. These delayed-type reactions can sometimes lead to severe and potentially life-threatening conditions. Specifically, systemic manifestations occur in 30% of CADR cases, with serious reactions such as SJS, TEN, and general exfoliative dermatitis in 5.2% of cases4. Furthermore, delayed CADRs are clinically difficult to predict, because there is no method for predicting delayed reactions through testing, like immediate reactions. For this problem, our findings showed that EHR FMs with DE demonstrated superior and more stable prediction performance for delayed reactions over the immediate reactions. This suggests that baseline patient information holds greater predictive value for delayed-type reactions and may include generalized risk factors relevant across hospitals. In addition, the performance gap between delayed and immediate CADR is noteworthy, as we did not provide separate information on immediate and delayed types. Our model itself recognized the disparity between different CADR types and demonstrated validity across all three cohorts. This finding indicates that our model can derive meaningful insights based on patient medical history, even without explicit information provided.

Among delayed-type CADRs, severe CADRs such as DRESS syndrome, SJS, and TEN are rare but potentially life-threatening. The incidence of SJS and TEN is estimated to be 2 per 1 million people, whereas the incidence of DRESS syndrome is estimated to be 1 per 1000 to 1 per 10,000 people44. The diagnosis of severe CADRs depends on the clinical presentation of the rash, its duration, associated symptoms (e.g., fever, pruritus, lymphadenopathy), and the time from drug administration to symptom onset. Direct immunofluorescence testing of the blistering rash and skin biopsy are necessary to rule out other possible causes. Clinically, it would be very useful to be able to identify and predict those at risk of progressing to severe ADRs among severe delayed CADRs. However, the training data for our AI model is insufficient to distinguish between the severity and phenotype of skin rashes, making it impossible to identify and predict severe ADRs. In this study, we are the first to conduct research using AI to predict antibiotic-associated CADRs, and we may also conduct future research to develop an AI model that can distinguish and predict severe CADRs.

This study has some limitations. First, the code disparity between hospitals was not fundamentally addressed. Instead of matching different codes, we simply replaced the codes out of vocabulary with a [MASK] token. Although this approach did not significantly reduce the performance, a more robust strategy to handle code disparity is imperative to enhance the compatibility of the model. Second, while our feature importance analysis demonstrated that our model focuses more on clinically relevant features, further interpretation of the complex attention patterns is needed to validate the utility of attention score analysis in EHR FMs. Third, CADRs are usually clinically classified by more complicated conditions than our settings. CADRs are classified into immediate and delayed reactions based on the 1–6 h window after drug administration45,46. Immediate reactions manifest as urticaria and angioedema, whereas delayed reactions mainly manifest as maculopapular rashes. A more detailed analysis of these subtypes and the corresponding application strategies is suggested for future studies. Fourth, this study involved hospitals from a single country, and external validation using a global cohort is necessary for model generalization. Fifth, we extracted the CADR records using nursing reports and rule-based regular expressions. This approach may have missed some records and may not be applicable to other institutions, as nursing reports are not strictly structured and format varies among institutions. Therefore, leveraging large LMs could enhance the CADR record extraction process. Sixth, this study focused on mild CADRs, including skin lesions, rashes, urticaria, and eruptions. A more comprehensive study design that covers diverse CADRs is required. Given the demonstrated effectiveness of EHR FMs for predicting mild CADRs, predicting severe CADRs, such as SJS and TEN, and identifying new factors causing those CADRs are suggested for future work. Seventh, our study compared patients who developed rashes during antibiotic treatment to those who did not, rather than comparing them with patients who developed rashes without antibiotic exposure. Future studies comparing “antibiotic with rash” versus “rash without antibiotic exposure” could help distinguish antibiotics-associated reactions from general rash susceptibility. Eighth, our current model predicts overall CADR risk rather than identifying specific culprit antibiotics or drug classes. Since drug allergy typically develops to specific drugs or drug classes rather than overall antibiotics in general, future studies needs to develop drug class-specific prediction models to provide more valuable clinical guidance for antibiotic selection. Nevertheless, the current model can serve as a risk stratification tool to identify patients who may benefit from enhanced monitoring during antibiotic treatment.

Methods

Data curation

This study used data from SNUH between December 2012 and April 2023, SNUBH between April 2003 and December 2022, and CNUH between July 2020 and September 2024. All included hospitals operate an EHR database based on the OMOP CDM47. We collected data from four domains (tables) of the CDM schema: condition, drug, measurement, and procedure. Terminologies for diagnosis, procedure, measurement, and prescription were based on SNOMED CT48, Rx-Norm and Rx-Norm Extension49, and LOINC50, respectively. In addition to EHR data, we combined nursing statements and reports to process information regarding skin rashes.

For the internal dataset (SNUH), we randomly split the data into training (70%), validation (10%), and hold-out test (20%) datasets. Training and validation datasets were used to develop the models, including LM pretraining. External datasets were used to validate the trained models. As data sharing across hospitals is strictly prohibited, we only exported the weights of the trained models, and external validation was performed in each hospital.

The institutional review boards of the SNUH (approval no. E-2403-088-1521), SNUBH (approval no. X-2403-890-901), and CNUH (approval no. 2024-06-022-001) approved the study and waived the requirement for informed consent, as only retrospective and observational EHR data was used. The study adhered to the principles outlined in the Declaration of Helsinki, the Korean Bioethics and Safety Act (Law No. 16372), and the Human Research Protection Program–Standard Operating Procedure of the Seoul National University Hospital.

Nursing statements and reports

To obtain high-quality skin rash records, we used nursing statements and reports from each hospital. Note that “nursing statements” and “nursing reports” are separate terms in this study. The nursing records of SNUH and SNUBH are based on nursing statements, which are based on a code system. With this system, nurses directly enter the codes based on the status or condition of patients, instead of writing free-text nursing reports. The examples of the nursing statements are “rash is present,” “urticaria is present,” “no rash is observed,” and “skin lesion is present,” and each statement has an assigned code. We extracted all statements including the keywords “skin lesion,” “rash,” “urticaria,” and “eruption” and excluded inappropriate nursing statements including “diaper rash,” “abrasion,” and “lacerated,” as these conditions are not likely to be associated with antibiotic-associated CADRs. Five allergists confirmed that included nursing statements like “rash is present,” “urticaria is present,” and “skin lesion is present” were appropriate.

In contrast, CNUH does not have a structured nursing statements system and only has natural language-based nursing reports. As the nursing reports were written in plain Korean text and were less structured, we developed rule-based regular expression algorithms. The detailed rules are summarized in Supplementary Table 6. We randomly selected more than 1000 samples that were classified as having skin rash or not, and five allergists confirmed that all sample reports were accurate.

Prediction timepoint

To train the CADR prediction model in a practical scenario, we considered the most recent inpatient visit where antibiotic treatment was provided, and the prediction time point was set as the antibiotic prescription immediately preceding the CADR occurrence. The reasons for the prediction timepoint definition are as follows: First, we considered predicting the first CADR for each patient, but the prediction performance deteriorated because of fewer records. Second, we attempted to rule out potential bias due to the conditions for the prediction timepoint. Accordingly, we designed the model to adapt to various situations, whether patient history records are abundant or not. We also considered three additional scenarios to validate the CADR prediction model: predicting the immediate and delayed CADRs at the initiation of the antibiotic treatment and predicting the delayed CADRs just before the CADR occurrence. The CADR prediction process and timepoints are presented in Fig. 4.

Fig. 4: CADR prediction process and prediction timepoints.
Fig. 4: CADR prediction process and prediction timepoints.
Full size image

a CDM-based EHR data and nursing records (nursing reports and statements) were merged and used to construct patient trajectories. Patient records until the last antibiotics prescription before CADR occurrence were used to predict CADR within 1 day. b Prediction timepoint for immediate CADRs. c Prediction timepoints for delayed CADRs. Prediction timepoints 1 and 2 involved the use of the records until the first antibiotic prescription and the last antibiotic prescription before CADR occurrence.

Model training

To train the tree models as baselines, we transformed the patient trajectories into count-based one-dimensional vectors, referring to previous studies. Time binning counts the occurrences of a code in different time intervals. We used intervals of 0–1 day, 1–7 days, and 7+ days from the prediction timepoint. The count-based vector was then concatenated with age and sex information. Hyperparameter tuning was performed using grid search (Supplementary Table 7). Medical codes that were not in the SNUH vocabulary were omitted from the external validation. The tree models were trained and validated using the Scikit-learn package (version 1.0.2) in Python (version 3.8.10).

In addition to tree-based machine learning models, we used several deep learning models, including RNN-based models (LSTM and GRU), modified RNN-based networks with attention mechanisms (RETAIN), and EHR FMs (BEHRT, CLMBR-t-base, and CDM-BERT). BEHRT is a basic BERT-based EHR foundation model and CLMBR-t-base is an open weight EHR foundation model that generates patient representations which are subsequently finetuned by LightGBM51. CDM-BERT is a modified BEHRT with DE for the OMOP CDM format. All EHR FMs were based on pretraining-finetuning framework. All deep learning models were based on the same embedding of patient records and shared the same embedding layers but different subsequent layers. The embedding sizes of the token, age, segment, and domain were set to 50,000, 180, 1024, and 20, respectively. All deep learning models used six RNN or self-attention layers with a hidden dimension of 256. The BERT-based LM used eight attention heads. To train the CADR prediction models, we added a feed-forward neural network (FFNN), a hyperbolic tangent (Tanh) activation function, dropout, and another FFNN to the representation of the first token to yield a vector of size two, and binary cross entropy was used to minimize the loss function. We calculated the AUROC values for every 1000 batches, and early stopping was implemented if the AUROC did not improve for 20 cycles. To address class imbalance, we randomly oversampled the patients with CADRs to one-tenth of those without CADRs. The batch size, learning rate, and dropout rate were 16, 5e–5, and 0.1, respectively. We used the pretrained models from our previous study and confirmed that the hold-out test dataset was entirely independent of the pretraining dataset. The hyperparameter tuning for LightGBM in CLMBR-t-base was performed using the same grid search approach as described for random forest and GBM (Supplementary Table 7). For the deep learning and BERT implementation, we used PyTorch (version 1.12.0) and the HuggingFace package (version 4.41.2)52 in Python (version 3.8.10). The code implementation for the model training was based on the GitHub repository (https://github.com/kicarussays/CDM-BERT).

Feature importance

We conducted two feature importance analyses: (1) comparing attention scores between CADR-related and unrelated diagnosis concepts, and (2) identifying the top 40 most important features. For the attention score comparison, we extracted and averaged attention scores for predefined concept groups. For identifying important features, the calculating process involved four steps: First, we averaged the attention matrices of eight attention heads from the last self-attention layer of the LM, yielding a simplified 2048 by 2048 attention matrix \({\rm{A}}\). For each attention matrix \({\rm{A}}\), an attention score \({{\rm{A}}}_{{\rm{ij}}}\) indicates how closely \({{\rm{i}}}^{\text{th}}\) and \({{\rm{j}}}^{\text{th}}\) tokens are related53. Note that \({\sum }_{{\rm{j}}}{A}_{{ij}}=1\,\forall {\rm{i}}\in \left(0,\ldots ,2048\right).\) Second, we averaged the simplified attention matrix across rows, resulting in a one-dimensional vector of length 2048, representing the importance of each token. Third, for each patient, we identified the top 10% of tokens with the highest average attention scores. Finally, for each variable, we counted the number of patients who included that variable in the top 40 features. We used count-based feature importance to improve interpretability, instead of the trajectory attention vector proposed in our previous study.

Statistical analysis

Characteristics (age, sex, and comorbidities) between the CADR and non-CADR groups in all cohorts were compared by calculating P-values using Student’s t test for age and Fisher’s exact test for other variables. We reported the AUROC, AUPRC, sensitivity, specificity, precision, and F1-score as performance metrics. The confidence intervals of AUROC and AUPRC were calculated using the DeLong’s method54, whereas those of the F1-score, sensitivity, specificity, and precision were calculated using Wilson’s method55. Statistical significance was set at α = 0.05. All statistical analyses were performed using Scikit-learn (version 1.0.2) in Python (version 3.8.10).