Abstract
Cutaneous adverse drug reactions (CADRs) are the most common form of adverse drug reactions, ranging from mild rashes to life-threatening diseases, such as Stevens–Johnson syndrome and toxic epidermal necrolysis. However, there is no effective tool to predict antibiotic-associated CADRs. In this study, we propose an antibiotic-associated CADR prediction model using electronic health record (EHR) foundation models (FMs). EHR FMs are based on the pretraining-finetuning paradigms of language models, corresponding medical codes and their sequences to words and sentences. We included 802,131 inpatients across three tertiary hospitals in Korea, combining EHR data with nursing statements and reports to extract skin rash records. Our approach achieved the best predictive performance compared to all the other baseline models across all datasets. To enhance clinical relevance, we classified CADRs into immediate and delayed types and conducted a detailed sub-analysis. Finally, we found that properly configured EHR FMs can effectively predict the risk of developing antibiotics-associated CADRs, particularly for delayed-type reactions where predictive testing options are limited.
Similar content being viewed by others
Introduction
Adverse drug reaction (ADR) is a harmful or unpleasant response to a medicinal product that predicts hazard for future administration, requiring prevention, specific treatment, dose adjustment, or product withdrawal1. ADRs are among the leading causes of death in the United States, and serious ADRs occur in approximately 6.7% of hospitalized patients, with fatal ADRs reported in 0.32% of cases2. In a prospective study of 5000 patients hospitalized at three tertiary hospitals in Korea, ADRs were observed in 10.2% of cases, and those who developed ADRs had a 5-day longer hospital stay and significantly higher medical expenses compared to the control group3. ADRs worsen clinical outcomes and increase medical costs because of prolonged hospital stays and additional visits3,4,5.
Among the various manifestations, skin is most commonly affected by ADRs, involving mild rashes to life-threatening diseases, such as Stevens–Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN). Antibiotics are among the most common triggers for cutaneous adverse drug reactions (CADRs)6,7. In a prospective study from an Italian university hospital, ADRs occurred in 4% of cases, of which 28% were CADRs8. In a French general hospital, the prevalence of CADR was 3.6 per 1000 patients, 34% of whom were classified as severe, with antibiotics, particularly penicillin, being the most commonly implicated drug9. Serious CADRs are rare but potentially life-threatening drug hypersensitivity reactions including Drug Rash with Eosinophilia and Systemic Symptoms (DRESS) syndrome, SJS, and TEN, which have reported mortality rate of 20–40%10,11. Thus, predicting CADRs and preventing deterioration is important; however, a consistent and standardized method for predicting CADRs has not yet been established.
The adoption of electronic health records (EHRs) has grown significantly worldwide, leading to a substantial increase in healthcare data12,13,14. Subsequently, deep learning-based EHR analysis models have been developed using the large-scale patient data, and representations of patient trajectories have emerged as a promising area of research15,16,17,18,19,20. As individual medical codes and sequential sets of medical codes correspond to words and sentences, several studies have proposed EHR foundation models (FMs) based on the pretraining paradigms used in language models (LMs). For pretraining, although most studies have utilized the masked LM21, autoregressive LMs have also been used22. Notably, previous studies have achieved excellent performance in predicting several diseases, such as pancreatic cancer18, heart failure19, non-accidental trauma15, and various ADRs23 using EHR FMs and patient trajectories.
The present study aims to develop a CADR prediction model using EHR FMs and patient trajectories consisting of diagnosis, measurement, prescription, and procedure records from EHR. We compared the performance of several machine learning, deep learning, and EHR foundation models, including BEHRT17, CLMBR-t-base22, and CDM-BERT23. Specifically, we demonstrated the effectiveness of domain embedding (DE) in CDM-BERT, which exhibited superior performance in predicting various ADRs. Skin rash records were extracted from nursing records (including statements and reports) and integrated with the EHR data. For internal and external validation, we utilize data from three tertiary hospitals in the Republic of Korea (hereafter, Korea). Unlike prior research, this study demonstrates true generalizability with stable performance across institutions without additional finetuning on external datasets. Moreover, we classify CADRs into immediate and delayed types for a detailed sub-analysis, and interpret the results based on the clinical background. With a rigorously defined practical cohort for CADR prediction, this study represents a progression in ADR prediction from proof-of-concept to clinically implementable solutions.
Results
Study population
We used the records of adult patients who were hospitalized for at least 3 days and were prescribed antibiotics during the stay. Since we used nursing statements and reports for CADR record extraction, only the data from inpatients were used because accurate labeling of antibiotic-associated CADRs in outpatients was not feasible due to the lack of nursing records. Patients from three tertiary hospitals in Korea—Seoul National University Hospital (SNUH; n = 366,434), Seoul National University Bundang Hospital (SNUBH; n = 325,396), and Chungbuk National University Hospital (CNUH; n = 110,301)—were included in this study. The prediction for all patients who were prescribed antibiotics was performed during their last visit. The exclusion criteria for patients with antibiotic-associated skin rash were as follows: <20 records, absence of skin rash during antibiotic treatment, and continuation of antibiotic treatment after the occurrence of skin rash. Given the need for continuous antibiotic treatment, patients who were prescribed medication for skin rash within 1 day after the occurrence were included, even if they were treated with antibiotics continuously. Finally, 1906, 3070, and 524 patients with CADR and 216,649, 311,631, and 34,357 patients without CADR from SNUH, SNUBH, and CNUH, respectively, were included in the study cohort (Fig. 1). The antibiotics and the drugs used to treat skin rash are summarized in Supplementary Table 1.
An internal dataset from Seoul National University Hospital (SNUH) was used for training and internal validation of the model. External datasets from Seoul National University Bundang Hospital (SNUBH) and Chungbuk National University Hospital (CNUH) were used for external validation.
Data from SNUH were used for training and internal validation of the CADR prediction model and data from SNUBH and CNUH were used for external validation (Table 1). The incidence ratio of CADR was approximately 1%, consistent with that reported in previous studies24,25,26. The incidence ratio of CADR for each antibiotic is summarized in Supplementary Table 2. No significant differences were observed for age and sex between the CADR and non-CADR groups. However, the days of hospitalization and the number of codes per patient were higher in the CADR groups than in the non-CADR groups for all hospitals. The prevalence rates of all comorbidities were higher in the CADR group of SNUH; however, no evident pattern of comorbidities was observed for SNUBH and CNUH. The overall distribution of data varied between hospitals, and we demonstrated the validity of our model using these external datasets. For a more detailed analysis, we classified CADRs into immediate and delayed types. The baseline characteristics of patients with immediate and delayed CADR types are summarized in Supplementary Table 3.
Prediction performance
For training, the prediction timepoint was set as the time of the latest antibiotic treatment before CADR. The models were designed to predict whether CADR would occur if the patient was prescribed specific antibiotics. The last prescribed antibiotics and the subsequent past records were used as inputs for the model. We evaluated the CADR prediction performance using eight models: two tree-based models (random forest27 and gradient boosting machine (GBM)28), three recurrent neural network (RNN)-based models (long short-term memory (LSTM)29, gated recurrent unit (GRU)30, and RETAIN31), and three EHR FMs (BEHRT17, CLMBR-t-base22, and CDM-BERT23). We compared the performance of BEHRT and CDM-BERT including the models trained without pretraining. As sequence data (patient trajectory) cannot be used as input for the random forest and GBM, we transformed each sequence into a count-based one-dimensional vector32. For the remaining models, we used the sequence data with age and segment embeddings. We randomly split the patient data from SNUH into development (80%) and hold-out internal validation (20%) datasets. CDM-BERT demonstrated the best prediction performance for internal and external validations, with area under the receiver operating characteristic (ROC) curve (AUROC) values of 0.975, 0.928, and 0.893 for SNUH, SNUBH, and CNUH, respectively (Table 2). Notably, the performance was maintained even without the additional fine-tuning for external validation. Meanwhile, BEHRT and CLMBR-t-base poorly predicted the CADR, even after pretraining. Despite the low incidence ratio of CADR (approximately 1%), CDM-BERT achieved high precision of 15.8%, 7.8%, and 12.6% for SNUH, SNUBH, and CNUH, respectively. The ROC and precision-recall (PR) curves of all models are summarized in Supplementary Fig. 1.
Feature importance
Feature importance was assessed using fine-tuned CDM-BERT, which demonstrated the best prediction performance (Table 2). To evaluate whether the model focuses on clinically relevant features, we extracted attention scores from the model and compared averaged scores between CADR-related and unrelated concepts. CADR-related diagnosis concepts included cancer, chronic kidney disease, and chronic liver disease, while CADR-unrelated diagnosis concepts included obesity, cataract, hypertension, and fracture. Similarly, CADR-related drug concepts included antibiotics, anticonvulsants, nonsteroidal anti-inflammatory drugs, and antitubercular drugs, while CADR-unrelated drug concepts included vitamins, normal saline, and dextrose. As shown in Fig. 2a, b, attention scores of CADR-related concepts were higher than unrelated concepts for both diagnosis and drug domains across all institutions. Calculating attention scores for drug concepts of CNUH was not feasible because the drug concepts of CNUH were not compatible with those of SNUH. Figure 2c shows the 40 most important features of the model for each hospital. For SNUH, drugs (such as famotidine) and antibiotics (such as ceftriaxone and piperacillin) were the most important variables. For SNUBH, midazolam, famotidine, and fluid-related drugs and chest CT procedure items were the most important variables. For CNUH, measurements such as protein, albumin, and liver enzyme levels, and conditions such as chronic kidney disease and acute kidney injury were the most important variables. No drugs appeared in the top 40 features for CNUH because it used drug codes different from those of SNUH. Accordingly, when predicting the CADRs in the CNUH cohort, the model could not utilize the drug records because they were replaced with the [UNK] code, which represents codes not found in the vocabulary. The number of codes and shared codes for each domain is summarized in Supplementary Table 4.
a Comparison of attention scores between CADR-related and unrelated diagnosis concepts. b Comparison of attention scores between CADR-related and unrelated drug concepts. Drug concept analysis for CNUH was not possible due to different RxNorm concept classes used (see “Discussion”). c Top 40 most important features for predicting CADRs. The bar color indicates the domain of the medical code (condition, drug, measurement, and procedure). SNUH Seoul National University Hospital, SNUBH Seoul National University Bundang Hospital, CNUH Chungbuk National University Hospital.
Sub-analysis of immediate and delayed CADR types
For a more detailed analysis, the CADRs were classified into immediate and delayed types. Furthermore, the prediction time point for the delayed type was divided into two categories: immediately before antibiotic treatment and immediately before CADR occurrence to assess the impact of cumulative antibiotic treatment history on CADR. Across all cohorts, the model demonstrated greater confidence in predicting the delayed type CADRs compared with the immediate type. The detailed performance metrics of the models are summarized in Fig. 3 and Supplementary Table 5. The ROC and PR curves of the models are presented in Supplementary Fig. 2. The feature importance results for each group are presented in Supplementary Figs. 3–5.
The green line in each boxplot indicates the median, and the length of each box indicates the interquartile range (IQR) of the logits for each group. The ends of the box correspond to the 25th (Q1) and 75th (Q3) percentiles, and the lines extend to Q1 ˗ 1.5 IQR and Q3 + 1.5 IQR. SNUH Seoul National University Hospital, SNUBH Seoul National University Bundang Hospital, CNUH Chungbuk National University Hospital, CADR cutaneous adverse drug reaction, Abx antibiotics, Tx treatment.
For uncertainty analysis, we reported the results of conformal prediction analysis in Table 3. Considering the class imbalance of our cohort, we adopted Mondrian conformal prediction33, which is class-conditional, with a significance level of α = 0.05. The average set size was smaller when the model predicted delayed-type CADR compared to immediate-type CADR. For delayed-type CADR, the fraction of uncertain predictions was 0 in SNUH and SNUBH, and 0.144 in CNUH. The overall and negative coverages were acceptable and close to 0.95, while positive coverage showed slight variability (0.915–0.950). This phenomenon is consistent with previous findings that the minority class tends to be undercovered in imbalanced datasets34,35.
Discussion
In this study, we developed, externally validated, and qualitatively analyzed CADR prediction models using patient trajectories and EHR FMs as well as other baselines. To extract high-quality patient records of skin rashes, we integrated nursing records with structured EHR data. CDM-BERT outperformed all other baselines, confirming the effectiveness of DE for predicting CADR. Our model showed remarkable generalizability, maintaining consistent performance across three different hospitals without requiring institution-specific finetuning. By analyzing CADR subtypes (immediate and delayed types), we found distinct predictive patterns that enhance applicability to clinical practice with subdivided use cases. These findings demonstrate the effectiveness of EHR FMs as a CADR prediction model that can directly contribute to patient care.
We assumed that the model pretrained using dense medical records from inpatients would reflect the variation in medical status before and after various treatments, including medication. Several studies have predicted ADRs using patient records from EHR data. One previous study developed a prediction model using a random forest algorithm, with clinical code data (prescription drugs and diagnoses), and measurements extracted from the EHR database36. Another study also aggregated diagnosis, drug, and measurement records to develop prediction models for various adverse drug events using several machine learning methods37. Nevertheless, only a few studies have used EHR FMs for ADR prediction23. In addition, to the best of our knowledge, CADRs have not been explored through EHR FMs.
Nursing records are important for developing an ADR prediction model, as they include detailed and regular observations of clinical signs and symptoms, such as skin rash, itching, nausea, and vomiting. Despite the abundance of information in nursing records, utilizing them from the EHR is challenging, as not all records are typically converted into diagnosis or observation codes. Many items in nursing records are ambiguous in matching existing diagnostic codes due to their variability in type, severity, and durability. Although some diagnostic records for skin conditions are present in our EHR databases, their integrity could not be guaranteed because patients in nursing records and EHR were hardly overlapped. In this study, we combined nursing records and EHR data to fully utilize the CADR records. Subsequently, we constructed reliable prediction models that captured changes in the skin condition of the patients before and after antibiotic treatment.
Many drugs were included in the top 40 most important features of SNUH and SNUBH; however, none were included in those of CNUH (Fig. 2c), as CNUH uses different drug codes. Although all hospitals in the study operated observational medical outcomes partnership (OMOP) common data model (CDM)-based EHR databases and used the same terminologies for medical codes, the specific classes for drug codes were different. SNUH and SNUBH used drug codes in the “clinical drug” class of Rx-Norm, whereas CNUH used those in the “quant clinical drug” class of Rx-Norm. For example, the Rx-Norm concept ID “19027493” indicates “famotidine 10 MG/ML Injectable Solution” under the “Clinical Drug” class, whereas the Rx-Norm concept ID “35606552” indicates “2 ML famotidine 10 MG/ML Injection” under the “Quant Clinical Drug” class. These codes are almost identical but the model recognizes them as different. However, although drug information from the CNUH was not available for external validation, our model still achieved an AUROC of approximately 0.9. Instead of drug information, the model classified patients with CADRs primarily based on the measurement items, which was hardly observed in SNUH and SNUBH. Additionally, we also conducted additional experiments using CDM-BERT after replacing drug codes with “unknown” codes, and the model achieved AUROCs of 0.971 and 0.928 for SNUH and SNUBH, respectively, demonstrating minimal performance degradation compared to the results of original setting (0.975 and 0.928). This implies that our model does not solely rely on a specific set of medical codes but comprehensively considers the overall patient information.
Our feature importance analysis revealed that our model assigns higher attention scores to clinically relevant concepts such as cancer, chronic kidney disease, and chronic liver disease compared to unrelated concepts (Fig. 2a, b). This finding suggests that our model makes the decision focusing more on clinically relevant concepts. However, it is important to note that attention scores indicate how much the corresponding records influenced the decision of model and high attention scores do not directly indicate clinically meaningful records. In addition, as described in above paragraph, EHR FM makes decisions by comprehensively analyzing all given concepts, rather than relying solely on specific “meaningful” concepts or domains. This makes the interpretation of EHR FMs difficult despite their superior performance.
Although skin tests can be performed for patients with suspected β-lactam allergy, they are not routinely recommended before antibiotic administration in patients without a history of hypersensitivity. While standardized protocols exist for β-lactam allergy testing, the diagnostic sensitivity varies depending on factors such as the time elapsed since the reaction and the specific drug involved38,39,40. In a study evaluating the clinical utility of beta-lactam antibiotics skin testing, where skin tests were performed with five penicillin and cephalosporin antibiotics in patients with no history of allergy to beta-lactam antibiotics, 97 (6.8%) of the 1,421 patients tested positive for at least one drug. However, none of these patients showed hypersensitivity to cephalosporins. However, 4 of the 1324 patients with negative test results exhibited hypersensitivity following cephalosporin injection, indicating that skin testing before the use of penicillin and cephalosporin may not be clinically useful41. A study evaluating the effectiveness of skin testing prior to cefazolin use in 13,153 cases also failed to prove the clinical effectiveness of skin tests. Of the 13,153 individuals who underwent skin tests, 184 (1.4%) exhibited positive results. Of these, 174 avoided cefazolin use, and 10 underwent a stepwise challenge test with cefazolin, all of which were negative42. A meta-analysis of penicillin skin tests reported 30.7% sensitivity and 96.8% specificity of the skin tests, with an AUROC value of 0.68643.
Delayed-type CADRs are a broad spectrum of diseases that occur after a considerable period of time after antibiotic exposure and have various pathophysiological and clinical risk factors. These delayed-type reactions can sometimes lead to severe and potentially life-threatening conditions. Specifically, systemic manifestations occur in 30% of CADR cases, with serious reactions such as SJS, TEN, and general exfoliative dermatitis in 5.2% of cases4. Furthermore, delayed CADRs are clinically difficult to predict, because there is no method for predicting delayed reactions through testing, like immediate reactions. For this problem, our findings showed that EHR FMs with DE demonstrated superior and more stable prediction performance for delayed reactions over the immediate reactions. This suggests that baseline patient information holds greater predictive value for delayed-type reactions and may include generalized risk factors relevant across hospitals. In addition, the performance gap between delayed and immediate CADR is noteworthy, as we did not provide separate information on immediate and delayed types. Our model itself recognized the disparity between different CADR types and demonstrated validity across all three cohorts. This finding indicates that our model can derive meaningful insights based on patient medical history, even without explicit information provided.
Among delayed-type CADRs, severe CADRs such as DRESS syndrome, SJS, and TEN are rare but potentially life-threatening. The incidence of SJS and TEN is estimated to be 2 per 1 million people, whereas the incidence of DRESS syndrome is estimated to be 1 per 1000 to 1 per 10,000 people44. The diagnosis of severe CADRs depends on the clinical presentation of the rash, its duration, associated symptoms (e.g., fever, pruritus, lymphadenopathy), and the time from drug administration to symptom onset. Direct immunofluorescence testing of the blistering rash and skin biopsy are necessary to rule out other possible causes. Clinically, it would be very useful to be able to identify and predict those at risk of progressing to severe ADRs among severe delayed CADRs. However, the training data for our AI model is insufficient to distinguish between the severity and phenotype of skin rashes, making it impossible to identify and predict severe ADRs. In this study, we are the first to conduct research using AI to predict antibiotic-associated CADRs, and we may also conduct future research to develop an AI model that can distinguish and predict severe CADRs.
This study has some limitations. First, the code disparity between hospitals was not fundamentally addressed. Instead of matching different codes, we simply replaced the codes out of vocabulary with a [MASK] token. Although this approach did not significantly reduce the performance, a more robust strategy to handle code disparity is imperative to enhance the compatibility of the model. Second, while our feature importance analysis demonstrated that our model focuses more on clinically relevant features, further interpretation of the complex attention patterns is needed to validate the utility of attention score analysis in EHR FMs. Third, CADRs are usually clinically classified by more complicated conditions than our settings. CADRs are classified into immediate and delayed reactions based on the 1–6 h window after drug administration45,46. Immediate reactions manifest as urticaria and angioedema, whereas delayed reactions mainly manifest as maculopapular rashes. A more detailed analysis of these subtypes and the corresponding application strategies is suggested for future studies. Fourth, this study involved hospitals from a single country, and external validation using a global cohort is necessary for model generalization. Fifth, we extracted the CADR records using nursing reports and rule-based regular expressions. This approach may have missed some records and may not be applicable to other institutions, as nursing reports are not strictly structured and format varies among institutions. Therefore, leveraging large LMs could enhance the CADR record extraction process. Sixth, this study focused on mild CADRs, including skin lesions, rashes, urticaria, and eruptions. A more comprehensive study design that covers diverse CADRs is required. Given the demonstrated effectiveness of EHR FMs for predicting mild CADRs, predicting severe CADRs, such as SJS and TEN, and identifying new factors causing those CADRs are suggested for future work. Seventh, our study compared patients who developed rashes during antibiotic treatment to those who did not, rather than comparing them with patients who developed rashes without antibiotic exposure. Future studies comparing “antibiotic with rash” versus “rash without antibiotic exposure” could help distinguish antibiotics-associated reactions from general rash susceptibility. Eighth, our current model predicts overall CADR risk rather than identifying specific culprit antibiotics or drug classes. Since drug allergy typically develops to specific drugs or drug classes rather than overall antibiotics in general, future studies needs to develop drug class-specific prediction models to provide more valuable clinical guidance for antibiotic selection. Nevertheless, the current model can serve as a risk stratification tool to identify patients who may benefit from enhanced monitoring during antibiotic treatment.
Methods
Data curation
This study used data from SNUH between December 2012 and April 2023, SNUBH between April 2003 and December 2022, and CNUH between July 2020 and September 2024. All included hospitals operate an EHR database based on the OMOP CDM47. We collected data from four domains (tables) of the CDM schema: condition, drug, measurement, and procedure. Terminologies for diagnosis, procedure, measurement, and prescription were based on SNOMED CT48, Rx-Norm and Rx-Norm Extension49, and LOINC50, respectively. In addition to EHR data, we combined nursing statements and reports to process information regarding skin rashes.
For the internal dataset (SNUH), we randomly split the data into training (70%), validation (10%), and hold-out test (20%) datasets. Training and validation datasets were used to develop the models, including LM pretraining. External datasets were used to validate the trained models. As data sharing across hospitals is strictly prohibited, we only exported the weights of the trained models, and external validation was performed in each hospital.
The institutional review boards of the SNUH (approval no. E-2403-088-1521), SNUBH (approval no. X-2403-890-901), and CNUH (approval no. 2024-06-022-001) approved the study and waived the requirement for informed consent, as only retrospective and observational EHR data was used. The study adhered to the principles outlined in the Declaration of Helsinki, the Korean Bioethics and Safety Act (Law No. 16372), and the Human Research Protection Program–Standard Operating Procedure of the Seoul National University Hospital.
Nursing statements and reports
To obtain high-quality skin rash records, we used nursing statements and reports from each hospital. Note that “nursing statements” and “nursing reports” are separate terms in this study. The nursing records of SNUH and SNUBH are based on nursing statements, which are based on a code system. With this system, nurses directly enter the codes based on the status or condition of patients, instead of writing free-text nursing reports. The examples of the nursing statements are “rash is present,” “urticaria is present,” “no rash is observed,” and “skin lesion is present,” and each statement has an assigned code. We extracted all statements including the keywords “skin lesion,” “rash,” “urticaria,” and “eruption” and excluded inappropriate nursing statements including “diaper rash,” “abrasion,” and “lacerated,” as these conditions are not likely to be associated with antibiotic-associated CADRs. Five allergists confirmed that included nursing statements like “rash is present,” “urticaria is present,” and “skin lesion is present” were appropriate.
In contrast, CNUH does not have a structured nursing statements system and only has natural language-based nursing reports. As the nursing reports were written in plain Korean text and were less structured, we developed rule-based regular expression algorithms. The detailed rules are summarized in Supplementary Table 6. We randomly selected more than 1000 samples that were classified as having skin rash or not, and five allergists confirmed that all sample reports were accurate.
Prediction timepoint
To train the CADR prediction model in a practical scenario, we considered the most recent inpatient visit where antibiotic treatment was provided, and the prediction time point was set as the antibiotic prescription immediately preceding the CADR occurrence. The reasons for the prediction timepoint definition are as follows: First, we considered predicting the first CADR for each patient, but the prediction performance deteriorated because of fewer records. Second, we attempted to rule out potential bias due to the conditions for the prediction timepoint. Accordingly, we designed the model to adapt to various situations, whether patient history records are abundant or not. We also considered three additional scenarios to validate the CADR prediction model: predicting the immediate and delayed CADRs at the initiation of the antibiotic treatment and predicting the delayed CADRs just before the CADR occurrence. The CADR prediction process and timepoints are presented in Fig. 4.
a CDM-based EHR data and nursing records (nursing reports and statements) were merged and used to construct patient trajectories. Patient records until the last antibiotics prescription before CADR occurrence were used to predict CADR within 1 day. b Prediction timepoint for immediate CADRs. c Prediction timepoints for delayed CADRs. Prediction timepoints 1 and 2 involved the use of the records until the first antibiotic prescription and the last antibiotic prescription before CADR occurrence.
Model training
To train the tree models as baselines, we transformed the patient trajectories into count-based one-dimensional vectors, referring to previous studies. Time binning counts the occurrences of a code in different time intervals. We used intervals of 0–1 day, 1–7 days, and 7+ days from the prediction timepoint. The count-based vector was then concatenated with age and sex information. Hyperparameter tuning was performed using grid search (Supplementary Table 7). Medical codes that were not in the SNUH vocabulary were omitted from the external validation. The tree models were trained and validated using the Scikit-learn package (version 1.0.2) in Python (version 3.8.10).
In addition to tree-based machine learning models, we used several deep learning models, including RNN-based models (LSTM and GRU), modified RNN-based networks with attention mechanisms (RETAIN), and EHR FMs (BEHRT, CLMBR-t-base, and CDM-BERT). BEHRT is a basic BERT-based EHR foundation model and CLMBR-t-base is an open weight EHR foundation model that generates patient representations which are subsequently finetuned by LightGBM51. CDM-BERT is a modified BEHRT with DE for the OMOP CDM format. All EHR FMs were based on pretraining-finetuning framework. All deep learning models were based on the same embedding of patient records and shared the same embedding layers but different subsequent layers. The embedding sizes of the token, age, segment, and domain were set to 50,000, 180, 1024, and 20, respectively. All deep learning models used six RNN or self-attention layers with a hidden dimension of 256. The BERT-based LM used eight attention heads. To train the CADR prediction models, we added a feed-forward neural network (FFNN), a hyperbolic tangent (Tanh) activation function, dropout, and another FFNN to the representation of the first token to yield a vector of size two, and binary cross entropy was used to minimize the loss function. We calculated the AUROC values for every 1000 batches, and early stopping was implemented if the AUROC did not improve for 20 cycles. To address class imbalance, we randomly oversampled the patients with CADRs to one-tenth of those without CADRs. The batch size, learning rate, and dropout rate were 16, 5e–5, and 0.1, respectively. We used the pretrained models from our previous study and confirmed that the hold-out test dataset was entirely independent of the pretraining dataset. The hyperparameter tuning for LightGBM in CLMBR-t-base was performed using the same grid search approach as described for random forest and GBM (Supplementary Table 7). For the deep learning and BERT implementation, we used PyTorch (version 1.12.0) and the HuggingFace package (version 4.41.2)52 in Python (version 3.8.10). The code implementation for the model training was based on the GitHub repository (https://github.com/kicarussays/CDM-BERT).
Feature importance
We conducted two feature importance analyses: (1) comparing attention scores between CADR-related and unrelated diagnosis concepts, and (2) identifying the top 40 most important features. For the attention score comparison, we extracted and averaged attention scores for predefined concept groups. For identifying important features, the calculating process involved four steps: First, we averaged the attention matrices of eight attention heads from the last self-attention layer of the LM, yielding a simplified 2048 by 2048 attention matrix \({\rm{A}}\). For each attention matrix \({\rm{A}}\), an attention score \({{\rm{A}}}_{{\rm{ij}}}\) indicates how closely \({{\rm{i}}}^{\text{th}}\) and \({{\rm{j}}}^{\text{th}}\) tokens are related53. Note that \({\sum }_{{\rm{j}}}{A}_{{ij}}=1\,\forall {\rm{i}}\in \left(0,\ldots ,2048\right).\) Second, we averaged the simplified attention matrix across rows, resulting in a one-dimensional vector of length 2048, representing the importance of each token. Third, for each patient, we identified the top 10% of tokens with the highest average attention scores. Finally, for each variable, we counted the number of patients who included that variable in the top 40 features. We used count-based feature importance to improve interpretability, instead of the trajectory attention vector proposed in our previous study.
Statistical analysis
Characteristics (age, sex, and comorbidities) between the CADR and non-CADR groups in all cohorts were compared by calculating P-values using Student’s t test for age and Fisher’s exact test for other variables. We reported the AUROC, AUPRC, sensitivity, specificity, precision, and F1-score as performance metrics. The confidence intervals of AUROC and AUPRC were calculated using the DeLong’s method54, whereas those of the F1-score, sensitivity, specificity, and precision were calculated using Wilson’s method55. Statistical significance was set at α = 0.05. All statistical analyses were performed using Scikit-learn (version 1.0.2) in Python (version 3.8.10).
Data availability
The raw data used in this study are not publicly available to preserve participant privacy. The data generated and analyzed during the study are available from the corresponding author upon reasonable request.
Code availability
Source code for the experiments is publicly available at https://github.com/kicarussays/CDM-BERT.
References
Edwards, I. R. & Aronson, J. K. Adverse drug reactions: Definitions, diagnosis, and management. Lancet 356, 1255–1259 (2000).
Lazarou, J., Pomeranz, B. H. & Corey, P. N. Incidence of adverse drug reactions in hospitalized patients: A meta-analysis of prospective studies. JAMA 279, 1200–1205 (1998).
Seo, B. et al. Incidence and economic burden of adverse drug reactions in hospitalization: A prospective study in Korea. J. Korean Med. Sci. 38, e56 (2023).
Gomes, E. R. & Demoly, P. Epidemiology of hypersensitivity drug reactions. Curr. Opin. Allergy Clin. Immunol. 5, 309–316 (2005).
Thong, B. Y., Leong, K. P., Tang, C. Y. & Chng, H. H. Drug allergy in a general hospital: Results of a novel prospective inpatient reporting system. Ann. Allergy Asthma Immunol. 90, 342–347 (2003).
Frey, N. et al. The epidemiology of stevens-johnson syndrome and toxic epidermal necrolysis in the UK. J. Invest Dermatol 137, 1240–1247 (2017).
Lee, E. Y., Knox, C. & Phillips, E. J. Worldwide prevalence of antibiotic-associated Stevens-Johnson syndrome and toxic epidermal necrolysis: A systematic review and meta-analysis. JAMA Dermatol 159, 384–392 (2023).
Singh, M. P. et al. Antimicrobial utilization in a paediatric intensive care unit in India: A step towards strengthening antimicrobial stewardship practices. PLOS ONE 19, e0310515 (2024).
Rossi, G., da Silva Cartell, A. & Marchiori Bakos, R. Dermoscopic aspects of cutaneous adverse drug reactions. Dermatol. Pract. Concept. 11, e2021136 (2021).
Finkelstein, Y., Macdonald, E. M., Li, P., Hutson, J. R. & Juurlink, D. N. Recurrence and mortality following severe cutaneous adverse reactions. Jama 311, 2231–2232 (2014).
Duong, T. A., Valeyrie-Allanore, L., Wolkenstein, P. & Chosidow, O. Severe cutaneous adverse reactions to drugs. Lancet 390, 1996–2011 (2017).
Parasrampuria, S. & Henry, J. Hospitals’ use of electronic health records data, 2015–2017. ONC Data Brief. 46, 13 (2019).
Liang, J. et al. Adoption of electronic health records (EHRs) in China during the past 10 years: consecutive survey data analysis and comparison of sino-american challenges and experiences. J. Med. Internet Res. 23, e24813 (2021).
Lee, K. et al. Digital health profile of South Korea: A cross sectional study. Int. J. Environ. Res. Public Health 19, 6329 (2022).
Huang, D., Cogill, S., Hsia, R. Y., Yang, S. & Kim, D. Development and external validation of a pretrained deep learning model for the prediction of non-accidental trauma. npj Digital Med. 6, 131 (2023).
Shang, J., Ma, T., Xiao, C. & Sun, J. In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 5953–5959 (International Joint Conferences on Artificial Intelligence, 2019).
Li, Y. et al. BEHRT: Transformer for electronic health records. Sci. Rep. 10, 7155 (2020).
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Med. 4, 86 (2021).
Pang, C. et al. In Proceedings of Machine Learning for Health Vol. 158 (eds Roy Subhrajit et al.) 239–260 (PMLR, Proceedings of Machine Learning Research, 2021).
Li, Y. et al. Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J. Biomed. Health Inform. 27, 1106–1117 (2023).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. In Proc. 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186 (2019).
Wornow, M., Thapa, R., Steinberg, E., Fries, J. & Shah, N. Ehrshot: An ehr benchmark for few-shot evaluation of foundation models. Adv. Neural Inf. Process. Syst. 36, 67125–67137 (2023).
Kim, J. et al. Pretrained patient trajectories for adverse drug event prediction using common data model-based electronic health records. Commun. Med. 5, 232 (2025).
Fiszenson-Albala, F. et al. A 6-month prospective survey of cutaneous drug reactions in a hospital setting. Br. J. Dermatol 149, 1018–1022 (2003).
Hernández-Salazar, A. et al. Epidemiology of adverse cutaneous drug reactions. A prospective study in hospitalized patients. Arch. Med. Res. 37, 899–902 (2006).
Park, C. S. et al. The use of an electronic medical record system for mandatory reporting of drug hypersensitivity reactions has been shown to improve the management of patients in the university hospital in Korea. Pharmacoepidemiol Drug Saf. 17, 919–925 (2008).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Jerome, H. F. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001).
Graves, A. in Supervised Sequence Labelling with Recurrent Neural Networks (ed Graves A.) 37–45 (Springer Berlin Heidelberg, 2012).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Choi, E. et al. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Advances in neural information processing systems 29 https://doi.org/10.48550/arXiv.1608.05745 (2016).
Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2023).
Vovk, V. Cross-conformal predictors. Ann. Math. Artif. Intell. 74, 9–28 (2015).
Sun, J. et al. Applying mondrian cross-conformal prediction to estimate prediction confidence on large imbalanced bioactivity data sets. J. Chem. Inf. Model. 57, 1591–1598 (2017).
Ding, T., Angelopoulos, A., Bates, S., Jordan, M. & Tibshirani, R. J. Class-conditional conformal prediction with many classes. Adv. neural Inf. Process. Syst. 36, 64555–64576 (2023).
Zhao, J., Henriksson, A., Asker, L. & Boström, H. Predictive modeling of structured electronic health records for adverse drug event detection. BMC Med Inf. Decis. Mak. 15, S1 (2015).
Bampa, M. & Papapetrou, P. Aggregate-eliminate-predict: Detecting adverse drug events from heterogeneous electronic health records. ArXiv abs/1907.06058 (2019).
Khan, D. A. et al. Drug allergy: A 2022 practice parameter update. J. Allergy Clin. Immunol. 150, 1333–1393 (2022).
Broyles, A. D. et al. Practical guidance for the evaluation and management of drug hypersensitivity: Specific drugs. J. Allergy Clin. Immunol. Pr. 8, S16–s116 (2020).
Doña, I. et al. An algorithm for the diagnosis of beta-lactam allergy, 2024 update. Allergy 80, 633–637 (2025).
Yoon, S. Y. et al. Validation of the cephalosporin intradermal skin test for predicting immediate hypersensitivity: A prospective study with drug challenge. Allergy 68, 938–944 (2013).
Kwon, J. W. et al. Results of intradermal skin testing with cefazolin according to a history of hypersensitivity to antibiotics. J. Korean Med Sci. 34, e319 (2019).
Sousa-Pinto, B. et al. Accuracy of penicillin allergy diagnostic tests: A systematic review and meta-analysis. J. Allergy Clin. Immunol. 147, 296–308 (2021).
Kardaun, S. H. et al. Drug reaction with eosinophilia and systemic symptoms (DRESS): an original multisystem adverse drug reaction. Results from the prospective RegiSCAR study. Br. J. Dermatol 169, 1071–1080 (2013).
Broyles, A. D., Banerji, A. & Castells, M. Practical guidance for the evaluation and management of drug hypersensitivity: General concepts. J. Allergy Clin. Immunol. Pract. 8, S3–s15 (2020).
Brockow, K. et al. EAACI position paper on how to classify cutaneous manifestations of drug hypersensitivity. Allergy 74, 14–27 (2019).
Biedermann, P. et al. Standardizing registry data to the OMOP Common Data Model: experience from three pulmonary hypertension databases. BMC Med. Res. Methodol. 21, 238 (2021).
Benson, T. Principles of health interoperability HL7 and SNOMED. (Springer Science & Business Media, 2012).
Liu, S., Wei, M., Moore, R., Ganesan, V. & Nelson, S. RxNorm: prescription for electronic drug information exchange. IT Professional 7, 17–23 (2005).
McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: A 5-year update. Clin. Chem. 49, 624–633 (2003).
Ke, G. et al. in Proceedings of the 31st International Conference on Neural Information Processing Systems 3149–3157 (Curran Associates Inc., Long Beach, California, USA, 2017).
Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 https://doi.org/10.48550/arXiv.1910.03771 (2019).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems 30 (2017).
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44, 837–845 (1988).
Wilson, E. B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212 (1927).
Acknowledgements
This work was supported by the Korea Institute of Drug Safety & Risk Management and the AI Institute at Seoul National University. S.H.K. receives funding from Korea Institute of Drug Safety & Risk Management, and J.K. is supported by the fellowship program of the AI Institute at Seoul National University. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
S.H.K. and J.K. contributed to the conceptualization and design of the study. J.K. handled and mainly analyzed the research data, and all authors interpreted the results. J.K. constructed the machine learning and deep learning models and conducted statistical analysis. J.K. and K.K. wrote the original draft of the paper. S.H.K. and K.K. reviewed the clinical evidence of the study. K.S.K., S.Y., and M.G.K. provided the data, and J.K. verified the quality of the data. J.K., S.H.K., K.S.K., M.G.K., and S.Y. had full access to the raw data. All authors had the final responsibility to submit for publication.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kim, J., Kim, K., Yun, JE. et al. Prediction of antibiotic-associated cutaneous adverse drug reactions using electronic health record foundation models. npj Digit. Med. 9, 311 (2026). https://doi.org/10.1038/s41746-026-02503-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02503-x






