close
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 22;20(1):437.
doi: 10.1186/s12859-019-3028-6.

Influence of batch effect correction methods on drug induced differential gene expression profiles

Affiliations

Influence of batch effect correction methods on drug induced differential gene expression profiles

Wei Zhou et al. BMC Bioinformatics. .

Abstract

Background: Batch effects were not accounted for in most of the studies of computational drug repositioning based on gene expression signatures. It is unknown how batch effect removal methods impact the results of signature-based drug repositioning. Herein, we conducted differential analyses on the Connectivity Map (CMAP) database using several batch effect correction methods to evaluate the influence of batch effect correction methods on computational drug repositioning using microarray data and compare several batch effect correction methods.

Results: Differences in average signature size were observed with different methods applied. The gene signatures identified by the Latent Effect Adjustment after Primary Projection (LEAPP) method and the methods fitted with Linear Models for Microarray Data (limma) software demonstrated little agreement. The external validity of the gene signatures was evaluated by connectivity mapping between the CMAP database and the Library of Integrated Network-based Cellular Signatures (LINCS) database. The results of connectivity mapping indicate that the genes identified were not reliable for drugs with total sample size (drug + control samples) smaller than 40, irrespective of the batch effect correction method applied. With total sample size larger than 40, the methods correcting for batch effects produced significantly better results than the method with no batch effect correction. In a simulation study, the power was generally low for simulated data with sample size smaller than 40. We observed best performance when using the limma method correcting for two principal components.

Conclusion: Batch effect correction methods strongly impact differential gene expression analysis when the sample size is large enough to contain sufficient information and thus the downstream drug repositioning. We recommend including two or three principal components as covariates in fitting models with limma when sample size is sufficient (larger than 40 drug and controls combined).

Keywords: Batch effect; Drug repositioning; Microarray.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Overall workflow of the study. a, workflow of real data analysis. CMAP and LINCS datasets are analyzed by principal component analysis, followed by differential expression analysis with several batch effect correction methods, which were then evaluated by connectivity mapping (the procedure of connectivity mapping is illustrated by Additional file 2: Figure S2); b, workflow of simulation analysis. Expression data were simulated from CMAP dataset, and the optimal number of largest principal components being corrected for was assessed
Fig. 2
Fig. 2
Summary plots of sample sizes in CMAP dataset. a, total (drug + control) sample size distribution in CMAP dataset. b, scatter plot of the relationship between control sample size and drug sample size for CMAP dataset. Note: the total number of drugs in CMAP dataset is 1309. In plot B, the drug trichostatin A (128 drug samples and 709 control samples) was not plotted because the particularly large sample size prevents a zoomed in view of other drugs
Fig. 3
Fig. 3
Results of principal component analysis on expression matrices for CMAP dataset. a, Median variance accounted for by the four largest principal components grouped by total sample size. b-e, Score plots of the first two principal components for four typical drugs; colors indicate batch (plate id) and shapes indicate drug or control status
Fig. 4
Fig. 4
Results of differential expression analysis on CMAP dataset. a, percentage of drugs having signature size greater than or equal to 10 for each gene expression analysis method plotted against FDR cutoff. b, average signature size resulted from each gene expression analysis method plotted against FDR cutoff; y-axis was transformed to log-10 scale
Fig. 5
Fig. 5
Results of connectivity score analysis with a fixed number of 15 genes with lowest FDR. a, Boxplot of the ranks of the same drug in connectivity mapping between CMAP and LINCS dataset. b, The proportion of drugs having the same drug ranked within top 3 in connectivity mapping between shared genes of CMAP and LINCS dataset. The x-axes are grouped by the total sample size in CMAP dataset. The colors indicate the differential gene expression analysis methods

References

    1. Ashburn TT, Thor KB. Drug repositioning: identifying and developing new uses for existing drugs. Nat Rev Drug Discov. 2004;3(8):673. doi: 10.1038/nrd1468. - DOI - PubMed
    1. Li J, Zheng S, Chen B, Butte AJ, Swamidass SJ, Lu Z. A survey of current trends in computational drug repositioning. Brief Bioinform. 2016;17(1):2–12. doi: 10.1093/bib/bbv020. - DOI - PMC - PubMed
    1. Jin G, Wong STC. Toward better drug repositioning: prioritizing and integrating existing methods into efficient pipelines. Drug Discov Today. 2014;19(5):637–644. doi: 10.1016/j.drudis.2013.11.005. - DOI - PMC - PubMed
    1. Koudijs KKM, AGTTv S, Böhringer S, Schimmel KJM, Guchelaar H-J. Personalised drug repositioning for clear cell renal cell carcinoma using gene expression. Sci Rep. 2018;8(1):5250. doi: 10.1038/s41598-018-23195-8. - DOI - PMC - PubMed
    1. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet J-P, Subramanian A, Ross KN, et al. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313(5795):1929–1935. doi: 10.1126/science.1132939. - DOI - PubMed

Substances