Fasterius E, Al-Khalili Szigyarto C
Sci Rep 8 (1) 11226 [2018-07-25; online 2018-07-25]
Meta-analysis of datasets available in public repositories are used to gather and summarise experiments performed across laboratories, as well as to explore consistency of scientific findings. As data quality and biological equivalency across samples may obscure such analyses and consequently their conclusions, we investigated the comparability of 85 public RNA-seq cell line datasets. Thousands of pairwise comparisons of single nucleotide variants in 139 samples revealed variable genetic heterogeneity of the eight cell line populations analysed as well as variable data quality. The H9 and HCT116 cell lines were found to be remarkably stable across laboratories (with median concordances of 99.2% and 98.5%, respectively), in contrast to the highly variable HeLa cells (89.3%). We show that the genetic heterogeneity encountered greatly affects gene expression between same-cell comparisons, highlighting the importance of interrogating the biological equivalency of samples when comparing experimental datasets. Both the number of differentially expressed genes and the expression levels negatively correlate with the genetic heterogeneity. Finally, we demonstrate how comparing genetically heterogeneous datasets affect gene expression analyses and that high dissimilarity between same-cell datasets alters the expression of more than 300 cancer-related genes, which are often the focus of studies using cell lines.