|
Post by djoser-xyyman on May 10, 2019 5:19:35 GMT -5
That suspicious 1000 Genomes dataset...I told you so
Legacy Data Confounds Genomics Studies Luke Anderson-Trocmé
Abstract Recent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, MOST have been reported **ONLY** in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.
xyyman comment: Remember I always was suspicious of the 1KG dataset. I prefer HAPMAP. These Europeans have been skewing datasets from the onset
To those who don't know choice of datasets can skew results. That is why my dataset has shown that the ABUSIR(Edit) is closer to SSA than the initial study shows.
|
|
|
Post by djoser-xyyman on May 10, 2019 5:24:39 GMT -5
what is the study telling us? We cannot trust research which includes/uses that 1000G datasets. Like the Abusir study and most others. The author is calling for a new dataset used in studies. Remember D. Shriner showed that depending on the Reference used that the Natufians can be as much as 29% African while the original paper has the Natufians carrying 0% African DNA.
Europeans have been cheating from the git go.
That is why I prefer NOT to use 1KG datasets in my ADMIXTURE charts. Remember 1KG, Simons, Hapmap etc uses different SNPs in their datasets giving different results
|
|
|
Post by zarahan on May 11, 2019 0:18:53 GMT -5
Hmm something to keep in mind. They sure have been cheating. Like the CRANID database some researchers use as "representative" of ancient Egypt, using samples drawn from cemeteries in the far north, near the Mediterranean, and downplaying or excluding the historic south.
|
|
|
Post by djoser-xyyman on May 11, 2019 5:51:17 GMT -5
quote:
"Batch Effects in Aging Reference Cohort Data
The last 5 years have seen a drastic increase in the amount and quality of human genome sequence data. Reference cohorts such as the International HapMap Project (International HapMap Con29 sortium, 2005), the 1000 Genomes Project (1kGP)(1000 Genomes Project Consortium, 2010, 2012; Consortium et al., 2015), and the Simons Diversity project (Mallick et al., 2016), for example, have made thousands of genome sequences publicly available for population and medical genetic analyses. Many more genomes are available indirectly through servers providing imputation services (McCarthy et al., 2016) or summary statistics for variant frequency estimation (Lek et al., 2016). The first genomes in the 1kGP were sequenced 10 years ago (van Dijk et al., 2014). Since then, sequencing platforms have rapidly improved. The second phase of the 1kGP implemented multiple technological and analytical improvements over its earlier phases (1000 Genomes Project Consortium, 2012; Consortium et al., 2015), leading to heterogeneous sample preparations and data quality over the course of the project. Yet, because of the extraordinary value of freely available data, early data from the 1kGP is still widely used to impute untyped variants, to estimate allele frequencies, and to answer a wide range of medical and evolutionary questions. This raises the question of whether and how such legacy data should be included in contemporary analyses alongside more recent cohorts. Here we point out how large and previously unreported batch effects in the early phases of the 1kGP still lead to incorrect genetic conclusions through population genetic analyses and spurious GWAS associations as a result of imputation using the 1kGP as a reference."
from above quote:
the 1kGP still lead to incorrect genetic conclusions the 1kGP still lead to incorrect genetic conclusions the 1kGP still lead to incorrect genetic conclusions the 1kGP still lead to incorrect genetic conclusions the 1kGP still lead to incorrect genetic conclusions
|
|
|
Post by djoser-xyyman on May 11, 2019 6:01:17 GMT -5
Is it a conspiracy??? Is there collusion? Here is what the authors have to say
Quote: Conclusion
On a technical front, we were surprised that strong association between variants and technical covariates in the 1kGP project had not been identified before. The genome-wide logistic regression analysis of genotype on quality metric is straightforward, and should probably be a standard in a variety of genomic studies. The logistic factor analysis is more computationally demanding but produces more robust results (Song et al., 2015). Both approaches produce comparable results. More generally, to improve the quality of genomic reference datasets, we can proceed by addition of new and better data and by better curation of existing data. Given rapid technological progress, the focus of genomic research is naturally on the data generation side. However, cleaning up existing databases is also important to avoid generating spurious results. The present findings suggest that a substantial fraction of data from the final release of the 1kGP project is overdue for retirement or re-sequencing.
a substantial fraction of data from the final release of the 1kGP project is overdue for retirement
a substantial fraction of data from the final release of the 1kGP project is overdue for retirement
a substantial fraction of data from the final release of the 1kGP project is overdue for retirement
a substantial fraction of data from the final release of the 1kGP project is overdue for retirement
|
|