That suspicious 1000 Genomes dataset...I told you so

djoser-xyyman
Vizier

Without data you are just another person with an opinion - Deming

Posts: 3,268

That suspicious 1000 Genomes dataset...I told you so May 10, 2019 5:19:35 GMT -5

Quote

Post by djoser-xyyman on May 10, 2019 5:19:35 GMT -5

That suspicious 1000 Genomes dataset...I told you so

Legacy Data Confounds Genomics Studies
Luke Anderson-Trocmé

Abstract
Recent reports have identified differences in the mutational spectra across human
populations. While some of these reports have been replicated in other cohorts, MOST have been
reported **ONLY** in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative
population stratification within the Japanese population, we identified a previously unreported
batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population
stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to
incorrect imputation by leading imputation servers and suspicious GWAS associations.
Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern
studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

xyyman comment: Remember I always was suspicious of the 1KG dataset. I prefer HAPMAP. These Europeans have been skewing datasets from the onset

To those who don't know choice of datasets can skew results. That is why my dataset has shown that the ABUSIR(Edit) is closer to SSA than the initial study shows.

Last Edit: May 10, 2019 12:59:07 GMT -5 by djoser-xyyman

Without data you are just another person with an opinion - Deming

djoser-xyyman
Vizier

Without data you are just another person with an opinion - Deming

Posts: 3,268

That suspicious 1000 Genomes dataset...I told you so May 10, 2019 5:24:39 GMT -5

Quote

Post by djoser-xyyman on May 10, 2019 5:24:39 GMT -5

what is the study telling us? We cannot trust research which includes/uses that 1000G datasets. Like the Abusir study and most others. The author is calling for a new dataset used in studies. Remember D. Shriner showed that depending on the Reference used that the Natufians can be as much as 29% African while the original paper has the Natufians carrying 0% African DNA.

Europeans have been cheating from the git go.

That is why I prefer NOT to use 1KG datasets in my ADMIXTURE charts. Remember 1KG, Simons, Hapmap etc uses different SNPs in their datasets giving different results

Last Edit: May 10, 2019 13:01:39 GMT -5 by djoser-xyyman

Without data you are just another person with an opinion - Deming

zarahan
Nomarch

Global Moderator

Posts: 2,098

That suspicious 1000 Genomes dataset...I told you so May 11, 2019 0:18:53 GMT -5

Quote

Post by zarahan on May 11, 2019 0:18:53 GMT -5

Hmm something to keep in mind. They sure have been cheating. Like the CRANID database
some researchers use as "representative" of ancient Egypt, using samples drawn from
cemeteries in the far north, near the Mediterranean, and downplaying or excluding the
historic south.

Last Edit: May 11, 2019 0:19:05 GMT -5 by zarahan

Note: I am not an "Egyptologist" as claimed by some still bitter, defeated, trolls creating fake profiles and posts elsewhere. You still fail..

djoser-xyyman
Vizier

Without data you are just another person with an opinion - Deming

Posts: 3,268

That suspicious 1000 Genomes dataset...I told you so May 11, 2019 5:51:17 GMT -5

Quote

Post by djoser-xyyman on May 11, 2019 5:51:17 GMT -5

quote:

"Batch Effects in Aging Reference Cohort Data

The last 5 years have seen a drastic increase in the amount and quality of human genome sequence
data. Reference cohorts such as the International HapMap Project (International HapMap Con29 sortium, 2005), the 1000 Genomes Project (1kGP)(1000 Genomes Project Consortium, 2010, 2012;
Consortium et al., 2015), and the Simons Diversity project (Mallick et al., 2016), for example, have
made thousands of genome sequences publicly available for population and medical genetic analyses. Many more genomes are available indirectly through servers providing imputation services
(McCarthy et al., 2016) or summary statistics for variant frequency estimation (Lek et al., 2016).
The first genomes in the 1kGP were sequenced 10 years ago (van Dijk et al., 2014). Since
then, sequencing platforms have rapidly improved. The second phase of the 1kGP implemented
multiple technological and analytical improvements over its earlier phases (1000 Genomes Project
Consortium, 2012; Consortium et al., 2015), leading to heterogeneous sample preparations and
data quality over the course of the project.
Yet, because of the extraordinary value of freely available data, early data from the 1kGP is still
widely used to impute untyped variants, to estimate allele frequencies, and to answer a wide range
of medical and evolutionary questions. This raises the question of whether and how such legacy
data should be included in contemporary analyses alongside more recent cohorts. Here we point
out how large and previously unreported batch effects in the early phases of the 1kGP still lead to
incorrect genetic conclusions through population genetic analyses and spurious GWAS associations
as a result of imputation using the 1kGP as a reference."

from above quote:

the 1kGP still lead to incorrect genetic conclusions
the 1kGP still lead to incorrect genetic conclusions
the 1kGP still lead to incorrect genetic conclusions
the 1kGP still lead to incorrect genetic conclusions
the 1kGP still lead to incorrect genetic conclusions

Last Edit: May 11, 2019 5:53:59 GMT -5 by djoser-xyyman

Without data you are just another person with an opinion - Deming

djoser-xyyman
Vizier

Without data you are just another person with an opinion - Deming

Posts: 3,268

That suspicious 1000 Genomes dataset...I told you so May 11, 2019 6:01:17 GMT -5

Quote

Post by djoser-xyyman on May 11, 2019 6:01:17 GMT -5

Is it a conspiracy??? Is there collusion? Here is what the authors have to say

Quote:
Conclusion

On a technical front, we were surprised that strong association between variants and technical
covariates in the 1kGP project had not been identified before. The genome-wide logistic regression
analysis of genotype on quality metric is straightforward, and should probably be a standard in
a variety of genomic studies. The logistic factor analysis is more computationally demanding but
produces more robust results (Song et al., 2015). Both approaches produce comparable results.
More generally, to improve the quality of genomic reference datasets, we can proceed by
addition of new and better data and by better curation of existing data. Given rapid technological
progress, the focus of genomic research is naturally on the data generation side. However, cleaning
up existing databases is also important to avoid generating spurious results. The present findings
suggest that a substantial fraction of data from the final release of the 1kGP project is overdue for
retirement or re-sequencing.

a substantial fraction of data from the final release of the 1kGP project is overdue for
retirement

a substantial fraction of data from the final release of the 1kGP project is overdue for
retirement

a substantial fraction of data from the final release of the 1kGP project is overdue for
retirement

a substantial fraction of data from the final release of the 1kGP project is overdue for
retirement

Last Edit: May 11, 2019 14:19:19 GMT -5 by djoser-xyyman

Without data you are just another person with an opinion - Deming