Post by djoser-xyyman on Dec 13, 2018 11:37:05 GMT -5
Re-Thinking the human Reference Genome and other lies
The presence and impact of reference bias on population genomic studies of prehistoric human populations -
Torsten Günther§,∗ & Carl Nettelblad‡
December 5, 2018
Abstract
High quality reference genomes are an important resource in genomic research projects. A
consequence is that DNA fragments carrying the reference allele will be more likely to map successfully,
or receive higher quality scores. This reference bias can have effects on downstream
population genomic analysis when heterozygous sites are falsely considered homozygous for
the reference allele.
In palaeogenomic studies of human populations, mapping against the human reference
genome is used to identify endogenous human sequences. Ancient DNA studies usually operate
with low sequencing coverages and fragmentation of DNA molecules causes a large proportion
of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted
mismatches, and increasing the probability of multiple matching sites in the genome. These
ancient DNA specific properties are potentially exacerbating the impact of reference bias on
downstream analyses, especially since most studies of ancient human populations use pseudohaploid
data, i.e. they randomly sample only one sequencing read per site.
We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric
humans with some differences between individual genomic regions. We illustrate that
the strength of reference bias is negatively correlated with fragment length. Reference bias
can cause differences in the results of downstream analyses such as population affinities, heterozygosity
estimates and estimates of archaic ancestry. These spurious results highlight how
important it is to be aware of these technical artifacts and that we need strategies to mitigate
the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference
bias which help to reduce its impact substantially.
quote:
"In general, we observe a deviation from zero in most cases highlighting the effect of reference bias
on these statistics (Figure 3). Surprisingly, the directions of this bias differ between the HO data
and the SGDP data, which suggests that different reference data sets are also affected by reference
bias at different degrees. This represents a potential batch effect which also needs to be considered
when merging different reference data sets. Affinities to populations of different geographic origin
vary in their sensitivity to reference bias but little general trends are observable. Western Eurasian
populations show a strong deviation from 0 in all tests. Notably, African populations show the
strongest deviation in the short versus long comparison in the SGDP data set while they exhibit
almost no bias in the same comparison using the HO data. As the biases do not seem to show
a consistent tendency, we cannot directly conclude that recent ancient DNA papers have been
systematically biased in some direction. The shifts appear to be dataset and test specific so some
results could still be driven by spurious affinities due to reference bias."
quote
"The human reference genome sequence is a mosaic of the genomes of different individuals. The
geographic origin of the specific segments should have an impact on the population genetic affini
ties as the reference allele will more likely be found in specific geographic regions. We obtained
information on the local ancestry of the human reference genome from Green et al. (2010). Ac
cording to this estimate 15.6 % of the reference genome can be assigned to African, 5.0 % to East
Asian and** 30.0 % to European origin** while the origin for 49.4 % is **uncertain**. We re-calculate
D statistics for the different parts of the genome separately, restricting the analysis to the SGDP
data. The impact of reference bias differs between the different ancestries (Figure 4). Generally
reference bias is weakest for reference segments of African origin. Notably, African populations
show the strongest deviations from 0 in this case. Sequences mapping to the European segments of
the reference show a strong reference bias with slight differences between continental populations.
Reference bias at the East Asian segments of the reference genome seems intermediate but the D
statistics also show large variation which may be due to the only small proportion of the reference
genome that could confidently be assigned to an East Asian origin (Green et al., 2010)"
"We show that reference bias is able to lead to significant differences between estimates of population genetic parameters (heterozygosity), overestimated levels
of archaic ancestry as well as to cause spurious affinities to certain populations.
Additionally, strong differences of fragment size distributions between different individuals may
cause spurious affinities due to reference bias."
"Conclusion
Our analysis highlights that reference bias is **pervasive** in ancient DNA data used to study pre
historic populations. While the strength of the effect differs between applications and data set,
it is clear that reference bias has the potential to create spurious results in population genomic
analyses. Furthermore, even when the overall presence of bias is limited, it is important to assess
whether subsets of variants are prone to strong systematic bias, including the possible presence of alternative bias"
Models of archaic admixture and recent history from two-locus statistics -
Aaron P. Ragsdale and Simon Gravel
Abstract
We learn about population history and underlying evolutionary biology through patterns of genetic
polymorphism. Many approaches to reconstruct evolutionary histories focus on a limited number of
informative statistics describing distributions of allele frequencies or patterns of linkage disequilibrium.
We show that many commonly used statistics are part of a broad family of two-locus moments whose
expectation can be computed jointly and rapidly under a wide range of scenarios, including complex
multi-population demographies with continuous migration and admixture events. A full inspection of
these statistics reveals that widely used models of human history fail to predict simple patterns of
linkage disequilibrium. To jointly capture the information contained in classical and novel statistics,
we implemented a tractable likelihood-based inference framework for demographic history. Using this
approach, we show that human evolutionary models that include archaic admixture in Africa, Asia, and
Europe provide a much better description of patterns of genetic diversity across the human genome. We
estimate that individuals in two African populations have 6 − 8% ancestry through admixture from an
unidentified archaic population that diverged from the ancestors of modern humans 500 thousand years
ago.
Quote:
"H. erectus had emerged around 1.8 million years ago, and had long been present, in various subspecies throughout Eurasia. The divergence time between the Neanderthal and archaic Homo sapiens lineages is estimated at between 800,000 and 400,000 years ago."
The presence and impact of reference bias on population genomic studies of prehistoric human populations -
Torsten Günther§,∗ & Carl Nettelblad‡
December 5, 2018
Abstract
High quality reference genomes are an important resource in genomic research projects. A
consequence is that DNA fragments carrying the reference allele will be more likely to map successfully,
or receive higher quality scores. This reference bias can have effects on downstream
population genomic analysis when heterozygous sites are falsely considered homozygous for
the reference allele.
In palaeogenomic studies of human populations, mapping against the human reference
genome is used to identify endogenous human sequences. Ancient DNA studies usually operate
with low sequencing coverages and fragmentation of DNA molecules causes a large proportion
of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted
mismatches, and increasing the probability of multiple matching sites in the genome. These
ancient DNA specific properties are potentially exacerbating the impact of reference bias on
downstream analyses, especially since most studies of ancient human populations use pseudohaploid
data, i.e. they randomly sample only one sequencing read per site.
We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric
humans with some differences between individual genomic regions. We illustrate that
the strength of reference bias is negatively correlated with fragment length. Reference bias
can cause differences in the results of downstream analyses such as population affinities, heterozygosity
estimates and estimates of archaic ancestry. These spurious results highlight how
important it is to be aware of these technical artifacts and that we need strategies to mitigate
the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference
bias which help to reduce its impact substantially.
quote:
"In general, we observe a deviation from zero in most cases highlighting the effect of reference bias
on these statistics (Figure 3). Surprisingly, the directions of this bias differ between the HO data
and the SGDP data, which suggests that different reference data sets are also affected by reference
bias at different degrees. This represents a potential batch effect which also needs to be considered
when merging different reference data sets. Affinities to populations of different geographic origin
vary in their sensitivity to reference bias but little general trends are observable. Western Eurasian
populations show a strong deviation from 0 in all tests. Notably, African populations show the
strongest deviation in the short versus long comparison in the SGDP data set while they exhibit
almost no bias in the same comparison using the HO data. As the biases do not seem to show
a consistent tendency, we cannot directly conclude that recent ancient DNA papers have been
systematically biased in some direction. The shifts appear to be dataset and test specific so some
results could still be driven by spurious affinities due to reference bias."
quote
"The human reference genome sequence is a mosaic of the genomes of different individuals. The
geographic origin of the specific segments should have an impact on the population genetic affini
ties as the reference allele will more likely be found in specific geographic regions. We obtained
information on the local ancestry of the human reference genome from Green et al. (2010). Ac
cording to this estimate 15.6 % of the reference genome can be assigned to African, 5.0 % to East
Asian and** 30.0 % to European origin** while the origin for 49.4 % is **uncertain**. We re-calculate
D statistics for the different parts of the genome separately, restricting the analysis to the SGDP
data. The impact of reference bias differs between the different ancestries (Figure 4). Generally
reference bias is weakest for reference segments of African origin. Notably, African populations
show the strongest deviations from 0 in this case. Sequences mapping to the European segments of
the reference show a strong reference bias with slight differences between continental populations.
Reference bias at the East Asian segments of the reference genome seems intermediate but the D
statistics also show large variation which may be due to the only small proportion of the reference
genome that could confidently be assigned to an East Asian origin (Green et al., 2010)"
"We show that reference bias is able to lead to significant differences between estimates of population genetic parameters (heterozygosity), overestimated levels
of archaic ancestry as well as to cause spurious affinities to certain populations.
Additionally, strong differences of fragment size distributions between different individuals may
cause spurious affinities due to reference bias."
"Conclusion
Our analysis highlights that reference bias is **pervasive** in ancient DNA data used to study pre
historic populations. While the strength of the effect differs between applications and data set,
it is clear that reference bias has the potential to create spurious results in population genomic
analyses. Furthermore, even when the overall presence of bias is limited, it is important to assess
whether subsets of variants are prone to strong systematic bias, including the possible presence of alternative bias"
Models of archaic admixture and recent history from two-locus statistics -
Aaron P. Ragsdale and Simon Gravel
Abstract
We learn about population history and underlying evolutionary biology through patterns of genetic
polymorphism. Many approaches to reconstruct evolutionary histories focus on a limited number of
informative statistics describing distributions of allele frequencies or patterns of linkage disequilibrium.
We show that many commonly used statistics are part of a broad family of two-locus moments whose
expectation can be computed jointly and rapidly under a wide range of scenarios, including complex
multi-population demographies with continuous migration and admixture events. A full inspection of
these statistics reveals that widely used models of human history fail to predict simple patterns of
linkage disequilibrium. To jointly capture the information contained in classical and novel statistics,
we implemented a tractable likelihood-based inference framework for demographic history. Using this
approach, we show that human evolutionary models that include archaic admixture in Africa, Asia, and
Europe provide a much better description of patterns of genetic diversity across the human genome. We
estimate that individuals in two African populations have 6 − 8% ancestry through admixture from an
unidentified archaic population that diverged from the ancestors of modern humans 500 thousand years
ago.
Quote:
"H. erectus had emerged around 1.8 million years ago, and had long been present, in various subspecies throughout Eurasia. The divergence time between the Neanderthal and archaic Homo sapiens lineages is estimated at between 800,000 and 400,000 years ago."