|
Post by djoser-xyyman on Nov 9, 2018 8:42:54 GMT -5
2.21 %
|
|
|
Post by djoser-xyyman on Nov 19, 2018 16:53:06 GMT -5
CODIS STR Repeats and location - template for LOBSTR
============== chr11 2192318 2192345 4 7 TH01 chr13 82722160 82722203 4 11 D13S317 chr15 97374245 97374269 5 5 PentaE chr16 86386308 86386351 4 11 D16S539 chr18 60948900 60948971 4 18 D18S51 chr21 20554291 20554417 4 29 D21S11 chr21 45056086 45056150 5 13 PentaD chr2 1493425 1493456 4 8 TPOX chr3 45582231 45582294 4 16 D3S1358 chr4 155508888 155508975 4 22 FGA chr5 123111250 123111293 4 11 D5S818 chr5 149455887 149455938 4 13 CSF1PO chr7 83789542 83789593 4 13 D7S820 chr8 125907115 125907158 4 13 D8S1179
=============================
What are you looking at ? take for example TPOX. We know it is was removed from the Abusir JK488!! I haven’t manually done a check on every single CODIS STR but the few that I have done is missing. The Chromosome number is self explanatory. The region/location follows. The 4 and 8 represents the repeats on the sample test file I pulled from LobSTR. 4/8 …..from mommy and daddy. And of course the “name” of the CODIS STR.
Now...if can manul look at the location we can do a physical count of the repeats ...no computer needed.
|
|
|
Post by djoser-xyyman on Nov 19, 2018 17:25:07 GMT -5
Quote from the Abusir paper ------------- “Nuclear data analysis: genotyping We called genotypes from the UDG treated data for the three individuals by sampling a **random** read per SNP in the SNP-capture panel, using a custom tool** ‘pileupCaller’**, available at github.com/stschiff/sequenceTools. The resulting genotypes were merged with data from two other data sets: First, 2,367 modern individuals genotyped on the Affymetrix Human Origins Array34,35; second, 294 ancient genomes36.” -------------- How ca it be random when the SNPs are selected through pileCaller? Huh? Anyways… Quote on pileupCaller: -------------------------- pileupCaller The main tool in this repository is the program pileupCaller to sample alleles from low coverage sequence data. The first step is to generate a “pileup” file at all positions you wish to genotype. To do that, here is a typical command line, which restricts to mapping and base quality of 30: ----------------------
|
|
|
Post by djoser-xyyman on Nov 19, 2018 17:25:28 GMT -5
In other words Reich Labs "selected" the SNPs they wanted to do the comparison with. tsk! tsk! tsk! naughty! naughty!
In other words:
1. These are NOT the complete genome of the 3 Abusir 2. The relevant SNPs were probable removed by pileupCaller.
|
|
|
Post by djoser-xyyman on Nov 21, 2018 10:03:50 GMT -5
Ok. I think I can conclude with certainty that the CODIS STR genome sections were removed from the Abusir. Initially I did a 'visual' with Excel dump and knew with certainty that TPOX was removed. I finally have BedTools software running and now I can look across the entire genomes within seconds..yes seconds, not minutes!!! I can process bam files instantaneously . And yes there are "zero" INTERSECT for CODIS STRs using BEDTOOLS. The CODIS STR were removed from Abusir JK2134. See attached. The script (-u) will dump CODIS STR if they are present. None was found with STRaitRazor now none is found with BedTools. The same script found "intersects" on the reference "hg18" Chr21 Now why would the researchers remove EVERY single CODIS STRs? The Amarnas come to mind. That was a cluster-fughgk on their part. Lol!
|
|
|
Post by djoser-xyyman on Nov 21, 2018 10:04:53 GMT -5
Now why would the researchers remove EVERY single CODIS STRs? The Amarnas come to mind. That was a cluser-fughgk on their part. Lol!
To the newbies reading this. What they have done is removed all the CODIS STRs that would have identified the Abusirs as Sub-Saharan Africans. ALL OF THE CODIS STRs!!!!!!!!! Now why would they do that. SMH. Obviously if the CODIS identified the Abusir as "European" they would have left then in dataset. They essentially left in preselected SNPs that would identify the Abusir as Eurasians. But ALL Africans carry "Eurasian" SNPs!!!! It is called a stack deck. Naughty! Naughty! Europeans!
@ Sage - I needed to get past the CODIS STR dilemma. I have been working on this for the last couple of weeks. I will start playing around with ADMIXTURE and other software. May be I can pull a rabbit out the hat with what little data they decided to publish/upload. MALDER/Treemix etc looks promising.
To those newbies trying to process the Abusir data using genetic software here are the steps
1. Download the (3)Abusir genomes 2. Download and install the software BEDTOOLS. This is the most difficult part. Many many many libraries may be needed 3. Download the hg19 reference BED files for CODIS STR from the LobSTR site 4. Download test files and verify BEDTOOLS is working properly 5. Convert the Abusir BAM file to a BED file using BEDTOOLS. 5. Run BEDTOOLS "Intersect" to pull out the CODIS STR from the Abusir...if they are there
Here is the script. If there are any CODIS STRs it will be dumped on the screen or you can "pipe" into a file.
$ bedtools intersect -a JK2134udg.fixedRG-2.bed -b lobSTR_codis_hg19.bed -u
Keep in mind you have to be familiar with UNIX or Ubuntu, this is not Windows based as with STRaitRazor or BAMAnalysis kit. I used Ubuntu
Enjoy!
|
|
|
Post by djoser-xyyman on Nov 21, 2018 10:12:18 GMT -5
In short. I confirmed by three different methods that the CODIS STRs were removed from the ABusir mummies before the dataset was uploaded.
1. STRaitRazor which is MSExcel VBA/Macro based pulled zero CODIS STRs 2. BAMAnalysis Kit via a VCF/Text conversion process showed the region for the TPOX and other STRs were missing 3. BEDTools showed no "Intersect" for COSDIS STRs. Ie Zero CODIS STRS
|
|
|
Post by djoser-xyyman on Nov 21, 2018 10:14:58 GMT -5
Now we cannot catch the Europeans in their lies!!!! Slick!
|
|
|
Post by kel on Nov 22, 2018 15:19:10 GMT -5
Have tried reaching out to the researchers and asking them to explain their behavior ?
Clearly they rely on the fact that most arent literate in the scientific details like you are.
|
|
|
Post by djoser-xyyman on Nov 29, 2018 5:36:25 GMT -5
Got ADMIXTURE running. I am yet to figure out how to incorporate the Abusir into my dataset. The dataset I chose is below. I created the *.bim, *.bed and *.fam using a larger population dataset from Razib Khan. I don't like it but that is what I had to work with so far. His population set only has 30,000 SNPs which is really small compared to SNP from other dataset can reached > 300,000,000!!!! And considering the human genome SNPs are in the billions!!! But I am getting my feet wet. Razib Khan has a Perl script that runs PCAs, ADMIXTURE and TreeMix in one shot. Here is the low down so far.
Use "grep" command to choose you population set. This is what I chose-see below. Then PLINK to create the files to run in ADMIXTURE. ADMIXTURE will kick out the *.Q and *.F files to plot and create the Cluster Chart with K values. R is the program used (like Excel) to create the color Cluster Charts. My PC has 4 cores so you will see 4 "threads" ie quad core with 4 gb RAM. Still have a lot of work to do. Pulling in Abusir into ADMIXTURE but it is coming along. The only problem is the 30,000SNP used in the dataset. I have to find test samples with a lot more SNPs than what is provided my Razib Khan.
I chose these population out of over 100 than what Razib Khan had available. I will try HGDP soon but I need to figure out how to create "*.bim, *.bed and *.fam" files from the HGDP.
Anyways. Here goes.....more to come
Grep "AfricanBarbados\|EsanNigeria\|EthiopianJews\|Gambian\|Ethiopians\|Yoruba\|Mbuti_Pygmies\|San\|Mandenka\|Bantu_NE\|Luhya\|Biaka_Pygmies\|Bedouin\|Egyptans\|Cypriots\|Druze\|Palesti nian\|AshkenazyJews\|Saudis\|SephardicJews\|Syrians\|Moroccans\|Mozabite\|Mende\|LibyaJew\|TunisiaJew\|Yemenese\|Tuscan\|Sardinian\|Sicily\|Greek\|Spaniard\|IranJew\|IraqiJews\|Ita lyJew\|Jordanians\|Sindhi\|Samaritians\|Bengali\|Gond\|Makrani\|Papuan\|NAN_Melanesian\|UtahWhite\|Finn\|French_Basque\|GreatBritain\|Mongol\|HanBeijing\|Han_S\|Dai\|Vietnamese\|Ka ritiana\|Maya\|MumbaiJews\|Peruvian" Est1000HGDP.fam > main-sample-set.txt
./plink --bfile Est1000HGDP --keep main-sample-set.txt --make-bed --out main-sample-admixture-test-set
|
|
|
Post by djoser-xyyman on Nov 29, 2018 5:37:03 GMT -5
PLINK v1.90p 64-bit (2 Jun 2015) www.cog-genomics.org/plink2 (C) 2005-2015 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to main-sample-admixture-test-set.log. Options in effect: --bfile Est1000HGDP --keep main-sample-set.txt --make-bed --out main-sample-admixture-test-set 3368 MB RAM detected; reserving 1684 MB for main workspace. 135056 variants loaded from .bim file. 2447 people (513 males, 260 females, 1674 ambiguous) loaded from .fam. Ambiguous sex IDs written to plink.nosex . Using up to 4 threads (change this with --threads). Before main variant filters, 2447 founders and 0 nonfounders present. Calculating allele frequencies... done. 135056 variants and 2447 people pass filters and QC. Note: No phenotypes present. Relationship matrix calculation complete. --pca: Results saved to plink.eigenval and plink.eigenvec . **** ADMIXTURE Version 1.22 ***** **** Copyright 2008-2012 ***** **** David Alexander, John Novembre, Ken Lange ***** **** Please cite our paper! ***** **** Information at www.genetics.ucla.edu/software/admixture *****
|
|
|
Post by djoser-xyyman on Nov 29, 2018 5:37:41 GMT -5
I will provide info as I go along. Also create a tutorial when needed. But in short.
1. To run ADMIXTURE you need the “input” genomes in a specific format. There are three input files/format needed. *.fam. *.bed and *.bim files format. The *.fam file has the listed populations. You need to run PLINK to create these formats. You can modify and remove populations in this *.fam file . I chose the populations listed above. In other words you can downsize from the original “large” dataset by modifying the *.fam file. This can be done from the command line using “grep”, no special software is needed. 2. Next use the resulting *.fam file in PLINK to “pull’ genomes from the original *.bed and *.bim files. 3. Now that you have created YOUR sample set then you can run it in the ADMIXTURE software which will kick out the *.Q and *.F file which can be plotted. 4. The chart plotting can be done in “R”. “R” is a spreadsheet type software that is free and runs in Ubuntu. Most researchers use “R”. “R” is run in the Ubuntu/UNIX environment!!!! So get familiar with this Operating System environment!
I am in good standing because I have done programming in college…been awhile but it is coming back to me.
One issue you will encounter is many of the ancient genomes comes in different data format and there conversion and/or alignment is needed. The main challenge is converting to different format and keeping the alignment to the reference correct. That is why they standardized on hg19 etc. But hg19 etc are European based and lacking many alleles. In other words it is skewed towards Europeans. The African Reference will soon be coming out.
But here is what the SIMON Project states. I am looking into using SIMON (SGDP) instead of HGDP to do my analysis. But you needed a lot of computing power. They are talking about >70TB of data. SHIIIIIT!!!! If I can pull a small fraction I will be good. Lol!
Quote: “ We report the Simons Genome Diversity Project (SGDP) dataset: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least ***5.8 million base pairs that are NOT present in the human reference genome***. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioral modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that in other non-Africans.
To obtain a complete picture of human diversity, it is necessary to sequence the genomes of many individuals from diverse locations. To date, the largest whole-genome sequencing survey, the 1000 Genomes Project, analyzed 26 populations of European, East Asian, South Asian, American, and sub-Saharan African ancestry”
|
|
|
Post by djoser-xyyman on Nov 29, 2018 5:38:14 GMT -5
The SGDP site talks about a small/light version for the layman to download and work with but that link is dead. Damn!!!! That would have been ideal since I don’t have I don’t have room for >70 TB of data!!!!
To those who missed it..quote from SGDP:
“These genomes include at least ***5.8 million base pairs that are NOT present in the human reference genome***. ”
The SGDP contains 5.8 million MORE base pairs than the reference. So the question is WhY use HGDP and hg19? We also now found out that Africans have so much more unaccounted for genes/proteins NOT found in Europeans. So why even use a European based Reference panel.
|
|
|
Post by djoser-xyyman on Nov 29, 2018 5:38:48 GMT -5
Finally completed my first ADMIXTURE chart. grep "AfricanBarbados\|EsanNigeria\|EthiopianJews\|Gambian\|Ethiopians\|Yoruba\|Mbuti_Pygmies\|San\|Mandenka\|Bantu_NE\|Luhya\|Biaka_Pygmies\|Bedouin\|Egyptans\|Cypriots\|Druze\|Palesti nian\|AshkenazyJews\|Saudis\|SephardicJews\|Syrians\|Moroccans\|Mozabite\|Mende\|LibyaJew\|TunisiaJew\|Yemenese\|Tuscan\|Sardinian\|Sicily\|Greek\|Spaniard\|IranJew\|IraqiJews\|Ita lyJew\|Jordanians\|Sindhi\|Samaritians\|Bengali\|Gond\|Makrani\|Papuan\|NAN_Melanesian\|UtahWhite\|Finn\|French_Basque\|GreatBritain\|Mongol\|HanBeijing\|Han_S\|Dai\|Vietnamese\|Ka ritiana\|Maya\|MumbaiJews\|Peruvian" Est1000HGDP.fam > main-sample-set.txt ./plink --bfile Est1000HGDP --keep main-sample-set.txt --make-bed --out main-sample-admixture-test-set ./admixture -j4 main-sample-admixture-test-set.bed 10I chose to run to K=10 PLINK v1.90p 64-bit (2 Jun 2015) www.cog-genomics.org/plink2 (C) 2005-2015 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to main-sample-admixture-test-set.log. Options in effect: --bfile Est1000HGDP --keep main-sample-set.txt --make-bed --out main-sample-admixture-test-set --------- Summary: Converged in 52 iterations (16355.8 sec) Loglikelihood: -306822873.494210 Fst divergences between estimated populations: Pop0 Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8 Pop0 Pop1 0.142 Pop2 0.137 0.225 Pop3 0.055 0.157 0.150 Pop4 0.172 0.259 0.064 0.185 Pop5 0.066 0.174 0.144 0.042 0.180 Pop6 0.086 0.127 0.170 0.112 0.204 0.121 Pop7 0.138 0.215 0.206 0.162 0.238 0.168 0.141 Pop8 0.139 0.227 0.017 0.153 0.063 0.146 0.172 0.208 Pop9 0.131 0.221 0.048 0.143 0.093 0.139 0.165 0.204 0.054 Writing output files.
|
|
|
Post by kel on Nov 29, 2018 14:08:17 GMT -5
""We also now found out that Africans have so much more unaccounted for genes/proteins NOT found in Europeans. So why even use a European based Reference panel."
I think we all know the answer to this question..................
|
|