Recurrent SNPs

General discussions regarding DNA and its uses in genealogy research

Posts: 2410
Joined: Sun Mar 18, 2012 7:08 am
Location: Pisa (Italy)
YDNA:
R- Z2110 (KV7Y2)
MtDNA:
K1a1b1e/HQ176413
PostPosted: Tue Dec 22, 2015 5:02 pm
This seems more reliable:



ChrY position:
20723243 (+strand)
Reads:
7
Position data:
7T
Weight for T:
1.0
Probability of error:
0.0 (0<->1)
Sample allele:
T
Reference (hg19) allele:
T
Known SNPs at this position:
F552 (T->C)
Reference sequence (100bp):
AAAATTGGGAGAAGCTGTCTTCTGTGTAGGACGGCTATTTCAAAATATTT
 T 
AGTGTTTTGTTTCCTGTTGCATGACAGATTTAACTTTTTTTTTTTTTTTC



This isn't classified yet:




ChrY position:
19048392 (+strand)
Reads:
2
Position data:
2G
Weight for G:
1.0
Probability of error:
0.0 (0<->1)
Sample allele:
G
Reference (hg19) allele:
T
Reference sequence (100bp):
ATGTCTATTATATATATATATATATATATCTAGACATATATATATATATA
 T 
AGAGAGAGAGAGAGAGAGACAGAGAGAGAGATGGGTTCTTTTTGTGTTGC
(19048341-19048442)

Posts: 2233
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Mon Aug 22, 2016 5:57 pm
Gioiello wrote:I have written that to a friend of mine this morning on FB. Perhaps it could be uselful to you too:

A recurrent SNP is a SNP that is found in numerous hgs, thus it is a fast mutating one, and, even though it may be useful within an haplogroup within close linked haplotypes, it isn't useful for the Y-tree. Don't mind your recurrent SNPs, if you have them. The test is clear about the not recurrent ones, and your situation is clear.


Search in BAM file
ChrY position: 16375977 (+strand)
Reads: 14
Position data: 14A
Weight for A: 1.0
Probability of error: 0.0 (0<->1)
Sample allele: A
Reference (hg19) allele: G
Known SNPs at this position: A9845 (G->A) Rating for known SNP
M8402 (G->T) Rating for known SNP

Here is an example of two different SNPs in the one position and a recurrent SNP also. I am A9845+ and it is a new branch of Z156. Some of these are good quality SNPs. Isogg will not except a SNP with under 4 reads. Furthermore, I know that A9845 is a top of the range SNP because a tester who is a GD of 7 at 67 markers match with me has the same mutation, and there are only two of us out of 1000's of R1b testers with this SNP.

I have heard of the SNP M12124 which was linked to RISE560 BB, that the result was G>A for him, but there is another mutation G>C at the same position in some modern R1b samples. It would mean RISE560 belonged to DF27 and downstream also because everyone who is DF27 does not have the mutation.

Posts: 2410
Joined: Sun Mar 18, 2012 7:08 am
Location: Pisa (Italy)
YDNA:
R- Z2110 (KV7Y2)
MtDNA:
K1a1b1e/HQ176413
PostPosted: Mon Aug 22, 2016 7:58 pm
dartraighe wrote:Search in BAM file
ChrY position: 16375977 (+strand)
Reads: 14
Position data: 14A
Weight for A: 1.0
Probability of error: 0.0 (0<->1)
Sample allele: A
Reference (hg19) allele: G
Known SNPs at this position: A9845 (G->A) Rating for known SNP
M8402 (G->T) Rating for known SNP

Here is an example of two different SNPs in the one position and a recurrent SNP also. I am A9845+ and it is a new branch of Z156. Some of these are good quality SNPs. Isogg will not except a SNP with under 4 reads. Furthermore, I know that A9845 is a top of the range SNP because a tester who is a GD of 7 at 67 markers match with me has the same mutation, and there are only two of us out of 1000's of R1b testers with this SNP.

I have heard of the SNP M12124 which was linked to RISE560 BB, that the result was G>A for him, but there is another mutation G>C at the same position in some modern R1b samples. It would mean RISE560 belonged to DF27 and downstream also because everyone who is DF27 does not have the mutation.


Your SNP is very good. This is my read:
Sample: #YF02873 (R-Z2110) ChrY position: 16375977 (+strand) Reads: 15 Position data: 15G Weight for G: 1.0 Probability of error: 0.0 (0<->1) Sample allele: G Reference (hg19) allele: G Known SNPs at this position: A9845 (G->A)
M8402 (G->T)
Reference sequence (100bp): GGCCAACTTGTGAAACCCTATCTCTACCAAAAAATACAAAAATTTGCCAG
G CATGATGGTACATGCCTATAATCCCAGCTACTCATGAGGCTGAGGTGAGA
(16375926-16376027)
The problem isn't the numbers of reads, but the test. I have many very good SNPs with only one read in my Full Genome. Other tests are bad even with hundreds of reads. Who considers a SNPs not valid under 4 reads without considering the firm which did it is out of head.

Posts: 2233
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Sat Jan 28, 2017 7:04 pm
This a good post by T.K.
" We (well, at least many of us) think of SNPs as "stable" and that testing positive for a SNP "proves" (which I'm intentionally using loosely) a certain descendancy. But if these SNPs happen in numerous branches, I think that says you can't just take a SNP-positive test as a strong claim of descendancy.

At yseq, for example, you don't need to do any STR testing - you can just test the SNP you're interested in.

But you could test positive for the SNP and be in an entirely different branch of the human tree. So, once in a while, that apparently gives the wrong answer.

One solution is it happens so infrequently, we won't worry about it. Already certain enough.

Another answer is an isolated SNP-positive test needs to be put together with some other piece of data (what data?) before proclaiming anything.

Another answer is more than one (how many?) SNPs in a given tree need to be tested to be sure enough.

Yet another gray area is what if the SNP back-mutates (is that possible?) Then I suppose a tester who was trying to pick off a single SNP would back all the way up to the beginning and eventually arrive at the correct answer by testing many unrelated SNPs and ruling out all sorts of other branches.

All this begs the question do we have any best-practices for zeroing in on the answer? Maybe it involves some combination of SNP and STR testing, or maybe it's dive into a NGS test before doing anything else? Or backbone test? or...

I know, "it varies" but still, there should be *some* rules of thumb, right?
The usual approach to answer the question if one SNP is enough is the Bayes theorem.
https://en.wikipedia.org/wiki/Bayes'_theorem

The first part is to figure out the prior probability. This is of course difficult and depends on many factors such as paper trail, how close those people live together, if they share the same surname etc..

The prior probability has a huge influence on the outcome, but statisticians tend to define it as 1 if they have no access to all this information.
So let's assume Pr, the prior probability as 1 (though we know it isn't).

The formula as it is shown in Wikipedia is

P(A | B) = ( P(B | A) * P(A) ) / P(B)

P(A) is the probability that the two people that are related to each other because they have inherited a genetic marker from a common ancestor.
P(B) is the probability that the two people just randomly match at a marker. This is the frequency of this mutation in the general population.

P(A|B) is the conditional probability that the persons share a marker, but they are in fact not related. This could happen through a parallel mutation, so this numeric value is really the mutation frequency of that marker.

To estimate some numerical values we'll need to focus on a distinct example. Numbers will change depending on the SNP we choose and on what population we consider. Let's select U106 as an example since we are in the R1b-U106 discussion group.

To estimate the population frequency of U106 I simplify by defining it as "YSEQ customers". You'll certainly find a much better definition, but the "YSEQ customer" statistic is easy to look up. (Of course we know it is very biased because only persons who have a reason to test U106 will order that test at YSEQ). The Ybrowse U106 details view shows us:

Name: U106
Type: snp
Source: point
Position: ChrY:8796078..8796078 (+ strand)
Length: 1
allele_anc: C
allele_der: T
comment: aka M405 S21
count_derived: 2824
count_tested: 9097
isogg_haplogroup: R1b1a1a2a1a1
load_id: U106
mutation: C to T
primer_f: U106_F GCTCTGGTGCATAGGGATTC
primer_r: U106_R AGTCTGAACTCTTGGGAGATGG
ref: Sims et al. 2007
ycc_haplogroup: R1b1a2a1a1a
primary_id: 204851
gbrowse_dbid: chrY:database


The important lines are:
count_derived: 2824
count_tested: 9097

So our heavily over-estimated population frequency for U106 is 2824 / 9097 = 0.31 = 31%
You will certainly find a better approach to measure this, but at least now we have a number to enter in the Bayes formula.

How do we find the mutation frequency in the "YSEQ customers" population? We have all kind of haplogroups from A00 to all European, African Asian and Indian origin. However we know that we're all related back 200.000 years ago when A00 split off. We also know that U106 has mutated exactly once within those 200ky since I'm not aware of any sample in a different haplogroup being confirmed U106+. (This may drastically change with a marker like PF6069 above). So for U106 we can estimate the mutation frequency as 1 / 200000 years = 0.000005 per year.

This leaves us with the P(A) parameter. When a father has a son, then the probability that he passes down his Y chromosome including the U106 marker is 100% or 1. With autosomal markers this would be 50%, but here we know that U106 is on the Y chromosome and any male descendant must have inherited the same Y chromosome. So 1 it is.

Enter the values into the Bays formula:

P(A|B) = (0.000005 per year * 1) / 0.31 = 0.0000161 per year

So what does this mean?
The likelihood ratio that we are observing a SNP in two persons despite they are in fact not related is very small. We'd need to wait 62112 : 2 = 31 kyears (round trip to the common ancestror in the tree and back) to expect that such an event will happen in a population that has a similar structure as the "YSEQ customers". Considering that R1b-U106 4500 ... 5300 years old (according to http://www.jb.man.ac.uk/~mcdonald/genet ... 6-age.html), the probability that this happened within the existence of U106 is 7 ... 8.5%.

Now feel free to use your own numbers and populations and markers and don't forget to factor in the prior probability if you have a reason for it."

Posts: 2233
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Tue Feb 21, 2017 7:41 pm
There are around 59 Million or so base pairs on the Y chromosome and about 3.5 Billion males in the world and that is one of the reasons that we have so many recurrent SNPs. That is from an expert. The SNPs that are important to us are those that are unique to our Y lines. I have lots of recurrent one star rated SNPs from my Yfull analysis of my BAM file.

Posts: 2233
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Thu May 18, 2017 6:13 pm
Dr Mc Donald.

"Recurrent SNPs can and do happen all over the haplotree.

Theoretically, the BigY tests about 10 million positions. There are 235805 unique variants in the current YBrowse file. Each new U106 test brings in a median of nine new variants, so the chances of a new test coming along with a previously listed variant being is around 1-(1- 235805/10000000)^9 = 21%.

Even within U106, we have 15,856 base pairs which mutate, giving a rough probability of 1-(1- 15856/10000000)^9 = 1.4% that any new test will come in with a genuinely recurrent SNP. One of the major issues I am now facing with the age analysis is determining what is a genuinely recurrent SNP, and what is a problematic call in the FTDNA data.

In reality, not all positions in the Big Y test are created equally. Different mutation rates are known in the palindromic arms from the rest of the chromosome, and strongly recurrent regions (e.g. DYZ19) have much faster mutation rates, perhaps by a factor of 10. That means that certain regions are much more prone to having recurrent mutations. One of my ongoing tasks is to try to categorise this mutation rate as a function of position in the chromosome and try to determine how the regions we use affect the age analysis."

- Iain.

Posts: 2233
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Thu Aug 16, 2018 5:57 pm
One of my recurrent YSNPs has been found in an African branch of the B haplogroup. This specific YSNP is estimated to be around 1,300 years in my Y line and found in only two Irish samples of U106>Z156. This one valid recurrent YSNP is enough to identify my subgroup. This YSNP is not found in any tester outside of Ireland so it must be specific to the region my ancestors are from. The TMRCA of this region specific YSNP shows us how long the branch has been in Ireland. It has taken 12 years to establish this fact.


The recurrent YSNP that I have is found among these and this is one huge bottlenecked African branch.

https://www.yfull.com/tree/B-P6/
Previous

Return to General DNA Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron