Big Y upgrade

General discussions regarding DNA and its uses in genealogy research

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Thu Nov 02, 2017 1:12 pm
FTDNA has an ongoing Big Y upgrade which should see some testers getting new YSNPs. The new VCF with lower thresholds should yield some SNPs much the same as were found in the .BAM file analysis done by Yfull.com. The new upgrade is taking a lot longer than was intended but it will benefit everyone who has taken the Big Y test.

My Big Y upgrade is in and I have lost some SNPs in the process. I haven't got the VCF yet.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Wed Nov 08, 2017 9:21 pm
My VCF was uploaded today and the file size is 19.5 MB with 429,044 variants. There were a few errors among my YSNPs but I have looked them up and deleted them. It is going to take me some time to go through the VCF and there is a lot of info in it.

FTDNA did not lower the threshold for the number of reads. I will be sticking with the Yfull YSNPs that were found in my .BAM file. I do not see anything wrong with the Yfull's team analysis.


>Does the number of reads determine the validity of an SNP whenever two complete strangers have the same allele change at the same position on the Y chromosome?

Yes, although having two people with the same allele greatly improves the chances of it being real.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Thu Nov 09, 2017 7:53 am
A Yfull tester cannot search the .BAM file with the new FTDNA Hg38 positions. That needs to change. The Hg 19 positions are irrelevant now. One can search at ybrowse for the unamed Hg38 YSNPs. My original .BAM file is showing available for downloading still even though I have my Big Y upgrade. I thought that they were going to upgrade the .BAM file as well.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Fri Nov 10, 2017 3:10 pm
Anyone who has a BAM file at Yfull can now search for their unamed variants in the Hg38 mode. I have checked out some of them already and I am negative for all.

FTDNA
"The Big Y 2.0 Upgrade will need to be complete for all users before we can begin generating BAM files. We will send you an email when we begin taking requests again. "

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Sat Nov 11, 2017 11:17 pm
A post by Dr.Iain McDonald;
Here are some comments about interpreting scientific results more generally.

"SNPs aren't Pokemon. The upgrade isn't to catch them all, it's to improve the underlying quality of the results. Some bad results will be removed as their quality decreases past some threshold; some good results are inserted as their quality improves. Since the underlying changes stem from a more accurate understanding of the reference sequence, more people will gain high-quality SNPs than lose them, but that won't be true for everyone.

In fact, it's impossible to catch them all. Any complex scientific result isn't black and white, it's some shade of grey. Vince made a very good summary of the different quality criteria in his recent post . Since the underlying data haven't been re-run, the principle change from Build 37 to Build 38 for most SNPs will be in the mapping quality (MQ), which determines how reliably that segment of DNA is aligned to the reference chromosome, for the combined set of reads.

Not only is everything some shade of grey (i.e. has some uncertainty), but there is uncertainty in that uncertainty. For example, each base pair on the Build 38 reference has its own probability of being incorrect, and not every base pair out of the reference DNA is mapped. So we can assign a mapping quality against the rest of the genome that we know about (most of it) but we can't assign a quality against those parts that we haven't yet sequenced. The other quality factors (GC, HL, HR) go some way to helping with this, but the numbers are inexact. "Quality" (BQ and MQ) should be thought of as a reasonable guess, not an exact number.

This is true of any scientific analysis, including my age analysis and anything you see in the published literature. For any probability of correctness, +/- uncertainty, or confidence range, the data provider will have done their best job to try to quantify the likely sources of uncertainty - the "known unknowns".

But there are always "unknown known unknown" where they've had to guess at the effect on the data, or claim it is small enough to be ignored (e.g. in my case, the rate of false positives/negatives in the phylogenically filtered results). And there will be the "unknown unknowns": things that no-one has yet thought of (perhaps something like unidentified stray light problems in a fluorescence test chamber) or unexpected problems (like cross-contamination between samples, which does sometimes occur, but is normally very obvious).

The complexities of this process mean it's hard to determine what's real and what isn't, or even give an accurate probability. Let's take the example of two BigY tests are carried out on a father and a son. (We don't normally encourage this, since the Y-DNA results should be almost identical, but the data allow us to make useful checks on the results.) Caveat emptor: I'm speaking as a general scientist, not as someone who knows the technical detail of how these tests work.

Firstly, FTDNA don't keep track of these relationships (the Family Tree element applies only to autosomal results). So they have no way of differentiating a father-son pair from a son-father pair, or a pair of moderately distant cousins. Any SNP legitimately found in one test is never expected in another.

Secondly, there are about 10,000,000 base pairs called. Even things that should only happen one-in-a-million times will happen ten times in every BigY test. So it's hard to get rare things

Finally, we need to think about how quality improves with successive reads. An individual read may have a base quality and mapping quality of 33 (BQ=33;MQ=33; i.e. a 1-in-2000 chance of each being wrong, or a 1-in-1000 chance of either being wrong). How does that scale if we get more reads? If we have two reads, does that decrease to a one-in-a-million chance? That depends if some spurious random result is causing that base pair to be read poorly, or if it is a systematic issue with that base pair in the genome. Hence (although I don't know) I would expect base quality to improve with more reads, faster than mapping quality would.

So if someone has one read and their clade mate has 100 reads at the same SNP, should you believe that one read? That will depend on the base and mapping quality of those reads, whether they are in a reliable part of the chromosome, how many novel variants are present in each test, what you know about the relationship between those two people, and a whole bunch of poorly quantified uncertainties and "unknown unknowns" that we can't properly consider.

The kinds of uncertainties here can get complex, so a lay analogy might be with driving. Your chances of dying in a car crash are about 1-in-10,000 every year (cf. a quality score of 40). But your real chance of dying depends on a number of factors, including the number of miles you drive (cf. the number of base pairs in the test), the safety features in your car (cf. the number of reads in your test), how good a driver you are (cf. the base quality of your sample), how good the other drivers are around you (cf. the homopolymer rate and rate of STR-like structures) and whether you regularly drive on dangerous stretches of road (cf. the mapping quality of your sample). If the typical driver drives around 10,000 miles per year, that's a death rate of once per 100 million miles, so if each read had BQ=43,MQ=43 then you might expect one error every 10 BigY tests. But not everyone drives equally, and not all reads or base pairs are the same, and properly assessing a probability for any individual person/position that takes all of these factors into account can be very difficult.

So if you are going to go into low quality SNPs to identify whether they are correct or not, even in very closely related tests, then good luck to you. It's going to take a skills set better than mine to work out whether individual SNPs are believable at that level, and there are always going to be SNPs which are dubious. And obviously, if FTDNA are going to keep down the test price, they can't afford to pay a large team of people to sit over every SNP and make that kind of human judgement call. There's a balance between providing and checking useful data, and creating a test that's affordable. Compared to other companies, you've paid for a cheap test, you can expect cheap-quality results.

But at the end of the day, individual SNPs don't really matter. They're only a means to an end. You haven't bought a test to find your SNPs, you've bought a test to help discover your family history. Unless you are particularly interested in uncovering very fine family structure, there's no need to go into the technical detail, accuracy and reproduction of every SNP. They serve two primary purposes: (1) to place you in the haplotree and (2) to be counted to provide the age of a relationship. Provided you have enough SNPs to do (1) and provided you are self-consistent in the way you count (2), whether or not individual SNPs are reproducible doesn't matter. What matters is ensuring you only use good-quality SNPs in your phylogeny and your calculations, not exactly what those SNPs are."

- Iain.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Sun Nov 12, 2017 3:29 pm
http://cruwys.blogspot.ie/2017/11/ftdna ... netic.html
Last edited by dartraighe on Mon Nov 13, 2017 6:21 pm, edited 1 time in total.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Sun Nov 12, 2017 6:38 pm
I have 37 YSNPs under U106 and I know for sure that all the YSNPs above U106 I am positive for so I do not need to know about them. I am interested in the accuracy of the YSNPs under U106 and I did not need a VCF with a load of unreliable YSNPs. I would be happy enough if FTDNA gave me a file with only the 37 YSNPs in it. Those YSNPs under U106 are important to me and I think everyone else in U106 feels the same.

FTDNA does not help us by selling 12 marker YSTR tests but they could help us all if they promote the Big Y for everyone in their database. They should now sell only the Big Y test as a stand alone test as their main product. This is not 2006. Some of us have been waiting for years for other testers to catch up and FTDNA does not help. Anyone who is serious about YDNA testing and matches should know that YSNPs tests are the way to go. A lot of R1b testers have similar YSTR haplotypes and M269 is just not good enough. This is Stone Age status and the majority of us R1b in western Europe are M269.

The Big Y upgrade will be a waste of time and effort if testers do not discover any new YSNPs and for some they will probably lose some. Is that what FTDNA calls progress?
Last edited by dartraighe on Tue Nov 14, 2017 10:21 pm, edited 1 time in total.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Sun Nov 12, 2017 11:23 pm
This was posted on the U106 forum today.

A few tidbits.
"1. FTDNA is planning to generate TMRCAs for the Y SNPs on their Y SNP tree, but it likely won't happen until next year at the earliest.
2. In January you will be able to buy the BigY as a first time purchase at FTDNA without needing to buy a YSTR test first.
3. FTDNA will be generating STR data for a minimum of 300 or so STRs and possibly as many as 500 or so STRs for all people who buy the BigY. They will generate this data for everyone who has previously done the BigY. Exactly when this data will appear on our BigY results page at FTDNA is uncertain, but it will probably be sometime next year. The STR results will be given to everyone who has done the BigY, no matter when they ordered the test. FTDNA has hired Dr. Caleb Davis, who has a PhD in bioinformatics, to head up the research into which STRs can be reliably called and which cannot."
Last edited by dartraighe on Fri Nov 17, 2017 1:47 pm, edited 1 time in total.

Posts: 1842
Joined: Fri Mar 16, 2012 5:43 pm

MtDNA:
U5b2b
PostPosted: Tue Nov 14, 2017 7:38 am
I think that YSTR testing can be consigned to history. Having lots of 12 and 25 marker YSTR matches does not do anything for me nor a lot of others testers. There are 1,000's of testers in the FTDNA database who never went past this stage of testing and they may as well throw all of those YSTR results in the bin especially those who are not willing to SNP test. NGS is the way to go and YSNPs are more reliable than YSTRs for anyone who wants to verify their relatedness to another dna tester. There are lots of examples of close 67 marker matches not being related within the last 2,000 years so why even bother wasting money on this test. For 49 dollars Yfull will extract 400 or more YSTRs from your NGS .BAM file and a lot more YSNPs than will be reported by FTDNA. so why pay lots of dollars for them.

I checked out my YSTR matching results from Yfull to my FTDNA's 12,25,37,67 and 111. Exact match at first 12, and at 25 , 35/37 58/67 and 101/111, so I would not have needed a 12 or 25 marker test. If I was a newbie I would wait until January and buy the Big Y as my first and only test, and all will be in it that one will need for their YDNA research project.



FTDNA's new Big Y matching threshold is 30 YSNPs which leaves me with just one match. That is a lot better than having 100 Neolithic matches.

Return to General DNA Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron