mthap: an mtdna haplogroup analysis tool

Any discussions regarding mt-DNA markers, results or questions.

Posts: 48
Joined: Sat Mar 17, 2012 5:38 am
PostPosted: Tue Apr 10, 2012 1:02 pm
OK, we have a first cut of mthap with PhyloTree Build 14 support, but this is just the first version that seems to mostly work. There's likely many bugs that'll need to work out, so please try it out and let me know of anything that seems odd. There's 1030 new haplogroups in the new build, so making sure everything works will be a challenge. Be sure to cross-check with the PhyloTree website to make sure what mthap comes up with is reasonable.

http://vps1.jameslick.com/dna/mthap-new/mthap.cgi

Known problems: Builds 1-2 completely broken. Builds 3-5 broken for some groups. Builds 6-13 seem to be OK. Build 14 has a quick hack for the new .XC insertion notation, and haplogroups L3e1c and M38e not working properly yet. There's a few other things, mostly minor or cosmetic. Anything else, let me know. I'll probably go through a few updates over the next several days while we get the kinks out.

Posts: 324
Joined: Thu Mar 15, 2012 1:14 am

YDNA:
R1b-Z12*
MtDNA:
I3b (FMS)
PostPosted: Tue Apr 10, 2012 2:30 pm
Using my FASTA file, mthap predicts my "1)" result spot on and as a "good match".

Using 23andMe data it gave me 4 "imperfect matches" and my correct hg appears as "2)". Also, it gives me two "3)" results -- not sure if that's expected behavior or not.
Use Profile/Edit Profile in User Control Panel to add your Y-DNA and mtDNA values.

Posts: 48
Joined: Sat Mar 17, 2012 5:38 am
PostPosted: Tue Apr 10, 2012 2:51 pm
Yes, it is normal to have multiple matches with the same rank. That just means that there is not a significant difference between the scores, so they are tied for 3rd place in your example.

Here are some interesting things about the new build:

Build 14 has 3550 haplogroups, up from 2520 in Build 13, an increase of 1030 haplogroups, or about 41%. This means that a lot of you will have a new haplogroup in this new build.

rCRS is now H2a2a1; previously it was H2a2a.

Some haplogroups have C repeat insertions of widely varying length. Previous builds made an attempt to get the right number of insertions for haplogroups that have them, but there's often too much variation to accurately reflect in the haplogroup tree. Build 14 introduces a new .XC notation which indicates an arbitrary number of insertions (e.g. 573.XC instead of 571.1C 571.2C ... or 573.1CC ...). (As previously mentioned, mthap doesn't fully support this yet.)

This build also marks the first step in transitioning from the Revised Cambridge Reference Sequence (rCRS) to the new Reconstructed Sapiens Reference Sequence (RSRS) which better represents a common ancestor rather than an arbitrary human rather far down the phylogenetic tree. After the Build 14 issues are worked out, I'll be looking at how to support this new convention.

What else have you found that's different?
User avatar
Posts: 150
Joined: Fri Mar 30, 2012 4:31 am

YDNA:
G P303+
MtDNA:
C4a1
PostPosted: Tue Apr 10, 2012 2:53 pm
Hi James,
Thanks a bunch! I tried it and works just fine with my FASTA file. Because my classification has not changed(still C4a1) , I guess I may not be a good test subject.
By the way, do you plan to switch over the interface to evolution from RSRS instead of rCRS?

Edit: Looks like you just answered my question!

Posts: 324
Joined: Thu Mar 15, 2012 1:14 am

YDNA:
R1b-Z12*
MtDNA:
I3b (FMS)
PostPosted: Tue Apr 10, 2012 2:57 pm
I should add that my hg changed (i.e. has been refined) in Build 14.
Use Profile/Edit Profile in User Control Panel to add your Y-DNA and mtDNA values.

Posts: 48
Joined: Sat Mar 17, 2012 5:38 am
PostPosted: Tue Apr 10, 2012 3:03 pm
The bulk of the changes appear to be in European haplogroups, as the vast majority of new sequences are from FTDNA, whose customer base is largely those of European ancestry. Those with Asian, African or Native American haplogroups have less chance of any change in designation. That said, my haplogroup did not change either, though that is to be expected as I have no FGS matches and only one private mutation.

RSRS support will require extensive changes, but it is something I plan to work on as it would make some things much more logical in the long run.
User avatar
Posts: 14
Joined: Sun Apr 01, 2012 6:48 pm
PostPosted: Wed Apr 11, 2012 12:31 pm
Thank you James for your continous work! The merging to RSRS is a logical path and should be done by everyone.
Here a result of mthap version 0.18pre2 (2012-04-10); haplogroup data version PhyloTree Build 14 (2012-04-05) +mods for an actual 23andme v3 raw data test file:
Found 2441 markers at 2440 positions covering 14.7% of mtDNA.
Markers found (shown as differences to rCRS):
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16263G 16270T

Best mtDNA Haplogroup Matches:
1) U5b1b1
Defining Markers for haplogroup U5b1b1:
HVR2: 73G 150T 263G
CR: 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G
HVR1: 16189C 16192T 16270T

Marker path from rCRS to haplogroup U5b1b1 (plus extra markers):
H2a2a1(rCRS) ⇨ 263G ⇨ H2a2a ⇨ 8860G 15326G ⇨ H2a2 ⇨ 750G ⇨ H2a ⇨ 4769G ⇨ H2 ⇨ 1438G ⇨ H ⇨ 2706G 7028T ⇨ HV ⇨ 14766T ⇨ R0 ⇨ 73G 11719A ⇨ R ⇨ 11467G 12308G 12372A ⇨ U ⇨ 3197C 9477A 13617C 16192T 16270T ⇨ U5 ⇨ 150T 7768G 14182C ⇨ U5b ⇨ 5656G ⇨ U5b1 ⇨ 16189C ⇨ U5b1(T16189C) ⇨ 12618A ⇨ U5b1b ⇨ 7385G 10927C ⇨ U5b1b1 ⇨ 16263G

Imperfect Match. Your results contained differences with this haplogroup:
Matches(25): 73G 150T 263G 750G 1438G 2706G 3197C 4769G 5656G 7028T 7385G 7768G 8860G 9477A 10927C 11467G 11719A 12308G 12372A 12618A 13617C 14182C 14766T 15326G 16270T
Extras(1): 16263G
Untested(2): 16189 16192

For this raw data there is no change in haplogroup assignment with the update.
User avatar
Posts: 43
Joined: Wed Mar 14, 2012 3:28 pm
Location: Edmonton, AB
YDNA:
R-U106* L199+
MtDNA:
U5a1a1 152C
PostPosted: Thu Apr 12, 2012 2:29 am
Looks good so far:
mthap version 0.18pre2 (2012-04-10); haplogroup data version PhyloTree Build 14 (2012-04-05) +mods
raw data source FASTA63831.fasta (16KB)

FASTA format was uploaded. Based on the markers found, assuming the following regions were completely sequenced: HVR1 (16001~16569) HVR2 (1~574) CR (575~16000).

Found 16569 markers at 16569 positions covering 100.0% of mtDNA.

Markers found (shown as differences to rCRS):

HVR2: 73G 152C 263G (315.1C)
CR: 750G 1438G 1700C 2706G 3197C 4769G 5495C 7028T 8860G 9477A 11467G 11719A 12308G 12372A 13617C 14766T 14793G 15218G 15326G 15924G
HVR1: 16223T 16256T 16270T 16294T 16399G

Best mtDNA Haplogroup Matches:

1) U5a1a1(T152C)

Defining Markers for haplogroup U5a1a1(T152C):
HVR2: 73G 152C 263G
CR: 750G 1438G 1700C 2706G 3197C 4769G 5495C 7028T 8860G 9477A 11467G 11719A 12308G 12372A 13617C 14766T 14793G 15218G 15326G 15924G
HVR1: 16256T 16270T 16399G

Marker path from rCRS to haplogroup U5a1a1(T152C) (plus extra markers):
H2a2a1(rCRS) ⇨ 263G ⇨ H2a2a ⇨ 8860G 15326G ⇨ H2a2 ⇨ 750G ⇨ H2a ⇨ 4769G ⇨ H2 ⇨ 1438G ⇨ H ⇨ 2706G 7028T ⇨ HV ⇨ 14766T ⇨ R0 ⇨ 73G 11719A ⇨ R ⇨ 11467G 12308G 12372A ⇨ U ⇨ 3197C 9477A 13617C 16192T 16270T ⇨ U5 ⇨ 14793G 16256T ⇨ U5a ⇨ 15218G 16399G ⇨ U5a1 ⇨ 16192C ⇨ U5a1(T16192C) ⇨ 1700C ⇨ U5a1a ⇨ 5495C 15924G ⇨ U5a1a1 ⇨ 152C ⇨ U5a1a1(T152C) ⇨ (315.1C) 16223T 16294T

Good Match! Your results also had extra markers for this haplogroup:
Matches(27): 73G 152C 263G 750G 1438G 1700C 2706G 3197C 4769G 5495C 7028T 8860G 9477A 11467G 11719A 12308G 12372A 13617C 14766T 14793G 15218G 15326G 15924G 16192C 16256T 16270T 16399G
Extras(2): (315.1C) 16223T 16294T


My private 16223T is actually a back-mutation from the original at hg R, and someone else in the U5a1a1 FGS Project shares 152C 16294T, so hopefully that will make it in a future build if the other donor authorized Dr. Behar to use his/her sequence in this paper. I'm anxiously waiting for GenBank to release their latest data-set to find out if it's actually there. The phylogeny provided by http://www.mtdnacommunity.org currently lists this haplogroup [U5a1a1 T152C!] as "TBD" with no associated GenBank ascension numbers. T152C! of course indicates that my 152C is a back mutation from the RSRS (the original C152T mutation took place at hg L2'3'4'5'6, so here's to vindication :D ), and of course there's also the C16192T / T16192C back-mutation from hg U5 and between hgs U5a1 and U5a1a. What can I say, my maternal line prefers the old ways. :lol:

Posts: 116
Joined: Tue Mar 20, 2012 3:38 am
PostPosted: Thu Apr 12, 2012 1:05 pm
Alpeu wrote: I hope this is one of the major steps to a fine graded tree and the resolution will be enough for all major uses in the future. Look at this statement in Behar, Oven et al. (2012):
Approaching a Perfect Phylogeny
[...] First, an almost final level of resolution for a number of western Eurasian clades was achieved, and the nodes of ancestral and derived haplogroups are often differentiated by a single mutation.


I expect that Build 14 will included most of terminal nodes for younger haplogroups such as the major daughters of H, but for the older haplogroups, e.g., daughters of U, I think there is a lot of structure that remains to be discovered. It appears that a large portion of the FTDNA customers did not authorize the use of the data, so I believe we still have many unnamed nodes in the data in several of the U projects.

Vince - about 152, is seems to be a frequent mutation site, so I'm not sure how reliable it is for specifying a new haplogroup. I'm guessing that is why it has not been included in Build 14 or in earlier versions.

Gail

Posts: 61
Joined: Wed Mar 14, 2012 4:29 pm
PostPosted: Thu Apr 12, 2012 1:35 pm
GailT wrote:
Alpeu wrote: I hope this is one of the major steps to a fine graded tree and the resolution will be enough for all major uses in the future. Look at this statement in Behar, Oven et al. (2012):
Approaching a Perfect Phylogeny
[...] First, an almost final level of resolution for a number of western Eurasian clades was achieved, and the nodes of ancestral and derived haplogroups are often differentiated by a single mutation.


I expect that Build 14 will included most of terminal nodes for younger haplogroups such as the major daughters of H, but for the older haplogroups, e.g., daughters of U, I think there is a lot of structure that remains to be discovered. It appears that a large portion of the FTDNA customers did not authorize the use of the data, so I believe we still have many unnamed nodes in the data in several of the U projects.

Vince - about 152, is seems to be a frequent mutation site, so I'm not sure how reliable it is for specifying a new haplogroup. I'm guessing that is why it has not been included in Build 14 or in earlier versions.

Gail

Hi Gail and all,

I don't like to be disagreeable (hah!), but I think there's still a lot of structure to be discovered. If you go back to Behar's original K tree from 2007, there were NO sequences on it from Ireland. From FTDNA, there are now lots of Irish sequences, and their haplotypes are getting somewhat predictable. But I have several K Project members from the Middle East. Their sequences tend to be unique each time. Had one from Turkey yesterday whose nearest match was from Italy. The Italian one has three additional coding-region mutations, which represents thousands of years. Think of all the structure to be found between those two. I think it's way too early to close up my discovering-new-subclades business. There is a limit to the structure; if you test everybody, the structure ends - until you test their newborns.

Not sure why you say 152C has not been included in the PhyloTree. It certainly is in the definitions of several K subclades, usually in conjunction with other mutations. "Recurrent" does not mean "unstable." (Trying out that phrase.)
PreviousNext

Return to mtDNA Discussions

Who is online

Users browsing this forum: No registered users and 1 guest