How to Mine 23andMe Raw Data with Python: Part 2
Creating A Family Tree With 23andMe Raw Data:
If you missed part one, here is the link. It's probably a good idea to peruse that post before moving onto this one. In this post I am going to see if I can create a family tree using the same 23andMe raw datasets I used in my last post.
Again, the module is freely available for download on github.
A Note About The Algorithm:
Computer Scientists call this hierarchical clustering. Biologists know this as creating a phylogenetic tree. These are essentially the same thing. However, I am going to simplify and refine the algorithm to suit my needs. It may even be apt to call this a "quick and dirty," or "pseudo" implementation. Perhaps I will program a full implementation in the future.
Setting Up Metrics And Controls:
In my last post, I used % identity between individuals to determine how closely related one person is to another. Although it is a rather casual measure of similarity, I am going to continue using this metric because it is easy to implement. I encourage you to create and implement your own metrics as this really helps you get a better understanding of your data.
Since this is the second time I've used this metric, I went ahead and created a function for it. Continuing from the coding example in my last post:
>> Datasets.identity()
>> len(Datasets.identity(Datasets.intersectionData['Nikhil.txt'],Datasets.intersectionData['Person3.txt']))
>> 267194
I've also added a function to find the genotypes that are not both identical, but only share one identical SNP.
>> Datasets.halfIdentity()
Since I am creating a tree, it is probably a good idea to use a few control data -- these controls should be on the outermost branches since they will be the least related to my folks and I. I've decided to use two -- Mikolaj Habryn (available on SNPedia) and Manu Sporny (yes, the guy who published his 23andMe data on github). The SNPedia link contains a few more sample datasets (and there are several more scattered across the internet). I chose to use Mikolaj's published data because it has been said that he was the first person to make his 23andMe raw data publicly available. I'd like to ensure handling "old" raw data won't be a problem.
Building The Tree:
I've added the phylogeny function to my ParseToDict class.
>> Datasets.phylogeny()
The output attempts to simulate a phylogenetic tree. The tree is created by comparing each of the datasets to each other and counting the number of identical genotypes that are shared between them. These results are organized and printed to a list (as shown below).
>> [[[[['Person2.txt', 'Person1.txt'], 'Nikhil.txt'], 'Person3.txt'], 'ManuSporny.txt'], 'MikolajHabryn.txt']
The closer the files are coupled together, the higher level of similarity between them. The reason I call this a quick and dirty implementation is that similarity is calculated with respect to the two most similar raw data inputs. Person1 and Person2 are most closely related. I am second most closely related (to Person1 and Person2). Person3 is third most closely related (to Person1, Person2, and I). I'm sure you get the idea. Keeping this in mind, the tree looks exactly as I expect it to.
Cool! What's Next?:
Now we can ask an interesting question. Can we plot the phylogenetic tree with the person's genotype "tupled" with their name -- perhaps revealing a pattern of inheritance?
To do this I've added an optional argument for the phylogeny function, rsid. I feel like randomly choosing 'rs6904200' for analysis today.
>> Datasets.phylogeny('rs6904200')
>> [[[[[('Person2.txt', 'AG'), ('Person1.txt', 'AG')], ('Nikhil.txt', 'AG')], ('Person3.txt', 'AG')], ('ManuSporny.txt', 'GG')], ('MikolajHabryn.txt', 'AA')]
It seems all of my family members and I have exactly the same genotype for this SNP (heterozygous 'AG'). The controls are homozygous for 'GG' and 'AA'. Thus far, the pattern of inheritance for this SNP merely illustrates the fact that the controls are not part of my family and that everyone in my family share the same genotype for this SNP.
Not So Fast...:
Please remember that a shared genotype between individuals does not guarantee that the individuals are related -- especially when only viewing one SNP!
To illustrate why, let's look at one of the SNPs I mentioned in my last post, rs3754777.
>> Datasets.phylogeny('rs3754777')
>> [[[[[('Person2.txt', 'CT'), ('Person1.txt', 'CT')], ('Nikhil.txt', 'CC')], ('Person3.txt', 'CC')], ('ManuSporny.txt', 'CT')], ('MikolajHabryn.txt', 'CC')]