How to Mine 23andMe Raw Data with Python: Part 1
Background:
Last Christmas, my company offered a substantial discount on 23andMe kits. Since I had already gotten myself sequenced about a year earlier and found the data interesting, I decided to purchase kits for my family. Their data have recently been posted and I collected their raw data files.
I enjoy looking at the metrics 23andMe provides, but I am curious as to what I would find if I mined the data myself. I've spent maybe an hour or so everyday for the past week exploring the data of my family members.
Getting Started:
I've uploaded my 23andMe python module to github so that folks can download it and play with their own data -- trust me it's interesting!
https://github.com/ngopal/23andMe
The classes and functions in the module can easily be used in scripts, but I built it to be used interactively in a python interpreter.
If you have data to play with pull the repository down and follow along:
>> import 23andMe as Me
>> files = ['Nikhil.txt', 'Person1.txt', 'Person2.txt']
>> Dataset = Me.ParseToDict(files)
The code above should be straight-forward. The import statement imports my module into the environment. The files data structure is a list with the name/location of the raw data files. The Dataset data structure is basically an initialized class.
If you look at the source code, you may notice another class called ParseToDB. Initially, I loaded each raw data file into a sqlite database as a table. However, python seems to have some sort of bottleneck issue with the sqlite3 module -- the query time for a simple join command is absolutely unbearable. Thus, I reverted back to using python dictionaries. However, I left the functionality in the script with the hope that someone will find it handy.
I was genotyped on 23andMe's V2 chip where as my family was genotyped with the new V3 chip. Consequently, each member of my family ends up with 73% more data than me (quickly estimated via raw data file sizes). To compare myself to each member, I need to make sure I am comparing a SNP that everyone in the comparison set has in common. Enter the intersection function.
>> Dataset.Intersection
This returns a list of SNPs that we all have in common. In my case, this is on the order 540,000 SNPs. My raw data file is the limiting factor as everyone else has close to a million genotyped SNPs in their raw data files.
Let's Start Playing:
To start scratching the surface, I performed a quick search for identical genotypes. I came up with these results:
Person1 and I have 67.5% identical genotypes.
Person1 and Person2 have 67.8% identical genotypes.
Person2 and I have 60.8% identical genotypes.
Keep in mind that these numbers are derived from whatever SNP data I've extracted.
23andMe reports 84% for person1 and 78% for person2 in terms of "percent similarity" (each being compared to me).
Since I am playing with a relatively small number of files here (three with myself included), my options are limited. A good, large dataset can go a long way.
Moving Along:
There are a number of published SNP association studies available via NCBI and NHGRI. Let's pick a SNP and see what my analysis says and compare that to my 23andMe report.
How about hypertension? There is a study published here by Padmanabhan et al. It was published October 28, 2010 in the journal PLoS Genetics.
I've sifted through the papers and extracted the RSids, the associated nucleotide base, and the p-values for the trait above.
Hypertension
RSid: rs13333226
Base: A
P-value: 4x10^-11
Before I go any further, a p-value is basically a confidence value. The lower this number, the better. It essentially reflects the probability we would reach this same conclusion by chance. 4x10^-11 is a very small number -- so we can be confident about this association.
I used the searchSNP function in my module to go through all of the loaded files and print data to screen if it is available:
>> Datasets.searchSNP('rs13333226')
This is the output (a list of tuples):
[('Nikhil.txt', 'GG'), ('Person1.txt', 'GG'), ('Person2.txt', 'GG')]
We are all homozygous G for this SNP and none of us seem to be predisposed to hypertension. The 'A' base is the base that would predispose us.
23andMe does report traits for hypertension (high blood pressure). Thankfully, they also list the studies they used to make their conclusions.
23andMe looked at SNPs rs3754777 and rs5370. I'm going to ignore rs5370 at the moment because it has to do with hypertension in physically unfit individuals. The SNP I just pushed through python does not take this into account...
However, I am going to consider rs3754777 because it has to do with a study on hypertension in general. Please note that it is a different SNP from a different study. According to my 23andMe report:
I am 'CC' which basically means, "Subjects had no increase in blood pressure."
Person1 and Person2 are 'CT' which means, "Subjects have an average increase of about 2mm Hg SBP and 1mm Hg DBP."
I'm 2/2 in finding associate evidence that I am not genetically predisposed to hypertension. However, the other two family members seem to be 1/2.
Wait a second. What does this mean? Who should I trust?
About Confidence:
23andMe uses a "4-star" scale to report their confidence. These stars correspond to how many independent experiments with large sample sizes have been conducted and resulted in similar findings. 4-stars is the highest score.
The 23andMe hypertension report is rated at 3-stars ("Preliminary Research. More than 750 people with the condition were studied, but the findings still need to be confirmed by the scientific community in an independent study of similar size.").
The study I chose had sample size of 1,621 Swedish cases and 1,699 Swedish controls. This would put the SNP I evaluated at about their highest confidence level (barring the part of it being confirmed in an independent study).
By these standards, I think I can trust my analyses and say that neither I, nor the two other family members analyzed, are genetically predisposed to hypertension. However, I would be much more confident in my evaluation if an independent study is conducted of rs13333226 and reports similar findings.
Hopefully, I will be able to do some more interesting analytics next week.