Every now and then when I catch up with some of my colleagues working at other companies or groups, I hear that they are absolutely miserable in their data science jobs! And it seems like most of the time, it isn’t the nature of the job itself that is the cause of misery, but some of the situations that arise during the course of the job.
Here below are two specific anti-patterns I’ve seen in context of analytics-related work. I’ve proposed some remedies as well, but obviously YMMV depending on your specific circumstances.
1. Data Science As An Insurance Policy
This is a scenario where you are asked to do some data science work to help “verify” a decision that has already been made. The typical offender is a “decision-maker” of some sort. In my eyes, this basically amounts to an insurance policy for the decision-maker. Here’s why:
When the analysis aligns with their decision, they have data to support their choice in the case that their decision pan out the way they expected.
When the analysis doesn’t align with their decision, they can say that the data quality was poor, the analysis was faulty/inadequate, [insert reason here], etc. In short, they may dismiss the analysis as non-helpful.
Remedy: There is likely more than one remedy for this. My typical playbook is to ask about potential actions the decision-maker will make based on this analysis up front. If there is no measurable impact to the business, the analysis will likely be de-prioritized on my todo list
In fact, this leads me to the second anti-pattern.
2. No Attribution for Insights
Imagine you did some excellent analytical work you were asked to do. In fact, some pivotal, positive business decisions were made as a result of that analysis. Will people remember you were the one that did that analysis 6 months later? As a mini thought experiment, have you heard the statistic that you have a higher chance of being in a car accident than in an airplane crash? Do you remember the exact statistic? Do you remember where that information came from?
My point is that the insight itself is more memorable than the source that generated it. To further complicate the situation, imagine your work environment is fast-paced and that the decision-maker(s) using the information is focused on demonstrating impact/relevance and moving forward as quickly as possible. In two weeks time, will they have a reason to remember the person behind the new insight? How about after 6 months?
Remedy: My proposed remedy for this is directly asking for appropriate recognition, doing talks around your company, or incorporating the “story” of “discovering” the insight, so your colleagues know you are the source of the insight.
Based on polling my own professional circle, it seems like analysts either will encounter, or have encountered these situations in the past. These are obviously challenging situations, and hopefully, these proposed remedies help!
I’ve been asked a few times to speak about my experiences, specifically about my journey from academic researcher to data scientist. I hope you read this and perhaps find inspiration to make the jump yourself, and preferably you can avoid the same traps that I fell into on the way. Although I share my experiences as an interviewer later in the post, I’m sharing my overall experience over the years, and do not cover company-specific details.
Who This Is For
Academics contemplating making the switch to industry, and anyone interested in learning about landing their first data science role in general.
Why I Left Academia
I’ll keep this section brief. I actually enjoyed my life as an academic, but in short, I felt I could be making more meaningful contributions outside of the ivory tower.
In a sense, academia can also be viewed through the lens of a business model, although many professors don’t look at it this way. Papers → Grants → Overhead for university. Institutions that provide grants also need to “invest” funds among a list of potential researchers they think will eventually yield substantial positive scientific returns (i.e. scientific breakthroughs). One of the primary ways these grant institutions seem to evaluate applicants is through their expertise in the research field, and that is typically demonstrated via past grants and research publications. This is why academic departments use publication records (and ultimately grant records) to assess tenure.
I’ve previously discussed the pitfalls of focusing too heavily on short-term productivity metrics like paper publications. Even Nobel laureates have spoken out against the use of impact factors.
Combine all of this with decreased year-over-year funding for the sciences for three political administrations in a row, and we have an environment where it behooves researchers to seek new routes.
There seem to be endless articles about this topic, and I am actually not writing this blog post to highlight that in particular.
This post, actually, is about data science 🙂
What I Enjoy About Working For A Company
I can work on projects that are unique to the company I’m working at – unique data, unique resources, unique problems to solve, unique opportunities. Although I personally weigh this aspect quite heavily when I’m searching for a new role, I realize that not everyone shares the same interest or prioritization.
Many of the questions I’ve received about being a data scientist have to do in some form or another with compensation. And yes, compensation can be leaps and bounds better than a postdoc salary. A ballpark data scientist salary range would be 2-3x a postdoc salary, not including signing bonuses, stock, retirement matching, and additional benefits (transportation, discounts, etc.), all of which factor into overall compensation. There are plenty of resources that collate this information – glassdoor.com is a reasonable place to start.
My Skills and Experience Prior To Switching
Another reason I’m writing this is because I’ve seen a few articles about how to “get into” data science, but those seem to be intended for professionals who work with data (software engineers, data engineers, data analysts, etc). This post will likely be most helpful to those who are in the same situation I was in, where I had the majority of the skills and knowledge required for the role, but I was missing a couple of key attributes.
When I made the switch from Illumina to graduate school, I essentially traded Python2 for R, Javascript, and Java. Since most of my apps were biological data visualizations, I made thorough use of D3.js – so much so I did a video set of O’Reilly, helping those with a background in Python learn how to use Javascript and D3.js.
-A very quick note on writing technical books or programs: I’ve done two of these now, and I can say that they obsolesce very quickly! If I do more technical books or programs in the future, they will certainly be focused on first principles, business cases, and methodology, rather than technology.
ML/AI/Stats are skills I learned through classes over the years, from high school through graduate school. I’ve also competed in a handful of machine learning bake-offs, and have had the privilege of working with ML/AI researchers who collectively have taught me a quite a bit!
Naturally, I also had a background in biology, which isn’t so useful for general data science so I won’t harp too much on it. Suffice it to say, there are some very interesting computer applications/methodologies in the field of biomedicine, so coding on those projects helped me learn and grow my software engineering skills.
What Helped Me Along The Way
Insight Data Science:
If you fit the same profile I did prior to landing a data science role, then perhaps like me, it could be worth investing in SQL skills, and knowledge of business models. A book I found to be enormously helpful for the “business side” was Lean Analytics. I feel this book provides the scaffolding/roadmap necessary to “speak the language,” if you will (KPIs, SaaS, CLV, churn, etc.). This might seem unnecessary, but I assure you it is vital. Put yourself in the shoes of a hiring manager or employer – if you ran a business, would you want to hire someone with little to no business training to run an essential part of your company?
As far a SQL goes, it seems every business I’ve ever interacted with uses SQL in some capacity. And as a data scientist, you need to be able to retrieve, format, transform, and clean data. Bioinformatics may be somewhat unique in that it is an NLP-heavy domain, working primarily with text data, sometimes stored as inconsistent flat files, and because of this, SQL is not a skill that I was accustomed to using.
There are obviously more data science incubators/camps than just Insight Data Science and Thinkful Inc, but I’m not as experienced with the others. Off of the top of my head, in the Seattle area, we also have Galvanize, General Assembly, and Metis. If you are considering one of these, please be sure you ask about their business model and how they make money, and whether you are required to “find a job” through them. One of the reasons I like Insight is because they work like recruiters (a bit more about that below) and they seem to value developing and supporting a strong data science community more than they do making a profit.
Recruiters (mutually incentivized to find you a good fit)
If you are looking for a data science role, it helps to work with a recruiter. I actually built quite a decent relationship with a couple of recruiters, each of whom provided great advice and support along the way. Anecdotally, the incentivization structure seems quite healthy – recruiters get paid if they place you into a role you love, perform well in, and stay in for a guaranteed period of time. The company you are hired at makes the recruiting fee it paid back multi-fold in productivity and efficiency. The downside is that you will find a large number of recruiters who will try to fit you into a role you are not suitable for, at borderline harassment levels.
Note: Since some people have asked, one of the recruiters I had a wonderful experience with was Richard Marion from CyberCoders.
Another Note: I registered for a few recruiting services, such as Hired, TheLadders, and LinkedIn Premium. I didn’t have much luck with Hired or TheLadders, but LinkedIn Premium seemed to do reasonably well at connecting me with the right recruiters. More importantly than that, LinkedIn Premium provides some interesting information on which roles I may be a particularly good fit for and why. Personally, I think this is worth the value of the free trial 😀
Practice, Practice, Practice! Especially To Perform Under Pressure!
Think about a concept you have read about – one that is fundamental, and you think you might know like the back of your hand. If you need ideas, try the central limit theorem, or stochastic gradient descent. Got one? Good. Now explain this concept succinctly, but accurately, in less than 60 seconds. If you are reading this on your phone, or in the office, and you feel self-conscious, whisper it under your breath. I think you’ll find it’s more difficult than you thought.
Here is the interesting part: You very likely actually do know this concept. Whether it’s over the phone, Skype, or in-person, perhaps you just need some practice explaining concepts well!
Since my own background is in a technical field like biomedical informatics, as I started preparing for interviews, most of my time went into reviewing and practicing technical concepts. However, because I have a quantitative PhD, this was actually not what I needed to practice, and in hindsight, may have been more of a crutch. The two skills I should have been investing in were SQL skills and product sense.
Note: This reminds me of an excerpt from the book, “Bayes Theorem: the theory that wouldn’t die.” When a number of planes that were shot down in WWII were analyzed for bullet holes, the scientist that was making suggestions to the military counter-intuitively suggested covering the parts of the plane without bullet holes. This is because of selection bias, where planes that were shot down were thought to have exploded, thus making the parts of the plane with bullet holes more resilient, and less in need of armor, than the parts without bullet holes!
A Challenging Job Search
The range of interview questions I was asked and my overall experiences seemed to look very different at each company. Depending on that company’s business model, and current challenges, certain skill sets may be more or less useful than others. In my current team at Microsoft, we have a team of specialists. Each data scientist has a specialty – natural language processing, time series forecasting, classification, clustering and matrix factorization. We all overlap in each others’ area of expertise, and ultimately are able to make valuable contributions in a particular area.
I won’t be going into specifics about each and every interview I went through. I will say, however, that I felt like the whole interview process was rather unfair.
I have a particular dislike for “take-home” interviews. I imagine these are useful for two reasons: (1) to mitigate bias towards extroverts in face-to-face interviews, (2) to mitigate any biases that may be subconsious on the part of the interviewer, (3) to save time (the reviewer of the notebook only need spend a few minutes judging, rather than a whole hour). As the interviewee, these homework assignments are full-day investments, and more often than not, I feel like I’m being manipulated into doing unpaid work. My personal advice is to withdraw from any interview pipelines that make you waste time with these notebooks – you’ll likely not receive much in the way of feedback anyway. But obviously, to each their own.
Interestingly, I think we can tell what data science might look like at a company depending on the interview questions you receive. For instance, if you feel you are receiving ridiculous programming interview question(s), there is a good chance that role is actually a software engineering role disguised as a data science role. If you feel you are receiving many business-oriented questions, then perhaps that role is less technical, which may not may not be what you want.
Questions You Should Ask
This is one of the most underrated parts of an interview, and at least some of the onus on finding the right fit is on you, not just the companies you are interviewing at. People have varying opinions on this, but personally, I like to ask the questions below. I think they are relevant, tough, and provide a reasonable approximation of the culture of the company.
What is the gender ratio of your the group, and at the office?
Who was the last person that was fired from the group and why?
Do people typically eat lunch at their desks?
What was the last data science / engineering mishap you had and how do you hope to prevent it from happening again?
What percent of employees leave the company within 1 year? 2 years? 4 years?
It may feel scary to ask these important questions, but I can assure you that as an interviewer, it’s also scary to receive them. Furthermore, there are a number of trite, softball questions I’ve been asked as an interviewer, and many of these can be found in web pages scattered across the Internet. For instance, if you ask anyone, “Why do you love working here?”, I’d say there is a decent chance the reply might be, “the people!” I’d suggest asking something more specific in the same vein, such as, “Could you tell me about a time you felt like you were overextended, and your team came to your rescue?”
I Applied For 57 jobs in a 12 Month Period
I used an excel sheet I used to track my job search in 2017. Most of these companies did not provide feedback along with their decision. There was only one company I interviewed with that did provide constructive feedback, which reflected very positively on them (General Assembly). My most negative interview experiences were with large companies like Facebook and KPMG.
When I was at Insight, the program directors had me keep track of companies I had in my pipeline in Trello, which seemed to work well. My board was quite sparse though, as I converged good fit rather quickly.
Data science is still a very young field (although, I remember learning about it in 2009 when HBR called it the sexiest job of the 21st century). Moreover, data science looks different at every company, which makes it difficult to anticipate if you are actually a good fit for a data science role at company X. Because of this, I think, I experiencing a wide range of interviews with over 30 companies, and still hadn’t converged on anything I would consider “signal.” In hindsight, I think this was partly on me, because I wasn’t telling companies “no” enough. I now use phone interviews, especially the initial phone calls with someone from the group I’m interviewing with, as an opportunity for me to assess how I can bring value to their team. This means understanding a few aspects about the group: their technology stack, problem-solving approach, team bond, etc. Your goal for this phone call is to get a feel for whether you would accept an offer from this group/company. If it doesn’t seem like a good fit, it’s best to be honest with yourself, acknowledge the place you are interviewing at will likely not be a good fit, and remove yourself from the remainder of the interview process.
Skills Gap
Every year, additional methodologies and case studies are accepted into the corpus of data science skills. Although the many of the fundamentals remain unchanged, specific examples that may have an obvious answer may not seem so obvious to those who haven’t studied the phenomenon. In short, there is an evolving culture in the field, and it is difficult to point to any one resource one may follow to stay up to date.
Job Boards
Here is a list of data science job boards I collected. It’s obviously not exhaustive, and likely has some cross-posted requistions, but was useful for me when I was searching for the right data science role:
In fact, if you want, you can sign up for the very same MailChimp list I setup for myself. It lassos all of these sources together and then creates an email that is sent out every Thursday morning:
A Short Selection of What Data Science “Looks Like” At Various Companies
I’ll leave some brief notes here on my data science experiences at some of the companies where I feel I gained substantial data science skills.
Bioinformatics at Illumina (i.e. Biological Data Science)
For those who don’t know what a bioinformatics scientist is, it is basically a data scientist that works with molecular biology data. The role I had from 2009 to 2012 has substantial overlap with my current role in terms of fundamental skills. However, domain expertise is a significant factor, and I would say there it is less of a challenge for a bioinformatician to transition into data science, than there is for a data scientist to transition into bioinformatics (unless you already have a background in biology that is).
Data Science at RealSelf
This role was very business-focused, PM attributes, Data Engineering attributes. This was a super-fun place to work, and I miss being there. I would say it was my first “true” data science role, in that the decisions I made directly affected business decisions. Colleagues have asked me how my current role at Microsoft is different from RealSelf. The short of it is, at RealSelf, I needed to be a bit more of a data engineer and product manager in addition to my data science role.
Data Science at Microsoft
My role at microsoft is also very business-focused. However, I would say 80% of it is data science. There are still some data engineering and PM tasks I need to dive into, but I have the support of a dedicated data engineer and project manager on my team. Since we all work well together, this leads to a very productive and happy work environment.
Data Science Feels Inaccessible to Newcomers
Data science still seems inaccessible, although I do think democratization of data science is the future. I’ve thought about why this might be the case, beyond just “hiring is broken,” and one of the reasons this might be is because there are very few entry-level data science roles. And I think this might be because the stakes are too high. Imagine putting a multi-million dollar business decision in the hands of a freshly-minted data science grad. Imagine allowing a entry-level data scientist provide advisory to a multi-million dollar client. The margin of error is quite small, and in a sense, the jobs of your co-workers, and revenue that families depend on may be affected as a consequence of your work/decisions.
Another reason why may be that it is too draconian to label an entry-level data scientist “incompetent”. After all, they are just starting their career! However, the performance of a model, or the deliverables produced as a result of it, are products we are accountable for. If your forecast is off by a percentage point, you will get questions about why it was wrong, what the cost of the error was, and perhaps why the company should trust your work at all. In harsh cases you may even find that those affected are able to attach dollar values to the consequences of your decision. These are tough questions, and they can catch even experienced data scientists a little off guard. Not to scare you off, but don’t be surprised if you feel like you are betting your badge on your work.
“How can I succeed in data science if I don’t have a PhD?”
At this point, I would suggest the following: pick 1-2 areas you can become an expert in (classification, A/B testing, etc), and make that your primary skill. Then find a startup job that needs that skill in particular. Why a startup? Because they generally have a difficult time recruiting. Why 1-2 areas you can be an expert in? Because you have to compete with experienced data scientists.
Every now and then, people reach out to me on LinkedIn, alumni networks, etc and ask me about how they can get into data science. I think a bit of advice that I’ll provide that you are unlikely to hear anywhere else, is to start with a data science role that is “low stakes”. If you see a data science job posting at a company, and if you can already tell it is going to be an impactful role, then that might not be for you – at least just yet. The stakes can be quite high with data science projects. An inflated error rate can mean millions of dollars in lost revenue, being unable to properly interpret a model may lead to a fundamentally poor business decision, etc. Think about this carefully – this means people’s jobs, people’s retirement funds, etc. Many tutorials seem to make data science seem like the job is primarily “painting with data” and that we can pull down off-the-shelf sci-kit learn solutions to solve business problems, but I promise you there is the very small tip of a very large iceberg.
The question I ask myself before I deploy a model into production is: Am I willing to bet my badge on this? And sometimes you will have to regardless due to circumstances outside of your control.
It might be worth asking yourself the question about whether you truly want to do the work of a data scientist. I’ve noticed many roles that have a skills overlap with data scientist roles – quantitative analyst, data analyst, etc. I know people with these titles, I consider them my peers, learn frequently from them, and they are very happy doing what they are doing because they are doing what they love! Today, data scientists may be called data scientists, and tomorrow they may be rebranded as something else. One way to determine this might be the following: Would you continue in the role you have right now if your title was changed to data scientist overnight?
More Advice:
A useful bit of advice would be to think about the economics of the company you are interviewing at. For instance, startups tend to have short financial runways, so the likelihood of a startup willing to take a chance on someone who has solid fundamentals but lacks practical experience may be low, because it is risky for them. Whereas a larger company may appreciate the idea of hiring someone completely new, and teaching them their best practices, etc. At the same time, recruiting is difficult for startups, so the likelihood of getting further down the interview process more quickly is higher versus larger companies.
Also realize that as a job seeker, startups are risky. It’s very possible the startup you are going to work for may not receive their Series A, or end up being acquired by a larger company, which means you may be job searching yet again, and much sooner than anticipated.
What It’s like being on the other side (being the interviewer):
I can appreciate now how difficult interviewing is for the interviewer as well. Imagine having to make a decision on whether someone will be a good fit for the group, have the opportunity to grow, is truly interested in the job, is competent, is reliable, can work under pressure, etc. in 30-60 minutes. Most interviewers aren’t going to doubt your ability, but will want to see you demonstrate it to see if he/she can follow your thought process – we are sharing work with each other after all, and we need to have mutual confidence in each other’s work.
Hiring a Data Science Manager
After interviewing no less than 7 data science managers, I’ve found that they generally fall into 4 buckets. This may be true for those seeking data scientist positions too. I’ll summarize these in three personas:
The Machinist – Someone who is technically strong, but has little to no business acumen or product sense
The Visionary – Someone who has a strong product sense / business acumen, but is not technically strong
The Consultant – Someone who has had exposure to a wide breadth of experience
The Professor – A domain specialist who has been working on the same problem for 10-20 years. Think Depth.
Hopefully all of this information is helpful! As time passes, I’m sure some of this information will be changing over time, so please use your best judgement. Best of luck on your search.
Sign Up For Updates:
Congratulations on making it this far! You got through approximately 4,246 words (25,946 characters). This is perhaps 10x longer than most blog posts on the internet :-D. Please consider signing up so I can send you email updates when I make more blog posts. You can hit reply and get in touch with me directly as well.
Update: Okay. I’ve uploaded a new template and things seem to be fine now.
Update: I am aware the table of contents is not being displayed in bullet form as I intended. The web template I’m using seems to be buggy. It also seems to think this page is in Indonesian…Working on it!
Stack Overflow is awesome. Some of the worlds most brilliant programmers frequent the website and answertough questions. Wouldn’t that make it a great place to learn from? Yeah, I think so too.
This is why I’ve collated, The Guerilla Cookbook for R. It’s basically a number of Stack Overflow links organized and ordered in a way to help R programmers learn their way to the next level. If you are proficient in R, I hope these resources will help you get closer to being amazing. If you are just getting started with R, I’d suggest adding this page to your bookmarks and returning when you are familiar with the basics of R programming.
The cool thing is, this "book" essentially writes itself since most of the experts (and peer-reviewers) are answering the questions. Most of the questions are "real-world" and are asked by novice or intermediate programmers. We can easily add/remove/reorganize the contents as necessary.
How Was The Content Selected?:
I personally searched through Stack Overflow to find my favorite questions and shared them here.
The table of contents will have to be updated/reorganized over time as links are added and removed. But use whatever you can for now!
I don’t know about you guys, but I’ve really been enjoying Spotify. Being one of the first American users has been a real treat. Making the switch from iTunes wasn’t painful at all. Sorry iTunes, sometimes the best things in life are free…
A few weeks ago, I used this applescript tutorial to setup iTunes to be my morning alarm clock. I love this. Nothing makes it easier to get out of bed than waking up to a song that energizes you.
I’ve found a way to use Spotify instead of iTunes in the alarm clock script, and this post is going to be a quick show and tell. It’s not as easy as just replacing "iTunes" with "Spotify" in the iTunes alarm clock script.
I would suggest following the tutorial in the link I provided earlier, substituting the script provided for the one I provide in step 1 below. Otherwise, you can follow my attempt at a walkthrough.
Open up spotlight and open up "AppleScript Editor." Copy the code below into your editor and save it (with an easy-to-remember name).
Edit out my username with yours
Edit out my playlist "muzic" with the name of your playlist
Press the Run button to make sure it actually works.
Update 8/13/2012: I’ve added the code in the image below to github as a gist.
Next, pop open your terminal window and type in "crontab -e" without the quotes
Now you will be in VI editor, so press "i" on your keyboard to "insert," and type in the following line (the one line at the bottom is all you need). Make sure the path points to the script you just made in step 1. Also note that the pound sign on the first line "comments out" the line, so that line of code is ignored.
After you are done typing in the code, press "esc" so that a colon show up in the bottom left corner of the screen. Then type in "wq!" to save and quit the editor.
As a note, my alarm confguration (shared in the image above) is set to wake me up at 6:01AM. I would look here to find out how to edit the cron job to awaken you at a time of your own choosing.
Next, you have to make sure your computer is on maybe 5 minutes before your alarm goes off. This can be done in the "System Preferences" application.
Click on the "Energy Saver" option
You don’t have to mess with your sleep and display configuration, but you do need to click the "Schedule…" button on the bottom right hand corner
Set your mac to wake up or start up 5 minutes before your cron job kicks off. You’re all done!
Congratulations, you’ve just setup your own Spotify alarm clock!
I’m going to cheat a little bit. Taking my own advice from my post, "Bioinformatics Programming Like Experts," I’ve found it much simpler to answer my next few questions using R. R has a number of complicated statistical tests built-in — performing them on data is trivial.
What I’ve Done: Principal Component Analysis:
I’ve performed principal component analysis on my family’s 23andMe data. In a nutshell, principal component analysis transforms multi-dimensional data into a number of components which reflect the amount of variance in each dimension. Thus, the first principal component corresponds to the dimension which accounts for most of the variation in a dataset and the last principal component corresponds to the dimension whichaccounts for the least variation in a dataset. The process of obtaining these numbers is very involved.
What Are We Looking At?:
To get to the punchline and share what I’ve posted in simple terms, we can plot "principal component 1" values against "principal component 2" values and the similar data points will "cluster."
Since I didn’t include a legend in the image above, here is who each data point corresponds to (and rough coordinates for folks who can’t see color too well):
Red = Me (-700, 1000)
Purple = Sister (-550, 100)
Pink = Mom (-1750, -1000)
Green = Dad (1300, 1000)
Blue = Grandfather (1700, -1500)
Brief Explanation of the Image:
My sister and I are roughly half of my mother and father. So, our data points straddle between the mom and dad data points. Based on the location of the grandpa data point, its obvious whether grandpa is paternal or maternal.
It’s well known that asymptotic notation is used to convey the speed and efficiency of code blocks in computer programs. I haven’t used them very much while working with Python, so I needed to refresh my memory before trying to use this great tool.
Cardinal Rule: Focus primarily the largest value in the equation of time complexity. All other factors in the time complexity equation are essentially trumped.
O(n^4+n^2+n^3+nm+100) ~= O(n^4) Update: assuming m is linear.
Trump Rules for Time Complexity:
Notes
Remember that we care most about the upper bound and are not so concerned with the lower (in general)
The smaller the upper bound number the better (and consequently, faster)
The Ladder
Constants are less than logarithms
Logarithms are less than polynomials
Polynomials are less than exponentials
Exponentials are less than factorials
Notation and Hierarchy (Smaller Is Better):
Constant Θ(1)
Logarithmic Θ(lg n)
Linear Θ(n)
Loglinear Θ(n lg n)
QuadraticΘ(n^2)
Cubic Θ(n^3)
Polynomial O(n^k)
Exponential O(k^n)
Factorial Θ(n!)
Quick Examples:
[i for i in list] {linear}
Functions that generally operate on lists or generators (sum, map, filter, reduce, min, max, etc) tend to be linear in time complexity
[i+k for i in list for k in list] {quadratic}
[i+k for i in list1 for k in list2] O(list1*list2) {quadratic I think, since it’s linear*linear}
Add 1 to the exponent value for each nested loop. For example. [j+i+k+n for j in list1 for i in list1 for k in list1 for n in list1] would have a time complexity of O(n^4)
Note: Some programmers reduce quadratic time complexity a bit when using nested loops with sorted lists by ensuring that calculations aren’t performed more than once. Consequently, that code block runs faster and faster and less and less has to be evaluated through each iteration of the loop. For example:
If you missed part one, here is the link. It’s probably a good idea to peruse that post before moving onto this one. In this post I am going to see if I can create a family tree using the same 23andMe raw datasets I used in my last post.
Again, the module is freely available for download on github.
A Note About The Algorithm:
Computer Scientists call this hierarchical clustering. Biologists know this as creating a phylogenetic tree. These are essentially the same thing. However, I am going to simplify and refine the algorithm to suit my needs. It may even be apt to call this a "quick and dirty," or "pseudo" implementation. Perhaps I will program a full implementation in the future.
Setting Up Metrics And Controls:
In my last post, I used % identity between individuals to determine how closely related one person is to another. Although it is a rather casual measure of similarity, I am going to continue using this metric because it is easy to implement. I encourage you to create and implement your own metrics as this really helps you get a better understanding of your data.
Since this is the second time I’ve used this metric, I went ahead and created a function for it. Continuing from the coding example in my last post:
Just to be clear, Datasets.intersectionData is a dictionary identical to Datasets.Data. The difference is that Datasets.intersectionData only contains data for the SNPs in common between all of the files (the same SNPs contained in the Datasets.Intersection list).
Since I am creating a tree, it is probably a good idea to use a few control data — these controls should be on the outermost branches since they will be the least related to my folks and I. I’ve decided to use two — Mikolaj Habryn (available on SNPedia) and Manu Sporny (yes, the guy who published his 23andMe data on github). The SNPedia link contains a few more sample datasets (and there are several more scattered across the internet). I chose to use Mikolaj’s published data because it has been said that he was the first person to make his 23andMe raw data publicly available. I’d like to ensure handling "old" raw data won’t be a problem.
Building The Tree:
I’ve added the phylogeny function to my ParseToDict class.
>> Datasets.phylogeny()
The output attempts to simulate a phylogenetic tree. The tree is created by comparing each of the datasets to each other and counting the number of identical genotypes that are shared between them. These results are organized and printed to a list (as shown below).
The closer the files are coupled together, the higher level of similarity between them. The reason I call this a quick and dirty implementation is that similarity is calculated with respect to the two most similar raw data inputs. Person1 and Person2 are most closely related. I am second most closely related (to Person1 and Person2). Person3 is third most closely related (to Person1, Person2, and I). I’m sure you get the idea. Keeping this in mind, the tree looks exactly as I expect it to.
Cool! What’s Next?:
Now we can ask an interesting question. Can we plot the phylogenetic tree with the person’s genotype "tupled" with their name — perhaps revealing a pattern of inheritance?
To do this I’ve added an optional argument for the phylogeny function, rsid. I feel like randomly choosing ‘rs6904200′ for analysis today.
It seems all of my family members and I have exactly the same genotype for this SNP (heterozygous ‘AG’). The controls are homozygous for ‘GG’ and ‘AA’. Thus far, the pattern of inheritance for this SNP merely illustrates the fact that the controls are not part of my family and that everyone in my family share the same genotype for this SNP.
Not So Fast…:
Please remember that a shared genotype between individuals does not guarantee that the individuals are related — especially when only viewing one SNP!
To illustrate why, let’s look at one of the SNPs I mentioned in my last post, rs3754777.
Notice how although I am closely related to Person1, Person2, and Person3, we have a variety of genotypes shared between us. In fact, the Mikolaj has the same genotype as Person3 and I even though he has no relation to us. This function may help shed some light on which genotype came from where, but the context really needs to be taken into consideration. The point here is to be careful not to trick yourself.
JSON – The Other New Feature:
I recently added in functionality to convert 23andMe raw data to JSON format. The input method is exactly the same, but a 23andMe.json file will appear in the current working directory. The implementation isn’t fancy. It’s really just a convenient wrapper over the built-in JSON module.
A few readers asked me this question after my first post. There are many sources available and I’m sure someone in the bioinformatics/genomics community would have made great a post about this subject. However, to get started I would suggest three sources:
Last Christmas, my company offered a substantial discount on 23andMe kits. Since I had already gotten myself sequenced about a year earlier and found the data interesting, I decided to purchase kits for my family. Their data have recently been posted and I collected their raw data files.
I enjoy looking at the metrics 23andMe provides, but I am curious as to what I would find if I mined the data myself. I’ve spent maybe an hour or so everyday for the past week exploring the data of my family members.
Getting Started:
I’ve uploaded my 23andMe python module to github so that folks can download it and play with their own data — trust me it’s interesting!
The classes and functions in the module can easily be used in scripts, but I built it to be used interactively in a python interpreter.
If you have data to play with pull the repository down and follow along:
>> import 23andMe as Me >> files = [‘Nikhil.txt’, ‘Person1.txt’, ‘Person2.txt’] >> Dataset = Me.ParseToDict(files)
The code above should be straight-forward. The import statement imports my module into the environment. The files data structure is a list with the name/location of the raw data files. The Dataset data structure is basically an initialized class.
If you look at the source code, you may notice another class called ParseToDB. Initially, I loaded each raw data file into a sqlite database as a table. However, python seems to have some sort of bottleneck issue with the sqlite3 module — the query time for a simple join command is absolutely unbearable. Thus, I reverted back to using python dictionaries. However, I left the functionality in the script with the hope that someone will find it handy.
I was genotyped on 23andMe’s V2 chip where as my family was genotyped with the new V3 chip. Consequently, each member of my family ends up with 73% more data than me (quickly estimated via raw data file sizes). To compare myself to each member, I need to make sure I am comparing a SNP that everyone in the comparison set has in common. Enter the intersection function.
>> Dataset.Intersection
This returns a list of SNPs that we all have in common. In my case, this is on the order 540,000 SNPs. My raw data file is the limiting factor as everyone else has close to a million genotyped SNPs in their raw data files.
Let’s Start Playing:
To start scratching the surface, I performed a quick search for identical genotypes. I came up with these results:
Person1 and I have 67.5% identical genotypes. Person1 and Person2 have 67.8% identical genotypes. Person2 and I have 60.8% identical genotypes.
Keep in mind that these numbers are derived from whatever SNP data I’ve extracted.
23andMe reports 84% for person1 and 78% for person2 in terms of "percent similarity" (each being compared to me).
Since I am playing with a relatively small number of files here (three with myself included), my options are limited. A good, large dataset can go a long way.
Moving Along:
There are a number of published SNP association studies available via NCBI and NHGRI. Let’s pick a SNP and see what my analysis says and compare that to my 23andMe report.
How about hypertension? There is a study published here by Padmanabhan et al. It was published October 28, 2010 in the journal PLoS Genetics.
I’ve sifted through the papers and extracted the RSids, the associated nucleotide base, and the p-values for the trait above.
Hypertension RSid: rs13333226 Base: A P-value: 4×10^-11
Before I go any further, a p-value is basically a confidence value. The lower this number, the better. It essentially reflects the probability we would reach this same conclusion by chance. 4×10^-11 is a very small number — so we can be confident about this association.
I used the searchSNP function in my module to go through all of the loaded files and print data to screen if it is available:
We are all homozygous G for this SNP and none of us seem to be predisposed to hypertension. The ‘A’ base is the base that would predispose us.
23andMe does report traits for hypertension (high blood pressure). Thankfully, they also list the studies they used to make their conclusions.
23andMe looked at SNPs rs3754777 and rs5370. I’m going to ignore rs5370 at the moment because it has to do with hypertension in physically unfit individuals. The SNP I just pushed through python does not take this into account…
However, I am going to consider rs3754777 because it has to do with a study on hypertension in general. Please note that it is a different SNP from a different study. According to my 23andMe report:
I am ‘CC’ which basically means, "Subjects had no increase in blood pressure."
Person1 and Person2 are ‘CT’ which means, "Subjects have an average increase of about 2mm Hg SBP and 1mm Hg DBP."
I’m 2/2 in finding associate evidence that I am not genetically predisposed to hypertension. However, the other two family members seem to be 1/2.
Wait a second. What does this mean? Who should I trust?
About Confidence:
23andMe uses a "4-star" scale to report their confidence. These stars correspond to how many independent experiments with large sample sizes have been conducted and resulted in similar findings. 4-stars is the highest score.
The 23andMe hypertension report is rated at 3-stars ("Preliminary Research. More than 750 people with the condition were studied, but the findings still need to be confirmed by the scientific community in an independent study of similar size.").
The study I chose had sample size of 1,621 Swedish cases and 1,699 Swedish controls. This would put the SNP I evaluated at about their highest confidence level (barring the part of it being confirmed in an independent study).
By these standards, I think I can trust my analyses and say that neither I, nor the two other family members analyzed, are genetically predisposed to hypertension. However, I would be much more confident in my evaluation if an independent study is conducted of rs13333226 and reports similar findings.
Hopefully, I will be able to do some more interesting analytics next week.