5 Things I Learned as a Data Science Intern

August 10, 2012

Editor’s Note: This is part of a series of blog posts by LinkedIn’s rockstar summer interns. Today, we hear from Nihit Desai who is working toward his Computer Science degree at the University of Illinois at Urbana-Champaign.

I don’t remember how exactly my fascination with data began. I am willing to believe it was when I read about how data drives products like Amazon’s Book Recommendation Engine or LinkedIn’s ‘People You May Know’. However, it has remained with me since, and here I am, interning with LinkedIn’s Data Science team.

Simply put, Data Science enables the use of data to solve problems and create data-driven products like ‘People You May Know’ or  ‘Jobs You May Be Interested In’. The rate at which we are producing and replicating data is estimated to double every two years. It is clear that the ability to take raw data, clean it, process it, extract value from it and derive a narrative from it is becoming an extremely important skill as we try to make sense of the world around us. In the context of consumer Internet companies, I believe the most important use of data is going to be around understanding how users interact with the product. Reid Hoffman in his book, “The Startup of You”, writes about how companies today are in “permanent beta”. This means that their products are never ‘finished’ but are constantly improving by iterating. In such a scenario, data science plays a crucial role of deriving insights from data and providing a direction for the product’s next iteration.

I’ve had a phenomenal internship so far. A chance to work on exciting projects, meet amazing people and have a lot of fun at intern events. Here are five things I’ve learned during my internship:

  1. Spend time cleaning your data. Data is your starting point so make sure it is clean. This will only make things simpler as you proceed further and make your results more reliable.
  2. Start simple and start with a vision. As a data scientist at LinkedIn, you have access to Petabytes of data (1 Petabyte as much data as is transferred when viewing HDTV for about 13.5 years). It can be overwhelming if you try to make sense of it all at once. I have rather found it really useful to start out simple, get initial results and then iteratively improve my models. Also, it is absolutely essential to define what is it that you are trying to accomplish with your project when you start. This helps you direct your efforts and evaluate tradeoffs between accuracy and computation time/cost better.
  3. Getting results is just the beginning. As a data scientist, an equally important part of your work is to interpret those results, understand what they mean and explain those results to others on the team.
  4. The breadth of skills required to handle various steps from parsing data to interpreting the final results is very wide. Data Science is really a blend of Computer Science, Statistics, Machine Learning and some domain expertise depending on specific application (Sociology, Economics, Physics and the like). Know your strengths and don’t be afraid to ask for help when you need it. But try to learn something new every time you ask for help.
  5. Have fun along the way. Very often, as a data scientist, you work on problems that haven’t been solved before. Wrong decisions and missed opportunities are all a part of the process as you try out new methods to solve it. Learn quickly and don’t be afraid to take the path less trodden. That’s where the treasure often lies.

In a world that is changing at an accelerating pace, data will play a very important role in enabling companies to evolve faster. At LinkedIn, I have had a chance to witness this everyday as my colleagues make decisions driven by data. The future clearly belongs to companies that figure out a way to use their data successfully.