Data Scientists: Finding Patterns in LinkedIn Data

April 20, 2010

Ed. note: We spend a lot of time talking about our cool products and features on this blog, so it makes sense to give some attention to the cool people joining us who make it happen.

Pete Skomoroch is on our Analytics Team. You may have read about his work building, an open source site that shows which Wikipedia articles are getting the most page views.

Here’s Pete on his true passion - data!

It may not be true for everyone, but finding patterns that reveal new information about the world is addictive for me. My work focuses on finding these patterns in large datasets and using statistics to make predictions from them. As the world becomes more heavily instrumented and can be logged more easily, there is a massive deluge of real time data points. Data Scientists can apply machine learning and data processing at scale to this information, combining it and remixing it with their own datasets to learn more and help solve real world problems.

There’s an art to combining these datasets and learning from them, and this is where tools like Hadoop, Amazon Mechanical Turk, and machine learning come to our aid.  As an example, consider trying to standardize the user locations found within Twitter profiles. Twitter users can enter any text string as their location, so how do you link these to cities, states, and countries?  The trick I applied for my O’Reilly Where 2.0 workshop earlier this month was to aggregate a number of attributes for each distinct location string using Hadoop.  These attributes were things like the most frequent time zone, common phrases in Tweets at that location, languages, etc..  This context was then presented to Mechanical Turk workers along with the location string and the GeoNames database of standardized locations.  Several Mechanical Turk workers would choose the best location from the database for each string. Then we used statistical techniques on the results to find the most likely standard location for each user.

At LinkedIn, we are doing the same sorts of remixing with our internal datasets. You can think of each dataset as a slice of user information or behavior. One slice is work history, another slice is connection data (who is connected to whom), another slice is search queries, or words used in recommendations, and so on. With 65 million professionals, all of whom have incentive to fill out their profiles completely and honestly, these data slices are clean and robust.

A single slice alone might tell us something high level, but combined, they are much more powerful. And we can still add in more slices from external sources like Like a catscan, each slice lines up to create a three dimensional portrait of our user base.

The trick here is keeping in mind the end goal—using data to create products that will make a difference in people's lives. In fact, this is one of the reasons I came to LinkedIn; there is a focus on one domain here, and with that focus comes novel, rich data about professionals and what influences them. Controlling scope and coming up with well-framed problems are difficult to do, and at LinkedIn, we always have the focus on the professional in the back of our minds. From there we are free to go attack the hardest and most valuable problems. As a result we now have the data to help answer questions that used to be decided by anecdote, hunches, or conjecture—questions like “Where should I go next in my career?” or “Who can help me learn more about that?”

LinkedIn is really committed to great data, so when I walked in the door six months ago, there were virtually no barriers to me getting at what I needed. The Data Platform team had built thoughtful event tracking and user interfaces optimized for collecting quality data. In addition, the supporting tools for rapidly building data products — Hadoop, Pig, Python, R, Azkaban and Voldemort — were all in place. Every product built at LinkedIn is required to give back to our data ecosystem, so our ability to learn from users grows with each new launch.

It's this growth of our knowledge that excites me the most—we've done so much, but really it’s just the tip of the iceberg. The potential here for new types of collaboration, advice and knowledge sharing, events, classes, mentoring and professional development — all powered by the data — is beyond the scope of anything that’s been done so far. And, that’s what I’m looking forward to tackling every day.

Wanna build the next big thing at LinkedIn? Check out our Careers Page to learn more.