LinkedIn Safety Series: What is scraping?
July 15, 2021
For our members to have the best possible experience, we want to keep them safe. We work every day to protect our members’ data and their ability to control the information they post on LinkedIn. Our Global Trust Teams create, deploy, and maintain models that detect and prevent abuse, stop attacks, thwart scams and generally limit the bad stuff that exists on the internet from ever reaching our members. Much of the detail on how companies do this has stayed behind the scenes, but we want to peel back the curtain in this first post in a series of safety topics.
Let’s get right into it, starting with one of the most challenging areas around: scraping.
What is scraping?
Scraping has been around since the start of the internet, but it’s grown dramatically in scale and sophistication. Today, the scraping we hear most about is unauthorized scraping, which uses code and automated collection methods to make (up to) thousands of queries per second and evade technical blocks in order to take data without permission. Scraped data can be gathered from multiple sites, combed, and sold in large batches, to be used for phishing and other campaigns designed to trick you into sharing private information.
To be clear, scraping isn’t always bad. Search engines are expressly authorized to scrape in order to collect and index information throughout the internet. When people search and find links with snippets of information, that kind of scraping ultimately benefits both the websites and the users of search services. What makes it nefarious is when it’s done without permission. When this happens, you have no ability to track where your data has gone and how it is being used. This can happen across many types of public-facing websites, including ecommerce, news sites and social networks. When your data is taken without permission and used in ways you haven’t agreed to, that’s not okay. On LinkedIn, our members trust us with their information, which is why we prohibit unauthorized scraping on our platform.
What isn’t scraping?
Unauthorized scraping by itself is not a breach or a hack. It can seem that way, as hackers will often tout they have hot data from a company. But scraping does not mean an attacker has been able to get inside secure systems, subvert firewalls or access protected network information. Unauthorized scraping can mean that bad actors can collect a lot of data and use it in ways that you didn’t expect. Even without getting into a network, unauthorized scraping can be highly abusive, so we use our entire toolkit, including AI and legal methods, to stop this behavior and hold perpetrators responsible. Simply put, and important to clarify, hack and breach are not synonyms for scraping. We’ll get into these topics in a separate post later in our series.
What are we doing to stop scraping?
Our teams at LinkedIn create, deploy, and maintain models and rules that detect and prevent abuse, including preventing unauthorized scraping. Let’s define some terms we use so you can understand a couple of the ways we protect against different types of scraping. When we say public profile scraping, we mean scraping of information that is viewable on LinkedIn without logging in to an account - for example, a member's public profile. And when we say logged-in scraping, we mean scraping of information that is viewable when logged into a member account.
To detect public profile scraping, our models look for signs of automated viewing of profiles. Due to the adversarial nature of unauthorized scraping, our models are retrained and automatically deployed several times per day to quickly adapt to new signals. Our abuse detection runs at scale, and our infrastructure is designed to help protect our members and their data without adversely affecting member experience on LinkedIn. In addition, we’ll be incorporating advanced signals into our machine learning models, retraining them more frequently to help adapt to evolving attack patterns.
We have models to defend against logged-in scraping as well. For this, we look for signals of bot-like activity. We employ deep learning to classify sequences of user behavior as automated, and we also use outlier detection to detect activity that appears to be non-human. We have open sourced the code we use for outlier detection so that other companies can also use it to detect abuse. When we do detect a member as scraping, we give them information on how to correct this behavior.
In addition to rate limits, we also employ a funnel of additional defenses that detect and take down fake accounts engaged in scraping at multiple stages. We aim to catch fake accounts as quickly as possible to prevent harm to our members.
What can members do to protect themselves?
We want members to have the clearest picture of what information they’re making available on LinkedIn. We protect you and the data on our platform every day, with a full arsenal of evolving techniques. Spend some time looking at what info you’ve added, from contact details to work history, and get familiar with your settings. In addition, take a look at your public profile page, to understand what information might be public and ensure it’s exactly what you want to be viewable to search engines and other off-LinkedIn services. You can choose to limit or adjust choices if you’d like. From there, it’s our job to enforce your choices to help to keep you and your data safe.