Guest post by Alexandra Datsenko.
Though web scraping is growing rapidly in popularity and being deployed widely, it’s vital to understand the legal and ethical requirements.
Although web scraping is perfectly legal, this does not mean that absolutely any content can be collected and actively used. It has to stay within certain limits.
This article presents a look at the etiquette of web scraping and what ethical rules and legal measures exist regarding automatic web data collection.
But first, let’s briefly define what web scraping is, and how it is used, for those unfamiliar with the concept.
Introduction to Web Scraping
Web scraping is an automated method of obtaining data from websites. Most of the retrieved information is unstructured data in HTML format, which is then converted into structured data in a spreadsheet or database.
There are many different ways to perform web scraping to retrieve content. They include using online services, APIs, or even creating your own code from scratch. Large websites like Google, Twitter, and Facebook expose APIs to access their data in a structured format.
Web scraping tools are software, i.e. bots, specifically developed to sift through databases and extract information. Different types of bots are used, many of which are fully customizable to recognize unique HTML structures of sites, extract and transform content, or extract data from APIs.
Web scraping service providers offer data extraction and export services for businesses and individuals. Such services allow businesses to refer their data extraction needs to experts who will accurately sort web pages, databases, documents, images, and folders.
How does web scraping work?
The general process of web scraping is as follows:
- Identifying the target site.
- Gathering the URLs of the pages from which we want to retrieve data.
- Sending a GET request to the server and getting a response in the form of web content.
- Parsing the HTML code of the website, following a tree structure path.
- Saving the data in a JSON or CSV file or some other structured format.
That’s why many companies prefer to outsource their web data scraping projects or use ready-made tools.
How is web scraping used?
Access to data, having methods to analyze it, and making smart decisions based on analysis can make a huge difference in the success and growth of businesses in today’s world.
Here are some of the many ways web scraping can be used:
- In finance, scraping helps extract financial statements and insights from SEC filings, estimate company fundamentals, and monitor financial news.
- In marketing, web scraping is used to generate leads, compile lists of phone numbers and email addresses for cold calls, monitor reputation, and create content.
- In real estate, web scraping is used to get information about properties and agents/owners, monitor vacancy rates, estimate rental yields, and understand market direction.
- In data science, web scraping helps collect training and test data for machine learning projects, make predictive analyses, and process natural language.
- In retail, web scraping helps in monitoring MAP compliance, competitor prices, and consumer sentiment.
What are the Rules of Etiquette for Web Scrapers?
Unfortunately, an individual approach to each site is not possible for those who want to scrape data from hundreds of sites at once.
Before scraping a site, study its robots.txt. This file specifies to the scraping program where it is or is not allowed to be on the site.
Does robots.txt have specific instructions regarding crawlers? If so, you should certainly follow it.
Also, read the terms and conditions. When you log in and agree to the terms and conditions, you are “signing” a contract with the site owner(s). This is how you agree to their rules regarding web scraping. They may clearly state that you are not allowed to scrape any data on the site.
Make sure it’s legal.
The extraction and use of data must meet certain criteria. First, the data is only collected for personal purposes and must not be made public. Second, the data does not cause financial or reputational damage to its owners.
Web scraping is legal and ethical if you are extracting data only for personal use and analysis. If you want to publish the collected data, you’ll need to request the permission of the data subjects and check the site policy; otherwise, you’ll be faced with a violation of data protection laws.
Don’t violate copyright law.
When scraping the site, make sure that you are not collecting data that is protected by copyright.
However, there are situations where exemptions may apply for all or part of the data, allowing it to be legally scraped without infringing on copyrights. For example, facts in copyrighted materials are often not covered by copyright laws.
Note that different countries have their exceptions to copyright law. You must be sure that the exception applies in the jurisdiction in which you are working.
Web scraping is performed by software that can put a heavy load on web servers. This means that the volume and frequency of the requests you make should not overload the site’s servers or interfere with its operation.
This can be accomplished in the following ways:
- Schedule requests to run during off-peak hours of the site.
- Limit the number of simultaneous requests to the same site from the same IP.
- Observe the delay that scrapers should keep between requests. The robots.txt file usually specifies the data collection delay parameters. If not, you should stick to the average scraping speed: 1 request every 10-15 seconds.
Don’t cross the line of personal information.
You may only collect information that does not include personal information and that does not violate the site’s Terms of Service (ToS).
ToS are usually located in the footer of the page, which describes what data you can get fined for scraping without the owner’s permission.
Personal data is any data that can identify an individual, such as:
- Phone number
- Bank or credit card information
- Health information
- Biometric data
Protected information, such as usernames, passwords, and access codes, is also strictly forbidden to collect.
If you don’t have a legitimate reason to scrape and store this data, you’re breaking the law. The most common legitimate grounds for collecting such information are legitimate interest and consent.
Always remember that the data does not belong to you. Before scraping a site, it pays to be polite and ask if you can collect this data.
You can identify your web scraper using the legitimate user agent string. By doing so, you will create a page informing site owners of your activity, its purpose, and organization. This shows respect to the site owner and creates a link back to the page in your user agent string.
This means that you are doing your web scraping responsibly and respectfully and can avoid further legal problems.
The Rules of Scraping Websites Legally
Data Use Agreement
A Data Use Agreement (DUA) is a contractual document required by the Privacy Rule. It is used to transfer data developed by nonprofit, government, or private organizations if the data is not publicly available or has restrictions on use.
There are also various kinds of laws and regulations about collecting information and the consequences for violating them. Different regions and countries have slightly different rules.
Data scraping itself is not illegal, but the use of personal information is restricted. For example, companies can only collect information and use it for various purposes if consumers consent. It’s important to understand the implications of GDPR in marketing.
However, the GDPR’s web scraping provision does not apply to data that has been anonymized.
In the event of a data breach, GDPR requires notification to both consumers and data processors. Companies indicate the exact nature of the breach and take action on the breach.
All companies, regardless of their worldwide location, must comply with the GDPR if they collect PII from EEA residents.
The U.S. Privacy Act
There is no single set of federal privacy laws in the United States. However, there are many different state laws, some of which are being considered by the U.S. Congress as a sort of test case for use at the federal level, for example, California’s CCPA.
The CCPA is the most comprehensive Internet-oriented law in America and lists what constitutes PII. The CCPA includes browsing history, geolocation, biometrics, email, and so on as PII.
There are also several consumer-oriented federal health laws, such as the Health Insurance Portability and Accountability Act (HIPAA). In the financial field, it is the Gramm-Leach-Bliley Act of 1999 (GLBA).
What Should You Consider When Looking for a Web Scraper Service?
To choose a responsible and honest data provider, check the following points when evaluating a web scraping service:
The quality of the data collected should be a main point when choosing a web scraping service. Collected content should be accurate and up-to-date. But their collection shouldn’t violate policies and site usage rules.
A good web scraping service must respect all terms and agreements when collecting data and ensure you get content on time.
Before choosing a web scraping provider, check the quality of their service. Web scraping providers should have clear pricing with specific rates that will accurately predict your future costs.
Pay attention to customer service. Only consider a web scraping service that has a working customer service team. Make sure the provider uses up-to-date customer support systems, as this should be one of the top priorities of the service you choose.
Make sure that the web scraping provider you choose can handle problems that arise. For example, when data volume increases, their performance will not degrade. The web scraping service should be scalable.
Also consider the format in which the provider delivers the data. Web scraping services should be able to supply data in the format you want, such as CSV or JSON.
Web scraping has a great future ahead of it. Slowly but surely, scraping is being accepted as a useful and ethical tool for gathering information. In the vast majority of cases, the information companies collect is completely clean and legitimate.
To summarize, scraping should be discreet, comply with site terms of service, check the robots.txt protocol, and avoid scraping personal data and secret information.
If you’re ordering web scraping services, make sure the provider complies with all applicable laws and regulations.
And don’t forget that the main purpose of the collected data is analysis, not republishing or selling.
Alexandra Datsenko is a content creator who loves to write about web scraping and techniques for extracting information from websites. Here she shares some of her favorite tools, tips, and tricks for data extraction and how to use them in your business.