Scraping the Internet
As Large Language Models (LLMs) continue to grow in popularity, questions around how they work are becoming more frequent.
Curiosity, of course, is not a crime. But it should come as no surprise that solutions that can seamlessly switch from writing SEO-enhanced articles to the perfect dismissal letter, draw on knowledge that they may not necessarily own.
As a result, Open AI, the creator of Chat GPT, has been sued in the state of California, where it stands accused of harvesting “massive amounts of personal data from the internet” in order to create the system millions are talking about every day.
Of course, Open AI is not the only business that could be scraping our data, and Chat GPT is not the only Generative AI system on the market. Google, for example, has confirmed that its AI services, including Bard and Cloud AI, may be trained using publicly available data obtained from the web, and they won’t be the only ones to have this practice in place.
The practice of “data scraping” is somewhat of a controversial one. Rather than using official channels like APIs, automated programs collect data by crawling and harvesting information from websites across the internet.
As Yair Adato, founder and CEO, Bria says: “In data, scraping information and content are downloaded from the internet not in an official manner. An automatic program harvests all of the content and data from the website over the internet.
“Unlike data scrubbing, an API is the official way for a website to allow for a data transfer. APIs are controlled, monitored, secured, and are the official way to download that, unlike scraping.
“Everything can be scraped. Wikipedia, Stack Overflow, Twitter, social media stock repositories like Getty Images or Shutterstock, e-commerce, personal information, everything can be scraped.”
It would seem then that the vast majority of software connected to the internet can be accessed through data scraping techniques. But while the mind instantly drifts to concerns of privacy and consent – mine did anyway – the practice doesn’t necessarily have to be intrusive.
Recent start-ups like Polly, an AI research tracker, have been developed with a ‘tech-for-good’ ethos, with it’s primary goal of reducing the national suicide rates in Canada by using AI to identify online trends and find patterns of suicide-related behaviours.
“Sometimes using data scraping can be for good reasons,” said Adato. “For instance, Google scrapes the internet in order to provide a good search engine. The software can scrape retail and e-commerce websites in order to compare prices.”
While scraping can be a powerful tool, it raises important ethical and legal questions.
The data revolution that has come about over the last decade has been accompanied by controversial headlines from tampering in elections, influencing teenagers and, funnily enough, harvesting our personal information.
However the data scraping without consent is where the waters get murky, with Adato saying that the practice is “equivalent to stealing.”
“Let’s assume that I own a website with content, and then someone takes this content from the website and uses it for commercial purposes without paying me. It’s equivalent to stealing my IP.
“Here’s a very clear example, let’s assume that I scrape Spotify, download all of the music, and then put this music on a free website and put ads on this website. Now I can give all of the music for free but the website violates the IP and the rights of the music owner and Spotify.
“Obviously, this is not acceptable. Whether they use Bard or the other generative AI engines to process the data before they commercialise it, is irrelevant. It’s the same concept, you can’t take someone else’s data without permission.”
Now that permission has been brought up, regulation is never far away. Particularly within the EU, personal data is taken care of through the General Data Protection Regulation, commonly known as GDPR.
After all, back in those crazy days when the regulation was being introduced, I’m not sure I got an email from Google making sure it’s okay to harvest my data.
The reality, of course, will be that T&Cs will have that sort of thing covered, but as our lives are increasingly played out on social media, the idea of this data being scraped does not sit well with the likes of Adato.
“It becomes even more problematic when talking about privacy. Private information is more sensitive than general information. Taking my personal data, without a clear understanding of how it’s going to be used, where it’s going to be used, or who is going to use it is a clear violation of fair GDPR. It’s to use my data in an unpredictable place.
“It’s a real problem that equates to money. Taking data without permission is taking money without permission. That should be clearly defined to the user in the terms and conditions and the website should make it very clear what you can do and what you cannot do with this data.”
While GDPR rules may be something to be wary of, it’s clear that the practice of data scraping is going nowhere.
So what is the impact? According to Twitter, sorry, X owner Elon Musk, these practices have serious effects on the functionality of a website, especially one with as much data on it as X.
This summer Musk took such an exception to the practice that he decided to impose restrictions on the features users with free access could take advantage of.
In a post, he said, “To address extreme levels of data scraping & system manipulation, we’ve applied the following temporary limits: verified accounts are limited to reading 6000 posts/day, unverified accounts to 600 posts/day, new unverified accounts to 300/day.”
However, Adato contradicted the Space X founder saying “As long as the data scraping is done correctly, it should not affect the performance of the website, and still, in many cases, what you need to do is to use the API if possible.”
Now it’s probably not the time or the place to discuss potential alternative motivations for Elon to restrict free usage of his platform. As Adato says if it is done effectively data scraping can be harmless to the health of the website.
What may be a more pressing issue is that of privacy. I think it’s safe to say that only the most naive of us would think that the Google, Apple, Microsoft, and Facebook conglomerates of this world only harvest our data when we want or allow them to, but the point still stands that privacy is essential.
But as you look into this further, you start to wonder: is anyone safe? According to Adato, erm… no.
“It should be clear for those who scrape or want to scrub the website, but also should be very clear for the user, how their data is going to be used, which this is how they protect the data in on the website.
“Many sites are free and open, so a paywall is not really an option for them. They may be able to find some cybersecurity solution to help, but the threat is endless and someone will always find a way to download data using scraping even if you have the right cyber solution. On top of this, it’s expensive.”
If I know anything, that expense won’t be viewed as essential by management either. The state of cyber security in the enterprise market will tell you that just the idea of protecting anything more than you have to is an anathema to the people who stand to gain from profits.
What’s the answer then? Adato says that regulation and public opinion are key if the data hoarders are going to toe the line, adding that the brazen admission of data scraping is “very wrong”.
“What we need to do is to have a very clear policy in the terms and conditions, but also in public opinion and regulation.
“It should be very clear what type of data scraping can be done, and what cannot be done. The type of announcements being made by Google are very wrong and the reaction should be very strong.
“You cannot take someone else’s data without very clear permission. That is simply wrong.”
To read more on Big Data click here
Subscribe to our Editor's weekly newsletter