OpenAI has launched a new web crawler called the GPTBot, built likely to enhance its GPT-4 large language model and possibly gather data for training GPT-5. The web crawler will access data from various websites, except those that are behind paywalls or that opt out of the process.
The idea is to reportedly only use sources that are freely available, comply with OpenAI’s policies, and do not collect any personal information from users. By allowing GPTBot to crawl their websites, publishers will be contributing their data to OpenAI’s existing and future models that power its AI chatbots. That may come with privacy and security concerns, but they’d be contributing to the overall AI advancement.
However, if publishers are not comfortable with sharing their data with an AI system, OpenAI offers a simple way to opt out. They just need to add a line of code to their website’s server – specifically, the robots.txt file. This line of code can be found in the official documentation for the bot. Publishers can also specify which parts of their website will be accessible and which ones will not.
OpenAI recently trademarked the term “GPT-5,” suggesting that it is working on a next-generation large language model, an improvement over the current GPT-4 LLM that powers ChatGPT. Now, the company has given an option for websites to decline to participate in providing data for this potential next-gen model. Has OpenAI already started its training? It’s possible.
However, the launch of GPTBot is not without concerns. On one hand, ChatGPT, which is unaware of events that happened after most of its data was cut off (September 2021), needs more data to grow. But on the other hand, websites do not benefit from GPTBot crawling them. Unlike Google, which drives traffic to a website after crawling it by showing search results to billions of users, ChatGPT only summarises data from across the web without giving any citations. It is hard to trace the source of the information it provides.
AI systems are expanding, but so are the ethical questions around data collection, copyright and consent.