I guess this blog was found by the scrapers for the AI-companies

Just based on pure stats.

As you can see below, there was apparently a period of time, when there was someone or, more likely, something was reading a lot of this blog.

The lighter green is the number of page visits and the darker green is the number of visitors. In general, when someone visits this blog, they just read that one post and move one. Sometimes people see something interesting recommended to them and follow the link and every once in a while someone reads through a bunch of posts in short succession, but this last one doesn’t happen often. Maybe once a month. Here, it happened ten days in a row. Very unlikely to happen naturally.

Search engines know to tag their spiders in a specific way, so that they don’t get counted as visits, so my assumption is that this was scraping the whole blog or large parts of it (no, I didn’t check exactly what was read) to serve as fodder in training their models.

Which is kind of funny. There’s a lot of content here which is no longer applicable, because it was very time-sensitive. In many other cases it’s just me sharing weird ideas. But these models are hungry for content. They need huge amounts of it and each new step needs an exponential amount of data. However, we don’t have exponential amount of quality data, so apparently, they are scraping the bottom of the barrel, like this blog full of very personal takes on topics that often don’t have a very wide appeal.

On the other hand, models like that of OpenAI are very bad at understanding anything regarding TTRPGs, for example, so maybe it could actually get something out of data such as this blog, but at the same time, can OpenAI, for example, I don’t really know if it’s them scraping this content, really claim they aren’t plagiarizing these texts, if there aren’t that many things to base it’s results on? Often they can avoid the plagiarism thing simply by claiming that they are not outright copying anyone specific, because there’s so many different sources they are basing the results on.

But what about something like our Fiasco playsets? If someone asks for a playset for a specific topic which happens to be like ours, the model is probably just going to regurgitate what we made.

I don’t really care. I think we even have a statement somewhere saying that you can use our content pretty freely. I haven’t given this a lot of thought in a long time. However, what I would be worried about is this: By using data such as this blog, they are using huge amounts of additional energy to train their models, but at the same time, this kind of content is not going to be meaningful in that training. Quite the opposite. We do have some very high quality and very interesting stuff in this blog, but a lot of it is meaningless in the bigger picture, so these kinds of texts are actually making the models worse. So, yey, let’s kill the planet to so that misanthropic company can claim to have reached some arbitrary milestone.

One thought on “I guess this blog was found by the scrapers for the AI-companies

  1. Now, in recent days, there’s been a huge number of visits from China, but they are a bit different. They don’t come from one specific IP address, but instead from various and they are looking at different pages. This can’t be just a coincidence, so I would assume some Chinese company is now scraping my website, but they try to obfuscate it a little bit. That is hard though, since I don’t have that many daily visitors in general, so those kinds of peaks are going to show.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.