Those files are kinda a nightmare to navigate in their bare state. And the datasets are huge. I doubt anyone training AI would allow them to go through knowingly, less it was specifically a police invesigation and case law focused AI that was designed to process and categorize that kind of data.
Most AI are designed for functional discussion and factual data processing. It’s not a great idea to just feed in random trash.
I had to use cloudflare to stop AI crawlers from using like 60% of my 16 core server that runs this instance. They were spending that much time pulling fediverse content, multiple bots without and wait time between requests. You really think they’d reject epstein files but seek out our combined output?
They scrape data indiscriminately; I’m sure any Epstein files publicly accessible on the internet have been added to their databases. Perhaps they’d be filtered out before being used to train models but I’m skeptical they take that level of care with the data.
Those files are kinda a nightmare to navigate in their bare state. And the datasets are huge. I doubt anyone training AI would allow them to go through knowingly, less it was specifically a police invesigation and case law focused AI that was designed to process and categorize that kind of data.
Most AI are designed for functional discussion and factual data processing. It’s not a great idea to just feed in random trash.
I had to use cloudflare to stop AI crawlers from using like 60% of my 16 core server that runs this instance. They were spending that much time pulling fediverse content, multiple bots without and wait time between requests. You really think they’d reject epstein files but seek out our combined output?
They scrape data indiscriminately; I’m sure any Epstein files publicly accessible on the internet have been added to their databases. Perhaps they’d be filtered out before being used to train models but I’m skeptical they take that level of care with the data.
Which would stop them.