Social network Bluesky has exploded in popularity since Twitter users jumped ship en masse after the US election. We’ve been on Bluesky since it was invite-only and we can assure you: Bluesky users…
This is a bit of a nuanced issue though. The person merely published a dataset made from publicly available data than anyone can re-create themselves using the Bluesky Firehose API. Could it be used to train a model? Yes, but that isn’t the only use case and the person who posted it has no control over what other people use it for. If someone does train a model using it then that’s their legal issue to work out, not the publisher’s.
It’s the same argument billionaires were using to justify silencing people who posted the movement of their private jets. The billionaires argued that this data could be used to harass them, but the posters argued the data is public and they aren’t responsible for what other people do with it.
The legal system is the perfect place for working out nuanced issues like this.
If I were a lawyer and making this lawsuit I would argue that “publicly available” does not mean “public domain”, and that without acquiring usage rights for the data then you don’t have the right to use the data.
If the courts rule against a decision like this then that would mean that any website that hosts any materials that can be accessed without an account must then provide that material to any person who accesses it free of charge which is a gigantic consequence to this nuanced issue.
My point is that you can’t talk about usage rights of a dataset without talking about a specific use case. The suggested use case was to provide a static test dataset for systems developed to use the firehose API, but the dataset could be used for literally anything from making funny memes (fair use) to training a LLM model (arguably not fair use). Does the existence of an illegal use case automatically mean the dataset itself should be illegal though?
As a collorary, a photocopier can be used to create unauthorized reproductions of copyrighted works. Should making and disturbing photocopiers be illegal because they are capable of and used in the process of violating copyright law, or should we accept the photocopier absent of a use case isn’t breaking any laws and go after the people who use them to illegally create unauthorized reproductions?
I hope Blue sky sues them.
This is a bit of a nuanced issue though. The person merely published a dataset made from publicly available data than anyone can re-create themselves using the Bluesky Firehose API. Could it be used to train a model? Yes, but that isn’t the only use case and the person who posted it has no control over what other people use it for. If someone does train a model using it then that’s their legal issue to work out, not the publisher’s.
It’s the same argument billionaires were using to justify silencing people who posted the movement of their private jets. The billionaires argued that this data could be used to harass them, but the posters argued the data is public and they aren’t responsible for what other people do with it.
The legal system is the perfect place for working out nuanced issues like this.
If I were a lawyer and making this lawsuit I would argue that “publicly available” does not mean “public domain”, and that without acquiring usage rights for the data then you don’t have the right to use the data.
If the courts rule against a decision like this then that would mean that any website that hosts any materials that can be accessed without an account must then provide that material to any person who accesses it free of charge which is a gigantic consequence to this nuanced issue.
My point is that you can’t talk about usage rights of a dataset without talking about a specific use case. The suggested use case was to provide a static test dataset for systems developed to use the firehose API, but the dataset could be used for literally anything from making funny memes (fair use) to training a LLM model (arguably not fair use). Does the existence of an illegal use case automatically mean the dataset itself should be illegal though?
As a collorary, a photocopier can be used to create unauthorized reproductions of copyrighted works. Should making and disturbing photocopiers be illegal because they are capable of and used in the process of violating copyright law, or should we accept the photocopier absent of a use case isn’t breaking any laws and go after the people who use them to illegally create unauthorized reproductions?