Migrating from old!SneerClub to New

@blakestacey · 1 year ago

Migrating from old!SneerClub to New

@blakestacey · 1 year ago

Anyone know how to manipulate compressed JSON (.zst) files? I was able to snarf the SneerClub data from a torrent there that goes up to December 2022.

@blakestacey · 1 year ago

Wake up babe, .zst files just dropped: comments and submissions.

To get the data from this year, I tried the straightforward thing of just doing save-full-webpage in Firefox for each post. This was tedious, but I didn’t feel like figuring out how to get any automated downloading tool to work with my login details so that it could grab the NSFW posts. The result is ~2 gigs, most of which is probably redundant infrastructure. An oddity: trying to save a thread always failed on the first attempt but worked when I clicked “retry download”.

@self · 1 year ago

nice! I’ll grab the archives and see how well they combine with the output of this tool: https://github.com/aliparlakci/bulk-downloader-for-reddit

@blakestacey · 1 year ago

Sounds like a good plan.

@self · 1 year ago

some work in progress on this is available here. the SneerClub directory is the output of the bulk downloader for all 1000 (deduplicated) posts it could grab from each of SneerClub’s hot, top, new, rising, and controversial tabs, and the jsonl files are just the ones you posted decompressed for convenience. so far I’m just using jq to process the data sets

SneerClub has 1940 posts with nested comments and attached media where the downloader could parse it; the archive team files have 3851 posts and 100149 comments in a (much less convenient) flattened format without media. both sets have a few posts from 2015, so I’ll need to do more looking to see how much we’ve salvaged overall

@self · 1 year ago

oh yeah I think that’s just zstandard! it’s fairly easy to decompress if you’ve got access to a Linux machine or similar, where it’s just unzstd if you’ve got the zstd package for your distro installed

any chance they’ve got the script they used available? we could use it to grab everything from this year and complete the archive

@blakestacey · 1 year ago

I don’t think the script is available (and it may be nonfunctional now, going by the terse notes at the above-linked wiki page).