Now that the software is running with (at least for me) a low level of jank, it seems worth considering what we do with the years of accumulated sneer-strata over at the old place. Just speaking for myself, I think it would be nice if we had a static-site backup of the whole shindig. Unfortunately, since I’m a physicist by trade, anything I do with webstuff tends to involve starting from scratch with compass, straightedge and wget. There’s got to be a better method of archiving.

The other, not-mutually-exclusive option I can think of is to manually rerun “SneerClub classics”, the posts that one way or another helped define what sneering is all about.

N.B. Some of the test posts made today involved writing more on the serious-discussion side and have accordingly been marked NSFW.

  • AcausalRobotGod
    link
    English
    31 year ago

    You realize you must work ceaselessly to get this up and running or else, IYKWIM?

    • @blakestaceyOPMA
      link
      English
      21 year ago

      Anyone know how to manipulate compressed JSON (.zst) files? I was able to snarf the SneerClub data from a torrent there that goes up to December 2022.

      • @blakestaceyOPMA
        link
        English
        31 year ago

        Wake up babe, .zst files just dropped: comments and submissions.

        To get the data from this year, I tried the straightforward thing of just doing save-full-webpage in Firefox for each post. This was tedious, but I didn’t feel like figuring out how to get any automated downloading tool to work with my login details so that it could grab the NSFW posts. The result is ~2 gigs, most of which is probably redundant infrastructure. An oddity: trying to save a thread always failed on the first attempt but worked when I clicked “retry download”.

            • @selfMA
              link
              English
              21 year ago

              some work in progress on this is available here. the SneerClub directory is the output of the bulk downloader for all 1000 (deduplicated) posts it could grab from each of SneerClub’s hot, top, new, rising, and controversial tabs, and the jsonl files are just the ones you posted decompressed for convenience. so far I’m just using jq to process the data sets

              SneerClub has 1940 posts with nested comments and attached media where the downloader could parse it; the archive team files have 3851 posts and 100149 comments in a (much less convenient) flattened format without media. both sets have a few posts from 2015, so I’ll need to do more looking to see how much we’ve salvaged overall

      • @selfMA
        link
        English
        21 year ago

        oh yeah I think that’s just zstandard! it’s fairly easy to decompress if you’ve got access to a Linux machine or similar, where it’s just unzstd if you’ve got the zstd package for your distro installed

        any chance they’ve got the script they used available? we could use it to grab everything from this year and complete the archive

        • @blakestaceyOPMA
          link
          English
          31 year ago

          I don’t think the script is available (and it may be nonfunctional now, going by the terse notes at the above-linked wiki page).

  • @foldl
    link
    English
    31 year ago

    Maybe have a “Sneerclub classic” community for the reposts?

  • @selfMA
    link
    English
    21 year ago

    Now that the software is running with (at least for me) a low level of jank

    this is actually a big relief. I’m still monitoring in the background to see if anything’s silently broken but other than lemmy really wanting access to a mail server, everything seems good on this end too

    Unfortunately, since I’m a physicist by trade, anything I do with webstuff tends to involve starting from scratch with compass, straightedge and wget. There’s got to be a better method of archiving.

    that’s not a bad way to do it. one thing that’d be cool is if we could archive sneers in a form that could be cited, which seemed like a pretty common ask back on reddit. some options for that are:

    • if we want a prebuilt automated system, this thing from the internet archive seems promising, but is fairly vague on how it actually works and usage is by invitation only (though I’d argue we’ve got a valid use case that the archive might be interested in)
    • we write a script (or modify an existing one) to scrape reddit and output comments to something like a set of JSON files. then a static site generator could reconstruct a sneerclub archive from that JSON into a rendered site, which could be hosted someplace free and permanent like github
    • same as above, but have the bot output to rationalwiki. not sure if this would flood the wiki or particularly match up with its formatting style though

    for accessibility, I imagine having a sneerclub archive here could be a good thing too. that might be fairly easy to do; we’d need to set up a dedicated community and account for it so main sneerclub doesn’t get flooded, but then we’d just have an existing crosspost bot run and grab everything in sneerclub and post it here

    this may become much harder once reddit’s API closes, which gives me some anxiety. it might make sense to speedrun an archival script before that happens

    • @blakestaceyOPMA
      link
      English
      41 year ago

      RationalWiki runs on MediaWiki, which is kind of awful for discussion threads.

      I will try to have more thoughts about this later (and do a bit more research into pre-existing scraping tools and such).

      • @selfMA
        link
        English
        11 year ago

        https://github.com/toonvandeputte/reddit_archive this might be adaptable into something that archives a subreddit instead of a single user’s posts. there may be a slight complication if we’re dealing with more than 1000 posts though. but since we’ve got an archive already that has everything up to december, we really only need this year’s posts

    • David GerardMA
      link
      English
      41 year ago

      gawd, definitely not suitable to dump on RW