• @daq@lemmy.sdf.org
    link
    fedilink
    English
    23 days ago

    I’m not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn’t just about every small site/service behind CF anyway?

    • @grysbok@lemmy.sdf.org
      link
      fedilink
      English
      63 days ago

      Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn’t want to require those because it would likely impact legitimate users as well as bots.

      • @daq@lemmy.sdf.org
        link
        fedilink
        English
        13 days ago

        Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?

        • @grysbok@lemmy.sdf.org
          link
          fedilink
          English
          02 days ago

          Odd. I just tried

          curl https://www.scrapingcourse.com/cloudflare-challenge

          and got

          Enable JavaScript and cookies to continue

          I’m clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).

          Anyways, I’m reading through this blog post on using cURL with cloudflare-protected sites and I’m finding it interesting.

          • @daq@lemmy.sdf.org
            link
            fedilink
            English
            12 days ago

            Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don’t want to ever challenge anyone. I was just saying CF blocks ML crawlers.