• @Soyweiser
    link
    English
    31 month ago

    Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

    • @diz
      link
      English
      71 month ago

      It would have to be more than just river crossings, yeah.

      Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.