Got the pointer to this from Allison Parrish who says it better than I could:

it’s a very compelling paper, with a super clever methodology, and (i’m paraphrasing/extrapolating) shows that “alignment” strategies like RLHF only work to ensure that it never seems like a white person is saying something overtly racist, rather than addressing the actual prejudice baked into the model.

  • @antifuchsOP
    link
    English
    1419 days ago

    I don’t think anyone is surprised, but brace yourself for the next round of OpenAI and peers claiming to fix this issue.