Artificial Generalized Incompetence

  • DarkCloud@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    5
    ·
    edit-2
    2 days ago

    “data gathering” and “training data” is just what they’ve tricked you into calling it (just like they tried to trick people into calling it an “intelligence”).

    It’s not data gathering, it’s stealing. It’s not training data, it’s our original work.

    It’s not creating anything, it’s searching and selectively remixing the human creative work of the internet.

    • MartianSands@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      2 days ago

      You’re putting words in my mouth, and inventing arguments I never made.

      I didn’t say anything about whether the training data is stolen or not. I also didn’t say a single word about intelligence, or originality.

      I haven’t been tricked into using one piece of language over another, I’m a software engineer and know enough about how these systems actually work to reach my own conclusions.

      There is not a database tucked away in the LLM anywhere which you could search through and find the phrases which it was trained on, it simply doesn’t exist.

      That isn’t to say it’s completely impossible for an LLM to spit out something which formed part of the training data, but it’s pretty rare. 99% of what it generates doesn’t come from anywhere in particular, and you wouldn’t find it in any of the sources which were fed to the model in training.

      • DarkCloud@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        3
        ·
        edit-2
        2 days ago

        It’s searched in training, tagged for use/topic then that info is processed and filtered through layers. So it’s pre-searched if you will. Like meta tags in the early internet.

        Then the data is processed into cells which queries flow through during generation.

        99% of what it generates doesn’t come from anywhere in particular, and you wouldn’t find it in any of the sources which were fed to the model in training.

        Yes it does - the fact that you in particular can’t recognize from where it comes: doesn’t matter. It’s still using copywrited works.

        Anyways you’re an AI stan, and defending theft. You can deny it all day, but it’s what you’re doing. “It’s okay, I’m a software engineer I’m allowed to defend it”

        …as if being a software engineer doesn’t stop you from also being a dumbass. Of course it doesn’t.

        • MartianSands@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          2 days ago

          You’re still putting words in my mouth.

          I never said they weren’t stealing the data

          I didn’t comment on that at all, because it’s not relevant to the point I was actually making, which is that people treating the output of an LLM as if it were derived from any factual source at all is really problematic, because it isn’t.

          • DarkCloud@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            2
            ·
            edit-2
            1 day ago

            Our discussion was never about the term factuality. You’ve just now raised that term for the first time in this discussion. You said search engine. They are in fact searching and reconstructing data based on a probabilistic data space.

            …and there are plenty of examples of search engines being sued for the types of data they’ve explored or digitized.

            …also the inference that search engines are “accurate” or don’t serve up misinformation, and manipulated data is foolish.