@pi_crew@chaos.social Aside from the serious (unsolvable) reliability issues, a huge problem with this technology is that it is fundamentally based on mass exploitation of labour. That also applies to the "local" tools.
We should not consider this an acceptable thing to promote in the community.
@twoolie @pi_crew@chaos.social There is no such thing as "local-only LLMs".
LLMs (it's in the name!) by their nature are trained on an input corpus so large that you cannot really acquire it through ethical means, let alone train it entirely locally. The technology simply cannot work with that little input data.
(That doesn't mean that local-only *autocompletion* cannot be a thing, but that wouldn't be an LLM.)
By "local-only" I mean models that run context adaptatin, inference, and fine-tuning entirely on the user's machine.
LLMs are indeed trained in datacenters on a large corpus, but that does not immediately make them unethical. Specifically for code, open source datasets are available to train from, and as long as we train the model not to reproduce large runs from the input corpus, then the result would be abiding by the terms of the license.
It's ethically no different than if I were to read all your source code and use what I learned to write code for my application.
TabbyML (and the underlying SantaCoder models) *are LLMs* that are able to run entirely on the user's machine and perform local, syntax aware, autocompletion services without reproducing large chunks of their training data without attribution.
LLMs are just an architecture, it's not productive to just say "ALL LLM BAD!" without articulating where the mass exploitation happens. You and I seem to fundamentally disagree, as you seem to think the mass exploitation happens on the corpus collection and training step whereas I think the exploitation happens at time-of-use when corps sell our knowledge back to us with attribution removed.
@twoolie I have already told you exactly where the exploitation happens, and it's trivial to find the remaining documentation on eg. the exploitation of Kenyans involved. That you don't want to hear it is not my problem.
Also, get out of here with this "technology is neutral" rhetoric.
@twoolie Many of these "open-source datasets" are scraped non-consentually. The ones that aren't, result in useless LLMs because there isn't enough training data - people have tried.
So yes, it *does* immediately make them unethical, when *in practice* it is impossible to ethically acquire sufficient training material to make the technology work.
Tabby is trained on non-consentually collected data.
Yes, people have tried, and succeed! That's what SantaCoder (the default tabby model) is! Maybe you should Read The Fucking Paper before regurgitating talking points like a hackernews comment thread.
https://arxiv.org/abs/2301.03988
SantaCoder is proof positive that an ethically acquired dataset can be used for training and produce good results, no Kenyan exploitation required.
Technology *is* neutral, dipshit. A nuke can be used to blow up a city, or put out a gas well fire. https://interestingengineering.com/science/soviet-engineers-detonated-a-nuke-miles-underground-to-put-out-a-gas-well-fire
LLMs can be created and used responsibly, we know how to do it, let's not tar an entire field of study with the one brush, ok?
@twoolie The BigCode dataset was collected without consent.
@joepie91
I feel that local-only coding LLMs are no more exploitative than spending a week reading a bunch of docco and tutorials to get ok at a new language/library. The ML model just automates this away. It doesn't magically make you a good programmer.
What I find exploitative is when it's configured to reproduce large sections of its input corpus while stripping attribution, and sold as a service.
If it's a local-only solution, trained to produce local auto-completions, or suggest the best method usages, then it seems like that's well within the spirit of the OSS software that it was trained upon.
@pi_crew