There are other ways it might work, like if there is a method of compression that is discovered that reduces the necessary RAM and Compute needs by 2-3 orders of magnitude. So models that are considered very large today (100-300 billion params at full quality) might be able to run effectively on a single 32GB GPU that costs a few thousand dollars.
You might want to check in on how well distilled / quantized models are doing, compared to gigundo datacenter versions.
You might want to check in on how well distilled / quantized models are doing, compared to gigundo datacenter versions.