I'm no expert and I'm biased because of my current work but fine-tuning (especially LoRA) is one of the most powerful and reliable ways to get LLMs to fit your custom needs. Better than prompt engineering and even embeddings. That's all I can say.
There are good mathematical rules of thumb for calculating FLOPs per token for both inference and training. Any for fine-tuning? I haven’t come across any.
Have you considered arguments presented here : https://warpcast.com/abhishek1point0/0x2c431a
My pipeline has been: - test wild prompts on largest (340B+) model I can find - find best outputs and test on 24Bs - generate 100+ examples and see if I can get it tuned on something much smaller And yes, LoRA has proven to be incredible for a lot of my use cases
for the prompt engineering crowd, i wonder how they're measuring the quality of their prompts ...