I'm no expert and I'm biased because of my current work but fine-tuning (especially LoRA) is one of the most powerful and reliable ways to get LLMs to fit your custom needs.
Better than prompt engineering and even embeddings.

That's all I can say.

There are good mathematical rules of thumb for calculating FLOPs per token for both inference and training. Any for fine-tuning? I haven’t come across any.

Have you considered arguments presented here :

https://warpcast.com/abhishek1point0/0x2c431a

My pipeline has been:
- test wild prompts on largest (340B+) model I can find
- find best outputs and test on 24Bs
- generate 100+ examples and see if I can get it tuned on something much smaller
And yes, LoRA has proven to be incredible for a lot of my use cases

for the prompt engineering crowd, i wonder how they're measuring the quality of their prompts ...

Searchcaster