With ChatGPT sparking public interest in large language models (LLMs) like GPT, many big techs are racing to release their own version of pre-trained LLMs (Bard from Google, LLaMA from Meta). These LLMs learned from a massive corpus of human language and can perform many tasks out of the box like text generation, summarization, categorization, etc. But these pre-trained foundation models are very expensive to train and thus achievable by only a few big techs in the industry.

You might wonder — how could smaller companies ride this wave of AI if they can’t train their own foundational models from scratch. The answer is to build on top of the foundation LLM models by tailoring them to the specific tasks that their customers want to achieve. To do this customization, we have some different techniques: fine-tuning, prompt engineering, and prompt-tuning.

As an entrepreneur or product manager, you probably care about the ROI of using these techniques and the feasibility of spinning the AI/data flywheel. In this article, we will explore and compare the different techniques to build on top of foundational LLMs for specific downstream tasks, so that you can leverage AI to create valuable products for your customers.

Understanding pre-trained LLMs & techniques to improve them

To understand the techniques, we first need to have a high-level understanding of how an LLM or neural network works. You can think of an LLM as a digital brain with many layers of nodes in it, just like a human brain with layers of neurons. The digital brain takes in some signals (input sequence), converts them through the layers of nodes, and outputs the converted results. The training process is just like how human babies learn — they see a lot of examples (from the corpus) and their neurons are wired to adapt to those examples (the mathematical expression/parameters of the nodes are updated). Once trained, they can do similar tasks to what they’ve seen in the training process.

(In this post, I’m simplifying technical concepts for non-technical readers. If you want a slightly more technical and more in-depth explanation of LLMs like GPT, you can read this article: How ChatGPT really works, explained for non-technical people)

Because pre-trained LLMs are huge with billions of parameters and they’re trained on a massive corpus, they learned how languages work and they have some general knowledge. They can do many tasks right away, but the quality might not be ideal because they are not so “familiar” with the specific tasks you want them to do. At this stage, these pre-trained LLMs with general knowledge are called foundation models, just like someone who has done general education but hasn’t been trained professionally.

To improve the performance of these pre-trained LLMs on specific tasks, we can use fine-tuning, prompt engineering, or prompt-tuning.

Fine-tuning is a widely used method in AI. It means that we update some of the model’s parameters with additional data to adapt it from a general-purpose model to a specific task [1]. Because a large part of the parameters is not changed much from the foundation model, the fine-tuned model can retain some “knowledge” from the pre-training process. Taking the generally educated person as an example, if you show them many examples of joke-writing, they will learn to write jokes, utilizing what they learned about language and writing in general. But, there is a risk of overfitting, when the person can’t write any other things beyond jokes well after seeing and writing too many jokes.

Another technique to guide pre-trained LLMs to perform a specific downstream task is prompt engineering. This approach aims to design the natural-language prompts and task descriptions to get high-quality responses without updating the LLMs parameters [1][2]. With the most recent LLMs, researchers found that some specific prompt engineering techniques like few-shot prompting or chain of thought prompting can improve the quality of output by a lot, even without fine-tuning [3][4]. It’s like telling the person exactly what they should do, providing them with context, and giving them one or a few examples. For example, instead of “write me a joke”, you specify “write a joke starting with ‘Knock Knock. Who’s there’” to get the exact type of joke you want.

(If you want to learn more about the specific techniques in prompt engineering, you can read this article: How to use ChatGPT in product management)

Researchers also explored prompt-tuning, which is like a compromise between fine-tuning and prompt engineering. In prompt-tuning, we only update the embedding parameters for a particular task (i.e. tuning the prompts) without retraining the model and updating its parameters [5]. The embedding translates human text into vectors that LLM can understand. Since we don’t need to update the parameters in the huge LLM, it can save a lot of resources compared to fine-tuning the model. Compared to prompt engineering, it can provide better quality outputs and save human labor crafting the prompts most of the time. It’s like letting an educated person figure out how to write jokes so you don’t have to teach him how to write the jokes yourself.

Comparison of fine-tuning, prompt engineering, and prompt-tuning

Now that you know there are these different techniques to make the pre-trained LLM model help you with specific tasks. You may be wondering how to choose the appropriate technique. Let’s compare these techniques by data input needed, training cost, technical difficulty, and response quality.

For the data input required for these techniques to work, fine-tuning >> prompt-tuning > prompt engineering. Fine-tuning requires a high-quality dataset of input/output pairs to learn how to do your specific task. The size of the dataset depends on task complexity and model size. For example, you need roughly 500 of these pairs for fine-tuning text completion on OpenAI’s GPT-3. And remember an old saying in the machine learning field still holds true — “Garbage in, garbage out”. So the quality of your dataset has to be very good for fine-tuning to produce the expected outcome. For prompt tuning, you need some labeled examples that correspond to different tasks that you want the LLM to perform to update the embedding layer. For prompt engineering, you just need to write a prompt with a few examples (few-shot prompts), some context, and task description, etc.

For the training cost, fine-tuning >> prompt-tuning > prompt engineering. As mentioned earlier, since the foundation LLMs are massive in size, it’s really costly to fine-tune them and update the billions of parameters. For prompt-tuning, since you don’t need to update the entire LLM, but just the embedding layers, it’s much cheaper to train. For prompt engineering, each time you modify your human-written prompt is just a normal usage without updating the model parameters, so the cost is pretty low.

For technical difficulties, prompt-tuning > fine-tuning >> prompt engineering. Note that this statement is true as of early 2023 and I’m going from a third-party business owner point of view. Because LLM providers offer wrapped API for fine-tuning, the technical difficulty of using those APIs are not very high. For prompt-tuning, since it’s still a young research method, there is no readily accessible APIs to use, but this might change in the future. Prompt engineering, on the other hand, does not require technical skills like coding or machine learning, but you’ll benefit if you have a general understanding of how neural networks and LLMs work.

For the response quality, it’s generally true that fine-tuning > prompt tuning > prompt engineering, but it depends on many factors like task complexity, model size, dataset quality, etc. Fine-tuning can give you a more consistent output tailored to your specific use case that imitates your fine-tuning dataset, but the creativity may be reduced and you might have the risk of overfitting. Prompt engineering, on the other hand, offers a lot of flexibility and creativity but the output may not be as consistent.

Conclusions

I hope this article helps you understand the different techniques to have pre-trained foundation models do your specific task for your business. Before deciding which technique to use, you should evaluate the requirements of the downstream task, the amount of labeled data you have, and the resources available to you (including knowledgeable employees, money, and time). Since prompt engineering is easy and non-technical but can produce decent results, I would recommend you start with prompt engineering before moving on to prompt-tuning and fine-tuning.

References:

[1] “What Is a Large Language Model (LLM)?” MLQ.Ai, 15 Dec. 2022, https://www.mlq.ai/what-is-a-large-language-model-llm/.

[2] “Prompting: Better Ways of Using Language Models for NLP Tasks.” Princeton NLP, 28 June 2021, https://princeton-nlp.github.io/prompting/.

[3] Brown, Tom B., et al. Language Models Are Few-Shot Learners. arXiv, 22 July 2020. arXiv.orghttps://doi.org/10.48550/arXiv.2005.14165.

[4] Wei, Jason, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv, 10 Jan. 2023. arXiv.orghttps://doi.org/10.48550/arXiv.2201.11903.

[5] Lester, Brian, et al. The Power of Scale for Parameter-Efficient prompt-tuning. arXiv, 2 Sept. 2021. arXiv.orghttps://doi.org/10.48550/arXiv.2104.08691.