Sed ut perspiciatis unde.
Like everyone else, this is the first time I am experiencing closed research. Since I was in college, all frontier research has been open and peer-reviewed, until recently. And I believe openness ultimately advances science more than closedness.
If we aim to match the performance of ChatGPT through open source, I believe we need to start taking training data more seriously. A substantial part of ChatGPT’s effectiveness might not come from, say, specific ML architecture, fine-tuning techniques, or frameworks. But more likely, it’s from the breadth, scale and quality of the instruction data.
To put it bluntly, fine-tuning large language models on mediocre instruction data is a waste of compute. Let’s take a look at what has changed in the training data and learning paradigm—how we are now formatting the training data differently and therefore learning differently than in past large-scale pre-training.
RLHF stands for Reinforcement Learning from Human Feedback. It has two main components:
- Reinforcement Learning (RL)
- Human Feedback (HF)