Introduction
language fashions (LLMs), we’re ceaselessly constrained by budgets. Such a constraint results in a elementary trade-off:Think about that in case you repair a compute finances, rising the mannequin dimension signifies that you should scale back the mannequin dimension you’ll be able to practice on, and vice versa. So you’re asking the query:
Ought to we allocate extra to a mannequin with extra parameters, or ought to we practice it on extra knowledge?
Particularly, LLMs’ efficiency and effectivity are largely influenced by this trade-off. It’s thus essential to search out an optimum steadiness between the variety of parameters of a mannequin and the variety of tokens used.
The overall coaching compute of a transformer roughly scales as: C∝N×D, the place
- N is the variety of mannequin parameters.
- D is the variety of tokens.
- C is the fastened compute finances.
It’s simple to see that for a hard and fast C, N and D are inversely proportional to one another.
Earlier research (Kaplan et al., 2020; Hoffmann et al., 2022) have discovered that coaching lack of machine studying fashions follows a power-law with compute: L(C)∝C^{−α} and the optimum mannequin dimension and dataset dimension scale with compute as: N_opt∝C^a, D_opt∝C^b for some constructive values a and b.
On this article, we are going to use tiny Transformers to discover the right way to steadiness N and D beneath a hard and fast compute C.
Experiment Setup
We design a minimal transformer mannequin, and we name it “tiny transformer” with the next configurable properties that affect the mannequin’s parameter dimension:
- Mannequin dimension (d_model)
- MLP dimension (d_mlp)
- Variety of layers (n_layers)
We wish to practice the transformer of various configurations on tokenized sequences of size 64 of the WikiText-2 dataset.
To check the impact of scaling, we outlined a grid of fashions from very small (16 hidden items, 1 layer) to comparatively giant (128 hidden items, 4 layers) and mix them with a variety of tokens from 5k to 1M. See the code under:
model_configs = [
{"d_model": 16, "d_mlp": 64, "n_layers": 1},
{"d_model": 24, "d_mlp": 96, "n_layers": 1},
{"d_model": 32, "d_mlp": 128, "n_layers": 2},
{"d_model": 48, "d_mlp": 192, "n_layers": 2},
{"d_model": 64, "d_mlp": 256, "n_layers": 3},
{"d_model": 96, "d_mlp": 384, "n_layers": 3},
{"d_model": 128, "d_mlp": 512, "n_layers": 4},
]
# variety of tokens (D) we practice on — simulated through few steps × batch × seq_len
token_budgets = [5e3, 1e4, 3e4, 5e4, 1e5, 3e5, 5e5, 1e6] # small for demo
By approximating the compute price as C≈N×D, our thought is to compute the loss operate for every (N,D) pair and discover the pair (N,D) with which the mannequin reaches the minimal loss operate for a given C: that is the steadiness we’re on the lookout for.
Implementation and observations
We use the code under to coach the mannequin as much as a hard and fast variety of steps with totally different (N,D) pair and report the outcome.
outcomes = []
machine = "cuda" if torch.cuda.is_available() else "cpu"
for cfg in model_configs:
mannequin = TinyTransformer(vocab_size=len(tokenizer), **cfg)
N_params = count_params(mannequin)
for D in token_budgets:
steps = int(D // (SEQ_LEN * 16)) # assuming batch_size=16
dataloader = DataLoader(
tokenized_dataset["train"].shuffle(seed=0),
batch_size=16,
collate_fn=collate_fn
)
avg_loss = train_one(mannequin, dataloader, steps=steps, machine=machine)
compute = N_params * D
outcomes.append({
"N": N_params,
"D": D,
"C": compute,
"loss": avg_loss
})
We then plot the ultimate loss towards the compute (N×D):
We’ve got the next essential observations:
- For small compute budgets, small fashions skilled on a lot of the obtainable knowledge carry out higher than bigger fashions skilled on little or no knowledge.
- For giant compute budgets, bigger fashions grow to be higher when sufficient knowledge is out there.
- The optimum mannequin dimension doesn’t develop linearly with compute finances. For instance, doubling the compute does probably not result in an optimum variety of parameters twice as earlier than.
The plot under provides the environment friendly frontier throughout mannequin dimension, that’s, the set of mannequin sizes which have the bottom loss for a given compute.

“Greatest” Mannequin
To find out the “finest” mannequin, we would choose the pair of mannequin dimension and the variety of tokens that minimizes loss at a hard and fast finances.
We assume each observe a power-law relationship: N_opt∝C^α, D_opt∝C^β, and we wish to estimate the unknown exponents α and β by the next steps:
- Take the logarithm of the portions: log?(N_opt)=αlog?(C)+const, log?(D_opt)=βlog?(C)+const.
- Match a linear regression. The slope of the regression is nothing however the power-law exponent.
The next code provides such a regression:
# Match log-log linear regression
a_slope, a_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.N))
b_slope, b_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.D))
In our toy experiment, we discovered that N_opt ~C^0.14 and D_opt~ C^0.86. This outcome may not reveal the entire picture as a result of we did the experiment on simpilied mannequin and configurations. However we will nonetheless see that the expansion of computing results in a rise in optimum mannequin dimension, however at a diminishing charge. Clearly, the remaining finances needs to be attributed to extra coaching tokens.
Furthermore, the compute above provides the truth that one of the best ratio N_opt/D_opt=C^-0.72. This suggests that if you enhance compute, it’s best to add extra coaching tokens reasonably than rising mannequin dimension.
Sensible Takeaways
From this experiment, although a toy case, we will extract a number of insights:
- For a hard and fast finances, utilizing a medium mannequin with extra knowledge can outperform a really giant mannequin with restricted knowledge.
- Optimum mannequin dimension and knowledge dimension develop with compute. Don’t practice a mannequin with many parameters when you’ve got a small finances.
- When the finances will increase, take into account first the optimum ratio N_opt/D_opt to find out whether or not it’s best to enhance the mannequin dimension or add extra coaching knowledge.
Conclusion
On this weblog publish, we offer a examine of the trade-off between mannequin dimension and knowledge beneath a hard and fast compute finances for LLMs with a toy case. The experiment reveals that we will discover the optimum pair of mannequin dimension and tokens quantity to acheive one of the best mannequin efficiency with a given finances, permitting researchers and practitioners to design LLMs correctly and obtain one of the best outcomes.
Reference
[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Little one, R., Grey, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Legal guidelines for Neural Language Fashions.
[2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, Okay., van den Driessche, G., Damoc, B., Man, A., Osindero, S., Simonyan, Okay., Elsen, E., … Sifre, L. (2022). Coaching Compute-Optimum Giant Language Fashions.
