While training language models, (LMs) datasets are drawn from various domains, say publicly accessible dataset (called the Pile) consisting of online data (24%), Wikipedia (9%), and GitHub etc. (4%). The constitution of the data influences how well an LM performs. It should be obvious how much each domain should be included so as to create a model that performs a range of downstream tasks. Either intuition or a series of downstream tasks are used to arrive at domain weights or sample probabilities for each domain. The Pile uses heuristically selected domain weights. Maybe, they are not an ideal choice.
Google and Stanford researchers attempted to identify domain weights so that models perform well on all domains. There is no optimization of domain weights based on a range of downstream tasks. Rather, there is minimization of worst-case loss over domains.
Each domain has an entropy or a unique optimum loss. The DoReMi technique is Domain Reweighting with Minimax Optimisation. It uses distributionally robust optimization (DRO) without being aware of the tasks which will be performed later.
Conventionally, DoReMi begins by training a tiny reference model with 280M parameters. To curtail excess loss, a tiny distributionally resistant LM is introduced (DRO-LM). Domain weights generated by DRO are used in training.
To optimise domain weights on the Pile and the GLaM dataset, they run DoReMi on 280M proxy and reference models.