The principle
Given a task that decomposes into independent sub-tasks, a system of small models specialized to each sub-task returns more correct answers than a single general model of equal aggregate parameter count.
The argument is straightforward. Specialization concentrates the loss landscape. A model trained only to write in one voice has fewer wrong answers available to it. A model trained only to extract entities from contracts cannot drift into prose. Compose them in a workflow that knows which one to call, and the system as a whole is more reliable than any one model could be.