Tag
1 articles
The paper shows that embedding-layer learning rate is the main reason μP transfers better than standard parameterization.