Scaling Laws for Language Models on Symbolic Music Data
For my NYU Machine Learning course final project (CS-GY 6923-B), I investigated a fascinating question: do the scaling laws that govern text-based language models also apply to symbolic music? This research compared decoder-only Transformers and LSTM networks across multiple model sizes, revealing surprising insights about how neural networks learn musical structure.
The Research Question
The famous "Scaling Laws for Neural Language Models" paper by Kaplan et al. showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute. But does this hold for domains beyond natural language?
Music, represented in ABC notation, provides an interesting test case: - It has structure and grammar (like language) - It has temporal dependencies (melodies, harmonies) - It's much smaller in vocabulary than natural language - It has different distributional properties
Dataset: Irish Folk Music in ABC Notation
I used real music data from two sources:
ABC notation represents music as text tokens - perfect for language model training.
The Session (~53,000 tunes)
The Session is a community-driven database of Irish and folk music. Each tune is transcribed in ABC notation:
X:1
T:The Kesh Jig
M:6/8
K:G
|:GAG GAB|ABA ABd|edd gdd|edB dBA|
GAG GAB|ABA ABd|edd gdB|AGF G3:|
Nottingham Music Database (~1,000 tunes)
Traditional folk tunes providing additional diversity.
Data Augmentation
To reach ~100M training tokens, I applied key transposition—shifting each tune through all 12 keys while preserving the musical structure:
def transpose_abc(abc_string, semitones):
# Parse ABC notation
# Shift all notes by semitones
# Preserve rhythm, structure, and metadata
return transposed_abc
Model Architectures
I compared two fundamental architectures at matched parameter counts:
Decoder-Only Transformers
| Model | Parameters | d_model | n_heads | n_layers |
|---|---|---|---|---|
| Tiny | ~1M | 128 | 4 | 4 |
| Small | ~5M | 256 | 8 | 6 |
| Medium | ~20M | 512 | 8 | 8 |
| Large | ~50M | 768 | 12 | 12 |
| XL | ~100M | 1024 | 16 | 16 |
class MusicTransformer(nn.Module):
def __init__(self, vocab_size, d_model, n_heads, n_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model)
decoder_layer = nn.TransformerDecoderLayer(
d_model=d_model,
nhead=n_heads,
dim_feedforward=4*d_model,
dropout=0.1
)
self.transformer = nn.TransformerDecoder(decoder_layer, n_layers)
self.output = nn.Linear(d_model, vocab_size)
LSTM Networks (Matched Parameters)
| Model | Parameters | embed_dim | hidden_dim | n_layers |
|---|---|---|---|---|
| Tiny | ~1M | 256 | 512 | 2 |
| Small | ~5M | 384 | 768 | 3 |
| Medium | ~20M | 512 | 1024 | 4 |
| Large | ~50M | 768 | 1536 | 5 |
class MusicLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, n_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=n_layers,
dropout=0.1,
batch_first=True
)
self.output = nn.Linear(hidden_dim, vocab_size)
Comparing attention-based Transformers with recurrent LSTMs at matched parameter counts.
Key Findings
1. Transformers Exhibit Strong Power-Law Scaling
The Transformer models showed clear power-law scaling behavior, similar to what's observed in natural language:
Loss ∝ N^(-α)
Where N is the number of parameters and α ≈ 0.07 for music data (compared to α ≈ 0.076 for text).
Transformers exhibit 2.3x stronger scaling than LSTMs - the gap widens with more parameters.
2. LSTMs Show Significantly Weaker Scaling
Surprisingly, LSTMs showed much weaker scaling—their performance plateaued earlier and didn't benefit as much from increased parameters:
| Model Size | Transformer Loss | LSTM Loss |
|---|---|---|
| 1M | 2.45 | 2.52 |
| 5M | 2.12 | 2.35 |
| 20M | 1.89 | 2.18 |
| 50M | 1.72 | 2.05 |
| 100M | 1.58 | 1.98 |
The gap widens as models get larger—suggesting Transformers are fundamentally better at utilizing additional capacity for this task.
3. Generated Music is Syntactically Valid
Both architectures learned to generate valid ABC notation that could be converted to playable MIDI:
def generate_music(model, prompt, max_length=500, temperature=0.8):
tokens = tokenize(prompt)
for _ in range(max_length):
logits = model(tokens)
next_token = sample_with_temperature(logits[-1], temperature)
tokens.append(next_token)
if next_token == EOS_TOKEN:
break
return detokenize(tokens)
Generated sample from the Large Transformer:
X:1
T:Generated Jig
M:6/8
K:D
|:DFA dAF|GBd gdB|AFA dFA|GFE EFG|
DFA dAF|GBd gdB|AFA dfe|d3 d3:|
Training Details
Hyperparameters
config = {
'batch_size': 64,
'learning_rate': 3e-4,
'weight_decay': 0.1,
'warmup_steps': 1000,
'max_steps': 100000,
'sequence_length': 512,
}
Training Curves
The training dynamics revealed interesting patterns: - Transformers: Smooth loss decrease, efficient gradient flow - LSTMs: More volatile training, required careful learning rate tuning
Scaling Law Analysis
Fitting Power Laws
I fit the scaling law equation to the empirical results:
import scipy.optimize as opt
def scaling_law(N, alpha, beta):
return beta * (N ** (-alpha))
# Fit to Transformer results
popt_transformer, _ = opt.curve_fit(
scaling_law,
transformer_params,
transformer_losses
)
# α = 0.071, β = 3.82
# Fit to LSTM results
popt_lstm, _ = opt.curve_fit(
scaling_law,
lstm_params,
lstm_losses
)
# α = 0.031, β = 2.89
The key finding: Transformer α is 2.3x larger than LSTM α, meaning Transformers benefit much more from scale.
Compute-Optimal Scaling
Following Chinchilla-style analysis, I also examined the optimal allocation of compute between model size and training tokens. For music data: - Optimal ratio is approximately 20 tokens per parameter - This is similar to the text domain
Qualitative Analysis of Generated Music
Transformer Strengths
- Better long-range coherence (maintains key signature)
- More interesting melodic variations
- Proper phrase structure (AABB form)
LSTM Strengths
- Sometimes more "adventurous" note choices
- Can produce surprising modulations
- Faster inference (no attention computation)
Common Failure Modes
- Both struggle with very long pieces (>64 bars)
- Occasional invalid ABC syntax at higher temperatures
- Tendency toward repetitive patterns
Tools for Listening
Generated samples can be played through:
- Online ABC Players: abcjs.net
- MIDI Conversion: Using music21 library
from music21 import converter
# Convert ABC to MIDI
score = converter.parse(abc_string, format='abc')
score.write('midi', fp='output.mid')
Conclusions
- Scaling laws transfer to music, but with domain-specific constants
- Transformers scale better than LSTMs for sequential music modeling
- Attention mechanisms are crucial for capturing long-range musical structure
- Music is a viable domain for studying neural network scaling behavior
Future Directions
- Multi-instrument generation: Extend to polyphonic music
- Conditional generation: Control style, tempo, mood
- Audio domain: Apply similar analysis to raw audio models
- Cross-domain transfer: Can music pretraining help text models?
References
- Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models"
- Vaswani, A., et al. (2017). "Attention Is All You Need"
- Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla)
The full code and trained models are available on GitHub.
This project was completed as part of CS-GY 6923-B Machine Learning at NYU Tandon School of Engineering, December 2025.