Scaling Laws for Language Models on Symbolic Music Data

For my NYU Machine Learning course final project (CS-GY 6923-B), I investigated a fascinating question: do the scaling laws that govern text-based language models also apply to symbolic music? This research compared decoder-only Transformers and LSTM networks across multiple model sizes, revealing surprising insights about how neural networks learn musical structure.

The Research Question

The famous "Scaling Laws for Neural Language Models" paper by Kaplan et al. showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute. But does this hold for domains beyond natural language?

Music, represented in ABC notation, provides an interesting test case: - It has structure and grammar (like language) - It has temporal dependencies (melodies, harmonies) - It's much smaller in vocabulary than natural language - It has different distributional properties

Dataset: Irish Folk Music in ABC Notation

I used real music data from two sources:

ABC Notation: Symbolic Music Representation ABC notation represents music as text tokens - perfect for language model training.

The Session (~53,000 tunes)

The Session is a community-driven database of Irish and folk music. Each tune is transcribed in ABC notation:

X:1
T:The Kesh Jig
M:6/8
K:G
|:GAG GAB|ABA ABd|edd gdd|edB dBA|
GAG GAB|ABA ABd|edd gdB|AGF G3:|

Nottingham Music Database (~1,000 tunes)

Traditional folk tunes providing additional diversity.

Data Augmentation

To reach ~100M training tokens, I applied key transposition—shifting each tune through all 12 keys while preserving the musical structure:

def transpose_abc(abc_string, semitones):
    # Parse ABC notation
    # Shift all notes by semitones
    # Preserve rhythm, structure, and metadata
    return transposed_abc

Model Architectures

I compared two fundamental architectures at matched parameter counts:

Decoder-Only Transformers

Model	Parameters	d_model	n_heads	n_layers
Tiny	~1M	128	4	4
Small	~5M	256	8	6
Medium	~20M	512	8	8
Large	~50M	768	12	12
XL	~100M	1024	16	16

class MusicTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)

        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=4*d_model,
            dropout=0.1
        )
        self.transformer = nn.TransformerDecoder(decoder_layer, n_layers)
        self.output = nn.Linear(d_model, vocab_size)

LSTM Networks (Matched Parameters)

Model	Parameters	embed_dim	hidden_dim	n_layers
Tiny	~1M	256	512	2
Small	~5M	384	768	3
Medium	~20M	512	1024	4
Large	~50M	768	1536	5

class MusicLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, n_layers):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, 
            num_layers=n_layers,
            dropout=0.1,
            batch_first=True
        )
        self.output = nn.Linear(hidden_dim, vocab_size)

Transformer vs LSTM Architecture Comparison Comparing attention-based Transformers with recurrent LSTMs at matched parameter counts.

Key Findings

1. Transformers Exhibit Strong Power-Law Scaling

The Transformer models showed clear power-law scaling behavior, similar to what's observed in natural language:

Loss ∝ N^(-α)

Where N is the number of parameters and α ≈ 0.07 for music data (compared to α ≈ 0.076 for text).

Neural Scaling Laws: Power-Law Relationship Transformers exhibit 2.3x stronger scaling than LSTMs - the gap widens with more parameters.

2. LSTMs Show Significantly Weaker Scaling

Surprisingly, LSTMs showed much weaker scaling—their performance plateaued earlier and didn't benefit as much from increased parameters:

Model Size	Transformer Loss	LSTM Loss
1M	2.45	2.52
5M	2.12	2.35
20M	1.89	2.18
50M	1.72	2.05
100M	1.58	1.98

The gap widens as models get larger—suggesting Transformers are fundamentally better at utilizing additional capacity for this task.

3. Generated Music is Syntactically Valid

Both architectures learned to generate valid ABC notation that could be converted to playable MIDI:

def generate_music(model, prompt, max_length=500, temperature=0.8):
    tokens = tokenize(prompt)

    for _ in range(max_length):
        logits = model(tokens)
        next_token = sample_with_temperature(logits[-1], temperature)
        tokens.append(next_token)

        if next_token == EOS_TOKEN:
            break

    return detokenize(tokens)

Generated sample from the Large Transformer:

X:1
T:Generated Jig
M:6/8
K:D
|:DFA dAF|GBd gdB|AFA dFA|GFE EFG|
DFA dAF|GBd gdB|AFA dfe|d3 d3:|

Training Details

Hyperparameters

config = {
    'batch_size': 64,
    'learning_rate': 3e-4,
    'weight_decay': 0.1,
    'warmup_steps': 1000,
    'max_steps': 100000,
    'sequence_length': 512,
}

Training Curves

The training dynamics revealed interesting patterns: - Transformers: Smooth loss decrease, efficient gradient flow - LSTMs: More volatile training, required careful learning rate tuning

Scaling Law Analysis

Fitting Power Laws

I fit the scaling law equation to the empirical results:

import scipy.optimize as opt

def scaling_law(N, alpha, beta):
    return beta * (N ** (-alpha))

# Fit to Transformer results
popt_transformer, _ = opt.curve_fit(
    scaling_law, 
    transformer_params, 
    transformer_losses
)
# α = 0.071, β = 3.82

# Fit to LSTM results  
popt_lstm, _ = opt.curve_fit(
    scaling_law,
    lstm_params,
    lstm_losses
)
# α = 0.031, β = 2.89

The key finding: Transformer α is 2.3x larger than LSTM α, meaning Transformers benefit much more from scale.

Compute-Optimal Scaling

Following Chinchilla-style analysis, I also examined the optimal allocation of compute between model size and training tokens. For music data: - Optimal ratio is approximately 20 tokens per parameter - This is similar to the text domain

Qualitative Analysis of Generated Music

Transformer Strengths

Better long-range coherence (maintains key signature)
More interesting melodic variations
Proper phrase structure (AABB form)

LSTM Strengths

Sometimes more "adventurous" note choices
Can produce surprising modulations
Faster inference (no attention computation)

Common Failure Modes

Both struggle with very long pieces (>64 bars)
Occasional invalid ABC syntax at higher temperatures
Tendency toward repetitive patterns

Tools for Listening

Generated samples can be played through:

Online ABC Players: abcjs.net
MIDI Conversion: Using music21 library

from music21 import converter

# Convert ABC to MIDI
score = converter.parse(abc_string, format='abc')
score.write('midi', fp='output.mid')

Conclusions

Scaling laws transfer to music, but with domain-specific constants
Transformers scale better than LSTMs for sequential music modeling
Attention mechanisms are crucial for capturing long-range musical structure
Music is a viable domain for studying neural network scaling behavior

Future Directions

Multi-instrument generation: Extend to polyphonic music
Conditional generation: Control style, tempo, mood
Audio domain: Apply similar analysis to raw audio models
Cross-domain transfer: Can music pretraining help text models?

References

Kaplan, J., et al. (2020). "Scaling Laws for Neural Language Models"
Vaswani, A., et al. (2017). "Attention Is All You Need"
Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models" (Chinchilla)

The full code and trained models are available on GitHub.

This project was completed as part of CS-GY 6923-B Machine Learning at NYU Tandon School of Engineering, December 2025.

Scaling Laws for Language Models on Symbolic Music Data

Scaling Laws for Language Models on Symbolic Music Data

The Research Question

Dataset: Irish Folk Music in ABC Notation

The Session (~53,000 tunes)

Nottingham Music Database (~1,000 tunes)

Data Augmentation

Model Architectures

Decoder-Only Transformers

LSTM Networks (Matched Parameters)

Key Findings

1. Transformers Exhibit Strong Power-Law Scaling

2. LSTMs Show Significantly Weaker Scaling

3. Generated Music is Syntactically Valid

Training Details

Hyperparameters

Training Curves

Scaling Law Analysis

Fitting Power Laws

Compute-Optimal Scaling

Qualitative Analysis of Generated Music

Transformer Strengths

LSTM Strengths

Common Failure Modes

Tools for Listening

Conclusions

Future Directions

References

Related Articles

MixMatch: Exploring Semi-Supervised Learning for Limited Data

Get In Touch