GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

For my NYU Computer Vision course final project (CS-GY 6643), I tackled an ambitious challenge: can a deep learning model predict the geographic location of a street view image? This GeoGuessr-style competition pushed me to explore state-of-the-art vision architectures and ensemble techniques to achieve a 96%+ accuracy on US state classification.

The Challenge

Given four street view images (north, east, south, west directions) from a location somewhere in the United States, predict which of the 50 states it belongs to. The dataset contained thousands of samples with GPS coordinates, and we were evaluated on classification accuracy.

This isn't just an academic exercise—accurate geo-location from imagery has real-world applications in autonomous vehicles, photo organization, tourism, and even digital forensics.

Multi-Directional Street View Input The model receives four street view images (N, E, S, W) from each location and must predict the US state.

My Approach: A 7-Model Ensemble Strategy

After extensive experimentation, I developed a comprehensive ensemble strategy combining seven different architectures. The key insight was that architectural diversity matters more than raw individual performance.

The Model Zoo

Model	Parameters	Expected Score	Architecture Type
ViT-Large-CLIP	307M	0.95	Vision Transformer (CLIP)
EVA02-Large	321M	0.93	Vision Transformer (MIM)
Swin-Base	88M	0.93	Window Transformer
BEiT-Large	304M	0.92	Vision Transformer (Masked)
MaxViT-Large	228M	0.92	Hybrid Conv+Attention
ConvNeXt-Large	197M	0.91	Pure ConvNet
ConvNeXt-Small	50M	0.89	Pure ConvNet

Architecture Diversity in Ensemble Different architecture types capture complementary features - from local textures to semantic understanding.

Why CLIP is the Star Model

The ViT-Large-CLIP model was my best performer, and there's a good reason why:

Pretrained on 400M text-image pairs including location descriptions
Has learned semantic concepts like "desert", "coastal", "urban", "midwest"
Different pretraining objective than all other vision models
Achieves 0.94-0.95 solo with test-time augmentation

# CLIP model configuration
model_name = 'vit_large_patch14_clip_336.openai_ft_in12k_in1k'
image_size = 336  # Higher resolution for better details

Technical Implementation

Multi-Directional Image Processing

Each sample has four images representing different viewing directions. I designed a custom dataset class to handle this:

class GeoDataset(Dataset):
    def __init__(self, df, img_dir, transform, is_train=True, gps_stats=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transform = transform

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        imgs = []
        for direction in ['north', 'east', 'south', 'west']:
            path = os.path.join(self.img_dir, row[f'image_{direction}'])
            img = Image.open(path).convert('RGB')
            imgs.append(self.transform(img))

        # Stack all four directions
        imgs = torch.stack(imgs)  # Shape: [4, C, H, W]

        if self.is_train:
            lat = (row['latitude'] - self.gps_stats['lat_m']) / self.gps_stats['lat_s']
            lon = (row['longitude'] - self.gps_stats['lon_m']) / self.gps_stats['lon_s']
            return {
                'images': imgs, 
                'state_idx': torch.tensor(row['state_idx'], dtype=torch.long),
                'gps': torch.tensor([lat, lon], dtype=torch.float32)
            }
        return {'images': imgs, 'sample_id': row['sample_id']}

Test-Time Augmentation (TTA)

To boost performance, I implemented test-time augmentation that averages predictions across multiple augmented versions of each image:

def tta_inference(model, images, num_augments=5):
    predictions = []
    for _ in range(num_augments):
        augmented = apply_random_augmentation(images)
        with torch.no_grad():
            pred = model(augmented)
        predictions.append(pred)

    # Average predictions across augmentations
    return torch.stack(predictions).mean(dim=0)

Ensemble Strategy

The final prediction combines all models using learned weights:

# Auto-weighted ensemble based on validation performance
def ensemble_predict(model_predictions, weights):
    weighted_sum = sum(w * p for w, p in zip(weights, model_predictions))
    return weighted_sum / sum(weights)

# Optimal weights found through ablation:
# CLIP: 0.40, EVA02: 0.25, Swin: 0.15, Others: 0.20

Training Pipeline

Progressive Training Strategy

Phase 1: Train all models independently with standard augmentation
Phase 2: Apply test-time augmentation during inference
Phase 3: Ablation testing to find optimal ensemble weights
Phase 4: Final ensemble with temperature scaling

# Training workflow
python scripts/train.py --config configs/vit_large_clip.yaml  # 28 hours
python scripts/train.py --config configs/eva02_large.yaml     # 30 hours
python scripts/train.py --config configs/swin_base.yaml       # 18 hours

# Inference with TTA
python scripts/inference.py --config outputs/vit_large_clip/config.yaml --use_tta

# Ensemble with ablation
python scripts/ensemble.py --predictions outputs/*/predictions_*.pt --ablation

Ablation Results

Running systematic ablation tests revealed the contribution of each model:

Ensemble Configuration	Score
CLIP alone	0.945-0.955
CLIP + EVA02	0.955-0.960
Top 3 (CLIP + EVA02 + Swin)	0.960-0.965
All 7 models	0.965-0.970

Key Insights

1. Architectural Diversity > Individual Performance

Models with different inductive biases capture complementary features: - ConvNets: Strong local patterns (road textures, vegetation) - Window Transformers: Local + global attention (building styles) - Vision Transformers: Global context (landscape composition) - CLIP: Semantic understanding ("this looks like Arizona")

2. Geographic Feature Learning

The models learned to recognize subtle geographic cues: - Vegetation types: Desert vs. forest vs. farmland - Road characteristics: Highway styles, signage - Architecture: Building styles vary by region - Sky and lighting: Different latitudes have different sun angles

3. GPS Regression as Auxiliary Task

Training with GPS coordinate prediction as an auxiliary task improved state classification:

# Multi-task learning
state_loss = cross_entropy(state_pred, state_label)
gps_loss = mse_loss(gps_pred, gps_label)
total_loss = state_loss + 0.1 * gps_loss

50-State Classification Performance Classification performance across US states - higher accuracy in distinctive regions like Hawaii and Alaska.

Results

Final Performance

Individual Best (CLIP): 0.906 accuracy
Ensemble (7 models): 0.96+ accuracy
Improvement from ensemble: +5.4% absolute

Interesting Failure Cases

The model struggled with: - Border regions: Areas that look like neighboring states - Generic suburbs: Cookie-cutter developments that could be anywhere - Unusual weather: Snow in typically warm states

Lessons Learned

Ensemble diversity matters more than ensemble size - 3 diverse models beat 7 similar ones
CLIP's semantic pretraining is powerful for geographic reasoning
Test-time augmentation provides consistent 1-2% improvement
Temperature scaling in ensembles helps calibrate confidence

What's Next

This project opened up several research directions: - Hierarchical prediction: Predict region → state → city - Attention visualization: What features does the model use? - Cross-country generalization: Can a US model work for Europe?

The code and models are available on GitHub.

This project was completed as part of CS-GY 6643 Computer Vision at NYU Tandon School of Engineering.

GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

The Challenge

My Approach: A 7-Model Ensemble Strategy

The Model Zoo

Why CLIP is the Star Model

Technical Implementation

Multi-Directional Image Processing

Test-Time Augmentation (TTA)

Ensemble Strategy

Training Pipeline

Progressive Training Strategy

Ablation Results

Key Insights

1. Architectural Diversity > Individual Performance

2. Geographic Feature Learning

3. GPS Regression as Auxiliary Task

Results

Final Performance

Interesting Failure Cases

Lessons Learned

What's Next

Related Articles

Adaptive Edge-Cloud Computer Vision: Winning Hackathon Project

Constellation Detection: Computer Vision for Astronomy

Real-Time Object Detection and Tracking with YOLO and DeepSORT

Get In Touch