Skip to content

GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

Achieving 96%+ accuracy on US state classification with a 7-model ensemble strategy

Jithendra Puppala
Jithendra Puppala
5 min read 36 views
GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction
Tech Stack: Python PyTorch timm CLIP Vision Transformers Swin ConvNeXt EVA02

GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

For my NYU Computer Vision course final project (CS-GY 6643), I tackled an ambitious challenge: can a deep learning model predict the geographic location of a street view image? This GeoGuessr-style competition pushed me to explore state-of-the-art vision architectures and ensemble techniques to achieve a 96%+ accuracy on US state classification.

The Challenge

Given four street view images (north, east, south, west directions) from a location somewhere in the United States, predict which of the 50 states it belongs to. The dataset contained thousands of samples with GPS coordinates, and we were evaluated on classification accuracy.

This isn't just an academic exercise—accurate geo-location from imagery has real-world applications in autonomous vehicles, photo organization, tourism, and even digital forensics.

Multi-Directional Street View Input The model receives four street view images (N, E, S, W) from each location and must predict the US state.

My Approach: A 7-Model Ensemble Strategy

After extensive experimentation, I developed a comprehensive ensemble strategy combining seven different architectures. The key insight was that architectural diversity matters more than raw individual performance.

The Model Zoo

Model Parameters Expected Score Architecture Type
ViT-Large-CLIP 307M 0.95 Vision Transformer (CLIP)
EVA02-Large 321M 0.93 Vision Transformer (MIM)
Swin-Base 88M 0.93 Window Transformer
BEiT-Large 304M 0.92 Vision Transformer (Masked)
MaxViT-Large 228M 0.92 Hybrid Conv+Attention
ConvNeXt-Large 197M 0.91 Pure ConvNet
ConvNeXt-Small 50M 0.89 Pure ConvNet

Architecture Diversity in Ensemble Different architecture types capture complementary features - from local textures to semantic understanding.

Why CLIP is the Star Model

The ViT-Large-CLIP model was my best performer, and there's a good reason why:

  • Pretrained on 400M text-image pairs including location descriptions
  • Has learned semantic concepts like "desert", "coastal", "urban", "midwest"
  • Different pretraining objective than all other vision models
  • Achieves 0.94-0.95 solo with test-time augmentation
# CLIP model configuration
model_name = 'vit_large_patch14_clip_336.openai_ft_in12k_in1k'
image_size = 336  # Higher resolution for better details

Technical Implementation

Multi-Directional Image Processing

Each sample has four images representing different viewing directions. I designed a custom dataset class to handle this:

class GeoDataset(Dataset):
    def __init__(self, df, img_dir, transform, is_train=True, gps_stats=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transform = transform

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        imgs = []
        for direction in ['north', 'east', 'south', 'west']:
            path = os.path.join(self.img_dir, row[f'image_{direction}'])
            img = Image.open(path).convert('RGB')
            imgs.append(self.transform(img))

        # Stack all four directions
        imgs = torch.stack(imgs)  # Shape: [4, C, H, W]

        if self.is_train:
            lat = (row['latitude'] - self.gps_stats['lat_m']) / self.gps_stats['lat_s']
            lon = (row['longitude'] - self.gps_stats['lon_m']) / self.gps_stats['lon_s']
            return {
                'images': imgs, 
                'state_idx': torch.tensor(row['state_idx'], dtype=torch.long),
                'gps': torch.tensor([lat, lon], dtype=torch.float32)
            }
        return {'images': imgs, 'sample_id': row['sample_id']}

Test-Time Augmentation (TTA)

To boost performance, I implemented test-time augmentation that averages predictions across multiple augmented versions of each image:

def tta_inference(model, images, num_augments=5):
    predictions = []
    for _ in range(num_augments):
        augmented = apply_random_augmentation(images)
        with torch.no_grad():
            pred = model(augmented)
        predictions.append(pred)

    # Average predictions across augmentations
    return torch.stack(predictions).mean(dim=0)

Ensemble Strategy

The final prediction combines all models using learned weights:

# Auto-weighted ensemble based on validation performance
def ensemble_predict(model_predictions, weights):
    weighted_sum = sum(w * p for w, p in zip(weights, model_predictions))
    return weighted_sum / sum(weights)

# Optimal weights found through ablation:
# CLIP: 0.40, EVA02: 0.25, Swin: 0.15, Others: 0.20

Training Pipeline

Progressive Training Strategy

  1. Phase 1: Train all models independently with standard augmentation
  2. Phase 2: Apply test-time augmentation during inference
  3. Phase 3: Ablation testing to find optimal ensemble weights
  4. Phase 4: Final ensemble with temperature scaling
# Training workflow
python scripts/train.py --config configs/vit_large_clip.yaml  # 28 hours
python scripts/train.py --config configs/eva02_large.yaml     # 30 hours
python scripts/train.py --config configs/swin_base.yaml       # 18 hours

# Inference with TTA
python scripts/inference.py --config outputs/vit_large_clip/config.yaml --use_tta

# Ensemble with ablation
python scripts/ensemble.py --predictions outputs/*/predictions_*.pt --ablation

Ablation Results

Running systematic ablation tests revealed the contribution of each model:

Ensemble Configuration Score
CLIP alone 0.945-0.955
CLIP + EVA02 0.955-0.960
Top 3 (CLIP + EVA02 + Swin) 0.960-0.965
All 7 models 0.965-0.970

Key Insights

1. Architectural Diversity > Individual Performance

Models with different inductive biases capture complementary features: - ConvNets: Strong local patterns (road textures, vegetation) - Window Transformers: Local + global attention (building styles) - Vision Transformers: Global context (landscape composition) - CLIP: Semantic understanding ("this looks like Arizona")

2. Geographic Feature Learning

The models learned to recognize subtle geographic cues: - Vegetation types: Desert vs. forest vs. farmland - Road characteristics: Highway styles, signage - Architecture: Building styles vary by region - Sky and lighting: Different latitudes have different sun angles

3. GPS Regression as Auxiliary Task

Training with GPS coordinate prediction as an auxiliary task improved state classification:

# Multi-task learning
state_loss = cross_entropy(state_pred, state_label)
gps_loss = mse_loss(gps_pred, gps_label)
total_loss = state_loss + 0.1 * gps_loss

50-State Classification Performance Classification performance across US states - higher accuracy in distinctive regions like Hawaii and Alaska.

Results

Final Performance

  • Individual Best (CLIP): 0.906 accuracy
  • Ensemble (7 models): 0.96+ accuracy
  • Improvement from ensemble: +5.4% absolute

Interesting Failure Cases

The model struggled with: - Border regions: Areas that look like neighboring states - Generic suburbs: Cookie-cutter developments that could be anywhere - Unusual weather: Snow in typically warm states

Lessons Learned

  1. Ensemble diversity matters more than ensemble size - 3 diverse models beat 7 similar ones
  2. CLIP's semantic pretraining is powerful for geographic reasoning
  3. Test-time augmentation provides consistent 1-2% improvement
  4. Temperature scaling in ensembles helps calibrate confidence

What's Next

This project opened up several research directions: - Hierarchical prediction: Predict region → state → city - Attention visualization: What features does the model use? - Cross-country generalization: Can a US model work for Europe?

The code and models are available on GitHub.


This project was completed as part of CS-GY 6643 Computer Vision at NYU Tandon School of Engineering.

Get In Touch

I'll respond within 24-48 hours