Real-Time Object Detection and Tracking with YOLO and DeepSORT

When I started this project, I wanted to solve a practical problem: how do you track multiple people across video frames in real-time while maintaining their identities? This is crucial for applications like crowd monitoring, retail analytics, and surveillance systems.

The Challenge

Object detection is relatively straightforward with modern deep learning models like YOLO. But tracking objects across frames while maintaining their identities? That's where things get interesting. You need to:

Detect objects in each frame
Associate detections across frames
Maintain stable IDs even when objects temporarily disappear
Do all this in real-time

Technical Architecture

Detection: YOLO v8 and v10

I implemented support for both YOLOv8 and YOLOv10 models. Why both?

YOLOv8 provides robust, well-tested performance
YOLOv10 offers improved speed with comparable accuracy

The models are pretrained on COCO dataset and fine-tuned for people detection. Here's what makes YOLO perfect for this task:

# YOLOv8 provides clean, simple inference
results = model(frame)
detections = results[0].boxes.data  # [x1, y1, x2, y2, conf, class]

Tracking: DeepSORT

DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric) is where the magic happens. It combines:

Kalman Filter: Predicts object positions between frames
Deep Appearance Features: Uses a CNN (mars-small128) to extract appearance features
Hungarian Algorithm: Optimally matches detections to tracks

The key insight of DeepSORT is using both motion and appearance for data association. This means even if someone is occluded briefly, we can re-identify them based on their appearance.

Implementation Details

1. Detection Pipeline

def detect_objects(self, frame):
    """Run YOLO detection on frame"""
    results = self.model(frame)
    boxes = []
    for det in results[0].boxes.data:
        x1, y1, x2, y2, conf, cls = det
        if conf > self.conf_threshold and cls == 0:  # Person class
            boxes.append([x1, y1, x2, y2, conf])
    return boxes

2. Feature Extraction

The deep association metric uses a CNN to extract 128-dimensional feature vectors:

def extract_features(self, frame, boxes):
    """Extract deep features for appearance matching"""
    features = []
    for box in boxes:
        crop = frame[y1:y2, x1:x2]
        crop_resized = cv2.resize(crop, (64, 128))
        feature = self.encoder(crop_resized)
        features.append(feature)
    return features

3. Tracking and ID Assignment

DeepSORT maintains tracks and assigns IDs:

# Update tracker with new detections
tracks = self.tracker.update(detections, features)

for track in tracks:
    if track.is_confirmed():
        track_id = track.track_id
        bbox = track.to_tlbr()  # [x1, y1, x2, y2]
        # Draw bounding box with stable ID

Deployment on Azure

One of the project's key achievements is deploying it as a web application on Azure. The architecture:

Frontend: Flask web app for video upload and result viewing
Backend: Python service running YOLO + DeepSORT
Storage: Azure Blob Storage for input/output videos
Compute: Azure App Service (free tier)

Challenges in Cloud Deployment

Challenge 1: No GPU on Free Tier - Solution: Optimized model to run on CPU, used smaller YOLO variants - Preprocessing: Resize frames to reduce computation

Challenge 2: Processing Time - Solution: Async task queue, progress tracking - Users get notified when processing completes

Challenge 3: Memory Constraints - Solution: Process video in chunks, stream results

Results and Performance

On the MOTChallenge dataset: - MOTA (Multiple Object Tracking Accuracy): 62.3% - IDF1 (ID F1 Score): 58.7% - FPS: ~15 on CPU, ~45 on GPU

Real-world performance: - Stable IDs across 95% of video duration - Handles up to 20 people simultaneously - Robust to partial occlusions

Key Learnings

Model Selection Matters: YOLOv10 was 30% faster than v8 with minimal accuracy loss
Feature Quality > Speed: Using deeper appearance features improved ID consistency by 25%
Kalman Filter Tuning: Process noise and measurement noise need careful tuning for different scenarios
Production != Research: Deploying on limited resources requires significant optimization

Live Demo

Check out the live demo (Note: Azure free tier means slower processing).

The code is open source on GitHub.

What's Next?

I'm exploring: - Multi-camera tracking (track people across multiple camera feeds) - Action recognition (what are tracked people doing?) - Edge deployment on Raspberry Pi

This project taught me that production ML is 10% model development and 90% engineering - deployment, optimization, and robustness.

Real-Time Object Detection and Tracking with YOLO and DeepSORT

Real-Time Object Detection and Tracking with YOLO and DeepSORT

The Challenge

Technical Architecture

Detection: YOLO v8 and v10

Tracking: DeepSORT

Implementation Details

1. Detection Pipeline

2. Feature Extraction

3. Tracking and ID Assignment

Deployment on Azure

Challenges in Cloud Deployment

Results and Performance

Key Learnings

Live Demo

What's Next?

Related Articles

GeoGuessr Street View Geolocation: Deep Learning Ensemble for Location Prediction

Adaptive Edge-Cloud Computer Vision: Winning Hackathon Project

Constellation Detection: Computer Vision for Astronomy

Get In Touch