Skip to content

Real-Time Object Detection and Tracking with YOLO and DeepSORT

Building a production-ready object detection pipeline deployed on Azure

Jithendra Puppala
Jithendra Puppala
3 min read 19 views
Real-Time Object Detection and Tracking with YOLO and DeepSORT
Tech Stack: Python PyTorch YOLOv8 YOLOv10 DeepSORT OpenCV Flask Azure

Real-Time Object Detection and Tracking with YOLO and DeepSORT

When I started this project, I wanted to solve a practical problem: how do you track multiple people across video frames in real-time while maintaining their identities? This is crucial for applications like crowd monitoring, retail analytics, and surveillance systems.

The Challenge

Object detection is relatively straightforward with modern deep learning models like YOLO. But tracking objects across frames while maintaining their identities? That's where things get interesting. You need to:

  1. Detect objects in each frame
  2. Associate detections across frames
  3. Maintain stable IDs even when objects temporarily disappear
  4. Do all this in real-time

Technical Architecture

Detection: YOLO v8 and v10

I implemented support for both YOLOv8 and YOLOv10 models. Why both?

  • YOLOv8 provides robust, well-tested performance
  • YOLOv10 offers improved speed with comparable accuracy

The models are pretrained on COCO dataset and fine-tuned for people detection. Here's what makes YOLO perfect for this task:

# YOLOv8 provides clean, simple inference
results = model(frame)
detections = results[0].boxes.data  # [x1, y1, x2, y2, conf, class]

Tracking: DeepSORT

DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric) is where the magic happens. It combines:

  • Kalman Filter: Predicts object positions between frames
  • Deep Appearance Features: Uses a CNN (mars-small128) to extract appearance features
  • Hungarian Algorithm: Optimally matches detections to tracks

The key insight of DeepSORT is using both motion and appearance for data association. This means even if someone is occluded briefly, we can re-identify them based on their appearance.

Implementation Details

1. Detection Pipeline

def detect_objects(self, frame):
    """Run YOLO detection on frame"""
    results = self.model(frame)
    boxes = []
    for det in results[0].boxes.data:
        x1, y1, x2, y2, conf, cls = det
        if conf > self.conf_threshold and cls == 0:  # Person class
            boxes.append([x1, y1, x2, y2, conf])
    return boxes

2. Feature Extraction

The deep association metric uses a CNN to extract 128-dimensional feature vectors:

def extract_features(self, frame, boxes):
    """Extract deep features for appearance matching"""
    features = []
    for box in boxes:
        crop = frame[y1:y2, x1:x2]
        crop_resized = cv2.resize(crop, (64, 128))
        feature = self.encoder(crop_resized)
        features.append(feature)
    return features

3. Tracking and ID Assignment

DeepSORT maintains tracks and assigns IDs:

# Update tracker with new detections
tracks = self.tracker.update(detections, features)

for track in tracks:
    if track.is_confirmed():
        track_id = track.track_id
        bbox = track.to_tlbr()  # [x1, y1, x2, y2]
        # Draw bounding box with stable ID

Deployment on Azure

One of the project's key achievements is deploying it as a web application on Azure. The architecture:

  • Frontend: Flask web app for video upload and result viewing
  • Backend: Python service running YOLO + DeepSORT
  • Storage: Azure Blob Storage for input/output videos
  • Compute: Azure App Service (free tier)

Challenges in Cloud Deployment

Challenge 1: No GPU on Free Tier - Solution: Optimized model to run on CPU, used smaller YOLO variants - Preprocessing: Resize frames to reduce computation

Challenge 2: Processing Time - Solution: Async task queue, progress tracking - Users get notified when processing completes

Challenge 3: Memory Constraints - Solution: Process video in chunks, stream results

Results and Performance

On the MOTChallenge dataset: - MOTA (Multiple Object Tracking Accuracy): 62.3% - IDF1 (ID F1 Score): 58.7% - FPS: ~15 on CPU, ~45 on GPU

Real-world performance: - Stable IDs across 95% of video duration - Handles up to 20 people simultaneously - Robust to partial occlusions

Key Learnings

  1. Model Selection Matters: YOLOv10 was 30% faster than v8 with minimal accuracy loss
  2. Feature Quality > Speed: Using deeper appearance features improved ID consistency by 25%
  3. Kalman Filter Tuning: Process noise and measurement noise need careful tuning for different scenarios
  4. Production != Research: Deploying on limited resources requires significant optimization

Live Demo

Check out the live demo (Note: Azure free tier means slower processing).

The code is open source on GitHub.

What's Next?

I'm exploring: - Multi-camera tracking (track people across multiple camera feeds) - Action recognition (what are tracked people doing?) - Edge deployment on Raspberry Pi

This project taught me that production ML is 10% model development and 90% engineering - deployment, optimization, and robustness.

Get In Touch

I'll respond within 24-48 hours