Real-Time Object Detection and Tracking with YOLO and DeepSORT
When I started this project, I wanted to solve a practical problem: how do you track multiple people across video frames in real-time while maintaining their identities? This is crucial for applications like crowd monitoring, retail analytics, and surveillance systems.
The Challenge
Object detection is relatively straightforward with modern deep learning models like YOLO. But tracking objects across frames while maintaining their identities? That's where things get interesting. You need to:
- Detect objects in each frame
- Associate detections across frames
- Maintain stable IDs even when objects temporarily disappear
- Do all this in real-time
Technical Architecture
Detection: YOLO v8 and v10
I implemented support for both YOLOv8 and YOLOv10 models. Why both?
- YOLOv8 provides robust, well-tested performance
- YOLOv10 offers improved speed with comparable accuracy
The models are pretrained on COCO dataset and fine-tuned for people detection. Here's what makes YOLO perfect for this task:
# YOLOv8 provides clean, simple inference
results = model(frame)
detections = results[0].boxes.data # [x1, y1, x2, y2, conf, class]
Tracking: DeepSORT
DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric) is where the magic happens. It combines:
- Kalman Filter: Predicts object positions between frames
- Deep Appearance Features: Uses a CNN (mars-small128) to extract appearance features
- Hungarian Algorithm: Optimally matches detections to tracks
The key insight of DeepSORT is using both motion and appearance for data association. This means even if someone is occluded briefly, we can re-identify them based on their appearance.
Implementation Details
1. Detection Pipeline
def detect_objects(self, frame):
"""Run YOLO detection on frame"""
results = self.model(frame)
boxes = []
for det in results[0].boxes.data:
x1, y1, x2, y2, conf, cls = det
if conf > self.conf_threshold and cls == 0: # Person class
boxes.append([x1, y1, x2, y2, conf])
return boxes
2. Feature Extraction
The deep association metric uses a CNN to extract 128-dimensional feature vectors:
def extract_features(self, frame, boxes):
"""Extract deep features for appearance matching"""
features = []
for box in boxes:
crop = frame[y1:y2, x1:x2]
crop_resized = cv2.resize(crop, (64, 128))
feature = self.encoder(crop_resized)
features.append(feature)
return features
3. Tracking and ID Assignment
DeepSORT maintains tracks and assigns IDs:
# Update tracker with new detections
tracks = self.tracker.update(detections, features)
for track in tracks:
if track.is_confirmed():
track_id = track.track_id
bbox = track.to_tlbr() # [x1, y1, x2, y2]
# Draw bounding box with stable ID
Deployment on Azure
One of the project's key achievements is deploying it as a web application on Azure. The architecture:
- Frontend: Flask web app for video upload and result viewing
- Backend: Python service running YOLO + DeepSORT
- Storage: Azure Blob Storage for input/output videos
- Compute: Azure App Service (free tier)
Challenges in Cloud Deployment
Challenge 1: No GPU on Free Tier - Solution: Optimized model to run on CPU, used smaller YOLO variants - Preprocessing: Resize frames to reduce computation
Challenge 2: Processing Time - Solution: Async task queue, progress tracking - Users get notified when processing completes
Challenge 3: Memory Constraints - Solution: Process video in chunks, stream results
Results and Performance
On the MOTChallenge dataset: - MOTA (Multiple Object Tracking Accuracy): 62.3% - IDF1 (ID F1 Score): 58.7% - FPS: ~15 on CPU, ~45 on GPU
Real-world performance: - Stable IDs across 95% of video duration - Handles up to 20 people simultaneously - Robust to partial occlusions
Key Learnings
- Model Selection Matters: YOLOv10 was 30% faster than v8 with minimal accuracy loss
- Feature Quality > Speed: Using deeper appearance features improved ID consistency by 25%
- Kalman Filter Tuning: Process noise and measurement noise need careful tuning for different scenarios
- Production != Research: Deploying on limited resources requires significant optimization
Live Demo
Check out the live demo (Note: Azure free tier means slower processing).
The code is open source on GitHub.
What's Next?
I'm exploring: - Multi-camera tracking (track people across multiple camera feeds) - Action recognition (what are tracked people doing?) - Edge deployment on Raspberry Pi
This project taught me that production ML is 10% model development and 90% engineering - deployment, optimization, and robustness.