AI DJ System

by MaksymCh in Living > Organizing

14 Views, 0 Favorites, 0 Comments

AI DJ System

AI DJ System – a smart music box that reads the room

Hi! I’m Maksym Chekhun, a CT&AI student at Howest University of Applied Sciences.

I’ve built a tiny “AI DJ” that watches what I’m doing (choosing from a set of predefined activities) and automatically switches the music on my Mac’s Spotify to match the vibe.

The brain lives on a Raspberry Pi 5 with a webcam and a YOLOv8 model that I trained in Roboflow. Every few seconds the Pi snaps a photo, recognises the activity, and sends both the label and a snapshot to a Flask server on my Mac. The server starts the matching playlist via AppleScript in the Spotify desktop app and logs each event (activity, time, playlist, and snapshot name) to a MySQL database.

A 16 × 2 I²C LCD on the Pi shows what’s playing in real time, so anyone in the room can see why the music changed. The whole unit is housed in a small wooden box I call a “radio” case, with space for the Pi, camera, and—potentially—speakers.

Supplies

Supplies & Tools

Raspberry Pi 5 (8 GB) Starter Pack – board, PSU, 32 GB micro-SD, plastic case.
Brain of the project; runs YOLOv8 and drives the LCD. ≈ €130
Active cooler / fan for Pi 5 – keeps the CPU < 80 °C during inference. ≈ €6
USB webcam (1080 p, 30 fps) – one frame every few seconds for detection. ≈ €12
16 × 2 I²C LCD module (+ jumper wires) – displays current activity / playlist. ≈ €8
7-in-1 USB-C hub – adds extra USB-A ports for webcam, keyboard, etc. ≈ €10
Small wooden “radio” box (DIY / laser-cut) – enclosure for Pi, webcam, LCD. ≈ €10
(Free if you’ve got scrap plywood or an old speaker cabinet.)
Pair of tiny 3 W laptop speakers (optional) – local playback. ≈ €10
Mac or PC running Spotify Desktop + Flask server – receives labels, starts playlists, logs to MySQL. (existing hardware – €0)
MySQL / MariaDB server – stores activity, timestamp, URI, snapshot. (open-source – €0)
Python libraries – Python 3.11, Ultralytics-YOLO v8, Flask, PyMySQL (all via pip). (free)
Basic hand tools – small Phillips screwdriver, wire-cutters / strippers, hot-glue gun or double-sided tape. (workshop staples – €0–€15)

Budget note: If a Pi 5 is hard to find, a Raspberry Pi 4 (4 GB) works too; inference is only about 20 % slower, perfectly fine for a snapshot every few seconds.

Total hardware cost if you’re starting from scratch: roughly €190–€200. Re-use a webcam, speakers or an old wooden box and you can shave that below €150.

Collecting Data

First things first: the project needed a reliable action-detection model, and that meant training my own. I chose YOLO, but of course YOLO is only as good as the data you feed it. My target was a compact dataset—at least 500 images—showing real people performing my five key activities.

I began by scouring Kaggle for public datasets. Plenty looked interesting at first glance, yet most either lacked the exact actions I needed or were shot in artificial studio settings that wouldn’t match a dorm room webcam. So I cherry-picked only the images that fit my scene and moved on.

Next stop was Google Images. Using Creative-Commons filters, I pulled down dozens of candid photos that were both royalty-free and privacy-safe. Still short on a few classes, I turned the camera on my own environment: I corralled classmates, handed out snacks, and snapped them reading, typing on laptops and generally acting natural. The mix of web finds and home-grown shots gave me a balanced, realistic pool of images.

With the pictures collected, the final (and most tedious) task was manual labelling. Every photo went through Roboflow, where I drew a bounding box around the whole scene—person plus the relevant object—and assigned the correct label. Only after that marathon click-fest did I finally have a clean, custom dataset ready for training.

Annotating the Data

With a folder full of raw images, I rolled up my sleeves for the click-fest that makes the magic possible: annotation. Rather than wrangle COCO JSON or Pascal-VOC XML by hand, I opened Roboflow, dragged every picture into the browser, and let the workflow guide me.

One box, one label.

For each image I drew a single, generous bounding box that framed the whole scene—both the person and the key prop that tells the story: a paperback for reading, a glowing laptop for working, a steaming bowl of noodles for eating, a chessboard for playing chess, or a blanket-cocoon for sleeping. This “big-context” box gives YOLO enough visual clues without drowning it in tiny, hard-to-learn parts.

Automatic splits.

With the last box drawn, Roboflow handled the dull bits for me: shuffling images into train/validation/test folders and renaming everything to YOLO-friendly format. One export click later and I had neatly paired /images and /labelsdirectories, perfectly organised for the next stage.

Painful on the wrist? A little. Worth it? Absolutely—because clean labels are the difference between a model that just “sort of” works and one that nails the vibe every single time.

Model Training

With my dataset boxed, labelled and neatly zipped, it was finally time to turn pixels into predictions. I did the number-crunching on my desktop PC and MacBook; training took a while, but the extra horsepower let me experiment with several YOLO flavours until I found the sweet spot.

First, I installed a CUDA-enabled build of PyTorch so the model could run on the GPU rather than plod along on the CPU. (If you plan to do the same, make sure you grab the PyTorch version that matches your CUDA toolkit—details are on the official site.)

from ultralytics import YOLO

import torch # enables GPU training

model = YOLO("yolov8s.pt") # load a pretrained YOLOv8 small backbone

model.to("cuda") # push the network onto the GPU

results = model.train(data="data.yaml", epochs=100)

Three lines—and the marathon begins.

epochs=100 means the network will see the entire dataset a hundred times. You can tweak dozens of other hyper-parameters, but epochs, image size and batch size are the big levers.

Training results scroll by in real time: loss curves, precision/recall, mAP on the validation set. I kept an eye on the chess and reading classes (my trickiest pair) and stopped only when their curves levelled out. Whenever the metrics plateaued too low, I tweaked an augmentation or added a handful of extra images, then fired up a new run—patience and iteration beat one-shot perfection every time.

So yes: start the training, queue up a movie, grab coffee with friends. When you get back, your GPU will (hopefully) have forged a model that’s ready to steer Spotify like a pro DJ.

Coding

After marathon of coffee-fueled debugging, the project finally left Jupyter notebooks and terminal windows and stepped into the real world. The coding phase had one clear mission: teach a Raspberry Pi to “see” my activity and teach my Mac to “hear” the Pi and make Spotify sing on cue. Everything else—hardware, dataset, even the shiny confusion matrix—was useless until those two machines could hold that conversation.

I started with the Pi client. The script opens innocently enough: import cv2, from ultralytics import YOLO, a handful of GPIO and LCD libraries, and—most importantly—the tiny eight-megabyte TorchScript file that contains all the model’s hard-won wisdom. Every five to twelve seconds the Pi grabs a frame from the USB webcam, pushes it through YOLO, and earns back a single word: reading, working, eating, sleeping, or playing_chess. A short block of logic smooths rough edges—ignoring shaky low-confidence detections, counting empty frames, defaulting to a nature-sounds playlist if the room goes dead quiet. Whatever label survives is splashed across the 16 × 2 I²C LCD bolted to the front of the wooden “radio” box, and then the Pi fires off a tiny JSON packet to the Mac:

{"label": "reading", "snapshot_path": "reading_20250619_142501.jpg"}

The JSON packet lands in a tiny, single-route Flask server on my Mac. In barely fifty lines of code, it plays maître d’ to my musical moods: looks up the label in a dictionary, grabs the matching Spotify URI, and whispers one line of AppleScript—tell application “Spotify” to play track…. If that playlist is already spinning, the server politely does nothing. Then it logs the label, URI, timestamp, IP and snapshot into MySQL so I can later chart how often chess trumps studying.

The first end-to-end test felt like wizardry. I opened a paperback—LCD said Reading; lo-fi study beats drifted in. Laptop up—Working; synthwave kicked on. A few blank frames while I paced the room and the fallback Nature playlist of rain ambience faded in seamlessly. Only glitch: one typoed URI that threw a quick “404 unmapped.” Fix the key, hit save—pipeline perfect.

So now a webcam, a Pi, and a few hundred lines of Python curate my soundtrack hands-free. The code is compact enough to skim over coffee, yet flexible enough to add a new activity with just a handful of images and one line in the playlist map. Two days of wiring, and my dorm has a living, breathing AI DJ.

Extra: LCD Display

For an extra flourish I decided to bolt a tiny 16 × 2 I²C LCD onto the Raspberry Pi so the AI DJ could literally announce each vibe change as it happened. The wiring was kindergarten-simple—just four jumper leads from 5 V, GND, SDA and SCL—but the effect is pure sci-fi: the moment the camera sees reading, the screen flashes “Activity: Reading” and lo-fi beats seep from the speakers; flick open a laptop and a heartbeat later it proclaims “Working” while synthwave takes over; leave the room long enough and the fallback “Nature mode” scrolls across the display as rain ambience fades in. Under the hood a pocket-sized helper class (a semester-two relic I wrote for our Sensors & Interfacing course) wraps the RPLCD library and hides the I²C gymnastics—one call to lcd_show() is all it takes. The Pi was already POSTing JSON labels to my Mac, so I simply print the same label locally whenever it changes; no need for extra sockets or MQTT detours. It’s a fifteen-line addition to the client script, but it transforms the box from a mysterious mood-shifting appliance into a jukebox that politely tells the whole room why the soundtrack just flipped.

Time to Turn on the Music!

🎶🎶🎶 Everything is wired, trained, and humming—time for the best part: letting the AI DJ take over the room. Lean back, crack open a book, start typing a report, or set up a quick chess match and watch the music shift with you in real time. It’s not just about flawless detection stats; it’s about enjoying the vibe and celebrating all the late-night photo shoots, annotation marathons, and debugging sprints that made this wooden “radio” box come alive.

Happy listening, and may your soundtrack always match your mood! 🎶🎶🎶