Research notes · last updated 2026-04-21

The Lab

A living page. When the model changes, so does this page. The goal is to be the honest counterpart to whatever we put in the App Store description.

1. Methodology

Telling AI-generated music from human-performed music is not a solved problem in 2026. Rather than pretending otherwise, we split the task into three pillars and tell the user which one produced their verdict:

Live

1. Platform identification

Given a URL from an AI-only platform (Suno, Udio, ElevenLabs, AIVA, Soundraw, Mubert, Boomy, Loudly, Beatoven), the app returns an instant 100%-certain verdict. Those platforms only host AI output — no model needed.

In training

2. Audio-signal model

For anything without a platform URL — Apple Music links and mic scans — audio is sent to our Hugging Face Space. Today that runs an off-the-shelf speech-trained classifier; the v2 replacement below is in training.

Planned

3. Stem-level analysis

For ambiguous verdicts we’ll separate audio into vocals / drums / bass / other and run the detector on each stream. Research-only today — see the stem-separation hypothesis note below.

2. Model versions

Every call to the detector goes through one of these:

v1 · production (current)

Off-the-shelf mo-thecreator/Deepfake-audio-detection — a Wav2Vec2 base fine-tuned on speech-deepfake datasets.

# current production audio model
MODEL_ID=mo-thecreator/Deepfake-audio-detection
MODEL_AI_LABELS=fake,ai,synthetic,spoof,label_1
ensemble_strategy=Wav2Vec2 + Spectral Haze Analysis
    

Limit we’re honest about: this model was trained to distinguish TTS / voice-cloned speech from real speech. On music, it skews heavily toward the “AI” class — Kanye tracks, Queen tracks, anything with vocals over live instruments tends to score around 0.7. That’s flagged in the app on every experimental result.

v2 · in training

Our own music-native detector, fine-tuned from MERT-v1-95M — a self-supervised music encoder pre-trained on 160 000 hours of music.

# v2 training config (in progress — see the model_training/ folder)
base_model: m-a-p/MERT-v1-95M
head:       2-way classification (AI vs human)
train_set:  Suno · Udio · ElevenLabs · AIVA · ~500 clips
            + Free Music Archive · CCMixter · ~500 clips
split:      80/10/10 stratified by source
target:     ≥ 85% held-out accuracy · ≥ 70% per-source
    

v3 · planned

Stem separation (via Spleeter or Open-Unmix, CPU-only for free HF Spaces) + the v2 detector run on each stem. Roadmap after v2 ships and we see where it’s weakest.

3. Where we are right now

Platform identification

100%

Audio-model separation (v1)

~0.003

v2 training data collected

in progress

v2 accuracy target

≥ 85%

What that 0.003 means: across a 7-clip benchmark (Queen, Coldplay, Michael Jackson, Ed Sheeran, The Weeknd, Eagles, Suno-generated track) the mean wav2vec2 score on AI clips was 1.000 and on real clips was 0.997 — a difference of 0.003. The model isn’t discriminating; it is simply saying “AI” to every piece of music with high confidence. Replacing this is the single most valuable thing we can do, and it’s the focus of the v2 work.

4. Lab notes

Reverse-chronological. No cadence commitment — posted when something changes worth recording.

2026-04-21

Release-mode switch for test vs. production AdMob IDs

Refactored AdConfig so debug builds always serve Google’s canonical test creatives and release builds serve real production IDs — automatic via kReleaseMode. Also registered test-device hashes to protect the AdMob account from self-click invalid traffic when dogfooding release builds.

2026-04-20

Album auto-analyse capped at 3 tracks

TestFlight surfaced the backend hammer. 18-track albums were firing 18 sequential detector calls, tripping the 5/min rate limit and cascading into “null” errors. Now we auto-scan the first 3 tracks and mark the rest as “not scanned” with a clear explainer banner. Saves the user’s daily free-scan quota too.

2026-04-18

Renamed CheckAI → RevailLab

The “AI” sits literally inside the name — “revAIl”. More honest about the research-in-progress positioning than “CheckAI”, which implied a finished consumer product we don’t yet have. Logo, splash, store listing, and this website moved over in one pass.

2026-04-17

Confirmed: off-the-shelf deepfake models fail on music

Ran a 7-clip benchmark against two popular Hugging Face deepfake models — both returned the same ~1.0 score on Suno-generated audio and on Queen’s Bohemian Rhapsody. Separation 0.000. This is why we committed to training our own music-native classifier rather than picking a different existing model.

2026-04-16

AI platform URL short-circuit shipped

Pasted Suno / Udio / ElevenLabs links now return an instant 100% verdict without touching the audio model or the backend. These platforms only host AI content, so the audio-path inference was waste. Speeds up verdicts and preserves the user’s daily scan quota for ambiguous cases.

5. Open questions we’re thinking about

Is per-stem analysis worth the CPU? Spleeter runs around 10 seconds per clip on the HF free tier. Architecture sketched in backend/ but not wired up until v2 proves itself.
How do we stay ahead of new AI generators? A model trained on Suno / Udio / ElevenLabs output today will be outdated the moment a new generator lands with a different artefact signature. Looking at continuous retraining loops and user-submitted fixtures.
Is there a signature-based approach we’re missing? Some AI generators embed metadata or inaudible watermarks. We skipped these because they’re easy to strip, but a hybrid (watermark + audio model) may deserve revisiting.

6. How to reach us

If you’re researching this problem space or have a corpus of labelled AI-music clips we could partner on: msquaregiza@gmail.com.