Learn how to effectively prepare a dataset, transform audio into vector embeddings, and establish a robust vector search system for audio queries.
Audio fingerprinting is a technology that enables the identification and retrieval of audio content by creating a unique representation, or "fingerprint," of an audio signal. This process involves converting audio data into vector embeddings, which are dense numerical representations that capture the essential features of the audio.
As digital audio usage continues to grow, efficient audio processing techniques are becoming more critical. Audio fingerprinting can identify audio clips even with some distortions or noise. By transforming audio samples into vector embeddings, you can achieve precise audio matching and facilitate applications such as speaker identification, music recognition, and voice verification.
In this guide, you will learn how to prepare a dataset, convert audio into vector embeddings, and set up a vector search for audio queries, making it possible to match audio samples efficiently and accurately.
Before diving into the code and setup, ensure you have a basic understanding of :
Wav2Vec 2.0 is a state-of-the-art embedding model by Facebook AI, specifically designed for speech recognition. It uses a convolutional neural network to extract features from raw audio and map them into vector embeddings. Wav2Vec 2.0 is pre-trained on large-scale datasets, allowing it to capture meaningful representations of audio data. For tasks such as audio fingerprinting, these embeddings provide a high-quality numerical representation that can distinguish between different audio samples.
You can also experiment with other audio embedding models like VGGish by Google, which is trained on YouTube sound data and can be a suitable choice for general audio tasks.
ChromaDB is a vector database designed to support high-speed similarity searches. It enables efficient storage, indexing, and retrieval of vector embeddings, making it ideal for applications requiring fast query times and scalability.
Ensure you have Python installed, along with the following libraries:
! pip install librosa
! pip install librosa pydub
! pip install transformers torch
! pip install chromadb
! pip install numpy scipy
A high-quality dataset is essential for accurate audio fingerprinting. For this project, we have selected a small set of audio files to illustrate the steps involved in fingerprinting:
To prepare these files for embedding and vector storage, we will preprocess them by normalizing, resampling, and trimming them to a standard duration. Preprocessing includes:
import librosa
import numpy as np
TARGET_DURATION = 60
def preprocess_audio(file_path, target_sr=16000, target_duration=TARGET_DURATION):
y, sr = librosa.load(file_path, sr=target_sr)
y = librosa.util.normalize(y)
target_length = target_sr * target_duration
if len(y) < target_length:
padding = target_length - len(y)
y = np.pad(y, (0, padding), mode='constant')
else:
y = y[:target_length]
return y, sr
To represent audio files numerically, we convert them into embeddings:
from transformers import Wav2Vec2Processor, Wav2Vec2Model
import torch
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
def get_audio_embedding(file_path):
audio, sr = preprocess_audio(file_path)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state.mean(dim=1)
return embeddings.squeeze().numpy()
With embeddings ready, we need a database for efficient querying. ChromaDB is ideal because it supports similarity-based search and can handle metadata.
import chromadb
client = chromadb.Client()
collection = client.get_or_create_collection(name="audio_embeddings")
def add_to_database(embedding, audio_id):
collection.add(
embeddings=[embedding],
metadatas=[{"audio_id": audio_id, "location": audio_id}],
ids=[audio_id]
)
To find matches for a new audio file:
def search_audio(query_file, top_k=5):
query_embedding = get_audio_embedding(query_file)
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results["ids"][0], results["distances"][0]
def print_search_results(query_file):
documents, distances = search_audio(query_file)
for i, (doc, dist) in enumerate(zip(documents, distances)):
print(f"Rank {i + 1}: Audio ID = {doc}, distance = {dist}")
With the core system in place, we can test it with actual audio files.
embedding = get_audio_embedding("/home/ml/projects/audio-fingerprinting/dataset/Christina Perri - A Thousand Years (Lyrics).wav")
add_to_database(embedding, "/home/ml/projects/audio-fingerprinting/dataset/Christina Perri - A Thousand Years (Lyrics).wav")
print_search_results("/home/ml/projects/audio-fingerprinting/dataset/A_Thousand_Years_Christina_Perri_Boyce_Avenue_acoustic_cover_on.wav")
Number of requested results 5 is greater than number of elements in index 2, updating n_results = 2
Rank 1: Audio ID = /home/ml/projects/audio-fingerprinting/dataset/Christina Perri - A Thousand Years (Lyrics).wav, distance = 2.059918165206909
Rank 2: Audio ID = /home/ml/projects/audio-fingerprinting/dataset/Benson Boone - Beautiful Things (Official Music Video).wav, distance = 2.201109647750854
Here, Rank 1 shows a high similarity score with Christina Perri - A Thousand Years (Lyrics), suggesting a strong match, while Rank 2 has a slightly lower score, indicating that it’s less similar to the query.
Above, we demonstrated the process of building an audio fingerprinting system using Wav2Vec 2.0 for embedding extraction and ChromaDB for vector search. By following these steps, you can transform audio files into embeddings, store them in a vector database, and efficiently retrieve matches based on similarity scores. This approach is scalable and can handle large datasets by simply adding more audio embeddings to ChromaDB. The process also illustrates the versatility of embedding models like Wav2Vec 2.0 in handling diverse audio sources, from original songs to covers.
For future enhancements, consider:
By implementing these improvements, the fingerprinting system can achieve higher accuracy and greater flexibility, allowing it to handle more complex audio tasks in real-world applications.
Audio fingerprinting technology is likely to impact several industries by enhancing how audio content is identified, tracked, and utilized. Below are the key industries that will experience substantial changes.
The music industry is perhaps the most directly affected sector. Audio fingerprinting allows for:
In advertising, audio fingerprinting can be utilized to:
The broadcasting industry can benefit from audio fingerprinting through:
In the broader entertainment sector, including film and television:
Platforms that host user-generated content will also see significant impact:
In gaming, audio fingerprinting can enhance user experiences by:
As audio fingerprinting technology continues to evolve, its applications will expand across various sectors beyond those mentioned. The ability to accurately identify and manage audio content will not only streamline operations but also create new business models centered around data-driven insights into consumer behavior.
If you want to learn more about the potential of audio fingerprinting for your business, connect with us for a demo.