Clip to source: contextualizing Instagram reels by building a search engine

2026–02–22

I can no longer count the number of times I have been scrolling Instagram reels and seen some politicians debating. Sadly, the debates are almost always posted by the politicians themselves - and the videos only show their side of the conversation. I hate this. They completely ruin the video, and undermine their own arguments. The whole point of a debate is for me, the viewer, to see both perspectives and decide for myself. Otherwise it’s a complete waste of everyone’s time.

To combat this, I made the only sane decision: I am going to build a search engine to find the source of any debate (in the Swedish parliament) posted on Instagram.

When I first set off, I came up with the following TODOs:

Download the video
Transcribe using locally hosted speech2text
Find the transcribed content in protocols from the Swedish parliament

Downloading content

After some devtooling on Instagram’s public interface (the one you get to if you are not logged in). I quickly realized I’m too lazy to reverse engineer their API - someone else must’ve already done this. Two minutes later and I found this cool python package called “instaloader”¹ which lets me download Instagram content via a simple cli. And thus my bash script was born:

#!/usr/bin/env bash
VIDEO_ID=$1
instaloader --dirname-pattern 'downloads/ig/{shortcode}' --filename-pattern='{typename}' -- -$VIDEO_ID

Transcribing

Yet again, I set off to find the optimal way to transcribe the videos. Most resources pointed towards BigAI™ selling their APIs. I did however find that I could self host OpenAI’s whisper model² using a C++ implementation called whisper.cpp³. This too had a nice cli that I could use in my bash script - I just had to first extract the audio and convert it to the correct encoding (16-bit WAV file). My script continues:

WHISPER=./whisper.cpp/build/bin/whisper-cli
WHISPER_MODEL=small

ffmpeg -n -i $VIDEO_FP -ar 16000 -ac 1 -c:a pcm_s16le $VIDEO_DIR/processed-audio.wav
$WHISPER -otxt -of $VIDEO_DIR/transcription -l swedish -f $VIDEO_DIR/processed-audio.wav -m whisper.cpp/models/ggml-$WHISPER_MODEL.bin
# Outputs a text file $VIDEO_DIR/transcription.txt with the transcribed content

Searching

Nice! We have our reel, we have an audio file, and we have a transcription of what was said. The only step that’s left is to find who said it, when and where. Easy!

To keep things simple, I’d like to avoid as much of a backend as possible (~~I’m kind of planning to host this as a CGI script~~ didn’t happen) and so hosting a search engine with a database is my absolute last option. I started looking into “static” search engines and found Pagefind⁴ - an open source search engine that doesn’t require any specialized infrastructure. It works by indexing all of you website’s content during the build step, generating a bunch of static files. The client then downloads the relevant parts in order to perform a search. This actually works insanely well considering how simple it is, and the pagefind index was only about 90 mb (and the client only downloads a small fraction of it!). However, it didn’t seem to work very well for my long but almost perfect quotes. I even tried tweaking the weighting of the search results but it barely worked at all. This was a bummer, but it made sense - its aim is searching a website like a blog or some documentation, not finding quotes in protocols.

Building a search engine?

What more can I say - I have no other option but to do the very complex task of building my own search engine (“oh no, I’m devastated”). I came up with the following requirements for version 0.1 of my search engine:

Optimized for long search queries with almost exact quotes
Hosted as a set of static files while also performant (aka usable on bad internet, and not too wasteful on my bandwidth)
Possible to extend without rebuilding the whole index

I took some inspiration from Pagefind and the offline search engines I found earlier, and from the Data structures and algorithms course I took half a year ago and decided to use n-grams. I also looked into the advantages and disadvantages of different methods, like using n-grams on characters vs. words, reasonable values for N, and how one would go about indexing the n-grams. I also looked into how fuzzy searching works, but realized it wasn’t worth the effort, especially considering my long search queries where the words are almost always correctly spelled (although far from always correctly identified).

N-grams and searching

According to wikipedia an n-gram is “a sequence of n adjacent symbols in a particular order”⁵. Dividing a sentence into n-grams would thus look a bit like this:

n = 3
input = "The quick brown fox jumps over the lazy dog"
output = [
    ["The", "quick", "brown"],
    ["quick", "brown", "fox"],
    ["brown", "fox", "jumps"],
    ["fox", "jumps", "over"],
    ["jumps", "over", "the"],
    ["over", "the", "lazy"],
    ["the", "lazy", "dog"]
]

By then converting our search query into similar n-grams it is possible to find matches:

query = "quick brown fox and the lazy dog something"
query_grams = [
    ["quick", "brown", "fox"],
    ["brown", "fox", "and"],
    ["fox", "and", "the"],
    ["and", "the", "lazy"],
    ["the", "lazy", "dog"],
    ["lazy", "dog", "something"],
]
score = 0
for qg in query_grams:
    if qg in output:
        score += 1 / len(output)

# => score = 2/6

I found using 3-grams works quite well for searching almost completely correct quotes - especially when the quotes are at least a few sentences long. It is also possible to use n-grams of a number of characters instead, but that seems to work better on shorter inputs, and especially misspelled inexact queries. Using character n-grams would also mean a lot more texts that use the same n-grams, and thus harder to eliminate candidate sources. It would also mean the search queries of around 130 words or 776 characters would include ~770 instead of ~127 n-grams - and this is really important in the next step.

Indexing and client-side search

In order to perform the search, we first find what n-grams our query yields. That’s fairly easy, and I think you could imagine what the code would look like (lowercase -> split by non-alphabetic characters -> n-grams). Then, we just have to find the sources that include the most of our n-grams.

We could do this by indexing all files, and generating a huge JSON file describing what n-grams exist in what files:

{
    "a-b-c": ["file1", "file2", "file3"],
    "b-c-d": ["file2", "file3"],
    "c-d-e": ["file1", "file2"]
}

This works very well - especially for small sites. The documentation for the requests package for python, for example, uses this method to power their search functionality⁶. Their search index javascript file is around 25kb, and even smaller compressed. This approach does, however, not work very well for my use case. Currently the ~450 files of official transcripts from the Swedish parliament I am indexing generates 7 517 080 unique n-grams, each of which points to multiple files. The generated index file from this dataset is just way too large for sending to the client, even when using a custom format way more compact than JSON, and using gzip. I needed a way to split it up, so that the client doesn’t have to download the irrelevant parts of the index.

One way to do this is to store each n-gram in a separate file. Then, when searching, the client can decide exactly what n-grams it want’s to look at, and from that decide which sources are relevant. Nice, we are only transferring the lowest amount of data possible! Well, we are also making over 100 requests just to search for a 130 word quote - taking way too long to complete, even over http/2.

I started researching ways to group n-grams into logical groups. At first it seemed reasonable that there must be a nice and simple way for me to split my index into files in a way where each query only needs to make a couple of requests. After all, it seems obvious that if I am looking at the n-gram (b, c, d) there is a high probability I am also interested in (a, b, c) and (c, d, e), and maybe even (c, d, f). Sadly, I never found a good way to do this without requiring me to read the whole index into memory, do some magic statistical analysis, and then shard based on that. While my dataset is small enough to do so every time I add data, it does go against my third goal to be able to add to the index without rebuilding it.

In the end I settled on a simpler approach, where I hashed all n-grams and put them in one of 4k bins (a bit like a hash map over http). Sadly, this still means I am making around one request per n-gram - but at least I don’t have to burden my file system with 7.5 million files, and caching is actually kind of possible (although still not optimal). I wrote my indexing program in go and I am quite pleased with the results: 490 files indexed in ~14 seconds with a total index size of ~474mb (~100mb gzipped). I could probably improve on this, but for now I’m happy.

Conclusions

I finished up the project by writing a little go server to queue video processing requests, run my bash scripts and host the static files neatly since it was easier than configuring a bunch of CGI scripts. While technically usable, it is not 100% finished yet. The UX is terrible (finding where in the sources is not implemented, for example) and I would love to include more than just the parliament’s transcriptions in the future. However, in the end I am happy with the results, and it was quite the learning experience!

The completed project can be found here: clip2src.mrks.se

Almost everything is, of course, in Swedish - but still, try these two reels and you’ll see why this tool is useful: