The awakening of slowness

You’ve been working on your data science project for a while. You’ve got twenty notebooks, a few images, and the typical folder structure that seemed like a good idea three months ago.

You run git status to see what you’ve changed and… you wait. And wait. And while you wait, you have time to wonder if the computer froze or is just meditating.

Spoiler: it’s not meditating. It’s suffering.

The problem has a name (and last names)

Git isn’t slow. Your repo is.

When you run git status, git has to do two things that seem simple but aren’t:

  1. Scan the entire file tree to see what’s changed
  2. Compare each file with what it has stored

In a normal repo this is instantaneous. But Jupyter notebooks are JSON disguised as documents. And not just any JSON: one that includes code, outputs, base64-encoded images, kernel metadata, and basically everything the person who designed the format could think of.

A “small” notebook with a few charts can weigh several megs. Multiply that by twenty notebooks and you have a repo that groans every time you look at it.

And if you rename a folder on top of that… git interprets it as “you deleted 50 files and created 50 new ones.” Fun guaranteed.

Solution 1: Put a doorman on your repo

The first solution is so elegant it’s annoying not to have known about it before.

It’s called FSMonitor and it works like this: instead of git scanning the entire repo every time you do something, the operating system tells it which files have changed.

In plain English: it’s like having a doorman at the entrance who tells you “only Juan came in” instead of having to check the entire guest list every time.

To activate it:

1
2
git config core.fsmonitor true
git config core.untrackedcache true

Done. That’s it.

The first time you run git status after activating it will take the same time (or more, because it has to initialize the cache). But from the second time on… magic.

In my 400+ file repo, git status went from 2-3 seconds to being instantaneous. Not “fast.” Instantaneous.

Does it work everywhere?

  • macOS: Yes, uses FSEvents
  • Linux: Yes, uses inotify (if your kernel supports it, which it does)
  • Windows: Yes, uses ReadDirectoryChangesW

So, yes.

Solution 2: Save the recipe, not the photo of the dish

FSMonitor speeds up scanning, but there’s another problem: notebooks are still huge. And every time you execute a cell and save, even if the code is the same, the file changes because the outputs are different.

This means:

  • Illegible diffs (who wants to review base64 of an image?)
  • Bloated commits
  • Merges from hell

The solution is called nbstripout and does exactly what its name indicates: it strips the output from notebooks before committing.

It’s like saving the recipe without the photos of the finished dish. The code is there, you regenerate the results when you want.

To install it with uv:

1
2
uv add nbstripout
uv run nbstripout --install

Or with pip, if you’re old school:

1
2
pip install nbstripout
nbstripout --install

The --install automatically configures git to use nbstripout as a filter. From that moment on, when you commit a notebook, it’s saved without outputs.

To verify it’s working:

1
git config --get filter.nbstripout.clean

If it returns something like nbstripout, you’re good.

What if I need to save the outputs?

Good question. Sometimes you want to commit an already-executed notebook, for example for documentation or so someone can see it without having to run it.

nbstripout lets you mark specific notebooks to skip the filter:

1
git -c diff.nbstripout.textconv=cat diff

Or you can configure exceptions in your .gitattributes. But in general, my advice is: don’t save outputs. Notebooks are code, not final documents.

Honorable mention: git-lfs

If besides notebooks you have images or videos that change frequently, there’s git-lfs (Large File Storage).

The idea is simple: instead of saving the fat binary in the repo, you save a pointer and the actual file lives on a separate server.

It’s like storing luggage at airport baggage claim and only carrying the receipt.

1
2
3
git lfs install
git lfs track "*.png"
git lfs track "*.mp4"

Why is it only an honorable mention and not the main solution? Because it adds complexity. You need an LFS server (GitHub and GitLab offer it, but they have limits), and if someone clones the repo without git-lfs installed, they get the pointers instead of the files.

For notebooks, FSMonitor + nbstripout are usually enough. git-lfs is for when you have real datasets or multimedia assets.

The winning combo

My configuration for any repo with notebooks:

1
2
3
4
5
6
7
# Speed
git config core.fsmonitor true
git config core.untrackedcache true

# Clean notebooks
uv add nbstripout
uv run nbstripout --install

Four commands and your repo goes from arthritic grandpa to Olympic athlete.

Verify everything works

1
2
3
4
5
6
7
# FSMonitor active
git config --get core.fsmonitor
# Should return: true

# nbstripout configured
git config --get filter.nbstripout.clean
# Should return: nbstripout

If both return what’s expected, you’re done. Enjoy your instantaneous git status and readable diffs.

Closing

The next time someone tells you “git is slow,” you know what to answer: git isn’t slow, your repo is misconfigured.

FSMonitor so the operating system does the dirty work. nbstripout so notebooks don’t weigh like they’re carrying ballast.

And if after all this it’s still slow… maybe it’s time to consider whether you really need 200 notebooks in the same repo. But that’s another story.