Skip to main content

Command Palette

Search for a command to run...

Hunting Disk Hogs on Ubuntu: A Shell Script for Finding the Largest Files

Published
β€’7 min read
Hunting Disk Hogs on Ubuntu: A Shell Script for Finding the Largest Files

Why this script exists

If you've ever watched your free disk space quietly shrink over a few weeks of active development, you know the feeling: yesterday you had plenty of headroom, today your IDE is yelling about low disk space, and you have no idea what ate the difference. Active Node.js and Python projects are especially good at this β€” node_modules, build caches, .next directories, virtual environments, and compiled artifacts accumulate silently with every install and every build.

This article walks through a bash script, find_largest_files.sh, that scans a directory tree and writes the largest files to a timestamped text report. It's designed to be a first diagnostic tool when you're trying to answer the question "where did all my disk space go?"

The script at a glance

The full script is in find_largest_files.sh. Here's what it does, step by step:

  1. Takes three optional arguments: search directory, number of results, and output filename.

  2. Validates that the search directory exists and that the count is a positive integer.

  3. Writes a header to the output file with timestamp, host, user, and scan parameters.

  4. Uses find to list every regular file under the target directory, along with its size in bytes.

  5. Sorts the list numerically by size (largest first), takes the top N, and converts byte counts into human-readable units (K/M/G/T).

  6. Appends the formatted list to the report and prints it to your terminal.

Key design decisions

Using find -printf instead of ls or du

find "$SEARCH_DIR" -type f -printf '%s\t%p\n'

find -printf outputs size (%s) in bytes and the full path (%p), tab-separated. This matters for three reasons: byte-level precision means sorting stays accurate; tab separation survives filenames with spaces; and restricting to -type f means we report on actual files, not directory aggregates the way du would.

Pruning pseudo-filesystems

\( -path /proc -o -path /sys -o -path /dev -o -path /run -o -path /snap \) -prune

/proc, /sys, /dev, and /run are kernel-provided virtual filesystems. They contain "files" whose reported sizes are often meaningless (a /proc/kcore can appear to be 128 TB). /snap is pruned because snap mount points produce duplicate entries. Skipping all five keeps the report focused on real files on your actual disk.

Silencing permission errors

2>/dev/null

On a full / scan, find will hit directories your user can't read and print Permission denied for every one of them β€” noise that can bury the real output. Redirecting stderr to /dev/null cleans that up. Run the script with sudo if you want complete coverage.

Human-readable sizes in awk, not find

We sort the raw byte counts first, then format them in awk. If we formatted early (e.g. via find ... | sort -h), we'd either give up precision or depend on sort -h parsing variants. Keeping bytes for sorting and converting after is simpler and portable.

Running it against your scenario

You mentioned roughly 30 GB of disk disappeared recently while you've been building mortgage_system and mortgage_frontend with Claude. Those kinds of projects are classic sources of silent disk bloat. Here's how I'd approach the investigation.

Step 1: Get the big picture

Start at root to confirm whether the missing space is actually inside your project folders, or somewhere else entirely (logs, Docker, snap revisions, trash, etc.).

sudo ./find_largest_files.sh / 50 full_scan.txt

This gives you the 50 biggest files system-wide. If most of them are under /home/you/mortgage_system/... or /home/you/mortgage_frontend/..., your instinct was right. If the top entries are elsewhere β€” /var/lib/docker, /var/log, ~/.cache, snap backups β€” the real culprit is somewhere you weren't looking.

Step 2: Focus on the suspects

Once you've confirmed the project folders are the problem, narrow the scan:

./find_largest_files.sh ~/mortgage_system 30 mortgage_system_report.txt
./find_largest_files.sh ~/mortgage_frontend 30 mortgage_frontend_report.txt

Step 3: Check directory-level size too

Individual largest files tell one story; directory totals tell another. A million tiny files in node_modules won't show up in a largest-files report, but they'll still eat gigabytes. Pair the script with du:

du -h --max-depth=1 ~/mortgage_system | sort -rh | head -20
du -h --max-depth=1 ~/mortgage_frontend | sort -rh | head -20

Common culprits in active dev folders

Based on the stack you're likely using, here are the usual suspects, roughly in order of how often they're the answer:

  • node_modules/ β€” routinely 500 MB to 2 GB per project. Two full Next.js/React projects with heavy dependency trees can easily account for 3–5 GB each.

  • .next/ or dist/ or build/ β€” production builds and incremental build caches. Next.js's .next/cache in particular can grow to several GB over weeks of npm run dev.

  • .git/ β€” if you've committed large binaries or have a long history, .git/objects can be surprisingly fat. git gc --aggressive helps.

  • Python __pycache__/ and .venv/ β€” virtual environments with ML/data dependencies (torch, tensorflow, pandas) are often 3–8 GB each.

  • Docker layers β€” /var/lib/docker is the single most common "where did my disk go?" answer on dev machines. docker system df shows the breakdown; docker system prune -a reclaims it.

  • Log files β€” /var/log/journal/, application logs, and PM2 logs can grow indefinitely if no rotation is configured.

  • Snap revisions β€” Ubuntu keeps old snap versions by default. sudo snap set system refresh.retain=2 caps retention at two revisions.

  • Trash β€” ~/.local/share/Trash/ is easy to forget.

  • Browser caches β€” ~/.cache/google-chrome, ~/.cache/mozilla, and similar can hit several GB.

For Node projects specifically, a quick sanity check:

du -sh ~/mortgage_system/node_modules ~/mortgage_frontend/node_modules 2>/dev/null

If each is over a gigabyte, that's your ~30 GB budget explained between two projects, their build caches, and a .git folder or two.

Useful companion commands

# Overall disk usage at a glance
df -h

# Top-level directories sorted by size (run from /)
sudo du -h --max-depth=1 / 2>/dev/null | sort -rh | head -20

# What's eating your home directory
du -h --max-depth=1 ~ | sort -rh | head -20

# Docker-specific
docker system df
docker system prune -a --volumes  # aggressive, frees everything unused

# Clean npm cache
npm cache clean --force

# Clean pip cache
pip cache purge

# Clear systemd journal older than 7 days
sudo journalctl --vacuum-time=7d

Going further: ncdu for interactive exploration

The script gives you a static report, which is great for archiving and diffing over time. For interactive drilling, install ncdu:

sudo apt install ncdu
ncdu ~

It gives you a terminal UI where you can navigate directories by size, delete things on the spot, and generally understand disk usage faster than any CLI combination. It's the tool I reach for once the script points me to the neighbourhood and I need to find the exact house.

Setting up ongoing monitoring

If you want to catch disk bloat as it happens instead of after the fact, schedule the script via cron and diff the reports:

# Edit your crontab
crontab -e

# Add: run every Sunday at 2 AM, save to a reports directory
0 2 * * 0 /home/you/find_largest_files.sh / 50 /home/you/disk_reports/scan_$(date +\%Y\%m\%d).txt

A week later, diff two reports to see what grew.

Resources

Summary

If the script points at node_modules and .next inside your two mortgage projects, that's consistent with the 30 GB of lost disk β€” two mature JS/TS codebases with their build artifacts can account for exactly that range. The usual remediation is rm -rf node_modules .next in each project, followed by a fresh npm install only on the one you're actively working on. If instead the biggest files live under /var/lib/docker or /var/log, the fix is completely different, which is exactly why running a scan first beats guessing.