Regex Link Extractor

A lightweight Chrome extension for extracting links, images, and videos from any page using regex filters. Built for manual browsing and bulk collection — with a handy side effect of feeding clean data into LLM workflows. This article walks through the why, the build, and what comes next.

Chrome Extension1.0.0RegexJavaScriptLLMJson

Introduction

Sometimes you’re browsing a page and you want everything on it — every PDF link, every image from a specific domain, every video URL — without opening DevTools or writing a scraper. That’s what this extension is for.

It started as a personal tool for manual browsing and bulk collection. Type a regex, hit extract, get a clean filtered list of links, images, or videos you can download or save.

The LLM angle came later, almost by accident. The JSON export format turned out to be exactly what you’d want to paste into a model prompt — structured, typed, with page context included. That’s a nice bonus, but it’s not why the extension exists.

Why build the extension

The idea came from a simple frustration. When I’m browsing I often want to bulk-collect something from a page — all the download links, all the images from a specific source, all the external URLs — and there’s no good lightweight tool for it. DevTools works but it’s overkill and verbose. Writing a one-off script works but takes time.

What I wanted was a popup: type a pattern, see matching assets, grab what I need. No server, no authentication, no data leaving the browser.

Building it as a Chrome extension was the obvious fit. It runs entirely client-side, has direct access to the page DOM, and stays out of the way until you need it.

Initial Design

The extension has three files at its core: manifest.json, popup.html, and popup.js. Manifest (MV3)

The extension uses Manifest V3, Chrome’s current extension standard. The only permissions required are activeTab — which grants temporary access to the current tab when the user opens the popup — and scripting, which allows injecting the extraction function into the page context. No broad host permissions, no background data collection.

Extraction logic

The core of the extension is a single function, extractPageAssets, that gets injected into the active tab via chrome.scripting.executeScript. It queries the DOM for three asset types:

<a href> — links, with their visible text trimmed to 50 characters
<img src> — images, with alt text carried through
<video> and child <source> elements — since most modern players put the real URL
on a <source> tag rather than the <video> element itself

All three are filtered by the user’s regex pattern, deduplicated by URL, and returned as a structured payload with the page title, source URL, and a timestamp.

Why regex?

Regex gives you precision without a UI for every use case. Want all YouTube links? youtube.com. Want only PDFs? \.pdf. Want links from a specific subdomain? You write the pattern. It keeps the tool small and hands control to the user — which also makes it a natural pairing with an LLM, since you can ask the model to write the pattern for you if you’re unsure.

If you’ve never written a regex before, regex101.com is the best place to start — paste a pattern and some test URLs and it explains exactly what each part does. Make sure the flavor is set to JavaScript, which is what the extension uses.

Export formats

Results can be exported as JSON (structured, with full metadata) or Markdown (a clean linked list, useful for dropping straight into a document or LLM prompt). Both are generated entirely in the browser using a blob URL — no server involved, no upload.

Usage Instructions

For everyday users

  1. Install the extension from the Chrome Web Store
  2. Navigate to any page you want to extract from
  3. Click the extension icon in your toolbar
  4. Type a pattern into the regex field — for example, youtube.com to find YouTube links, or leave it broad with . to match everything
  5. Click Find Links
  6. Check or uncheck individual results, use Select All / Deselect All as needed
  7. Choose your export format (JSON or Markdown) and which asset types to include
  8. Click Download

For developers using it as an LLM feed

The JSON export is designed to be pasted directly into a model prompt. It includes the source URL, page title, timestamp, and a typed array of items — each with a url, type, and either text (for links) or alt / poster (for images and videos). A prompt like:

Here is a JSON list of links extracted from a documentation page. 
Identify which ones are likely API reference pages versus guides.

…works well out of the box with the exported format. The Markdown export is better suited for summarisation tasks or when you want the model to reason about link text rather than raw URLs.

Future plans

A few things on the roadmap:

  • Right-click context menu — right-clicking a link or image on any page will pre-fill the regex field with a pattern derived from that URL, so you can immediately find similar assets
  • Regex history — remember the last few patterns used, selectable from a dropdown
  • CSV export — a third format option for tabular workflows and spreadsheet tools
  • Pattern library — a small set of one-click presets for common cases (PDFs, images, YouTube, external links)

Closing

The extension is small by design. It does one thing — gets assets off a page fast — and stays out of your way otherwise. If you regularly find yourself manually hunting down links or bulk-saving resources while browsing, it’s worth a try. And if you happen to use an LLM for content work, the export format will slot straight in.

The full source is available on GitHub.com. If you run into a bug or have a feature request, open an issue — I read them.