Domain skill
wayback-machine
Markdown synced from browser-harness domain skills.
- Host
- wayback-machine
- Files
- 1
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `wayback-machine` domain skill from `agent-workspace/domain-skills/wayback-machine/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/wayback-machine/scraping.md Use those domain-skill notes to complete my task for `wayback-machine` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
CDX API & Snapshot Retrieval
scraping.md
- https://web.archive.org — all public data, no auth or API key required. Everything here is pure httpget — no browser needed.
- NOTE: A comprehensive Internet Archive skill (covering CDX, item metadata, and search) already exists at domain-skills/archive-org/scraping.md. This file is a focused, CDX-first quick-reference for Wayback Machine...
- The CDX (Capture/Crawl Index) API is the single fastest way to query the Wayback Machine. It returns structured JSON and supports filtering, collapsing, pagination, and nearest-date lookups.
- All CDX values are strings, even numeric ones (status='200', length='4821'). Cast explicitly with int() / float().
Show full markdown
https://web.archive.org — all public data, no auth or API key required. Everything here is pure http_get — no browser needed.
NOTE: A comprehensive Internet Archive skill (covering CDX, item metadata, and search) already exists at
domain-skills/archive-org/scraping.md. This file is a focused, CDX-first quick-reference for Wayback Machine snapshot work specifically.
Start here: CDX API
The CDX (Capture/Crawl Index) API is the single fastest way to query the Wayback Machine. It returns structured JSON and supports filtering, collapsing, pagination, and nearest-date lookups.
import json
# Find all snapshots of a URL — the minimal starting query
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json&limit=10"
"&fl=timestamp,original,statuscode,mimetype,length",
timeout=40.0 # CDX is slow — never use less than 40s
)
rows = json.loads(r)
# rows[0] is ALWAYS the header row — slice rows[1:] for data
for ts, orig, status, mime, length in rows[1:]:
print(f"https://web.archive.org/web/{ts}/{orig} [{status}]")
All CDX values are strings, even numeric ones (status='200', length='4821'). Cast explicitly with int() / float().
Core CDX patterns
Nearest snapshot to a target date
import json
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json&limit=1"
"&fl=timestamp,original,statuscode"
"&closest=20230601120000&sort=closest",
timeout=60.0
)
rows = json.loads(r)
ts, orig, status = rows[1] # rows[0] is header
snap_url = f"https://web.archive.org/web/{ts}/{orig}"
# Timestamp format: 14-digit YYYYMMDDHHMMSS
# Prefix shorthand: '20230601' (day), '202306' (month), '2023' (year)
One snapshot per month (collapsed)
import json
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json"
"&collapse=timestamp:6" # :6 = one per YYYYMM
"&from=20220101&to=20230101"
"&fl=timestamp,original,statuscode",
timeout=60.0
)
rows = json.loads(r)
for ts, orig, status in rows[1:]:
print(f"{ts[:4]}-{ts[4:6]} https://web.archive.org/web/{ts}/{orig}")
# collapse=timestamp:N — collapse by first N timestamp digits:
# :4 = one per year
# :6 = one per month (most common)
# :8 = one per day
# :10 = one per hour
# Keeps the FIRST capture of each period — not the last.
All pages under a domain or path
import json
# matchType=prefix — all URLs starting with the given path
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com/blog/&matchType=prefix&output=json"
"&limit=20&fl=timestamp,original,statuscode"
"&filter=statuscode:200", # only successful captures
timeout=60.0
)
rows = json.loads(r)
for row in rows[1:]:
print(row)
# matchType options:
# exact (default) — this URL only
# prefix — URL + all subpaths
# host — all subdomains of the host
# domain — host + all subdomains (broadest)
Filter by status code or MIME type
import json
# Only successful HTML captures — combine multiple filters
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json"
"&filter=statuscode:200"
"&filter=mimetype:text/html"
"&fl=timestamp,original,length"
"&limit=10",
timeout=40.0
)
rows = json.loads(r)
# filter= uses regex. Examples:
# &filter=statuscode:200 exact match
# &filter=!statuscode:200 negation (all non-200)
# &filter=statuscode:[23].. 2xx and 3xx only
# &filter=mimetype:text/html HTML only
# &filter=original:.*\\.pdf URLs ending in .pdf
# Multiple &filter= params are ANDed together.
CDX field reference
| Field | Description | Example value |
|---|---|---|
urlkey | SURT-format URL (reversed domain) | com,example)/ |
timestamp | Capture time, 14-digit YYYYMMDDHHMMSS | 20230601114925 |
original | Original crawled URL (includes port if non-standard) | https://example.com/ |
mimetype | Content-Type at crawl time | text/html |
statuscode | HTTP status at crawl time (string) | 200 |
digest | SHA-1 of body, base32-encoded | I4YBMQ6PHPWE2TD6TIXNWHZB6MXRNTSR |
length | Content-length in bytes (string) | 4821 |
Default fl= when omitted: all 7 fields above in that order.
Availability API (DO NOT USE as primary)
import json
# WARNING: This API is BROKEN — returns empty archived_snapshots
# for URLs that ARE in the archive. Confirmed broken 2026-04-18.
# Use CDX with ?sort=closest&limit=1 instead (see above).
# Left here for reference only — do not rely on it:
r = http_get(
"https://archive.org/wayback/available"
"?url=example.com×tamp=20240101",
timeout=20.0
)
data = json.loads(r)
# Returns: {"url": "example.com", "archived_snapshots": {}}
# Even for well-archived URLs. Do not trust empty results.
# CORRECT replacement:
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json&limit=1"
"&fl=timestamp,original,statuscode"
"&closest=20240101000000&sort=closest",
timeout=60.0
)
rows = json.loads(r)
if len(rows) > 1:
ts, orig, status = rows[1]
snap_url = f"https://web.archive.org/web/{ts}/{orig}"
Paginate large result sets
import json
from urllib.parse import quote
def cdx_all_snapshots(url, fl="timestamp,original,statuscode", page_size=500):
"""Yield all CDX rows for a URL, page by page."""
base = (
"https://web.archive.org/cdx/search/cdx"
f"?url={quote(url, safe='')}&output=json"
f"&fl={fl}&limit={page_size}&showResumeKey=true"
)
resume_key = None
while True:
endpoint = base if resume_key is None else f"{base}&resumeKey={quote(resume_key)}"
rows = json.loads(http_get(endpoint, timeout=60.0))
# With showResumeKey=true, last two rows are [] and ['<key>']
has_resume = len(rows) >= 2 and rows[-2] == [] and rows[-1] != []
data_rows = rows[1:-2] if has_resume else rows[1:]
for row in data_rows:
yield row
if not has_resume:
break
resume_key = rows[-1][0]
for ts, orig, status in cdx_all_snapshots("example.com"):
snap_url = f"https://web.archive.org/web/{ts}/{orig}"
# process...
Retrieve the archived page
# Direct snapshot URL: /web/{14-digit-timestamp}/{original-url}
snap_url = "https://web.archive.org/web/20230601114925/https://example.com/"
content = http_get(snap_url, timeout=30.0)
# Returns archived HTML with Wayback toolbar injected inside:
# <!-- BEGIN WAYBACK TOOLBAR INSERT --> ... <!-- END WAYBACK TOOLBAR INSERT -->
# Strip those comments + their contents if you want the original HTML.
# Canonical form for "get latest available" — use 14 zeros:
latest = "https://web.archive.org/web/20240101000000*/example.com"
# The * suffix returns a calendar page (HTML), not the archived page itself.
# Use CDX to find the real timestamp, then fetch the direct URL.
Advanced CDX: deduplicate by content digest
import json
# Find only snapshots where the content CHANGED — dedup by SHA-1 digest
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&output=json"
"&collapse=digest" # one capture per unique body hash
"&fl=timestamp,original,digest,length"
"&filter=statuscode:200",
timeout=60.0
)
rows = json.loads(r)
# rows[1:] are unique content versions across all time
# Useful for detecting when a page actually changed, vs. being re-crawled identically
CDX summary/count query
import json
# showNumPages=true returns total page count, not records
# Use for estimating result size before a full fetch
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=example.com&matchType=prefix"
"&showNumPages=true",
timeout=30.0
)
page_count = int(r.strip()) # returns plain integer, not JSON
# 1 page ~ 150,000 records by default
# Combine with &page=N for manual page-based pagination:
# ?url=...&output=json&page=0, ?url=...&output=json&page=1, etc.
Rate limits & timeouts
| API | Typical latency | Safe timeout | Notes |
|---|---|---|---|
| CDX search | 5–40s | 60s | Intermittently slow; retry on timeout |
Snapshot fetch (/web/) | 2–10s | 30s | Reliable |
Metadata (/metadata/) | <1s | 20s | Fast, stable |
| Advanced search | <1s | 20s | Fast, stable |
No API key required. No documented rate limit. Be respectful: add time.sleep(1) between CDX calls in loops. 3 rapid sequential CDX calls (~10s) complete fine; 10+ rapid calls produce timeouts.
Gotchas
-
CDX is slow — always set
timeout=60.0for CDX calls. 40s minimum, 60s recommended. Metadata and search APIs are fine at 20s. CDX slowness is server-side and unpredictable. -
Availability API (
/wayback/available) is broken. Returns{"archived_snapshots": {}}even for URLs with thousands of captures. Tested 2026-04-18 — do not use. Replacement: CDX with?sort=closest&limit=1. -
rows[0]is always the header whenoutput=json. Always slicerows[1:]for data. Forgetting this causes silent type errors because you're destructuring column names, not values. -
output=jsonmust be explicit. Omitting it returns space-separated text. There is no default JSON mode. -
All CDX values are strings.
statuscode='200'not200,length='4821'not4821. Cast:int(row[4]),int(row[6]). -
originalpreserves non-standard ports. Old crawls capturedhttp://www.example.com:80/— the:80is part of theoriginalfield. Build playback URLs verbatim:f"https://web.archive.org/web/{ts}/{orig}"works correctly with the port. -
from=/to=timestamps are exclusive at the boundary.to=20231231meansto=20231231000000— it excludes captures from Dec 31 itself. Useto=20240101to include all of 2023. -
collapse=timestamp:6keeps the FIRST capture of each period. Not the most recent. Reverse the result set or filter client-side if you need the last. -
CDX
matchType=domaincan return millions of rows for popular sites. Always add&limit=or&showNumPages=truefirst to estimate size. -
showResumeKey=trueappends two sentinel rows. The second-to-last row is[](empty separator), the last row is['<resume_key_string>']. Slicerows[1:-2]for data rows when a resume key is present. -
Wayback toolbar is injected into every archived HTML page. The injection is wrapped in
<!-- BEGIN WAYBACK TOOLBAR INSERT -->/<!-- END WAYBACK TOOLBAR INSERT -->comments. Strip them if you need original HTML fidelity.