Domain skill
archive-org
Markdown synced from browser-harness domain skills.
- Host
- archive-org
- Files
- 1
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `archive-org` domain skill from `agent-workspace/domain-skills/archive-org/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/archive-org/scraping.md Use those domain-skill notes to complete my task for `archive-org` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
Scraping & Data Extraction
scraping.md
- https://archive.org / https://web.archive.org — all public data, no auth required. Every workflow here is pure httpget — no browser needed.
- Use the CDX API for anything Wayback-related — it is the reliable workhorse. The Wayback Availability API (/wayback/available) is known to return empty archivedsnapshots even for well-archived URLs and should not be...
- For item metadata (books, video, audio, software), go straight to:
- Timestamp format is always 14-digit YYYYMMDDHHMMSS. Pass any prefix — 20230601 (day), 202306 (month), 2023 (year) — and CDX will match.
Show full markdown
https://archive.org / https://web.archive.org — all public data, no auth required. Every workflow here is pure http_get — no browser needed.
Do this first
Use the CDX API for anything Wayback-related — it is the reliable workhorse. The Wayback Availability API (/wayback/available) is known to return empty archived_snapshots even for well-archived URLs and should not be used as a primary mechanism.
import json
# Find snapshots of any URL — primary entry point for Wayback data
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=iana.org&output=json&limit=5"
"&fl=timestamp,original,statuscode,mimetype,length",
timeout=40.0
)
rows = json.loads(r)
headers = rows[0] # ['timestamp', 'original', 'statuscode', 'mimetype', 'length']
for row in rows[1:]:
ts, orig, status, mime, length = row
snap_url = f"https://web.archive.org/web/{ts}/{orig}"
print(f"{ts} {status} {snap_url}")
For item metadata (books, video, audio, software), go straight to:
data = json.loads(http_get("https://archive.org/metadata/{identifier}", timeout=30.0))
Common workflows
Find the nearest archived snapshot to a target date
import json
# CDX sort=closest returns the single snapshot nearest to the given timestamp
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=iana.org&output=json&limit=1"
"&fl=timestamp,original,statuscode"
"&closest=20230601120000&sort=closest",
timeout=60.0 # CDX can be slow — always use timeout >= 40s
)
rows = json.loads(r)
# rows[0] = header, rows[1] = closest snapshot
ts, orig, status = rows[1]
snap_url = f"https://web.archive.org/web/{ts}/{orig}"
# Result: ts='20230601114925', orig='https://www.iana.org/', status='200'
# snap_url: https://web.archive.org/web/20230601114925/https://www.iana.org/
Timestamp format is always 14-digit YYYYMMDDHHMMSS. Pass any prefix — 20230601 (day), 202306 (month), 2023 (year) — and CDX will match.
List all monthly snapshots for a URL (collapsed)
import json
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=iana.org&output=json"
"&collapse=timestamp:6" # :6 = dedupe by YYYYMM (one per month)
"&from=20230101&to=20231231"
"&fl=timestamp,original",
timeout=60.0
)
rows = json.loads(r)
# rows[0] = header ['timestamp', 'original']
# rows[1:] = one row per month:
# ['20230101103807', 'https://www.iana.org/']
# ['20230201144829', 'https://www.iana.org/']
# ...12 rows for 2023
for ts, orig in rows[1:]:
print(f"{ts[:4]}-{ts[4:6]} https://web.archive.org/web/{ts}/{orig}")
collapse=timestamp:N deduplicates by the first N digits of the timestamp:
:4= one per year,:6= one per month,:8= one per day
List snapshots for an entire domain (all pages)
import json
# matchType=domain captures all URLs under that domain
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=iana.org&matchType=domain&output=json"
"&limit=10&fl=timestamp,original,statuscode"
"&collapse=timestamp:8", # one capture per URL per day
timeout=60.0
)
rows = json.loads(r)
for row in rows[1:]:
print(row)
# ['19971210061738', 'http://www.iana.org:80/', '200']
# ['19980211065537', 'http://www.iana.org:80/', '200']
# ...
matchType options: exact (default), prefix (URL + subpaths), host (all subdomains), domain (host + all subdomains).
Filter snapshots by prefix path
import json
# All archived pages under /domains/ path
r = http_get(
"https://web.archive.org/cdx/search/cdx"
"?url=iana.org/domains/&matchType=prefix&output=json"
"&limit=5&fl=timestamp,original,statuscode",
timeout=40.0
)
rows = json.loads(r)
for row in rows[1:]:
print(row)
# ['20080509121811', 'http://www.iana.org/domains/', '200']
# ['20080704174537', 'http://iana.org/domains/', '200']
Paginate CDX results with resumeKey
import json
from urllib.parse import quote
def cdx_all_snapshots(url, fl="timestamp,original,statuscode", page_size=500):
"""Iterate all CDX records for a URL, yielding rows (excluding header)."""
base = (
f"https://web.archive.org/cdx/search/cdx"
f"?url={quote(url, safe='')}&output=json"
f"&fl={fl}&limit={page_size}&showResumeKey=true"
)
resume_key = None
while True:
endpoint = base if resume_key is None else f"{base}&resumeKey={quote(resume_key)}"
rows = json.loads(http_get(endpoint, timeout=60.0))
# rows structure with showResumeKey=true:
# [header, row1, row2, ..., [], [resume_key_string]]
# The second-to-last row is [] (separator), last row is [resume_key]
has_resume = len(rows) >= 2 and rows[-1] != [] and rows[-2] == []
data_rows = rows[1:-2] if has_resume else rows[1:]
for row in data_rows:
yield row
if not has_resume:
break
resume_key = rows[-1][0]
for row in cdx_all_snapshots("iana.org", fl="timestamp,original"):
ts, orig = row
# process...
Retrieve the actual archived page
# Direct snapshot URL: /web/{14-digit-timestamp}/{original-url}
snap_url = "https://web.archive.org/web/19971210061738/http://www.iana.org:80/"
content = http_get(snap_url, timeout=30.0)
# Returns the archived HTML with Wayback toolbar injected at top
# The toolbar is inside <!-- BEGIN WAYBACK TOOLBAR INSERT --> comments
# The calendar view URL pattern (for browser navigation, not http_get):
# https://web.archive.org/web/20230101000000*/python.org
# The * tells Wayback to show the calendar — returns HTML, not raw page
Item metadata (books, video, audio, software, collections)
import json
from urllib.parse import quote
identifier = "HardWonWisdomTrailer"
data = json.loads(http_get(f"https://archive.org/metadata/{identifier}", timeout=30.0))
# Top-level keys:
# alternate_locations, created, d1, d2, dir, files, files_count,
# is_collection, item_last_updated, item_size, metadata, server, uniq, workable_servers
meta = data['metadata']
# Common metadata fields (not all present on every item):
print(meta.get('identifier')) # 'HardWonWisdomTrailer'
print(meta.get('title')) # 'Hard Won Wisdom Trailer'
print(meta.get('mediatype')) # 'movies' | 'texts' | 'audio' | 'software' | 'collection'
print(meta.get('creator')) # 'jakemauz'
print(meta.get('date')) # '2017-02-18'
print(meta.get('description')) # HTML string — strip tags if needed
print(meta.get('subject')) # str OR list of str depending on item
print(meta.get('publicdate')) # '2017-02-18 11:51:16'
print(meta.get('collection')) # parent collection identifier
files = data['files']
# Each file entry:
# name, source ('original'|'derivative'|'metadata'), format, size (bytes as str),
# md5, sha1, crc32, mtime
# For video/audio: length (seconds as str), height, width
# For derivative: original (name of source file)
# Find the primary original file
orig_files = [f for f in files if f.get('source') == 'original']
# orig_files[0]: {'name': 'Hard-won wisdom trailer.mp4', 'source': 'original',
# 'format': 'MPEG4', 'size': '7532153', 'length': '94.13',
# 'height': '360', 'width': '640', 'md5': 'aaeebe0481...', ...}
# Build download URL — two equivalent forms:
server = data['server'] # 'ia601405.us.archive.org'
dir_path = data['dir'] # '/2/items/HardWonWisdomTrailer'
fname = orig_files[0]['name']
from urllib.parse import quote as urlquote
# Form 1: direct storage server (fastest)
url1 = f"https://{server}{dir_path}/{urlquote(fname)}"
# Form 2: standard redirect URL (always works, resolved by CDN)
url2 = f"https://archive.org/download/{identifier}/{urlquote(fname)}"
# Both confirmed status 200, Content-Type: video/mp4
Search items (books, audio, video, software)
import json
# advancedsearch.php is the correct API — /search returns HTML
r = http_get(
"https://archive.org/advancedsearch.php"
"?q=artificial+intelligence+AND+mediatype:texts"
"&fl[]=identifier&fl[]=title&fl[]=creator&fl[]=date&fl[]=downloads"
"&rows=5&sort[]=downloads+desc&output=json",
timeout=30.0
)
data = json.loads(r)
# data['responseHeader']['status'] = 0 (success)
# data['responseHeader']['QTime'] = query time ms
# data['response']['numFound'] = 25911 (total matches)
# data['response']['start'] = 0 (offset)
# data['response']['docs'] = list of item dicts
resp = data['response']
print(f"Total: {resp['numFound']}, showing: {len(resp['docs'])}")
for doc in resp['docs']:
print(f" {doc['identifier']} {doc.get('title', '')[:50]}")
# doc fields are only present if they have values — always use .get()
Pagination: use start= offset (not page=). Max rows= is not documented but 100 works reliably.
Search with all supported parameters
import json
r = http_get(
"https://archive.org/advancedsearch.php"
"?q=machine+learning+AND+mediatype:texts" # Lucene query syntax
"&fl[]=identifier&fl[]=title&fl[]=date&fl[]=year"
"&fl[]=creator&fl[]=subject&fl[]=description&fl[]=downloads"
"&rows=3"
"&start=0" # pagination offset
"&sort[]=date+desc" # sort field + direction
"&output=json",
timeout=30.0
)
data = json.loads(r)
# Confirmed fields in fl[]:
# identifier, title, date, year, creator, subject, description,
# downloads, mediatype, collection, language, avg_rating, num_reviews
# mediatype values: texts, audio, movies, software, image, etree, data, collection, account
# Sort fields: date, downloads, avg_rating, num_reviews, publicdate, addeddate
API reference
| Endpoint | What it returns | Auth |
|---|---|---|
web.archive.org/cdx/search/cdx?url=...&output=json | Snapshot index: all captures of a URL | None |
archive.org/wayback/available?url=... | Nearest snapshot (DEGRADED — see gotchas) | None |
archive.org/metadata/{identifier} | Item metadata + files list | None |
archive.org/advancedsearch.php?q=...&output=json | Full-text + metadata search | None |
archive.org/download/{identifier}/{filename} | Direct file download | None |
web.archive.org/web/{timestamp}/{url} | Archived page HTML | None |
CDX field reference
The CDX API returns a JSON array of arrays. The first row is always the header when output=json.
| Field | Description | Example |
|---|---|---|
urlkey | SURT-format URL (reversed domain, path in parens) | org,iana)/ |
timestamp | Capture time, 14-digit YYYYMMDDHHMMSS | 19971210061738 |
original | Original crawled URL (exact, including port) | http://www.iana.org:80/ |
mimetype | Content-Type of the archived response | text/html |
statuscode | HTTP status at crawl time | 200 |
digest | SHA-1 of response body, base32-encoded | I4YBMQ6PHPWE2TD6TIXNWHZB6MXRNTSR |
length | Content length in bytes (as string) | 1418 |
Default fl= when omitted: urlkey,timestamp,original,mimetype,statuscode,digest,length (all 7 fields in that order).
Rate limits
No auth, no API key. In practice:
- CDX API: intermittently slow — individual queries time out at 20s and succeed at 40–60s. Always use
timeout=40.0minimum. 3 rapid sequential CDX calls in ~10s completed; 10 rapid calls produced 3 timeouts. - Metadata API: Fast and reliable — 5 sequential calls completed in 3.0s with no errors.
- Search API: Fast — typically responds in 30–65ms (
QTimein response header). - No documented per-second or per-day limits. Archive.org's policy is to be respectful: add
time.sleep(1)between CDX calls in loops.
Gotchas
-
CDX times out — always set
timeout=40.0or higher. The default 20s is often too short for CDX. Metadata and search APIs are fine at 20–30s. CDX slowness is backend-side and unpredictable; add retry logic for production use. -
Wayback Availability API is unreliable.
GET /wayback/available?url=iana.orgreturns{"url": "iana.org", "archived_snapshots": {}}even for URLs confirmed archived via CDX. Tested 2026-04-18 across many URLs and timestamp combinations — consistently empty. UseCDX ?sort=closest&limit=1instead (confirmed working). -
CDX first row is always the header when
output=json.rows[0]is['timestamp', 'original', ...], not a data row. Always slicerows[1:]for data. WhenshowResumeKey=true, the last two rows are[](separator) and['<resume_key_string>']. -
CDX
fl=must match exactly what you iterate. If you request&fl=timestamp,originalyou get 2-element rows; forgetting a field breaks destructuring. When in doubt, omitfl=entirely and get all 7 fields. -
output=jsonis required — there is no default JSON mode. Omittingoutput=jsonreturns space-separated text.output=textalso works and is slightly faster for simple queries. -
timestampis a string, not an integer. Even in JSON, CDX returns all fields as strings:'1418'not1418,'200'not200. Cast explicitly:int(row[4]),int(row[6]). -
The
originalfield preserves port numbers. Old crawls capturedhttp://www.iana.org:80/— the:80is part of the URL. When building a playback URL, useoriginalverbatim:f"https://web.archive.org/web/{ts}/{orig}"works correctly with the port included. -
Metadata
{}means the item doesn't exist or is private.http_get("https://archive.org/metadata/nonexistent")returns'{}'(2-byte response) with HTTP 200. Always checkif not dataorif not data.get('metadata')before accessing fields. -
Metadata
subjectcan be a string or a list. When a single subject tag is set, the API returns"subject": "short film". When multiple, it returns"subject": ["short film", "spoken word"]. Normalize with:subjects = [meta['subject']] if isinstance(meta.get('subject'), str) else meta.get('subject', []). -
File
sizeandlengthare strings, not numbers.files[0]['size']is'7532153'(bytes).files[0]['length']is'94.13'(seconds for video/audio). Cast withint()andfloat()respectively. -
Use
archive.org/download/not the raw storage server URL for reliability. The raw URL (ia601405.us.archive.org/2/items/...) is faster but server-specific.archive.org/download/{id}/{file}redirects to the correct storage node and remains stable as items migrate. -
/search?output=jsonreturns HTML, not JSON. The/searchendpoint is a React SPA — it ignoresoutput=json. Always useadvancedsearch.phpfor programmatic access. -
collapse=timestamp:6gives one row per month, but it keeps the FIRST capture of that month. If you want the last, you'd need to reverse and re-collapse, or fetch all and filter client-side. Thecollapseparameter de-duplicates by truncating the timestamp to N digits and keeping the first matching row. -
CDX
from=/to=accept partial timestamps.from=20230101means20230101000000.to=20231231means20231231000000(exclusive). To include all of 2023, useto=20240101.