Domain skill
substack
Markdown synced from browser-harness domain skills.
- Host
- substack
- Files
- 1
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `substack` domain skill from `agent-workspace/domain-skills/substack/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/substack/scraping.md Use those domain-skill notes to complete my task for `substack` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
Data Extraction
scraping.md
- Field-tested against multiple Substack publications on 2026-04-27. No authentication required for any approach documented here. All endpoints work via httpget without a browser.
- ---
- Substack exposes a clean public REST API at {publication}.substack.com/api/v1/. Every publication hosted on Substack (custom domain or {name}.substack.com) responds to the same API paths. No API key, no login, no...
- What you can do:
- List all posts from any publication (/api/v1/posts)
Show full markdown
Field-tested against multiple Substack publications on 2026-04-27.
No authentication required for any approach documented here.
All endpoints work via http_get without a browser.
TL;DR
Substack exposes a clean public REST API at {publication}.substack.com/api/v1/.
Every publication hosted on Substack (custom domain or {name}.substack.com)
responds to the same API paths. No API key, no login, no browser required.
What you can do:
- List all posts from any publication (
/api/v1/posts) - Fetch full post content by slug (
/api/v1/posts/{slug}) - Fetch post comments (
/api/v1/post/{post_id}/comments) - Read the RSS feed (
/feed) for title/date/link/description metadata
Limitations:
- Paid-only post bodies return a truncated HTML preview for
body_html(not the full article) - No cross-publication search API accessible without a logged-in session
- Comment endpoint uses
post_id(integer), not slug
Approach 1 (Recommended): Publication Post List
GET https://{subdomain}.substack.com/api/v1/posts?limit=N&offset=N
Works for any Substack publication. Returns posts sorted newest-first.
from helpers import http_get
import json
def substack_list_posts(publication_url, limit=20, offset=0):
"""List posts from a Substack publication.
Args:
publication_url: Base URL of the publication, e.g.
'https://www.slowboring.com' or
'https://simonwillison.substack.com'
limit: Number of posts to return (max observed: 100)
offset: Pagination offset
Returns list of post dicts with keys: title, subtitle, slug,
canonical_url, post_date, audience, wordcount, reactions, restacks.
audience is 'everyone' (free) or 'only_paid' (paywalled).
"""
url = f"{publication_url.rstrip('/')}/api/v1/posts?limit={limit}&offset={offset}"
posts = json.loads(http_get(url))
return [
{
"title": p.get("title"),
"subtitle": p.get("subtitle"),
"slug": p.get("slug"),
"url": p.get("canonical_url"),
"post_date": p.get("post_date"),
"audience": p.get("audience"), # 'everyone' or 'only_paid'
"wordcount": p.get("wordcount"),
"reactions": p.get("reactions"), # e.g. {"❤": 221}
"restacks": p.get("restacks"),
"cover_image": p.get("cover_image"),
"post_id": p.get("id"),
}
for p in posts
]
posts = substack_list_posts("https://www.slowboring.com", limit=10)
# [
# {
# "title": "What to make of the generic ballot",
# "subtitle": "Plus ties, Mamdani, the Obama legacy, and fundraising's diminishing returns",
# "slug": "what-to-make-of-the-generic-ballot",
# "url": "https://www.slowboring.com/p/what-to-make-of-the-generic-ballot",
# "post_date": "2026-04-24T10:03:26.581Z",
# "audience": "everyone",
# "wordcount": 4369,
# "reactions": {"❤": 221},
# "restacks": 10,
# "post_id": 194950421,
# },
# ...
# ]
# Filter for free (non-paywalled) posts only
free_posts = [p for p in posts if p["audience"] == "everyone"]
Pagination
def substack_all_posts(publication_url, max_posts=200):
"""Fetch all posts from a publication via paginated API."""
all_posts = []
offset = 0
batch_size = 50
while len(all_posts) < max_posts:
batch = substack_list_posts(publication_url, limit=batch_size, offset=offset)
if not batch:
break
all_posts.extend(batch)
if len(batch) < batch_size:
break # last page
offset += batch_size
return all_posts[:max_posts]
Approach 2: Full Post Content by Slug
GET https://{subdomain}.substack.com/api/v1/posts/{slug}
Returns the full post including body_html for free posts. Paywalled posts
return a truncated HTML preview for body_html (not the full article).
from helpers import http_get
import json, re
def substack_get_post(publication_url, slug):
"""Fetch full content of a single Substack post by slug.
Returns title, body as plain text, body_html, author, date,
and metadata. body_html is a truncated preview for paywalled posts.
"""
url = f"{publication_url.rstrip('/')}/api/v1/posts/{slug}"
post = json.loads(http_get(url))
body_html = post.get("body_html")
body_text = None
if body_html:
# Strip HTML tags for plain text
body_text = re.sub(r'<[^>]+>', ' ', body_html)
body_text = re.sub(r'\s+', ' ', body_text).strip()
return {
"title": post.get("title"),
"subtitle": post.get("subtitle"),
"slug": post.get("slug"),
"url": post.get("canonical_url"),
"post_date": post.get("post_date"),
"audience": post.get("audience"),
"wordcount": post.get("wordcount"),
"reactions": post.get("reactions"),
"restacks": post.get("restacks"),
"body_html": body_html, # full article if free; truncated preview if paywalled
"body_text": body_text, # full plain text if free; truncated if paywalled
"truncated_preview": post.get("truncated_body_text"), # always present
"post_id": post.get("id"),
"publication_id": post.get("publication_id"),
}
post = substack_get_post(
"https://www.slowboring.com",
"what-to-make-of-the-generic-ballot"
)
# Free post (audience == "everyone"):
# {
# "title": "What to make of the generic ballot",
# "audience": "everyone",
# "wordcount": 4369,
# "body_html": "<p>I suppose this isn't a huge surprise...</p>...", # ~40KB full article
# "body_text": "I suppose this isn't a huge surprise ...", # ~25KB plain text
# "post_id": 194950421,
# }
# Paywalled post (audience == "only_paid"):
# post["body_html"] -> truncated HTML preview (a few hundred bytes, not the full article)
# post["body_text"] -> truncated plain text (stripped from truncated HTML)
# post["truncated_preview"] -> short plaintext excerpt (separate, always present)
# Use audience == "everyone" as the reliable signal for full content availability.
Approach 3: Post Comments
GET https://{subdomain}.substack.com/api/v1/post/{post_id}/comments?limit=N
Note: uses integer post_id, not slug. Get post_id from the post list
or post detail responses.
from helpers import http_get
import json
def substack_get_comments(publication_url, post_id, limit=50):
"""Fetch top-level comments for a Substack post.
Args:
publication_url: Base URL of the publication
post_id: Integer post ID (from post list or post detail)
limit: Max comments to return
Returns list of comment dicts.
"""
url = f"{publication_url.rstrip('/')}/api/v1/post/{post_id}/comments?limit={limit}"
data = json.loads(http_get(url))
comments = data.get("comments", [])
return [
{
"comment_id": c.get("id"),
"author": c.get("name"),
"author_handle": c.get("handle"),
"body": c.get("body"),
"date": c.get("date"),
"reaction_count": c.get("reaction_count"), # e.g. {"❤": 99}
"children_count": c.get("children_count"), # reply count
"restacks": c.get("restacks"),
}
for c in comments
if not c.get("deleted")
]
comments = substack_get_comments("https://www.slowboring.com", 194950421, limit=10)
# [
# {
# "comment_id": 248392394,
# "author": "John from FL",
# "body": "Sam asks: \"don't they kind of have a point...\"",
# "date": "2026-04-24T10:20:21.997Z",
# "reaction_count": {"❤": 99},
# "children_count": 3,
# },
# ...
# ]
Approach 4: RSS Feed (Lightweight Metadata)
GET https://{subdomain}.substack.com/feed
Returns an RSS 2.0 feed. Useful when you only need title/date/link/description without hitting the JSON API. Works as a quick check without parsing JSON.
from helpers import http_get
import re
def substack_rss(publication_url, max_items=20):
"""Fetch recent post metadata via RSS feed.
Lighter than the JSON API — only returns title, link, pubDate,
and description (short excerpt). Does not include body_html or wordcount.
"""
rss = http_get(f"{publication_url.rstrip('/')}/feed")
items = re.findall(
r'<item>(.*?)</item>',
rss,
re.DOTALL
)[:max_items]
results = []
for item in items:
title = re.search(r'<title><!\[CDATA\[(.*?)\]\]></title>', item)
link = re.search(r'<link>(https?://[^<]+)</link>', item)
date = re.search(r'<pubDate>(.*?)</pubDate>', item)
desc = re.search(r'<description><!\[CDATA\[(.*?)\]\]></description>', item, re.DOTALL)
results.append({
"title": title.group(1) if title else None,
"link": link.group(1) if link else None,
"pub_date": date.group(1) if date else None,
"description": desc.group(1).strip() if desc else None,
})
return results
feed = substack_rss("https://www.slowboring.com", max_items=5)
# [
# {
# "title": "Sunday Mailbag + Thread",
# "link": "https://www.slowboring.com/p/sunday-mailbag-thread-48b",
# "pub_date": "Sun, 26 Apr 2026 17:02:04 GMT",
# "description": "Ask your questions below.",
# },
# ...
# ]
Publication URL Formats
Substack publications use one of two URL formats:
# Format 1: native subdomain (older or simpler publications)
"https://simonwillison.substack.com"
# Format 2: custom domain (larger publications, purchased domain)
"https://www.slowboring.com" # Matthew Yglesias — Slow Boring
"https://unchartedterritories.tomaspueyo.com" # Tomas Pueyo
# Both formats use identical API paths:
# {base_url}/api/v1/posts
# {base_url}/api/v1/posts/{slug}
# {base_url}/api/v1/post/{post_id}/comments
# {base_url}/feed
If you only know a publication's Substack handle (e.g., matthewyglesias),
the canonical subdomain URL is https://matthewyglesias.substack.com. Custom
domain URLs are listed on the publication's about page or in the RSS feed's
<link> element.
Gotchas
- Paywalled post
body_htmlis a truncated preview, notnull— the API returns a short HTML excerpt (typically a few hundred to a few KB). It is nevernull. The reliable way to detect full content availability isaudience == "everyone". For paywalled posts, comparelen(body_html)towordcount * ~7(average bytes per word) — a large gap means truncation.truncated_body_text(plaintext) is always present regardless of audience. - Comments endpoint uses integer
post_id, not slug —/api/v1/post/{id}/commentsis correct./api/v1/posts/{slug}/commentsreturns 404. reactionsfield is a dict with emoji keys, e.g.{"❤": 221}— not a plain integer. Sum the values for total reaction count:total = sum(post["reactions"].values()).limiton post list is not strictly capped — values up to at least 100 work; beyond that behavior is untested.- Custom domains and
{name}.substack.comare interchangeable — use whichever you have. Thex-subresponse header always reflects the internal publication handle. audiencevalues: only"everyone"and"only_paid"observed. A third value"founding"exists in Substack's data model but is rare.- No unauthenticated cross-publication search —
substack.com/api/v1/searchreturns HTML (a React page), not JSON. To find publications, use external search engines (site:substack.com {query}) or the RSS discovery approach. - Podcast posts have
type == "podcast"andpodcast_urlset; theirbody_htmlmay be a show-notes HTML block. Checktypeto distinguish newsletter posts from podcast episodes.