Domain skill
medium
Markdown synced from browser-harness domain skills.
- Host
- medium
- Files
- 2
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `medium` domain skill from `agent-workspace/domain-skills/medium/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/medium/article-hydration.md - agent-workspace/domain-skills/medium/scraping.md Use those domain-skill notes to complete my task for `medium` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
Article Body via DOM
article-hydration.md
- Extract a Medium article's body as clean markdown using the logged-in browser. Use this when API paths in scraping.md are blocked or truncated:
- Cloudflare challenge on the ?format=json endpoint ("Performing security verification")
- Member-only post that the API returns locked (isSubscriptionLocked=True) but the logged-in browser can render in full
- JS-only variant where the article is gated behind a client-side paywall modal
Show full markdown
Extract a Medium article's body as clean markdown using the logged-in browser. Use this when API paths in scraping.md are blocked or truncated:
- Cloudflare challenge on the
?format=jsonendpoint ("Performing security verification") - Member-only post that the API returns locked (
isSubscriptionLocked=True) but the logged-in browser can render in full - JS-only variant where the article is gated behind a client-side paywall modal
If the article is free and the API works, prefer scraping.md — it's faster and doesn't need a visible tab.
URL patterns
- Canonical:
https://medium.com/@<author>/<slug>-<id> - Publication:
https://<pub>.medium.com/<slug>-<id>orhttps://medium.com/<pub>/<slug>-<id> - Custom domain: some publications (e.g.
towardsdatascience.com) proxy Medium; the same DOM extractor works there.
All variants render the article body inside a single <article> element.
Site structure
- The article body lives under the page's single
<article>element. - Block-level content:
h1–h4,p,pre,blockquote,ul,ol,figure. - Images are always wrapped in
<figure>with a<figcaption>sibling; the real resolution lives onmiro.medium.com/v2/resize:fit:<N>/.... - Code blocks are
<pre>— no language class is exposed in the DOM, so emit plain fenced blocks. - Pull quotes render as
<blockquote>with nested<p>.
Cruft to strip
Medium injects engagement UI inside <article>. The text "6 2 Listen Share More" at the top is the clap/comment/listen/share button row, not content. Also expect a follow button near the author's name and sometimes a "Help" / "Status" footer.
Safe pattern: take the extracted markdown, then drop leading paragraphs that are shorter than ~12 characters until you hit the first real block (the "Last updated" line, the H1, or the first long paragraph).
Extractor
browser-harness <<'PY'
new_tab("https://medium.com/@user/slug-abc123")
wait_for_load()
wait(2.0) # Medium hydrates more UI after readyState=complete
md = js(r"""
(()=>{
const article = document.querySelector('article');
if(!article) return null;
const blocks = article.querySelectorAll('h1, h2, h3, h4, p, pre, blockquote, ul, ol, figure');
const out = [];
const seen = new Set();
for(const el of blocks){
let skip = false;
for(const s of seen){ if(s.contains(el) && s !== el){ skip=true; break; } }
if(skip) continue;
seen.add(el);
const tag = el.tagName;
const txt = (el.innerText || '').trim();
if(!txt && tag !== 'FIGURE') continue;
if(tag === 'H1') out.push('# ' + txt);
else if(tag === 'H2') out.push('## ' + txt);
else if(tag === 'H3') out.push('### ' + txt);
else if(tag === 'H4') out.push('#### ' + txt);
else if(tag === 'PRE') out.push('```\n' + txt + '\n```');
else if(tag === 'BLOCKQUOTE') out.push(txt.split('\n').map(l=>'> '+l).join('\n'));
else if(tag === 'UL' || tag === 'OL'){
const items = [...el.querySelectorAll(':scope > li')].map((li,i)=>{
const t = li.innerText.trim();
return (tag==='OL' ? (i+1)+'. ' : '- ') + t;
});
out.push(items.join('\n'));
}
else if(tag === 'FIGURE'){
const img = el.querySelector('img');
const cap = el.querySelector('figcaption');
if(img && img.src){
const alt = img.alt || (cap ? cap.innerText.trim() : '');
out.push('');
}
}
else if(tag === 'P') out.push(txt);
}
return out.join('\n\n');
})()
""")
# Strip engagement-button cruft from the top
paras = md.split('\n\n')
while paras and len(paras[0]) < 12:
paras.pop(0)
md = '\n\n'.join(paras)
print(md)
PY
The seen set avoids double-emitting when an <li> matches the block query inside its <ul>.
Waits
wait_for_load()is necessary but not sufficient — Medium continues to hydrate author-card and clap widgets afterreadyState=complete. An additionalwait(2.0)avoids cases where the article outer frame exists but the first few paragraphs are still skeleton<div>s.- For member-only articles, if
<article>renders but text length is suspiciously short (<500 chars), the paywall modal intercepted. Confirm the tab is on your logged-in profile and retry.
Paywall / login detection
state = js("""
(()=>{
const art = document.querySelector('article');
const len = art ? art.innerText.length : 0;
const hasPaywall = !!document.querySelector('[data-testid*="paywall"], [aria-label*="Sign in" i]');
return {len, hasPaywall};
})()
""")
If hasPaywall is true or len < 500, fall back to scraping.md API paths (the article may simply be locked for this account).
Traps
- Don't use
article.innerTextalone. It drops structure — code blocks lose their fences, lists lose their markers, figures disappear. The block walker above preserves each element kind. - Don't rely on CSS class names. Medium's class names are hashed (
pw-post-body-paragraph, etc.) and rotate; select by tag instead. <figure>caption text is often also repeated as<img alt>. Preferalt, fall back tofigcaption, so you don't emit both.- The article ends before the "About the Author" card sometimes, sometimes not. The walker captures both, which is fine for archival. If you need body-only, cut at the last
h2/h3before a<hr>-equivalent divider, or trim by known footer strings (Follow,More from,Written by). - Tab marker.
new_tab()prepends 🟢 to the title. Don't includedocument.titlein the emitted markdown — use the article's<h1>instead.
Data Extraction
scraping.md
- https://medium.com — blogging platform. Three access paths tested and validated: the undocumented ?format=json endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric...
- Never use a browser for read-only Medium tasks. All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
- ---
- Every ?format=json response starts with the anti-hijacking prefix ])}while(1);</x> before the JSON. Strip it before parsing. The helper below handles this.
Show full markdown
https://medium.com — blogging platform. Three access paths tested and validated: the undocumented ?format=json endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.
Do this first: pick your access path
| Goal | Best approach | Latency |
|---|---|---|
| Article metadata + full body | ?format=json on article URL | ~400ms |
| Article metrics only (claps, visibility) | GraphQL post(id:) | ~275ms |
| Author profile + follower count | GraphQL user(username:) | ~220ms |
| Recent posts for a user (up to 10) | ?format=json on profile URL | ~240ms |
| Recent posts for a publication | ?format=json on publication URL | ~300ms |
| Paginated post list (feed) | RSS feed | ~260ms |
| Full article body as HTML | RSS content:encoded field | ~260ms |
| Publication subscriber count | ?format=json on publication URL | ~300ms |
Never use a browser for read-only Medium tasks. All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
The XSSI prefix
Every ?format=json response starts with the anti-hijacking prefix ])}while(1);</x> before the JSON. Strip it before parsing. The helper below handles this.
import urllib.request, gzip, json, re
def medium_json(url):
"""Fetch any Medium URL with ?format=json and return parsed dict.
Strips the XSSI prefix ])}while(1);</x> automatically.
Works on: article URLs, user profile URLs, publication URLs.
Does NOT work on: search pages, /latest, profile stream API.
"""
sep = '&' if '?' in url else '?'
req = urllib.request.Request(
url + sep + 'format=json',
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json, */*",
"Accept-Encoding": "gzip",
}
)
with urllib.request.urlopen(req, timeout=20) as r:
raw = r.read()
if r.headers.get("Content-Encoding") == "gzip":
raw = gzip.decompress(raw)
text = raw.decode()
# Strip everything before the first {
return json.loads(re.sub(r'^[^\{]+', '', text))
Path 1: ?format=json — article metadata + body (fastest for articles)
Append ?format=json to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured bodyModel. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.
data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
payload = data['payload']
val = payload['value'] # article fields
refs = payload['references'] # User, Social, SocialStats dicts keyed by ID
# --- Article fields ---
title = val['title'] # "Software 2.0"
article_id = val['id'] # "a64152b37c35"
creator_id = val['creatorId'] # "ac9d9a35533e"
slug = val['uniqueSlug'] # "software-2-0-a64152b37c35"
url = val['canonicalUrl'] # "https://medium.com/@karpathy/..."
first_pub = val['firstPublishedAt'] # unix ms: 1510438733751
last_pub = val['latestPublishedAt'] # unix ms: 1615659523264
visibility = val['visibility'] # 0=public, 2=subscriber-locked
is_locked = val['isSubscriptionLocked'] # True if paywalled
locked_src = val['lockedPostSource'] # 0=free, 1=Medium Partner Program
# --- Metrics (in val['virtuals']) ---
virtuals = val['virtuals']
clap_count = virtuals['totalClapCount'] # 60865 (all claps, including multi-clap)
recommends = virtuals['recommends'] # 8846 (unique clappers)
read_time = virtuals['readingTime'] # 8.79811320754717 (minutes, float)
word_count = virtuals['wordCount'] # 2146
# --- Tags ---
tags = [t['slug'] for t in virtuals['tags']]
# ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']
# --- Author (from references) ---
user = refs['User'][creator_id]
author_name = user['name'] # "Andrej Karpathy"
author_handle = user['username'] # "karpathy"
author_bio = user['bio'] # "I like to train deep neural nets on large datasets."
author_twitter = user['twitterScreenName'] # "karpathy"
# --- Follower count (from SocialStats) ---
ss = refs['SocialStats'][creator_id]
follower_count = ss['usersFollowedByCount'] # 60027
following_count = ss['usersFollowedCount'] # 183
Detect paywall
# Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
# Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
is_paywalled = val['isSubscriptionLocked'] # True / False
Confirmed on real TDS articles: paywalled articles return isSubscriptionLocked=True, visibility=2, lockedPostSource=1. Free articles: all three are False/0.
Article body
The full body is in val['content']['bodyModel']['paragraphs'] — a list of dicts:
paragraphs = val['content']['bodyModel']['paragraphs']
# Paragraph types (confirmed for this article):
# type=1 -> body text (P)
# type=3 -> heading (H1/H2)
# type=4 -> image (text is empty; metadata has image ID)
# Reconstruct plain text:
text_paras = [p['text'] for p in paragraphs if p.get('text')]
full_text = '\n\n'.join(text_paras)
Path 2: GraphQL API — targeted metric lookups
POST https://medium.com/_/graphql with a JSON body. No auth, no CSRF token required.
Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.
import json, urllib.request, gzip
def gql(query):
body = json.dumps({"query": query}).encode()
req = urllib.request.Request(
"https://medium.com/_/graphql",
data=body,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Content-Type": "application/json",
"Accept": "application/json",
"Accept-Encoding": "gzip",
},
method="POST",
)
with urllib.request.urlopen(req, timeout=20) as r:
raw = r.read()
if r.headers.get("Content-Encoding") == "gzip":
raw = gzip.decompress(raw)
return json.loads(raw.decode())
Fetch article metrics (fastest)
result = gql("""
{
post(id: "a64152b37c35") {
title
id
firstPublishedAt
latestPublishedAt
visibility
uniqueSlug
canonicalUrl
mediumUrl
isLocked
clapCount
readingTime
wordCount
}
}
""")
post = result['data']['post']
# post['visibility'] -> "PUBLIC" | "LOCKED" (string, not numeric)
# post['isLocked'] -> False | True
# post['clapCount'] -> 60865 (same as totalClapCount in format=json)
# post['readingTime'] -> 8.79811320754717 (minutes)
# post['wordCount'] -> 2146
Confirmed working post() fields: title, id, createdAt, updatedAt, firstPublishedAt, latestPublishedAt, visibility, uniqueSlug, canonicalUrl, mediumUrl, isLocked, clapCount, readingTime, wordCount
Nested object that works: topics { name slug }, creator { name username }, collection { name id slug description domain creator { name username } }
Fields that return HTTP 400 (not available): tags, author, recommends, content, publication, responses, sequence
Fetch author profile
result = gql("""
{
user(username: "karpathy") {
name
username
id
bio
imageId
twitterScreenName
mediumMemberAt
socialStats {
followerCount
followingCount
}
}
}
""")
user = result['data']['user']
# user['name'] -> "Andrej Karpathy"
# user['id'] -> "ac9d9a35533e"
# user['bio'] -> "I like to train deep neural nets on large datasets."
# user['twitterScreenName'] -> "karpathy"
# user['socialStats']['followerCount'] -> 60028
# user['mediumMemberAt'] -> 0 (not a member); nonzero = unix ms join date
Confirmed working user() fields: name, username, id, bio, imageId, twitterScreenName, mediumMemberAt, socialStats { followerCount followingCount }
Fields that return HTTP 400: followerCount (top-level), followingCount (top-level), postCount
Fetch collection (publication) by ID
The GraphQL collection() query only accepts id, not slug. Get the ID from ?format=json on the publication page.
# TDS Archive id: 7f60cf5620c9 (from medium.com/towards-data-science?format=json)
result = gql("""
{
collection(id: "7f60cf5620c9") {
name
id
slug
description
domain
creator { name username }
}
}
""")
coll = result['data']['collection']
# coll['name'] -> "TDS Archive"
# coll['slug'] -> "data-science"
Path 3: RSS feeds (best for recent posts list + article bodies)
Works with plain http_get. Returns up to 10 most recent posts. Full article HTML is in content:encoded. No clap count or visibility info in RSS.
import re
from helpers import http_get
def parse_rss_items(rss_xml):
"""Extract items from Medium RSS feed. Returns list of dicts."""
def cdata(tag, text):
m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
return m.group(1).strip() if m else None
items = []
for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
# link is plain text (not CDATA)
link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
items.append({
'title': cdata('title', raw),
'link': link_m.group(1).strip() if link_m else None,
'pubDate': cdata('pubDate', raw),
'creator': cdata('dc:creator', raw),
'tags': re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
'body_html': cdata('content:encoded', raw), # full article HTML
})
return items
# User feed (up to 10 latest posts)
rss = http_get("https://medium.com/feed/@karpathy")
posts = parse_rss_items(rss)
# posts[0]['title'] -> "Software 2.0"
# posts[0]['pubDate'] -> "Sat, 11 Nov 2017 22:18:53 GMT"
# posts[0]['creator'] -> "Andrej Karpathy"
# posts[0]['tags'] -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
# posts[0]['link'] -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
# posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)
# Publication feed (up to 10 latest posts)
rss_pub = http_get("https://medium.com/feed/towards-data-science")
pub_posts = parse_rss_items(rss_pub)
RSS limitations:
- RSS does not include clap count, view count, or paywall status.
body_htmlcontains the full article body as HTML, including<p>,<strong>,<a>,<img>tags.- Pagination is not supported — RSS always returns the 10 most recent posts.
Path 4: ?format=json on user profile — recent posts with metrics
Better than RSS when you need clap counts alongside post list. Returns up to limit posts (default 10) plus full author metadata.
data = medium_json("https://medium.com/@karpathy?limit=10")
payload = data['payload']
user = payload['user']
# user['name'] -> "Andrej Karpathy"
# user['username'] -> "karpathy"
# user['bio'] -> "I like to train deep neural nets on large datasets."
refs = payload['references']
ss = refs['SocialStats'][user['userId']]
# ss['usersFollowedByCount'] -> 60028 (followers)
# ss['usersFollowedCount'] -> 183 (following)
posts = refs.get('Post', {}) # dict keyed by post ID
for pid, p in posts.items():
v = p['virtuals']
print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))
# Paginate: use paging['next'] from payload
paging = payload['paging']
next_params = paging['next']
# next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
# Append as query params to the same profile URL to get next page
next_url = (
f"https://medium.com/@{user['username']}"
f"?limit={next_params['limit']}&to={next_params['to']}"
f"&source={next_params['source']}&page={next_params['page']}"
)
data2 = medium_json(next_url)
# Note: karpathy has only 8 total posts — pagination returns same refs on page 2
Path 5: ?format=json on publication page
Returns publication metadata and recent posts with metrics.
data = medium_json("https://medium.com/towards-data-science")
payload = data['payload']
coll = payload['collection']
# coll['name'] -> "TDS Archive"
# coll['slug'] -> "data-science"
# coll['description'] -> full description string
# coll['subscriberCount'] -> 828527
# coll['metadata']['followerCount'] -> 828527
# coll['tags'] -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]
posts = payload['references'].get('Post', {})
for pid, p in posts.items():
v = p['virtuals']
print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
# Also includes: p['visibility'] (0=free, 2=paywalled)
# Paginate (same pattern as user profile)
paging = payload['paging']
# paging['next'] = {'to': '1738573325936', 'page': 3}
Retrieving the article ID from a URL
The id is the last 12 hex chars of a Medium article URL slug:
import re
url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
if article_id:
article_id = article_id.group(1) # "a64152b37c35"
This ID is the same across all URL forms (medium.com/@user/slug, user.medium.com/slug, medium.com/publication/slug).
Gotchas
-
HTTP 403 on plain
http_get— The defaulthttp_gethelper sendsUser-Agent: Mozilla/5.0which Medium accepts for most endpoints, but article HTML pages (without?format=json) return 403. Always use?format=jsonfor article and profile pages. -
?format=jsonworks; profile stream API does not —https://medium.com/_/api/users/{id}/profile/streamreturns HTTP 403 for unauthenticated requests. Use?format=jsonon the profile URL instead. -
?format=jsonon search pages returns 403 or broken JSON —medium.com/search?q=...&format=jsonandmedium.com/search/posts?q=...&format=jsonboth fail. Search is not available without auth. -
GraphQL
collection()requires ID, not slug —collection(slug: "towards-data-science")returns HTTP 400. You must use the numeric ID (e.g."7f60cf5620c9"). Get it from?format=jsonon the publication page:payload['collection']['id']. -
GraphQL
tagsfield onpost()returns HTTP 400 — Usetopics { name slug }instead. Topics are a subset of tags but work without auth. -
GraphQL visibility is a string, not a number —
post().visibilityreturns"PUBLIC"or"LOCKED"(string). The?format=jsonvalue.visibilityfield uses integers:0=public,2=locked. Both agree on the lock status. -
totalClapCountvsrecommends—totalClapCount(60865) counts all claps (Medium allows up to 50 claps per reader).recommends(8846) counts unique clappers. The GraphQLclapCountfield equalstotalClapCount, notrecommends. -
RSS returns at most 10 items, no clap counts — RSS is best for getting recent article links + full HTML body. Use
?format=jsonprofile if you need metrics. -
RSS link contains tracking params —
posts[0]['link']includes?source=rss-{userId}------2. Strip with.split('?')[0]if you need a clean URL. -
content:encodedin RSS is full HTML, not plaintext — Strip HTML tags if you want plaintext:re.sub(r'<[^>]+>', '', body_html). -
Medium subdomains — Some users have custom subdomains (
karpathy.medium.com). Bothmedium.com/@karpathy/...andkarpathy.medium.com/...resolve to the same article;?format=jsonworks on both. -
towardsdatascience.com is no longer Medium — TDS moved to its own WordPress site.
towardsdatascience.com/article-slug?format=jsonreturns full WordPress HTML, not Medium JSON. Usemedium.com/towards-data-sciencefor the archived Medium publication. -
No public search API — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
-
Timestamps are unix milliseconds —
firstPublishedAt,createdAt,latestPublishedAtare all in milliseconds. Convert:datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc).