Domain skill

medium

Markdown synced from browser-harness domain skills.

Host: medium
Files: 2

Agent prompt

Use this skill

Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.

Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are.

Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `medium` domain skill from `agent-workspace/domain-skills/medium/`. Read every markdown file for this domain before inventing an approach:
- agent-workspace/domain-skills/medium/article-hydration.md
- agent-workspace/domain-skills/medium/scraping.md

Use those domain-skill notes to complete my task for `medium` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.

Skill contents

What the agent will read

Article Body via DOM

article-hydration.md

Source

Extract a Medium article's body as clean markdown using the logged-in browser. Use this when API paths in scraping.md are blocked or truncated:
Cloudflare challenge on the ?format=json endpoint ("Performing security verification")
Member-only post that the API returns locked (isSubscriptionLocked=True) but the logged-in browser can render in full
JS-only variant where the article is gated behind a client-side paywall modal

Show full markdown

Extract a Medium article's body as clean markdown using the logged-in browser. Use this when API paths in scraping.md are blocked or truncated:

Cloudflare challenge on the ?format=json endpoint ("Performing security verification")
Member-only post that the API returns locked (isSubscriptionLocked=True) but the logged-in browser can render in full
JS-only variant where the article is gated behind a client-side paywall modal

If the article is free and the API works, prefer scraping.md — it's faster and doesn't need a visible tab.

URL patterns

Canonical: https://medium.com/@<author>/<slug>-<id>
Publication: https://<pub>.medium.com/<slug>-<id> or https://medium.com/<pub>/<slug>-<id>
Custom domain: some publications (e.g. towardsdatascience.com) proxy Medium; the same DOM extractor works there.

All variants render the article body inside a single <article> element.

Site structure

The article body lives under the page's single <article> element.
Block-level content: h1–h4, p, pre, blockquote, ul, ol, figure.
Images are always wrapped in <figure> with a <figcaption> sibling; the real resolution lives on miro.medium.com/v2/resize:fit:<N>/....
Code blocks are <pre> — no language class is exposed in the DOM, so emit plain fenced blocks.
Pull quotes render as <blockquote> with nested <p>.

Cruft to strip

Medium injects engagement UI inside <article>. The text "6 2 Listen Share More" at the top is the clap/comment/listen/share button row, not content. Also expect a follow button near the author's name and sometimes a "Help" / "Status" footer.

Safe pattern: take the extracted markdown, then drop leading paragraphs that are shorter than ~12 characters until you hit the first real block (the "Last updated" line, the H1, or the first long paragraph).

Extractor

bash

browser-harness <<'PY'
new_tab("https://medium.com/@user/slug-abc123")
wait_for_load()
wait(2.0)  # Medium hydrates more UI after readyState=complete

md = js(r"""
(()=>{
  const article = document.querySelector('article');
  if(!article) return null;
  const blocks = article.querySelectorAll('h1, h2, h3, h4, p, pre, blockquote, ul, ol, figure');
  const out = [];
  const seen = new Set();
  for(const el of blocks){
    let skip = false;
    for(const s of seen){ if(s.contains(el) && s !== el){ skip=true; break; } }
    if(skip) continue;
    seen.add(el);
    const tag = el.tagName;
    const txt = (el.innerText || '').trim();
    if(!txt && tag !== 'FIGURE') continue;
    if(tag === 'H1') out.push('# ' + txt);
    else if(tag === 'H2') out.push('## ' + txt);
    else if(tag === 'H3') out.push('### ' + txt);
    else if(tag === 'H4') out.push('#### ' + txt);
    else if(tag === 'PRE') out.push('```\n' + txt + '\n```');
    else if(tag === 'BLOCKQUOTE') out.push(txt.split('\n').map(l=>'> '+l).join('\n'));
    else if(tag === 'UL' || tag === 'OL'){
      const items = [...el.querySelectorAll(':scope > li')].map((li,i)=>{
        const t = li.innerText.trim();
        return (tag==='OL' ? (i+1)+'. ' : '- ') + t;
      });
      out.push(items.join('\n'));
    }
    else if(tag === 'FIGURE'){
      const img = el.querySelector('img');
      const cap = el.querySelector('figcaption');
      if(img && img.src){
        const alt = img.alt || (cap ? cap.innerText.trim() : '');
        out.push('![' + alt + '](' + img.src + ')');
      }
    }
    else if(tag === 'P') out.push(txt);
  }
  return out.join('\n\n');
})()
""")

# Strip engagement-button cruft from the top
paras = md.split('\n\n')
while paras and len(paras[0]) < 12:
    paras.pop(0)
md = '\n\n'.join(paras)
print(md)
PY

The seen set avoids double-emitting when an <li> matches the block query inside its <ul>.

Waits

wait_for_load() is necessary but not sufficient — Medium continues to hydrate author-card and clap widgets after readyState=complete. An additional wait(2.0) avoids cases where the article outer frame exists but the first few paragraphs are still skeleton <div>s.
For member-only articles, if <article> renders but text length is suspiciously short (<500 chars), the paywall modal intercepted. Confirm the tab is on your logged-in profile and retry.

Paywall / login detection

python

state = js("""
(()=>{
  const art = document.querySelector('article');
  const len = art ? art.innerText.length : 0;
  const hasPaywall = !!document.querySelector('[data-testid*="paywall"], [aria-label*="Sign in" i]');
  return {len, hasPaywall};
})()
""")

If hasPaywall is true or len < 500, fall back to scraping.md API paths (the article may simply be locked for this account).

Traps

Don't use article.innerText alone. It drops structure — code blocks lose their fences, lists lose their markers, figures disappear. The block walker above preserves each element kind.
Don't rely on CSS class names. Medium's class names are hashed (pw-post-body-paragraph, etc.) and rotate; select by tag instead.
<figure> caption text is often also repeated as <img alt>. Prefer alt, fall back to figcaption, so you don't emit both.
The article ends before the "About the Author" card sometimes, sometimes not. The walker captures both, which is fine for archival. If you need body-only, cut at the last h2/h3 before a <hr>-equivalent divider, or trim by known footer strings (Follow, More from, Written by).
Tab marker. new_tab() prepends 🟢 to the title. Don't include document.title in the emitted markdown — use the article's <h1> instead.

Data Extraction

scraping.md

Source

https://medium.com — blogging platform. Three access paths tested and validated: the undocumented ?format=json endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric...
Never use a browser for read-only Medium tasks. All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).
---
Every ?format=json response starts with the anti-hijacking prefix ])}while(1);</x> before the JSON. Strip it before parsing. The helper below handles this.

Show full markdown

https://medium.com — blogging platform. Three access paths tested and validated: the undocumented ?format=json endpoint (fastest for article + publication data), the undocumented GraphQL API (best for targeted metric lookups), and RSS feeds (best for recent posts lists without auth). No browser needed for any read-only task.

Do this first: pick your access path

Goal	Best approach	Latency
Article metadata + full body	`?format=json` on article URL	~400ms
Article metrics only (claps, visibility)	GraphQL `post(id:)`	~275ms
Author profile + follower count	GraphQL `user(username:)`	~220ms
Recent posts for a user (up to 10)	`?format=json` on profile URL	~240ms
Recent posts for a publication	`?format=json` on publication URL	~300ms
Paginated post list (feed)	RSS feed	~260ms
Full article body as HTML	RSS `content:encoded` field	~260ms
Publication subscriber count	`?format=json` on publication URL	~300ms

Never use a browser for read-only Medium tasks. All article content, metadata, and metrics are available over HTTP. Browser is only needed for authenticated actions (clapping, posting, account management).

The XSSI prefix

Every ?format=json response starts with the anti-hijacking prefix ])}while(1);</x> before the JSON. Strip it before parsing. The helper below handles this.

python

import urllib.request, gzip, json, re

def medium_json(url):
    """Fetch any Medium URL with ?format=json and return parsed dict.
    Strips the XSSI prefix ])}while(1);</x> automatically.
    Works on: article URLs, user profile URLs, publication URLs.
    Does NOT work on: search pages, /latest, profile stream API.
    """
    sep = '&' if '?' in url else '?'
    req = urllib.request.Request(
        url + sep + 'format=json',
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Accept": "application/json, */*",
            "Accept-Encoding": "gzip",
        }
    )
    with urllib.request.urlopen(req, timeout=20) as r:
        raw = r.read()
        if r.headers.get("Content-Encoding") == "gzip":
            raw = gzip.decompress(raw)
        text = raw.decode()
    # Strip everything before the first {
    return json.loads(re.sub(r'^[^\{]+', '', text))

Path 1: `?format=json` — article metadata + body (fastest for articles)

Append ?format=json to any article URL. Returns full metadata, virtuals (metrics), and the complete article body in a structured bodyModel. No auth required for public and subscriber-locked articles alike — the metadata and full body are always returned, but paywalled body content in a browser would be truncated.

python

data = medium_json("https://medium.com/@karpathy/software-2-0-a64152b37c35")
payload = data['payload']
val     = payload['value']        # article fields
refs    = payload['references']   # User, Social, SocialStats dicts keyed by ID

# --- Article fields ---
title       = val['title']                              # "Software 2.0"
article_id  = val['id']                                 # "a64152b37c35"
creator_id  = val['creatorId']                          # "ac9d9a35533e"
slug        = val['uniqueSlug']                         # "software-2-0-a64152b37c35"
url         = val['canonicalUrl']                       # "https://medium.com/@karpathy/..."
first_pub   = val['firstPublishedAt']                   # unix ms: 1510438733751
last_pub    = val['latestPublishedAt']                  # unix ms: 1615659523264
visibility  = val['visibility']                         # 0=public, 2=subscriber-locked
is_locked   = val['isSubscriptionLocked']               # True if paywalled
locked_src  = val['lockedPostSource']                   # 0=free, 1=Medium Partner Program

# --- Metrics (in val['virtuals']) ---
virtuals    = val['virtuals']
clap_count  = virtuals['totalClapCount']                # 60865 (all claps, including multi-clap)
recommends  = virtuals['recommends']                    # 8846 (unique clappers)
read_time   = virtuals['readingTime']                   # 8.79811320754717 (minutes, float)
word_count  = virtuals['wordCount']                     # 2146

# --- Tags ---
tags = [t['slug'] for t in virtuals['tags']]
# ['machine-learning', 'artificial-intelligence', 'programming', 'software-development', 'future']

# --- Author (from references) ---
user = refs['User'][creator_id]
author_name = user['name']          # "Andrej Karpathy"
author_handle = user['username']    # "karpathy"
author_bio  = user['bio']           # "I like to train deep neural nets on large datasets."
author_twitter = user['twitterScreenName']  # "karpathy"

# --- Follower count (from SocialStats) ---
ss = refs['SocialStats'][creator_id]
follower_count  = ss['usersFollowedByCount']   # 60027
following_count = ss['usersFollowedCount']     # 183

Detect paywall

python

# Paywalled (Medium Partner Program): isSubscriptionLocked=True, visibility=2, lockedPostSource=1
# Free: isSubscriptionLocked=False, visibility=0, lockedPostSource=0
is_paywalled = val['isSubscriptionLocked']   # True / False

Confirmed on real TDS articles: paywalled articles return isSubscriptionLocked=True, visibility=2, lockedPostSource=1. Free articles: all three are False/0.

Article body

The full body is in val['content']['bodyModel']['paragraphs'] — a list of dicts:

python

paragraphs = val['content']['bodyModel']['paragraphs']

# Paragraph types (confirmed for this article):
# type=1  -> body text (P)
# type=3  -> heading (H1/H2)
# type=4  -> image (text is empty; metadata has image ID)

# Reconstruct plain text:
text_paras = [p['text'] for p in paragraphs if p.get('text')]
full_text   = '\n\n'.join(text_paras)

Path 2: GraphQL API — targeted metric lookups

POST https://medium.com/_/graphql with a JSON body. No auth, no CSRF token required. Returns HTTP 200 with JSON even for unauthenticated queries. Invalid fields return HTTP 400 — do not assume a field exists without testing first.

python

import json, urllib.request, gzip

def gql(query):
    body = json.dumps({"query": query}).encode()
    req  = urllib.request.Request(
        "https://medium.com/_/graphql",
        data=body,
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Content-Type": "application/json",
            "Accept": "application/json",
            "Accept-Encoding": "gzip",
        },
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=20) as r:
        raw = r.read()
        if r.headers.get("Content-Encoding") == "gzip":
            raw = gzip.decompress(raw)
        return json.loads(raw.decode())

Fetch article metrics (fastest)

python

result = gql("""
{
  post(id: "a64152b37c35") {
    title
    id
    firstPublishedAt
    latestPublishedAt
    visibility
    uniqueSlug
    canonicalUrl
    mediumUrl
    isLocked
    clapCount
    readingTime
    wordCount
  }
}
""")
post = result['data']['post']
# post['visibility']  -> "PUBLIC" | "LOCKED"  (string, not numeric)
# post['isLocked']    -> False | True
# post['clapCount']   -> 60865  (same as totalClapCount in format=json)
# post['readingTime'] -> 8.79811320754717  (minutes)
# post['wordCount']   -> 2146

Confirmed working post() fields: title, id, createdAt, updatedAt, firstPublishedAt, latestPublishedAt, visibility, uniqueSlug, canonicalUrl, mediumUrl, isLocked, clapCount, readingTime, wordCount

Nested object that works: topics { name slug }, creator { name username }, collection { name id slug description domain creator { name username } }

Fields that return HTTP 400 (not available): tags, author, recommends, content, publication, responses, sequence

Fetch author profile

python

result = gql("""
{
  user(username: "karpathy") {
    name
    username
    id
    bio
    imageId
    twitterScreenName
    mediumMemberAt
    socialStats {
      followerCount
      followingCount
    }
  }
}
""")
user = result['data']['user']
# user['name']                       -> "Andrej Karpathy"
# user['id']                         -> "ac9d9a35533e"
# user['bio']                        -> "I like to train deep neural nets on large datasets."
# user['twitterScreenName']          -> "karpathy"
# user['socialStats']['followerCount'] -> 60028
# user['mediumMemberAt']             -> 0 (not a member); nonzero = unix ms join date

Confirmed working user() fields: name, username, id, bio, imageId, twitterScreenName, mediumMemberAt, socialStats { followerCount followingCount }

Fields that return HTTP 400: followerCount (top-level), followingCount (top-level), postCount

Fetch collection (publication) by ID

The GraphQL collection() query only accepts id, not slug. Get the ID from ?format=json on the publication page.

python

# TDS Archive id: 7f60cf5620c9  (from medium.com/towards-data-science?format=json)
result = gql("""
{
  collection(id: "7f60cf5620c9") {
    name
    id
    slug
    description
    domain
    creator { name username }
  }
}
""")
coll = result['data']['collection']
# coll['name'] -> "TDS Archive"
# coll['slug'] -> "data-science"

Path 3: RSS feeds (best for recent posts list + article bodies)

Works with plain http_get. Returns up to 10 most recent posts. Full article HTML is in content:encoded. No clap count or visibility info in RSS.

python

import re
from helpers import http_get

def parse_rss_items(rss_xml):
    """Extract items from Medium RSS feed. Returns list of dicts."""
    def cdata(tag, text):
        m = re.search(rf'<{tag}[^>]*><!\[CDATA\[(.*?)\]\]></{tag}>', text, re.DOTALL)
        return m.group(1).strip() if m else None

    items = []
    for raw in re.findall(r'<item>(.*?)</item>', rss_xml, re.DOTALL):
        # link is plain text (not CDATA)
        link_m = re.search(r'<link>(.*?)</link>', raw, re.DOTALL)
        items.append({
            'title':    cdata('title', raw),
            'link':     link_m.group(1).strip() if link_m else None,
            'pubDate':  cdata('pubDate', raw),
            'creator':  cdata('dc:creator', raw),
            'tags':     re.findall(r'<category><!\[CDATA\[(.*?)\]\]></category>', raw),
            'body_html': cdata('content:encoded', raw),   # full article HTML
        })
    return items

# User feed (up to 10 latest posts)
rss = http_get("https://medium.com/feed/@karpathy")
posts = parse_rss_items(rss)
# posts[0]['title']    -> "Software 2.0"
# posts[0]['pubDate']  -> "Sat, 11 Nov 2017 22:18:53 GMT"
# posts[0]['creator']  -> "Andrej Karpathy"
# posts[0]['tags']     -> ['programming', 'software-development', 'artificial-intelligence', 'future', 'machine-learning']
# posts[0]['link']     -> "https://karpathy.medium.com/software-2-0-a64152b37c35?source=rss-..."
# posts[0]['body_html'] -> full article body as HTML string (~15KB for this article)

# Publication feed (up to 10 latest posts)
rss_pub = http_get("https://medium.com/feed/towards-data-science")
pub_posts = parse_rss_items(rss_pub)

RSS limitations:

RSS does not include clap count, view count, or paywall status.
body_html contains the full article body as HTML, including <p>, <strong>, <a>, <img> tags.
Pagination is not supported — RSS always returns the 10 most recent posts.

Path 4: `?format=json` on user profile — recent posts with metrics

Better than RSS when you need clap counts alongside post list. Returns up to limit posts (default 10) plus full author metadata.

python

data = medium_json("https://medium.com/@karpathy?limit=10")
payload = data['payload']

user = payload['user']
# user['name']     -> "Andrej Karpathy"
# user['username'] -> "karpathy"
# user['bio']      -> "I like to train deep neural nets on large datasets."

refs = payload['references']
ss   = refs['SocialStats'][user['userId']]
# ss['usersFollowedByCount'] -> 60028 (followers)
# ss['usersFollowedCount']   -> 183   (following)

posts = refs.get('Post', {})  # dict keyed by post ID
for pid, p in posts.items():
    v = p['virtuals']
    print(p['title'], v['totalClapCount'], round(v['readingTime'], 1))

# Paginate: use paging['next'] from payload
paging = payload['paging']
next_params = paging['next']
# next_params = {'limit': 10, 'to': '1495652975362', 'source': 'overview', 'page': 2, 'ignoredIds': []}
# Append as query params to the same profile URL to get next page
next_url = (
    f"https://medium.com/@{user['username']}"
    f"?limit={next_params['limit']}&to={next_params['to']}"
    f"&source={next_params['source']}&page={next_params['page']}"
)
data2 = medium_json(next_url)
# Note: karpathy has only 8 total posts — pagination returns same refs on page 2

Path 5: `?format=json` on publication page

Returns publication metadata and recent posts with metrics.

python

data = medium_json("https://medium.com/towards-data-science")
payload = data['payload']

coll = payload['collection']
# coll['name']            -> "TDS Archive"
# coll['slug']            -> "data-science"
# coll['description']     -> full description string
# coll['subscriberCount'] -> 828527
# coll['metadata']['followerCount'] -> 828527
# coll['tags']            -> ['DATA SCIENCE', 'MACHINE LEARNING', ...]

posts = payload['references'].get('Post', {})
for pid, p in posts.items():
    v = p['virtuals']
    print(p['title'], v['totalClapCount'], p['isSubscriptionLocked'])
# Also includes: p['visibility'] (0=free, 2=paywalled)

# Paginate (same pattern as user profile)
paging = payload['paging']
# paging['next'] = {'to': '1738573325936', 'page': 3}

Retrieving the article ID from a URL

The id is the last 12 hex chars of a Medium article URL slug:

python

import re

url = "https://medium.com/@karpathy/software-2-0-a64152b37c35"
article_id = re.search(r'-([a-f0-9]{12})$', url.rstrip('/').split('?')[0])
if article_id:
    article_id = article_id.group(1)   # "a64152b37c35"

This ID is the same across all URL forms (medium.com/@user/slug, user.medium.com/slug, medium.com/publication/slug).

Gotchas

HTTP 403 on plain http_get — The default http_get helper sends User-Agent: Mozilla/5.0 which Medium accepts for most endpoints, but article HTML pages (without ?format=json) return 403. Always use ?format=json for article and profile pages.
?format=json works; profile stream API does not — https://medium.com/_/api/users/{id}/profile/stream returns HTTP 403 for unauthenticated requests. Use ?format=json on the profile URL instead.
?format=json on search pages returns 403 or broken JSON — medium.com/search?q=...&format=json and medium.com/search/posts?q=...&format=json both fail. Search is not available without auth.
GraphQL collection() requires ID, not slug — collection(slug: "towards-data-science") returns HTTP 400. You must use the numeric ID (e.g. "7f60cf5620c9"). Get it from ?format=json on the publication page: payload['collection']['id'].
GraphQL tags field on post() returns HTTP 400 — Use topics { name slug } instead. Topics are a subset of tags but work without auth.
GraphQL visibility is a string, not a number — post().visibility returns "PUBLIC" or "LOCKED" (string). The ?format=json value.visibility field uses integers: 0=public, 2=locked. Both agree on the lock status.
totalClapCount vs recommends — totalClapCount (60865) counts all claps (Medium allows up to 50 claps per reader). recommends (8846) counts unique clappers. The GraphQL clapCount field equals totalClapCount, not recommends.
RSS returns at most 10 items, no clap counts — RSS is best for getting recent article links + full HTML body. Use ?format=json profile if you need metrics.
RSS link contains tracking params — posts[0]['link'] includes ?source=rss-{userId}------2. Strip with .split('?')[0] if you need a clean URL.
content:encoded in RSS is full HTML, not plaintext — Strip HTML tags if you want plaintext: re.sub(r'<[^>]+>', '', body_html).
Medium subdomains — Some users have custom subdomains (karpathy.medium.com). Both medium.com/@karpathy/... and karpathy.medium.com/... resolve to the same article; ?format=json works on both.
towardsdatascience.com is no longer Medium — TDS moved to its own WordPress site. towardsdatascience.com/article-slug?format=json returns full WordPress HTML, not Medium JSON. Use medium.com/towards-data-science for the archived Medium publication.
No public search API — Medium has no Algolia equivalent. Finding articles by keyword requires either a browser, or fetching a user/publication feed and filtering locally.
Timestamps are unix milliseconds — firstPublishedAt, createdAt, latestPublishedAt are all in milliseconds. Convert: datetime.fromtimestamp(val['firstPublishedAt'] / 1000, tz=timezone.utc).

Use this skill

What the agent will read

Article Body via DOM

URL patterns

Site structure

Cruft to strip

Extractor

Waits

Paywall / login detection

Traps

Data Extraction

Do this first: pick your access path

The XSSI prefix

Path 1: ?format=json — article metadata + body (fastest for articles)

Detect paywall

Article body

Path 2: GraphQL API — targeted metric lookups

Fetch article metrics (fastest)

Fetch author profile

Fetch collection (publication) by ID

Path 3: RSS feeds (best for recent posts list + article bodies)

Path 4: ?format=json on user profile — recent posts with metrics

Path 5: ?format=json on publication page

Retrieving the article ID from a URL

Gotchas

Path 1: `?format=json` — article metadata + body (fastest for articles)

Path 4: `?format=json` on user profile — recent posts with metrics

Path 5: `?format=json` on publication page