Domain skill
hackernews
Markdown synced from browser-harness domain skills.
- Host
- hackernews
- Files
- 1
Agent prompt
Use this skill
Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.
Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are. Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `hackernews` domain skill from `agent-workspace/domain-skills/hackernews/`. Read every markdown file for this domain before inventing an approach: - agent-workspace/domain-skills/hackernews/scraping.md Use those domain-skill notes to complete my task for `hackernews` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.
Skill contents
What the agent will read
Data Extraction
scraping.md
- https://news.ycombinator.com — YCombinator's link aggregator. Three access paths tested: httpget DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.
- Never use a browser for read-only HN tasks. Everything is accessible over HTTP with no auth, no JS rendering needed.
- ---
- The front page HTML is 34KB. Story order matches Firebase /topstories.json exactly — confirmed identical on 2026-04-18.
Show full markdown
https://news.ycombinator.com — YCombinator's link aggregator. Three access paths tested: http_get DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.
Do this first: pick your access path
| Goal | Best approach | Latency |
|---|---|---|
| Current front page (30 stories, real-time) | http_get + regex | ~170ms |
| Historical / keyword search | Algolia search API | ~400ms |
| Full comment tree (nested) | Algolia items API | ~300ms |
| Specific item by ID | Firebase API | ~200ms |
| 500 ranked story IDs | Firebase topstories | ~200ms (+ ~190ms/item after) |
Never use a browser for read-only HN tasks. Everything is accessible over HTTP with no auth, no JS rendering needed.
Path 1: http_get front page (fastest for real-time data)
The front page HTML is ~34KB. Story order matches Firebase /topstories.json exactly — confirmed identical on 2026-04-18.
import re, html as htmllib
page = http_get("https://news.ycombinator.com")
# Extract all 30 story IDs (in rank order)
story_ids = re.findall(r'<tr class="athing submission" id="(\d+)">', page)
# Extract titles + URLs (same order as IDs)
titles_urls = re.findall(
r'class="titleline"[^>]*><a href="([^"]*)"[^>]*>(.*?)</a>', page
)
# Extract scores keyed by story ID (job posts have no score row)
scores_by_id = {
m.group(1): int(m.group(2))
for m in re.finditer(
r'<span class="score" id="score_(\d+)">(\d+) points</span>', page
)
}
# Extract authors keyed by story ID (anchor on score span)
authors_by_id = {}
for m in re.finditer(
r'<span class="score" id="score_(\d+)">\d+ points</span>'
r'.*?class="hnuser">(.*?)</a>',
page, re.DOTALL
):
authors_by_id[m.group(1)] = m.group(2)
# Extract comment counts keyed by story ID
comments_by_id = {
m.group(1): int(m.group(2))
for m in re.finditer(
r'href="item\?id=(\d+)">(\d+) comments</a>', page
)
}
stories = []
for i, sid in enumerate(story_ids):
url, raw_title = titles_urls[i] if i < len(titles_urls) else ('', '')
stories.append({
'rank': i + 1,
'id': sid,
'title': htmllib.unescape(raw_title), # MUST unescape — titles contain ' etc.
'url': url,
'score': scores_by_id.get(sid), # None for job posts
'author': authors_by_id.get(sid),
'comments': comments_by_id.get(sid, 0),
})
Gotchas:
- Titles contain HTML entities (
'&">). Always callhtml.unescape(). <tr class="athing submission" id="...">— the class isathing submission, not justathing. Theathing comtrclass is for comment rows.- Job/hiring posts (YC ads) appear in the list but have no score or author.
scores_by_id.get(sid)returnsNonefor them — check before comparing. re.DOTALLmulti-line patterns can cross story boundaries. Use ID-anchored patterns (as above) instead of positional zip for score/author.- The page only serves page 1 (30 items). Pages 2–4 exist at
?p=2etc. but require a login cookie for page 3+.
Path 2: Algolia search API (best for historical / keyword search)
No rate limiting observed. Returns up to 1000 hits per query (hitsPerPage max is capped at ~1000 per Algolia plan).
import json
# Keyword search — sorted by relevance
data = json.loads(http_get(
"https://hn.algolia.com/api/v1/search"
"?query=llm&tags=story&hitsPerPage=20"
))
# Date-sorted (most recent first)
data = json.loads(http_get(
"https://hn.algolia.com/api/v1/search_by_date"
"?tags=story&hitsPerPage=20"
))
# Paginate: add &page=N (0-indexed), up to data['nbPages']-1
Fields returned per story hit:
objectID, title, url, author, points, num_comments, created_at (ISO 8601), created_at_i (unix ts), story_id, children (list of comment IDs — flat, not tree), _tags, _highlightResult
Fields returned per comment hit:
objectID, comment_text, author, story_id, story_title, story_url, parent_id, created_at, created_at_i, points
Note: comment hits use comment_text, NOT text. Story hits use story_text for self-post body.
Tag filters
Tags are AND by default, OR with parentheses:
# Story types
"tags=story" # regular link/self posts
"tags=show_hn" # Show HN
"tags=ask_hn" # Ask HN
"tags=poll" # polls
"tags=job" # job posts
# Combined AND
"tags=story,front_page" # currently on front page
"tags=story,author_pg" # stories submitted by pg
# OR
"tags=(ask_hn,show_hn),story" # Ask OR Show HN
# By story ID (gets story + all its comments)
"tags=story_47806725"
Numeric filters
# Date range (unix timestamps)
"numericFilters=created_at_i>1745000000"
"numericFilters=created_at_i>1700000000,created_at_i<1750000000"
# Point threshold
"numericFilters=points>100"
"numericFilters=points>500,points<1000"
Full Algolia items API (nested comment tree)
import json
thread = json.loads(http_get(
"https://hn.algolia.com/api/v1/items/47806725"
))
# thread['children'] = list of top-level comment objects
# Each comment: author, text (HTML), created_at, children (nested replies)
# Recursively walk children for full thread
# Total comment count (recursive walk with stack):
stack = list(thread.get('children', []))
total = 0
while stack:
node = stack.pop()
total += 1
stack.extend(node.get('children', []))
Confirmed: Algolia items returns 653 total comments for a 659-comment thread (some deleted). text field in items API is HTML with <p> tags and <a> links — may need to strip tags.
Path 3: Official HN Firebase API
Clean JSON, no scraping. Use for fetching specific items or building live feeds.
import json
# Ranked story ID lists (no metadata — just IDs)
top = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json")) # 500 IDs
new = json.loads(http_get("https://hacker-news.firebaseio.com/v0/newstories.json")) # 500 IDs
best = json.loads(http_get("https://hacker-news.firebaseio.com/v0/beststories.json")) # 200 IDs
ask = json.loads(http_get("https://hacker-news.firebaseio.com/v0/askstories.json")) # ~32 IDs
show = json.loads(http_get("https://hacker-news.firebaseio.com/v0/showstories.json")) # ~119 IDs
jobs = json.loads(http_get("https://hacker-news.firebaseio.com/v0/jobstories.json")) # ~31 IDs
# Fetch a single item
item = json.loads(http_get(
"https://hacker-news.firebaseio.com/v0/item/47806725.json"
))
# Fields: id, type, by, title, url, score, descendants (total comment count),
# time (unix ts), kids (list of top-level comment IDs), text (self-post body)
# Fetch a user profile
user = json.loads(http_get(
"https://hacker-news.firebaseio.com/v0/user/pg.json"
))
# Fields: id, karma, created (unix ts), about (HTML), submitted (list of item IDs)
# Highest current item ID (useful for polling new items)
maxid = json.loads(http_get("https://hacker-news.firebaseio.com/v0/maxitem.json"))
Firebase vs Algolia tradeoff:
- Firebase
topstoriesgives you 500 IDs in one call but then requires one HTTP call per item (~190ms each). Fetching all 500 items sequentially would take ~100 seconds. - Algolia returns full story data (title, points, author, comments) in one call for up to ~1000 results.
- For "top 30 stories with full metadata": use
http_getfront page scrape (170ms total). For "top 500 stories with full metadata": use Algolia withtags=front_pageor loop pages.
Comment thread HTML (item page)
For a large thread, the item page HTML (~1MB for 659 comments) loads ALL comments flat in a single request — no pagination, no JS required.
import re, html as htmllib
page = http_get("https://news.ycombinator.com/item?id=47806725")
# Count all comment IDs
comment_ids = re.findall(r'<tr class="athing comtr" id="(\d+)">', page)
# len(comment_ids) matches total comment count
# Extract comment texts (careful: text spans multiple lines with <p> tags)
# Use Algolia items API instead for structured access
For structured comment access prefer Algolia items API — it returns a proper nested tree. The HTML item page is useful only when you need approximate comment count without an API call.
Do NOT use a browser for HN
All data is in plain HTML or JSON APIs. goto_url() + wait_for_load() takes 3–8 seconds; http_get takes 170–400ms. The JS querySelectorAll approach works (tested, returns correct data) but is 20–50x slower with no benefit.