← Back to skills

Domain skill

hackernews

Markdown synced from browser-harness domain skills.

Host
hackernews
Files
1

Agent prompt

Use this skill

Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.

Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are.

Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `hackernews` domain skill from `agent-workspace/domain-skills/hackernews/`. Read every markdown file for this domain before inventing an approach:
- agent-workspace/domain-skills/hackernews/scraping.md

Use those domain-skill notes to complete my task for `hackernews` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.

Skill contents

What the agent will read

Data Extraction

scraping.md

Source
  • https://news.ycombinator.com — YCombinator's link aggregator. Three access paths tested: httpget DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.
  • Never use a browser for read-only HN tasks. Everything is accessible over HTTP with no auth, no JS rendering needed.
  • ---
  • The front page HTML is 34KB. Story order matches Firebase /topstories.json exactly — confirmed identical on 2026-04-18.
Show full markdown

https://news.ycombinator.com — YCombinator's link aggregator. Three access paths tested: http_get DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser.

Do this first: pick your access path

GoalBest approachLatency
Current front page (30 stories, real-time)http_get + regex~170ms
Historical / keyword searchAlgolia search API~400ms
Full comment tree (nested)Algolia items API~300ms
Specific item by IDFirebase API~200ms
500 ranked story IDsFirebase topstories~200ms (+ ~190ms/item after)

Never use a browser for read-only HN tasks. Everything is accessible over HTTP with no auth, no JS rendering needed.


Path 1: http_get front page (fastest for real-time data)

The front page HTML is ~34KB. Story order matches Firebase /topstories.json exactly — confirmed identical on 2026-04-18.

python
import re, html as htmllib

page = http_get("https://news.ycombinator.com")

# Extract all 30 story IDs (in rank order)
story_ids = re.findall(r'<tr class="athing submission" id="(\d+)">', page)

# Extract titles + URLs (same order as IDs)
titles_urls = re.findall(
    r'class="titleline"[^>]*><a href="([^"]*)"[^>]*>(.*?)</a>', page
)

# Extract scores keyed by story ID (job posts have no score row)
scores_by_id = {
    m.group(1): int(m.group(2))
    for m in re.finditer(
        r'<span class="score" id="score_(\d+)">(\d+) points</span>', page
    )
}

# Extract authors keyed by story ID (anchor on score span)
authors_by_id = {}
for m in re.finditer(
    r'<span class="score" id="score_(\d+)">\d+ points</span>'
    r'.*?class="hnuser">(.*?)</a>',
    page, re.DOTALL
):
    authors_by_id[m.group(1)] = m.group(2)

# Extract comment counts keyed by story ID
comments_by_id = {
    m.group(1): int(m.group(2))
    for m in re.finditer(
        r'href="item\?id=(\d+)">(\d+)&nbsp;comments</a>', page
    )
}

stories = []
for i, sid in enumerate(story_ids):
    url, raw_title = titles_urls[i] if i < len(titles_urls) else ('', '')
    stories.append({
        'rank': i + 1,
        'id': sid,
        'title': htmllib.unescape(raw_title),   # MUST unescape — titles contain &#x27; etc.
        'url': url,
        'score': scores_by_id.get(sid),          # None for job posts
        'author': authors_by_id.get(sid),
        'comments': comments_by_id.get(sid, 0),
    })

Gotchas:

  • Titles contain HTML entities (&#x27; &amp; &quot; &gt;). Always call html.unescape().
  • <tr class="athing submission" id="..."> — the class is athing submission, not just athing. The athing comtr class is for comment rows.
  • Job/hiring posts (YC ads) appear in the list but have no score or author. scores_by_id.get(sid) returns None for them — check before comparing.
  • re.DOTALL multi-line patterns can cross story boundaries. Use ID-anchored patterns (as above) instead of positional zip for score/author.
  • The page only serves page 1 (30 items). Pages 2–4 exist at ?p=2 etc. but require a login cookie for page 3+.

Path 2: Algolia search API (best for historical / keyword search)

No rate limiting observed. Returns up to 1000 hits per query (hitsPerPage max is capped at ~1000 per Algolia plan).

python
import json

# Keyword search — sorted by relevance
data = json.loads(http_get(
    "https://hn.algolia.com/api/v1/search"
    "?query=llm&tags=story&hitsPerPage=20"
))

# Date-sorted (most recent first)
data = json.loads(http_get(
    "https://hn.algolia.com/api/v1/search_by_date"
    "?tags=story&hitsPerPage=20"
))

# Paginate: add &page=N (0-indexed), up to data['nbPages']-1

Fields returned per story hit:

code
objectID, title, url, author, points, num_comments,
created_at (ISO 8601), created_at_i (unix ts), story_id,
children (list of comment IDs — flat, not tree),
_tags, _highlightResult

Fields returned per comment hit:

code
objectID, comment_text, author, story_id, story_title, story_url,
parent_id, created_at, created_at_i, points

Note: comment hits use comment_text, NOT text. Story hits use story_text for self-post body.

Tag filters

Tags are AND by default, OR with parentheses:

python
# Story types
"tags=story"           # regular link/self posts
"tags=show_hn"         # Show HN
"tags=ask_hn"          # Ask HN
"tags=poll"            # polls
"tags=job"             # job posts

# Combined AND
"tags=story,front_page"          # currently on front page
"tags=story,author_pg"           # stories submitted by pg

# OR
"tags=(ask_hn,show_hn),story"    # Ask OR Show HN

# By story ID (gets story + all its comments)
"tags=story_47806725"

Numeric filters

python
# Date range (unix timestamps)
"numericFilters=created_at_i>1745000000"
"numericFilters=created_at_i>1700000000,created_at_i<1750000000"

# Point threshold
"numericFilters=points>100"
"numericFilters=points>500,points<1000"

Full Algolia items API (nested comment tree)

python
import json

thread = json.loads(http_get(
    "https://hn.algolia.com/api/v1/items/47806725"
))
# thread['children'] = list of top-level comment objects
# Each comment: author, text (HTML), created_at, children (nested replies)
# Recursively walk children for full thread

# Total comment count (recursive walk with stack):
stack = list(thread.get('children', []))
total = 0
while stack:
    node = stack.pop()
    total += 1
    stack.extend(node.get('children', []))

Confirmed: Algolia items returns 653 total comments for a 659-comment thread (some deleted). text field in items API is HTML with <p> tags and <a> links — may need to strip tags.


Path 3: Official HN Firebase API

Clean JSON, no scraping. Use for fetching specific items or building live feeds.

python
import json

# Ranked story ID lists (no metadata — just IDs)
top   = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json"))  # 500 IDs
new   = json.loads(http_get("https://hacker-news.firebaseio.com/v0/newstories.json"))  # 500 IDs
best  = json.loads(http_get("https://hacker-news.firebaseio.com/v0/beststories.json")) # 200 IDs
ask   = json.loads(http_get("https://hacker-news.firebaseio.com/v0/askstories.json"))  # ~32 IDs
show  = json.loads(http_get("https://hacker-news.firebaseio.com/v0/showstories.json")) # ~119 IDs
jobs  = json.loads(http_get("https://hacker-news.firebaseio.com/v0/jobstories.json"))  # ~31 IDs

# Fetch a single item
item = json.loads(http_get(
    "https://hacker-news.firebaseio.com/v0/item/47806725.json"
))
# Fields: id, type, by, title, url, score, descendants (total comment count),
#         time (unix ts), kids (list of top-level comment IDs), text (self-post body)

# Fetch a user profile
user = json.loads(http_get(
    "https://hacker-news.firebaseio.com/v0/user/pg.json"
))
# Fields: id, karma, created (unix ts), about (HTML), submitted (list of item IDs)

# Highest current item ID (useful for polling new items)
maxid = json.loads(http_get("https://hacker-news.firebaseio.com/v0/maxitem.json"))

Firebase vs Algolia tradeoff:

  • Firebase topstories gives you 500 IDs in one call but then requires one HTTP call per item (~190ms each). Fetching all 500 items sequentially would take ~100 seconds.
  • Algolia returns full story data (title, points, author, comments) in one call for up to ~1000 results.
  • For "top 30 stories with full metadata": use http_get front page scrape (170ms total). For "top 500 stories with full metadata": use Algolia with tags=front_page or loop pages.

Comment thread HTML (item page)

For a large thread, the item page HTML (~1MB for 659 comments) loads ALL comments flat in a single request — no pagination, no JS required.

python
import re, html as htmllib

page = http_get("https://news.ycombinator.com/item?id=47806725")

# Count all comment IDs
comment_ids = re.findall(r'<tr class="athing comtr" id="(\d+)">', page)
# len(comment_ids) matches total comment count

# Extract comment texts (careful: text spans multiple lines with <p> tags)
# Use Algolia items API instead for structured access

For structured comment access prefer Algolia items API — it returns a proper nested tree. The HTML item page is useful only when you need approximate comment count without an API call.


Do NOT use a browser for HN

All data is in plain HTML or JSON APIs. goto_url() + wait_for_load() takes 3–8 seconds; http_get takes 170–400ms. The JS querySelectorAll approach works (tested, returns correct data) but is 20–50x slower with no benefit.