← Back to skills

Domain skill

pubmed

Markdown synced from browser-harness domain skills.

Agent prompt

Use this skill

Copy this prompt into your coding agent to make it enable browser-harness domain skills and read this exact domain folder before automating.

Set up https://github.com/browser-use/browser-harness for me if it is not already installed. If setup is needed, read `install.md` first to install and connect it to my real browser. Then read `SKILL.md` for normal usage and always read `helpers.py` because that is where the browser-harness functions are.

Enable domain skills if they are not already enabled by setting `BH_DOMAIN_SKILLS=1` for browser-harness. Use the `pubmed` domain skill from `agent-workspace/domain-skills/pubmed/`. Read every markdown file for this domain before inventing an approach:
- agent-workspace/domain-skills/pubmed/scraping.md

Use those domain-skill notes to complete my task for `pubmed` in my real browser. When you open a setup, verification, or task tab, activate it so I can see the active browser tab.

Skill contents

What the agent will read

Scraping & Data Extraction

scraping.md

Source
  • https://pubmed.ncbi.nlm.nih.gov — 37 M+ biomedical citations. Never use the browser for PubMed. All data is reachable via httpget using the NCBI E-utilities REST API. No API key required; a free key raises the rate...
  • ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.
  • Use EFetch XML when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.
  • Boolean operators: AND, OR, NOT. Phrase search: "exact phrase"[Title].
Show full markdown

https://pubmed.ncbi.nlm.nih.gov — 37 M+ biomedical citations. Never use the browser for PubMed. All data is reachable via http_get using the NCBI E-utilities REST API. No API key required; a free key raises the rate limit from 3 to 10 req/s.

Do this first

ESearch → ESummary is the fastest pipeline for most tasks — two calls, JSON responses, no XML parsing.

python
import json
from helpers import http_get

# Step 1: search → get PMIDs
search = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    "?db=pubmed&term=deep+learning+radiology&retmax=10&retmode=json"
))
pmids = search['esearchresult']['idlist']   # e.g. ['41999029', '41998456', ...]
count = search['esearchresult']['count']    # total hits across all pages

# Step 2: fetch lightweight metadata for all PMIDs in one call
summary = json.loads(http_get(
    f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    f"?db=pubmed&id={','.join(pmids)}&retmode=json"
))
result = summary['result']
for uid in result['uids']:
    art = result[uid]
    print(uid, art['pubdate'], art['source'])
    print("  ", art['title'][:80])
    print("  authors:", [a['name'] for a in art['authors'][:3]])
# Confirmed output (2026-04-18):
# 41999029 2026 Apr 18 Med Sci Monit
#    Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
#    authors: ['Kesimal U', 'Akkaya HE', 'Polat Ö']
# 41998456 2026 Apr 17 Sci Rep
#    ...

Use EFetch XML when you need: full abstract text, MeSH terms, complete author names (not just "Last I"), structured abstract labels, or the DOI from within the article record.

Common workflows

Search PubMed (ESearch)

python
import json
from helpers import http_get

data = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    "?db=pubmed"
    "&term=large+language+models+clinical"
    "&retmax=5"
    "&retmode=json"
    "&sort=pub+date"                          # newest first; default is relevance
    "&datetype=pdat"                          # filter by publication date
    "&mindate=2024/01/01&maxdate=2024/12/31"  # YYYY/MM/DD format
))
result = data['esearchresult']
print("Total hits:", result['count'])         # '24160' — note: string, not int
print("PMIDs:", result['idlist'])
print("Query translation:", result['querytranslation'])
# Confirmed output (2026-04-18):
# Total hits: 24160
# PMIDs: ['41996895', '41996722', '41996006', '41995888', '41995759']
# Query translation: "large language models"[MeSH Terms] OR ...

ESearch field tags (append to term)

code
machine learning[MeSH Terms]        MeSH controlled vocabulary
Hinton GE[Author]                   author last + initials
attention is all you need[Title]    title words
Nature[Journal]                     journal name
2024[pdat]                          publication year

Boolean operators: AND, OR, NOT. Phrase search: "exact phrase"[Title].

Sort options (sort=)

ValueEffect
(omit)Relevance (default)
pub+dateMost recent publication first
AuthorFirst author alphabetical
JournalNameJournal alphabetical

Lightweight metadata — ESummary (JSON, no XML)

python
import json
from helpers import http_get

data = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    "?db=pubmed&id=41999029,41998456,41997837&retmode=json"
))
result = data['result']
for uid in result['uids']:
    art = result[uid]
    # Key fields available:
    title         = art['title']            # full title string
    source        = art['source']           # abbreviated journal name
    fulljournalname = art['fulljournalname']
    pubdate       = art['pubdate']          # e.g. '2026 Apr 18'
    epubdate      = art['epubdate']         # e-pub ahead of print date (may be empty)
    authors       = art['authors']          # list of {'name': 'Last I', 'authtype': ...}
    volume        = art['volume']
    issue         = art['issue']
    pages         = art['pages']
    pubtype       = art['pubtype']          # list: ['Journal Article', 'Review', ...]
    # Extract DOI from elocationid or articleids:
    doi_field     = art['elocationid']      # e.g. 'doi: 10.12659/MSM.951157'
    article_ids   = {x['idtype']: x['value'] for x in art['articleids']}
    doi           = article_ids.get('doi')
    pmc_id        = article_ids.get('pmc')  # PMC ID if open access
    print(uid, pubdate, source)
    print(" ", title[:70])
    print("  doi:", doi, "| pmc:", pmc_id)
# Confirmed output (2026-04-18):
# 41999029 2026 Apr 18 Med Sci Monit
#    Use of Deep Learning Models in the Diagnosis of Proptosis Through Orbi
#   doi: 10.12659/MSM.951157 | pmc: None

Full article metadata — EFetch XML

Use this for full abstracts, complete author names, MeSH terms, structured abstract sections.

python
import json, xml.etree.ElementTree as ET
from helpers import http_get

raw = http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    "?db=pubmed&id=41999029,36328784&retmode=xml&rettype=abstract"
)
root = ET.fromstring(raw)

for art in root.findall('.//PubmedArticle'):
    mc      = art.find('MedlineCitation')
    pmid    = mc.find('PMID').text
    article = mc.find('Article')

    # Title — use itertext() to handle embedded tags like <i>, <sub>
    title   = ''.join(article.find('ArticleTitle').itertext()).strip()

    # Abstract — plain or structured (BACKGROUND / METHODS / RESULTS / CONCLUSION)
    abstract_el = article.find('Abstract')
    if abstract_el is not None:
        sections = []
        for t in abstract_el.findall('AbstractText'):
            label = t.get('Label', '')          # e.g. 'BACKGROUND', 'METHODS'
            text  = ''.join(t.itertext()).strip()
            sections.append(f"[{label}] {text}" if label else text)
        abstract = ' '.join(sections)
    else:
        abstract = ''                           # ~15% of articles have no abstract

    # Journal + year
    journal  = article.find('Journal')
    j_title  = journal.find('Title').text if journal is not None else ''
    pub_date = journal.find('.//PubDate') if journal is not None else None
    if pub_date is not None:
        year_el    = pub_date.find('Year')
        medline_el = pub_date.find('MedlineDate')   # fallback for old/seasonal dates
        season_el  = pub_date.find('Season')        # e.g. 'Jul-Aug', 'Oct-Dec'
        year = (year_el.text if year_el is not None
                else medline_el.text[:4] if medline_el is not None else '')

    # DOI
    doi_el = next(
        (e for e in article.findall('ELocationID') if e.get('EIdType') == 'doi'),
        None
    )
    doi = doi_el.text if doi_el is not None else ''

    # Authors — handle CollectiveName (consortium/group authors)
    author_list = article.find('AuthorList')
    authors = []
    if author_list is not None:
        for a in author_list.findall('Author'):
            collective = a.find('CollectiveName')
            last       = a.find('LastName')
            fore       = a.find('ForeName')
            initials   = a.find('Initials')
            if collective is not None:
                authors.append(collective.text)
            elif last is not None:
                full = last.text
                if fore is not None:
                    full += f", {fore.text}"
                authors.append(full)

    # MeSH controlled vocabulary terms
    mesh_list = mc.find('MeshHeadingList')
    mesh_terms = []
    if mesh_list is not None:
        mesh_terms = [
            mh.find('DescriptorName').text
            for mh in mesh_list.findall('MeshHeading')
            if mh.find('DescriptorName') is not None
        ]

    print(f"PMID={pmid} ({year}) {j_title}")
    print(f"  Title: {title[:70]}")
    print(f"  Authors: {authors[:3]}")
    print(f"  DOI: {doi}")
    print(f"  MeSH: {mesh_terms[:4]}")
    print(f"  Abstract: {abstract[:120]}")
# Confirmed output (2026-04-18):
# PMID=41999029 (2026) Medical science monitor : international medical...
#   Title: Use of Deep Learning Models in the Diagnosis of Proptosis Thro
#   Authors: ['Kesimal, Uğur', 'Akkaya, Habip Eser', 'Polat, Önder']
#   DOI: 10.12659/MSM.951157
#   MeSH: ['Humans', 'Deep Learning', 'Exophthalmos', 'Magnetic Resonance Imaging']
#   Abstract: BACKGROUND Proptosis is a common manifestation of orbital disease...
# PMID=36328784 (...)
#   Abstract: [OBJECTIVES] Physical inactivity and sedentary behaviour...  ← structured

Large result sets — usehistory + WebEnv

When count exceeds retmax (max 10 000), use server-side history to paginate EFetch without re-running ESearch on every page.

python
import json, xml.etree.ElementTree as ET
from helpers import http_get

# Step 1: ESearch with usehistory=y — NCBI holds result set on server
search = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    "?db=pubmed&term=CRISPR+gene+editing&retmax=0&retmode=json&usehistory=y"
))
webenv    = search['esearchresult']['webenv']       # server-side session token
query_key = search['esearchresult']['querykey']     # result set ID within session
total     = int(search['esearchresult']['count'])
print(f"Total: {total}, WebEnv: {webenv[:30]}..., query_key: {query_key}")
# Confirmed output (2026-04-18):
# Total: 24160, WebEnv: MCID_69e4203757db89391008d6f1..., query_key: 1

# Step 2: EFetch pages using WebEnv (no re-searching)
batch_size = 200
for start in range(0, min(total, 1000), batch_size):  # cap at 1000 for demo
    raw = http_get(
        f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
        f"?db=pubmed&query_key={query_key}&WebEnv={webenv}"
        f"&retstart={start}&retmax={batch_size}&retmode=xml&rettype=abstract"
    )
    root = ET.fromstring(raw)
    articles = root.findall('.//PubmedArticle')
    print(f"  Fetched {len(articles)} articles (start={start})")
    # process articles here...

EInfo — list available NCBI databases

python
import json
from helpers import http_get

data = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?retmode=json"
))
dbs = data['einforesult']['dblist']
print(f"Total databases: {len(dbs)}")   # Confirmed: 39 (2026-04-18)
print(dbs[:10])
# ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure',
#  'genome', 'annotinfo', 'assembly', 'bioproject']

Get PubMed-specific metadata (field list, link list):

python
import json
from helpers import http_get

data = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed&retmode=json"
))
db_info = data['einforesult']['dbinfo'][0]
print("DB name:", db_info['dbname'])
print("Record count:", db_info['count'])    # total PubMed records
link_names = [l['name'] for l in db_info.get('linklist', [])]
print(f"Link types ({len(link_names)}):", link_names[:5])
# Confirmed (2026-04-18):
# DB name: pubmed
# Record count: 37620453
# Link types (48): ['pubmed_assembly', 'pubmed_bioproject', ...]

ELink — cross-database linking

ELink connects a PubMed record to associated data in other NCBI databases. The pubmed_pubmed "related articles" linkname relies on a similarity server that is intermittently unavailable (returns "Couldn't resolve #exLinkSrv2, the address table is empty."). Use the non-similarity links below instead.

python
import json
from helpers import http_get

# Link a PMID to its free full-text in PMC (if open access)
# linkname=pubmed_pmc — may also hit the server outage; check error field
data = json.loads(http_get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi"
    "?dbfrom=pubmed&id=38325330&linkname=pubmed_pmc&retmode=json"
))
error = data.get('ERROR', '')
if error:
    print("ELink error:", error)   # 'Couldn't resolve #exLinkSrv2...' — NCBI server issue
else:
    for ls in data.get('linksets', []):
        for lsdb in ls.get('linksetdbs', []):
            print(lsdb['linkname'], "→", lsdb['links'][:5])

Available ELink linknames from pubmed (48 total):

linknameTarget
pubmed_pmcFree full text in PMC
pubmed_pubmed_citedinArticles citing this paper
pubmed_pubmed_refsReferences cited by this paper
pubmed_geneRelated Gene records
pubmed_clinvarClinical variants associated with publication
pubmed_gdsRelated GEO datasets

Practical alternative: If ELink is down, extract DOI from EFetch/ESummary and use https://doi.org/{doi} directly for the full-text link.

URL and parameter reference

E-utilities base URLs

code
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi   # search → PMIDs
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi  # PMIDs → JSON summary
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi    # PMIDs → full XML
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi     # cross-db links
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi     # DB metadata

ESearch parameters

ParameterValuesNotes
dbpubmedAlways pubmed for PubMed
termquery stringSupports field tags like [Author], [Title], [MeSH Terms]
retmaxinteger, max 10000Results returned per call
retmodejsonJSON output
sortpub+date, Author, JournalNameDefault is relevance
datetypepdat (pub), edat (entrez), mdat (modified)
mindate, maxdateYYYY/MM/DD or YYYYRequires datetype
usehistoryyStore results on server; returns webenv + querykey

EFetch parameters

ParameterValuesNotes
dbpubmed
id38000000,37999999Comma-separated PMIDs; max ~200 per call
query_key + WebEnvfrom ESearch usehistory=yAlternative to id for large sets
retstartintegerOffset for pagination with WebEnv
retmaxinteger, max 10000Batch size
retmodexmlUse XML for EFetch (JSON not available for full records)
rettypeabstractReturns abstract + core metadata

PubMed article URL construction

python
pmid = "41999029"
pubmed_url  = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/"
doi         = "10.12659/MSM.951157"
doi_url     = f"https://doi.org/{doi}"        # resolves to publisher page
pmc_id      = "PMC9876543"                    # from ESummary articleids
pmc_url     = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"

Gotchas

  • count is a string, not int. search['esearchresult']['count'] returns '24160', not 24160. Always cast with int() before arithmetic.

  • EFetch retmode must be xml for full records. Unlike ESearch and ESummary, EFetch with retmode=json returns flat text (the MEDLINE citation text format), not structured JSON. Parse EFetch responses with xml.etree.ElementTree.

  • ArticleTitle may contain embedded XML tags. Titles with italics (<i>Staphylococcus aureus</i>) or math (<sub>2</sub>) are mixed-content nodes. Always use ''.join(el.itertext()) instead of el.text, which silently drops everything after the first child tag.

  • ~15% of articles have no abstract. article.find('Abstract') returns None for short communications, editorials, letters, and older records. Always guard with if abstract_el is not None.

  • Author names vary in structure — always handle CollectiveName. Consortium papers list a group name ('GeKeR Study Group', 'Breast Cancer Association Consortium') under <CollectiveName> instead of <LastName>/<ForeName>. Individual authors have <LastName> + optionally <ForeName> and <Initials>. Check CollectiveName first; falling through to LastName without the check produces None errors.

    • Confirmed real examples (2026-04-18): PMID 37586835 (GeKeR Study Group), PMID 36328784 (Breast Cancer Association Consortium)
  • PubDate has three possible structures. Most articles have <Year> + optional <Month> + optional <Day>. Seasonal journals use <Season> (e.g. Jul-Aug, Oct-Dec) instead of <Month>. A minority of older records use <MedlineDate> (e.g. 1995 Fall) with no <Year>. Safe extraction pattern:

    python
    pub_date = journal.find('.//PubDate')
    year_el    = pub_date.find('Year')    if pub_date is not None else None
    medline_el = pub_date.find('MedlineDate') if pub_date is not None else None
    year = (year_el.text if year_el is not None
            else medline_el.text[:4] if medline_el is not None else '')
    
  • Batch EFetch: keep IDs to ~200 per call. The API accepts comma-separated IDs in id=, but very large batches (500+) occasionally time out or return truncated XML. For >200 articles, iterate in chunks or use usehistory + WebEnv.

  • ELink pubmed_pubmed (related articles) is intermittently broken. The NCBI similarity server returns "Couldn't resolve #exLinkSrv2, the address table is empty." — this is a persistent server-side issue as of 2026-04-18, not a rate-limit error. Other linknames (pubmed_gene, pubmed_pmc, pubmed_clinvar) fail with the same error. Use the DOI as a fallback link to publisher full text.

  • Rate limits: 3 req/s without API key, 10 req/s with free key. Exceeding 3 req/s returns HTTP 429. Insert time.sleep(0.34) between sequential calls without a key. Get a free API key at https://www.ncbi.nlm.nih.gov/account/ and append &api_key=YOUR_KEY to all URLs.

  • retmax upper bound is 10 000 for ESearch. To retrieve more than 10 000 PMIDs for a search, use usehistory=y and page through EFetch with retstart offsets. EFetch itself also accepts retmax up to 10 000 per call.

  • retmax=0 in ESearch returns only the count, not IDs — useful for counting. Combine with usehistory=y to store the result for later paging without fetching IDs upfront:

    python
    search = json.loads(http_get(
        "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
        "?db=pubmed&term=cancer&retmax=0&retmode=json&usehistory=y"
    ))
    total  = int(search['esearchresult']['count'])   # e.g. 4800000
    webenv = search['esearchresult']['webenv']
    
  • ESummary authors field uses abbreviated names (Last I), not full names. Use EFetch XML to get ForeName (e.g. 'Kesimal, Uğur' vs ESummary 'Kesimal U'). For bulk tasks where full names are not needed, ESummary is faster.

  • querytranslation shows how NCBI interpreted your term. The ESearch response includes esearchresult.querytranslation — a MeSH-expanded version of your query. Inspect it to verify the search matched what you intended.