Content Parsers

IntelliScraper provides an extensible parser hierarchy for extracting structured data from scraped HTML.

Parser Hierarchy

BaseParser (ABC)
└── HTMLParser (general purpose)
    └── YourCustomParser (extend for your site)

HTMLParser

The default parser for extracting text, links, and Markdown from any HTML page.

from intelliscraper.parsers import HTMLParser

parser = HTMLParser(url="https://example.com", html=html_content)

# Plain text
print(parser.text)

# All absolute URLs from <a> tags
print(parser.links)

# Classified internal/external links with metadata
print(parser.navigable_links)

# Full Markdown
print(parser.markdown)

# Cleaned Markdown optimised for LLM input
print(parser.markdown_for_llm)

Properties

Property

Type

Description

text

str

Plain text with newline separators

links

list[str]

Deduplicated absolute HTTP/HTTPS URLs

navigable_links

list[dict]

Classified links with href, text, title, link_type, rel

markdown

str

Full-page Markdown (standard preprocessing)

markdown_for_llm

str

Cleaned Markdown (nav, ads, forms stripped)

All properties are lazily computed and cached on first access.

Creating Custom Parsers

Extend HTMLParser to add site-specific extraction logic:

from functools import cached_property
from intelliscraper.parsers import HTMLParser

class ProductPageParser(HTMLParser):
    """Parser for an e-commerce product page."""

    @cached_property
    def product_title(self) -> str | None:
        tag = self.soup.select_one("h1.product-title")
        return tag.get_text(strip=True) if tag else None

    @cached_property
    def price(self) -> str | None:
        tag = self.soup.select_one("span.price")
        return tag.get_text(strip=True) if tag else None

    @cached_property
    def description(self) -> str | None:
        tag = self.soup.select_one("div.product-description")
        return tag.get_text(separator="\n", strip=True) if tag else None

Usage

async with AsyncScraper() as scraper:
    response = await scraper.scrape("https://shop.example.com/product/123")

    if response.status == ScrapStatus.SUCCESS:
        parser = ProductPageParser(
            url=response.scrape_request.url,
            html=response.scrap_html_content,
        )
        print(f"Title: {parser.product_title}")
        print(f"Price: {parser.price}")