Parsers¶

Content parsers for extracting structured data from HTML.

BaseParser (ABC)¶

Abstract base class for content parsers.

All parsers in IntelliScraper extend BaseParser to provide a consistent interface for extracting text, links, and Markdown from scraped HTML content.

To create a site-specific parser, subclass HTMLParser (which already implements BaseParser) and add your custom extraction logic as @cached_property methods.

Example:

class MyCustomParser(HTMLParser):
    @cached_property
    def product_title(self) -> str | None:
        tag = self.soup.select_one("h1.product-title")
        return tag.get_text(strip=True) if tag else None

class intelliscraper.parsers.base_parser.BaseParser(url, html)[source]¶

Bases: ABC

Abstract base for all content parsers.

Defines the minimum interface that every parser must implement. Concrete parsers should extend HTMLParser rather than this class directly, unless they need a fundamentally different parsing engine.

Parameters:

url (str) – The source URL of the scraped page. Used for normalising relative links.
html (str) – Raw HTML string to parse.

abstractmethod __init__(url, html)[source]¶

Initialize the parser with URL and HTML content.

Parameters:

url (str)
html (str)

Return type:

None

property text: str¶: Extract plain text from the HTML content.

property links: list[str]¶: Extract all normalised href values from <a> tags.

property markdown: str¶: Convert the HTML to Markdown.

HTMLParser¶

General-purpose HTML parser.

Parses raw HTML content and provides access to plain text, links, Markdown, and LLM-optimised Markdown (with boilerplate stripped).

This is the default parser used by IntelliScraper and the recommended base class for site-specific parsers.

Example:

parser = HTMLParser(url="https://example.com", html=html_string)
print(parser.text)               # plain text
print(parser.links)              # list of absolute URLs
print(parser.markdown)           # full Markdown
print(parser.markdown_for_llm)   # cleaned Markdown for LLM input
print(parser.navigable_links)    # classified internal/external links

class intelliscraper.parsers.html_parser.HTMLParser(url, html, html_parser_type=HTMLParserType.HTML5LIB)[source]¶

Bases: BaseParser

General-purpose HTML parser with text, link, and Markdown extraction.

Wraps BeautifulSoup for DOM querying and html-to-markdown for Markdown conversion. Provides both standard and LLM-optimised Markdown outputs.

All properties are lazily computed and cached on first access.

Parameters:

url (str) – The source URL of the page (used for link normalisation).
html (str) – Raw HTML string. Must be a non-empty string.
html_parser_type (HTMLParserType) – BeautifulSoup parser backend. Defaults to HTMLParserType.HTML5LIB.

Raises:

HTMLParserInputError – If html is empty or not a string.

Example:

parser = HTMLParser(
    url="https://example.com/page",
    html=response.scrap_html_content,
)
print(parser.text)
print(parser.links)
print(parser.markdown_for_llm)

property text: str¶

Plain text extracted from the HTML.

Uses BeautifulSoup’s get_text() with newline separators and whitespace stripping.

property links: list[str]¶

All normalised href values from <a> tags.

Relative URLs are resolved against the source URL. Duplicates and fragment-only links are removed.

Returns:: A deduplicated list of absolute HTTP/HTTPS URLs.

property navigable_links: list[dict]¶

Internal and external page links, classified and normalised.

Skips anchors (#fragment), mailto:, tel:, javascript:, and resource links (CSS/JS).

Returns:

href — absolute URL
text — visible link label
title — title attribute or None
link_type — "Internal" or "External"
rel — list of rel values (e.g. ["nofollow"])

Return type:

A list of dicts, each with keys

property markdown: str¶

Full-page Markdown with standard preprocessing.

Preserves navigation, forms, and page structure.

property markdown_for_llm: str¶

Markdown with nav, ads, forms, and boilerplate stripped.

Optimised for use as LLM input — removes elements that add noise without informational value.