Parsers¶
Content parsers for extracting structured data from HTML.
BaseParser (ABC)¶
Abstract base class for content parsers.
All parsers in IntelliScraper extend BaseParser to provide a
consistent interface for extracting text, links, and Markdown from
scraped HTML content.
To create a site-specific parser, subclass HTMLParser (which
already implements BaseParser) and add your custom extraction
logic as @cached_property methods.
Example:
class MyCustomParser(HTMLParser):
@cached_property
def product_title(self) -> str | None:
tag = self.soup.select_one("h1.product-title")
return tag.get_text(strip=True) if tag else None
- class intelliscraper.parsers.base_parser.BaseParser(url, html)[source]¶
Bases:
ABCAbstract base for all content parsers.
Defines the minimum interface that every parser must implement. Concrete parsers should extend
HTMLParserrather than this class directly, unless they need a fundamentally different parsing engine.- Parameters:
HTMLParser¶
General-purpose HTML parser.
Parses raw HTML content and provides access to plain text, links, Markdown, and LLM-optimised Markdown (with boilerplate stripped).
This is the default parser used by IntelliScraper and the recommended base class for site-specific parsers.
Example:
parser = HTMLParser(url="https://example.com", html=html_string)
print(parser.text) # plain text
print(parser.links) # list of absolute URLs
print(parser.markdown) # full Markdown
print(parser.markdown_for_llm) # cleaned Markdown for LLM input
print(parser.navigable_links) # classified internal/external links
- class intelliscraper.parsers.html_parser.HTMLParser(url, html, html_parser_type=HTMLParserType.HTML5LIB)[source]¶
Bases:
BaseParserGeneral-purpose HTML parser with text, link, and Markdown extraction.
Wraps BeautifulSoup for DOM querying and
html-to-markdownfor Markdown conversion. Provides both standard and LLM-optimised Markdown outputs.All properties are lazily computed and cached on first access.
- Parameters:
url (str) – The source URL of the page (used for link normalisation).
html (str) – Raw HTML string. Must be a non-empty string.
html_parser_type (HTMLParserType) – BeautifulSoup parser backend. Defaults to
HTMLParserType.HTML5LIB.
- Raises:
HTMLParserInputError – If
htmlis empty or not a string.
Example:
parser = HTMLParser( url="https://example.com/page", html=response.scrap_html_content, ) print(parser.text) print(parser.links) print(parser.markdown_for_llm)
- property text: str¶
Plain text extracted from the HTML.
Uses BeautifulSoup’s
get_text()with newline separators and whitespace stripping.
- property links: list[str]¶
All normalised
hrefvalues from<a>tags.Relative URLs are resolved against the source URL. Duplicates and fragment-only links are removed.
- Returns:
A deduplicated list of absolute HTTP/HTTPS URLs.
Internal and external page links, classified and normalised.
Skips anchors (
#fragment),mailto:,tel:,javascript:, and resource links (CSS/JS).- Returns:
href— absolute URLtext— visible link labeltitle— title attribute orNonelink_type—"Internal"or"External"rel— list of rel values (e.g.["nofollow"])
- Return type:
A list of dicts, each with keys