Extractor API¶

Extraction¶

wpextract.WPExtractor ¶

WPExtractor(
    json_root: Path,
    scrape_root: Optional[Path] = None,
    json_prefix: Optional[str] = None,
    translation_pickers: Optional[PickerListType] = None,
)

Manages the extraction of data from a WordPress site.

PARAMETER	DESCRIPTION
`json_root`	Path to directory of JSON files TYPE: `Path`
`scrape_root`	Path to scrape directory TYPE: `Optional[Path]` DEFAULT: `None`
`json_prefix`	Prefix of files in `json_root` TYPE: `Optional[str]` DEFAULT: `None`
`translation_pickers`	Supply a custom list of translation pickers TYPE: `Optional[PickerListType]` DEFAULT: `None`

categories `instance-attribute` ¶

categories: Optional[DataFrame]

DataFrame of extracted categories.

link_registry `instance-attribute` ¶

link_registry: LinkRegistry = LinkRegistry()

Registry of known URLs and their corresponding data items.

media `instance-attribute` ¶

media: Optional[DataFrame]

DataFrame of extracted media.

pages `instance-attribute` ¶

pages: Optional[DataFrame]

DataFrame of extracted pages.

posts `instance-attribute` ¶

posts: Optional[DataFrame]

DataFrame of extracted posts.

tags `instance-attribute` ¶

tags: Optional[DataFrame]

DataFrame of extracted tags.

users `instance-attribute` ¶

users: Optional[DataFrame]

DataFrame of extracted users.

export ¶

export(out_dir: Path) -> None

Save scrape results!

PARAMETER	DESCRIPTION
`out_dir`	Path to output directory TYPE: `Path`

extract ¶

extract() -> None

Perform the extraction.

Extraction Data¶

wpextract.extractors.data.links.Link `dataclass` ¶

Link(text: Optional[str], href: Optional[str])

A link to a URL.

wpextract.extractors.data.links.LinkRegistry ¶

LinkRegistry()

A collection of all known links on the site.

wpextract.extractors.data.links.LinkRegistry.add_linkable ¶

add_linkable(
    url: str,
    data_type: str,
    idx: str,
    _refresh_cache: bool = True,
) -> None

Add a single linkable item to the registry.

The URL will be compared later against a list of links that need to be resolved and the data type and IDX will be returned.

Data types should be unique. IDXes should be unique within one or more data types.

PARAMETER	DESCRIPTION
`url`	The URL of the destination TYPE: `str`
`data_type`	A unique identifier for this type of item. TYPE: `str`
`idx`	A unique identifier within the data type. TYPE: `str`
`_refresh_cache`	Whether the link cache should be updated. Should be left as True unless multiple links are being added together. TYPE: `bool` DEFAULT: `True`

wpextract.extractors.data.links.LinkRegistry.add_linkables ¶

add_linkables(
    data_type: str, links: list[str], idxes: list[str]
) -> None

Add multiple linkable items at once.

PARAMETER	DESCRIPTION
`data_type`	The data type for all items. TYPE: `str`
`links`	A list of links. Must be the same length as idxes. TYPE: `list[str]`
`idxes`	A list of IDs. Must be the same length as links. TYPE: `list[str]`

RAISES	DESCRIPTION
`ValueError`	if the links and idxes lists are not the same length.

wpextract.extractors.data.links.LinkRegistry.query_link ¶

query_link(href: str) -> Optional[Linkable]

Find a linkable item by the URL in the registry.

Returns None if no URL matches.

PARAMETER	DESCRIPTION
`href`	A URL to search TYPE: `str`

RETURNS	DESCRIPTION
`Optional[Linkable]`	A matching linkable

wpextract.extractors.data.links.Linkable `dataclass` ¶

Linkable(link: str, data_type: str, idx: str)

An item which can be linked to.

wpextract.extractors.data.links.ResolvableLink `dataclass` ¶

ResolvableLink(
    text: Optional[str],
    href: Optional[str],
    destination: Optional[Linkable],
)

Bases: Link

A link to a URL which can be looked up against known links.

Multilingual Extraction¶

wpextract.parse.translations.LangPicker ¶

LangPicker(page_doc: BeautifulSoup)

Bases: ABC

Abstract class of a language picker style.

Support for a new language picker can be added by creating a new class inheriting from this one.

current_language `instance-attribute` ¶

current_language: Language

The current language of the page, populated by calling LangPicker.set_current_lang within LangPicker.extract.

page_doc `instance-attribute` ¶

page_doc: BeautifulSoup = page_doc

The document to extract the language picker from.

root_el `instance-attribute` ¶

root_el: Tag

The root element of the language picker, populated if LangPicker.matches is succesful.

translations `instance-attribute` ¶

translations: list[TranslationLink] = []

A list of translation links, populated by calling LangPicker.add_translation within LangPicker.extract.

add_translation ¶

add_translation(href: str, lang: str) -> None

Add a translation from the picker.

PARAMETER	DESCRIPTION
`href`	The link to the translated page. TYPE: `str`
`lang`	The provided language code. TYPE: `str`

extract `abstractmethod` ¶

extract() -> None

Extract the current language and translations from the doc.

Instead of directly selecting on root_el, consider using the helper methods _root_select and _root_select_one to extract elements. These are the equivalent of directly calling select or select_one, but will raise a formatted error if the element is not found. Don't use these methods if no results is an expected outcome, e.g. a post may have no translations.

If using other selectors, you can construct the exception using the helper _build_extraction_fail_err.

RAISES	DESCRIPTION
`ExtractionFailedError`	If the picker is unable to find an element it expects to be present.

get_root `abstractmethod` ¶

get_root() -> Optional[Tag]

Retrieve the root element of the translation picker.

Using the LangPicker.page_doc attribute (a bs4.BeautifulSoup object representing the whole page), the root element of the picker shoudl be found and returned.

RETURNS	DESCRIPTION
`Optional[Tag]`	The root element, or None if this picker is not found on the page.

matches ¶

matches() -> bool

Checks if this picker can extract from the document.

RETURNS	DESCRIPTION
`bool`	If the page uses this type of matcher.

RAISES	DESCRIPTION
`TypeError`	If the root element that has been retrieved is not a tag, or has 0 children. This may happen if it accidentally retrieves a text node.

set_current_lang ¶

set_current_lang(lang: str) -> None

Set the language of this doc.

PARAMETER	DESCRIPTION
`lang`	The locale string TYPE: `str`

_root_select_one ¶

_root_select_one(selector: str) -> Tag

Helper to extract an element from the root element.

PARAMETER	DESCRIPTION
`selector`	a CSS selector to be passed to `Tag.select_one` TYPE: `str`

RAISES	DESCRIPTION
`ExtractionFailedError`	If the element was not found. This indicates this picker was activated when it should not have been.

RETURNS	DESCRIPTION
`Tag`	The element found by the selector.

_root_select ¶

_root_select(selector: str) -> Sequence[Tag]

Helper to extract elements from the root element.

PARAMETER	DESCRIPTION
`selector`	a CSS selector to be passed to `Tag.select` TYPE: `str`

RAISES	DESCRIPTION
`ExtractionFailedError`	If no matching elements were found. This indicates this picker was activated when it should not have been.

RETURNS	DESCRIPTION
`Sequence[Tag]`	The elements found by the selector.

_build_extraction_fail_err ¶

_build_extraction_fail_err(
    selector: str,
) -> ExtractionFailedError

Create an error for when an expected element is missing.

PARAMETER	DESCRIPTION
`selector`	a string describing the attempted selection criteria TYPE: `str`

RETURNS	DESCRIPTION
`ExtractionFailedError`	An instance of the exception to be raised.

wpextract.parse.translations.PickerListType `module-attribute` ¶

PickerListType = list[type[LangPicker]]

wpextract.parse.translations.TranslationLink `dataclass` ¶

TranslationLink(
    text: Optional[str],
    href: Optional[str],
    destination: Optional[Linkable],
    lang: Union[str, Language],
)

Bases: ResolvableLink

A link to an alternative version of this article in a different language.

destination `instance-attribute` ¶

destination: Optional[Linkable]

href `instance-attribute` ¶

href: Optional[str]

lang `instance-attribute` ¶

lang: Union[str, Language]

Raw language code, or existing language object if derived from another source.

language `property` ¶

language: Language

Parsed and normalized language. Populated automatically post-init.

Extractor API¶

Extraction¶

wpextract.WPExtractor ¶

categories instance-attribute ¶

link_registry instance-attribute ¶

media instance-attribute ¶

pages instance-attribute ¶

posts instance-attribute ¶

tags instance-attribute ¶

users instance-attribute ¶