Skip to content

Extractor API

Extraction

wpextract.WPExtractor

WPExtractor(
    json_root: Path,
    scrape_root: Optional[Path] = None,
    json_prefix: Optional[str] = None,
    translation_pickers: Optional[PickerListType] = None,
)

Manages the extraction of data from a WordPress site.

PARAMETER DESCRIPTION
json_root

Path to directory of JSON files

TYPE: Path

scrape_root

Path to scrape directory

TYPE: Optional[Path] DEFAULT: None

json_prefix

Prefix of files in json_root

TYPE: Optional[str] DEFAULT: None

translation_pickers

Supply a custom list of translation pickers

TYPE: Optional[PickerListType] DEFAULT: None

categories instance-attribute

categories: Optional[DataFrame]

DataFrame of extracted categories.

link_registry: LinkRegistry = LinkRegistry()

Registry of known URLs and their corresponding data items.

media instance-attribute

DataFrame of extracted media.

pages instance-attribute

DataFrame of extracted pages.

posts instance-attribute

DataFrame of extracted posts.

tags instance-attribute

DataFrame of extracted tags.

users instance-attribute

DataFrame of extracted users.

export

export(out_dir: Path) -> None

Save scrape results!

PARAMETER DESCRIPTION
out_dir

Path to output directory

TYPE: Path

extract

extract() -> None

Perform the extraction.

Extraction Data

Link(text: Optional[str], href: Optional[str])

A link to a URL.

wpextract.extractors.data.links.LinkRegistry

LinkRegistry()

A collection of all known links on the site.

wpextract.extractors.data.links.LinkRegistry.add_linkable

add_linkable(
    url: str,
    data_type: str,
    idx: str,
    _refresh_cache: bool = True,
) -> None

Add a single linkable item to the registry.

The URL will be compared later against a list of links that need to be resolved and the data type and IDX will be returned.

Data types should be unique. IDXes should be unique within one or more data types.

PARAMETER DESCRIPTION
url

The URL of the destination

TYPE: str

data_type

A unique identifier for this type of item.

TYPE: str

idx

A unique identifier within the data type.

TYPE: str

_refresh_cache

Whether the link cache should be updated. Should be left as True unless multiple links are being added together.

TYPE: bool DEFAULT: True

wpextract.extractors.data.links.LinkRegistry.add_linkables

add_linkables(
    data_type: str, links: list[str], idxes: list[str]
) -> None

Add multiple linkable items at once.

PARAMETER DESCRIPTION
data_type

The data type for all items.

TYPE: str

links

A list of links. Must be the same length as idxes.

TYPE: list[str]

idxes

A list of IDs. Must be the same length as links.

TYPE: list[str]

RAISES DESCRIPTION
ValueError

if the links and idxes lists are not the same length.

query_link(href: str) -> Optional[Linkable]

Find a linkable item by the URL in the registry.

Returns None if no URL matches.

PARAMETER DESCRIPTION
href

A URL to search

TYPE: str

RETURNS DESCRIPTION
Optional[Linkable]

A matching linkable

wpextract.extractors.data.links.Linkable dataclass

Linkable(link: str, data_type: str, idx: str)

An item which can be linked to.

ResolvableLink(
    text: Optional[str],
    href: Optional[str],
    destination: Optional[Linkable],
)

Bases: Link

A link to a URL which can be looked up against known links.

Multilingual Extraction

wpextract.parse.translations.LangPicker

LangPicker(page_doc: BeautifulSoup)

Bases: ABC

Abstract class of a language picker style.

Support for a new language picker can be added by creating a new class inheriting from this one.

See Also
PARAMETER DESCRIPTION
page_doc

The document to extract a language picker from.

TYPE: BeautifulSoup

current_language instance-attribute

current_language: Language

The current language of the page, populated by calling LangPicker.set_current_lang within LangPicker.extract.

page_doc instance-attribute

page_doc: BeautifulSoup = page_doc

The document to extract the language picker from.

root_el instance-attribute

root_el: Tag

The root element of the language picker, populated if LangPicker.matches is succesful.

translations instance-attribute

translations: list[TranslationLink] = []

A list of translation links, populated by calling LangPicker.add_translation within LangPicker.extract.

add_translation

add_translation(href: str, lang: str) -> None

Add a translation from the picker.

PARAMETER DESCRIPTION
href

The link to the translated page.

TYPE: str

lang

The provided language code.

TYPE: str

extract abstractmethod

extract() -> None

Extract the current language and translations from the doc.

Instead of directly selecting on root_el, consider using the helper methods _root_select and _root_select_one to extract elements. These are the equivalent of directly calling select or select_one, but will raise a formatted error if the element is not found. Don't use these methods if no results is an expected outcome, e.g. a post may have no translations.

If using other selectors, you can construct the exception using the helper _build_extraction_fail_err.

RAISES DESCRIPTION
ExtractionFailedError

If the picker is unable to find an element it expects to be present.

get_root abstractmethod

get_root() -> Optional[Tag]

Retrieve the root element of the translation picker.

Using the LangPicker.page_doc attribute (a bs4.BeautifulSoup object representing the whole page), the root element of the picker shoudl be found and returned.

RETURNS DESCRIPTION
Optional[Tag]

The root element, or None if this picker is not found on the page.

matches

matches() -> bool

Checks if this picker can extract from the document.

RETURNS DESCRIPTION
bool

If the page uses this type of matcher.

RAISES DESCRIPTION
TypeError

If the root element that has been retrieved is not a tag, or has 0 children. This may happen if it accidentally retrieves a text node.

set_current_lang

set_current_lang(lang: str) -> None

Set the language of this doc.

PARAMETER DESCRIPTION
lang

The locale string

TYPE: str

_root_select_one

_root_select_one(selector: str) -> Tag

Helper to extract an element from the root element.

PARAMETER DESCRIPTION
selector

a CSS selector to be passed to Tag.select_one

TYPE: str

RAISES DESCRIPTION
ExtractionFailedError

If the element was not found. This indicates this picker was activated when it should not have been.

RETURNS DESCRIPTION
Tag

The element found by the selector.

_root_select

_root_select(selector: str) -> Sequence[Tag]

Helper to extract elements from the root element.

PARAMETER DESCRIPTION
selector

a CSS selector to be passed to Tag.select

TYPE: str

RAISES DESCRIPTION
ExtractionFailedError

If no matching elements were found. This indicates this picker was activated when it should not have been.

RETURNS DESCRIPTION
Sequence[Tag]

The elements found by the selector.

_build_extraction_fail_err

_build_extraction_fail_err(
    selector: str,
) -> ExtractionFailedError

Create an error for when an expected element is missing.

PARAMETER DESCRIPTION
selector

a string describing the attempted selection criteria

TYPE: str

RETURNS DESCRIPTION
ExtractionFailedError

An instance of the exception to be raised.

wpextract.parse.translations.PickerListType module-attribute

PickerListType = list[type[LangPicker]]
TranslationLink(
    text: Optional[str],
    href: Optional[str],
    destination: Optional[Linkable],
    lang: Union[str, Language],
)

Bases: ResolvableLink

A link to an alternative version of this article in a different language.

destination instance-attribute

destination: Optional[Linkable]

href instance-attribute

href: Optional[str]

lang instance-attribute

lang: Union[str, Language]

Raw language code, or existing language object if derived from another source.

language property

language: Language

Parsed and normalized language. Populated automatically post-init.

See Also

langcodes documentation