Extracting Links

reqwest select cat-network-programming cat-web-programming

Use reqwest::get⮳ to perform a HTTP GET request and then use select::document::Document::from_read⮳ to parse the response into a HTML document. select::document::Document::find⮳ with the criteria of select::predicate::Name⮳ is "a" retrieves all links. Call std-core::iter::Iterator::filter_map⮳ on the select::selection::Selection⮳ retrieves URLs from links that have the "href" select::node::Node::attr⮳ (attribute).

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

reqwest select url cat-network-programming cat-web-programming

Call get_base_url to retrieve the base URL. If the document has a base tag, get the href select::node::Node::attr⮳ from base tag. select::node::Node::attr⮳ of the original URL acts as a default.

Iterates through links in the document and creates a tokio::task::spawn⮳ task that will parse an individual link with url::ParseOptions⮳ and tokio::task::spawn⮳. The task makes a request to the links with reqwest⮳ and verifies reqwest::StatusCode⮳. Then the tasks await⮳ completion before ending the program.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

reqwest regex cat-network-programming cat-web-programming

Pull the source of a MediaWiki page using reqwest::get⮳ and then look for all entries of internal and external links with regex::Regex::captures_iter⮳. Using std::borrow::Cow⮳ avoids excessive std::string::String⮳ allocations.

MediaWiki link syntax is described here⮳.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX