Extracting Links
Recipe | Crates | Categories |
---|---|---|
Extract all Links from the HTML of a Webpage | ||
Check a Webpage for Broken Links | ||
Extract all Unique Links from a MediaWiki Markup |
Extract all Links from the HTML of a Webpage
Use reqwest::get
⮳ to perform a HTTP GET request and then use select::document::Document::from_read
⮳ to parse the response into a HTML document. select::document::Document::find
⮳ with the criteria of select::predicate::Name
⮳ is "a" retrieves all links. Call std-core::iter::Iterator::filter_map
⮳ on the select::selection::Selection
⮳ retrieves URLs from links that have the "href" select::node::Node::attr
⮳ (attribute).
Check a Webpage for Broken Links
Call get_base_url
to retrieve the base URL. If the document has a base tag, get the href select::node::Node::attr
⮳ from base tag. select::node::Node::attr
⮳ of the original URL acts as a default.
Iterates through links in the document and creates a tokio::task::spawn
⮳ task that will parse an individual link with url::ParseOptions
⮳ and tokio::task::spawn
⮳. The task makes a request to the links with reqwest
⮳ and verifies
reqwest::StatusCode
⮳. Then the tasks await
⮳ completion before ending the program.
Extract all Unique Links from a MediaWiki Markup
Pull the source of a MediaWiki page using reqwest::get
⮳ and then look for all entries of internal and external links with regex::Regex::captures_iter
⮳. Using std::borrow::Cow
⮳ avoids excessive std::string::String
⮳ allocations.
MediaWiki link syntax is described here⮳.