Extracting Links

Recipe	Crates	Categories
Extract all Links from the HTML of a Webpage
Check a Webpage for Broken Links
Extract all Unique Links from a MediaWiki Markup

Extract all Links from the HTML of a Webpage

Use reqwest::get⮳ to perform a HTTP GET request and then use select::document::Document::from_read⮳ to parse the response into a HTML document. select::document::Document::find⮳ with the criteria of select::predicate::Name⮳ is "a" retrieves all links. Call std-core::iter::Iterator::filter_map⮳ on the select::selection::Selection⮳ retrieves URLs from links that have the "href" select::node::Node::attr⮳ (attribute).

//! This example demonstrates how to extract all the links from a web page.
//!
//! `select` is a library to extract useful data from HTML documents, suitable
//! for web scraping.

use anyhow::Result;
use select::document::Document;
use select::predicate::Name;

#[tokio::main]
async fn main() -> Result<()> {
    let res = reqwest::get("https://www.rust-lang.org/en-US/")
        .await?
        .text()
        .await?;

    Document::from(res.as_str())
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .for_each(|x| println!("{}", x));

    Ok(())
}

Check a Webpage for Broken Links

Call get_base_url to retrieve the base URL . If the document has a base tag, get the href select::node::Node::attr⮳ from base tag. select::node::Node::attr⮳ of the original URL acts as a default.

Iterates through links in the document and creates a tokio::task::spawn⮳ task that will parse an individual link with url::ParseOptions⮳ and tokio::task::spawn⮳. The task makes a request to the links with reqwest⮳ and verifies reqwest::StatusCode⮳. Then the tasks await⮳ completion before ending the program.

//! This example demonstrates how to check for broken links on a website.
//!
//! `select` is a library to extract useful data from HTML documents, suitable
//! for web scraping.

use std::collections::HashSet;

use anyhow::Result;
use reqwest::StatusCode;
use select::document::Document;
use select::predicate::Name;
use url::Position;
use url::Url;

/// Extracts the base URL from a document.
///
/// If the HTML document contains a `<base>` tag, its `href` attribute is used
/// as the base URL. Otherwise, the base URL is derived from the URL of the
/// document itself, up to the path component.
async fn get_base_url(url: &Url, doc: &Document) -> Result<Url> {
    let base_tag_href =
        doc.find(Name("base")).filter_map(|n| n.attr("href")).next();
    let base_url = base_tag_href
        .map_or_else(|| Url::parse(&url[..Position::BeforePath]), Url::parse)?;
    Ok(base_url)
}

/// Checks if a link is broken.
///
/// This function sends a GET request to the given URL and checks the HTTP
/// status code. If the status code is 404 (Not Found), the link is
/// considered broken. Otherwise, it's considered OK.
async fn check_link(url: &Url) -> Result<bool> {
    let res = reqwest::get(url.as_ref()).await?;
    Ok(res.status() != StatusCode::NOT_FOUND)
}

#[tokio::main]
async fn main() -> Result<()> {
    let url = Url::parse("https://www.rust-lang.org/en-US/")?;
    let res = reqwest::get(url.as_ref()).await?.text().await?;
    let document = Document::from(res.as_str());
    let base_url = get_base_url(&url, &document).await?;
    let base_parser = Url::options().base_url(Some(&base_url));
    let links: HashSet<Url> = document
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .filter_map(|link| base_parser.parse(link).ok())
        .collect();
    let mut tasks = vec![];

    for link in links {
        tasks.push(tokio::spawn(async move {
            if check_link(&link).await.unwrap() {
                println!("{} is OK", link);
            } else {
                println!("{} is Broken", link);
            }
        }));
    }

    for task in tasks {
        task.await?
    }

    Ok(())
}

Extract all Unique Links from a MediaWiki Markup

Pull the source of a MediaWiki page using reqwest::get⮳ and then look for all entries of internal and external links with regex::Regex::captures_iter⮳. Using std::borrow::Cow⮳ avoids excessive std::string::String⮳ allocations.

MediaWiki link syntax is described here⮳.

//! This example demonstrates how to extract unique links from a Wikipedia page.
//!
//! It uses the `reqwest` crate to fetch the page content, the `regex` crate to
//! find links, and the `lazy_static` crate to compile the regex only once.

use std::borrow::Cow;
use std::collections::HashSet;

use anyhow::Result;
use lazy_static::lazy_static;
use regex::Regex;

fn extract_links(content: &str) -> HashSet<Cow<str>> {
    lazy_static! {
        static ref WIKI_REGEX: Regex = Regex::new(
            r"(?x)
                \[\[(?P<internal>[^\[\]|]*)[^\[\]]*\]\]    # internal links
                |
                (url=|URL\||\[)(?P<external>http.*?)[ \|}] # external links
            "
        )
        .unwrap();
    }

    let links: HashSet<_> = WIKI_REGEX
        .captures_iter(content)
        .map(|c| match (c.name("internal"), c.name("external")) {
            (Some(val), None) => Cow::from(val.as_str().to_lowercase()),
            (None, Some(val)) => Cow::from(val.as_str()),
            _ => unreachable!(),
        })
        .collect();

    links
}

#[tokio::main]
async fn main() -> Result<()> {
    let content = reqwest::get(
    "https://en.wikipedia.org/w/index.php?title=Rust_(programming_language)&action=raw",
  )
  .await?
  .text()
  .await?;

    println!("{:#?}", extract_links(content.as_str()));

    Ok(())
}

The Rust How-to Book

Extracting Links

Extract all Links from the HTML of a Webpage

Check a Webpage for Broken Links

Extract all Unique Links from a MediaWiki Markup