Text Processing
Deal with the complexities of human language when expressed in textual form.
This table outlines common text processing tasks and relevant Rust crates.
Topic | Rust Crates | Notes |
---|---|---|
String Concatenation | std::string , joinery | joinery ⮳ is a small crate for generically joining iterators with a separator. |
String Manipulation | std::string , heck | std::string provides basic string operations. heck ⮳ is a case conversion library. textwrap ⮳ provides word wrapping, indenting, and dedenting strings. |
Regular Expressions | regex ⮳, fancy-regex ⮳ | regex ⮳ is the standard Regex crate. fancy-regex ⮳ provides more advanced features. |
String Search | aho-corasick ⮳, memchr ⮳ | String search can be done with regular expressions or with algorithms like Aho-Corasick. |
Fuzzy Matching | fuzzy-matcher ⮳, strsim | fuzzy-matcher ⮳ provides fuzzy string matching. strsim ⮳ implement string similarity metrics. |
Diffing & Patching | diff ⮳, similar ⮳ | These crates calculate differences between text files. |
OS, C, and other strings | std::ffi , bstr | std::ffi provides types for platform-native strings and C-style, NUL-terminated strings. bstr ⮳ offers a string type that is not required to be valid UTF-8. |
Unicode handling | unicode-segmentation ⮳ | unicode-segmentation ⮳ correctly handles Unicode graphemes and word boundaries. |
Key Considerations
- Always be mindful of Unicode when processing text. Use crates like
unicode-segmentation
⮳ to handle graphemes and word boundaries correctly. - Choose the crate that is most appropriate for your specific text processing task. Don't use a full-fledged parsing library if you only need basic string manipulation.
- For performance-critical tasks, consider using crates optimized for speed, such as
memchr
⮳.
Code Examples
Concatenate Strings
Recipe | Crates | Categories |
---|---|---|
Concatenate Strings |
Manipulate Strings
FIXME
Find, Extract, and Replace Text Using Regular Expressions
FIXME
Search for Strings (incl. Fuzzy Matching)
Recipe | Crates | Categories |
---|---|---|
aho-corasick | [![aho-corasick][c-aho-corasick-badge]][c-aho-corasick] | |
fuzzy-matcher | [![fuzzy-matcher][c-fuzzy-matcher-badge]][c-fuzzy-matcher] | |
memchr | ||
strsim |
FIXME
Parse Strings
Recipe | Crates | Categories |
---|---|---|
Implement the FromStr Trait for a Custom struct |
Diff Text
FIXME
Work with Unicode
Recipe | Crates | Categories |
---|---|---|
Collect Unicode Graphemes |
FIXME
Create and Use OS, C, and Non-UTF8 Strings
Recipe | Crates | Categories |
---|---|---|
bstr | ||
CString and CStr | ||
OsString and OsStr |
FIXME
Related Topics
Topic | Rust Crates | Notes |
---|---|---|
Logging | log ⮳, env_logger ⮳, tracing ⮳ | Logging often involves formatting and processing text. |
Markdown Processing | pulldown-cmark ⮳, comrak ⮳ | These crates parse and render Markdown. |
Natural Language Processing (NLP) | (Many crates for specific tasks) | NLP tasks often use the crates mentioned here, along with specialized crates for things like part-of-speech tagging, named entity recognition, etc. See Deep Learning. |
Parsing | nom ⮳, pest ⮳, lalrpop ⮳, combine ⮳ | These crates offer different approaches to parsing, from combinators (nom ⮳, combine ⮳) to parser generators (pest ⮳, lalrpop ⮳). |
CSV Parsing | csv ⮳ | This crate provides efficient CSV parsing. |
HTML Parsing | scraper ⮳, kuchiki ⮳ | These crates parse HTML documents. |
XML Parsing | xmltree ⮳, quick-xml ⮳ | These crates parse XML documents. |
Command-Line Argument Parsing | clap ⮳, structopt ⮳ | These crates help with parsing command-line arguments, which often involve text processing. |
Serialization/Deserialization | serde ⮳, serde_json ⮳, serde_yml ⮳, toml ⮳ | serde ⮳ is a powerful framework for serialization and deserialization, often used with text-based formats like JSON, YAML, and TOML. |
Text Encoding/Decoding | encoding ⮳, iconv ⮳ (bindings) | These crates handle different character encodings. encoding ⮳ is a pure Rust solution, while iconv ⮳ provides bindings to the iconv library. |
Text Formatting & Templating | minijinja ⮳, tera ⮳, handlebars ⮳, askama ⮳ | These crates are used for generating text-based output with dynamic content. |
Tokenization | tokenizers ⮳ | tokenizers ⮳ provides tools for breaking text into tokens. |
Stemming & Lemmatization | rust-stemmers ⮳, lingua ⮳ | rust-stemmers ⮳ provides stemming algorithms. lingua ⮳ is a natural language detection library, suitable for short text and mixed-language text. |