Text Processing

Deal with the complexities of human language when expressed in textual form.

This table outlines common text processing tasks and relevant Rust crates.

Topic	Rust Crates	Notes
String Concatenation	`std::string`, `joinery`	`joinery`⮳ is a small crate for generically joining iterators with a separator.
String Manipulation	`std::string`, `heck`	`std::string` provides basic string operations. `heck`⮳ is a case conversion library. `textwrap`⮳ provides word wrapping, indenting, and dedenting strings.
Regular Expressions	`regex`⮳, `fancy-regex`⮳	`regex`⮳ is the standard Regex crate. `fancy-regex`⮳ provides more advanced features.
String Search	`aho-corasick`⮳, `memchr`⮳	String search can be done with regular expressions or with algorithms like Aho-Corasick.
Fuzzy Matching	`fuzzy-matcher`⮳, `strsim`	`fuzzy-matcher`⮳ provides fuzzy string matching. `strsim`⮳ implement string similarity metrics.
Diffing & Patching	`diff`⮳, `similar`⮳	These crates calculate differences between text files.
OS, C, and other strings	`std::ffi`, `bstr`	`std::ffi` provides types for platform-native strings and C-style, NUL-terminated strings. `bstr`⮳ offers a string type that is not required to be valid UTF-8.
Unicode handling	`unicode-segmentation`⮳	`unicode-segmentation`⮳ correctly handles Unicode graphemes and word boundaries.

Key Considerations

Always be mindful of Unicode when processing text. Use crates like unicode-segmentation⮳ to handle graphemes and word boundaries correctly.
Choose the crate that is most appropriate for your specific text processing task. Don't use a full-fledged parsing library if you only need basic string manipulation.
For performance-critical tasks, consider using crates optimized for speed, such as memchr⮳.

Code Examples

Concatenate Strings

Recipe	Crates	Categories
Concatenate Strings

Manipulate Strings

Recipe	Crates	Categories
`heck`
`indoc`
`textwrap`

Find, Extract, and Replace Text Using Regular Expressions

Recipe	Crates	Categories
Verify and Extract Login from an Email Address
Extract a list of Unique #hashtags from a Text
Extract Phone Numbers from Text
Filter a log File by Matching Multiple Regular Expressions
Replace all Occurrences of one text Pattern with Another Pattern
Longer Regex Example
Use Regular Expressions with Back-references and Lookarounds
Longer Regex Example

Search for Strings (incl. Fuzzy Matching)

Recipe	Crates	Categories
`aho-corasick`
`fuzzy-matcher`
`memchr`
`strsim`

Parse Strings

Recipe	Crates	Categories
Implement the `FromStr` Trait for a Custom `struct`

Diff Text

Recipe	Crates	Categories
`diff`
`similar`

Work with Unicode

Recipe	Crates	Categories
Collect Unicode Graphemes

Create and Use OS, C, and Non-UTF8 Strings

Recipe	Crates	Categories
`bstr`
`CString` and `CStr`
`OsString` and `OsStr`

Topic	Rust Crates	Notes
Logging	`log`⮳, `env_logger`⮳, `tracing`⮳	Logging often involves formatting and processing text.
Markdown Processing	`pulldown-cmark`⮳, `comrak`⮳	These crates parse and render Markdown.
Natural Language Processing (NLP)	(Many crates for specific tasks)	NLP tasks often use the crates mentioned here, along with specialized crates for things like part-of-speech tagging, named entity recognition, etc. See Deep Learning.
Parsing	`nom`⮳, `pest`⮳, `lalrpop`⮳, `combine`⮳	These crates offer different approaches to parsing, from combinators (`nom`⮳, `combine`⮳) to parser generators (`pest`⮳, `lalrpop`⮳).
CSV Parsing	`csv`⮳	This crate provides efficient CSV parsing.
HTML Parsing	`scraper`⮳, `kuchiki`⮳	These crates parse HTML documents.
XML Parsing	`xmltree`⮳, `quick-xml`⮳	These crates parse XML documents.
Command-Line Argument Parsing	`clap`⮳, `structopt`⮳	These crates help with parsing command-line arguments, which often involve text processing.
Serialization/Deserialization	`serde`⮳, `serde_json`⮳, `serde_yml`⮳, `toml`⮳	`serde`⮳ is a powerful framework for serialization and deserialization, often used with text-based formats like JSON, YAML, and TOML.
Text Encoding/Decoding	`encoding`⮳, `iconv`⮳ (bindings)	These crates handle different character encodings. `encoding`⮳ is a pure Rust solution, while `iconv`⮳ provides bindings to the iconv library.
Text Formatting & Templating	`minijinja`⮳, `tera`⮳, `handlebars`⮳, `askama`⮳	These crates are used for generating text-based output with dynamic content.
Tokenization	`tokenizers`⮳	`tokenizers`⮳ provides tools for breaking text into tokens.
Stemming & Lemmatization	`rust-stemmers`⮳, `lingua`⮳	`rust-stemmers`⮳ provides stemming algorithms. `lingua`⮳ is a natural language detection library, suitable for short text and mixed-language text.