Text Processing

cat-text-processing

Deal with the complexities of human language when expressed in textual form.

This table outlines common text processing tasks and relevant Rust crates.

TopicRust CratesNotes
String Concatenationstd::string, joineryjoinery is a small crate for generically joining iterators with a separator.
String Manipulationstd::string, heckstd::string provides basic string operations. heck is a case conversion library. textwrap provides word wrapping, indenting, and dedenting strings.
Regular Expressionsregex, fancy-regexregex is the standard Regex crate. fancy-regex provides more advanced features.
String Searchaho-corasick, memchrString search can be done with regular expressions or with algorithms like Aho-Corasick.
Fuzzy Matchingfuzzy-matcher, strsimfuzzy-matcher provides fuzzy string matching. strsim implement string similarity metrics.
Diffing & Patchingdiff, similarThese crates calculate differences between text files.
OS, C, and other stringsstd::ffi, bstrstd::ffi provides types for platform-native strings and C-style, NUL-terminated strings. bstr offers a string type that is not required to be valid UTF-8.
Unicode handlingunicode-segmentationunicode-segmentation correctly handles Unicode graphemes and word boundaries.

Key Considerations

  • Always be mindful of Unicode when processing text. Use crates like unicode-segmentation to handle graphemes and word boundaries correctly.
  • Choose the crate that is most appropriate for your specific text processing task. Don't use a full-fledged parsing library if you only need basic string manipulation.
  • For performance-critical tasks, consider using crates optimized for speed, such as memchr.

Code Examples

Concatenate Strings

RecipeCratesCategories
Concatenate Stringsstdcat-text-processing

Manipulate Strings

RecipeCratesCategories
heckheckcat-text-processing
indocindoccat-text-processing
textwraptextwrapcat-text-processing

Find, Extract, and Replace Text Using Regular Expressions

Search for Strings (incl. Fuzzy Matching)

RecipeCratesCategories
aho-corasick[![aho-corasick][c-aho-corasick-badge]][c-aho-corasick]cat-text-processing
fuzzy-matcher[![fuzzy-matcher][c-fuzzy-matcher-badge]][c-fuzzy-matcher]cat-text-processing
memchrmemchrcat-text-processing
strsimstrsimcat-text-processing

Parse Strings

Diff Text

RecipeCratesCategories
diffdiffcat-text-processing
similarsimilarcat-text-processing

Work with Unicode

RecipeCratesCategories
Collect Unicode Graphemesunicode_segmentationcat-text-processing

Create and Use OS, C, and Non-UTF8 Strings

RecipeCratesCategories
bstrbstrcat-text-processing
CString and CStrstdcat-text-processing
OsString and OsStrstdcat-text-processing
TopicRust CratesNotes
Logginglog, env_logger, tracingLogging often involves formatting and processing text.
Markdown Processingpulldown-cmark, comrakThese crates parse and render Markdown.
Natural Language Processing (NLP)(Many crates for specific tasks)NLP tasks often use the crates mentioned here, along with specialized crates for things like part-of-speech tagging, named entity recognition, etc. See Deep Learning.
Parsingnom, pest, lalrpop, combineThese crates offer different approaches to parsing, from combinators (nom, combine) to parser generators (pest, lalrpop).
CSV ParsingcsvThis crate provides efficient CSV parsing.
HTML Parsingscraper, kuchikiThese crates parse HTML documents.
XML Parsingxmltree, quick-xmlThese crates parse XML documents.
Command-Line Argument Parsingclap, structoptThese crates help with parsing command-line arguments, which often involve text processing.
Serialization/Deserializationserde, serde_json, serde_yml, tomlserde is a powerful framework for serialization and deserialization, often used with text-based formats like JSON, YAML, and TOML.
Text Encoding/Decodingencoding, iconv (bindings)These crates handle different character encodings. encoding is a pure Rust solution, while iconv provides bindings to the iconv library.
Text Formatting & Templatingminijinja, tera, handlebars, askamaThese crates are used for generating text-based output with dynamic content.
Tokenizationtokenizerstokenizers provides tools for breaking text into tokens.
Stemming & Lemmatizationrust-stemmers, linguarust-stemmers provides stemming algorithms. lingua is a natural language detection library, suitable for short text and mixed-language text.