6g44. |
GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://github.com/huggingface/tokenizers
Saved on 2023-08-08 [19577 edays] via github.com
Modified 2023-08-08 [19577 edays]
ai rust
https://github.com/huggingface/tokenizers
Saved on 2023-08-08 [19577 edays] via github.com
Modified 2023-08-08 [19577 edays]
ai rust
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.