sentencepiece/README.en.md
2021-11-16 20:14:17 +08:00

7.1 KiB

sentencepiece

Description

An unsupervised text tokenizer and detokenizer.

Software Architecture

Software architecture description

Installation

  1. Python module

    SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

    % pip install sentencepiece

  2. Build and install SentencePiece command line tools from C++ source

    The following tools and libraries are required to build SentencePiece:

    • cmake

    • C++11 compiler

    • gperftools library (optional, 10-40% performance improvement can be obtained.)

    On Ubuntu, the build tools can be installed with apt-get:

    % sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

    Then, you can build and install command line tools as follows.

    % git clone https://github.com/google/sentencepiece.git

    % cd sentencepiece

    % mkdir build

    % cd build

    % cmake ..

    % make -j $(nproc)

    % sudo make install

    % sudo ldconfig -v

    On OSX/macOS, replace the last command with sudo update_dyld_shared_cache.

  3. Build and install using vcpkg

    You can download and install sentencepiece using the vcpkg dependency manager:

    git clone https://github.com/Microsoft/vcpkg.git

    cd vcpkg

    ./bootstrap-vcpkg.sh

    ./vcpkg integrate install

    ./vcpkg install sentencepiece

    The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Instructions

  1. Train SentencePiece Model

    % spm_train --input=< input > --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=

    • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.

    • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.

    • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000

    • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

    • --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  2. Encode raw text into sentence pieces/ids

    % spm_encode --model=<model_file> --output_format=piece < input > output

    % spm_encode --model=<model_file> --output_format=id < input > output

    Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

    % spm_encode --extra_options=eos (add only)

    % spm_encode --extra_options=bos:eos (add and )

    % spm_encode --extra_options=reverse:bos:eos (reverse input and add and )

    SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

    % spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output

    % spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

  3. Decode sentence pieces/ids into raw text

    % spm_decode --model=<model_file> --input_format=piece < input > output

    % spm_decode --model=<model_file> --input_format=id < input > output

    Use --extra_options flag to decode the text in reverse order.

    % spm_decode --extra_options=reverse < input > output

  4. End-to-End Example

    % spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000

    unigram_model_trainer.cc(494) LOG(INFO) Starts training with :

    input: "../data/botchan.txt"

    ...

    unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091

    trainer_interface.cc(272) LOG(INFO) Saving model: m.model

    trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

    % echo "I saw a girl with a telescope." | spm_encode --model=m.model

    ▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

    % echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id

    9 459 11 939 44 11 4 142 82 8 28 21 132 6

    % echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id

    I saw a girl with a telescope.

    You can find that the original input sentence is restored from the vocabulary id sequence.

  5. Export vocabulary list

    % spm_export_vocab --model=<model_file> --output=

    stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

  6. Redefine special meta tokens

    By default, SentencePiece uses Unknown (), BOS () and EOS () tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

    % spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

    When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding () as --pad_id=3.

  7. Vocabulary restriction

    spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency).

    The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

    % cat {train_file}.L1 {train_file}.L2 | shuffle > train

    % spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995

    % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1

    % spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

    shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

    Then segment train/test corpus with --vocabulary option

    % spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1

    % spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Contribution

  1. Fork the repository
  2. Create Feat_xxx branch
  3. Commit your code
  4. Create Pull Request

Gitee Feature

  1. You can use Readme_XXX.md to support different languages, such as Readme_en.md, Readme_zh.md
  2. Gitee blog blog.gitee.com
  3. Explore open source project https://gitee.com/explore
  4. The most valuable open source project GVP
  5. The manual of Gitee https://gitee.com/help
  6. The most popular members https://gitee.com/gitee-stars/