71 lines
4.9 KiB
Diff
71 lines
4.9 KiB
Diff
From 5c09745aafa151be7ed5d9a9101f3e8c79a8758b Mon Sep 17 00:00:00 2001
|
|
From: stephantul <stephantul@gmail.com>
|
|
Date: Thu, 1 Oct 2020 12:49:13 +0200
|
|
Subject: [PATCH 3/7] Create options.md
|
|
|
|
---
|
|
doc/options.md | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
1 file changed, 51 insertions(+)
|
|
create mode 100644 doc/options.md
|
|
|
|
diff --git a/doc/options.md b/doc/options.md
|
|
new file mode 100644
|
|
index 0000000..7861fdc
|
|
--- /dev/null
|
|
+++ b/doc/options.md
|
|
@@ -0,0 +1,51 @@
|
|
+# Training options
|
|
+
|
|
+The training options for the `spm_train` can be listed using `spm_train --help`. Since the standard `pip install` of sentencepiece does not necessarily install `spm_train`, the options are also listed here.
|
|
+
|
|
+```
|
|
+--help (show help) type: bool default: false
|
|
+--version (show version) type: bool default: false
|
|
+--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere) type: int default: 0
|
|
+--input (comma separated list of input sentences) type: std::string default: ""
|
|
+--input_format (Input format. Supported format is `text` or `tsv`.) type: std::string default: ""
|
|
+--model_prefix (output model prefix) type: std::string default: "" --model_type (model algorithm: unigram, bpe, word or char) type: std::string default: "unigram"
|
|
+--vocab_size (vocabulary size) type: int32 default: 8000
|
|
+--accept_language (comma-separated list of languages this model can accept) type: std::string default: ""
|
|
+--self_test_sample_size (the size of self test samples) type: int32 default: 0
|
|
+--character_coverage (character coverage to determine the minimum symbols) type: double default: 0.9995
|
|
+--input_sentence_size (maximum size of sentences the trainer loads) type: int32 default: 0
|
|
+--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0) type: bool default: true
|
|
+--seed_sentencepiece_size (the size of seed sentencepieces) type: int32 default: 1000000
|
|
+--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss) type: double default: 0.75
|
|
+--num_threads (number of threads for training) type: int32 default: 16
|
|
+--num_sub_iterations (number of EM sub-iterations) type: int32 default: 2
|
|
+--max_sentencepiece_length (maximum length of sentence piece) type: int32 default: 16
|
|
+--max_sentence_length (maximum length of sentence in byte) type: int32 default: 4192
|
|
+--split_by_unicode_script (use Unicode script to split sentence pieces) type: bool default: true
|
|
+--split_by_number (split tokens by numbers (0-9)) type: bool default: true
|
|
+--split_by_whitespace (use a white space to split sentence pieces) type: bool default: true
|
|
+--split_digits (split all digits (0-9) into separate pieces) type: bool default: false
|
|
+--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.) type: bool default: false
|
|
+--control_symbols (comma separated list of control symbols) type: std::string default: ""
|
|
+--user_defined_symbols (comma separated list of user defined symbols) type: std::string default: ""
|
|
+--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage) type: std::string default: ""
|
|
+--byte_fallback (decompose unknown pieces into UTF-8 byte pieces) type: bool default: false
|
|
+--vocabulary_output_piece_score (Define score in vocab file) type: bool default: true
|
|
+--normalization_rule_name (Normalization rule name. Choose from nfkc or identity) type: std::string default: "nmt_nfkc"
|
|
+--normalization_rule_tsv (Normalization rule TSV file. ) type: std::string default: ""
|
|
+--denormalization_rule_tsv (Denormalization rule TSV file.) type: std::string default: ""
|
|
+--add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true
|
|
+--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace) type: bool default: true
|
|
+--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.) type: bool default: true
|
|
+--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.) type: bool default: false
|
|
+--unk_id (Override UNK (<unk>) id.) type: int32 default: 0
|
|
+--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.) type: int32 default: 1
|
|
+--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.) type: int32 default: 2
|
|
+--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.) type: int32 default: -1
|
|
+--unk_piece (Override UNK (<unk>) piece.) type: std::string default: "<unk>"
|
|
+--bos_piece (Override BOS (<s>) piece.) type: std::string default: "<s>"
|
|
+--eos_piece (Override EOS (</s>) piece.) type: std::string default: "</s>"
|
|
+--pad_piece (Override PAD (<pad>) piece.) type: std::string default: "<pad>"
|
|
+--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.) type: std::string default: " ⁇ "
|
|
+--train_extremely_large_corpus (Increase bit depth for unigram tokenization.) type: bool default: false
|
|
+```
|
|
--
|
|
2.18.0.huawei.25
|
|
|