sentencepiece/Removed-codes-where-Zero-Width-Joiner-replaced-with-.patch
2021-11-02 11:38:21 +08:00

55 lines
1.6 KiB
Diff
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

From 82b8b6f61403fcfcef673ee49ed2dfe475ba4cf2 Mon Sep 17 00:00:00 2001
From: Sarubi <stsarut@gmail.com>
Date: Tue, 23 Feb 2021 20:47:25 +0530
Subject: [PATCH] Removed codes where Zero Width Joiner replaced with
whitespace.
---
data/nmt_nfkc.tsv | 3 +--
data/nmt_nfkc_cf.tsv | 3 +--
src/builder.cc | 1 -
3 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/data/nmt_nfkc.tsv b/data/nmt_nfkc.tsv
index 1ce2b71..5c8b48b 100644
--- a/data/nmt_nfkc.tsv
+++ b/data/nmt_nfkc.tsv
@@ -57263,8 +57263,7 @@ FB9 F90 FB5 # ྐྵ => ྐྵ
200A 20 # =>
200B 20 # =>
200C 20 # =>
-200D 20 # =>
-200E 20 # =>
+200E 20 # =>
200F 20 # =>
2011 2010 # =>
2017 20 333 # ‗ => ̳
diff --git a/data/nmt_nfkc_cf.tsv b/data/nmt_nfkc_cf.tsv
index 2178882..0d0e708 100644
--- a/data/nmt_nfkc_cf.tsv
+++ b/data/nmt_nfkc_cf.tsv
@@ -57980,8 +57980,7 @@ FB9 F90 FB5 # ྐྵ => ྐྵ
200A 20 # =>
200B 20 # =>
200C 20 # =>
-200D 20 # =>
-200E 20 # =>
+200E 20 # =>
200F 20 # =>
2011 2010 # =>
2017 20 333 # ‗ => ̳
diff --git a/src/builder.cc b/src/builder.cc
index d9442d3..9f47aac 100644
--- a/src/builder.cc
+++ b/src/builder.cc
@@ -366,7 +366,6 @@ util::Status Builder::BuildNmtNFKCMap(CharsMap *chars_map) {
nfkc_map[{0xFEFF}] = {0x20}; // ZERO WIDTH NO-BREAK
nfkc_map[{0xFFFD}] = {0x20}; // REPLACEMENT CHARACTER
nfkc_map[{0x200C}] = {0x20}; // ZERO WIDTH NON-JOINER
- nfkc_map[{0x200D}] = {0x20}; // ZERO WIDTH JOINER
// Ascii Control characters
nfkc_map[{0x0001}] = {};
--