微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

tokenizers decoders模块

模块介绍

decoders模块负责将id转换成可读的文本。decoders模块中Decoder主要用于解码pre_tokenizers模块中Pretokenizer使用的特殊字符,比如pre_tokenizers模块中Metaspace,将空格转换成下划线,通过deocders模块中Metaspace,则可以将下划线还原成空格。又如,pre_tokenizers模块中ByteLevel,将空格转换成符号"Ġ",对应deocders模块中ByteLevel将符号"Ġ"解码成空格。

decoder模块实现的是Decoder的子类,对于Decoder,官方文档的解释如下,也就是说Decoder负责将标记化的输入映射回原始字符串。 通常根据我们之前使用的 Pretokenizer 来选择解码器。

Decoding: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the Pretokenizer we used prevIoUsly.

模块使用

1、BPEDecoder

tokenizers.decoders.BPEDecoder(suffix = '</w>')

BPEDecoder解码器用于处理将子词合并成单词,并处理每个单词后面添加的后缀"</w>",将其转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
>>> tokenizer.normalizer = normalizers.Bertnormalizer(lowercase=False)
>>> tokenizer.pre_tokenizer = pre_tokenizers.BertPretokenizer()
>>> special_tokens = ["<unk>"]
>>> trainers = trainers.BpeTrainer(special_tokens=special_tokens,
                                   end_of_word_suffix="</w>")
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.decoder = decoders.BPEDecoder()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text !

2、ByteLevel

tokenizers.decoders.ByteLevel()

ByteLevel解码器和ByteLevel预分词器一起使用,将ids转换成文本的同时也将符号Ġ转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")

>>> tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
>>> tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
>>> special_tokens = ["<unk>", "<pad>", "<s>", "</s>", "<mask>"]
>>> trainers = trainers.BpeTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.post_processor = processors.RobertaProcessing(sep=("</s>", tokenizer.token_to_id("</s>")),
	                                                        cls=("<s>", tokenizer.token_to_id("<s>")),
	                                                        trim_offsets=True,
	                                                        add_prefix_space=False)
>>> tokenizer.decoder = decoders.ByteLevel()

>>> tokenizer.encode("this is a text!").ids
[2, 256, 202, 305, 176, 4452, 5, 3]
>>> tokenizer.pre_tokenizer.pre_tokenize_str("this is a text!")
[('this', (0, 4)), ('Ġis', (4, 7)), ('Ġa', (7, 9)), ('Ġtext', (9, 14)), ('!', (14, 15))]
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!

3、CTC

tokenizers.decoders.CTC(pad_token = '<pad>', word_delimiter_token = '|', cleanup = True )

CTC解码器,每个单词之间用"|"来分隔。

4、Metaspace

tokenizers.decoders.Metaspace(replacement="▁", add_prefix_space=True)

Metaspace解码器用于和Metaspace预分词器一起使用,将ids转换成文本,并将下划线转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.Unigram())
>>> tokenizer.normalizer = normalizers.Sequence(
	    [normalizers.Replace("``", '"'), normalizers.Replace("''", '"'), normalizers.Lowercase()]
	)
>>> tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
>>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
>>> trainers = trainers.UnigramTrainer(special_tokens=special_tokens, unk_token="[UNK]")
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> cls_token_id = tokenizer.token_to_id("[CLS]")
>>> sep_token_id = tokenizer.token_to_id("[SEP]")
>>> tokenizer.post_processor = processors.TemplateProcessing(
	    single="[CLS]:0 $A:0 [SEP]:0",
	    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
	    special_tokens=[
	        ("[CLS]", cls_token_id),
	        ("[SEP]", sep_token_id),
	    ],
	)
>>> tokenizer.decoder = decoders.Metaspace()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!

5、WordPiece

tokenizers.decoders.WordPiece(prefix = '##', cleanup = True)

WordPiece解码器用于处理子词的前缀"##",并将子词合并成词。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
>>> tokenizer.pre_tokenizer = pre_tokenizers.BertPretokenizer()
>>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
>>> trainers = trainers.WordPieceTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.post_processor = processors.BertProcessing(sep=("[SEP]", tokenizer.token_to_id("[SEP]")),
                                                   	 	 cls=("[CLS]", tokenizer.token_to_id("[CLS]")))
>>> tokenizer.decoder = decoders.WordPiece()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids, skip_special_tokens=False)
[CLS] this is a text! [SEP]

原文地址:https://www.jb51.cc/wenti/3284094.html

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐