用keras.preprocessing.text.Tokenizer標(biāo)記中文文本

keras.preprocessing.text.Tokenizer不能正確處理中文文本。如何修改它以處理中文文本？

from keras.preprocessing.text import Tokenizer
def fit_get_tokenizer(data, max_words):
    tokenizer = Tokenizer(num_words=max_words, filters='!"#%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
    tokenizer.fit_on_texts(data)
    return tokenizer
tokenizer = fit_get_tokenizer(df.sentence,max_words=150000)
print('Total number of words: ', len(tokenizer.word_index))
vocabulary_inv = {}
for word in tokenizer.word_index:
    vocabulary_inv[tokenizer.word_index[word]] = word
print(vocabulary_inv)

發(fā)布于 6 天前

? 最佳回答：

由于我無法在SO中發(fā)布中文文本，我將演示如何使用英語句子，但同樣適用于中文：

import tensorflow as tf
text = ['This is a chinese sentence', 
         'This is another chinese sentence']
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=50, char_level = False)
tokenizer.fit_on_texts(text)
print(tokenizer.word_index)

{'this': 1, 'is': 2, 'chinese': 3, 'sentence': 4, 'a': 5, 'another': 6}

確保你有一個中文space-separated句子列表，并且應(yīng)該正確。使用列表將導(dǎo)致意外行為。

公眾號：1024技術(shù)圈

?? 提供互聯(lián)網(wǎng)知識和資訊，分享IT前沿技術(shù)，熱門資源，大廠面試題 ??

用keras.preprocessing.text.Tokenizer標(biāo)記中文文本

熱門問答

docker從圖像合成php應(yīng)用程序

函數(shù)模板中的C++傳遞函數(shù)指針

自定義文件類型可以在fileimporter中使用嗎？

十進(jìn)制的正則表達(dá)式模式匹配

PostgreSQL:無法向上插入（沖突更新時(shí)）大小寫相同的列

TypeError：列表索引必須是整數(shù)或片，而不是元組。我得到一個TypeError:

我怎么在頁面中用javascript得到<asp:CheckBoxList>控件的選中值

如何在grpc服務(wù)器端使用從客戶端傳遞的上下文

如何優(yōu)化在emu8086上運(yùn)行的匯編代碼性能

如何在HTML頁面中嵌入Java Applet并確保其安全性

project如何添加項(xiàng)目名稱

現(xiàn)在有兩個項(xiàng)目，里面的代碼非常相似但是又不是100%相似的，請問如何做代碼管理？