如何解决keras - AttributeError: 'numpy.ndarray' 对象没有属性 'lower'
不确定我做错了什么:
我有一个数据集(所有文本),当我尝试拟合 Tokenizer 时它失败了。
from keras.preprocessing.text import Tokenizer as Tok
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMbedDING_DIM = 100
labels = training_data['group_name']
features = training_data.drop('group_name',axis='columns')
tokenizer = Tok(num_words=MAX_NB_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',lower=True)
tokenizer.fit_on_texts(features.values)
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
我已按如下方式清理文本:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ',text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
text = BAD_SYMBOLS_RE.sub('',text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
text = text.replace('x','')
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
return text
print('clean text')
training_data = training_data.applymap(lambda x: clean_text(x))
...所以我看不到 numpy.ndarray
来自哪里
更新:
我能够理解问题并以一种丑陋的方式解决它:
将所有列合并为一个:
labels = df['group_name']
features = df.drop('group_name',axis='columns')
tmp = pd.DataFrame()
tmp['txt'] = features[features.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)),axis=1
)
现在它通过了有问题的步骤:
tokenizer = Tok(num_words=MAX_NB_WORDS,lower=True)
tokenizer.fit_on_texts(tmp['txt'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(tmp['txt'].values)
X = pad_sequences(X,maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:',X.shape)
Y = pd.get_dummies(labels).values
print('Shape of label tensor:',Y.shape)
不过我还是想保留原来的列,不要把所有的数据都放在一个列中。
我怎样才能做到这一点(即从所有列中获取所有值而不迭代嵌套的 numpy 数组)?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。