微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

keras - AttributeError: 'numpy.ndarray' 对象没有属性 'lower'

如何解决keras - AttributeError: 'numpy.ndarray' 对象没有属性 'lower'

不确定我做错了什么:

我有一个数据集(所有文本),当我尝试拟合 Tokenizer 时它失败了。

from keras.preprocessing.text import Tokenizer as Tok

# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMbedDING_DIM = 100

labels = training_data['group_name']
features = training_data.drop('group_name',axis='columns')

tokenizer = Tok(num_words=MAX_NB_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',lower=True)
tokenizer.fit_on_texts(features.values)

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

我已按如下方式清理文本:

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))


def clean_text(text):   
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ',text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('',text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = text.replace('x','')

    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text

print('clean text')
training_data = training_data.applymap(lambda x: clean_text(x))

...所以我看不到 numpy.ndarray 来自哪里

更新:

我能够理解问题并以一种丑陋的方式解决它:

将所有列合并为一个

labels = df['group_name']
features = df.drop('group_name',axis='columns')
tmp = pd.DataFrame()
tmp['txt'] = features[features.columns[1:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),axis=1
)

现在它通过了有问题的步骤:

tokenizer = Tok(num_words=MAX_NB_WORDS,lower=True)
tokenizer.fit_on_texts(tmp['txt'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(tmp['txt'].values)
X = pad_sequences(X,maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:',X.shape)
Y = pd.get_dummies(labels).values
print('Shape of label tensor:',Y.shape)

不过我还是想保留原来的列,不要把所有的数据都放在一个列中。

我怎样才能做到这一点(即从所有列中获取所有值而不迭代嵌套的 numpy 数组)?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。