使用列表理解删除不支持的 unicode 字符

如何解决使用列表理解删除不支持的 unicode 字符

我正在尝试编写一种算法来从文本字符串列表中删除非 ASCII 字符。我通过从网页中抓取段落并将它们添加到列表中来整理列表。为此，我编写了一个嵌套的 for 循环，该循环遍历包含字符串的列表的每个元素，然后遍历字符串的字符。我使用的字符串示例列表在这里：

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also kNown as the panda bear or simply the panda,is a bear[6] native to South Central China','It is characterised by large,black patches around its eyes,over the ears,and across its round body']

然后我的最后一个操作是替换 ord() 值大于 128 的字符。像这样：

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    for i in range(len(text_list)):
        # for each string in the text list
        for char in text_list[i]:
            # for each character in the individual string
            if ord(char) > 128:
              text_list[i] = text_list[i].replace(char,'')

    return text_list

这作为嵌套 for 循环工作正常。但是因为我想扩展它，我想我会把它写成一个列表理解。像这样：

def remove_utf_chars(text_list):
    """
    :param text_list: a list of lines of text for each element in the list
    :return: a list of lines of text without utf characters
    """

    scrubbed_text = [text_list[i].replace(char,'') for i in range(len(text_list))
                     for char in text_list[i] if ord(char) > 128]

    return scrubbed_text

但是由于某种原因这不起作用。起初我认为这可能与我在表达式中使用的方法来删除 unicode 字符有关，因为 text_list 是一个列表，而 text_list[i] 是一个字符串。所以我将我的方法从 .strip() 更改为 .replace()。那没有用。然后我认为这可能与我放置 .replace() 的位置有关，所以我将它移动到列表理解中，没有任何变化。所以我不知所措。我认为这可能与在涉及过滤 unicode 的嵌套 for 循环的这种特定情况之间进行转换有关，这可能是问题所在。因为不是所有的 for 循环都可以写成 list comps 但所有的 list comp 都可以写成 for 循环。

解决方法

有一种更简单的方法可以删除非 ascii 字符；将字符串编码为 ASCII 并指定 errors='ignore' 以删除它们。例如：

text = ['The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo),[5] also known as the panda bear or simply the panda,is a bear[6] native to South Central China','It is characterised by large,black patches around its eyes,over the ears,and across its round body']

>>> text[0].encode('ascii',errors='ignore')
b'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),is a bear[6] native to South Central China'

这会给你一个字节串，即结果是 bytes 类型。您可以使用 decode():

将其转换回 Python 字符串

>>> text[0].encode('ascii',errors='ignore').decode()
'The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),is a bear[6] native to South Central China'

您可能会迂腐并指定 .decode('ascii')，但您的默认编解码器可能已经涵盖了这一点。

将其作为列表理解来执行：

def remove_non_ascii_chars(text_list):
    return [s.encode('ascii',errors='ignore').decode('ascii') for s in text_list]

>>> remove_non_ascii_chars(text)
['The giant panda (Ailuropoda melanoleuca; Chinese: ; pinyin: dxingmo),and across its round body']

您还可以编写函数以返回一个生成器，该生成器在许多情况下更具可扩展性，具体取决于后续代码中字符串的使用方式：

def remove_non_ascii_chars(text_list):
    return (s.encode('ascii',errors='ignore').decode('ascii') for s in text_list)

您要么需要一个外部循环，要么需要第二个理解来解析列表，然后在内部循环中解析字符串：

def remove_utf_chars(text_list):
    scrubbed_text = ["".join([y for y in x if ord(y) < 128]) for x in text_list]
    return scrubbed_text

使用列表理解删除不支持的 unicode 字符

如何解决使用列表理解删除不支持的 unicode 字符

解决方法

相关推荐