如何解决我想解码为“UTF-8”
The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete. \n \n
我需要将所有这些解码为“UTF-8”,除了“\n”。 所以我想要这个输出
Original :The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete. \n \n
Decoded : The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete. \n \n
解决方法
您的输入字符串必须是字节字符串才能进行解码。假设使用 bytes.decode()
:
>>> s = b'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete. \n \n'
>>> type(s)
<class 'bytes'>
>>> s2 = s.decode('utf8')
>>> type(s2)
<class 'str'>
>>> s2
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete. \n \n'
以上显示了将字节字符串(类 bytes
)解码为 unicode 字符串(类 str
)。
用 rstrip()
去掉尾随的新行:
>>> s2.rstrip()
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.'
如果您的数据来自文件或其他流,您可以通过在打开文件/流时指定编码来在读取时进行解码:
with open('file.txt',encoding='utf8') as f:
for line in f:
print(line)
这将解码来自 UTF8 的传入数据,您的代码仅处理字符串。不是字节字符串。有关详细信息,请参阅 open()
。
您可以按如下方式解决特定的 mojibake 情况:
s = 'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete. \n \n'
s.encode('latin1').decode('utf-8')
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete. \n \n'
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。