如何解决从列表中创建一个新列表,但没有重复项?
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/damselfly">damselfly</a>','<a href="/lyric/37311114/Loyle+Carner/damselfly">damselfly</a>','<a href="/lyric/37360958/Loyle+Carner/damselfly">damselfly</a>','<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
现在我想摆脱那些重复的物品。问题是,双精度项仅在字符串i [38:]的特定点彼此不同。
我的想法是创建一个for循环:
new_list = []
for i in carner_list:
if i[38:] in new_list:
print("found")
else:
new_list = new_list + [i]
print("not")
但这不起作用。
语法是否有误或者我完全走错了轨道?
最佳罗素
解决方法
我键入了一个名为listContains
的小函数,我认为它可以解决您的问题。您的代码无法正常工作,因为您在i[38:]
中搜索了值new_list
,而在new_list
中添加了i
的整个值。
因此,您还应该对列表的每个值应用 [38:] 规则。
我认为以下代码可以更好地解释我的意思:
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/37311114/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
new_list = []
def listContains(myList,toSearch):
for val in myList:
if val[38:] == toSearch:
return True
return False
for i in carner_list:
if listContains(new_list,i[38:]):
print("found")
else:
new_list.append(i)
print("not")
print(new_list)
如果要测试,可以从here
开始 ,您用来确定重复的字符串部分(从索引38到末尾)不是您实际存储在列表中的部分,因此in
运算符将不起作用。
您可以改用字典来存储重复数据删除的字符串,并将您关心的部分字符串作为密钥,以便in
运算符可以工作:
new = {}
for i in carner_list:
key = i[38:]
if key not in new:
new[key] = i
print(list(new.values()))
这将输出:
['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
,
因此,当前搜索的方式是查看子字符串是否等于new_list中的任何内容。因为它是子字符串,所以永远不会如此。
您可以使用lambda,然后对其进行过滤以获得真实结果,以查看该项目是否在新列表中。然后将其转换为列表,并检查该列表的长度是否不等于0。
len(list(filter(lambda x: i[38:] in x,new_list))) != 0
最终密码
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
new_list = []
for i in carner_list:
if len(list(filter(lambda x: i[38:] in x,new_list))) != 0:
print("found")
else:
new_list.append(i)
print("not")
,
使用BeautifulSoup
解析html,然后检查
例如:
from bs4 import BeautifulSoup
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
new_list = []
check_val = set()
for i in carner_list:
s = BeautifulSoup(i,"html.parser")
if s.text not in check_val: #check for text
new_list.append(i)
check_val.add(s.text)
print(new_list)
输出:
['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of '
'Arran</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the '
'Morning</a>']
,
为什么不使用正则表达式
import re
carner_list = ['<a href="/lyric/34808442/Loyle+Carner/Damselfly">Damselfly</a>','<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>']
print({re.findall(r'"([^"]*)"',x)[0].split("/")[4]: x for x in carner_list })
#Below is the output generated
'''
{'Damselfly': '<a href="/lyric/37360958/Loyle+Carner/Damselfly">Damselfly</a>','The+Isle+of+Arran': '<a href="/lyric/33661937/Loyle+Carner/The+Isle+of+Arran">The Isle of Arran</a>','Mean+It+in+the+Morning': '<a href="/lyric/33661936/Loyle+Carner/Mean+It+in+the+Morning">Mean It in the Morning</a>'}
'''
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。