Python爬取网易云音乐评论！

前言

上篇爬取喜马拉雅FM音频的最后也提到过，这回我们爬取的就是网易云音乐的热评+评论。本人用了挺久的网易云，也是非常喜欢...闲话不多说，跟着我的思路来看看如何爬取网易云的热评+评论~

目标

本次我们爬取的目标是--网易云音乐歌曲的热评以及普通评论

我们知道网易云音乐有很多的歌单，那么我们的思路就是， 从这些歌单入手，遍历歌单，遍历歌单中的歌曲

这里我选取的是最新歌单，大概看了下最后有100页这样的歌单，每页35个歌单

https://music.163.com/#/discover/playlist/?order=new

Python爬取网易云音乐评论！

进群：960410445 即可获取数十套PDF_(:з」∠)_

这里写图片描述

接下来我们来对一个歌单进行分析

https://music.163.com/#/playlist?id=2294381226

Python爬取网易云音乐评论！

这里写图片描述

我们点击其中的一首歌曲：

https://music.163.com/#/song?id=26075485

既然我们要获取歌曲的评论，那么我们通过开发者工具来看看，这些评论在哪里

根据我们的经验，我们在XHR中找到了这些动态加载的评论

Python爬取网易云音乐评论！

这里写图片描述

我们可以看到，在 R_SO_4_26075485?csrf_token= 中，包含了 comments 以及 hotComments ，这两个分别对应的是最新评论以及热门评论

我们可以发现，这些评论是通过向

https://music.163.com/weapi/v1/resource/comments/R_SO_4_26075485?csrf_token=

发起post请求得到的，期间还传入两个参数， params 和 encSecKey

Python爬取网易云音乐评论！

这里写图片描述

Python爬取网易云音乐评论！

这里写图片描述

也就是说我们只要通过模拟浏览器向网易云服务器发送post请求就能获得评论！

这里还要注意这个post的链接，R_SO_4_ 之后跟的一串数字实际上就是这首歌曲对应的id；而且这里需要传入的参数，也得好好分析一下（在后面）

所以现在目标就是：找到最新的所有歌单 -> 对每一个歌单，遍历其中的所有歌曲，获取网页源码中的所存在歌曲的id->对每一个首歌曲通过其id，向服务器post请求(带上参数)，得到想要的评论

开始动刀

第一步

代码如下：

headers = {
 'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
baseUrl = 'https://music.163.com'
def getHtml(url):
 r = requests.get(url,headers=headers)
 html = r.text
 return html
def getUrl():
 #从最新歌单开始
 startUrl = 'https://music.163.com/discover/playlist/?order=new'
 html = getHtml(startUrl)
 pattern =re.compile('.*?.*?span class="s-fc4".*?title="(.*?)".*?href="(.*?)".*?',re.S)
 result = re.findall(pattern,html)
 #获取歌单总页数
 pageNum = re.findall(r'',html,re.S)[0]
 info = []
 #对第一页的歌单获取想要的信息
 for i in result:
 data = {}
 data['title'] = i[0]
 url = baseUrl+i[1]
 print url
 data['url'] = url
 data['author'] = i[2]
 data['authorUrl'] = baseUrl+i[3]
 info.append(data)
 #调用获取每个歌单里的歌曲的方法
 getSongSheet(url)
 time.sleep(random.randint(1,10))
 #这里暂时获取第一页的第一个歌单，所以用break
 break

这里应该很好理解，获取网页源码内歌单的信息，但是要注意，如果直接get

https://music.163.com/#/discover/playlist/?order=new

是获取不到歌单信息的，这也是网易云一个有趣的地方，我们在爬取的时候，需要把 # 删了才可

https://music.163.com/discover/playlist/?order=new

这样就可以看到

Python爬取网易云音乐评论！

这里写图片描述

第二步

def getSongSheet(url):
 #获取每个歌单里的每首歌的id，作为接下来post获取的关键
 html = getHtml(url)
 result = re.findall(r'',re.S)
 result.pop()
 musicList = []
 for i in result:
 data = {}
 headers1 = {
 'Referer': 'https://music.163.com/song?id={}'.format(i[0]),'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/67.0.3396.87 Safari/537.36'
 }
 musicUrl = baseUrl+'/song?id='+i[0]
 print musicUrl
 #歌曲url
 data['musicUrl'] = musicUrl
 #歌曲名
 data['title'] = i[1]
 musicList.append(data)
 postUrl = 'https://music.163.com/weapi/v1/resource/comments/R_SO_4_{}?csrf_token='.format(i[0])
 param = {
 'params': get_params(1),'encSecKey': get_encSecKey()
 }
 r = requests.post(postUrl,data = param,headers = headers1)
 total = r.json()
 # 总评论数
 total = int(total['total'])
 comment_TatalPage = total/20
 # 基础总页数
 print comment_TatalPage
 #判断评论页数，有余数则为多一页，整除则正好
 if total%20 != 0:
 comment_TatalPage = comment_TatalPage+1
 comment_data,hotComment_data = getMusicComments(comment_TatalPage,postUrl,headers1)
 #存入数据库的时候若出现ID重复，那么注意爬下来的数据是否只有一个
 savetoMongoDB(str(i[1]),comment_data,hotComment_data)
 print 'End!'
 else:
 comment_data,headers1)
 savetoMongoDB(str(i[1]),hotComment_data)
 print 'End!'
 time.sleep(random.randint(1,10))
 break

这一步的目的就是 获取歌单里歌曲的id ，遍历对每一个歌曲（即对应的id），获取其歌曲的url，歌曲名；

根据id，构造postUrl通过对第一页的post（关于如何post得到想要的信息，在后面会讲到），获取评论的总条数，及总页数；

以及调用获取歌曲评论的方法；

这里还有一个判断，根据评论总条数除以每页20条的评论，判断是否有余数，可以获得最终评论的总页数，并且我们也可以发现，热门评论只在第一页

第三步

def getMusicComments(comment_TatalPage,headers1):
 commentinfo = []
 hotcommentinfo = []
 # 对每一页评论
 for j in range(1,comment_TatalPage + 1):
 # 热评只在第一页可抓取
 if j == 1:
 #获取评论
 r = getPostApi(j,headers1)
 comment_info = r.json()['comments']
 for i in comment_info:
 com_info = {}
 com_info['content'] = i['content']
 com_info['author'] = i['user']['nickname']
 com_info['likedCount'] = i['likedCount']
 commentinfo.append(com_info)
 hotcomment_info = r.json()['hotComments']
 for i in hotcomment_info:
 hot_info = {}
 hot_info['content'] = i['content']
 hot_info['author'] = i['user']['nickname']
 hot_info['likedCount'] = i['likedCount']
 hotcommentinfo.append(hot_info)
 else:
 r = getPostApi(j,headers1)
 comment_info = r.json()['comments']
 for i in comment_info:
 com_info = {}
 com_info['content'] = i['content']
 com_info['author'] = i['user']['nickname']
 com_info['likedCount'] = i['likedCount']
 commentinfo.append(com_info)
 print u'第'+str(j)+u'页爬取完毕...'
 time.sleep(random.randint(1,10))
 print commentinfo
 print '
-----------------------------------------------------------
'
 print hotcommentinfo
 return commentinfo,hotcommentinfo

传入三个参数，分别为comment_TatalPage,headers1，对应评论总页数，postUrl就是postUrl...以及请求头

对第一页获取热评以及评论，对其他页获取普通评论；以及获取其他数据，添加到列表中

第四步

下面我们就来看看令人头疼的post部分！...

# offset的取值为:(评论页数-1)*20,total第一页为true，其余页为false
# first_param = '{rid:"",offset:"0",total:"true",limit:"20",csrf_token:""}' # 第一个参数
# 第二个参数
second_param = "010001"
# 第三个参数
third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
# 第四个参数
forth_param = "0CoJUm6Qyw8W8jud"
# 获取参数
def get_params(page): # page为传入页数
 iv = "0102030405060708"
 first_key = forth_param
 second_key = 16 * 'F'
 if(page == 1): # 如果为第一页
 first_param = '{rid:"",csrf_token:""}'
 h_encText = AES_encrypt(first_param,first_key,iv)
 else:
 offset = str((page-1)*20)
 first_param = '{rid:"",offset:"%s",total:"%s",csrf_token:""}' %(offset,'false')
 h_encText = AES_encrypt(first_param,iv)
 h_encText = AES_encrypt(h_encText,second_key,iv)
 return h_encText
# 获取 encSecKey
def get_encSecKey():
 encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"
 return encSecKey
# 加密过程
def AES_encrypt(text,key,iv):
 pad = 16 - len(text) % 16
 text = text + pad * chr(pad)
 encryptor = AES.new(key,AES.MODE_CBC,iv)
 encrypt_text = encryptor.encrypt(text)
 encrypt_text = base64.b64encode(encrypt_text)
 return encrypt_text
#获取post得到的Json
def getPostApi(j,headers1):
 param = {
 # 获取对应页数的params
 'params': get_params(j),data=param,headers=headers1)
 return r

这里的getPostApi函数传入的三个参数分别为， 页数(因为每页的post附带的参数params不相同) ，postURL以及请求头；

这里 data=param ，就是需要的参数

Python爬取网易云音乐评论！

这里写图片描述

很明显是加密过的，这里简单的介绍一下

首先我想到的是利用fiddler抓包，结果没有什么区别，随后我参考了

http://www.cnblogs.com/lyrichu/p/6635798.html

作者提到的这位...平胸小仙女...

https://www.zhihu.com/question/36081767/answer/140287795

以及

https://www.zhihu.com/question/21471960

中路人甲大哥的分享

把core.js 下载到本地，用notepad++分析，这里推荐一个notepad++的插件，可以格式化JavaScript， https://blog.csdn.net/u011037869/article/details/47170653

然后找到我们需要的这两个参数

Python爬取网易云音乐评论！

这里写图片描述

然后在fiddler中重定向core.js，修改本地core.js的内容，可以打印上面的参数，结果第一次可以在控制台看到打印的结果，后来老是报错...

Python爬取网易云音乐评论！

这里写图片描述

随后就是分析JavaScript的代码，这里我直接搬用了生成参数的方法...（果然还是得好好的学一下js！）

AND

别以为这样就可以了！接着我遇到了最糟心的问题：在你导入

from Crypto.Cipher import AES之后报错！

ImportError: No module named Crypto.Cipher

接着我尝试pip install Crypto 成功后，但这回出现

ImportError: No module named Cipher！！...

最后我找了很多资料，给大家总结一下，如何解决这个问题

一般情况下，在pip install Crypto 之后只需在 C:Python27Libsite-packages 下把crypto改成Crypto就行（但是我的没用）
我的最后解决办法就是 参考以下 ，注意我的话安装的是pycrypto
https://blog.csdn.net/teloy1989/article/details/72862108
中的方法自己再安装pycrypto
一开始没有安装Microsoft Visual C++ 9.0

Python爬取网易云音乐评论！

这里写图片描述

出现了报错，随后照着上面博文下载了Microsoft Visual C++ 9.0 后再次安装pycrypto

Python爬取网易云音乐评论！

这里写图片描述

可算成功了！这之后我再导入 from Crypto.Cipher import AES 就可以正常运行了~

第五步

def savetoMongoDB(musicName,hotComment_data):
 client = pymongo.MongoClient(host='localhost',port=27017)
 db = client['Music163']
 test = db[musicName]
 test.insert(hotComment_data)
 test.insert(comment_data)
 print musicName+u'已存入数据库...'

这最后就是将数据存入MongoDB中了，有兴趣的也可以试着存入MysqL中

Python爬取网易云音乐评论！

这里写图片描述

Python爬取网易云音乐评论！

这里写图片描述

if __name__ == '__main__':
 getUrl()

这里我是把数据爬完之后一次性存入MongoDB中，可能负担有点大，也可以试着爬取一页存入一页？

Python爬取网易云音乐评论！

相关推荐