用Python实现笔趣阁小说爬取

文章来源：企鹅号 - Miku无形

今天来实现以下笔趣阁小说爬虫

笔趣阁的小说爬取难度还是比较低的（不涉及搜索功能）

咱们用requests和xpath来完成这个小爬虫

首先肯定是导包

```python

import requests

import time

from lxml import etree

```

然后来写两个辅助函数

分别用于请求网页和xpath解析

函数会让我们后面的程序更加简洁方便

```python

def get_tag(response, tag):

? html = etree.HTML(response)

? ret = html.xpath(tag)

? return ret

def parse_url(url):

? response = requests.get(url)

? response.encoding = 'gbk'

? return response.text

```

我们是从目录页开始爬取的

所以首先我们要获取目录页面里面所有的章节网址

```python

def find_url(response):

? chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')

? # print(chapter)

? for i in chapter:

? ? ? url_list.append(start_url + i)

? ? ? # url_list.append('https://www.52bqg.com/book_187/' + i)

? # print(url_list)

```

然后根据章节的网址

去获取章节内的信息

```python

def find_content(url):

? response = parse_url(url)

? chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]

? content = get_tag(response, '//*[@id="content"]/text()')

? print('正在爬取', chapter)

? with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:

? ? ? j.write(chapter)

? ? ? for i in content:

? ? ? ? ? if i == '\r\n':

? ? ? ? ? ? ? continue

? ? ? ? ? j.write(i)

? j.close()

? print(chapter, '保存完毕')

? time.sleep(2)

```

最后就是一个主体函数

```python

def main():

? global title, start_url

? # start_url = 'https://www.52bqg.com/book_187/'

? start_url = input('输入爬取小说的目录页:例如（https://www.52bqg.com/book_187/）')

? response = parse_url(start_url)

? # print(response)

? title = get_tag(response, '//*[@id="info"]/h1/text()')[0]

? # print(title)

? find_url(response)

? # print(1)

? for url in url_list:

? ? ? find_content(url)

```

完整代码如下：

```python

import requests

import time

from lxml import etree

url_list = []

def get_tag(response, tag):

? html = etree.HTML(response)

? ret = html.xpath(tag)

? return ret

def parse_url(url):

? response = requests.get(url)

? response.encoding = 'gbk'

? return response.text

def find_url(response):

? chapter = get_tag(response, '//*[@id="list"]/dl/dd/a/@href')

? # print(chapter)

? for i in chapter:

? ? ? url_list.append(start_url + i)

? ? ? # url_list.append('https://www.52bqg.com/book_187/' + i)

? # print(url_list)

def find_content(url):

? response = parse_url(url)

? chapter = get_tag(response, '//*[@id="box_con"]/div[2]/h1/text()')[0]

? content = get_tag(response, '//*[@id="content"]/text()')

? print('正在爬取', chapter)

? with open('{}.txt'.format(title), 'at', encoding='utf-8') as j:

? ? ? j.write(chapter)

? ? ? for i in content:

? ? ? ? ? if i == '\r\n':

? ? ? ? ? ? ? continue

? ? ? ? ? j.write(i)

? j.close()

? print(chapter, '保存完毕')

? time.sleep(2)

def main():

? global title, start_url

? # start_url = 'https://www.52bqg.com/book_187/'

? start_url = input('输入爬取小说的目录页:例如（https://www.52bqg.com/book_187/）')

? response = parse_url(start_url)

? # print(response)

? title = get_tag(response, '//*[@id="info"]/h1/text()')[0]

? # print(title)

? find_url(response)

? # print(1)

? for url in url_list:

? ? ? find_content(url)

if __name__ == '__main__':

? main()

```

效果图：

笔趣阁小数的爬虫还是比较好写的

下篇文章我们为这个爬虫加上GUI界面

一起学习python，小白指导，教学分享记得私信我

发表于: 2020-12-312020-12-31 16:32:47
原文链接：https://kuaibao.qq.com/s/20201231A0AZ1700?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

用Python实现笔趣阁小说爬取

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐