��Ѷ��

�ĵ��鷴��̨

��/��/��ţ

��

Python��ȡ��Python�̳̲��pdf

��Դ�� - Python��

��Ҫ�ѽ̳̱��PDF��

1��html��ȡÿһƪ�̷̳Ž�һ��ɵ�div��˰��н̵̳�html�ļ�(BeautifulSoup)

2��htmlת��pdf(wkhtmltopdf)

3��ıȽϺã��ȡ�Ĺ��л��Ҫ��ip(�� or ��)

�Ƽ��Լ��Pythonѧϰ��Ⱥ960410445��Pythonѧϰ��ĵط��С�׻��Ǵ�ţ��С�඼��ӭ��ڷ��ɻ��һ��ʺ��ѧϰPython��Ϻ��Ž̡̳�

��ʼʹ��

��һ��ĵ�� BeautifulSoup �Ĺ��췽��,��ܵõ�һ��ĵ��Ķ��, ��Դ��һ��ַ��һ��ļ��.

��ʾ��

��,�ĵ��ת��Unicode,��HTML��ʵ��ת��Unicode��.

Ȼ��,Beautiful Soupѡ��ʵĽ��ĵ�,��ֶ�ָ��ôBeautiful Soup��ѡ��ָ��Ľ��ĵ�.

��

Beautiful Soup �� HTML �ĵ�ת��һ��ӵ��νṹ,ÿ��ڵ㶼�� Python ��,��ж��Թ��Ϊ 4 ��: Tag , NavigableString , BeautifulSoup , Comment .

Tag��ͨ�׵㽲�� HTML �е�һ��ǩ�� div��p��

NavigableString��ȡ��ǩ�ڲ��֣��磬soup.p.string��

BeautifulSoup��ʾһ��ĵ��ȫ��ݡ�

Comment��Comment ��һ��͵� NavigableString ��ݲ��ע�ͷ��.

Tag

Tag��html�е�һ��ǩ��BeautifulSoup��ܽ��Tag�ľ��ݣ��ĸ�ʽΪsoup.name,��name��html�µı�ǩ��ʵ��£�

print soup.title��title��ǩ�µ��ݣ��˱�ǩ��

print soup.head��head��ǩ�µ��

�� Tag ��Ҫ��ȡ�ı�ǩ�ж��Ļ��ֻ�᷵��е�һ��Ҫ��ı�ǩ��

Tag ��

ÿ�� Tag ��Ҫ�� name �� attrs��

name��Tag��name��䱾��soup.p.name��p

attrs��һ��ֵ��͵ģ��Ӧ��-ֵ��print soup.p.attrs,��ľ��{'class': ['title'], 'name': 'dromouse'},��Ȼ��Ҳ��Եõ��ֵ��print soup.p.attrs['class'],��ľ��[title]��һ��б��ͣ��Ϊһ��Կ��ܶ�Ӧ��ֵ,��Ȼ��Ҳ��ͨ��get��õ��Եģ��磺print soup.p.get('class')��ֱ��ʹ��print soup.p['class']

get

get��ڵõ��ǩ�µ��ֵ��ע��һ��Ҫ�ķ��ೡ�϶��õ��Ҫ�õ��ǩ�µ�ͼ��url,��ô�Ϳ��soup.img.get('src'),��£�

# �õ��һ��p��ǩ�µ�src��printsoup.p.get("class")

string

�õ��ǩ�µ��ı��ݣ�ֻ��ڴ˱�ǩ��û��ӱ�ǩ��ֻ��һ��ӱ�ǩ��²��ܷ��е��ݣ��򷵻ص��None��ʵ��£�

# ��һ��ı��p��ǩû��ӱ�ǩ��ܹ��ȷ��ı��printsoup.p.string# ��õ��ľ��None,��Ϊ��html��кܶ��ӱ�ǩprintsoup.html.string

get_text()

��Ի��һ��ǩ�е��ı��ݣ��ڵ��ݣ���õķ��

��ĵ��

BeautifulSoup ��Ҫ��ӽڵ㼰�ӽڵ��ԣ�ͨ��Tagȡ��Եķ�ʽֻ�ܻ�õ�ǰ�ĵ��еĵ�һ�� tag��磬soup.p��Ҫ�õ��е�

��ǩ,��ͨ��ֵõ��һ�� tag ��ݵ�ʱ��,��Ҫ�õ� find_all()

find_all(name, attrs, recursive, text, **kwargs )

find_all��ڵ��з��Ϲ��Ľڵ㡣

name��Tag��֣��p,div,title

# 1. �ڵ��print(soup.find_all('p'))# 2. ��ʽprint(soup.find_all(re.compile('^p')))# 3. �б� print(soup.find_all(['p','a']))

�� attrs ��Ҳ��Ϊ��ȡ��ݣ�� limit ��Ʒ��ص��

CSS ѡ��

�� CSS �﷨Ϊƥ��׼�ҵ� Tag��ͬ��Ҳ��ʹ�õ�һ��ú��Ϊselect()�� list��ľ��÷��£�

# 1. ͨ�� tag ��ǩ��print(soup.select(head))# 2. ͨ�� id ��print(soup.select('#link1'))# 3. ͨ�� class ��print(soup.select('.sister'))# 4. ͨ��Բ��print(soup.select('p[name=dromouse]'))# 5. ��ϲ��print(soup.select("body p"))

wkhtmltopdf

wkhtmltopdf��Ҫ��HTML��PDF��

pdfkit�ǻ��wkhtmltopdf��python��װ��֧��URL��ļ��ı��ݵ�PDF��ת��ջ��ǵ��wkhtmltopdf��

��װ

�Ȱ�װwkhtmltopdf��ٰ�װpdfkit��

https://wkhtmltopdf.org/downloads.html

pdfkit

shell pip3 install pdfkit

ת��url/file/string

importpdfkitpdfkit.from_url('http://google.com','out.pdf')pdfkit.from_file('index.html','out.pdf')pdfkit.from_string('Hello!','out.pdf')

ת��url��ļ��б�

pdfkit.from_url(['google.com','baidu.com'],'out.pdf')pdfkit.from_file(['file1.html','file2.html'],'out.pdf')

ת��ļ�

withopen('file.html')asf: pdfkit.from_file(f,'out.pdf')

�Զ��

ʹ�ô��ip

��ȡʮ��ƪ�̳�֮��

��δ�ķ��ĺܺã��ֻ��ʹ�ô��ip�ˣ��ѵ��Ѵ��ѡ��˸��ѵ� �� о��Ӧ�ٶȺ��ȶ��Ի�OK��

��н��

��й��̽�ͼ��

��й��

��ɵ�Ч��ͼ��

Ч��ͼ

��£�

��: 2019-01-112019-01-11 10:54:02
ԭ��https://kuaibao.qq.com/s/20190111A0EKPU00?refer=cp_1026
��Ѷ��Ѷ�ƿ��Ѷ��ݿ��ƽ̨�ʺţ��ţ��֮һ��Ѷ��ݿ��ƽ̨��Э�顷ת�ط��ݡ�
��Ȩ��ϵ cloudcommunity@tencent.com ɾ��

��Ѷ

ɨ��

��վ�� Ⱥ

��ȡר�� 10Ԫ��ż�ȯ

˽�� ��ɻ�

ɨ��뿪��Ⱥ