UUBlog

UUBlog

前言

获取内容,比较纠结是用BeautifulSoup还是直接用正则匹配好。BeautifulSoup简单清晰,但是不够灵活。
正则则相反。

正文

信息位置的分析

像网盘,我们要提取的信息主要有共享者ID、资源名、网盘URL、资源大小、创建时间等等。搞清楚这些信息的位置,不是本文的重点,所以这里假设已经清楚了信息的位置,然后提取就行了。用共享者ID、资源名、网盘URL做个示范。

举个栗子,比如莽荒纪.zip的资源,URL是:http://www.sobaidupan.com/file-106010793.html从HTML中我们可以获得如下信息:

2082813876是sobaidupan.com的站内ID,也是百度云盘的用户ID。这就好办了。
但是资源的URL还要进一步加载http://sbdp.baidudaquan.com/down.asp?id=16166237&token=301efbbe2c138d150b41b5813a3d4077才能知道。
源码如下:

1
2
3
4
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<div style="margin:0 auto;margin-top:10%;width:600px;border: 1px solid #ff0000;line-height:30px;padding:10px 10px 10px 10px; ">提示:亲,正在为您跳转,请稍等2秒.....
<meta http-equiv='refresh' content='2;URL=http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097'></div>

源码里的http://pan.baidu.com/share/link?shareid=3994307345&uk=2755655514&fid=45639734040097正是我们要的资源。

也就是说,要提取莽荒纪的资源名称,至少得加载两次URL,才能将信息提取全。

  • 第一次加载:http://www.sobaidupan.com/user-2082813876-1.html
    得到资源名、共享者ID和网盘的站内地址http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176

  • 第二次加载: http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176提取出网盘的真实地址。

提取信息

获取网站源码

上一篇日志提到如何提取源码。我把它放到一个叫yzyPublic.py文件里。所以等下得先导入这个文件再使用。

1
2
3
4
5
import yzyPublic

res = yzyPublic.get_web_source('http://www.sobaidupan.com/file-106010793.html')
print res

res内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
<!DOCTYPE html>
<html xmlns=http://www.w3.org/1999/xhtml>
<head>
<meta http-equiv=X-UA-Compatible content="IE=edge,chrome=1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="style.css" />

<title>莽荒纪.zip_zgh*****1617_百度云盘下载 - 搜百度盘</title>
<meta name="keywords" content="莽荒纪.zip" />
<meta name="description" content="小说/修真/莽荒纪.zip" />
<style type="text/css">
<!--
.f_color {
color: #FFFFFF;
font-weight: bold;
}
-->
</style>
</head>
<body>
<div class="headtop">
<div class="headtop_f"><B>搜百度盘(SoBaiduPan.com)</B>&nbsp;是基于百度云搜索,最大的百度云盘资源搜索中心,千万级大数据量,让您一网打尽所有的百度网盘资源.</div>
</div>
<div class="site_head w c">
<div class="sitelogo"><a href="/"><img src="image/logo.gif" border="0" title="SoBaiduPan.com"></a></div>
<div class="top_allsite" id="top_allsite"><ul>
<script type="text/javascript" src="top_txtad.asp"></script>
</ul></div>
</div>
<div class="menu w c">
<ul>
<li><a href="http://www.sobaidupan.com">首 页</a></li>
<li><a href="list-1-1.html">最新资源</a></li>
<li><a href="zhuan-1-1.html">影视目录</a></li>
<li><a href="zhuan-2-1.html">小说目录</a></li>
<li><a href="list-28-1.html">影视资源</a></li>
<li><a href="list-30-1.html">动漫资源</a></li>
<li><a href="list-29-1.html">小说资源</a></li>
<li><a href="zhuan-3-1.html">综合资源</a></li>
<li><a href="http://soft.sobaidupan.com" target="_blank" title="百度云下载器">云下载器</a></li>
<li><a href="m.asp" title="移动端访问">手机专版</a></li>
<li><a href="http://weipan.sobaidupan.com" title="新浪微盘资源搜索" target="_blank">新浪微盘</a></li>
<li><a href="about.asp?id=2" title="在线发布共享资源">发布资源</a></li>
<li><a href="http://bbs.sobaidupan.com" title="建议留言" target="_blank"><font color="#FFFF00">建议留言</font></a></li>
</ul>
</div>

<div class="smenu c">
<div class="smenu_nav">
<a href="list-3-1.html">torrent</a><a href="list-5-1.html">rmvb</a><a href="list-4-1.html">mp4</a><a href="list-7-1.html">mp3</a><a href="list-9-1.html">avi</a><a href="list-8-1.html">epub</a><a href="list-10-1.html">mkv</a><a href="list-11-1.html">flv</a><a href="list-12-1.html">pdf</a><a href="list-13-1.html">pps</a><a href="list-15-1.html">psd</a><a href="list-16-1.html">iso</a><a href="list-17-1.html">ghost</a><a href="list-19-1.html">exe</a><a href="list-20-1.html">txt</a><a href="list-21-1.html">apk</a><a href="list-22-1.html">ipa</a><a href="list-24-1.html">wps</a><a href="list-25-1.html">rtf</a><a href="list-26-1.html">vob</a><a href="list-13-1.html">ppt/pptx</a><a href="list-27-1.html">xls/xlsx</a><a href="list-14-1.html">doc/docx</a><a href="list-18-1.html">rar/zip</a>
</div>
</div>

<div class="search w c">
<table width="100%" height="90" border="0" align="center" cellpadding="0" cellspacing="1">
<tr>
<td>

<script type="text/javascript" src="ad/top1_580x90.js"></script>

</td>
<td>
<a href="adgo.asp?id=30" target="_blank"><img src="ad/ad2.jpg"></a>
</td>
</tr>
</table>
<div class="fgx"></div>
<form id="form1" name="form1" method="get" action="search.asp" ><img src="image/s.png" width="32" height="32" align="absmiddle">&nbsp;请您输入搜索内容:
<input name="wd" id="wd" placeholder="共108,789,857个资源,今日已更新2382..." type="text" size="30" value="" autocomplete="off" />
<input type="submit" id="Su" tabindex="2" value="网盘搜索" style="cursor:hand;">&nbsp;<img src="image/soso.gif" width="23" height="21" align="absmiddle"><a href="about.asp?id=1" target="_blank"><font color="red"><b>点击打赏本站</b></font>&nbsp;&nbsp;<a href="http://koubei.baidu.com/s/www.sobaidupan.com" target="_blank"><b>点击支持本站</b></a>&nbsp;<img src="image/new.gif" width="22" height="14" align="absmiddle">&nbsp;<a href="http://soft.sobaidupan.com" target="_blank"><font color="red"><b>百度云搜索器</b></font></a>
</form>
</div>
<script type="text/javascript" charset="gbk" src="opensug.js"></script>
<script type="text/javascript">
var txtObj = document.getElementById("alertSpan");
function show(str){
window.location.href="search.asp?r=0&wd="+encodeURIComponent(str);
}
var params = {
"XOffset":0,
"YOffset":0,
"width":204,
"fontColor":"#f70",
"fontColorHI":"#FFF",
"fontSize":"15px",
"fontFamily":"宋体",
"borderColor":"gray",
"bgcolorHI":"#03c",
"sugSubmit":false
};
BaiduSuggestion.bind("wd",params,show);
</script>

<div class="main w c">
<div class="art_bt_box w c"><ul><li><h1>莽荒纪.zip</h1></li></ul></div>
<div class="art_box">
<table border="0">
<tr>
<td width="250" valign="top" ><table width="250" border="0" cellpadding="0" cellspacing="1" bordercolor="#3E92CF" bgcolor="#3E92CF">
<tr>
<td width="250" height="119" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="http://himg.bdimg.com/sys/portrait/item/797c6b21.jpg" width="100" height="100" border="0"></a></div></td>
</tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div align="center">用户名:zgh*****1617</div></td>
</tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div align="center"><a href="user-2082813876-1.html"><img src="image/jrzy.gif" width="89" height="24" border="0"></a></div></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<script src="ad/250x250.js" type="text/javascript"></script></div>
</td>
</tr>
<tr>
<td height="35" bgcolor="#3E92CF" >&nbsp;<span class="f_color">Ta 分享的其它资源:</span></td>
</tr>
<tr>
<td height="40" bgcolor="#FFFFFF">
<ul>
<li>&nbsp;<a href="file-1266183.html" title=网游——屠龙巫师.zip>网游——屠龙巫师.zip</a></li><li>&nbsp;<a href="file-1266216.html" title=网游-梦幻现实.zip>网游-梦幻现实.zip</a></li><li>&nbsp;<a href="file-1266234.html" title=神也玩转网游.zip>神也玩转网游.zip</a></li><li>&nbsp;<a href="file-1668670.html" title=魔兽英雄.zip>魔兽英雄.zip</a></li><li>&nbsp;<a href="file-1668832.html" title=阿亚罗克年代记.zip>阿亚罗克年代记.zip</a></li><li>&nbsp;<a href="file-1668883.html" title=重生之福星道士.zip>重生之福星道士.zip</a></li><li>&nbsp;<a href="file-1668930.html" title=重生之极限风流.zip>重生之极限风流.zip</a></li><li>&nbsp;<a href="file-1669255.html" title=英雄无敌之大航海时代.zip>英雄无敌之大航海时代.zip</a></li><li>&nbsp;<a href="file-1674467.html" title=网游之霸世神偷.zip>网游之霸世神偷.zip</a></li><li>&nbsp;<a href="file-2013963.html" title=霸王怒.zip>霸王怒.zip</a></li>
</ul>
</td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<script src="ad/250x250-2.js" type="text/javascript"></script></td>
</tr>
<tr>
<td height="35" bgcolor="#3E92CF" >&nbsp;<span class="f_color">其它网友正在下载的资源:</span></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<ul>
<li>&nbsp;<a href="file-830.html" title=橄榄油 - 副本5.psd>橄榄油 - 副本5.psd</a></li><li>&nbsp;<a href="file-829.html" title=百度云管家 v4.8.0 绿色版 i2i2.cn.rar>百度云管家 v4.8.0 绿色版 i2i2.cn.rar</a></li><li>&nbsp;<a href="file-828.html" title=百度云管家 v4.8.0 单文件版 i2i2.cn.rar>百度云管家 v4.8.0 单文件版 i2i2.cn.rar</a></li><li>&nbsp;<a href="file-827.html" title=第1天上午.5.mp3>第1天上午.5.mp3</a></li><li>&nbsp;<a href="file-826.html" title=第2天下午.8.mp3>第2天下午.8.mp3</a></li><li>&nbsp;<a href="file-825.html" title=第2天上午.7.mp3>第2天上午.7.mp3</a></li><li>&nbsp;<a href="file-824.html" title=第1天下午.5.mp3>第1天下午.5.mp3</a></li><li>&nbsp;<a href="file-823.html" title=第1天上午.4.mp3>第1天上午.4.mp3</a></li><li>&nbsp;<a href="file-822.html" title=第2天下午.6.mp3>第2天下午.6.mp3</a></li><li>&nbsp;<a href="file-821.html" title=第1天下午.7.mp3>第1天下午.7.mp3</a></li>
</ul>
</td>
</tr>
</table>

</td>
<td height="61" align="left" valign="top" >
<table width="100%" border="0" align="left" cellpadding="0" cellspacing="0" bordercolor="#3E92CF" bgcolor="#3E92CF">
<tr>
<td bgcolor="#FFFFFF" >
<script type='text/javascript' src='http://m1.sobaidupan.com/fr3a1ec292ffc2f63fdb146392acb024e057e3d4002ef230ec51322bda.js'></script></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx"></div></td>
</tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left">&nbsp;<B>资源名称:</B>莽荒纪.zip</div></td>
</tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left">&nbsp;<B>资源类别:</B>小说/修真</div></td>
</tr>
<tr>
<td style="line-height: 30px" bgcolor="#FFFFFF" ><div align="left">&nbsp;<B>资源大小:</B>3.83 MB&nbsp;<b>资料扩展名:</b>.zip&nbsp;<b>访问/下载次数</b>:10/9&nbsp;<b>分享日期:</b>2016/9/5 11:13:00</div></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" ><div class="fgx"></div></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<table width="100%" border="0" align="left">
<tr>
<td width="155">

<div align="center">
<a href="http://sbdp.baidudaquan.com/down.asp?id=106010793&token=c4e0d8de4bf94fe0d86a6b4f675fe176" title="莽荒纪.zip -百度网盘下载" target="_blank"><img src="image/wpdown.gif" width="137" height="34" border="0"></a></div></td>
<td width="152" bgcolor="#FFFFFF" ><div align="center"><a href="#" onclick="javascript:alert('违法信息举报信箱:sobaidupan@126.com')"><img src="image/zaixjb.gif" width="137" height="34" border="0" title="举报资源" style="cursor:pointer" id="police" ></a></div></td>
<td width="497" bgcolor="#FFFFFF" > <div class="bdsharebuttonbox"><a href="#" class="bds_more" data-cmd="more">分享到:</a><a href="#" class="bds_qzone" data-cmd="qzone" title="分享到QQ空间">QQ空间</a><a href="#" class="bds_tieba" data-cmd="tieba" title="分享到百度贴吧">百度贴吧</a><a href="#" class="bds_weixin" data-cmd="weixin" title="分享到微信">微信</a><a href="#" class="bds_tsina" data-cmd="tsina" title="分享到新浪微博">新浪微博</a><a href="#" class="bds_douban" data-cmd="douban" title="分享到豆瓣网">豆瓣网</a></div>

</td>
</tr>
</table></td>
</tr>

<tr>
<td bgcolor="#FFFFFF" ><div class="fgx"></div>
<script src="ad/728x90_2.js" type="text/javascript"></script>
</td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<div id="hm_t_97521"></div></td>
</tr>

<tr>
<td bgcolor="#FFFFFF" ><div class="fgx"></div><div align="left">
<script src="ad/336x280.js" type="text/javascript"></script>
</div></td>
</tr>
<tr>
<td height="35" bgcolor="#3E92CF" >&nbsp;<span class="f_color">相关资源:</span></td>
</tr>
<tr>
<td height="40" bgcolor="#FFFFFF" >
<ul>
<li>&nbsp;<a href="file-12334474.html" title=仙符问道.zip>仙符问道.zip</a></li><li>&nbsp;<a href="file-12335167.html" title=随身副本闯仙界.zip>随身副本闯仙界.zip</a></li><li>&nbsp;<a href="file-12335453.html" title=齐宇问道.zip>齐宇问道.zip</a></li><li>&nbsp;<a href="file-12335876.html" title=猫行天下.zip>猫行天下.zip</a></li><li>&nbsp;<a href="file-12336124.html" title=极品修真邪少.zip>极品修真邪少.zip</a></li><li>&nbsp;<a href="file-12424570.html" title=极品丹师.zip>极品丹师.zip</a></li><li>&nbsp;<a href="file-12744895.html" title=重生之唯我独仙.zip>重生之唯我独仙.zip</a></li><li>&nbsp;<a href="file-14281154.html" title=仙缘五行.zip>仙缘五行.zip</a></li><li>&nbsp;<a href="file-15903276.html" title=与狐仙双修的日子.zip>与狐仙双修的日子.zip</a></li><li>&nbsp;<a href="file-15903375.html" title=修真之位面交易系统.zip>修真之位面交易系统.zip</a></li><li>&nbsp;<a href="file-15903925.html" title=拜师八戒.zip>拜师八戒.zip</a></li><li>&nbsp;<a href="file-15904006.html" title=重生在白蛇的世界里.zip>重生在白蛇的世界里.zip</a></li><li>&nbsp;<a href="file-15904154.html" title=巫也是道.zip>巫也是道.zip</a></li><li>&nbsp;<a href="file-15979622.html" title=僵尸问道.zip>僵尸问道.zip</a></li><li>&nbsp;<a href="file-16005591.html" title=大地之皇.zip>大地之皇.zip</a></li><li>&nbsp;<a href="file-16484435.html" title=猪八戒重生记.zip>猪八戒重生记.zip</a></li><li>&nbsp;<a href="file-16484613.html" title=至神传说.zip>至神传说.zip</a></li><li>&nbsp;<a href="file-16484713.html" title=星空战神.zip>星空战神.zip</a></li><li>&nbsp;<a href="file-16484798.html" title=现代封神榜.zip>现代封神榜.zip</a></li><li>&nbsp;<a href="file-16735997.html" title=仙侠世界之天才掌门.zip>仙侠世界之天才掌门.zip</a></li><li>&nbsp;<a href="file-16888626.html" title=物理高材修仙记.zip>物理高材修仙记.zip</a></li><li>&nbsp;<a href="file-16889125.html" title=灵枢.zip>灵枢.zip</a></li><li>&nbsp;<a href="file-17136845.html" title=极品仙君.zip>极品仙君.zip</a></li><li>&nbsp;<a href="file-17175592.html" title=将修仙进行到底.zip>将修仙进行到底.zip</a></li><li>&nbsp;<a href="file-17175765.html" title=合成修仙传.zip>合成修仙传.zip</a></li><li>&nbsp;<a href="file-17257619.html" title=我做许仙的日子.zip>我做许仙的日子.zip</a></li><li>&nbsp;<a href="file-17349180.html" title=少年武仙在都市.zip>少年武仙在都市.zip</a></li><li>&nbsp;<a href="file-17349336.html" title=超级修仙之旅.zip>超级修仙之旅.zip</a></li><li>&nbsp;<a href="file-17349557.html" title=娇美仙妻爱上我.zip>娇美仙妻爱上我.zip</a></li><li>&nbsp;<a href="file-18057326.html" title=极品仙商.zip>极品仙商.zip</a></li>
</ul></td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<div class="fgx"></div>
<!-- UJian Button BEGIN -->
<div class="ujian-hook"></div>
<script type="text/javascript">var ujian_config = {num:16,target:1,picSize:72,textHeight:45,hoverTextColor:'#FA1B02'};</script>
<script type="text/javascript" src="http://v1.ujian.cc/code/ujian.js?uid=2087333"></script>
<a href="http://www.ujian.cc" style="border:0;"><img src="http://img.ujian.cc/pixel.png" alt="友荐云推荐" style="border:0;padding:0;margin:0;" /></a>
<!-- UJian Button END -->
</td>
</tr>
<tr>
<td bgcolor="#FFFFFF" >
<div class="fgx"></div>

</td>
</tr>
<tr>
<td height="40" bgcolor="#3E92CF" >&nbsp;<span class="f_color">相关说明:</span></td>
</tr>
<tr>
<td height="40" bgcolor="#FFFFFF" ><div class="art_foot">莽荒纪.zip为搜百度盘收集整理的结果,下载地址直接跳转到百度网盘进行下载,该文件的安全性和完整性需要您自行判断。感谢您对本站的支持.</div> </td>
</tr>
<tr>
<td height="80" bgcolor="#FFFFFF" >
&nbsp;上一个:<a href="file-106010792.html" title="netplan.zip">netplan.zip</a>
<div class="fgx"></div>
&nbsp;下一个:<a href="file-106010794.html" title="斗战西游.zip">斗战西游.zip</a> </td>
</tr>

</table></td>
<td width="200" align="left" valign="top" >
<script src="ad/200x200.js" type="text/javascript"></script>

<div class="art_left_bt"><img src="image/hot.gif" width="22" height="11">&nbsp;您可能需要的资源:</div>

<ul>
<li>&nbsp;<a href="file-23821718.html" title=重生之婚后试爱.txt>重生之婚后试爱.txt</a></li><li>&nbsp;<a href="file-23827473.html" title=时光,浓淡相宜.txt>时光,浓淡相宜.txt</a></li><li>&nbsp;<a href="file-25264047.html" title=[书包网]亲爱的爱情(重生演艺圈).txt>[书包网]亲爱的爱情(重生演艺圈).txt</a></li><li>&nbsp;<a href="file-25650524.html" title=[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txt>[古装言情]《二货娘子》作者:雾矢翊(晋江VIP2014-03-17完结)金牌高积分.txt</a></li><li>&nbsp;<a href="file-25651309.html" title=系统之宠妃.txt>系统之宠妃.txt</a></li><li>&nbsp;<a href="file-25651440.html" title=后宫翻身记(重生) .txt>后宫翻身记(重生) .txt</a></li><li>&nbsp;<a href="file-29456136.html" title=重生之汤圆儿.txt>重生之汤圆儿.txt</a></li><li>&nbsp;<a href="file-29456254.html" title=《重生之换我疼你》作者:森中一小妖.txt>《重生之换我疼你》作者:森中一小妖.txt</a></li><li>&nbsp;<a href="file-29717792.html" title=《宠妃》作者:月非娆.txt>《宠妃》作者:月非娆.txt</a></li><li>&nbsp;<a href="file-30877984.html" title=[网游]舍我娶谁.txt>[网游]舍我娶谁.txt</a></li>
</ul>
<script src="ad/160x600.js" type="text/javascript"></script>
</td>
</tr>

</table>
</div>
</div>

<script charset='gbk' src='http://p.tanx.com/ex?i=mm_113468001_12740314_57802967'></script>
<div class="cl"></div>
<div class="fgx"></div>
<div class="foot">
<p><img src="image/wj.png" width="36" height="43" align="absmiddle">&nbsp;&nbsp;搜百度盘(<a href="http://www.sobaidupan.com" title="搜百度盘">www.sobaidupan.com</a>&nbsp;2015-2018 All Rights Reserved&nbsp;<a href="zhaoshang.asp" title="广告合作及投放">广告合作</a>&nbsp;<a href="about.asp" title="关于本站">关于本站</a> &nbsp;QQ群:<a href="http://jq.qq.com/?_wv=1027&k=a2uzxT" target="_blank">385379281</a></p>
<p>本站仅提供百度网盘资源搜索和百度网盘资源下载的网站,本站只抓取百度网盘的链接而不保存任何资源. <script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?f9d133598d63eabee77f59430aefa2ab";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script>
<script type="text/javascript">var cnzz_protocol = (("https:" == document.location.protocol) ? " https://" : " http://");document.write(unescape("%3Cspan id='cnzz_stat_icon_1254604262'%3E%3C/span%3E%3Cscript src='" + cnzz_protocol + "s11.cnzz.com/stat.php%3Fid%3D1254604262' type='text/javascript'%3E%3C/script%3E"));</script> <a href="setxml.asp">sitemap.xml</a>
</p>
<p>本站所有资源均来自互联网,本站只负责技术收集和整理,均不承担任何法律责任,如有侵权违规等其它行为请联系我们. <img src="image/e.jpg" width="163" height="20" align="absmiddle"></p>
</div>
<br />

<script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":["mshare","qzone","tsina","bdysc","weixin","tieba","douban","sqq","qq","hi","baidu","share189","fx","mail","copy"],"bdPic":"","bdStyle":"0","bdSize":"16"},"share":{"bdSize":16}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];</script>
</body>
</html>
<script src="count.asp?id=106010793" type="text/javascript"></script>

提取用户ID、资源名、网盘URL

想了良久,还是决定使用BeautifulSoup和re正则共同完成信息的提取。
其实我个人是比较倾向于只使用正则提取,在以往我写的其它采集器基本都是用这个完成信息的提取。抱着学习的目的,加入了beautifulsoup。

导入相关的模块: BeautifulSoup和re

1
2
3
from bs4 import BeautifulSoup
import re

提取标题

标题这里都是存在h1标签里面。提取如下:

1
2
3
soup = BeautifulSoup(res,"html.parser")
print soup.h1.text

res是前面获取的网页源码’html.parser’解析,可以理解为让BeautifulSoup明白这个页面是什么语言写的。另外还有常用的lxml.

提取UID

uid这里的提取,我用了正则,觉得会简单点。BeautifulSoup的话,我还是会用到正则,后面我把两种方法都贴出来。

  • 方法1 直接正则匹配

    1
    2
    3
    uid = re.search('user-(\d*)-1\.html',res)
    print uid.group(1)

  • 方法2 BeautifulSoup配合正则找出符合的href属性

    1
    2
    3
    uid2 = soup.find(href=re.compile('user-\d*-1\.html'))['href']
    print uid2.split('-')[1]

提取网盘URL

这里需要先提取出站内下载的地址,加载源码,再提取出百度网盘地址。文章前面有提到过了。

提取站内下载URL

1
2
3
rurl = re.search('href="(http://sbdp\.baidudaquan\.com/down\.asp\?id=.+?)"',res)

print rurl.group(1)

提取百度网盘地址

1
2
3
4
dres = yzyPublic.get_web_source(rurl.group(1))
purl = re.search("URL=(http://pan\.baidu\.com/share/link\?shareid=.+?)'",dres)
print purl.group(1)

封装成函数提高代码复用

按自己习惯自己搞。不赘述。

参考资料

关注公众号 尹安灿

前言

过完年无聊,想学学Python,想了半天,从实用的角度出发,打算边学边做。想了半天,还是写一个采集器好点。
目标嘛,就是采集 www.sobaidupan.com 的内容入库。因为是初学,有很多不懂,所以一切从简,实现目的第一,性能第二。

正文

既然要采集,肯定得先获取网页源码。其中使用urllib和requests模块最多。而其中requests模块提供的api来看,友好度最高,所以打算采用requests。但是requests是一个第三方模块。所以

安装requests模块

pip install requests

获取网页源码

导入requests模块,调用get的方法。不清楚http的get、post、put、delete等方法的,度娘http协议了解。
简单来说,一般获取网页信息,绝大部分都是用的get,而提交信息,基本都是用post。我说是绝大部分。
下面就来一段代码演示如何获取www.baidu.com首页的源码。简直好用到哭。

获取源码

1
2
3
import requests
res = requests.get('http://www.baidu.com')
print res.text

结果如下:

1
2
3
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

获取源码并解码

源码是有了,但是中文变成了乱码。该网页用的是utf-8,所以还要指定编码名。这样程序才知道用什么编码来解码并展示出来。正确的解码才能获取到我们想要的内容。所以代码变成了下面的样子。

1
2
3
4
import requests
res = requests.get('http://www.baidu.com')
res.encoding='utf-8'
print res.text

结果如下:

1
2
3
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

这下总算能看到中文了。

封装成一个函数

为了提高复用性,所以打算把它封装成一个函数,比如get_web_source,这样以后获取不同的url,和编码,将其作为参数传入就能正确获取源码了。所以我把它写成了这个样子。

1
2
3
4
5
6
7
8
9
10
11
import requests

# 定义函数
def get_web_source(url,encode="utf-8"):
res = requests.get(url)
res.encoding = encode
return res.text

# 测试函数 打印出源码,第二个参数我默认填的是utf8,所以我不写。如果是GBK、GB2313或者其它的再填第二个参数。
print get_web_source('http://www.baidu.com')

结果如下:

1
2
3
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

结果正确,收工!

参考资料

关注公众号 尹安灿

$f(\sqrt{x}+1)=x+2\sqrt{x}$ 求 $f(x)$ 的解析式

  1. 换元法:设 $t=\sqrt{x}+1$ ,则 $x=(t-1)^2$ ,且 ( $x\geq1$ )$f(t)=(t-1)^2+2(t-1)=t^2-1$ $(t \geq 1)$$f(x)=x^2-1$ $( x\geq 1 )$

来源

Markdown 公式指导手册

Special Symbols

关注公众号 尹安灿

前言

It’s You的歌词算是首发吧,早在11还是10年我就问作者Sodeep Lama拿到了歌词。也发过在自己的QQ空间,不过后来QQ空间限制访问了。也没什么人能看到,也达不到共享的目的。直到今日,迷上了网易云音乐,想为它做点什么。恰好发现这首歌还没有歌词,就整理了一份歌词上传了上去。昨天刚刚通过审核。

歌词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[ti:It'S You]
[ar:Pari B]
[lr:Pari B]
[by:YinYongYou]
[00:03.24]Girl I'm just thinkin' about you
[00:06.30]fantasizin', memorizin' moments you spended with me (yeah)
[00:12.10]Girl I know thing can change
[00:15.10]about me lately got me thinking
[00:18.10]Nomatter what we're meant to be
[00:19.30]You like my superstar( Your name on my guitar )
[00:24.10]You're my celebrity( Your picture on my wall )
[00:29.40]Just the way you are ( I love it how you are )
[00:33.20]You and me ,meant to be, you should see
[00:38.01]Girl it's you you you you you
[00:42.10]Who changed my heart turn old into new
[00:46.05]Girl it's you you you you
[00:50.00]Who changed my life love story true
[00:50.00]Hold up for a second how can I thank god
[00:59.00]sending me this angel from above
[01:03.30]You're so so perfect to crash in my life
[01:08.10]Shawty you appear every night
[01:11.10]Your body's calling me
[01:13.40]Don't worry I'll be there girl
[01:16:40]I'm all yours shawty you should know
[01:20.10]Don't be scared girl I ain't gonna hurt you
[01:25.10]Gently , let me lead you
[01:29.01]Girl it's you you you you you
[01:33.10]Who changed my heart turn old into new
[01:37.05]Girl it's you you you you
[01:42.00]Who changed my life love story true
[01:47.00]Shawty you're my dime love suffocate
[01:49.00]But I'll give you my time
[01:51.00]Yah girl let me intrest you
[01:53.00]We equal better math when me plus you
[01:55.10]Let me hold you girl I need your presence here
[01:57.10]from this whole wide world
[01:59.00]I can live without money anything
[02:02.00]But how I can live without you
[02:04.10]Girl I know I make mistakes
[02:06.10]and I'll make you mine whateva it takes
[02:07.10]and it hurts you the most but it hurts me too
[02:09.10]When you are mad and sad baby give me a clue so
[02:11.40]Let's forget about the past you and me
[02:12.40]Let's make this last fasten
[02:14.10]Seat belt girl you ready ready ready
[02:20.00]Roger that
[02:21.01]Girl it's you you you you you
[02:24.10]Who changed my heart turn old into new
[02:29.05]Girl it's you you you you
[02:32.00]Who changed my life love story true
[02:37.01]Girl it's you you you you you
[02:42.10]Who changed my heart turn old into new
[02:46.05]Girl it's you you you you
[02:50.00]Who changed my life love story true

感谢

Sodeep Lama 手打的歌词
郑晓昕 同学的帮忙整理

关注公众号 尹安灿

0%