一只垂直的小爬虫

今天的主角是一只垂直的小爬虫,爬取工大要闻,(我爱我的工大),其实这个小爬虫,还有很多不足,但是拥有它,整个Internet好像还挺透明的

这只垂直的小爬虫,使用如下实现

实现的思路很简单,我从主函数开始简单叙述一下整个运行流程,第一步:收集需要爬取的url地址,容器我选择的是ConcurrentLinkedQueue非阻塞队列,它底层使用Unsafe实现,要的就是它线程安全的特性

主函数代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

static String url = "http://www.qlu.edu.cn/38/list.htm";
// 添加url任务
public static ConcurrentLinkedQueue<String> add( ConcurrentLinkedQueue<String> queue){
for (int i=1;i<=19;i++){
String subString = StringUtils.substringBefore(url, ".htm");
queue.add(subString+i+".htm");
}
return queue;
}

public static void main(String[] args) throws IOException {
ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue();
queue.add(url);
ConcurrentLinkedQueue<String> newQueue = add(queue);
// 多线程下载解析
TPoolForDownLoadRootUrl.downLoadRootTaskPool(queue);

}

第二步:把url列表丢线程池:

我使用的线程池是newCachedThreadPool 根据提交的任务数,动态分配线程

线程池里面干了这么几件事,下载源html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/**
* 下载html的业务实现
* @Author: Changwu
* @Date: 2019/3/24 11:13
*/
public class downLoadHtml {
public static Logger logger = Logger.getLogger(downLoadHtml.class);
/**
* 根据url 下载网页源码
* @param url
* @return
*/
public static String downLoadHtmlByUrl(String url) throws IOException {
CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
//设置请求头
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");

CloseableHttpResponse response = httpClient.execute(httpGet);
logger.info("请求"+url+"状态码为"+response.getStatusLine().getStatusCode());
HttpEntity entity = response.getEntity();
String result = EntityUtils.toString(entity, "utf-8");
return result;
}

解析rootUrl,目的是拿到新闻主页的url,因为新闻的正文,在那里面,边解析遍封装RootBean

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

/**
* 解析源html.封装成一级Bean对象并返回
*
* @param sourceHtml
* @return
*/
public static List<RootBean> getRootBeanList(String sourceHtml) {
LinkedList<RootBean> rootBeanList = new LinkedList<>();
Document doc = Jsoup.parse(sourceHtml);
Elements elements = doc.select("#wp_news_w6 ul li");
String rootUrl = "http://www.qlu.edu.cn";

for (Element element : elements) {
RootBean rootBean = new RootBean();
// 获取url并拼装
String href = element.child(0).child(0).attr("href");
// 获取title
String title = element.text();
String[] split = title.split("\\s+");
//封装
System.out.println(title);

if (split.length >= 2) {
String s = element.outerHtml();
String regex = "class=\"news_meta\">.*";
Pattern compile = Pattern.compile(regex);
Matcher matcher = compile.matcher(s);
if (matcher.find()) {

String group = matcher.group(0);
String ss = StringUtils.substring(group, 18);
ss = StringUtils.substringBefore(ss, "</span> </li>");
rootBean.setPostTime(ss);
}

}


rootBean.setTitle(split[0]);
rootBean.setUrl(rootUrl + href);

rootBeanList.add(rootBean);
/*System.out.println();
System.out.println(split[0]);
System.out.println();*/
}
return rootBeanList;
}

类似,处理二级任务,这里使用到了正则表达式,原来没好好学,今天用的时候,完全蒙,还好慢慢悠悠整出来了,这块这要是观察源html,根据特性,使用jsoup提供的选择器选择,剪切,拼接出我们想要的内容,然后封装

为啥说是垂直的小爬虫,它只适合爬取我学校新闻,看下面的代码,没办法,只能拼凑剪切,最坑的是,100条新闻中,99条标题放在里面,总有那么一条放在了里面, 这个时候,就不得不去改刚才写好的规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/**
* 解析封装二级任务
*
* @param htmlSouce
* @return
*/
public static List<PojoBean> getPojoBeanByHtmlSource(String htmlSouce, RootBean bean) {

LinkedList<PojoBean> list = new LinkedList<>();
PojoBean pojoBean = new PojoBean();

// 解析
Document doc = Jsoup.parse(htmlSouce);

// 编辑
Elements elements1 = doc.select(".arti_metas");

for (Element element : elements1) {

String text = element.text();

// 编辑
String regex = "(责任编辑:.*)";
Pattern compile = Pattern.compile(regex);
Matcher matcher = compile.matcher(text);
String editor = null;
if (matcher.find()) {
//System.out.println(matcher.group(group));
editor = matcher.group(1);
editor = StringUtils.substring(editor, 5);
//System.out.println(editor);
}

// 作者
regex = "(作者:.*出处)";
compile = Pattern.compile(regex);
matcher = compile.matcher(text);
String author = null;
if (matcher.find()) {
//System.out.println(matcher.group(group));
author = matcher.group(1);
author = StringUtils.substring(author, 3);
author = StringUtils.substringBefore(author, "出处");
//System.out.println(author);
}

// 出处
regex = "(出处:.*责任编辑)";
compile = Pattern.compile(regex);
matcher = compile.matcher(text);
String source = null;
if (matcher.find()) {
source = matcher.group(1);
source = StringUtils.substring(source, 3);
source = StringUtils.substringBefore(source, "责任编辑");
// System.out.println(source);
}

// 正文
Elements EBody = doc.select(".wp_articlecontent");
String body = EBody.first().text();
// System.out.println(body);

// 封装
pojoBean.setAuthor(author);
pojoBean.setBody(body);
pojoBean.setEditor(editor);
pojoBean.setSource(source);
pojoBean.setUrl(bean.getUrl());
pojoBean.setPostTime(bean.getPostTime());
pojoBean.setTitle(bean.getTitle());
list.add(pojoBean);
}
return list;
}
}

持久化,使用的是底册的JDBC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
* 持久化单个pojo
* @param pojo
*/
public static void insertOnePojo(PojoBean pojo) throws ClassNotFoundException, SQLException {
// 注册驱动
Class.forName("com.mysql.jdbc.Driver");
// 连接
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spider", "root", "root");
String sql = "insert into qluspider (title,url,post_time,insert_time,author,source,editor,body) values (?,?,?,?,?,?,?,?)";
PreparedStatement ps = connection.prepareStatement(sql);
// 填充sql
ps.setString(1,pojo.getTitle());
ps.setString(2,pojo.getUrl());
// 把字符串转换成日期
ps.setTimestamp(3,new java.sql.Timestamp(SpiderUtil.stringToDate(pojo.getPostTime()).getTime()));
ps.setTimestamp(4,new java.sql.Timestamp(new Date().getTime()));
ps.setString(5,pojo.getAuthor());
ps.setString(6,pojo.getSource());
ps.setString(7,pojo.getEditor());
ps.setString(8,pojo.getBody());

ps.execute();

connection.close();

}

拿到的新的url称作是二级

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

public static Logger logger = Logger.getLogger(TPoolForDownLoadRootUrl.class);

/**
* 下载,解析 根url的线程池
*/
public static void downLoadRootTaskPool(ConcurrentLinkedQueue queue) {
ExecutorService executor = Executors.newCachedThreadPool();
//ExecutorService executor = Executors.newFixedThreadPool(5);
for ( int i=1;i<=queue.size();i++)
{
executor.execute(new Runnable() {
@Override
public void run() {
try {
logger.info("1号线程池开启,将要下载解析root任务");
// 获取根任务url
String url = (String) queue.poll();

logger.info("根URL==" + url);
if (StringUtils.isNotBlank(url)) {
// 下载当前url对应的rootHtml
String sourceHtml = downLoadHtml.downLoadHtmlByUrl(url);
// 解析rootHtml里面所有的RootBean对象
List<RootBean> rootBeanList = parseHtmlByJsoup.getRootBeanList(sourceHtml);
// 二级任务开始
for (RootBean rootBean : rootBeanList) {
logger.info(this + "进入二级任务");
String subUrl = rootBean.getUrl();
// 下载二级任务 html
String htmlSouce = downLoadHtml.downLoadHtmlByUrl(subUrl);
// 解析封装
List<PojoBean> pojoList = parseHtmlByJsoup.getPojoBeanByHtmlSource(htmlSouce, rootBean);
// 持久化
logger.info(this + "将持久化" + subUrl + "中的二级任务");
Persistence.insertPojoListToDB(pojoList);
logger.info("持久化完成.......");
}
}
} catch (IOException e) {
System.out.println();
e.printStackTrace();
}

}
});

}