文章目录
1 回顾
- io
- File
- 文件目录操作的对象
- FileInputStream/FileOutputStream
- 文件流
- ObjectInputStream/ObjectOutputStream
- 对象序列化
- 被序列化的对象要实现Serializable
- writeObject()
- readObject()
- InputStreamReader/OutputStreamWriter
- 编码转换流
- java - Unicode
- UTF-8
- GBK
- text - BufferedReader, PrintWriter
- properties - Properties
- xml - DOM4J
- json - Jackson
- yaml - Jackson
- File
- 线程
-
创建
- 继承Thread
- 实现Runnable
-
方法
- Thread.currentThread()
- Thread.sleep()
- Thread.yield()
- getName(),setName()
- start()
- interrupt()
- join()
- setDaemon(true)
- setPriority(优先级)
-
同步 synchronized
- 步调一致地执行,不会引起数据混乱
- synchronized(对象) {
}
抢指定对象的锁 - synchronized void f() {
}
抢当前实例的锁(this) - static synchronized void f() {
}
抢"类对象"的锁
-
生产者,消费者模型
- 中间用一个集合来传递数据
- 解耦
-
等待和通知
- wait()
- notify()
- notifyAll()
- 必须在synchronized内调用
- 等待通知的对象,必须是加锁的对象
- wait()外面总应该是一个循环
-
Lock
- 乐观锁
- Lock
- ReentrantLock
- ReentrantReadWriteLock
-
工具辅助创建,控制线程
- 线程池 ExecutorService/Executors
- Executors.newFixedThreadPool(5)
- Executors.newCachedThreadPool()
- Executors.newSingleThreadExecutor()
- pool.execute(Runnable任务)
- Callable/Future
- Future future = pool.submit(Callable任务)
Object r = future.get();
- Future future = pool.submit(Callable任务)
- ThreadLoca
- 线程绑定
- 线程当做流水线,上游放入数据,下游访问数据
- threadLocal.set(数据)
- threadLocal.get()
- threadLocal.remove()
- 线程池 ExecutorService/Executors
-
2 第十八天:实战:爬虫京东
<mark>首先配置好Jsoup插件</mark>:
- 单独Jsoup1.11.3:https://download.csdn.net/download/LawssssCat/11997750
- 全部常用插件:https://download.csdn.net/download/LawssssCat/11996025
2.1 http协议
向服务器发送的 http 协议数据
GET / HTTP/1.1
Host: www.tedu.cn
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: zh-CN,zh;q=0.9
服务器返回的数据
HTTP/1.1 200 OK
Date: Tue, 24 Sep 2019 15:30:45 GMT
Content-Type: text/html
Content-Length: 275688
Connection: keep-alive
Server: tarena
Last-Modified: Tue, 24 Sep 2019 01:14:40 GMT
ETag: "5d896e00-434e8"
Accept-Ranges: bytes
Age: 7092
X-Via: 1.1 PShbsjzsxqo180:5 (Cdn Cache Server V2.0), 1.1 PSjlbswt4dm34:3 (Cdn Cache Server V2.0), 1.1 bdwt64:8 (Cdn Cache Server V2.0)
<!DOCTYPE html>
......
Jsoup测试:
@Test
public void test1() throws Exception{
//www.tarena.cn
String body = Jsoup.connect("http://www.jd.cn").execute().body();
System.out.println(body);
}
或者用传统的手动IO流
package cn.edut.com.tarena;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.Socket;
public class Test0 {
public static void main(String[] args) throws Exception {
/* * 连接 */
String host = "item.jd.com" ;
int port = 80 ;
Socket socket = new Socket(host,port);
System.out.println("已连接 - "+host+":"+port);
/* * 发送请求 */
String http = "GET / HTTP/1.1\n"+
"Host: "+host+"\n"+
"Connection: keep-alive\n"+
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36\n"+
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\n"+
"Accept-Language: zh-CN,zh;q=0.9\n\n";
OutputStream out = socket.getOutputStream();
out.write(http.getBytes());
out.flush();
System.out.println("http请求已发送 ... ");
/* * 接收数据 */
System.out.println("\n接收数据:");
BufferedReader in = new BufferedReader(
new InputStreamReader(socket.getInputStream(), "UTF-8"));
socket.setSoTimeout(5000);
String line ;
while(true) {
try {
line=in.readLine() ;
}catch (Exception e) {
break;
}
System.out.println(line);
}
System.out.println("----数据接收完毕-----");
}
}
控制台
浏览器翻译:
2.2 html和css
css的结构
<html>
<head>
<style>
div {
...
}
#id1 {
font-size: 50px
}
.c1 {
....
}
div.c0 .c1 {
...
}
</style>
</head>
<body>
<div id="id1">
<a href="www.tedu.cn">点击访问达内</a>
</div>
<div class="c0">
<div class="c1">xxx</div>
<div class="c1">xxx</div>
</div>
<div>
<div class="c1">xxx</div>
<div class="c1">xxx</div>
</div>
</body>
</html>
Jsoup 中 css结构的命名
* DOM树
* / -------------------------------- Document类型
* |- <html> ------------------------ Element类型
* |- <head> ------------------ Element
* |- <body> ------------------ Element
* |- <div> --------------- Element
* |- class="c1" ------ Attribute类型
* |- <div> --------------- ...
* |- <div> ----------- ...
* |- <p>
* |- <div>
* |- <div>
* |- <div>
2.3 爬虫
Jsoup 第三方开源API,方便的执行http请求,并处理响应,方便的从html中提取需要的内容
Step 1 获取标题
分析结构
代码
/** * DOM树 * / -------------------------------- Document类型 * |- <html> ------------------------ Element类型 * |- <head> ------------------ Element * |- <body> ------------------ Element * |- <div> --------------- Element * |- class="c1" ------ Attribute类型 * |- <div> --------------- ... * |- <div> ----------- ... * |- <p> * |- <div> * |- <div> * |- <div> */
@Test
public void test3() throws Exception {
String url = "https://item.jd.com/100004286349.html"; ;
String title = getTitle(url);
System.out.println(title);
}
private String getTitle(String url) throws Exception {
//获得html DOM树的根节点
Document document = Jsoup.connect(url).get();
//doc.select("div.sku-name").get(0);
Element element = document.selectFirst("div.sku-name");//用css选择器元素
return element.text();//从元素获得内部包含的文本
}
获取结构
Step2 获取价格
javaScript引起的问题
问题描述:
我们锁定到静态的代码,发现价格那里原本是没有数据的。是通过JavaScript脚本动态生成的。
那么怎么获取价格数据呢?
NetWork
通过NetWork找到接收价格数据的文本体,和请求的address参数
提取有用的请求参数
总结出这样的一段请求
https://p.3.cn/prices/mgets?skuIds=J_100004286349
代码实现
@Test
public void test4() throws Exception {
String id = "J_100004286349";
double price = getPrice(id);
System.out.println(price);
}
private double getPrice(String id) throws Exception {
String url = "https://p.3.cn/prices/mgets?skuIds="+id;
//用户***
String userAgent = "\"Mozilla/5.0 (Windows NT 5.1; zh-CN) AppleWebKit/535.12 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/535.12\"";
String body = Jsoup.connect(url).
userAgent(userAgent).
ignoreContentType(true).
execute().body();
//过滤
ObjectMapper m = new ObjectMapper();
//JsonNode node = m.readTree(body);
List<Map<String,String>> list = m.readValue(body,new TypeReference<List<Map<String,String>>>() {});
String p = list.get(0).get("p");
return Double.parseDouble(p);
}
Step3 获得产品分类的全部列表
https://www.jd.com/allSort.aspx
/** * 获取商品列表所有的list连接 */
@Test
public void test6() throws Exception {
/* * { * "http://...." * "http://...." * "http://...." * "http://...." * "http://...." * } */
List<String> list = getAllLink();
for (String s : list) {
System.out.println(s);
}
System.out.println(list.size());
}
private List<String> getAllLink() throws Exception {
String url = "https://www.jd.com/allSort.aspx" ;
Document doc = Jsoup.connect(url).get();
LinkedList<String> list = new LinkedList<String>();
//层里面的层,所以 空格
//dt 标题 , dd 内容
Elements els = doc.select("div dl dd a");
for (Element e : els) {
//获取元素
String href = e.attr("href");
if(href.startsWith("//list.jd.com")) {
String text = e.text();
String data = e.data();
String s = "text="+text+", data="+data+", href=http:"+href;
list.add(s);
}
}
return list;
}
Step4 得到其中一个列表的最大页号
/** * 得到最大页号 */
@Test
public void test7() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
int n = getMaxPage(url); //3
System.out.println(n);
}
private int getMaxPage(String url) throws Exception {
Document doc = Jsoup.connect(url).get();
Elements e = doc.select("div.f-pager i");
String n = e.text();
return Integer.parseInt(n);
}
Step5 得到当前种类产品所有页面
/** * 获得一个list中的所有页面 */
//@Test
public void test08() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
List<String> list = getItemSort_AllPageList(url);
for (String s : list) {
System.out.println(s);
}
}
private List<String> getItemSort_AllPageList(String url) throws Exception {
LinkedList<String> list = new LinkedList<String>();
int n = getMaxPage(url); //3
for (int i = 1; i <= n; i++) {
list.add(url+"&page="+i);
}
return list;
}
Step 6 获取当前页面的所有连接
/** * 一个种类的所有list */
//@Test
public void test09() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303&page=1" ;
List<String> list = getItemSort_InOnePage_AllItemList(url);
for (String s : list) {
System.out.println(s);
}
System.out.println(list.size());
}
private List<String> getItemSort_InOnePage_AllItemList(String url) throws Exception {
LinkedList<String> list = new LinkedList<String>();
Document doc = Jsoup.connect(url).get();
Elements eles = doc.select("div#plist div.p-img a");
for (Element e : eles) {
String href = e.attr("href");
if(href.startsWith("//item.jd.com")) {
list.add("https:"+href);
}
}
return list;
}
总和:获取所有产品种类的所有商品的数据
package cn.edut.com.tarena;
import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class Test1 {
String userAgent = "\"Mozilla/5.0 (Windows NT 5.1; zh-CN) AppleWebKit/535.12 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/535.12\"";
//@Test
public void test1() throws Exception{
//www.tarena.cn
String body = Jsoup.connect("http://www.jd.cn").execute().body();
System.out.println("\nbody:-----------\n"+body);
}
//@Test
public void test2() throws Exception {
String url = "https://item.jd.com/100004286349.html";
String body = Jsoup.connect(url).execute().body();
System.out.println("\nbody:-----------\n"+body);
}
/** * DOM树 * / -------------------------------- Document类型 * |- <html> ------------------------ Element类型 * |- <head> ------------------ Element * |- <body> ------------------ Element * |- <div> --------------- Element * |- class="c1" ------ Attribute类型 * |- <div> --------------- ... * |- <div> ----------- ... * |- <p> * |- <div> --------------- ... * |- <div> --------------- ... * |- <div> --------------- ... */
//@Test
public void test3() throws Exception {
String url = "https://item.jd.com/100004286349.html"; ;
String title = getTitle(url);
System.out.println("\n标题:"+title);
}
private String getTitle(String url) throws Exception {
//获得html DOM树的根节点
Document document = Jsoup.connect(url).get();
//doc.select("div.sku-name").get(0);
// class 所以用 点 div.sku.....
Element element = document.selectFirst("div.sku-name");//用css选择器元素
return element.text();//从元素获得内部包含的文本
}
/** * 这里需要重新看 * @throws Exception */
//@Test
public void test4() throws Exception {
String id = "J_100004286349";
double price = getPrice(id);
System.out.println("\n价格:"+price);
}
/** * 处理数据: * * [{"cbf":"0", * "id":"J_100004286349", * "m":"20000.00", * "op":"6599.00", * "p":"6599.00"}] * * Note: Document.body直接是文本,execute之后得到的Response是javadoc处理后的文本 */
private double getPrice(String id) throws Exception {
String url = "https://p.3.cn/prices/mgets?skuIds="+id;
//用户***
String body = Jsoup.connect(url).
userAgent(userAgent).//欺骗服务器,客户端是个浏览器
ignoreContentType(true).//让jsoup处理数据时,不要把数据当做html进行处理
execute().body();
//过滤
ObjectMapper m = new ObjectMapper();
//JsonNode node = m.readTree(body);
List<Map<String,String>> list = m.readValue(body,new TypeReference<List<Map<String,String>>>() {});
String p = list.get(0).get("p");
return Double.parseDouble(p);
}
/** * 获取商品信息 content * * 没写好 */
//@Test
public void test5() throws Exception {
String id = "3882469" ;
//TODO
String content = getContent(id);
System.out.println("\n内容简介:"+content);
}
private String getContent(String id) throws Exception {
String url = "http://d.3.cn/desc/" + id;
String body = Jsoup.connect(url).
userAgent(userAgent).
ignoreContentType(true).
execute().
body();
body = body.substring("showdesc(".length(), body.length()-1) ;
System.out.println(body);
ObjectMapper map = new ObjectMapper();
List<Map<String,String>> list = map.readValue(body,new TypeReference<List<Map<String,String>>>() {}) ;
for (Map<String, String> map2 : list) {
Set<Entry<String, String>> entry = map2.entrySet();
for (Entry<String, String> e : entry) {
System.out.println(e.getKey()+":"+e.getValue());
}
}
String content = list.get(0).get("content");
return content;
}
/** * 获取商品列表所有的list连接 */
//@Test
public void test6() throws Exception {
/* * { * "http://...." * "http://...." * "http://...." * "http://...." * "http://...." * } */
List<String> list = getAllSortList();
for (String s : list) {
System.out.println(s);
}
System.out.println(list.size());
}
private List<String> getAllSortList() throws Exception {
String url = "https://www.jd.com/allSort.aspx" ;
Document doc = Jsoup.connect(url).get();
LinkedList<String> list = new LinkedList<String>();
//层里面的层,所以 空格
//dt 标题 , dd 内容
Elements els = doc.select("div dl dd a");
for (Element e : els) {
//获取元素
String href = e.attr("href");
if(href.startsWith("//list.jd.com")) {
/* String text = e.text(); String data = e.data(); String s = "text="+text+", data="+data+", href=http:"+href; */
list.add("http:"+href);
}
}
return list;
}
/** * 得到最大页号 */
//@Test
public void test7() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
int n = getMaxPage(url); //3
System.out.println(n);
}
private int getMaxPage(String url) throws Exception {
Document doc = Jsoup.connect(url).get();
Elements e = doc.select("div.f-pager i");
String n = e.text();
return Integer.parseInt(n);
}
/** * 获得一个list中的所有页面 */
//@Test
public void test08() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
List<String> list = getItemSort_AllPageList(url);
for (String s : list) {
System.out.println(s);
}
}
private List<String> getItemSort_AllPageList(String url) throws Exception {
LinkedList<String> list = new LinkedList<String>();
int n = getMaxPage(url); //3
for (int i = 1; i <= n; i++) {
list.add(url+"&page="+i);
}
return list;
}
/** * 一个种类的所有list */
//@Test
public void test09() throws Exception {
String url = "https://list.jd.com/list.html?cat=12379,13302,13303&page=1" ;
List<String> list = getItemSort_InOnePage_AllItemList(url);
for (String s : list) {
System.out.println(s);
}
System.out.println(list.size());
}
private List<String> getItemSort_InOnePage_AllItemList(String url) throws Exception {
LinkedList<String> list = new LinkedList<String>();
Document doc = Jsoup.connect(url).get();
Elements eles = doc.select("div#plist div.p-img a");
for (Element e : eles) {
String href = e.attr("href");
if(href.startsWith("//item.jd.com")) {
list.add("https:"+href);
}
}
return list;
}
/** * 获得所有list * 每个list获得所有页面 * 每个页面获得所有连接 * * * => 获得所有list的所有页面上的所有连接 => 京东玩蛋 */
@Test
public void test10() throws Exception {
//遍历所有分类
List<String> list = getAllSortList();
//处理一个分类,遍历它所有的分页
for (String sortUrl : list) {
handleOneSort(sortUrl);
}
}
private void handleOneSort(String url) throws Exception {
//一个产品种类的全部url
List<String> list = getItemSort_AllPageList(url);
for (String allPageUrl : list) {
handlePage(allPageUrl);
}
}
private void handlePage(String url) throws Exception {
List<String> list = getItemSort_InOnePage_AllItemList(url);
for (String itemUrl : list) {
handleItem(itemUrl);
}
}
private void handleItem(String url) {
//http://item.jd.com/23423423423.html
// | |
//可能有的商品无法获取,所以用try
try {
int from = url.lastIndexOf("/")+1;
int to = url.lastIndexOf(".");
String id = url.substring(from, to);
String title = getTitle(url);
double price = getPrice(id);
System.out.println("title:"+title);
System.out.println("price:"+price);
System.out.println("----------------------");
}catch(Exception e) {
//先忽略那些商品
}
}
}