1 回顾

  • io
    • File
      • 文件目录操作的对象
    • FileInputStream/FileOutputStream
      • 文件流
    • ObjectInputStream/ObjectOutputStream
      • 对象序列化
      • 被序列化的对象要实现Serializable
      • writeObject()
      • readObject()
    • InputStreamReader/OutputStreamWriter
      • 编码转换流
      • java - Unicode
      • UTF-8
      • GBK
    • text - BufferedReader, PrintWriter
    • properties - Properties
    • xml - DOM4J
    • json - Jackson
    • yaml - Jackson
  • 线程
    • 创建

      • 继承Thread
      • 实现Runnable
    • 方法

      • Thread.currentThread()
      • Thread.sleep()
      • Thread.yield()
      • getName(),setName()
      • start()
      • interrupt()
      • join()
      • setDaemon(true)
      • setPriority(优先级)
    • 同步 synchronized

      • 步调一致地执行,不会引起数据混乱
      • synchronized(对象) {
        }
        抢指定对象的锁
      • synchronized void f() {
        }
        抢当前实例的锁(this)
      • static synchronized void f() {
        }
        抢"类对象"的锁
    • 生产者,消费者模型

      • 中间用一个集合来传递数据
      • 解耦
    • 等待和通知

      • wait()
      • notify()
      • notifyAll()
      • 必须在synchronized内调用
      • 等待通知的对象,必须是加锁的对象
      • wait()外面总应该是一个循环
    • Lock

      • 乐观锁
      • Lock
        • ReentrantLock
        • ReentrantReadWriteLock
    • 工具辅助创建,控制线程

      • 线程池 ExecutorService/Executors
        • Executors.newFixedThreadPool(5)
        • Executors.newCachedThreadPool()
        • Executors.newSingleThreadExecutor()
        • pool.execute(Runnable任务)
      • Callable/Future
        • Future future = pool.submit(Callable任务)
          Object r = future.get();
      • ThreadLoca
        • 线程绑定
        • 线程当做流水线,上游放入数据,下游访问数据
        • threadLocal.set(数据)
        • threadLocal.get()
        • threadLocal.remove()

2 第十八天:实战:爬虫京东

<mark>首先配置好Jsoup插件</mark>:

2.1 http协议

向服务器发送的 http 协议数据

GET / HTTP/1.1
Host: www.tedu.cn
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3 Accept-Encoding: gzip, deflate Accept-Language: zh-CN,zh;q=0.9 

服务器返回的数据

HTTP/1.1 200 OK

Date: Tue, 24 Sep 2019 15:30:45 GMT
Content-Type: text/html
Content-Length: 275688
Connection: keep-alive
Server: tarena
Last-Modified: Tue, 24 Sep 2019 01:14:40 GMT
ETag: "5d896e00-434e8"
Accept-Ranges: bytes
Age: 7092
X-Via: 1.1 PShbsjzsxqo180:5 (Cdn Cache Server V2.0), 1.1 PSjlbswt4dm34:3 (Cdn Cache Server V2.0), 1.1 bdwt64:8 (Cdn Cache Server V2.0)
<!DOCTYPE html>
......

Jsoup测试:

@Test
	public void test1() throws Exception{
		//www.tarena.cn
		String body = Jsoup.connect("http://www.jd.cn").execute().body();
		System.out.println(body);
	}

或者用传统的手动IO流

package cn.edut.com.tarena;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.net.Socket;

public class Test0 {
	public static void main(String[] args) throws Exception {
		/* * 连接 */
		String host = "item.jd.com" ;
		int port = 80 ; 
		Socket socket = new Socket(host,port);
		System.out.println("已连接 - "+host+":"+port);
		/* * 发送请求 */
		String http = "GET / HTTP/1.1\n"+
              "Host: "+host+"\n"+
              "Connection: keep-alive\n"+
              "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36\n"+
              "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\n"+
              "Accept-Language: zh-CN,zh;q=0.9\n\n";
		OutputStream out = socket.getOutputStream();
		out.write(http.getBytes());
		out.flush();
		System.out.println("http请求已发送 ... ");
		/* * 接收数据 */
		System.out.println("\n接收数据:");
		BufferedReader in = new BufferedReader(
				new InputStreamReader(socket.getInputStream(), "UTF-8"));
		socket.setSoTimeout(5000);
		String line ;
		while(true) {
			try {
				line=in.readLine() ; 
			}catch (Exception e) {
				break;
			}
			System.out.println(line);
		}
		System.out.println("----数据接收完毕-----");
	}
}

控制台

浏览器翻译:



2.2 html和css

css的结构

<html>
    <head>
       <style>
           div {
              ...
			}

           #id1 {
              font-size: 50px
			}
 
			.c1 {
		    ....
			}

			div.c0   .c1 {
		    ...
			}
       </style>
    </head>
    <body>
       <div id="id1">
           <a href="www.tedu.cn">点击访问达内</a>
		</div>
      
       <div class="c0">
           <div class="c1">xxx</div>
           <div class="c1">xxx</div>
       </div>
       <div>
           <div class="c1">xxx</div>
           <div class="c1">xxx</div>
       </div>
    </body>
</html>

Jsoup 中 css结构的命名


	 * DOM树
	 *   /	-------------------------------- Document类型
	 *    |- <html> ------------------------ Element类型
	 *    		|- <head> ------------------ Element
	 *    		|- <body> ------------------ Element
	 *    			|- <div> --------------- Element
	 *    				|- class="c1" ------ Attribute类型
	 *    			|- <div> --------------- ...
	 *    				|- <div> ----------- ... 
	 *    				|- <p> 
	 *    			|- <div> 
	 *    			|- <div> 
	 *    			|- <div> 



2.3 爬虫

Jsoup 第三方开源API,方便的执行http请求,并处理响应,方便的从html中提取需要的内容

Step 1 获取标题


分析结构
代码

/** * DOM树 * / -------------------------------- Document类型 * |- <html> ------------------------ Element类型 * |- <head> ------------------ Element * |- <body> ------------------ Element * |- <div> --------------- Element * |- class="c1" ------ Attribute类型 * |- <div> --------------- ... * |- <div> ----------- ... * |- <p> * |- <div> * |- <div> * |- <div> */ 
	@Test
	public void test3() throws Exception {
		String url =  "https://item.jd.com/100004286349.html"; ; 
		String title = getTitle(url);
		System.out.println(title);
	}

	private String getTitle(String url) throws Exception {
		//获得html DOM树的根节点
		Document document = Jsoup.connect(url).get();
		//doc.select("div.sku-name").get(0);
		Element element = document.selectFirst("div.sku-name");//用css选择器元素
		return element.text();//从元素获得内部包含的文本
	}

获取结构

Step2 获取价格

javaScript引起的问题

问题描述:
我们锁定到静态的代码,发现价格那里原本是没有数据的。是通过JavaScript脚本动态生成的。
那么怎么获取价格数据呢?

NetWork
通过NetWork找到接收价格数据的文本体,和请求的address参数

提取有用的请求参数

总结出这样的一段请求

https://p.3.cn/prices/mgets?skuIds=J_100004286349



代码实现

@Test
	public void test4() throws Exception {
		String id = "J_100004286349";
		double price = getPrice(id);
		System.out.println(price);
	}

	private double getPrice(String id) throws Exception {
		String url = "https://p.3.cn/prices/mgets?skuIds="+id;
		//用户***
		String userAgent = "\"Mozilla/5.0 (Windows NT 5.1; zh-CN) AppleWebKit/535.12 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/535.12\"";
		String body = Jsoup.connect(url).
		userAgent(userAgent).
		ignoreContentType(true).
		execute().body();

		
		//过滤
		ObjectMapper m = new ObjectMapper();
		//JsonNode node = m.readTree(body);
		List<Map<String,String>> list = m.readValue(body,new TypeReference<List<Map<String,String>>>() {});
		String p = list.get(0).get("p");
		
		return Double.parseDouble(p);
	}



Step3 获得产品分类的全部列表

https://www.jd.com/allSort.aspx

/** * 获取商品列表所有的list连接 */
	@Test
	public void test6() throws Exception {
		/* * { * "http://...." * "http://...." * "http://...." * "http://...." * "http://...." * } */
		List<String> list = getAllLink();
		for (String s : list) {
			System.out.println(s);
		}
		System.out.println(list.size());
	}

	private List<String> getAllLink() throws Exception {
		String url = "https://www.jd.com/allSort.aspx" ; 
		Document doc = Jsoup.connect(url).get();
		
		LinkedList<String> list = new LinkedList<String>();
		//层里面的层,所以 空格
		//dt 标题 , dd 内容
		Elements els = doc.select("div dl dd a");
		for (Element e : els) {
			//获取元素
			String href = e.attr("href");
			if(href.startsWith("//list.jd.com")) {
				String text = e.text();
				String data = e.data();
				String s = "text="+text+", data="+data+", href=http:"+href;
				list.add(s);
			}
		}
		return list;
	}



Step4 得到其中一个列表的最大页号

/** * 得到最大页号 */
	@Test
	public void test7() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
		int n = getMaxPage(url); //3 
		System.out.println(n);
	}

	private int getMaxPage(String url) throws Exception {
		Document doc = Jsoup.connect(url).get();
		Elements e = doc.select("div.f-pager i");
		String n = e.text();
		return Integer.parseInt(n);
	}

Step5 得到当前种类产品所有页面

/** * 获得一个list中的所有页面 */
	//@Test
	public void test08() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
		List<String> list = getItemSort_AllPageList(url); 
		for (String s : list) {
			System.out.println(s);
		}
	}

	private List<String> getItemSort_AllPageList(String url) throws Exception {
		LinkedList<String> list = new LinkedList<String>();
		int n = getMaxPage(url); //3 
		for (int i = 1; i <= n; i++) {
			list.add(url+"&page="+i);
		}
		return list;
	}

Step 6 获取当前页面的所有连接

/** * 一个种类的所有list */
	//@Test
	public void test09() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303&page=1" ; 
		List<String> list = getItemSort_InOnePage_AllItemList(url);
		for (String s : list) {
			System.out.println(s);
		}
		System.out.println(list.size());
	}

	private List<String> getItemSort_InOnePage_AllItemList(String url) throws Exception {
		LinkedList<String> list = new LinkedList<String>();
		Document doc = Jsoup.connect(url).get();
		Elements eles = doc.select("div#plist div.p-img a");
		for (Element e : eles) {
			String href = e.attr("href");
			if(href.startsWith("//item.jd.com")) {
				list.add("https:"+href);
			}
		}
		return list;
	}



总和:获取所有产品种类的所有商品的数据

package cn.edut.com.tarena;

import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;


public class Test1 {
	String userAgent = "\"Mozilla/5.0 (Windows NT 5.1; zh-CN) AppleWebKit/535.12 (KHTML, like Gecko) Chrome/22.0.1229.79 Safari/535.12\"";
	
	//@Test
	public void test1() throws Exception{
		//www.tarena.cn
		String body = Jsoup.connect("http://www.jd.cn").execute().body();
		System.out.println("\nbody:-----------\n"+body);
	}
	
	//@Test
	public void test2() throws Exception {
		String url = "https://item.jd.com/100004286349.html";
		String body = Jsoup.connect(url).execute().body();
		System.out.println("\nbody:-----------\n"+body);
	}
	
	/** * DOM树 * / -------------------------------- Document类型 * |- <html> ------------------------ Element类型 * |- <head> ------------------ Element * |- <body> ------------------ Element * |- <div> --------------- Element * |- class="c1" ------ Attribute类型 * |- <div> --------------- ... * |- <div> ----------- ... * |- <p> * |- <div> --------------- ... * |- <div> --------------- ... * |- <div> --------------- ... */ 
	//@Test
	public void test3() throws Exception {
		String url =  "https://item.jd.com/100004286349.html"; ; 
		String title = getTitle(url);
		System.out.println("\n标题:"+title);
	}

	private String getTitle(String url) throws Exception {
		//获得html DOM树的根节点
		Document document = Jsoup.connect(url).get();
		//doc.select("div.sku-name").get(0);
		// class 所以用 点 div.sku.....
		Element element = document.selectFirst("div.sku-name");//用css选择器元素
		return element.text();//从元素获得内部包含的文本
	}
	
	/** * 这里需要重新看 * @throws Exception */
	//@Test
	public void test4() throws Exception {
		String id = "J_100004286349";
		double price = getPrice(id);
		System.out.println("\n价格:"+price);
	}

	/** * 处理数据: * * [{"cbf":"0", * "id":"J_100004286349", * "m":"20000.00", * "op":"6599.00", * "p":"6599.00"}] * * Note: Document.body直接是文本,execute之后得到的Response是javadoc处理后的文本 */
	private double getPrice(String id) throws Exception {
		String url = "https://p.3.cn/prices/mgets?skuIds="+id;
		//用户***
		String body = Jsoup.connect(url).
		userAgent(userAgent).//欺骗服务器,客户端是个浏览器
		ignoreContentType(true).//让jsoup处理数据时,不要把数据当做html进行处理
		execute().body();
		
		//过滤
		ObjectMapper m = new ObjectMapper();
		//JsonNode node = m.readTree(body);
		List<Map<String,String>> list = m.readValue(body,new TypeReference<List<Map<String,String>>>() {});
		String p = list.get(0).get("p");
		
		return Double.parseDouble(p);
	}
	
	
	/** * 获取商品信息 content * * 没写好 */
	//@Test
	public void test5() throws Exception {
		String id  = "3882469" ;
		//TODO
		String content = getContent(id);
		System.out.println("\n内容简介:"+content);
	}

	private String getContent(String id) throws Exception {
		String url = "http://d.3.cn/desc/" + id; 
		String body = Jsoup.connect(url).
		userAgent(userAgent).
		ignoreContentType(true).
		execute().
		body();
		body = body.substring("showdesc(".length(), body.length()-1) ; 
		
		System.out.println(body);
		
		ObjectMapper map = new ObjectMapper();
		List<Map<String,String>> list = map.readValue(body,new TypeReference<List<Map<String,String>>>() {}) ; 
		for (Map<String, String> map2 : list) {
			Set<Entry<String, String>> entry = map2.entrySet();
			for (Entry<String, String> e : entry) {
				System.out.println(e.getKey()+":"+e.getValue());
			}
		}
		
		String content = list.get(0).get("content");
		
		return content;
	}
	
	/** * 获取商品列表所有的list连接 */
	//@Test
	public void test6() throws Exception {
		/* * { * "http://...." * "http://...." * "http://...." * "http://...." * "http://...." * } */
		List<String> list = getAllSortList();
		for (String s : list) {
			System.out.println(s);
		}
		System.out.println(list.size());
	}

	private List<String> getAllSortList() throws Exception {
		String url = "https://www.jd.com/allSort.aspx" ; 
		Document doc = Jsoup.connect(url).get();
		
		LinkedList<String> list = new LinkedList<String>();
		//层里面的层,所以 空格
		//dt 标题 , dd 内容
		Elements els = doc.select("div dl dd a");
		for (Element e : els) {
			//获取元素
			String href = e.attr("href");
			if(href.startsWith("//list.jd.com")) {
				/* String text = e.text(); String data = e.data(); String s = "text="+text+", data="+data+", href=http:"+href; */
				list.add("http:"+href);
			}
		}
		return list;
	}
	
	/** * 得到最大页号 */
	//@Test
	public void test7() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
		int n = getMaxPage(url); //3 
		System.out.println(n);
	}

	private int getMaxPage(String url) throws Exception {
		Document doc = Jsoup.connect(url).get();
		Elements e = doc.select("div.f-pager i");
		String n = e.text();
		return Integer.parseInt(n);
	}
	
	/** * 获得一个list中的所有页面 */
	//@Test
	public void test08() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303";
		List<String> list = getItemSort_AllPageList(url); 
		for (String s : list) {
			System.out.println(s);
		}
	}

	private List<String> getItemSort_AllPageList(String url) throws Exception {
		LinkedList<String> list = new LinkedList<String>();
		int n = getMaxPage(url); //3 
		for (int i = 1; i <= n; i++) {
			list.add(url+"&page="+i);
		}
		return list;
	}
	
	/** * 一个种类的所有list */
	//@Test
	public void test09() throws Exception {
		String url = "https://list.jd.com/list.html?cat=12379,13302,13303&page=1" ; 
		List<String> list = getItemSort_InOnePage_AllItemList(url);
		for (String s : list) {
			System.out.println(s);
		}
		System.out.println(list.size());
	}

	private List<String> getItemSort_InOnePage_AllItemList(String url) throws Exception {
		LinkedList<String> list = new LinkedList<String>();
		Document doc = Jsoup.connect(url).get();
		Elements eles = doc.select("div#plist div.p-img a");
		for (Element e : eles) {
			String href = e.attr("href");
			if(href.startsWith("//item.jd.com")) {
				list.add("https:"+href);
			}
		}
		return list;
	}
	
	
	/** * 获得所有list * 每个list获得所有页面 * 每个页面获得所有连接 * * * => 获得所有list的所有页面上的所有连接 => 京东玩蛋 */
	@Test
	public void test10() throws Exception {
		//遍历所有分类
		List<String> list = getAllSortList();
		//处理一个分类,遍历它所有的分页
		for (String sortUrl : list) {
			handleOneSort(sortUrl);
		}
	}

	private void handleOneSort(String url) throws Exception {
		//一个产品种类的全部url
		List<String> list = getItemSort_AllPageList(url);
		for (String allPageUrl : list) {
			handlePage(allPageUrl);
		}
	}

	private void handlePage(String url) throws Exception {
		List<String> list = getItemSort_InOnePage_AllItemList(url);
		for (String itemUrl : list) {
			handleItem(itemUrl);
		}
	}

	private void handleItem(String url) {
		//http://item.jd.com/23423423423.html
		// | |
		//可能有的商品无法获取,所以用try
		try {
			int from = url.lastIndexOf("/")+1;
			int to = url.lastIndexOf(".");
			String id = url.substring(from, to);
			String title = getTitle(url);
			double price = getPrice(id);
			System.out.println("title:"+title);
			System.out.println("price:"+price);
			System.out.println("----------------------");
		}catch(Exception e) {
			//先忽略那些商品
		}
	}
	
}

结果