HttpClient爬虫使用

  Java本身提供了关于网络访问的包,在java.net中,然后它不够强大。于是Apache基金会发布了开源的http请求的包,即HttpClient,这个包提供了非常多的网络访问的功能。

一、HttpClient简单示例

maven工程中导包

<dependencies>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.3</version>
    </dependency>    
</dependencies>

一个简单示例

   String url = "";    //请求路径

    //构造路径参数
    List<NameValuePair> nameValuePairList = Lists.newArrayList();
    nameValuePairList.add(new BasicNameValuePair("username","test"));
    nameValuePairList.add(new BasicNameValuePair("password","password"));

    //构造请求路径,并添加参数
    URI uri = new URIBuilder(url).addParameters(nameValuePairList).build();

    //构造Headers
    List<Header> headerList = Lists.newArrayList();
    headerList.add(new BasicHeader(HttpHeaders.ACCEPT_ENCODING,"gzip, deflate"));
    headerList.add(new BasicHeader(HttpHeaders.CONNECTION, "keep-alive"));

    //构造HttpClient
    HttpClient httpClient = HttpClients.custom().setDefaultHeaders(headerList).build();

    //构造HttpGet请求
    HttpUriRequest httpUriRequest = RequestBuilder.get().setUri(uri).build();

    //获取结果
    HttpResponse httpResponse = httpClient.execute(httpUriRequest);

    //获取返回结果中的实体
    HttpEntity entity = httpResponse.getEntity();

    //查看页面内容结果
    String rawHTMLContent = EntityUtils.toString(entity);

    System.out.println(rawHTMLContent);

    //关闭HttpEntity流
    EntityUtils.consume(entity);

  爬虫的第一步需要构建一个客户端,即请求端,我们这里使用HttpClient作为我们的请求端,然后确定使用哪种方式请求什么网址,伪造请求头,携带参数等。再然后使用HttpResponse获取请求的地址对应的结果即可。最后取出HttpEntity转换一下就可以得到我们请求的网址对应的内容了。

二、HttpClient封装使用

  Apache不仅开发了httpClient包,而且针对常用的一些网络爬虫技术做了封装,使之用来更为方便,commons-httpclient包是针对上述的封装开发便于客户端使用。

maven工程中导包

  <dependencies>
    <dependency>
	    <groupId>apache-httpclient</groupId>
	    <artifactId>commons-httpclient</artifactId>
	    <version>3.1</version>
	</dependency>
  </dependencies>

1、实现get请求

	HttpClient client = new HttpClient();
		//设置***服务器和端口
// client.getHostConfiguration().setProxy("proxyHost", "proxyPort");
		//使用get方法,如果服务器需要通过HTTPS链接,那只需要奖下面的url中的http换成https
		HttpMethod method = new GetMethod("http://www.bjkgjlu.com/64621hnb/328224105.html");
		//使用post方法
// HttpMethod method = new PostMethod("http://www.bjkgjlu.com/64621hnb/328224105.html");
		client.executeMethod(method);
		//打印服务器返回的状态
		System.out.println(method.getStatusLine());
		//打印返回的信息
		System.out.println(method.getResponseBodyAsString());
		//释放链接
		method.releaseConnection();

2、处理重定向

    在jsp/Servlet编程中的response.sendRedirect方法就是i使用HTTP协议中的重定向机制。它与JSP中的jsp:forward的区别在于后者是在服务器中实现页面的跳转,也就是说应用容器加载了索要跳转的页面内容并返回给客户端。而前者是返回一个状态码,这些状态码可能值见下表, 然后客户端读取需要跳转到的页面的URL并重新加载新的页面。就是这样一个过程,所以我们编程的时候就要通过HttpMethod.getStatusCode()方法判断返回值是否为下表中的某个值来判断是否需要跳转。如果已经确认需要进行页面跳转了,那么可以通过读取HTTP头中的location属性来获取新的地址。

HttpClient client = new HttpClient();
		HttpMethod post = new PostMethod(url);
		
		client.executeMethod(post);
		System.out.println(post.getStatusLine().toString());
		post.releaseConnection();
		// 检查是否重定向
		int statuscode = post.getStatusCode();
		if ((statuscode == HttpStatus.SC_MOVED_TEMPORARILY) ||
            (statuscode == HttpStatus.SC_MOVED_PERMANENTLY) || 
				(statuscode ==HttpStatus.SC_SEE_OTHER) || 
            (statuscode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
			// 读取新的 URL 地址 
			Header header=post.getResponseHeader("location");
			if (header!=null){
				String newuri=header.getValue();
				if((newuri==null)||(newuri.equals("")))
					newuri="/";
				//请求新的地址
				GetMethod redirect=new GetMethod(newuri);
				client.executeMethod(redirect);
		   		System.out.println("Redirect:"
                              +redirect.getStatusLine().toString());
				redirect.releaseConnection();
			}else 
				System.out.println("Invalid redirect");
		}

3、模仿登陆

3.1 分析登陆请求表单

  图为分析登陆表单得到的action地址,就是将登陆账号密码提交的地址,一般是提交到服务器验证成功之后,设置客户端cookie并产生重定向。

3.2 分析登陆重定向

  该网络请求为重定向后的请求,获取了uid等cookie得到服务器的认可。

3.3 Java源码

    public static String loginurl = "https://security.kaixin001.com/login/login_post.php";
    static Cookie[] cookies = {};

    static HttpClient httpClient = new HttpClient();
    
    static String email = "524235428@qq.com";//你的email
    static String psw = "123456";//你的密码
    // 消息发送的action
    String url = "http://www.kaixin001.com/home/";

    public static void getUrlContent()
            throws Exception {

        HttpClientParams httparams = new HttpClientParams();
        httparams.setSoTimeout(30000);
        httpClient.setParams(httparams);

        httpClient.getHostConfiguration().setHost("www.kaixin001.com", 80);

        httpClient.getParams().setParameter(
                HttpMethodParams.HTTP_CONTENT_CHARSET, "UTF-8");

        PostMethod login = new PostMethod(loginurl);
        login.addRequestHeader("Content-Type",
                "application/x-www-form-urlencoded; charset=UTF-8");

        NameValuePair Email = new NameValuePair("loginemail", email);// 邮箱
        NameValuePair password = new NameValuePair("password", psw);// 密码
        // NameValuePair code = new NameValuePair( "code"
        // ,"????");//有时候需要验证码,暂时未解决

        NameValuePair[] data = { Email, password };
        login.setRequestBody(data);

        httpClient.executeMethod(login);
        int statuscode = login.getStatusCode();
        System.out.println(statuscode + "-----------");
        String result = login.getResponseBodyAsString();
        System.out.println(result+"++++++++++++");

        cookies = httpClient.getState().getCookies();
        System.out.println("==========Cookies============");
        int i = 0;
        for (Cookie c : cookies) {
            System.out.println(++i + ": " + c);
        }
        httpClient.getState().addCookies(cookies);

        // 当state为301或者302说明登陆页面跳转了,登陆成功了
        if ((statuscode == HttpStatus.SC_MOVED_TEMPORARILY)
                || (statuscode == HttpStatus.SC_MOVED_PERMANENTLY)
                || (statuscode == HttpStatus.SC_SEE_OTHER)
                || (statuscode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
            // 读取新的 URL 地址
            Header header = login.getResponseHeader("location");
            // 释放连接
            login.releaseConnection();
            System.out.println("获取到跳转header>>>" + header);
            if (header != null) {
                String newuri = header.getValue();
                if ((newuri == null) || (newuri.equals("")))
                    newuri = "/";
                GetMethod redirect = new GetMethod(newuri);
                // ////////////
                redirect.setRequestHeader("Cookie", cookies.toString());
                httpClient.executeMethod(redirect);
                System.out.println("Redirect:"
                        + redirect.getStatusLine().toString());
                redirect.releaseConnection();

            } else
                System.out.println("Invalid redirect");
        } else {
            // 用户名和密码没有被提交,当登陆多次后需要验证码的时候会出现这种未提交情况
            System.out.println("用户没登陆");
            System.exit(1);
        }

    }

参考:https://www.cnblogs.com/ITtangtang/p/3968093.html#a6