Java利用Apache-POI解析doc和docx文档

背景介绍：前些天，公司需要做一个相当于wiki文档的项目，其中涉及到在SpringBoot的基础上将word的doc文档和docx文档解析为html格式文件的相关内容。

格式介绍 ：doc文档是微软为office定制的word2003版本之前的一种格式，docx是微软为word2007版本及之后所定制的一种文档格式，看后缀就知道其继承自doc，但它比doc格式更加的节省空间。在此也推荐大家都用docx格式，既节省空间，又给开发人员减少麻烦。

设计思路 ：

根据项目需要，word解析为html文件需要依据不同的格式（doc和docx）用不同的方法实现。并将html文件以String类型返回给钱前端。
docx文档作为微软最新的word文档格式，兼容性，稳定性比较好，转html过程中，无论是在windows环境，还是在linux环境完全没有乱码的问题。
鉴于笔者能力有限，没能完全看懂docx解析为html文件的代码，只能按照poi的的源方式，将生成的html文件置于项目的相对路径之下，然后将其读取出来做为String对象返回给前端，最后删除临时的html文件。
doc格式文档是微软的一种比较老的文档格式，在解析为html文件的过程中。出现各种各样的问题，遇到了各种不同的乱码（乱码也有好多种，有些完全是将文件给损坏了）。乱码问题也是花了好长的时间才解决，特别需要注意我在解析doc文档中写了UTF-8 的地方，写错一个地方就会在你意想不到的时候出现乱码。
解析doc文档花的时间长，但是不是没有收获，至少基本弄懂了解析doc文档为html文档的过程，实现了直接将String类型的html文件直接返回给前端，而不是临时置于相对路径下，然后在去读取，然后删除。
解析过程中，word文档中的图片，在poi中是以二进制的方式存在，由于项目的需要，我将这些二进制图片传至阿里云oss，然后将得到的图片链接替换到html文件的图片链接处（文中关于此部分的内容已经删除）
需要自定义图片链接的话需要重写FileImageExtractor、BasicURIResolver方法（docx格式），setPicturesManager（doc格式）

POM文件 ：pom文件的引入至关重要，一定不能出现版本的的差异，我现在用的虽然不是最新的版本，但是贵在没有版本之间的不兼容

		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.10-FINAL</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.10-FINAL</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.12</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>xdocreport</artifactId>
            <version>2.0.1</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>fr.opensagres.xdocreport.document</artifactId>
            <version>2.0.1</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.core</artifactId>
            <version>1.0.6</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.pdf</artifactId>
            <version>1.0.6</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
            <version>1.0.6</version>
        </dependency>
        <!-- apache poi系列 **** 往上-->

代码实现：

docx格式：

/** * DOCX文档解析 * * @param inputStream * 输入流 * @throws Exception * 异常 */
    private AnalysisDTO docxToHtml(InputStream inputStream) throws Exception {

        OutputStreamWriter outputStreamWriter = null;

        BufferedReader bf = null;

        // 线程安全的list
        List<AnalysisPicMsgDTO> ossList = Collections.synchronizedList(new ArrayList<AnalysisPicMsgDTO>());

        // 使用ThreadLocal实现线程间的共享
        ThreadLocal<String> threadLocal = new ThreadLocal<String>();

        Long htmlId = IdWorkerUtil.getId();

        try {
            XWPFDocument document = new XWPFDocument(inputStream);

            XHTMLOptions options = XHTMLOptions.create();

            // 存放图片的文件夹，设置为空是因为对当前类进行了重写
            options.setExtractor(new FileImageExtractor(new File("E:\000000000")));

            // html中图片的路径，
            options.URIResolver(new BasicURIResolver(new File("E:\000000000")));

            // 文件输出到项目文件处
            outputStreamWriter = new OutputStreamWriter(new FileOutputStream(htmlPath(htmlId)), "utf-8");

            // 获取实例
            XHTMLConverter xhtmlConverter = (XHTMLConverter) XHTMLConverter.getInstance();

            // 转化为xhtml
            xhtmlConverter.convert(document, outputStreamWriter, options);

            // 从项目处读取html文件
            StringBuffer buffer = new StringBuffer();

            bf = new BufferedReader(new FileReader(htmlPath(htmlId)));

            String s = null;

            // 使用readLine方法，一次读一行
            while ((s = bf.readLine()) != null) {
                buffer.append(s.trim());
            }

            // 去除样式（去除html中的style样式，不需要的话，此句完全可以删除）
            String content = CommonUtil.delDangerHTMLTag(buffer.toString());

            // 删除临时文件
            deleteTemporaryHtmlFile(htmlPath(htmlId));

            AnalysisDTO dto = new AnalysisDTO();

            dto.setHtmlFile(content);
            dto.setOssKeyList(ossList);

            // 读取的文件转化为dto类型的html文件
            return dto;
        } catch (IOException e) {
            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【docx】文档为html时，出现IO异常"), e);
            throw new ServiceException(ErrorCodes.ANALYSIS_DOCX_IO_EXCEPTION);
        } catch (Exception e) {
            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【docx】文档为为html时，出现一般异常"), e);
            throw new ServiceException(ErrorCodes.ANALYSIS_DOCX_EXCEPTION);
        } finally {
            if (outputStreamWriter != null) {
                outputStreamWriter.close();
            }
            if (null != bf) {
                bf.close();
            }
            // 再次删除临时文件
            deleteTemporaryHtmlFile(htmlPath(htmlId));

            // 释放线程缓存，保证安全
            threadLocal.remove();
        }
    } 
    

   /** * 获取相对路径 */
    private String htmlPath(Long htmlId) throws Exception {

        StringBuffer sb = new StringBuffer();

        sb.append("src/main/resources/temporaryHtml/");
        sb.append(htmlId);
        sb.append(".html");

        return sb.toString();
    }

    /** * 删除临时文件（相对路径下面） */
    private void deleteTemporaryHtmlFile(String oppositeFilePath) throws Exception {

        File file = new File(oppositeFilePath);

        if (file.isFile() && file.exists()) {
            file.delete();
        }
    }

doc文档

  /** * DOC文档解析 * * @param inputStream * 输入流 * @throws Exception * 异常 */
    private AnalysisDTO docToHtml(InputStream inputStream) throws Exception {

        BufferedReader bf = null;

        List<AnalysisPicMsgDTO> ossList = Collections.synchronizedList(new ArrayList<>());

        try {

            HWPFDocument wordDocument = new HWPFDocument(inputStream);

            Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

            WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(document);

            /** * 1.保存图片，并返回图片的相对路径（为异步） * * 2.content 二进制的图片文件 * * 3.pictureType图片类型（后缀） * * 4.name 文件名称 */
            wordToHtmlConverter.setPicturesManager((content, pictureType, name, width, height) -> {

                String attachmentUrl = "";
                // oss获取成功
                if (ret.isSuccess()) {
                    attachmentUrl = (String) ret.getResult().getData();
                }
                // 返回上传后的阿里云oss路径，并替换到文档处
                return attachmentUrl;
            });

            wordToHtmlConverter.processDocument(wordDocument);

            Document htmlDocument = wordToHtmlConverter.getDocument();

            ByteArrayOutputStream outStream = new ByteArrayOutputStream();

            DOMSource domSource = new DOMSource(htmlDocument);

            StreamResult streamResult = new StreamResult(outStream);

            // 获取转化工厂实例
            TransformerFactory tf = TransformerFactory.newInstance();

            Transformer serializer = tf.newTransformer();

            // 设置格式，编码，缩进信息，并执行转换
            setLayOutMsg(serializer, domSource, streamResult, "UTF-8");

            outStream.close();
            String content = new String(outStream.toByteArray(), "UTF-8");

            // 去除样式(去除HTML的style样式，不要要的话，本句可以删除)
            String strContent = CommonUtil.delDangerHTMLTag(content);

            AnalysisDTO dto = new AnalysisDTO();
            dto.setHtmlFile(strContent);
            dto.setOssKeyList(ossList);

            return dto;
        } catch (FileNotFoundException e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档时，未找到文档"), e);
            throw new ServiceException(ErrorCodes.NOT_FIND_DOC);
        } catch (IOException e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，出现IO异常"), e);
            throw new ServiceException(ErrorCodes.ANALYSIS_DOC_IO_EXCEPTION);
        } catch (ParserConfigurationException e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，解析器配置异常"), e);
            throw new ServiceException(ErrorCodes.PARSER_CONFIGURATION_EXCEPTION);
        } catch (TransformerConfigurationException e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，变压器配置异常"), e);
            throw new ServiceException(ErrorCodes.TRANSFORMER_CONFIGURATION_EXCEPTION);
        } catch (TransformerFactoryConfigurationError e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，变压器出厂配置错误"), e);
            throw new ServiceException(ErrorCodes.PARSER_CONFIGURATION_EXCEPTION);
        } catch (TransformerException e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，变压器异常"), e);
            throw new ServiceException(ErrorCodes.PARSER_CONFIGURATION_EXCEPTION);
        } catch (Exception e) {

            LOG.error(EventLog.cast(LogType.DOCUMENT, "解析【doc】文档为html时，出现普通异常"), e);
            throw new ServiceException(ErrorCodes.ANALYSIS_DOC_EXCEPTION);
        } finally {

            if (null != bf) {
                bf.close();
            }
            if (null != inputStream) {
                inputStream.close();
            }
        }
    }



    /** * 设置格式，编码，缩进等信息 */
    private void setLayOutMsg(Transformer serializer, DOMSource domSource, StreamResult streamResult, String fileCode) throws Exception {

        String defaultCode = serializer.getOutputProperties().getProperty(OutputKeys.ENCODING);

        LOG.info("JAVA默认编码格式为：" + defaultCode);

        // 编码
        serializer.setOutputProperty(OutputKeys.ENCODING, fileCode);

        // 缩进
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");

        // 格式
        serializer.setOutputProperty(OutputKeys.METHOD, "html");

        serializer.transform(domSource, streamResult);
    }

总结

本文中只是讲解了实现方法和贴了关键代码，并不能直接运行。
解析doc格式的文档一定要注意编码，笔者遇到好多次编码错乱的问题。
若是不知道导入哪些pom文件，上文的pom文件可以完全照搬。