代码语言
.
CSharp
.
JS
Java
Asp.Net
C
MSSQL
PHP
Css
PLSQL
Python
Shell
EBS
ASP
Perl
ObjC
VB.Net
VBS
MYSQL
GO
Delphi
AS
DB2
Domino
Rails
ActionScript
Scala
代码分类
文件
系统
字符串
数据库
网络相关
图形/GUI
多媒体
算法
游戏
Jquery
Extjs
Android
HTML5
菜单
网页交互
WinForm
控件
企业应用
安全与加密
脚本/批处理
开放平台
其它
【
Java
】
使用Java基于数据流直接抽取word文本
作者:
脚本爱好者
/ 发布于
2012/6/25
/
513
如下代码是直接基于数据流进行文本抽取,支持word97-word2003版本,之后的版本实际都是xml,抽取文本非常简单
<div>public class WordExtractor { public static StringBuilder logBytes = new StringBuilder(); public static String bytesToString(byte[] ogiBytes, int start, int length, int fc) { StringBuilder content = new StringBuilder(); byte[] bytes = new byte[length]; System.arraycopy(ogiBytes, start, bytes, 0, length); if(fc == 0) { for(int i=0;i<bytes.length;i++) { if(i == bytes.length - 1) { return content.toString(); } String a = Integer.toHexString(bytes[i+1] & 0xFF); String b = Integer.toHexString(bytes[i] & 0xFF); if(a.length() == 1) { a = "0"+ a; } if(b.length() == 1) { b = "0"+ b; } String hexStr = a + b; int ch = Integer.valueOf(hexStr, 16); content.append( (char)ch ); i++; } } else { for(int i=0;i<bytes.length;i++) { int ch = bytes[i] & 0xFF; content.append( (char)ch ); } } return content.toString(); } public static void bytesToString(byte[] ogiBytes, StringBuilder content, int start, int length, int fc) { content.append( bytesToString(ogiBytes, start, length, fc) ); } public static void printLogBytes(List<Byte> legaled) throws Exception { logBytes = new StringBuilder(); logBytes.append(" ========================================================"); for(int a=0;a<legaled.size();a++) { if(a % 16 == 0) { logBytes.append(" "); } logBytes.append(Integer.toHexString(legaled.get(a) & 0xFF) +" "); } logBytes.append(" ========================================================"); FileUtil.writeAscFile("E:ytes.txt", logBytes.toString()); } public static int getOneTable(byte[] ogiBytes, Stream stream, int dirSect1) { for(int i=0;i<8;i++) { int offsetEntry = (dirSect1 + 1)*512 + i*128; StringBuilder content = new StringBuilder(); bytesToString(ogiBytes, content, offsetEntry, 64, 0); if(content.toString().indexOf("1Table") > -1) { return offsetEntry; } } return 0; } public static void main(String[] args) throws Exception { byte[] ogiBytes = FileUtil.readBinFile("D: oolsoletest est-old.doc"); System.out.println("Total bytes: "+ ogiBytes.length); if( ogiBytes.length < 8 || (ogiBytes[0] & 0xFF) != 208 || (ogiBytes[1] & 0xFF) != 207 || (ogiBytes[2] & 0xFF) != 17 || (ogiBytes[3] & 0xFF) != 224 || (ogiBytes[4] & 0xFF) != 161 || (ogiBytes[5] & 0xFF) != 177 || (ogiBytes[6] & 0xFF) != 26 || (ogiBytes[7] & 0xFF) != 225 ){ System.out.println("Not the doc file!"); return; } StringBuilder content = new StringBuilder(); Stream stream = new Stream(ogiBytes); int[] offset = new int[1]; offset[0] = 48; int dirSect1 = stream.getInteger(offset); int oneTable = getOneTable(ogiBytes, stream, dirSect1); offset[0] = oneTable + 116; int startSect = stream.getInteger(offset); int tableStream = (startSect + 1)*512; offset[0] = 930; int fcClx = stream.getInteger(offset); if(fcClx == -1) { System.out.println("This version of doc can not be parsed!"); return; } int offsetClx = tableStream + fcClx; offset[0] = offsetClx + 1; int lcb = stream.getInteger(offset); int countPcd = (lcb - 4)/12; int countCp = (lcb - countPcd*8)/4; int offsetPlcpcd = offsetClx + 5; for(int i=0;i<countPcd;i++) { int offsetPcd = offsetPlcpcd + countCp*4 + i*8; offset[0] = offsetPcd + 2; int start = stream.getInteger(offset); int fc = start >> 30; start = (start << 2) >> 2; offset[0] = offsetPlcpcd + i*4; int cpPre = stream.getInteger(offset); int cpNext = stream.getInteger(offset); int length = cpNext - cpPre -1; if(fc == 0) { length *= 2; } else { start = start/2; } start += 512; bytesToString(ogiBytes, content, start, length, fc); System.out.println(start +", "+ length); } FileUtil.writeAscFile("E:output.txt", content.toString(), false); System.out.println("Done!"); } }
试试其它关键字
抽取word文本
同语言下
.
List 切割成几份 工具类
.
一行一行读取txt的内容
.
Java PDF转换成图片并输出给前台展示
.
java 多线程框架
.
double类型如果小数点后为零则显示整数否则保留两位小
.
将图片转换为Base64字符串公共类抽取
.
sqlParser 处理SQL(增删改查) 替换schema 用于多租户
.
JAVA 月份中的第几周处理 1-7属于第一周 依次类推 29-
.
java计算两个经纬度之间的距离
.
输入时间参数计算年龄
可能有用的
.
实现测量程序运行时间及cpu使用时间
.
C#实现的html内容截取
.
List 切割成几份 工具类
.
SQL查询 多列合并成一行用逗号隔开
.
一行一行读取txt的内容
.
C#动态修改文件夹名称(FSO实现,不移动文件)
.
c# 移动文件或文件夹
.
c#图片添加水印
.
Java PDF转换成图片并输出给前台展示
.
网站后台修改图片尺寸代码
脚本爱好者
贡献的其它代码
(
1
)
.
使用Java基于数据流直接抽取word文本
Copyright © 2004 - 2024 dezai.cn. All Rights Reserved
站长博客
粤ICP备13059550号-3