1000字范文 > 把docx格式的word文档转换为txt文件

把docx格式的word文档转换为txt文件

时间：2021-06-21 09:38:47

docx格式转txt格式

先将doc格式转换为docx格式运用python-docx工具包来操作word文档乱码问题通过给定分隔符切分文段，并且保留分隔符

先将doc格式转换为docx格式

这里可以参考我的另一个博客: doc转docx.

def doc_to_docx(file_dir):docfiles = []for root, dirs, files in os.walk(file_dir):for file in files:if os.path.splitext(file)[1] == '.doc':docfiles.append(os.path.join(root, file))word = wc.Dispatch("Word.Application") # 打开word应用程序for docfile in docfiles:doc = word.Documents.Open(docfile) # 打开word文件doc.SaveAs('{}x'.format(docfile), 12) # 另存为后缀为".docx"的文件，其中参数12指docx文件doc.Close() # 关闭原来word文件os.remove(docfile)word.Quit()print("完成！")

运用python-docx工具包来操作word文档

首先下载docx工具包，在命令行中输入pip install docx,就可以安装docx工具包。

接着，读取word文档里面的文本和表格内容（因为我暂时处理的主要是这两种格式，所以只探究了如何提取这两种格式文本的方法）。

from docx import Document #导入方法document = Document(filename) #注意这里的filename必须是包含绝对路径的文件名# 读取每段资料l = [paragraph.text.encode('utf-8') for paragraph in document.paragraphs]# 输出并观察结果，也可以通过其他手段处理文本即可pattern = r'(。|！|？|；)'for i in l:list = []seg = i.decode('utf-8')seg = re.split(pattern, seg)seg.append("")seg = ["".join(i) for i in zip(seg[0::2], seg[1::2])]for word in seg:# 读取表格材料，并输出结果tables = [table for table in document.tables]for table in tables:for row in table.rows:for cell in row.cells:print(cell.text.encode('utf-8').decode('utf-8'), '\t', )

乱码问题

最开始的时候，会出现一些乱码问题，上网查询了一下是编码问题，具体操作就是给定指定的编码格式utf-8，这里以后再去详细了解，我主要就靠着给含有文本的变量名定义decode(‘utf-8’)来使得编码成功解析出汉字（如果尝试的变量名没有decode后缀，可以先encode(‘utf-8’)再decode(‘utf-8’)）例如代码里面的：

seg = i.decode('utf-8')cell.text.encode('utf-8').decode('utf-8')

通过给定分隔符切分文段，并且保留分隔符

在提取文本时，我需要将大段的文本通过我要求的字符来切分成一句一句的句子，同时，需要保留分隔符在句尾。平常运用的split方法会直接将切分符号去掉，满足不了要求（这里，我没想到运用split方法能切分文段且保留标记的办法），网上找到了一个方法，特此记录学习一下。贴一下博客链接: 保留分隔符在句尾.

pattern = r'(。|！|？|；)'#定义需要切割的分割符，加上（）保留分隔符seg = re.split(pattern, seg)#通过split先进行切分seg.append("")seg = ["".join(i) for i in zip(seg[0::2], seg[1::2])]

写进txt文件里面

output = open(filename, 'w', encoding='utf-8')for sentence in seg:output.write(sentence + '\n')

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。