Welcome, guest | Sign In | My Account | Store | Cart
NOTE: Recipes have moved! Please visit GitHub.com/activestate/code for the current versions.

Here's a script to save all Word documents in and below a given directory to text.

Python, 14 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import fnmatch, os, pythoncom, sys, win32com.client

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

try:
    for path, dirs, files in os.walk(sys.argv[1]):
        for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
            print "processing %s" % doc
            wordapp.Documents.Open(doc)
            docastxt = doc.rstrip('doc') + 'txt'
            wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)
            wordapp.ActiveWindow.Close()
finally:
    wordapp.Quit()

Requires the Python for Windows extensions, and MS Word.

Shows how simple COM scripting can be in Python!

15 comments

Prakash Balraj 13 years, 6 months ago  # | flag

ImportError: No module named pythoncom. I tried running the py file it is throwing the following error (It is Python 2.3.3 on Win 2000)--

> Traceback (most recent call last): File "C:\Documents and Settings\e176636\Desktop\docs_hem\New Folder\convert.py", line 1, in ? import fnmatch, os, pythoncom, sys, win32com.client ImportError: No module named pythoncom >

Simon Brunning (author) 13 years, 6 months ago  # | flag

You don't have Python for Windows. Sounds like you don't have the Python for Windows extensions. Get these from http://starship.python.net/crew/mhammond/

Evan Sokolski 13 years, 4 months ago  # | flag

Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

I get this error every time i run the code. Please help me.

Evan Sokolski 13 years, 4 months ago  # | flag

Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

I get this error every time i run the code. Please help me.

Evan Sokolski 13 years, 4 months ago  # | flag

Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

I get this error every time i run the code. Please help me.

Evan Sokolski 13 years, 4 months ago  # | flag

ERROR!!!! Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

Remco Boerma 13 years, 2 months ago  # | flag

Error. It seems MS Word tell's you something about the error:

'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)'

So you might try to see if the file 'Doca.doc' exists, and it's not a directory. It could also be a locking problem, if you have word or any other program opening this document, it might well be word can't open it. This also sometimes happens when you have an explorer opened at the same time, with a focus on this file. [i guess explorer reads some details, using file locking]

Simon Brunning (author) 13 years, 2 months ago  # | flag

Fixed! Were you, perhaps, passing in "." as the required root directory? COM doesn't seem to like file names in the form ".\mydoc.doc", even though from the command line you can open a document like this.

I've added a call to os.path.abspath to sort this out. I've also put in a try/finally block to make sure that Word gets closed down properly.

pavel kosina 13 years, 2 months ago  # | flag

win32com. I got some troubles using this script. I do not want you to solve this (some RPC Server error) just would like to ask generally. Where can I found documentation to all these function in win32. I think it is the matter of API windows, so probably somewhere on MS. Could you help? Some pdf of donwnloadable html or chm is appreciated. Thank you Pavel

pavel kosina 13 years, 2 months ago  # | flag

SOLVED: Replying to myself: to make full power I have to change:

wordapp = win32com.client.Dispatch("Word.Application")

FileFormat=win32com.client.constants.wdFormatText

wordapp.ActiveDocument.Close()

Otherwise there were too much errors on different kinds of docs ... Pavel

John Pywtorak 12 years, 6 months ago  # | flag

Replying to: to make full power I have to change: Using pythonwin's COM Makepy utility makes the change unecessary. I believe when working with COM objects this is one of the first steps, but may not be absolutely necessary. That is how it gets into the gencache.

This FileFormat=win32com.client.constants.wdFormatText and the original above I believe are both valid. I also beleive this wordapp.ActiveDocument.Close() is more appropriate than closing the active window, however I think the same effect is achieved.

John Pywtorak 12 years, 6 months ago  # | flag

ActiveState Python 2.4.1 and Word 2003, pywin32 build 204 did not work for me. Using COM Makepy utility from PythonWin would fail right away. Not sure why, but you can use it as a test before trying the code.

Once I installed ActiveState 2.3.5 the below worked as advertised. PythonWin

Tools->COM Makepy utility

select Microsoft Word 11.0 Object Library (8.3)

>>> Generating to E:
\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x3.py
>>> import fnmatch, os, pythoncom, sys, win32com.client



>>> wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")



>>> wordapp.Documents.Open("c:\\a.doc")




>>> wordapp.ActiveDocument.SaveAs("c:\\a.txt",
FileFormat=win32com.client.constants.wdFormatTextLineBreaks)



>>> wordapp.ActiveWindow.Close()



>>> wordapp.Quit()

>

Mustafa Görmezer 12 years, 4 months ago  # | flag

A smaller solution. <p>What about this small solution?</p>

import win32com.client

app = win32com.client.Dispatch('Word.Application')
doc = app.Documents.Open('c:\\files\\mydocument.doc')
print doc.Content.Text
app.Quit()

<p>More examples to Office and win32com: http://www.win32com.de

Jeff Miller 12 years, 1 month ago  # | flag

Nice. Nice script -- worked great the first time. Only bug was an error message I got once when executing wordapp.Quit(). But it hasn't happened again.

ccpizza 6 years, 4 months ago  # | flag

I had excellent results using antiword:

import sys
import os

os.system('antiword ' + sys.argv[1] + ' > ' + sys.argv[1][-3] + 'txt']

Antiword does an especially good job with table formatting. And it works perfectly on non-windows systems.