ActiveState Code

Recipe 279003: Converting Word documents to text


Here's a script to save all Word documents in and below a given directory to text.

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import fnmatch, os, pythoncom, sys, win32com.client

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")

try:
    for path, dirs, files in os.walk(sys.argv[1]):
        for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.doc')]:
            print "processing %s" % doc
            wordapp.Documents.Open(doc)
            docastxt = doc.rstrip('doc') + 'txt'
            wordapp.ActiveDocument.SaveAs(docastxt, FileFormat=win32com.client.constants.wdFormatTextLineBreaks)
            wordapp.ActiveWindow.Close()
finally:
    wordapp.Quit()

Discussion

Requires the Python for Windows extensions, and MS Word.

Shows how simple COM scripting can be in Python!

Comments

  1. 1. At 11:19 a.m. on 27 apr 2004, Prakash Balraj said:

    ImportError: No module named pythoncom. I tried running the py file it is throwing the following error (It is Python 2.3.3 on Win 2000)--

    > Traceback (most recent call last): File "C:\Documents and Settings\e176636\Desktop\docs_hem\New Folder\convert.py", line 1, in ? import fnmatch, os, pythoncom, sys, win32com.client ImportError: No module named pythoncom >

  2. 2. At 2:11 a.m. on 28 apr 2004, Simon Brunning (the author) said:

    You don't have Python for Windows. Sounds like you don't have the Python for Windows extensions. Get these from http://starship.python.net/crew/mhammond/

  3. 3. At 11:23 a.m. on 30 jun 2004, Evan Sokolski said:

    Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

    I get this error every time i run the code. Please help me.

  4. 4. At 11:23 a.m. on 30 jun 2004, Evan Sokolski said:

    Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

    I get this error every time i run the code. Please help me.

  5. 5. At 11:24 a.m. on 30 jun 2004, Evan Sokolski said:

    Error when using this code. Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

    I get this error every time i run the code. Please help me.

  6. 6. At 11:32 a.m. on 30 jun 2004, Evan Sokolski said:

    ERROR!!!! Traceback (most recent call last): File "C:\Python23\text.py", line 7, in -toplevel- wordapp.Documents.Open(doc) File "C:\Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x2\Documents.py", line 79, in Open ret = self._oleobj_.InvokeTypes(18, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName, ConfirmConversions, ReadOnly, AddToRecentFiles, PasswordDocument, PasswordTemplate, Revert, WritePasswordDocument, WritePasswordTemplate, Format, Encoding, Visible, OpenAndRepair, DocumentDirection, NoEncodingDialog) com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)', 'C:\Program Files\Microsoft Office\Office10\1033\wdmain10.chm', 24654, -2146823114), None)

  7. 7. At 4:17 a.m. on 25 aug 2004, Remco Boerma said:

    Error. It seems MS Word tell's you something about the error:

    'Microsoft Word', 'This file could not be found.\nTry one or more of the following:\n* Check the spelling of the name of the document.\n* Try a different file name.\r (Doca.doc)'
    

    So you might try to see if the file 'Doca.doc' exists, and it's not a directory. It could also be a locking problem, if you have word or any other program opening this document, it might well be word can't open it. This also sometimes happens when you have an explorer opened at the same time, with a focus on this file. [i guess explorer reads some details, using file locking]

  8. 8. At 5:33 a.m. on 2 sep 2004, Simon Brunning (the author) said:

    Fixed! Were you, perhaps, passing in "." as the required root directory? COM doesn't seem to like file names in the form ".\mydoc.doc", even though from the command line you can open a document like this.

    I've added a call to os.path.abspath to sort this out. I've also put in a try/finally block to make sure that Word gets closed down properly.

  9. 9. At 6:39 a.m. on 14 sep 2004, pavel kosina said:

    win32com. I got some troubles using this script. I do not want you to solve this (some RPC Server error) just would like to ask generally. Where can I found documentation to all these function in win32. I think it is the matter of API windows, so probably somewhere on MS. Could you help? Some pdf of donwnloadable html or chm is appreciated. Thank you Pavel

  10. 10. At 11:03 p.m. on 16 sep 2004, pavel kosina said:

    SOLVED: Replying to myself: to make full power I have to change:

    wordapp = win32com.client.Dispatch("Word.Application")
    
    FileFormat=win32com.client.constants.wdFormatText
    
    wordapp.ActiveDocument.Close()
    

    Otherwise there were too much errors on different kinds of docs ... Pavel

  11. 11. At 5:45 p.m. on 5 may 2005, John Pywtorak said:

    Replying to: to make full power I have to change: Using pythonwin's COM Makepy utility makes the change unecessary. I believe when working with COM objects this is one of the first steps, but may not be absolutely necessary. That is how it gets into the gencache.

    This FileFormat=win32com.client.constants.wdFormatText and the original above I believe are both valid. I also beleive this wordapp.ActiveDocument.Close() is more appropriate than closing the active window, however I think the same effect is achieved.

  12. 12. At 5:50 p.m. on 5 may 2005, John Pywtorak said:

    ActiveState Python 2.4.1 and Word 2003, pywin32 build 204 did not work for me. Using COM Makepy utility from PythonWin would fail right away. Not sure why, but you can use it as a test before trying the code.

    Once I installed ActiveState 2.3.5 the below worked as advertised. PythonWin

    Tools->COM Makepy utility

    select Microsoft Word 11.0 Object Library (8.3)

    >>> Generating to E:
    \Python23\lib\site-packages\win32com\gen_py\00020905-0000-0000-C000-000000000046x0x8x3.py
    >>> import fnmatch, os, pythoncom, sys, win32com.client
    
    
    
    >>> wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
    
    
    
    >>> wordapp.Documents.Open("c:\\a.doc")
    
    
    
    
    >>> wordapp.ActiveDocument.SaveAs("c:\\a.txt",
    FileFormat=win32com.client.constants.wdFormatTextLineBreaks)
    
    
    
    >>> wordapp.ActiveWindow.Close()
    
    
    
    >>> wordapp.Quit()
    

    >

  13. 13. At 12:36 a.m. on 13 jul 2005, Mustafa Görmezer said:

    A smaller solution. <p>What about this small solution?</p>

    import win32com.client
    
    app = win32com.client.Dispatch('Word.Application')
    doc = app.Documents.Open('c:\\files\\mydocument.doc')
    print doc.Content.Text
    app.Quit()
    

    <p>More examples to Office and win32com: http://www.win32com.de

  14. 14. At 5:56 p.m. on 3 oct 2005, Jeff Miller said:

    Nice. Nice script -- worked great the first time. Only bug was an error message I got once when executing wordapp.Quit(). But it hasn't happened again.

Sign in to comment