Markdown for Microsoft Word

August 17, 2019

If you prefer text-based documentation(markdown, etc) over WYSIWYG(MS Word), then this is for you! If you work in modern corporate America, Microsoft Word is unavoidable. Even though I wish it weren’t so, most engineering organizations use it for technical documentation.

I want to use markdown as the source and then automatically render Microsoft Word. Most folks online recommend pandoc for this kind of thing. The problem is that pandoc renders docx using it’s own style.

But there’s usually a standard corporate template that you’re required to use.

At this point I usually stare blankly at my screen for a couple of minutes then begrudgingly load up the corporate template in Word and start copy-n-past’ing then manually formatting. But…. no…. longer.

I came across a little hack that works pretty well. If you insert unstyled HTML into a Word template using Insert->Object->Text from File, Word will automatically apply the template styles.

So, I generate HTML from markdown: pandoc -s my.md -o my.html, then insert the HTML using the Text from File menu item in Word and voila! I have a perfectly formatted Word docx using the corporate template automatically.

However, it’s still a hassle to have to perform these manual menu selections everything I want to re-render the docx. So here’s a little Python script to automate that part using COM:

import sys
import os
import argparse

import win32com.client as win32
from win32com.client import constants as c


def main(argv=None):
    if argv is None:
        argv = sys.argv

    parser = argparse.ArgumentParser(description="convert html to docx")
    parser.add_argument("htmlfile", help="the html file path for input")
    parser.add_argument("docxfile", help="the docx file path for output")
    parser.add_argument("-t", "--template", default="template.dotx", help="the template file")
    args = parser.parse_args(argv[1:])

    template_path = os.path.realpath(args.template)
    html_path = os.path.realpath(args.htmlfile)
    docx_path = os.path.realpath(args.docxfile)

    word = win32.gencache.EnsureDispatch('Word.Application')
    try:
        doc = word.Documents.Open(template_path)
        word.Visible = True
        section = doc.Sections(2)
        section.Range.InsertFile(html_path)
        doc.TablesOfContents(1).Update()
        for i in range(1, doc.InlineShapes.Count + 1):
            doc.InlineShapes(i).LinkFormat.BreakLink()
        doc.SaveAs2(docx_path, FileFormat=c.wdFormatXMLDocument)
    finally:
        word.Quit()


if __name__ == '__main__':
    sys.exit(main())

It also regenerates the table of contents and embeds the images after the text is inserted. I hope this helps any other text-based documentation lovers out there.

2019/9/13 Update: We’re now embedding all the images instead of leaving them as links.

2020/6/15 Update: It turns out that this method doesn’t work in the general case. It happens to work with my template because the template fonts are close enough to the default Word fonts that they match.