singledigitalpresence.vic.gov.au

Tips on converting Word documents to HTML

If you follow these steps to clean up your source document and before using an online HTML converter, you will end up with much cleaner HTML.

This procedure was written to assist content editors who have to HTML a report that has been provided in Word.

First you clean up the Word doc as much as possible.

1. Remove the automated table of contents

Word documents created with an automatic table of contents will have anchor links on all headings. On the published page these headings will behave like links (underline on hover, clickable).

It's a fiddly manual process to remove them, so it would be easier to remove the automated table of contents from the Word document before you start working on the HTML conversion.

2. Delete extra spaces and breaks

There are a few things you can quickly and easily clean up in Word using Find and replace:

  • section breaks
  • line breaks
  • extra paragraph breaks
  • extra spaces

Use find and replace to remove extra paragraph breaks

  1. Click Ctrl + H to open the Find and replace popup.
  2. Click the cursor into the 'Find' field.
  3. Click the More button and then the Special button at the bottom.
  4. Click on paragraph mark. Click again so there are two (which looks like ^p^p).
  5. Click the cursor into the 'Replace' field.
  6. Add a single paragraph mark here (looks like ^p).
  7. MAKE SURE you have no characters in the find and replace fields, not even a space. The fields need to be empty!
  8. Replace all. Do this a few times until there are no more instances.

Section breaks

Replace section breaks with a paragraph mark (replace ^% with ^p).

Section breaks are found in the Special dropdown on the Find and replace window.

Line breaks

Line breaks are found in the Special dropdown on the Find and replace window.

Extra spaces

Replace 2 spaces with 1 space (just type the space key - twice in the Find field, once in the Replace field). Keep clicking Replace all until there are none left.

3. Remove extraneous bold formatting

If headings are formatted with bold, replace bold (Ctrl + B) with Heading 2 (via the Format dropdown).

Make sure the Find and Replace fields have nothing in them before hitting the replace key on the formatting changes! (There might be a space in there that you can't see.)

Important: don't choose 'Replace all' for this as there are usually stray instances of bold for emphasis in the document that we may want to keep. 

Go from top to bottom of the document twice to make sure you didn't miss any.

We do not use italics for book, report or legislation titles online. If italics is used for this purpose, change to no italics but keep title case.

If italics is used for emphasis, change it to bold.

There should be no underlining applied. Online, underline implies a clickable hyperlink. 

4. Fix or apply heading formatting

On the web, page headings are always Heading 1, so your publication content should start with Heading 2 and cascade from there. (H2 can be followed by another H2 or an H3. You can't skip a level.) 

Correct heading formatting is important for accessibility. 

Use the inbuilt Word heading styles you see in your toolbar. This is in the Styles section of the tool bar when the Home top menu selection is active.

You can click on the small arrow in the bottom right of the Styles pane and this will display all the styles on the side of your screen. If its not displaying headings 3 and below, you can change the settings. On the bottom of this panel, click the Options button and click the Select styles to show the dropdown. Choose All styles and OK.

Using find and replace to fix heading levels

If your document uses the wrong heading levels, you can use Find and replace to quickly fix them.

If your document uses Headings 1 to 3, you need to change them to Headings 2 to 4.

You need to start with the lowest heading level in the document. 

  1. Click Ctrl + H to open the Find and replace popup.
  2. Click the cursor into the Find field.
  3. Click the More button to reveal more options.
  4. Click the Format dropdown to see options.
  5. Click on Style. A find style popup will appear. Type H to jump to the section on the list where headings are listed.
  6. Click on Heading 4.
  7. Click the cursor into the 'Replace' field.
  8. Repeat steps 4 and 5 and choose Heading 5.
  9. Click Replace all. Do it a couple of times to make sure you caught them all. (Depending on where your cursor is in the document, as Find and replace works down to the bottom of the document and you may need to start again from the start of the document.

5. Convert your cleaned-up Word content into HTML

Now your document is clean! You can use an online HTML conversion tool to convert the Word-formatted content into clean-ish HTML.

Ctrl + A to select all the content and Ctrl + C to copy it. Go to the tool and click Ctrl + V to paste it.

Once the content has been converted, select all and copy it so you can paste it into the CMS.

6. Clean up the HTML code

The above process is great for getting a Word doc with lots of formatting into HTML but the resulting code will still probably have:

  • empty span tags
  • language span tags
  • hopefully no strong tags (=bold) on the headings

Links to websites are OK.

Sometimes there are document links, so also do a search for "<a " to find these. These will look OK in the live page, but the link most likely won't work. You'll have to manually download these and add them in the CMS, if they are necessary.

Copy the source code into Notepad ++ or similar and use find and replace to clean out these extraneous tags.

Check and fix tables HTML

If they have a caption, this should be straight after the <table> tag, surrounded by code like this: <caption>your text</caption>.

Sometimes there's no <thead> section, it's all <tbody>. You can change this in the WYSIWYG in the CMS by right-clicking anywhere in the table and you'll see the Table properties popup. Usually your table header is the first row. This applies the table heading formatting so it displays well but is is also important for accessibility.

And sometimes there is a <thead> section but the cells are <td> instead of <th>.

7. Paste your HTML into the CMS and preview

Go to your CMS page. Add or open a basic text component.

Click on the Source button.

Paste in the code you just copied from the converter website.

Click the Source button again to toggle back to WYSIWYG view.

Save.

Click on the preview link and carefully cross-check your source document against the previewed page to check that:

  • headings are correctly applied
  • list formatting is correct
  • other special formatting such as callouts and tables look right
  • hyperlinks are correctly applied and working

 

Reviewed 18 August 2021

Vic Gov digital guide

Was this page helpful?