Dropbox Bookstore Audiobook Production Prices Print On Demand Prices Ebook Conversion Prices Book Marketing Service

Enter your name and email below to register, and receive your copy of the book "101 Book Marketing Ideas".
Quick Reg!

one moment please...


Blog Home Tags:
convert from pdf
pdf to doc
pdf to word
scanned pdf
scanning books
Print

PDF To Word: Extremely Important Conversion Tips

image loading... by Bo Bennett, PhD, Founder of eBookIt
posted Tuesday Jun 28, 2016 12:27 PM

image loading...

Bo Bennett, PhD

Founder of eBookIt

About Bo Bennett, PhD

I started eBookIt.com back in 2010, because, as an author, I was frustrated with the lack of options for e-publishing. We have helped thousands of clients publish their books over the years, and we are looking forward to helping thousands more.

There is a reason business is booming for us and has been for years—automated ebook conversions suck big time. Okay, maybe "big time" is pushing it a bit too far, after all, the automation does still save countless hours of manual labor. The problem is, until PDF to Word converters adopt a fairly good artificial intelligence algorithm, a human being will be needed to review the entire document and make corrections where a computer cannot. In this article, I will show you what you need to know when converting your PDF file to a Word document (.doc or .docx) file.

This article assumes that you already have a Word document that was created from a .pdf file. If you are not sure, read on!

PDF to Word: Relatively Painless Experience... or Nightmare

Perhaps you convert PDF to Word documents on occasion and never had much of a problem. If this is the case, I can virtually guarantee you that the PDF files you are working with are PDF files made from editable document files (such as Word) with very few advanced layout features (i.e., callouts, wrapped images, etc.) and not PDF files made from scanned images. When you save a Word doc as a PDF file, there is far less of a loss in information, meaning that reverse conversion from that PDF back to the Word document will still have some issues, but issues that are not too difficult to address, and thus a relatively painless experience. But creating a PDF from scanned book is like taking a photograph of each page. The software interprets the page as an image and not text. To understand the image as text, OCR (optical character recognition) software must be run on the image to interpret the image as text. Assuming a clean scan of the pages, even the best OCR software at 99.9% accuracy will screw up 1 out of 1000 words. In a 100,000 word book, this means you will have 100 messed up words! Not very professional, and quite a nightmare.

pdf to word conversion example

Why Machines Fail and Humans Are Needed

At the time of this writing, OCR software used to convert scans into text do not contain enough AI (artificial intelligence) to have a good contextual understanding of words. Therefore, if the image looks like an "iv" to the software, it will interpret as "iv" even though in context it might be "We ivill succeed and we will prosper!" This is not a real brain-buster for humans—not even an 8-year-old one. Yet machines struggle and usually fail. Fortunately, this is an error that any decent spell checker would pick up since "ivill" is not a recognized word. But many errors are recognized words or they are in names that are ignored by the spell checker.

Captcha Image Showing How PDF to DOC Converts FailAnother reason machines fail is because of poor quality scans/images, small text, unorthodox fonts, and generally not being able to recognize letters from its rather limited library of knowledge on how to recognize letters. This is where the human mind excels. This failure on the machine's part is the reason that form spam software works so well (often referred to as "Captcha"). It is (usually) easy for the human eye to detect the characters but virtually impossible for machines.

Proofing Your PDF to Word Conversion

Now that you have your Word document that was created from a PDF here is what you need to do in addition to the standard formatting that you would otherwise do for Word document before converting it to an ebook. Let me stress that you should read every word in the document to ensure it is correct. If you were scanning hundreds of books for free public access, this level of proofing would clearly be an overkill, but if this is your book that you are selling online (i.e., people are paying money for), you owe it to your readers to ensure they are buying an error-free (or virtually error-free) book.

  • Look for incorrect words. Often OCR and even the standard PDF to Word conversion algorithms will misinterpret two letters close to each other that look like another letter. For example, "Li" can be seen as "U". Once you find one of these errors, it might be worth it to do a global search and replace. So you might want to replace all instances of "Ught" with "Light" (since "Ught" is not a word).
  • Fix line breaks. PDF to Word converters are notorious for not knowing where line breaks are supposed to go, and putting them in places where they don't belong. One of the best ways to detect these line breaks is by turning on the "show invisibles" option, or changing the font size.
  • Fix hyphenated words. If a word is hyphenated because of being split on two lines, the pdf to Word software generally does not know if the hyphen needs to be there or not, so keeps it. So a word like "insti-tution" might appear on one line, which is not something you want.
  • Fix multiple spaces. You will find words separated by multiple spaces all throughout the document. To get rid of these, use find and replace. Start with finding 20 spaces and replacing with one space, then 19, then 18, and so on.
  • Missing formatting. OCR often misses bold and italic formatting, as well as mixed upper and lower case.

Go Nuclear

If the document is a real mess, we often use what we call the "nuclear" option to remove all the formatting. We call it this because it's like nuking a city and starting over from scratch. What you will have is a plain text document with all of the words and none of the formatting (you still need to fix the errors with the incorrect words). Here is the process:

  1. Open up your Word document and choose "select all" from the "Edit" menu.
  2. Open up a plain text file using Notepad, TextEdit, or other plain text editor.
  3. Paste all into the plain text editor.
  4. If you clearly have many line breaks where they should not be, do a global search and replace for all line breaks and replace them with a space. Depending on your OS and text editor, the way to do this will vary (google it!).
  5. Reconstruct your document using the physical book or PDF scanned source as a visual guide.

PDF to Word conversions do not have to be a nightmare, even if from a scanned source. It does take time, however. If you are willing to put in the time, you can have a wonderful looking and working document ready to be converted to an ebook. If you're not willing to put in the time or deal with the many issues that can arise from a PDF to Word conversion and would rather pay someone to deal with this, well, that is why we're in business!

Private, Anonymous Comment On This Post (no login required)Your comment below will be anonymously sent to the post owner, it will not be posted, and you will not get a response. To make a public comment, post below (login required).

Send Comment sending comment...

Registered User Comments


Book Marketing Service

Over 450,000 books were self-published in 2013. No matter how outstanding your self-published book may be, it is not difficult to realize that it can get lost in the sea of books published each year. Competition for readers is tough, so your marketing has to be tough, as well. Don't let your book be one of many that remain dormant on the virtual shelves. Complete this quick form to see how we can work within your budget to market your book efficently and effectively!

Get Your Free Proposal...


  Ebook Publishing Service - Formatting, Conversion, Distribution, and Promotion Package: Now Just $149!


When you take advantage of this limited-time special offer, you will get free Press Release Distribution for your book.

* Price does not include the press release writing service—you can write the release yourself or hire our PR specialist to do it for you. All press releases must comply with the editorial guidelines. The Press Release Distribution must redeemed within 90 days of the initial order. Your press release must include a link to your book on the eBookIt.com bookstore, and no other links to other bookstores (although you certainly can mention other bookstores). Price is $125 less if submitting a valid .epub, and $100 more if submitting only a .pdf.

See Details...


Get Your Author / Book Website


Checkout our new webhosting division for authors at http://www.hostingauthors.com. HostingAuthors.com was created by an author for authors, and comprises the set of web tools needed for any author to most effectively market their books and promote themselves on the Internet.

More...



Privacy Policy Affiliate Program Conversion and Distribution Agreement Contact Us
 Copyright 2017, Archieboy Holdings, LLC. 

Component Viewer

A component is the HTML code for a section of a webpage that can be combined with other components to make a complete webpage. Click the component to insert the component code at the bottom of your current page, then customize it.