Re: converting PDF and retaining paragraph format

Flor Lynch

So it appears. But, when you replace the zzz or similarly exotic string of characters that was used to replace the ^n line-breaks – which are not distinguished in the original pdf from the ^l or soft line-breaks – in the manner as I outlined, you will get an approximation of the original paragraph formatting. Remember also that the original formatting HAS BEEN LOST when the pdf was converted into another file format. There are some pdf-to html converters around, but they end up adding extra lines to the output in htmn. Adobe Reader has it downsides, too, and loss of formatting is one. I don’t know if Adobe Professional does a better job; perhaps it does. OCR on a PDF image might also do a good job. But none of them (so far as I know) does a perfect original format retention job.

From: Gene
Sent: Sunday, September 06, 2015 7:19 PM
Subject: Re: [TechTalk] converting PDF and retaining paragraph format
but it appears that your procedure deletes all hard returns.  That is not what is wanted.  Paragraph divisions are still wanted.  The problem is that every soft return is made into a hard return.  The point is not to delete all hard returns.  The point is to have soft returns except where paragraphs begin. 
If you convert any document from a word processor or html format to a txt format, all soft returns will be converted into hard returns.  There is no soft return in the txt character set, if that is the correct term.  But if you convert the file to a word processor format and the same thing occurs, that is, all soft returns are converted to hard returns, then the subject should be discussed further.  If a word processor format retains soft returns, that should solve the problem. 
----- Original Message -----
From: Flor Lynch
Sent: Sunday, September 06, 2015 1:02 PM
Subject: Re: [TechTalk] converting PDF and retaining paragraph format
One might also want to delete right parentheses, apostrophes, and some other punctuation marks just  preceding hard returns, such as: )^zzz with )^n; ‘zzz with ‘^n; !zzz with !^n; ?zzz with ?^n, (in step 5 per the scheme below).
From: Flor Lynch
Sent: Sunday, September 06, 2015 5:51 PM
Subject: Re: [TechTalk] converting PDF and retaining paragraph format
Hi Dean,
This is what I did, and it seemed to work:
1. Converted the pdf to text.
2. Select All in the text ffile, copy to clipboard. 
3. Open a blank document in Microsoft Word. 
3A: Paste the clipboard contents into the Word document.
4. Now we’re going to do Finding and Replacing, and the Special Characters dialogue, More... button. Find: ^n . Replace with: zzz . REPLACE ALL Button, and ENTER.
5. Find: .zzz . Replace With: .n . Replace All.
5A:  (If necessary, Find and Replace All . zzz with . ^n .....
6. Now, Find and Replace ALL zzz with nothing, leaving the replacement field empty.
7. You may require more finding and replacing, use judiciously.
HTH, and Good Luck!
Sent: Sunday, September 06, 2015 3:55 PM
Subject: [TechTalk] converting PDF and retaining paragraph format

I like to convert PDF documents to other formats, usually text but could be any of several others.  When I use the conversion routine built into Adobe Reader, all the line breaks are converted into hard line breaks, making it hard to find the actual paragraph boundaries.  Does anybody have a solution to this?





