From Static Vectors to Editable Text
Published by PDFteq Engineering Team
A PDF file is like a printed sheet of paper—it knows where text is, but not what it means. It doesn't know what a "Paragraph" is. Converting it to Word requires Layout Reconstruction.
The Challenge: "Broken Lines"
If you copy text from a cheap PDF converter, you often get a "hard return" (Enter key) at the end of every line. This makes editing impossible.
The Solution: Sigma-Reflow Algorithm
Our engine scans the vertical distance ($D$) between two lines. If $D$ is smaller than the font height, we assume it belongs to the same paragraph and merge it. If $D$ is larger, we create a new paragraph.
IF (Line_Gap < Font_Height * 1.2) THEN
Merge() // Same Paragraph
ELSE
Break() // New Paragraph
END IF
Merge() // Same Paragraph
ELSE
Break() // New Paragraph
END IF