⚠️ Data Mining: PDF is a display format, not a data format. Extracting text requires decoding the
Content Stream which contains drawing instructions, not just simple strings.
⚙️ The Parsing Algorithm
1. Decompressing Streams
Most PDF streams are compressed with FlateDecode. Our engine first inflates these streams to reveal the raw PDF operators.
2. Glyph to Unicode Mapping
PDFs use Font ID numbers (Glyphs). To get readable text, we must look up the ToUnicode CMap table embedded in the file to translate Glyph ID #45 into the letter "A".
3. Spatial Reconstruction
Since PDF text can be drawn in any order, our algorithm sorts the extracted strings by their X/Y coordinates to maintain logical reading order.