[poppler] text extraction in raw order + text attributes

Discussion:

Richard Wossal

2013-12-06 16:51:24 UTC

Hi!

I'm trying to use poppler to extract text from PDFs, and I've found
empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).

This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".

As far as I can see, I could either:

* hack something so I can extract text in raw-order using the Glib-bindings
(I'd prefer staying C-only, but I don't see how this would be possible,
except by adding it to the bindings)

* or re-implement poppler_page_get_text_attributes in C++, using poppler's
private API (or take poppler's implementation)

What do you think would be the best way to go about that?

Thanks!

Richard

PS:

My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.
The backend has a bunch of rules like "the thing with the biggest font
size is
probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.

Carlos Garcia Campos

2013-12-07 11:43:24 UTC

Permalink

Post by Richard Wossal
Hi!
I'm trying to use poppler to extract text from PDFs, and I've found
empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).

Yes, please it would help to see any of those examples.

Post by Richard Wossal
This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".

poppler_page_get_text, get_text_layout and get_text_attributes returns
the text in reading order, using heuristics to follow columns and
tables. It's not perfect, of course, since it's based on heuristics.

Post by Richard Wossal
* hack something so I can extract text in raw-order using the Glib-bindings
(I'd prefer staying C-only, but I don't see how this would be possible,
except by adding it to the bindings)
* or re-implement poppler_page_get_text_attributes in C++, using poppler's
private API (or take poppler's implementation)
What do you think would be the best way to go about that?

I you really need to get the text in raw order we can add new methods in
the API for that. I'm thinking that maybe we could add a more generic
text iteration API with options like area, order and even the break
iterator (so that you can iter over characters, lines and words).

Post by Richard Wossal
Thanks!
Richard
My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.
The backend has a bunch of rules like "the thing with the biggest font
size is
probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.

What's exactly garbage?

Regards,

--
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462

Richard Wossal

2013-12-09 11:05:47 UTC

Permalink

Post by Carlos Garcia Campos

Yes, please it would help to see any of those examples.

Here are some samples:

If you save the following google doc as a PDF (File->Download as):
https://docs.google.com/document/d/1U6SsDnTIce3IH-GhdKpx_uStQQSzSCsACoPkvmZtqTc/edit?usp=sharing

$ pdftotext -v
pdftotext version 0.18.4
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
$ pdftotext ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle

T iin r l x
h oma t t
ss

e
This is underlined text

$ pdftotext -raw ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle
This is normal text
This is underlined text
This is a Heading
Here’s some nonascii stuff: öäüß§

Similar effects can be observed for the title page of
http://www.farmworkerjustice.org/sites/default/files/documents/7.2.a.6%20fwj.pdf

While looking at it more closely now, it appears that sometimes
non-raw reading order gives better results, as with
http://win.niddk.nih.gov/publications/pdfs/teenblackwhite3.pdf

$ pdftotext 'pdfs/teenblackwhite3.pdf' - | head
A Guide for
Teenagers!

Take

C h a rg e
of

Your

$ pdftotext -raw 'pdfs/teenblackwhite3.pdf' - | head
TakeTake
Charge
o f
Your Health!
A Guide for
Teenagers!

A GuideT fe oen r
TakeTake
agers!
Charge

(Just to give some sense as to the magnitude: the last two are from
a random sample of 100 PDFs my users threw at me. The google doc I
wrote myself, as a test case. So it's not exactly a huge problem.)

Post by Carlos Garcia Campos

Being able to iterate over basically some kind of AST of the PDF
(say, chars+attributes) would be pretty nice indeed.

For myself, I've decided to go ahead with poppler-glib's
page_get_text_* for now. The failure rate is low enough for my
application. I was initially stumped that my simple google-doc test
case wouldn't parse correctly, but it doesn't seem to be such a big
problem with PDFs in the wild.

Thanks!

Richard