Discussion:
[poppler] Python - PDF - hightlighted text (not annotation popup) - How to extract it text
bruno gallart
2014-11-09 09:38:08 UTC
Permalink
Hello,

I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
highlihted the object created in the PDF's file is:

20 0 obj

<<

/C [1 1 0]

/F 4

/M (D:20141107203743+01'00')

/P 7 0 R

/T (bruno)

/AP <<

/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)

/Rect [112.707338 807.385499 164.672639 816.770264]

/Subj (Surligner)

/Subtype /Highlight

/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]

/CreationDate (D:20141107203743+01'00')
endobj`<<

Unlike a classical annotations here there is not the key " /Contents" and it
is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pyPoppler but
but ... I am not very good and don't find the way to extract the line I
want.

My question:

The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?

If somebody can give me some advices I will be happy.

Thanks for your patience

Bruno
Albert Astals Cid
2014-11-09 15:48:01 UTC
Permalink
Post by bruno gallart
Hello,
I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (bruno)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<
Unlike a classical annotations here there is not the key " /Contents" and it
is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pyPoppler but
but ... I am not very good and don't find the way to extract the line I
want.
The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?
They are both "the same", seems in this case Rect has a bit more of "padding"
but they depict the same area.

Yes you should be able to use that rect to get the text in there.

Cheers,
Albert
Post by bruno gallart
If somebody can give me some advices I will be happy.
Thanks for your patience
Bruno
bruno gallart
2014-11-09 16:41:49 UTC
Permalink
Bon dia Albert e mercès per ta responta,
(soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)

Thanks for your response Albert,

But I have readen the poppler's Api and I does not see the object and
the method for this (/Rect ---> extract text with x,y coordonates). My
question is quite boring but do you know the object that I must use to
do this extraction ?

Thanks a lot
Gràcies molt

Bruno
Post by Albert Astals Cid
Post by bruno gallart
Hello,
I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (bruno)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<
Unlike a classical annotations here there is not the key " /Contents" and it
is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now pyPoppler but
but ... I am not very good and don't find the way to extract the line I
want.
The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?
They are both "the same", seems in this case Rect has a bit more of "padding"
but they depict the same area.
Yes you should be able to use that rect to get the text in there.
Cheers,
Albert
Post by bruno gallart
If somebody can give me some advices I will be happy.
Thanks for your patience
Bruno
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
Albert Astals Cid
2014-11-09 17:24:49 UTC
Permalink
Post by bruno gallart
Bon dia Albert e mercès per ta responta,
(soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)
Thanks for your response Albert,
But I have readen the poppler's Api and I does not see the object and
the method for this (/Rect ---> extract text with x,y coordonates). My
question is quite boring but do you know the object that I must use to
do this extraction ?
Using qt4 frontend i'd use Poppler::Page::text(rect)

Cheers,
Albert
Post by bruno gallart
Thanks a lot
Gràcies molt
Bruno
Post by Albert Astals Cid
Post by bruno gallart
Hello,
I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (bruno)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<
Unlike a classical annotations here there is not the key " /Contents" and
it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now
pyPoppler but but ... I am not very good and don't find the way to
extract the line I want.
The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?
They are both "the same", seems in this case Rect has a bit more of
"padding" but they depict the same area.
Yes you should be able to use that rect to get the text in there.
Cheers,
Albert
Post by bruno gallart
If somebody can give me some advices I will be happy.
Thanks for your patience
Bruno
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
bruno gallart
2014-11-09 17:38:27 UTC
Permalink
Post by Albert Astals Cid
Post by bruno gallart
Bon dia Albert e mercès per ta responta,
(soi pas catalan mas lengadocian de Besièrs e mon catalan es fòrt luènh)
Thanks for your response Albert,
But I have readen the poppler's Api and I does not see the object and
the method for this (/Rect ---> extract text with x,y coordonates). My
question is quite boring but do you know the object that I must use to
do this extraction ?
Poppler::Page::text(rect) with the rect's coordonates. I have readen the API, I am going to try.
I am going to have a very good evening of programation now with pyPoppler. Thanks Albert

Cheers

Bruno
Post by Albert Astals Cid
Using qt4 frontend i'd use Poppler::Page::text(rect)
Cheers,
Albert
Post by bruno gallart
Thanks a lot
Gràcies molt
Bruno
El Diumenge, 9 de novembre de 2014, a les 10:38:08, bruno gallart va
Post by bruno gallart
Hello,
I read many pdf's texts. I don't do annotations popup but I only highlight
text in yellow. I wanted to extract (with Python) this text to do some
indexation with Whoosh after for my studies. I saw that when the text is
20 0 obj
<<
/C [1 1 0]
/F 4
/M (D:20141107203743+01'00')
/P 7 0 R
/T (bruno)
/AP <<
/N 31 0 R
/NM (38048b89-6e9f-4434-9cae2b25dfc8c8a2)
/Rect [112.707338 807.385499 164.672639 816.770264]
/Subj (Surligner)
/Subtype /Highlight
/QuadPoints [114.570002 816.770274 162.809979 816.770274 114.570002
807.385508 162.809979 807.385508]
/CreationDate (D:20141107203743+01'00')
endobj`<<
Unlike a classical annotations here there is not the key " /Contents" and
it is my problem. I have tried pdfMiner, pyPDF, PyPDF2 and now
pyPoppler but but ... I am not very good and don't find the way to
extract the line I want.
The key /QuadPoints can give me a link for the text highlighted ? Or is the
key /Rect can do this ?
They are both "the same", seems in this case Rect has a bit more of
"padding" but they depict the same area.
Yes you should be able to use that rect to get the text in there.
Cheers,
Albert
Post by bruno gallart
If somebody can give me some advices I will be happy.
Thanks for your patience
Bruno
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
http://lists.freedesktop.org/mailman/listinfo/poppler
Loading...