pdftojson

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

  • Owner: ldenoue/pdftojson
  • Platform:
  • License:: GNU General Public License v2.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

pdftojson

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

Compile

./configure
make

On MacOS, you might need to specify libpng and libfreetype locations, e.g.

./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/  --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/

You will find pdftojson inside the directory xpdf/pdftojson

Usage

pdftojson <input.pdf> <output.json>

File format

The JSON produced looks like:
[
{ "pages":14,
"number":1,
"width":612,
"height":792,
"text":[
[115,162,41,14,0,"What "],
...
]
},
{ "pages":14,
"number":2,
"width":612,
"height":792,
"text":[
[115,162,41,14,0,"Here "],
...
]
},
...
];

For each page, the text array contains: [top,left,width,height,0,text]

Main metrics

Overview
Name With Ownerldenoue/pdftojson
Primary LanguageC++
Program languageMakefile (Language Count: 5)
Platform
License:GNU General Public License v2.0
所有者活动
Created At2017-02-10 13:47:56
Pushed At2023-11-04 15:52:12
Last Commit At2023-11-04 16:52:11
Release Count0
用户参与
Stargazers Count146
Watchers Count10
Fork Count15
Commits Count33
Has Issues Enabled
Issues Count8
Issue Open Count4
Pull Requests Count2
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private