Camelot

Camelot 是一个 Python 库,任何人都可以轻松地从 PDF 文件中提取表格!「Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!」

Github星跟蹤圖

Camelot: PDF Table Extraction for Humans

Build Status Documentation Status
codecov.io
image image image Gitter chat
image

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note: You can also check out Excalibur, which is a web interface for Camelot!


Here's how you can extract tables from PDF files. Check out the PDF used in this example here.

Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings
Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There's a command-line interface too!

Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)

Why Camelot?

  • You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel, HTML and Sqlite.

See comparison with other PDF table extraction libraries and tools.

Installation

Using conda

The easiest way to install Camelot is to install it with conda, which is a package manager and environment management system for the Anaconda distribution.

Using pip

After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:

From the source code

After installing the dependencies, clone the repo using:

and install Camelot using pip:

Documentation

Great documentation is available at http://camelot-py.readthedocs.io/.

Development

The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.

Source code

You can check the latest sources with:

Setting up a development environment

You can install the development dependencies easily, using pip:

Testing

After installation, you can run tests using:

Versioning

Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License

This project is licensed under the MIT License, see the LICENSE file for details.

主要指標

概覽
名稱與所有者atlanhq/camelot
主編程語言Python
編程語言Python (語言數: 2)
平台
許可證Other
所有者活动
創建於2016-06-18 11:48:49
推送於2023-01-05 15:25:42
最后一次提交2019-10-15 10:55:40
發布數18
最新版本名稱v0.7.3 (發布於 )
第一版名稱v0.1.0 (發布於 2018-09-24 16:30:41)
用户参与
星數3.7k
關注者數80
派生數360
提交數446
已啟用問題?
問題數385
打開的問題數105
拉請求數88
打開的拉請求數13
關閉的拉請求數25
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?