cdQA

An End-To-End Closed Domain Question Answering System.

Github stars Tracking Chart

cdQA: Closed Domain Question Answering

Build Status
codecov
PyPI Version
PyPI Downloads
Binder
Colab
Contributor Covenant
PRs Welcome
GitHub

An End-To-End Closed Domain Question Answering System. Built on top of the HuggingFace transformers library.

cdQA in details

If you are interested in understanding how the system works and its implementation, we wrote an article on Medium with a high-level explanation.

We also made a presentation during the #9 NLP Breakfast organised by Feedly. You can check it out here.

Table of Contents

Installation

With pip

pip install cdqa

From source

git clone https://github.com/cdqa-suite/cdQA.git
cd cdQA
pip install -e .

Hardware Requirements

Experiments have been done with:

  • CPU ? AWS EC2 t2.medium Deep Learning AMI (Ubuntu) Version 22.0
  • GPU ? AWS EC2 p3.2xlarge Deep Learning AMI (Ubuntu) Version 22.0 + a single Tesla V100 16GB.

Getting started

Preparing your data

Manual

To use cdQA you need to create a pandas dataframe with the following columns:, title, paragraphs, -----------------, ------------------------------------------------------, The Article Title, [Paragraph 1 of Article, ... , Paragraph N of Article], #### With converters

The objective of cdqa converters is to make it easy to create this dataframe from your raw documents database. For instance the pdf_converter can create a cdqa dataframe from a directory containing .pdf files:

from cdqa.utils.converters import pdf_converter

df = pdf_converter(directory_path='path_to_pdf_folder')

You will need to install Java OpenJDK to use this converter. We currently have converters for:

  • pdf
  • markdown

We plan to improve and add more converters in the future. Stay tuned!

Downloading pre-trained models and data

You can download the models and data manually from the GitHub releases or use our download functions:

from cdqa.utils.download import download_squad, download_model, download_bnpp_data

directory = 'path-to-directory'

# Downloading data
download_squad(dir=directory)
download_bnpp_data(dir=directory)

# Downloading pre-trained BERT fine-tuned on SQuAD 1.1
download_model('bert-squad_1.1', dir=directory)

# Downloading pre-trained DistilBERT fine-tuned on SQuAD 1.1
download_model('distilbert-squad_1.1', dir=directory)

Training models

Fit the pipeline on your corpus using the pre-trained reader:

import pandas as pd
from ast import literal_eval
from cdqa.pipeline import QAPipeline

df = pd.read_csv('your-custom-corpus-here.csv', converters={'paragraphs': literal_eval})

cdqa_pipeline = QAPipeline(reader='bert_qa.joblib') # use 'distilbert_qa.joblib' for DistilBERT instead of BERT
cdqa_pipeline.fit_retriever(df=df)

If you want to fine-tune the reader on your custom SQuAD-like annotated dataset:

cdqa_pipeline = QAPipeline(reader='bert_qa.joblib') # use 'distilbert_qa.joblib' for DistilBERT instead of BERT
cdqa_pipeline.fit_reader('path-to-custom-squad-like-dataset.json')

Save the reader model after fine-tuning:

cdqa_pipeline.dump_reader('path-to-save-bert-reader.joblib')

Making predictions

To get the best prediction given an input query:

cdqa_pipeline.predict(query='your question')

To get the N best predictions:

cdqa_pipeline.predict(query='your question', n_predictions=N)

There is also the possibility to change the weight of the retriever score
versus the reader score in the computation of final ranking score (the default is 0.35, which is shown to be the best weight on the development set of SQuAD 1.1-open)

cdqa_pipeline.predict(query='your question', retriever_score_weight=0.35)

Evaluating models

In order to evaluate models on your custom dataset you will need to annotate it. The annotation process can be done in 3 steps:

  1. Convert your pandas DataFrame into a json file with SQuAD format:

    from cdqa.utils.converters import df2squad
    
    json_data = df2squad(df=df, squad_version='v1.1', output_dir='.', filename='dataset-name')
    
  2. Use an annotator to add ground truth question-answer pairs:

    Please refer to our cdQA-annotator, a web-based annotator for closed-domain question answering datasets with SQuAD format.

  3. Evaluate the pipeline object:

    from cdqa.utils.evaluation import evaluate_pipeline
    
    evaluate_pipeline(cdqa_pipeline, 'path-to-annotated-dataset.json')
    
    
  4. Evaluate the reader:

    from cdqa.utils.evaluation import evaluate_reader
    
    evaluate_reader(cdqa_pipeline, 'path-to-annotated-dataset.json')
    

Notebook Examples

We prepared some notebook examples under the examples directory.

You can also play directly with these notebook examples using Binder or Google Colaboratory:, Notebook, Hardware, Platform, --------------------------------, ------------, ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------, [1] First steps with cdQA, CPU or GPU, Binder Colab, [2] Using the PDF converter, CPU or GPU, Binder Colab, [3] Training the reader on SQuAD, GPU, Colab, Binder and Google Colaboratory provide temporary environments and may be slow to start but we recommend them if you want to get started with cdQA easily.

Deployment

Manual

You can deploy a cdQA REST API by executing:

export dataset_path=path-to-dataset.csv
export reader_path=path-to-reader-model

FLASK_APP=api.py flask run -h 0.0.0.0

You can now make requests to test your API (here using HTTPie):

http localhost:5000/api query=='your question here'

If you wish to serve a user interface on top of your cdQA system, follow the instructions of cdQA-ui, a web interface developed for cdQA.

Contributing

Read our Contributing Guidelines.

References, Type, Title, Author, Year, --------------------, --------------------------------------------------------------------------------------------------------------------------------------------, --------------------------------------------------------------------------------------, ----, :video_camera: Video, Stanford CS224N: NLP with Deep Learning Lecture 10 – Question Answering, Christopher Manning, 2019, :newspaper: Paper, Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes, 2017, :newspaper: Paper, Neural Reading Comprehension and Beyond, Danqi Chen, 2018, :newspaper: Paper, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018, :newspaper: Paper, Contextual Word Representations: A Contextual Introduction, Noah A. Smith, 2019, :newspaper: Paper, End-to-End Open-Domain Question Answering with BERTserini, Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin, 2019, :newspaper: Paper, Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering, Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin, 2019, :newspaper: Paper, Passage Re-ranking with BERT, Rodrigo Nogueira, Kyunghyun Cho, 2019, :newspaper: Paper, MRQA: Machine Reading for Question Answering, Jonathan Berant, Percy Liang, Luke Zettlemoyer, 2019, :newspaper: Paper, Unsupervised Question Answering by Cloze Translation, Patrick Lewis, Ludovic Denoyer, Sebastian Riedel, 2019, :computer: Framework, Scikit-learn: Machine Learning in Python, Pedregosa et al., 2011, :computer: Framework, PyTorch, Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, 2016, :computer: Framework, Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch., Hugging Face, 2018, ## LICENSE

Apache-2.0

Overview

Name With Ownercdqa-suite/cdQA
Primary LanguagePython
Program languagePython (Language Count: 1)
Platform
License:Apache License 2.0
Release Count6
Last Release Namebert_qa (Posted on )
First Release Namebnpp_newsroom_v1.1 (Posted on )
Created At2019-01-14 10:51:31
Pushed At2020-04-30 14:10:11
Last Commit At2020-04-30 16:10:09
Stargazers Count612
Watchers Count38
Fork Count191
Commits Count249
Has Issues Enabled
Issues Count214
Issue Open Count57
Pull Requests Count141
Pull Requests Open Count2
Pull Requests Close Count12
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private
To the top