Simple Question Answering API using spaCy and nmslib
Question answering is very popular task in NLP. Each routine agent work can be replaced using such system. So imagine you have a list of frequently asked questions and it’s answers. Then you can easily automate the job of operator using simple NLP pipeline which I will describe here.
Problem:
Given the list of questions and their answers, our goal is to build a ml model which will match questions to right answers.
Solution:
We can use text similarity. We can calculate embedding of all questions and index them in any database that supports vector similarity search. One such library which is nmslib. It works really fast and is able to search 10^5 samples per second (see benchmark). For calculating text embeddings we can use very popular NLP library spaCy. So, we will embed all questions and index them using nmslib. When we get a question, we just calculate it’s embedding and then find closest embedding vector indexed in nmslib. We will assume that the question is very close to one of which closest embedding was found. If the distance is lower than some threshold, we can return the answer linked to given question.
Example Data:
question | answer | |
---|---|---|
0 | Carl and the Passions changed band name to what | Beach Boys |
1 | How many rings on the Olympic flag | Five |
2 | What colour is vermilion a shade of | Red |
3 | King Zog ruled which country | Albania |
4 | What colour is Spock’s blood | Green |
5 | Where in your body is your patella | Knee ( it’s the kneecap ) |
6 | Where can you find London bridge today | USA ( Arizona ) |
7 | What spirit is mixed with ginger beer in a Moscow mule | Vodka |
8 | Who was the first man in space | Yuri Gagarin |
9 | What would you do with a Yashmak | Wear it - it’s an Arab veil |
10 | Who betrayed Jesus to the Romans | Judas Escariot |
Notebook:
Imports
import pandas as pd
import os
import json
import re
import spacy
import nmslib
!pip install nmslib
!python -m spacy download en
!pip install xlrd
Read xslx file sheet
data = pd.read_excel('../data/QA_(0-100).xlsx','Sheet1', index_col=0, header=None)
data = data.iloc[:, 0:2].reset_index(drop=True)
data.columns = ['question', 'answer']
data.head(2)
question | answer | |
---|---|---|
0 | Carl and the Passions changed band name to what | Beach Boys |
1 | How many rings on the Olympic flag | Five |
Create QA model class
class QA(object):
def __init__(self, data):
self.nlp = spacy.load('en')
self.questions = data.question.tolist()
self.answers = data.answer.tolist()
def to_vectors(self, texts):
"""Convert texts into their vectors"""
result = []
for item in texts:
result.append(self.nlp(item).vector)
return result
def build_nmslib_index(self):
"""build nmslib index with vectors of question texts"""
self.index = {}
self.index = nmslib.init(method='hnsw', space='cosinesimil')
self.index.addDataPointBatch(self.to_vectors(self.questions))
self.index.createIndex({'post': 2}, print_progress=True)
def search(self, text, max_distance=0.2):
"""
K-Nearest-Neighbour search over indexed taxonomy data and distance threshold parameter
to get most similar one.
Args:
text: (str) sample question text
max_distance: (float) maximum allowed distance for neighbours
Returns:
result: (tuple) index and distance for found item
"""
result = {}
vector = self.nlp(text).vector
if vector is not None:
ids, distances = self.index.knnQuery(vector)
if ids is not None and distances is not None:
best_indices_mask = (distances == distances.min()) & (distances < max_distance)
if best_indices_mask.sum() != 0:
result = {'index': ids[best_indices_mask][0], 'distance': distances[best_indices_mask][0]}
return result
def query(self, question, max_distance=0.2):
search_result = self.search(question, max_distance)
index, distance = search_result.get('index', -1), search_result.get('distance', -1)
result = "N/A"
if index != -1:
result = self.answers[index]
return result
qa = QA(data)
qa.build_nmslib_index()
qa.query('Carl and the Passions day changed band name to what', max_distance=0.05)
'Beach Boys'
data.head(2)
question | answer | |
---|---|---|
0 | Carl and the Passions changed band name to what | Beach Boys |
1 | How many rings on the Olympic flag | Five |
preds = data.question.apply(lambda x: qa.query(x))
Full source code can be found here