Part of a series on extracting information from semi-structured text, starts here
Introduction
Blocking and labelling using known words stored in a knowledge base lets us extract known information from a document. However, for words that are new to the knowledge base, like a new car brand or model in the example domain, blocking and labelling fails. To be able to categorise or label new words, we take a probabilistic approach. Given a sizeable number of documents, we build a positional model, that helps us find an appropriate label for a new word given its position in the document.
Preparing data
Before we start calculating probabilities, we aggregate a sizeable corpus of documents, lets say 50000 or more, and run our blocking and labelling step. We then search for documents which have “unknown” blocks and remove them from our set. We do this assuming these are < 10% of the corpus. If this is not true, the quality of the knowledge base needs to be improved, and we keep iterating the cycle below
improve KB –> Block and Label > % of documents not labelled completely > improve KB…
till we reach this goal.
Once we pass this step, we move on to building a representative matrix of documents with each cell denoting the position of a label. For a labelled document like below –
[“sell”, “hyundai”, “santro”, “2007”, “green”, “tyres”, “leather”, “seats”, “price”, “300000”, “contact”, “91xxxxxxx”]
the row vector (of labels by position) in the matrix that denotes this document looks like
[“ad_type”, “brand”, “model”, “year”, “colour”, “features”, “price”, “contact”]
Now, to reduce the size of the matrix in memory, we may represent each label by a unique number. and so our representative row vector looks like
[2,3,4,5,6,7,8,9]
(we reserve 1 for “unknown”)
The length of each document may differ, so we find the longest document, and we pad all documents with 0’s to fit the matrix width. An example matrix of 10 documents across 10 labels may look like –
1 2 3 4 5 6 7 8 9 10 |
|
where each row represents a document and each column represents a position in the document, and numbers denote a label.
Positional Model
To build a vector for positional probabilities for a label we follow
1 2 3 |
|
Over 50000 documents and 12 labels, with a maximum size of 150 tokens, traditional looping does not perform well. We turn to the numpy python library to help us crunch these numbers faster for us.
First, we derive a matrix for each label, which represents the position of this label only across documents. We call this matrix the label sieve. For the matrix above, choosing label 5, the sieve should look like –
1 2 3 4 5 6 7 8 9 10 |
|
where 1 denotes the presence of label 5 in a document, and 0 denotes absence.
The following code does this for us
1
|
|
numpy matrices can be queried like –> matrix == n to result in a boolean matrix for the presence of n. the result * 1 just converts this to the above.
Now its easy for us to sum all column vectors in the matrix above, where each column represents a position in the document. Again, numpy provides faster calculations, as below
1 2 |
|
and then we just divide by the count of documents.
Our final positional matrix may look like –
1 2 3 4 5 6 7 8 9 10 |
|
Extras
We may derive a different model if we calculate the probability as –
1 2 3 |
|
To make this change, we first build a sieve which denotes the presence of all labels, and then sum column-wise to get a denominator vector by position.
1 2 |
|
Our final probability calculation looks like
1 2 |
|
where numpy.divide does an item-wise division.
This calculation produces skewed numbers for a column where the presence labels in the column is very low. To normalise the probabilities in this case, the previous method is better.
Performance
Traditional looping method to build this model for around 50000 documents, 14 labels and a maximum size of 144 words (50000 X 144 size matrix), takes around 2 seconds with our modest python skills. Using numpy, as coded above, we could do this in milli-seconds. The added advantage being expressive code.