Automatically spotting interesting sentences in parliamentary debates
Jake has been volunteering at Full Fact for several weeks now. As part of our work towards more automated factchecking, we showed him the work of Simone Teufel on extracting information from scientific texts and asked whether he could do a proof of concept for a similar annotation system for profiling parliamentary committee debates. We'll leave it to Jake to tell you more...
Aggregating the huge quantity of data that government and politicians produce has the potential to cast new light on the way our political system works.
One way this might be done is by analysing the language seen in parliamentary committees and politicians speeches.
Taking inspiration from Simone Teufels PhD work on extracting information from scientific texts, we wondered, might it be possible to apply the same method to parliamentary committee and speech transcripts?
In her work, Teufel was able to identify 7 different sentence categories based on the typical ways in which arguments are made in scientific articles, that was discernible by following a decision tree.
For example, the very first question in this decision tree was 'Does this sentence refer to own work? If the answer is yes this would lead you to the question 'Does this sentence contain material that describes the specific aim of the paper? If the answer to this is also yes then the sentence falls into the category of AIM. In this way she was able to produce an easy to follow decision tree that could be used annotate an entire paper based on the function of the sentences in it.
The next step was to have a group of people (some trained in the task, some not) to apply this decision tree to multiple papers. A gold standard was used to determine their accuracy. Then Teufel attempted to automate the process by analysing clues in the text, term frequency or paragraph structure for example, which might suggest which category a sentence belongs to.
In the most accurate system, the automated annotator achieved an accuracy of 71% compared to 76% for an untrained human and 87% for a trained one. Although there are obviously improvements to be made, such a high initial accuracy paints a very optimistic picture for the future.
There are numerous benefits to being able to apply something similar to the machinations of parliament. It opens up the possibility of tracking an MPs position on an issue, analysing the efficacy of parliamentary committees and identifying trends in government.
To test if it was feasible, I took transcripts from parliamentary committees and tried to see if I could do something similar. To be successful it would have to (a) be able to classify sentences into distinct categories and (b) for these categories to be of some value.
Although this is just an initial attempt, it does appear that sentences in parliamentary committees can be categorised in much the same way that Teufel was able to with scientific papers. I was able to discern 10 different sentence types, as follows.
- Quoting Amendment — Member of committee quoting an amendment
- Quoting Bill — Member of the committee quoting the Bill under discussion
- Witness evidence — a member of the committee quoting an outside source
- Argument — When an argument is made regarding the application of the law or its reasons for being enforced
- Clarification — Asking for clarification on an issue
- Information on current law (as it is) — Sentence referring to the way in which the law as it is is applied
- Technical aspect on how the new/ altered law would work — Sentence that seeks to explain how the changes in the law might be applied.
- Justification — Sentence that acts to justify or explain the principles behind the new law
- Context — Sentence providing context for a statement
- Formality — Various sentences that are essential but do not impart any information.
I then created a potential tree diagram for categorising them:
This is by no means a perfect categorisation, however when applied to parliamentary committee transcripts it was able to meet our criteria for success. The classifications capture all the types of sentence in the transcript without being so broad as to be meaningless, whilst also being distinct enough that if it were possible to apply the annotation scheme to all transcripts, it would provide a valuable research resource.
Of course Im not claiming to have anywhere near the level of sophistication in our decision tree that Teufel has in hers, but as a proof of concept, I hope that it might entice someone with greater expertise to look into the possibility of expanding on what I have done.
If you're interested in helping, get in touch with us at feedback@fullfact.org