Tag Archives: pdfs

Making video and audio interviews searchable: how Pinpoint helped with one investigation

Pinpoint creates a ranking of people, organisations, and locations with the number of times they are mentioned on your uploaded documents.

MA Data Journalism student Tony Jarne spent eight months investigating exempt accommodation, collecting hundreds of documents, audio and video recordings along the way. To manage all this information, he turned to Google’s free tool Pinpoint. In a special guest post for OJB, he explains how it should be an essential part of any journalist’s toolkit.

The use of exempt accommodation — a type of housing for vulnerable people — has rocketed in recent years.

At the end of December, a select committee was set up in Parliament to look into the issue. The select committee opened a deadline, and anyone who wished to do so could submit written evidence.

Organisations, local authorities and citizens submitted more than 125 pieces of written evidence to be taken into account by the committee. Some are only one page — others are 25 pages long.

In addition to the written evidence, I had various reports, news articles, Land Registry titles an company accounts downloaded from Companies House.

I needed a tool to organise all the documentation. I needed Pinpoint

Continue reading

Telegraph plans to expand MPs database site in build up to election (Q&A)

I asked Tim Rowell, Digital Publisher at Telegraph.co.uk 3 questions about how they dealt with the MPs expenses story online. The main headline is that the new domain hosting the expenses database – parliament.telegraph.co.uk -will expand in the run-up to the next election along with the MP expenses database itself.

There are also curious “legal reasons” given for disabling the embed/email option on the PDFs. I’m pushing on that because I don’t see how publication on your site is different from allowing someone to embed it on their own, or email it. If you have any insight on that, let me know. [See response below]

Here are the responses in full:

When the team was going through the expenses and reporting, how was this longer term online strategy incorporated?

From day one, it was agreed that we would work towards the publication of an online database that contained not only the files themselves but also an aggregation of publicly available data (Parliament Parser, They Work for You, Register of Members Interests etc.) with our own unique data analysis.

The publication by Parliament last week of the redacted files has provided a glimpse into the scale of operation required to analyse such a volume of documentation but one has to realise that the full files contain many, many more pages.

The launch yesterday of the database is the first phase. We will, in due course, publish the full uncensored files for all 646 MPs. Crucially, the expenses investigative team of reporters spent a week aggregating and processing the data (the unique 2007/8 analysis of the Additional Costs Allowance) themselves. Integration in action again! The end result of that work is the first accurate breakdown of those ACA figures. We soon realised that this data provided a great basis upon which to build the Complete Expenses Files supplement in last Saturday’s newspaper.

Why Issuu? And why is the ’email/embed’ option disabled for “secret documents”?

“Secret documents’ is not our term, it is Issuu’s. We think Issuu is a great product and that it provides a fantastic user experience and have plans to use it more extensively. But for legal reasons we need to be sure that the document cannot be downloaded. By disabling the download function, Issuu automatically restricts email/embed.

[further to that:]  How is publication on your site different from allowing someone to embed it on their own, or emailing it?

It is a precautionary measure. In the unlikely event that one of the source documents puts at risk the identity of a supplier or the full postcode of an MP we need to be confident that a) we can amend that file immediately and b) that the file has not been distributed more widely. For that reason, we do not want the files to be downloadable. We’d be very happy for other to embed the files in their pages but if you restrict the download option in Issuu you restrict the ability to embed.

Am I right in thinking the pages on each MP are static and so indexable by search engines, even though they’re generated from a database?

Yes. You may also notice that it is on a new domain parliament.telegraph.co.uk. We will be enhancing our political resources over the coming months as we build up to the General Election. This application is not just for the Expenses files, we have plans to develop this area into a full service that enables our users to engage more closely with the democratic process.

Meanwhile, A.nnotate puts all MPs expense PDFs online for free annotation

On the day that Parliament released MPs’ expenses in their ‘official’ form, I was hawking around on Twitter trying to find a good way to crowdsource analysis of the documents (this was before The Guardian’s crowdsourcing tool went live).

Central to the problem was that the expenses were presented in search-unfriendly PDFs. So I was looking for a place people could upload those PDFs and post comments, tag and annotate them.

Scribd was the obvious option: you can comment and tag – but not annotate. After a number of responses on Twitter (in particular Jen Michaels’ suggestions and Marcelo Soares, who had converted Brazilian parliamentary salaries from PDF to Excel with Able2Extract), I had one from Fred Howell of A.nnotate.

A.nnotate was indeed an ideal candidate – however, the website charges for use, which made it redundant for crowdsourcing purposes. But I was feeling cheeky…

“Perhaps you could let users do #mpexpenses for free as a great bit of PR?” I asked.

Fred saw the potential. Within a couple of hours he had twittered back:

“Put a list of all #mpexpenses pdfs for free shared online annotation at : http://a.nnotate.com/beta/mpexpenses/

So credit to Fred – and the power of Twitter. Had The Guardian not created their tool, we had hacked together our own platform within hours – and there lies lesson #1: the power of the web to enable people to mobilise very quickly. It also brought to mind something I said to a group of people at an event the same day: don’t obsess with the tools – the networks are more important: because through the networks you should be able to find someone who knows the tools, or how to use them.

The Guardian meant Fred’s efforts were – this week – to no avail. But in the longer term, I know who to turn to if I need a bunch of PDFs annotated – as will anyone else who saw those tweets. And anyone reading this blog post will know about A.nnotate too. So there is lesson #2: it wasn’t the PR of ‘delivering a message’ but simply ‘doing a good turn’, which in social media is the best PR there is.

But I still wish there was a free online PDF upload service that did annotation.