The future of investigative journalism: databases and algorithms

8 Replies

There’s a great article over at Miller-McCune on investigative journalism and what you might variously call computer assisted reporting and database journalism. Worth reading in full, the really interesting stuff comes further in, which I’ve quoted below in full:

“Bill Allison, a senior fellow at the Sunlight Foundation and a veteran investigative reporter and editor, summarizes the nonprofit’s aim as “one-click” government transparency, to be achieved by funding online technology that does some of what investigative reporters always have done: gather records and cross-check them against one another, in hopes of finding signs or patterns of problems

“… Before he came to the Sunlight Foundation, Allison says, the notion that computer algorithms could do a significant part of what investigative reporters have always done seemed “far-fetched.” But there’s nothing far-fetched about the use of data-mining techniques in the pursuit of patterns. Law firms already use data “chewers” to parse the thousands of pages of information they get in the discovery phase of legal actions, Allison notes, looking for key phrases and terms and sorting the probative wheat from the chaff and, in the process, “learning” to be smarter in their further searches.

“Now, in the post-Google Age, Allison sees the possibility that computer algorithms can sort through the huge amounts of databased information available on the Internet, providing public interest reporters with sets of potential story leads they otherwise might never have found. The programs could only enhance, not replace, the reporter, who would still have to cultivate the human sources and provide the context and verification needed for quality journalism. But the data-mining programs could make the reporters more efficient — and, perhaps, a less appealing target for media company bean counters looking for someone to lay off. “I think that this is much more a tool to inform reporters,” Allison says, “so they can do their jobs better.”

“… After he fills the endowed chair for the Knight Professor of the Practice of Journalism and Public Policy Studies, [James] Hamilton hopes the new professor can help him grow an academic field that provides generations of new tools for the investigative journalist and public interest-minded citizen. The investigative algorithms could be based in part on a sort of reverse engineering, taking advantage of experience with previous investigative stories and corruption cases and looking for combinations of data that have, in the past, been connected to politicians or institutions that were incompetent or venal. “The whole idea is that we would be doing research and development in a scalable, open-source way,” he says. “We would try to promote tools that journalists and others could use.”

Hat tip to Nick Booth

8 thoughts on “The future of investigative journalism: databases and algorithms”

Aron Pilhofer January 15, 2009 at 2:47 pm

Interesting piece that kind of misses the mark.

Computer-assisted reporting has been around for decades, and many of the techniques John lumps under the “computational journalism” header were radical in the 1970s and ’80s, but are relatively commonplace today (at least in the States, Canada, and increasingly the U.K.).

Folks like Phil Meyer, Don Barlett and Jim Steele were using advanced analytical techniques such as those mentioned when I was still watching Sesame Street.

The “new” here aren’t the techniques. It’s how and where they are being applied.

The rule of thumb I grew up with professionally was this: The best CAR story is the one without a single number in sight. Do your analysis, crunch your numbers, write your story, but for the love of god, keep the numbers out of sight. Limit discussions of tools and techniques — to the extent that there is any — to a tiny little “nerd box” on an inside page where we can be completely certain no one will read it.

There have been hundreds of stories published based on highly advanced analytical techniques, but the goal was to make this as invisible as possible to the reader. In other words, there’s nothing new at all under the sun there.

The web offers a completely new way of bringing those sorts of analysis into the light, and exposing them in more collaborative and, frankly, useful ways for readers. Sunlight is a perfect example of this. When Bill talks about “one-click” government transparency, he is talking about creating tools that allow users to apply these techniques themselves, to do their own investigating.

The same nerdy analysis that we kept hidden in print can be brought out front-and-center on the web, and it will work. If done right. (And Sunlight does it right.)

The potential I’m most excited about is not to finding some magic algorithm to substitute for human judgment, one that culls through mounds of data and tell us what’s interesting about it. (That’s a pipe dream anyway.)

What’s interests me is that now we can provide advanced tools and technologies to help people cull through data and judge for themselves what’s interesting. We can potentially turn every reader into an investigator.

To me, that’s the “new” in all this.

Reply ↓
paulbradshaw January 16, 2009 at 1:16 pm

Agreed. Newspapers and the BBC at the moment are complaining about the government only releasing spreadsheets to them a day before they release it to the public, because they then won’t have as much time to analyse it. Why they can’t do so in partnership with techy members of their userbase I don’t know.

Reply ↓
John Mecklin January 18, 2009 at 1:12 am

I appreciate Mr. Pilhofer’s point of view, but with all due respect, what’s new about the field of computational journalism is the use of complex algorithms to scan and combine information from multiple databases that are now available over the Web but until the recent past were not. It is superficially similar to but more sophisticated and far-ranging than the matching and sorting that have long been done on government records databases obtained by CAR types. (To get all metaphoric on you: It’s the difference between 8-track tapes and the iPod.)

I agree that many of the Sunlight projects are aimed at empowering citizens to follow campaign finances, and this is a worthy effort. But data mining techniques could truly expand the efficiency and reach of the investigative journalists who are fulltime professional watchdogs. Via data mining algorithms, for example, one might deconstruct past corruption cases — let’s say Duke Cunningham’s — to find databased patterns of behavior (even seemingly innocent behavior) that could then serve as flags in the monitoring of other congressmen and -women.

As far as Bartlett and Steele go: It’s kind of funny that they are mentioned as examples of how far I am behind the times. If they were doing years ago what I am just now writing about, I imagine that Bill Allison, now at the Sunlight Foundation but a researcher for Bartlett and Steele for several years in Philadelphia, might have mentioned same.

Reply ↓
Aron Pilhofer January 18, 2009 at 2:46 pm

Hi John, and thanks for the response.

I think you’re missing me here just a tad.

You certainly aren’t wrong to point to Prof. Hamilton’s work (among many others) as part of a new and exciting emerging specialty within the field of journalism. On this we are in complete agreement, though I will say not everyone feels this way. (Of course, as a former “CAR type” who is now deeply committed to this new specialty, my position on this is a pretty predictable.)

And just to be clear: I’m also not raising these criticisms to suggest you or anyone else is “behind the times.” Apologies if that’s what you took away from my comment. In fact, I’m delighted to see a story like this, because, again, I do think this this is one of the ways traditional journalism can thrive in the digital world. If anything, I’d say you’re ahead of the curve in identifying the significance of this work.

That said, I think what you present as a sudden explosion of innovation is actually more accurately a logical next step along a continuum that began years ago. Indeed, it may surprise you to discover that journalists have been applying these techniques to their reporting for decades — but surprising or not, that they have. (And, yes, just to be clear, I’m talking about the “iPod” and not the “8-track” kind.)

As a data person, I hate imprecision. So, let’s talk specifics:

Data mining, as you know, is a blanket term for a number of techniques to extract meaningful patterns from a complex dataset or multiple datasets. An example is regression analysis — which newsrooms (in the states and canada, anyway) have been using to great effect for decades.

Philip Meyer was among the very first to see how these powerful tools could be used to improve reporting. His seminal work, “Precision Journalism,” was published in 1972 as a direct result of work he did in 1967 following the riots in Detroit.

In 1989, Bill Dedman and the Atlanta Journal-Constitution won a Pulitzer for “Color of Money,” which utilized a sophisticated statistical analysis across multiple datasets to expose redlining practices by banks.

And many, many newspapers have used regression analysis to examine the merits of standardized testing as a means of assessing the quality of eduction in public schools. The Dallas Morning News recently broke a series of stories about schools that were cheating on standardized tests in order to improve their overall ranking. Again, the story began with a regression analysis that showed these schools (quite literally a needle in a haystack of data) to be outliers.

Steve Doig pioneered the use of GIS in 1992 when he layered wind speed data from Hurricane Andrew over millions of records of property data and damage reports. He discovered an undeniable pattern — that the most heavily damaged areas were not those that experienced the highest winds. His reporting exposed a pattern of shoddy building patterns, lax regulation and oversight and a pattern of corruption that kept public officials looking the other way.

(You can read more about this, and a bit of history as well, here: http://www.nieman.harvard.edu/reportsitem.aspx?id=100075.)

Jaimi Dowdell, then at The St. Louis Post-Dispatch, used social network analysis in her reporting. Matt Waite at the The St. Petersburg Times earned the equivalent of a master’s degree in order to gain the GIS knowledge necessary for his series on disappearing wetlands in Florida. And, yes, Barlett and Steele have used these techniques for years — I’m actually surprised Bill (who is a friend and former colleague) didn’t mention it.

I myself co-wrote a series in the mid-1990s that used regression analysis to show how over-development was potentially responsible for an increase in the frequency and severity of flooding in central New Jersey.

I could go on, but I think it’s clear enough that this goes beyond simple “matching and sorting” as you put it. The techniques you’re talking about have been part of journalism for a very long time, we just have not (for reasons I noted above) made an overly big deal about it. Readers care little about techniques. They care about the story.

What is new (as I also note above) is the Web, which increases our ability to share data and build new and interesting online tools to help reporters and the public understand complex issues. That’s the promise of computational journalism, and where I think you missed the bullseye just a bit.

A quick note about the limits of data mining itself, which I think you describe in terms that might be a little pollyannish.

It would be a wonderful thing indeed for journalism if we could come up with some magic algorithm to expose corruption of the sort that tripped up Randy “Duke” Cunningham. You know, just toss a bunch of data into a funnel and out comes your Pulitzer. Sounds great in theory, but in practice we’re a long, long, long way off (and I have no doubt Prof. Hamilton would agree with me there).

The broader the question, and the more disparate (and dirty) the data sources, the harder it is to derive real meaning from the data. This is why your credit card company is often (but not always) right when they call to verify that it is indeed you charging dinner and drinks in Bermuda. And this is why Amazon’s recommendation engine is so frequently off when it suggests movies or books you might like, and why the U.S. government has such a hard time knowing who’s who on its “no fly” list.

OK, this is getting long so I’ll leave it there. I really did enjoy the piece, and look forward to reading more from you on the subject in the future.

Reply ↓
John Mecklin January 18, 2009 at 7:01 pm

Hey Aron

A few quick responses: 1) I appreciate your history of CAR, most of which I was aware of. 2) I didn’t present computational journalism as a sudden explosion; I specifically noted that CAR has a long history. 3) I had and have no doubt that journalists run regressions on data (though usually they get their statistician friends at the local university to do them). 4) I never said that sophisticated data analysis was a magic route to a Pulitzer Prize; I just said it can make investigative reporters more efficient, which (I think you’ll agree) will be a good thing, considering the vast amounts of reporter and IT resources many of the projects you mention absorbed. 5) I considered trying to include the ethical dangers of data mining in the public interest, but the column was already very, very long. False positives will be a problem, but only for bad reporters who do not seek independent verification/explanation of data patterns.

Those minor clarifications aside, thanks for taking so much time to write in at such length and with such intelligence. I genuinely appreciate your thoughtfulness (and the citation of “Precision Journalism,” which I’ve run into but never read; now I will head right to the library to check it out).

Reply ↓
Pingback: links for 2009-01-20 « Inside Online Journalism
LotusChannel.com - A Lotus Notes Blog April 20, 2009 at 3:06 am

The return value of the Java method getItemValueString has been changed in Lotus Domino 6. 5. 5, Lotus Domino 7. 0 and later releases. Before 6. 5. 5 the method getItemValueString returns null if the item is empty or if the item does not exist. In 6. 5. 5, 7. 0 and later releases getItemValueString returns the empty string (“”) instead.

Reply ↓
Pingback: Deep Throat Meets Data Mining: Algorithms and Humans Adding Value | (iverson's) currentbuzz