<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Online Journalism Blog &#187; Something for the weekend</title>
	<atom:link href="http://onlinejournalismblog.com/tag/something-for-the-weekend/feed/" rel="self" type="application/rss+xml" />
	<link>http://onlinejournalismblog.com</link>
	<description>A conversation.</description>
	<lastBuildDate>Sat, 11 Feb 2012 12:06:28 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<cloud domain='onlinejournalismblog.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
		<item>
		<title>SFTW: Scraping data with Google Refine</title>
		<link>http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/</link>
		<comments>http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 08:27:12 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[google refine]]></category>
		<category><![CDATA[grel]]></category>
		<category><![CDATA[parsehtml]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[Something for the weekend]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=15674</guid>
		<description><![CDATA[For the first Something For The Weekend of 2012 I want to tackle a common problem when you&#8217;re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity: http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521 http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629 http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823 In this instance, you can see that<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2012%2F01%2F13%2Fsftw-scraping-data-with-google-refine%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2012_2F01_2F13_2Fsftw-scraping-data-with-google-refine_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2012%2F01%2F13%2Fsftw-scraping-data-with-google-refine%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>For the first <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something For The Weekend</a> of 2012 I want to tackle a common problem when you&#8217;re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:</p>
<ol>
<li><a href="http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521" onclick="urchinTracker('/outgoing/www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521&amp;referer=');">http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521</a></li>
<li><a href="http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629" onclick="urchinTracker('/outgoing/www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629&amp;referer=');">http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629</a></li>
<li><a href="http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823" onclick="urchinTracker('/outgoing/www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823&amp;referer=');">http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823</a></li>
</ol>
<p>In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.</p>
<p>There are a number of ways you could scrape this data. You could <a title="using Google Docs and the =importXML formula" href="http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/">use Google Docs and the =importXML formula</a>, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit &gt; Paste Special &gt; Values Only and then use the formula a further 50 times if it&#8217;s not too many &#8211; <a href="https://docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdEJ2dFF5YVY0Ml9sX3NURUM5YkdKVHc" onclick="urchinTracker('/outgoing/docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdEJ2dFF5YVY0Ml9sX3NURUM5YkdKVHc&amp;referer=');">here&#8217;s one I prepared earlier</a>).</p>
<p>And you could use Scraperwiki to write a powerful scraper &#8211; but you need to understand enough coding to do so quickly (<a href="https://scraperwiki.com/scrapers/free_school_meals_scotland/" onclick="urchinTracker('/outgoing/scraperwiki.com/scrapers/free_school_meals_scotland/?referer=');">here&#8217;s a demo I prepared earlier</a>).</p>
<p>A middle option is to use Google Refine, and here&#8217;s how you do it.</p>
<h2>Assembling the ingredients</h2>
<p>With the <strong>basic URL structure</strong> identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we&#8217;re going to use to complete each URL.</p>
<p>An <a href="http://www.google.co.uk/webhp?rlz=1C1GPCK_enGB454GB455&amp;sourceid=chrome-instant&amp;ix=heb&amp;ie=UTF-8&amp;ion=1#sclient=psy-ab&amp;hl=en&amp;rlz=1C1GPCK_enGB454GB455&amp;site=webhp&amp;source=hp&amp;q=list+seed+number+scottish+schools+filetype:xls&amp;pbx=1&amp;oq=list+seed+number+scottish+schools+filetype:xls&amp;aq=f&amp;aqi=&amp;aql=&amp;gs_sm=e&amp;gs_upl=74020l77151l0l77535l13l12l0l0l0l0l137l1079l7.5l12l0&amp;bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&amp;fp=f9ef8465024f9e21&amp;biw=1280&amp;bih=856&amp;ion=1" onclick="urchinTracker('/outgoing/www.google.co.uk/webhp?rlz=1C1GPCK_enGB454GB455_amp_sourceid=chrome-instant_amp_ix=heb_amp_ie=UTF-8_amp_ion=1_sclient=psy-ab_amp_hl=en_amp_rlz=1C1GPCK_enGB454GB455_amp_site=webhp_amp_source=hp_amp_q=list+seed+number+scottish+schools+filetype_xls_amp_pbx=1_amp_oq=list+seed+number+scottish+schools+filetype_xls_amp_aq=f_amp_aqi=_amp_aql=_amp_gs_sm=e_amp_gs_upl=74020l77151l0l77535l13l12l0l0l0l0l137l1079l7.5l12l0_amp_bav=on.2_or.r_gc.r_pw.r_cp._cf.osb_amp_fp=f9ef8465024f9e21_amp_biw=1280_amp_bih=856_amp_ion=1&amp;referer=');">advanced search for &#8220;list seed number scottish schools filetype:xls</a>&#8221; brings up a link to <a href="http://www.scotland.gov.uk/stats/sources/adds.xls" onclick="urchinTracker('/outgoing/www.scotland.gov.uk/stats/sources/adds.xls?referer=');">this spreadsheet (XLS)</a> which gives us just that.</p>
<p>The spreadsheet will need editing: <strong>remove any rows you don&#8217;t need.</strong> This will reduce the time that the scraper will take in going through them. For example, if you&#8217;re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.</p>
<p>Now to combine  the ID codes with the base URL.</p>
<h2>Bringing your data into Google Refine</h2>
<p>Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.</p>
<p>At the top of the school ID column click on the drop-down menu and select <strong>Edit column &gt; Add column based on this column&#8230;</strong></p>
<p>In the <em>New column name</em> box at the top call this &#8216;URL&#8217;.</p>
<p>In the <em>Expression</em> box type the following piece of GREL (Google Refine Expression Language):</p>
<p>&#8220;http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=&#8221;+value</p>
<p>(<em>Type in the quotation marks yourself &#8211; if you&#8217;re copying them from a webpage you may have problems</em>)</p>
<p>The &#8216;value&#8217; bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.</p>
<p>In the <em>Preview</em> window you should see the results &#8211; you can even copy one of the resulting URLs and paste it into a browser to check it works. (<em>On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing &#8216;value&#8217; to </em>value.substring(0,7)<em> &#8211; this extracts the first 7 characters of the ID number, omitting the &#8216;.0&#8242;</em>)</p>
<p>Click <strong>OK</strong> if you&#8217;re happy, and you should have a new column with a URL for each school ID.</p>
<h2>Grabbing the HTML for each page</h2>
<p>Now click on the top of this new URL column and select <strong>Edit column &gt; Add column by fetching URLs&#8230;</strong></p>
<p>In the <em>New column name</em> box at the top call this &#8216;HTML&#8217;.</p>
<p>All you need in the <em>Expression</em> window is &#8216;value&#8217;, so leave that as it is.</p>
<p>Click <strong>OK</strong>.</p>
<p>Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time &#8211; hours, depending on the speed of your computer and internet connection (it may not work at all if either isn&#8217;t very fast). So leave it running and come back to it later.</p>
<h2>Extracting data from the raw HTML with parseHTML</h2>
<p>When it&#8217;s finished you&#8217;ll have another column where each cell is a bunch of HTML. You&#8217;ll need to create a new column to extract what you need from that, and you&#8217;ll also <a href="http://code.google.com/p/google-refine/wiki/StrippingHTML" onclick="urchinTracker('/outgoing/code.google.com/p/google-refine/wiki/StrippingHTML?referer=');">need some GREL expressions explained here</a>.</p>
<p>First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like &lt;table class=&#8221;destinations&#8221;&gt; or &lt;div id=&#8221;statistics&#8221;&gt;. Keep that open in another window while you tweak the expression we come onto below&#8230;</p>
<p>Back in Google Refine, at the top of the HTML column click on the drop-down menu and select <strong>Edit column &gt; Add column based on this column&#8230;</strong></p>
<p>In the <em>New column name</em> box at the top give it a name describing the data you&#8217;re going to pull out.</p>
<p>In the <em>Expression</em> box type the following piece of GREL (Google Refine Expression Language):</p>
<p><a>value.parseHtml().select(&#8220;table.destinations&#8221;)[0].select(&#8220;tr&#8221;).toString()</a></p>
<p><em>(Again, type the quotation marks yourself rather than copying them from here or you may have problems</em>)</p>
<p>I&#8217;ll break down what this is doing:</p>
<p><a>value.parseHtml()</a></p>
<p><em>parse the HTML in each cell (value)</em></p>
<p><a>.select(&#8220;table.destinations&#8221;)</a></p>
<p><em>find a table with a class (.) of &#8220;destinations&#8221; (in the source HTML this reads &lt;table class=&#8221;destinations&#8221;&gt;. If it was &lt;div id=&#8221;statistics&#8221;&gt; then you would write .select(&#8220;div#statistics&#8221;) &#8211; the hash sign representing an &#8216;id&#8217; and the full stop representing a &#8216;class&#8217;.</em></p>
<p>[0]</p>
<p><em>This zero in square brackets tells Refine to only grab the first table &#8211; a number 1 would indicate the second, and so on. This is because numbering (&#8220;indexing&#8221;) generally begins with zero in programming.</em></p>
<p><a>.select(&#8220;tr&#8221;)</a></p>
<p><em>Now, within that table, find anything within the tag &lt;tr&gt;</em></p>
<p><a>.toString()</a></p>
<p><em>And convert the results into a string of text</em>.</p>
<p>The results of that expression in the <em>Preview</em> window should look something like this:</p>
<p>&lt;tr&gt; &lt;th&gt;&lt;/th&gt; &lt;th&gt;Abbotswell School&lt;/th&gt; &lt;th&gt;Aberdeen City&lt;/th&gt; &lt;th&gt;Scotland&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;th&gt;Percentage of pupils&lt;/th&gt; &lt;td&gt;25.5%&lt;/td&gt; &lt;td&gt;16.3%&lt;/td&gt; &lt;td&gt;22.6%&lt;/td&gt; &lt;/tr&gt;</p>
<p>This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the <a href="http://excelnotes.posterous.com/splitting-a-vote-or-other-piece-of-data-into" onclick="urchinTracker('/outgoing/excelnotes.posterous.com/splitting-a-vote-or-other-piece-of-data-into?referer=');">=SPLIT formula, for example</a>).</p>
<p>Or you could further tweak your GREL code in Refine to drill further into your data, like so:</p>
<p>value.parseHtml().select(&#8220;table.destinations&#8221;)[0].select(&#8220;td&#8221;)[0].toString()</p>
<p>Which would give you this:</p>
<p>&lt;td&gt;25.5%&lt;/td&gt;</p>
<p>Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):</p>
<p>value.parseHtml().select(&#8220;table.destinations&#8221;)[0].select(&#8220;td&#8221;)[0].toString().substring(5,10)</p>
<p>When you&#8217;re happy, click <strong>OK</strong> and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.</p>
<p>Then click <strong>Export</strong> in the upper right corner and save as a CSV or Excel file.</p>
<p><em><a title="Help Me Investigate Education - Scottish schools free school meals data" href="http://helpmeinvestigate.com/education/2012/01/free-school-meals-in-scottish-primary-schools-data-visualisation/" onclick="urchinTracker('/outgoing/helpmeinvestigate.com/education/2012/01/free-school-meals-in-scottish-primary-schools-data-visualisation/?referer=');">More on how this data was used on Help Me Investigate Education</a>.</em></p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2012%2F01%2F13%2Fsftw-scraping-data-with-google-refine%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Scraping data from a list of webpages using Google Docs</title>
		<link>http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/</link>
		<comments>http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 09:48:14 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[google docs]]></category>
		<category><![CDATA[importxml]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[Something for the weekend]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=15172</guid>
		<description><![CDATA[Quite often when you&#8217;re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one &#8211; or even scrape the data individually &#8211; would take time. Here I explain a way to use Google Docs to grab the data for<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F10%2F14%2Fscraping-data-from-a-list-of-webpages-using-google-docs%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2011_2F10_2F14_2Fscraping-data-from-a-list-of-webpages-using-google-docs_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F10%2F14%2Fscraping-data-from-a-list-of-webpages-using-google-docs%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Quite often when you&#8217;re looking for data as part of a story, that data will not be on a single page, but on a series of pages. To manually copy the data from each one &#8211; or even scrape the data individually &#8211; would take time. Here I explain a way to use Google Docs to grab the data for you.</p>
<h2>Some basic principles</h2>
<p>Although Google Docs is a pretty clumsy tool to use to scrape webpages, the method used is much the same as if you were writing a scraper in a programming language like Python or Ruby. For that reason, I think this is a good quick way to introduce the basics of certain types of scrapers.</p>
<p>Here&#8217;s how it works:</p>
<p>Firstly, you need a list of links to the pages containing data.</p>
<p>Quite often that list might be on a webpage which links to them all, but if not you should look at whether the links have any common structure, for example &#8220;http://www.country.com/data/australia&#8221; or &#8220;http://www.country.com/data/country2&#8243;. If it does, then you can generate a list by filling in the part of the URL that changes each time (in this case, the country name or number), assuming you have a list to fill it from (i.e. a list of countries, codes or simple addition).</p>
<p>Second, you need the destination pages to have some consistent structure to them. In other words, they should look the same (although looking the same doesn&#8217;t mean they have the same structure &#8211; more on this below).</p>
<p>The scraper then cycles through each link in your list, grabs particular bits of data from each linked page (because it is always in the same place), and saves them all in one place.</p>
<h2>Scraping with Google Docs using =importXML &#8211; a case study</h2>
<p>If you&#8217;ve not used =importXML before it&#8217;s worth catching up on my previous 2 posts <a rel="bookmark" href="http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/">How to scrape webpages and ask questions with Google Docs and =importXML</a> and <a rel="bookmark" href="http://onlinejournalismblog.com/2011/08/05/sftw-asking-questions-of-a-webpage-and-finding-out-when-those-answers-change/">Asking questions of a webpage – and finding out when those answers change</a>.</p>
<p>This takes things a little bit further.</p>
<p>In this case I&#8217;m going to scrape some data for a story about local history &#8211; the data for which is helpfully <a href="http://www.dmm.org.uk/mindex.htm" onclick="urchinTracker('/outgoing/www.dmm.org.uk/mindex.htm?referer=');">published by the Durham Mining Museum</a>. Their homepage has a list of local mining disasters, with the date and cause of the disaster, the name and county of the colliery, the number of deaths, and links to the names and to a page about each colliery.</p>
<p>However, there is not enough geographical information here to map the data. That, instead, is <a href="http://www.dmm.org.uk/colliery/h029.htm" onclick="urchinTracker('/outgoing/www.dmm.org.uk/colliery/h029.htm?referer=');">provided on each colliery&#8217;s individual page</a>.</p>
<p>So we need to go through this list of webpages, grab the location information, and pull it all together into a single list.</p>
<h2>Finding the structure in the HTML</h2>
<p>To do this we need to isolate which part of the homepage contains the list. If you right-click on the page to &#8216;view source&#8217; and search for &#8216;Haig&#8217; (the first colliery listed) we can see it&#8217;s in a table that has a beginning tag like so: &lt;table border=0 align=center style=&#8221;font-size:10pt&#8221;&gt;</p>
<p>We can use =importXML to grab the contents of the table like so:</p>
<p>=Importxml(&#8220;http://www.dmm.org.uk/mindex.htm&#8221;, &#8221;//table[starts-with(@style, 'font-size:10pt')]&#8220;)</p>
<p>But we only want the links, so how do we grab just those instead of the whole table contents?</p>
<p>The answer is to add more detail to our request. If we look at the HTML that contains the link, it looks like this:</p>
<p>&lt;td valign=top&gt;&lt;a href=&#8221;<a href="http://www.dmm.org.uk/colliery/h029.htm" target="_blank" onclick="urchinTracker('/outgoing/www.dmm.org.uk/colliery/h029.htm?referer=');">http://www.dmm.org.uk/colliery/h029.htm</a>&#8220;&gt;Haig&amp;nbsp;Pit&lt;/a&gt;&lt;/td&gt;</p>
<p>So it&#8217;s within a &lt;td&gt; tag &#8211; but <em>all</em> the data in this table is, not surprisingly, contained within &lt;td&gt; tags. The key is to identify which &lt;td&gt; tag we want &#8211; and in this case, it&#8217;s always the fourth one in each row.</p>
<p>So we can add &#8220;//td[4]&#8221; (&#8216;<em>look for the fourth &lt;td&gt; tag&#8217;</em>) to our function like so:</p>
<p>=Importxml(&#8220;http://www.dmm.org.uk/mindex.htm&#8221;, &#8221;//table[starts-with(@style, 'font-size:10pt')]//td[4]&#8220;)</p>
<p>Now we should have a list of the collieries &#8211; but we want the actual URL of the page that is linked to with that text. That is contained within the value of the href attribute &#8211; or, put in plain language: it comes after the bit that says href=&#8221;.</p>
<p>So we just need to add one more bit to our function: &#8220;//@href&#8221;:</p>
<p>=Importxml(&#8220;http://www.dmm.org.uk/mindex.htm&#8221;, &#8221;//table[starts-with(@style, 'font-size:10pt')]//td[4]//@href&#8221;)</p>
<p>So, reading from the far right inwards, this is what it says: &#8220;<em>Grab the value of href, within the fourth &lt;td&gt; tag on every row, of the table that has a style value of font-size:10pt</em>&#8221;</p>
<p>Note: if there was only one link in every row, we wouldn&#8217;t need to include //td[4] to specify the link we needed.</p>
<h2>Scraping data from each link in a list</h2>
<p>Now we have a list &#8211; but we still need to scrape some information from each link in that list</p>
<p>Firstly, we need to identify the location of information that we need on the linked pages. Taking <a href="http://www.dmm.org.uk/colliery/h029.htm" onclick="urchinTracker('/outgoing/www.dmm.org.uk/colliery/h029.htm?referer=');">the first page</a>, view source and search for &#8216;Sheet 89&#8242;, which are the first two words of the &#8216;Map Ref&#8217; line.</p>
<p>The HTML code around that information looks like this:</p>
<p>&lt;td valign=top&gt;(Sheet 89) NX965176, 54&amp;#176; 32&amp;#39; 35&amp;#34; N, 3&amp;#176; 36&amp;#39; 0&amp;#34; W&lt;/td&gt;</p>
<p>Looking a little further up, the table that contains this cell uses HTML like this:</p>
<p>&lt;table border=0 width=&#8221;95%&#8221;&gt;</p>
<p>So if we needed to scrape this information, we would write a function like this:</p>
<p>=importXML(&#8220;http://www.dmm.org.uk/colliery/h029.htm&#8221;, &#8220;//table[starts-with(@width, '95%')]//tr[2]//td[2]&#8220;)</p>
<p>&#8230;And we&#8217;d have to write it for every URL.</p>
<p>But because we have a list of URLs, we can do this much quicker by using cell references instead of the full URL.</p>
<p>So. Let&#8217;s assume that your formula was in cell C2 (<a href="https://docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdDZ5RTFXcThPeExLcWt6dVJLZERhLWc&amp;hl=en_GB" onclick="urchinTracker('/outgoing/docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdDZ5RTFXcThPeExLcWt6dVJLZERhLWc_amp_hl=en_GB&amp;referer=');">as it is in this example</a>), and the results have formed a column of links going from C2 down to C11. Now we can write a formula that looks at each URL in turn and performs a scrape on it.</p>
<p>In D2 then, we type the following:</p>
<p>=importXML(C2, &#8220;//table[starts-with(@width, '95%')]//tr[2]//td[2]&#8220;)</p>
<p>If you copy the cell all the way down the column, it will change the function so that it is performed on each neighbouring cell.</p>
<p>In fact, we could simplify things even further by putting the second part of the function in cell D1 &#8211; without the quotation marks &#8211; like so:</p>
<p>//table[starts-with(@width, '95%')]//tr[2]//td[2]</p>
<p>And then in D2 change the formula to this:</p>
<p>=ImportXML(C2,$D$1)</p>
<p>(The dollar signs keep the D1 reference the same even when the formula is copied down, while C2 will change in each cell)</p>
<p>Now it works &#8211; we have the data from each of 8 different pages. Almost.</p>
<h2>Troubleshooting with =IF</h2>
<p>The problem is that the structure of those pages is not as consistent as we thought: the scraper is producing extra cells of data for some, which knocks out the data that should be appearing there from other cells.</p>
<p>So I&#8217;ve used an IF formula to clean that up as follows:</p>
<p>In cell E2 I type the following:</p>
<p>=if(D2=&#8221;", ImportXML(C2,$D$1), D2)</p>
<p>Which says &#8216;<em>If D2 is empty, then run the importXML formula again and put the results here, but if it&#8217;s not empty then copy the values across</em>&#8216;</p>
<p>That formula is copied down the column.</p>
<p>But there&#8217;s still one empty column even now, so the same formula is used again in column F:</p>
<p>=if(E2=&#8221;", ImportXML(C2,$D$1), E2)</p>
<h2>A hack, but an instructive one</h2>
<p>As I said earlier, this isn&#8217;t the best way to write a scraper, but it is a useful way to start to understand how they work, and a quick method if you don&#8217;t have huge numbers of pages to scrape. With hundreds of pages, it&#8217;s more likely you will miss problems &#8211; so watch out for inconsistent structure and data that doesn&#8217;t line up.</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F10%2F14%2Fscraping-data-from-a-list-of-webpages-using-google-docs%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2011/10/14/scraping-data-from-a-list-of-webpages-using-google-docs/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to use the CableSearch API to quickly reference names against Wikileaks cables (SFTW)</title>
		<link>http://onlinejournalismblog.com/2011/09/09/how-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables/</link>
		<comments>http://onlinejournalismblog.com/2011/09/09/how-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables/#comments</comments>
		<pubDate>Fri, 09 Sep 2011 12:33:25 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[cables]]></category>
		<category><![CDATA[cablesearch]]></category>
		<category><![CDATA[google refine]]></category>
		<category><![CDATA[grel]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[Wikileaks]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=15134</guid>
		<description><![CDATA[CableSearch is a neat project by the European Centre for Computer Assisted Research and VVOJ (the Dutch-Flemish association for investigative journalists) which aims to make it easier for journalists to interrogate the Wikileaks cables. Although it&#8217;s been around for some time, I&#8217;ve only just noticed the site&#8217;s API, so I thought I&#8217;d show how such an API can be useful as a<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2011/09/09/how-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F09%2F09%2Fhow-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2011_2F09_2F09_2Fhow-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F09%2F09%2Fhow-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><img src="http://searchengineland.com/figz/wp-content/seloads/2010/12/logo-cablesearch.png-PNG-Image-473x172-pixels-300x145.jpg" alt="Cablesearch logo" /></p>
<p><a href="http://cablesearch.org/" onclick="urchinTracker('/outgoing/cablesearch.org/?referer=');">CableSearch</a> is a neat project by the European Centre for Computer Assisted Research and <a href="http://www.vvoj.nl/cms/vvoj-english/contact-us" onclick="urchinTracker('/outgoing/www.vvoj.nl/cms/vvoj-english/contact-us?referer=');">VVOJ</a> (the Dutch-Flemish association for investigative journalists) which aims to make it easier for journalists to interrogate the Wikileaks cables. Although it&#8217;s been around for some time, I&#8217;ve only just noticed the site&#8217;s API, so I thought I&#8217;d show how such an API can be useful as a way to draw on such data sources to complement data of your own.<span id="more-15134"></span></p>
<h2>Example question: &#8220;How many Swedish party leaders are mentioned in the cables?&#8221;</h2>
<p>There&#8217;s no particular reason why I picked Sweden, but this is an exercise you could do with any list &#8211; MPs, cabinet members, organisational heads, etc.</p>
<p>First, you need to grab the list. I did so by <a href="http://excelnotes.posterous.com/scraping-a-table-from-a-webpage-using-importh" onclick="urchinTracker('/outgoing/excelnotes.posterous.com/scraping-a-table-from-a-webpage-using-importh?referer=');">using the =importHTML formula</a> on <a href="http://en.wikipedia.org/wiki/List_of_members_of_the_parliament_of_Sweden,_2010%E2%80%932014" onclick="urchinTracker('/outgoing/en.wikipedia.org/wiki/List_of_members_of_the_parliament_of_Sweden_2010_E2_80_932014?referer=');">this Wikipedia page</a>. You would obviously need to check that. Alternatively, you could <a href="http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/">use =importXML</a> on <a href="http://www.sweden.gov.se/sb/d/10893/a/109925" onclick="urchinTracker('/outgoing/www.sweden.gov.se/sb/d/10893/a/109925?referer=');">this official Swedish parliament page</a> for a list of ministers.</p>
<p>(I&#8217;m not going to repeat these processes as you can read how to do these by clicking through to the links explaining them above)</p>
<p><a href="https://docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdEh3d21WYjF2S1gxNW1ZRGo5eC1qeGc&amp;hl=en_GB" onclick="urchinTracker('/outgoing/docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdEh3d21WYjF2S1gxNW1ZRGo5eC1qeGc_amp_hl=en_GB&amp;referer=');">Here are the results</a>. As often happens with Wikipedia tables, the first row is shifted so the headings don&#8217;t quite match the columns below. As we only need a list of names we don&#8217;t have to correct that. (For the =importXML scrape, you&#8217;ll also encounter a problem with accented characters, but this will still be quicker to correct than if we were manually copying the list across)</p>
<p>Now download that spreadsheet as a CSV file, and open up Google Refine.</p>
<h2>Testing with the API</h2>
<p>I&#8217;ve previously explained <a href="http://onlinejournalismblog.com/2011/03/18/getting-full-addresses-for-school-data-in-an-foi-response/">how to use Google Refine with the APIs of Google Maps</a>, <a href="http://onlinejournalismblog.com/2010/12/16/adding-geographical-information-to-a-spreadsheet-based-on-postcodes-google-refine-and-apis/">UK-Postcodes</a>, and <a href="http://onlinejournalismblog.com/2011/07/22/how-to-grab-useful-political-data-with-the-they-work-for-you-api/">They Work For You (UK politics)</a>.</p>
<p>The <a href="http://cablesearch.org/?page_id=242" onclick="urchinTracker('/outgoing/cablesearch.org/?page_id=242&amp;referer=');">CableSearch API page</a> is pretty straightforward if you&#8217;ve followed any of those &#8211; but it&#8217;s key that you test what results Google Refine provides against what you get from a manual search (and make sure you have a test that provides unusual results &#8211; in this case, anything less than 10 results).</p>
<p>In particular, testing reveals that your search term needs to first be formatted in a particular way to avoid you getting the wrong results.</p>
<h2>Formatting your data</h2>
<p>So in our data we have a list of names &#8211; but if we just run them through CableSearch we will get results where those names do not appear together. In other words, a search for John Jones will bring back results where <em>anyone </em>called John and<em> anyone</em> called Jones is mentioned.</p>
<p>The normal solution is to <strong>put quotation marks around the search term</strong>, to ensure that only results containing that exact phrase are returned, i.e. &#8220;John Jones&#8221;.</p>
<p>With an API where we are constructing a URL, however, that space can cause problems because a URL cannot contain a space. <strong>We need to replace it with a code for a space: %20</strong> (if you do a search for anything containing a space, you will notice that %20 will sometimes appear in the URL for the results in its place; at other times a + sign will replace the space)</p>
<p>So, here&#8217;s how to reformat the text accordingly:</p>
<ol>
<li>Click on the arrow at the top of your column of names, and select <strong>Edit Column &gt; Add column based on this column&#8230;</strong></li>
<li>In the window that appears type the following code: <strong>&#8216;&#8221;&#8216;+value.split(&#8221; &#8220;).join(&#8220;%20&#8243;)+&#8217;&#8221;&#8216;</strong></li>
<li>Give the column a name and click OK.</li>
</ol>
<p>The start and end may be difficult to see, so here it is with spaces in between:</p>
<p><strong>&#8216; &#8221; &#8216;</strong></p>
<p>You&#8217;ll see that it&#8217;s a single inverted comma followed by double inverted commas and a further single inverted comma. That adds double inverted commas at the start and end of our new data.</p>
<p>The rest of the code splits the original data wherever there is a space (&#8221; &#8220;) and joins the resulting fragments together with &#8220;%20&#8243;.</p>
<p>And so John Jones becomes &#8220;John%20Jones&#8221; &#8211; which will work in the API (one cell has 2 names, however, which you will need to clean up).</p>
<h2>Grabbing from the API</h2>
<p>Now that we have properly formatted text we can ask the CableSearch API for the information it has on each name. Here&#8217;s how:</p>
<ol>
<li>Click on the arrow at the top of the newly created column of formatted names, and select <strong>Edit Column &gt; Add column by fetching URLs</strong></li>
<li>In the window that appears type the following code: <strong><a>&#8220;http://cablesearch.org/cable/api/search?q=&#8221;+value</a></strong></li>
<li>Give the column a name and click OK.</li>
</ol>
<p>It will now go and fetch data for each name, which may take a few minutes (or more, depending how many names you have).</p>
<p>When it&#8217;s finished you should have a column of cells containing JSON data. It will be very hard to look at (<a href="http://onlinejournalismblog.com/2011/04/14/data-for-journalists-json-for-beginners/">more on how to read JSON here</a>) but that&#8217;s OK because we&#8217;re going to create a final column to extract the piece of data we want.</p>
<h2>Extracting from the JSON</h2>
<p>The process should be familiar by now:</p>
<ol>
<li>Click on the arrow at the top of the newly created column of formatted names, and select <strong>Edit Column &gt; Add column <strong>based on this column&#8230;</strong></strong></li>
<li>In the window that appears type the following code: <strong><a></a><a>value.parseJson().info.items</a></strong></li>
<li>Give the column a name and click OK.</li>
</ol>
<p>This will create a new column which just tells you how many results there are for each name. Where it says &#8217;10&#8242; there are probably more (that&#8217;s the maximum value &#8211; sadly the API doesn&#8217;t return any information on total records, although <a href="http://cablesearch.org/?page_id=242" onclick="urchinTracker('/outgoing/cablesearch.org/?page_id=242&amp;referer=');">the API page</a> details one way you can continue to cycle through pages of results beyond the first 10).</p>
<p>This enables you to take a list of names and quickly find out which ones are mentioned in the cables at all, and which ones have been mentioned just a few times &#8211; saving you lots of searches, and time, and allowing you to narrow the focus of your work.</p>
<p>A more powerful API would allow you to narrow your focus further: by date range, for example, or source, urgency or classification. The broader point is: this is why APIs are useful. Knowing how to use them (and <a href="http://www.programmableweb.com/apis/directory/1?sort=date&amp;pagesize=25" onclick="urchinTracker('/outgoing/www.programmableweb.com/apis/directory/1?sort=date_amp_pagesize=25&amp;referer=');">which ones there are</a>) simply gives you another way to do a job better.</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F09%2F09%2Fhow-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2011/09/09/how-to-use-the-cablesearch-api-to-quickly-reference-names-against-wikileaks-cables/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>SFTW: 9 data journalism tools</title>
		<link>http://onlinejournalismblog.com/2011/08/19/sftw-9-data-journalism-tools/</link>
		<comments>http://onlinejournalismblog.com/2011/08/19/sftw-9-data-journalism-tools/#comments</comments>
		<pubDate>Fri, 19 Aug 2011 10:26:18 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[buzzdata]]></category>
		<category><![CDATA[cleaning]]></category>
		<category><![CDATA[data wrangler]]></category>
		<category><![CDATA[datamarket]]></category>
		<category><![CDATA[google news scraper]]></category>
		<category><![CDATA[impure]]></category>
		<category><![CDATA[junar]]></category>
		<category><![CDATA[metadata extraction tool]]></category>
		<category><![CDATA[roambi]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[tools]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[zanran]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=15048</guid>
		<description><![CDATA[There have been quite a few tools springing up over the past few months that I&#8217;ve not had time to blog about, so here&#8217;s a roundup post on all of them &#8211; a bumper Something For The Weekend (let me know how you find these). 1. Junar &#8211; for scraping websites and sharing data Junar presents a much easier way<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2011/08/19/sftw-9-data-journalism-tools/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F08%2F19%2Fsftw-9-data-journalism-tools%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2011_2F08_2F19_2Fsftw-9-data-journalism-tools_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F08%2F19%2Fsftw-9-data-journalism-tools%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>There have been quite a few tools springing up over the past few months that I&#8217;ve not had time to blog about, so here&#8217;s a roundup post on all of them &#8211; a bumper <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something For The Weekend</a> (let me know how you find these).</p>
<h2>1. Junar &#8211; for scraping websites and sharing data</h2>
<p><a href="http://www.junar.com/" onclick="urchinTracker('/outgoing/www.junar.com/?referer=');">Junar </a>presents a much easier way to scrape data from online tables with its &#8216;<a href="http://www.junar.com/datastreams/create" onclick="urchinTracker('/outgoing/www.junar.com/datastreams/create?referer=');">Collect Data</a>&#8216; tool &#8211; and the team behind it tell me they have plans to build functionality allowing users to scrape linked pages, as well as the ability to scrape PDFs.<span id="more-15048"></span></p>
<h2>2. BuzzData &#8211; for sharing data</h2>
<p><a href="http://buzzdata.com/" onclick="urchinTracker('/outgoing/buzzdata.com/?referer=');">BuzzData</a> is a platform for sharing data &#8211; essentially a social network where you can follow other data journalists or datasets, tag and license your data, and &#8211; importantly &#8211; add visualisations, articles and attachments. When someone else builds on your data, it tells you, which is nice.</p>
<h2>3. DataMarket &#8211; for finding data</h2>
<p><a href="http://datamarket.com/" onclick="urchinTracker('/outgoing/datamarket.com/?referer=');">DataMarket</a> is exactly what it says on the tin: a market for data from organisations including the UN, BP, Eurostat, the IMF, USGS, and various other acronyms. You can access the data for free, or pay for extra functionality such as exporting to Excel.</p>
<h2>4. Google News Scraper &#8211; for grabbing data on news coverage</h2>
<p><a href="https://tools.issuecrawler.net/beta/googleNews/" onclick="urchinTracker('/outgoing/tools.issuecrawler.net/beta/googleNews/?referer=');">This scraper</a> will allow you to gather data on coverage of a particular issue, event or person. It only gathers the teaser text but the country data may if you want to map coverage, while the URLs can provide a starting point for further scraping experiments.</p>
<h2>5. Metadata extraction tool &#8211; a first step for searching document dumps?</h2>
<p><a href="http://meta-extractor.sourceforge.net/" onclick="urchinTracker('/outgoing/meta-extractor.sourceforge.net/?referer=');">This</a> is aimed at file preservation activities, but it has a few possible applications for journalists. Firstly, it has a Windows interface for exploring the metadata of a bunch of files, making it possible to sort in different ways to more quickly look for information you&#8217;re seeking. Secondly, the generation of an XML file will give some structure which could allow you to, for example, plot your documents on a timeline, spotting patterns or outliers.</p>
<h2>6. Roambi &#8211; data visualisation on your iPhone</h2>
<p>Sadly, <a href="http://www.roambi.com/" onclick="urchinTracker('/outgoing/www.roambi.com/?referer=');">it&#8217;s only <em>your</em> iPhone</a>, not anyone else&#8217;s, so this is more if you&#8217;re on the move but want to go through some private data visualisations which might hide a story.</p>
<h2>7. Data Wrangler &#8211; web-based data cleaning tool</h2>
<p><a href="http://vis.stanford.edu/wrangler/" onclick="urchinTracker('/outgoing/vis.stanford.edu/wrangler/?referer=');">This</a> looks pretty powerful, if not pretty full stop. Here&#8217;s a video:</p>
<p>http://vimeo.com/19185801</p>
<h2>8. Impure &#8211; visual programming language</h2>
<p>From the About page:</p>
<p>&#8220;<a href="http://www.impure.com/" onclick="urchinTracker('/outgoing/www.impure.com/?referer=');">Impure</a> is a visual programming language aimed to gather, process and visualize information. With impure is possible to obtain information from very different sources; from user owned data to diverse feeds in internet, including social media data, real time or historical financial information, images, news, search queries and many more. Impure is a tool to be in touch with data around internet, to deeply understand it. Within a modular logic interface you can quickly link information to operators, controls and visualization methods, bringing all the power of the comprehension of information and knowledge to the not programmers that want to work with information in a professional way.&#8221;</p>
<h2>9. Zanran &#8211; PDF/spreadsheet/table search engine</h2>
<p><a href="http://www.zanran.com/q/" onclick="urchinTracker('/outgoing/www.zanran.com/q/?referer=');">This looks a very useful tool</a> for narrowing down searches to PDFs, spreadsheets, and tables within webpages (the advanced search allows further narrowing by filetype, date, server location and site). Clever stuff behind it &#8211; particularly in the way it looks at images and decides if they&#8217;re charts. The site says they plan to add Word documents and PowerPoint presentations soon.</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F08%2F19%2Fsftw-9-data-journalism-tools%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2011/08/19/sftw-9-data-journalism-tools/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>SFTW: How to scrape webpages and ask questions with Google Docs and =importXML</title>
		<link>http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/</link>
		<comments>http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/#comments</comments>
		<pubDate>Fri, 29 Jul 2011 08:24:51 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[google docs]]></category>
		<category><![CDATA[importxml]]></category>
		<category><![CDATA[openlylocal]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xpath]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=14943</guid>
		<description><![CDATA[Here&#8217;s another Something for the Weekend post. Last week I wrote a post on how to use the =importFeed formula in Google Docs spreadsheets to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns. Another formula which performs a similar function more powerfully is =importXML. There are at least 2 distinct journalistic uses<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F29%2Fsftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2011_2F07_2F29_2Fsftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F29%2Fsftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<div class="wp-caption alignnone" style="width: 301px"><a href="http://www.flickr.com/photos/dullhunk/3448804778/" onclick="urchinTracker('/outgoing/www.flickr.com/photos/dullhunk/3448804778/?referer=');"><img src="http://farm4.static.flickr.com/3663/3448804778_6fc1876655_o.png" alt="XML puzzle cube" width="291" height="300" /></a><p class="wp-caption-text">Image by dullhunk on Flickr</p></div>
<p>Here&#8217;s another <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something for the Weekend</a> post. Last week I wrote a post on <a href="http://onlinejournalismblog.com/2011/07/20/how-to-collaborate-or-crowdsource-by-combining-delicious-and-google-docs/">how to use the =importFeed formula in Google Docs spreadsheets</a> to pull an RSS feed (or part of one) into a spreadsheet, and split it into columns.  Another formula which performs a similar function more powerfully is <strong>=importXML</strong>.</p>
<p>There are at least 2 distinct journalistic uses for =importXML:</p>
<ol>
<li>You have found information that is only available in XML format and need to put it into a standard spreadsheet to interrogate it or combine it with other data.</li>
<li>You want to extract some information from a webpage &#8211; perhaps on a regular basis &#8211; and put that in a structured format (a spreadsheet) so you can more easily ask questions of it.</li>
</ol>
<p>The first task is the easiest, so I&#8217;ll explain how to do that in this post. I&#8217;ll use a separate post to explain the latter.<span id="more-14943"></span></p>
<h2>Converting an XML feed into a table</h2>
<p>If you have some information in XML format it helps if you have some understanding of how XML is structured. A backgrounder on how to understand XML is covered in this post explaining <a href="http://onlinejournalismblog.com/2011/04/11/data-for-journalists-understanding-xml-and-rss/">XML for journalists</a>.</p>
<p>It also helps if you are using a browser which is good at displaying XML pages: Chrome, for example, not only staggers and indents different pieces of information, but also allows you to expand or collapse parts of that, and colours elements, values and attributes (which we&#8217;ll come on to below) differently.</p>
<p>Say, for example, you wanted a spreadsheet of UK council data, including latitude, longitude, CIPFA code, and so on &#8211; and you found the data, but it was in XML format at a page like this:  <a href="http://openlylocal.com/councils/all.xml" onclick="urchinTracker('/outgoing/openlylocal.com/councils/all.xml?referer=');">http://openlylocal.com/councils/all.xml</a></p>
<p>To pull that into a neatly structured spreadsheet in Google Docs, type the following into the cell where you want the import to begin (try typing in cell A2, leaving the first row free for you to add column headers):</p>
<p><strong>=ImportXML(&#8220;http://openlylocal.com/councils/all.xml&#8221;, &#8221;//council&#8221;)</strong></p>
<p>The formula (or, more accurately, function) needs two pieces of information, which are contained in the parentheses and separated by a comma: a web address (URL), and a query. Or, put another way:</p>
<p>=importXML(&#8220;theURLinQuotationMarks&#8221;, &#8220;theBitWithinTheURLthatYouWant&#8221;)</p>
<p>The URL is relatively easy &#8211; it is the address of the XML file you are reading (it should end in .xml). The query needs some further explanation.</p>
<p>The query tells Google Docs which bit of the XML you want to pull out. It uses a language called <strong>XPath</strong> &#8211; but don&#8217;t worry, you will only need to note down a few queries for most purposes.</p>
<p>Here&#8217;s an example of part of that XML file shown in the Chrome browser:</p>
<p><a href="http://onlinejournalismblog.com/wp-content/uploads/2011/07/Picture-12.png"><img class="alignnone size-full wp-image-14968" title="XML from OpenlyLocal" src="http://onlinejournalismblog.com/wp-content/uploads/2011/07/Picture-12.png" alt="XML from OpenlyLocal" width="471" height="190" /></a></p>
<p>The indentation and triangles indicate the way the data is structured. So, the &lt;councils&gt; tag contains at least one item called &lt;council&gt; (if you scrolled down, or clicked on the triangle to collapse &lt;council&gt; you would see there are a few hundred).</p>
<p>And each &lt;council&gt; contains an &lt;address&gt;, &lt;authority-type&gt;, and many other pieces of information.</p>
<p>If you wanted to grab every &lt;council&gt; from this XML file, then, you use the query &#8220;//council&#8221; as shown above. Think of the // as a replacement for the &lt; in a tag &#8211; you are saying: &#8216;grab the contents of every item that begins &lt;council&gt;&#8217;.</p>
<p>You&#8217;ll notice that in your spreadsheet where you have typed the formula above, it gathers the contents (called a value) of each tag within &lt;council&gt;, each tag&#8217;s value going into their own column &#8211; giving you dozens of columns.</p>
<p>You can continue this logic to look for tags within tags. For example, if you wanted to grab the &lt;name&gt; value from within each &lt;council&gt; tag, you could use:</p>
<p><strong>=ImportXML(&#8220;http://openlylocal.com/councils/all.xml&#8221;, &#8221;//council//name&#8221;)</strong></p>
<p>You would then only have one column, containing the names of all the councils &#8211; if that&#8217;s all you wanted. You could of course adapt the formula again in cell B2 to pull another piece of information. However, you may <a href="https://spreadsheets0.google.com/spreadsheet/pub?hl=en_GB&amp;hl=en_GB&amp;key=0ApTo6f5Yj1iJdGFuMVlsZzZySHNqZGhkdjNPMENtRHc&amp;single=true&amp;gid=10&amp;output=html" onclick="urchinTracker('/outgoing/spreadsheets0.google.com/spreadsheet/pub?hl=en_GB_amp_hl=en_GB_amp_key=0ApTo6f5Yj1iJdGFuMVlsZzZySHNqZGhkdjNPMENtRHc_amp_single=true_amp_gid=10_amp_output=html&amp;referer=');">end up with a mismatch of data</a> where that information is missing &#8211; so it&#8217;s always better to grab all the XML once, then clean it up on a copy.</p>
<p>If the XML is more complex then you can ask more complex questions &#8211; which I&#8217;ll cover in the second part of this post. You can also put the URL and/or query in other cells to simplify matters, e.g.</p>
<p><strong>=ImportXML(A1, B1)</strong></p>
<p>Where cell A1 contains <strong>http://openlylocal.com/councils/all.xml</strong> and B1 contains <strong>//council </strong>(note the lack of quotation marks). You then only need to change the contents of A1 or B1 to change the results, rather than having to edit the formula directly)</p>
<p>If you&#8217;ve any other examples, ideas or corrections, let me know. Meanwhile, <a href="https://spreadsheets.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdGFuMVlsZzZySHNqZGhkdjNPMENtRHc&amp;hl=en_GB" onclick="urchinTracker('/outgoing/spreadsheets.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdGFuMVlsZzZySHNqZGhkdjNPMENtRHc_amp_hl=en_GB&amp;referer=');">I&#8217;ve published an example spreadsheet demonstrating all the above techniques here</a>.</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F29%2Fsftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2011/07/29/sftw-how-to-scrape-webpages-and-ask-questions-with-google-docs-and-importxml/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>SFTW: How to grab useful political data with the They Work For You API</title>
		<link>http://onlinejournalismblog.com/2011/07/22/how-to-grab-useful-political-data-with-the-they-work-for-you-api/</link>
		<comments>http://onlinejournalismblog.com/2011/07/22/how-to-grab-useful-political-data-with-the-they-work-for-you-api/#comments</comments>
		<pubDate>Fri, 22 Jul 2011 08:35:47 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[data journalism]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[constituencies]]></category>
		<category><![CDATA[google refine]]></category>
		<category><![CDATA[grel]]></category>
		<category><![CDATA[guardian api]]></category>
		<category><![CDATA[Politics]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[they work for you]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=14930</guid>
		<description><![CDATA[It&#8217;s been over 2 years since I stopped doing the &#8216;Something for the Weekend&#8217; series. I thought I would revive it with a tutorial on They Work For You and Google Refine&#8230; If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2011/07/22/how-to-grab-useful-political-data-with-the-they-work-for-you-api/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F22%2Fhow-to-grab-useful-political-data-with-the-they-work-for-you-api%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2011_2F07_2F22_2Fhow-to-grab-useful-political-data-with-the-they-work-for-you-api_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F22%2Fhow-to-grab-useful-political-data-with-the-they-work-for-you-api%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p><img src="http://www.theyworkforyou.com/images/logo.png" alt="They Work For You" /></p>
<p><em>It&#8217;s been over 2 years since I stopped doing the &#8216;<a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something for the Weekend&#8217; series</a>. I thought I would revive it with a tutorial on They Work For You and Google Refine&#8230;<br />
</em><br />
If you want to add political context to a spreadsheet – say you need to know what political parties a list of constituencies voted for, or the MPs for those constituencies – the <a href="http://www.theyworkforyou.com/api/" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/?referer=');">They Work For You API</a> can save you hours of fiddling &#8211; if you know how to use it.</p>
<p>An API is – for the purposes of journalists – a way of asking questions for reams of data. For example, you can use an API to ask “What constituency is each of these postcodes in?” or “When did these politicians enter office?” or even “Can you show me an image of these people?”</p>
<p>The They Work For You API will give answers to a range of UK political questions on subjects including Lords, MLAs (Members of the Legislative Assembly in Northern Ireland), MPs, MSPs (Members of the Scottish Parliament), select committees, debates, written answers, statements and constituencies.</p>
<p>When you combine that API with <strong>Google Refine</strong> you can fill a whole spreadsheet with additional political data, allowing you to answer questions you might otherwise not be able to.</p>
<p>I’ve written before on <a href="http://onlinejournalismblog.com/2011/03/18/getting-full-addresses-for-school-data-in-an-foi-response/">how to use Google Refine to pull data into a spreadsheet from the Google Maps API</a> and <a href="http://onlinejournalismblog.com/2010/12/16/adding-geographical-information-to-a-spreadsheet-based-on-postcodes-google-refine-and-apis/">the UK Postcodes API</a>, but this post takes things a bit further because the They Work For You API requires something called a ‘key’. This is quite common with APIs so knowing how to use them is &#8211; well &#8211; <em>key</em>. If you need extra help, try those tutorials first.<span id="more-14930"></span></p>
<h2>The They Work For You API key</h2>
<p>Unlike the previous APIs I’ve written about, the They Work For You API requires you to register for a ‘key’ to use it. If you don’t understand how this works the <a href="http://www.theyworkforyou.com/api/" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/?referer=');">instructions on the TWFY website</a> can be a little confusing. So here’s how it works:</p>
<p>The key is a password of sorts, used when you ask the API a question.</p>
<p>As your ‘question’ takes the form of a web address (URL) then that key needs to be included at a particular part of that URL.</p>
<p>You’ll see how that works when we get to asking the URL questions. But first, go to <a href="http://www.theyworkforyou.com/api/key" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/key?referer=');">http://www.theyworkforyou.com/api/key </a>to get a key.</p>
<p>Got it? OK, now copy it into a text document – or just keep this window open. You’ll need to paste it later.</p>
<h2>Using the TWFY key</h2>
<p>The API has a number of pre-set questions, called ‘functions’. These are listed in the right hand column, and include <a href="http://www.theyworkforyou.com/api/docs/getMPs" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/docs/getMPs?referer=');">getMPs</a>, <a href="http://www.theyworkforyou.com/api/docs/getLord" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/docs/getLord?referer=');">getLord</a>, <a href="http://www.theyworkforyou.com/api/docs/getDebates" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/docs/getDebates?referer=');">getDebates</a> and so on. If you click on any of these you will be given information on how they work, and you can also test the function with the ‘Explorer’.</p>
<p>To demonstrate how to use these functions, <a href="http://www.theyworkforyou.com/api/docs/getConstituency" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/docs/getConstituency?referer=');">click on getConstituency</a>.</p>
<p>If you use the ‘Explorer’ to test it (in this case with &#8216;Edinburgh South&#8221;) you will be shown a bunch of results at a URL like this:</p>
<p><strong>http://www.theyworkforyou.com/api/docs/getConstituency?name=edinburgh+south&amp;postcode=&amp;output=js#output</strong></p>
<p>Now you could manually use the Explorer to get information for each of the cells in a spreadsheet, but it&#8217;s much, much quicker to use the API to automate the process instead.</p>
<p>On that front the Explorer can be a little misleading. Because although it shows you the information you might get from the API, this is not the URL that you will need.</p>
<p>The URL you really need is shown above the results, and below the word ‘<strong><em>Output</em></strong>’ like so:</p>
<p><strong>http://www.theyworkforyou.com/api/getConstituency?name=edinburgh+south&amp;output=js</strong></p>
<p>If you copy and paste that URL into your browser you will get the following warning:</p>
<p><strong>{</strong><br />
<strong> error: &#8220;No API key provided. Please see http://www.theyworkforyou.com/api/key for more information.&#8221;</strong><br />
<strong> }</strong></p>
<p>So now we need that key.</p>
<h2>Using your key</h2>
<p>Assuming you still have your API key copied somewhere, or still open in another window, you can find instructions on how to use it at <a href="http://www.theyworkforyou.com/api/" onclick="urchinTracker('/outgoing/www.theyworkforyou.com/api/?referer=');">http://www.theyworkforyou.com/api/</a></p>
<p>Here you are told to use the key as part of the following structure:</p>
<p><strong>http://www.theyworkforyou.com/api/function?key=key&amp;output=output&amp;other_variables</strong></p>
<p>The important bit is where it says <strong>key=key&amp;</strong></p>
<p>That is where you need to add your own key, so that that part of the URL looks <em>something </em>like</p>
<p><strong>key=aTh0jklerJaHui7&amp;</strong></p>
<p>(where that random assortment of characters is your key, copied earlier, followed by the <strong>&amp;</strong> sign)</p>
<p>Going back for a moment to the URL that wasn’t working without a key, we can see that it can be split into two parts:</p>
<p><strong><strong>http://www.theyworkforyou.com/api/getConstituency?</strong><br />
</strong></p>
<p><em>and</em></p>
<p><strong><strong><strong>name=edinburgh+south&amp;output=js</strong></strong><br />
</strong></p>
<p>Adding in the <em>key</em> in the middle makes up a <em>third</em> part, like so:</p>
<p><strong><strong>http://www.theyworkforyou.com/api/getConstituency?</strong><br />
</strong></p>
<p><em>and</em></p>
<p><strong><strong><strong>key=key&amp;</strong></strong></strong></p>
<p><em>and</em></p>
<p><strong><strong><strong>name=edinburgh+south&amp;output=js</strong></strong></strong></p>
<p>So, you now need to <em>edit the output URL to include your API key</em>. It should then look something like this:</p>
<p>http://www.theyworkforyou.com/api/getConstituency?<strong>key=AHdajHUShajshaJ&#038;</strong>name=edinburgh+south&#038;output=js</p>
<p><em>UPDATE: Matthew Somerville points out that the key can be used anywhere after the ? so you can tag it on the end if that&#8217;s easier.</em></p>
<h2>The URL broken down further</h2>
<p>Just to clarify, these are the parts:</p>
<p><strong>http://www.theyworkforyou.com/</strong></p>
<p>(The website hosting the API)</p>
<p><strong>api/</strong></p>
<p>(The API)</p>
<p><strong>getConstituency?</strong></p>
<p>(The function – or question being asked)</p>
<p><strong>key=AHdajHUShajshaJ</strong></p>
<p>(Our API key – or password)</p>
<p><strong>&amp;name=edinburgh+south</strong></p>
<p>(and the constituency name that we are asking the API for information on)</p>
<p><strong>&amp;output=js</strong></p>
<p>(and the format we want the answer in &#8211; JSON, in this case)</p>
<p>You should now get a page of JSON code giving data for the question. If your browser doesn&#8217;t display it particularly well, try Chrome or Firefox.</p>
<h2>Using with Google Refine to get a bunch of results</h2>
<p>Great. But we could get one result by using the ‘Explorer’, so why did we need to do all that? Because we can now use Google Refine to automate the process of asking the same question hundreds of times.</p>
<p>To demonstrate this, <a href="https://spreadsheets.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdDA3V010RUlqTjhYalN6ejh0T2ZGN0E&amp;hl=en_GB" onclick="urchinTracker('/outgoing/spreadsheets.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdDA3V010RUlqTjhYalN6ejh0T2ZGN0E_amp_hl=en_GB&amp;referer=');">here&#8217;s a spreadsheet with 4 constituencies</a>. Open it, and select <strong>File &gt; Download as&#8230; &gt; CSV </strong></p>
<p>Open Google Refine (<a href="http://code.google.com/p/google-refine/wiki/Downloads?tm=2" onclick="urchinTracker('/outgoing/code.google.com/p/google-refine/wiki/Downloads?tm=2&amp;referer=');">download here</a>) and create a new project with that spreadsheet. Create a new column from the one you have by clicking on the arrow at the top of the column and selecting <strong>Edit Column &gt; Add Column by fetching URLs</strong></p>
<p>In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):</p>
<div id="left-panel">
<div>
<div id="refine-tabs-history">
<div>
<div><a>&#8220;http://www.theyworkforyou.com/api/getConstituency?<strong>key=Gr7jUUlKdhB3fsihFnHzab&amp;</strong>name=&#8221;+value+&#8221;&amp;output=js&#8221;</a></div>
</div>
<div>This generates a URL in each cell based on the value of the original column: the start and end of the URL are in quotation marks; the value is inserted in the middle where it says +value+</div>
<div><strong> </strong></div>
</div>
</div>
</div>
<p>(NOTE: Avoid copying and pasting as quotation marks may cause you problems. Instead try typing it in yourself &#8211; this also helps you remember things) This generates a URL in each cell based on the value of the original column: the start and end of the URL are in quotation marks; the value is inserted in the middle where it says<strong> +value+</strong></p>
<p>Give the column a name and click <strong>OK</strong>. It will now run &#8211; this test example only has 4 rows so you can see the results quickly.</p>
<p>You&#8217;ll see that only one row has actually worked &#8211; Tatton. The others have failed. Why? Because they have more than one word.</p>
<p>Take another look at that URL that the API returned earlier with the test of Edinburgh South:</p>
<p>http://www.theyworkforyou.com/api/getConstituency?key=AHdajHUShajshaJ&#038;<strong>name=edinburgh+south</strong>&#038;output=js</p>
<p>When a constituency has two words the space between them is represented by a plus sign &#8211; so we need to format our data in the same way for it to work.</p>
<h2>Formatting data for the API</h2>
<p>You could use Find and Replace in Excel to replace all spaces in that column with a plus sign but you will still hit problems with unusual constituency names. But this is how to do it in Google Refine:</p>
<p><del>Click on the arrow at the top of the constituency column and selecting <strong>Edit Column &gt; Add column based on this column&#8230;</strong></del></p>
<p><del> </del></p>
<p><del>In the window that appears type the following GREL:</del></p>
<p><del> </del></p>
<p>value.split(&#8221; &#8220;).join(&#8220;+&#8221;)</p>
<p>To explain:</p>
<p><em>&#8216;Value&#8217; is the value in each cell.</em></p>
<p><em>&#8216;.split(&#8221; &#8220;)&#8217; splits each value where there is a space (&#8221; &#8220;).</em></p>
<p>&nbsp;</p>
<p><del><em>&#8216;.join(&#8220;+&#8221;) then joins the resulting items together, with a plus sign.</em></del></p>
<p><del>Give it a name and click <strong>OK</strong>. You&#8217;ll see a new column with plus signs replacing the spaces. </del><em>[see comment from Matthew Somerville for explanation]</em></p>
<p>Create a new column from the one you have by clicking on the arrow at the top of the column and selecting <strong>Edit Column &gt; Add Column by fetching URLs</strong></p>
<p>In the window that appears adapt the following piece of Google Refine Expression Language (GREL) with your own API key (shown in bold):</p>
<p>&#8220;http://www.theyworkforyou.com/api/getConstituency?name=&#8221; + escape(value, &#8220;url&#8221;) + &#8220;<strong>&amp;<strong>key=Gr7jUUlKdhB3fsihFnHzab&amp;</strong></strong>output=js&#8221;</p>
<p><strong> </strong></p>
<p>The key part here is between the + signs. Whereas before we simply inserted the value of each cell, here we <em>escape</em> that value at the same time so that it will work in a URL.</p>
<p>This will change Edinburgh South to &#8220;edinburgh+south&#8221; but also Normanton, Pontefract and Castleford to &#8221;Normanton%2C+Pontefract+and+Castleford&#8221; and any other unforeseen characters in similar ways.</p>
<p>Give this new column a name, click <strong>OK</strong> and watch your new column populate itself with the JSON from each URL.</p>
<h2>Creating new columns from the JSON</h2>
<p>Now we can populate new columns with data taken from that JSON as follows:</p>
<p>Click on the arrow at the top of the <em>new </em>JSON column and select Edit Column &gt; Add column based on this column&#8230;</p>
<p>Type this GREL:</p>
<p>value.parseJson().bbc_constituency_id</p>
<p><em>(This looks in the JSON in each cell and pulls out the bit after bbc_constituency_id <img src='http://onlinejournalismblog.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em> And click OK.</p>
<p>Repeat the process for further columns as follows:</p>
<p><a>value.parseJson().guardian_election_results</a></p>
<p><a>value.parseJson().pa_id</a></p>
<p><a>value.parseJson().guardian_id</a></p>
<h2>Going further</h2>
<p>That&#8217;s just a demonstration of how to use a small part of the They Work For You API &#8211; there are lots of other functions that you can use to get other information. Have a play with those.</p>
<p>Meanwhile, what about those IDs? Well, the Guardian ID <a href="http://www.guardian.co.uk/open-platform/politics-api/getting-started" onclick="urchinTracker('/outgoing/www.guardian.co.uk/open-platform/politics-api/getting-started?referer=');">will allow you to play with The Guardian&#8217;s API</a> &#8211; which gives lots more information on each constituency. For an example see http://www.guardian.co.uk/politics/api/constituency/664/json</p>
<p>Based on that URL you can repeat the process above to grab more data.</p>
<p><em>Is this useful? Anything you can add? Or other data problems?</em></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2011%2F07%2F22%2Fhow-to-grab-useful-political-data-with-the-they-work-for-you-api%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2011/07/22/how-to-grab-useful-political-data-with-the-they-work-for-you-api/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Quote Twitter conversations with QuoteURL (Something for the Weekend #15)</title>
		<link>http://onlinejournalismblog.com/2009/03/19/quote-twitter-conversations-with-quoteurl-something-for-the-weekend-15/</link>
		<comments>http://onlinejournalismblog.com/2009/03/19/quote-twitter-conversations-with-quoteurl-something-for-the-weekend-15/#comments</comments>
		<pubDate>Thu, 19 Mar 2009 21:41:28 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[online journalism]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[quoteurl]]></category>
		<category><![CDATA[Something for the weekend]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=2447</guid>
		<description><![CDATA[Following on from the previous Something for the Weekend, Twickie, which allows you to collect responses to a question posted on Twitter, this tool allows you to present a conversation &#8211; with impressive control.  QuoteURL allows you to drag and drop (or copy and paste) Twitter tweet URLs to reconstruct a conversation. Here&#8217;s one I prepared earlier: dirkthecow great article on<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2009/03/19/quote-twitter-conversations-with-quoteurl-something-for-the-weekend-15/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F03%2F19%2Fquote-twitter-conversations-with-quoteurl-something-for-the-weekend-15%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2009_2F03_2F19_2Fquote-twitter-conversations-with-quoteurl-something-for-the-weekend-15_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F03%2F19%2Fquote-twitter-conversations-with-quoteurl-something-for-the-weekend-15%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Following on from the <a href="http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/">previous</a> <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something for the Weeken</a>d, <a href="http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/">Twickie</a>, which allows you to collect responses to a question posted on Twitter, this tool allows you to present a conversation &#8211; with impressive control. </p>
<p><a href="http://www.quoteurl.com/" onclick="urchinTracker('/outgoing/www.quoteurl.com/?referer=');">QuoteURL</a> allows you to drag and drop (or copy and paste) Twitter tweet URLs to reconstruct a conversation.<span id="more-2447"></span></p>
<p>Here&#8217;s <a href="http://www.quoteurl.com/k9m1y" onclick="urchinTracker('/outgoing/www.quoteurl.com/k9m1y?referer=');">one I prepared earlier</a>:</p>
<p><!-- QuoteURL styled embed start --></p>
<blockquote class="quoteurl-block">
<ol class="quoteurl-quote" style="background-color: #fff;color: #000;padding: .4em;border: 1px solid #888;width: 90%;margin: auto">
<li class="hentry status u-dirkthecow">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/69001581/Dirk_normal.png" alt="Dirk Singer" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Dirk Singer" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">dirkthecow</a> <span class="entry-content" style="font-style:normal">great article on the &#8220;daily me&#8221; or how the Internet is narrowing our viewpoints, from Nicholas Kristof in the NYT <a rel="nofollow" href="http://bit.ly/lsH3" onclick="urchinTracker('/outgoing/bit.ly/lsH3?referer=');">http://bit.ly/lsH3</a></span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/dirkthecow/status/1355935966" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1355935966?referer=');"> <span class="published" title="2009-03-19 18:41:35">19 Mar 2009</span> </a> <span>from <a href="http://www.tweetdeck.com/" onclick="urchinTracker('/outgoing/www.tweetdeck.com/?referer=');">TweetDeck</a></span> </span></div>
</li>
<li class="hentry status u-paulbradshaw">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/79160848/paulbradshaw_twitterprofile_bigger_normal.jpg" alt="Paul Bradshaw" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Paul Bradshaw" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">paulbradshaw</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">@dirkthecow</a> I disagree. He has no evidence the Internet will make that worse.</span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/paulbradshaw/status/1355955160" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1355955160?referer=');"> <span class="published" title="2009-03-19 18:44:58">19 Mar 2009</span> </a> <span>from <a href="http://m.slandr.net" onclick="urchinTracker('/outgoing/m.slandr.net?referer=');">m.slandr.net</a></span> </span></div>
</li>
<li class="hentry status u-dirkthecow">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/69001581/Dirk_normal.png" alt="Dirk Singer" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Dirk Singer" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">dirkthecow</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">@paulbradshaw</a> agree he gives no evidence,but my own personal experience mirrors what he says,I  follow and read stuff from people &#8216;like me&#8217;</span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/dirkthecow/status/1355981031" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1355981031?referer=');"> <span class="published" title="2009-03-19 18:49:30">19 Mar 2009</span> </a> <span>from <a href="http://www.tweetdeck.com/" onclick="urchinTracker('/outgoing/www.tweetdeck.com/?referer=');">TweetDeck</a></span> <a href="http://twitter.com/paulbradshaw/status/1355969235" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1355969235?referer=');">in reply to paulbradshaw</a> </span></div>
</li>
<li class="hentry status u-paulbradshaw">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/79160848/paulbradshaw_twitterprofile_bigger_normal.jpg" alt="Paul Bradshaw" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Paul Bradshaw" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">paulbradshaw</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">@dirkthecow</a> but we do that offline too -he&#8217;s suggesting the net makes it worse. Evidence doesn&#8217;t back that up.</span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/paulbradshaw/status/1356018987" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1356018987?referer=');"> <span class="published" title="2009-03-19 18:58:20">19 Mar 2009</span> </a> <span>from <a href="http://www.atebits.com/software/tweetie/" onclick="urchinTracker('/outgoing/www.atebits.com/software/tweetie/?referer=');">Tweetie</a></span> <a href="http://twitter.com/dirkthecow/status/1355981031" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1355981031?referer=');">in reply to dirkthecow</a> </span></div>
</li>
<li class="hentry status u-dirkthecow">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/69001581/Dirk_normal.png" alt="Dirk Singer" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Dirk Singer" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">dirkthecow</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">@paulbradshaw</a> but isn&#8217;t it so that when we all watched the evening news we were forced to hear different opinions, now we filter them more?</span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/dirkthecow/status/1356047496" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1356047496?referer=');"> <span class="published" title="2009-03-19 19:03:27">19 Mar 2009</span> </a> <span>from <a href="http://www.atebits.com/software/tweetie/" onclick="urchinTracker('/outgoing/www.atebits.com/software/tweetie/?referer=');">Tweetie</a></span> <a href="http://twitter.com/paulbradshaw/status/1356018987" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1356018987?referer=');">in reply to paulbradshaw</a> </span></div>
</li>
<li class="hentry status u-paulbradshaw">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/79160848/paulbradshaw_twitterprofile_bigger_normal.jpg" alt="Paul Bradshaw" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Paul Bradshaw" href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">paulbradshaw</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">@dirkthecow</a> one stat sticks out in my mind: 46% of people come across news stories while searching for something else</span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/paulbradshaw/status/1356077809" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1356077809?referer=');"> <span class="published" title="2009-03-19 19:09:02">19 Mar 2009</span> </a> <span>from <a href="http://www.atebits.com/software/tweetie/" onclick="urchinTracker('/outgoing/www.atebits.com/software/tweetie/?referer=');">Tweetie</a></span> <a href="http://twitter.com/dirkthecow/status/1356047496" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1356047496?referer=');">in reply to dirkthecow</a> </span></div>
</li>
<li class="hentry status u-dirkthecow">
<div class="thumb vcard author" style="float:left;margin-right:1em;margin-left:.5em"><a class="url" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');"><img class="photo fn" style="border:none" src="http://s3.amazonaws.com/twitter_production/profile_images/69001581/Dirk_normal.png" alt="Dirk Singer" width="48" height="48" /></a></div>
<div class="status-body" style="margin-right:30px;padding-right:1em"><a class="author" title="Dirk Singer" href="http://twitter.com/dirkthecow" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow?referer=');">dirkthecow</a> <span class="entry-content" style="font-style:normal"><a href="http://twitter.com/paulbradshaw" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw?referer=');">@paulbradshaw</a> another interesting piece is this one on &#8216;homophily&#8217; by Oliver Burkeman in The Guardian  <a rel="nofollow" href="http://bit.ly/5qPXJ" onclick="urchinTracker('/outgoing/bit.ly/5qPXJ?referer=');">http://bit.ly/5qPXJ</a></span> <span class="meta entry-meta" style="color:#888;font-family:georgia;font-size:0.8em;font-style:italic"> <a class="entry-date" rel="bookmark" href="http://twitter.com/dirkthecow/status/1356308689" onclick="urchinTracker('/outgoing/twitter.com/dirkthecow/status/1356308689?referer=');"> <span class="published" title="2009-03-19 19:56:00">19 Mar 2009</span> </a> <span>from <a href="http://www.tweetdeck.com/" onclick="urchinTracker('/outgoing/www.tweetdeck.com/?referer=');">TweetDeck</a></span> <a href="http://twitter.com/paulbradshaw/status/1356077809" onclick="urchinTracker('/outgoing/twitter.com/paulbradshaw/status/1356077809?referer=');">in reply to paulbradshaw</a> </span></div>
</li>
</ol>
</blockquote>
<p> &#8212; <a href="http://www.quoteurl.com/k9m1y" onclick="urchinTracker('/outgoing/www.quoteurl.com/k9m1y?referer=');">this quote</a> was brought to you by <a href="http://www.quoteurl.com" onclick="urchinTracker('/outgoing/www.quoteurl.com?referer=');">quoteurl</a> <br class="quoteurl-end" /> <!-- QuoteURL embed end --></p>
<p>The tool allows you to include up to 4 tweets without registering, 10 if you have, and more if you pay for an account (not available at the moment).</p>
<p>There are a few weaknesses to the service &#8211; when I tried it you couldn&#8217;t actually see any more than 4 tweets &#8211; although they were clearly being stored: you have to work &#8216;blind&#8217; so to speak.</p>
<p>And it seems you cannot have more than 1 consecutive tweet by the same person.</p>
<p>What is nice is that you can have more than 2 people involved in the conversation, and tweets seem to be arranged chronologically, so you can drag them in in any order.</p>
<p>If you have a play with it, let me know how you get on &#8211; or any uses you can see.</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F03%2F19%2Fquote-twitter-conversations-with-quoteurl-something-for-the-weekend-15%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2009/03/19/quote-twitter-conversations-with-quoteurl-something-for-the-weekend-15/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Twickie: easily blog responses to a Twitter question (Something for the Weekend #14)</title>
		<link>http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/</link>
		<comments>http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/#comments</comments>
		<pubDate>Fri, 20 Feb 2009 11:39:23 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[online journalism]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[archive]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[twickie]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=2131</guid>
		<description><![CDATA[This week&#8217;s Something for the Weekend tool review continues the Twitter theme with a simple tool which helps bridge the Twitter-blog divide. If you&#8217;ve ever posted a question on Twitter and followed it up with a blog post discussing the responses, you&#8217;ll have probably been frustrated by the inability to present those responses in the blog post &#8211; you either have<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F02%2F20%2Ftwickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2009_2F02_2F20_2Ftwickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F02%2F20%2Ftwickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<div id="attachment_2132" class="wp-caption alignnone" style="width: 435px"><a href="http://onlinejournalismblog.com/wp-content/uploads/2009/02/twickie.gif"><img class="size-full wp-image-2132" src="http://onlinejournalismblog.com/wp-content/uploads/2009/02/twickie.gif" alt="Twickie" width="425" height="219" /></a><p class="wp-caption-text">Twickie</p></div>
<p>This week&#8217;s <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something for the Weekend tool review</a> continues the Twitter theme with a simple tool which helps bridge the Twitter-blog divide.</p>
<p>If you&#8217;ve ever posted a question on Twitter and followed it up with a blog post discussing the responses, you&#8217;ll have probably been frustrated by the inability to present those responses in the blog post &#8211; you either have to link to each one, or copy and paste them from <a href="http://search.twitter.com" onclick="urchinTracker('/outgoing/search.twitter.com?referer=');">Twitter Search</a> (which means ugly table-based HTML and irrelevant messages, newest-first).</p>
<p><a href="http://twickie.pirillo.com/" onclick="urchinTracker('/outgoing/twickie.pirillo.com/?referer=');">Twickie </a>is a cute solution to that problem. You log on with your Twitter username and password, browse through your recent tweets to find the question you posted, and click on &#8216;<strong>Get @s</strong>&#8216; to see the replies ordered oldest- or newest-first.<span id="more-2131"></span></p>
<p>At that point you are also given some HTML you can then copy and paste into a blog post &#8211; and this is not embedded code so will not disappear if Twickie does.</p>
<p>I&#8217;ve used this <a href="http://onlinejournalismblog.com/2009/02/16/why-should-student-journalists-use-twitter/">twice</a> <a href="http://onlinejournalismblog.com/2009/02/17/teaching-journalism-students-to-twitter-the-twentoring-project/">already </a>this week &#8211; but a word of advice: Twickie only allows you to look at replies to your most recent tweets, so if you leave it too late you might not be able to find it. If this is the case, you&#8217;ll have to delete some of your tweets and refresh Twickie until the tweet in question is within the last few dozen.</p>
<p>Also, if you tweet too soon after Twickie will not be able to find all the responses to the tweet posing the question, so make sure you leave a generous amount of time before tweeting again.</p>
<p><em>UPDATE: </em><a href="http://twitter.com/stef" onclick="urchinTracker('/outgoing/twitter.com/stef?referer=');">Stefan Lewandwoski</a> informs me that the formatting means tweets don&#8217;t show up when the post is viewed on an iPhone</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2009%2F02%2F20%2Ftwickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2009/02/20/twickie-easily-blog-responses-to-a-twitter-question-something-for-the-weekend-14/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Twitter/mobile bookmarking with Tagthis (Something for the Weekend #13)</title>
		<link>http://onlinejournalismblog.com/2008/12/18/twittermobile-bookmarking-with-tagthis-something-for-the-weekend-13/</link>
		<comments>http://onlinejournalismblog.com/2008/12/18/twittermobile-bookmarking-with-tagthis-something-for-the-weekend-13/#comments</comments>
		<pubDate>Thu, 18 Dec 2008 08:09:42 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[online journalism]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[bookmarking]]></category>
		<category><![CDATA[Something for the weekend]]></category>
		<category><![CDATA[tagthis]]></category>
		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=1968</guid>
		<description><![CDATA[It&#8217;s been a while since I did a Something for the Weekend tool review, but Twitter bookmarking service TagThis is such a great tool it needed covering. TagThis allows you to bookmark any URL you see on Twitter to your own account on Delicious or Magnolia. This is particularly useful if, like me, you use Twitter on a mobile phone<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2008/12/18/twittermobile-bookmarking-with-tagthis-something-for-the-weekend-13/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F12%2F18%2Ftwittermobile-bookmarking-with-tagthis-something-for-the-weekend-13%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2008_2F12_2F18_2Ftwittermobile-bookmarking-with-tagthis-something-for-the-weekend-13_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F12%2F18%2Ftwittermobile-bookmarking-with-tagthis-something-for-the-weekend-13%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>It&#8217;s been a while since I did a <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">Something for the Weekend tool review</a>, but Twitter bookmarking service TagThis is such a great tool it needed covering.</p>
<p><a href="http://tagth.is/index.php" onclick="urchinTracker('/outgoing/tagth.is/index.php?referer=');">TagThis</a> allows you to bookmark any URL you see on Twitter to your own account on Delicious or Magnolia. This is particularly useful if, like me, you use Twitter on a mobile phone or iPod, and often see useful links on Twitter that you&#8217;d like to come back to later or &#8216;file&#8217; for reference.<span id="more-1968"></span></p>
<p>Here&#8217;s how it works:</p>
<ul>
<li>Follow <a href="http://twitter.com/tagthis" onclick="urchinTracker('/outgoing/twitter.com/tagthis?referer=');">@tagthis</a> on Twitter</li>
<li>Register on the site at <a href="http://tagth.is/index.php" onclick="urchinTracker('/outgoing/tagth.is/index.php?referer=');">tagth.is</a> and enter your Twitter and Delicious or Magnolia login details.</li>
<li>When you find a tweet with a useful link, retweet it @tagthis. Shortened URLs will be &#8216;unzipped&#8217; and all words in the tweet will be used as tags &#8211; assuming you don&#8217;t want that to happen use a hash sign (#) before words you want to be used as tags, or add them.</li>
<li>If you don&#8217;t want everyone to see what you&#8217;ve bookmarked, you can also send it as a direct message</li>
<li>That&#8217;s it.</li>
</ul>
<p>There are a number of mobile Twitter apps that make retweeting easy &#8211; <a href="http://m.Slandr.net" onclick="urchinTracker('/outgoing/m.Slandr.net?referer=');">m.Slandr.net</a> is one, while <a href="http://twitstat.com/m/" onclick="urchinTracker('/outgoing/twitstat.com/m/?referer=');">Twitstat Mobile</a> has a &#8216;tagthis&#8217; button which automatically prefixes your tweet @tagthis.</p>
<p>If you want to find the URL yourself <a href="http://tweetcrunch.com/2008/09/24/twitfire-released-for-the-iphone/" onclick="urchinTracker('/outgoing/tweetcrunch.com/2008/09/24/twitfire-released-for-the-iphone/?referer=');">Twitfire </a>is an iPhone/iPod app that allows you to search for a webpage then will insert a tinyurl in your tweet. Alternatively you can install the <a href="http://www.quickonlinetips.com/archives/2005/02/posticious-quicker-posts-in-delicious/" onclick="urchinTracker('/outgoing/www.quickonlinetips.com/archives/2005/02/posticious-quicker-posts-in-delicious/?referer=');">Post to Delicious bookmarklet</a> in Safari and sync it with your iPod/iPhone.</p>
<p>Any other tips for bookmarking webpages on the move or through Twitter?</p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F12%2F18%2Ftwittermobile-bookmarking-with-tagthis-something-for-the-weekend-13%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2008/12/18/twittermobile-bookmarking-with-tagthis-something-for-the-weekend-13/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What are your most useful online tools? (Something for the Weekend #12)</title>
		<link>http://onlinejournalismblog.com/2008/08/08/something-for-the-weekend-12-now-its-your-turn/</link>
		<comments>http://onlinejournalismblog.com/2008/08/08/something-for-the-weekend-12-now-its-your-turn/#comments</comments>
		<pubDate>Fri, 08 Aug 2008 08:00:46 +0000</pubDate>
		<dc:creator>Paul Bradshaw</dc:creator>
				<category><![CDATA[Blogging]]></category>
		<category><![CDATA[computer aided reporting]]></category>
		<category><![CDATA[Something for the weekend]]></category>

		<guid isPermaLink="false">http://onlinejournalismblog.com/?p=1234</guid>
		<description><![CDATA[I&#8217;ve looked at a number of tools in this series, often very new with potential applications for journalism that haven&#8217;t been realised. This time I want to turn the spotlight onto tools that you&#8217;re using every day, which may not be flashy, but which do a simple job very well &#8211; for example: in managing or filtering information, identifying leads,<br /><span class="read_more"><a href="http://onlinejournalismblog.com/2008/08/08/something-for-the-weekend-12-now-its-your-turn/">Read more...</a></span>]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F08%2F08%2Fsomething-for-the-weekend-12-now-its-your-turn%2F" onclick="urchinTracker('/outgoing/api.tweetmeme.com/share?url=http_3A_2F_2Fonlinejournalismblog.com_2F2008_2F08_2F08_2Fsomething-for-the-weekend-12-now-its-your-turn_2F&amp;referer=');"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F08%2F08%2Fsomething-for-the-weekend-12-now-its-your-turn%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>I&#8217;ve looked at a number of tools in <a href="http://onlinejournalismblog.com/tag/something-for-the-weekend/">this series</a>, often very new with potential applications for journalism that haven&#8217;t been realised. This time I want to turn the spotlight onto tools that you&#8217;re using every day, which may not be flashy, but which do a simple job very well &#8211; for example:</p>
<ul>
<li>in managing or filtering information,</li>
<li>identifying leads, ideas and contacts,</li>
<li>producing news itself,</li>
<li>distributing it,</li>
<li>or allowing users to get involved.</li>
</ul>
<p><strong>What have been the most useful online tools you&#8217;ve used? </strong></p>
<iframe src="http://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fonlinejournalismblog.com%2F2008%2F08%2F08%2Fsomething-for-the-weekend-12-now-its-your-turn%2F&amp;layout=standard&amp;show_faces=true&amp;width=450&amp;action=like&amp;colorscheme=light&amp;height=80" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:450px; height:80px;" allowTransparency="true"></iframe>]]></content:encoded>
			<wfw:commentRss>http://onlinejournalismblog.com/2008/08/08/something-for-the-weekend-12-now-its-your-turn/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

