Wednesday, September 14, 2011

Culturomics: Computers that Predict The Future? 3

Libyan revolution predicted?  Bin Laden's location within 200 kilometers known?
We continue with Leetaru's article in which he examines the data in news articles concerning political events and upheavals.

In regards to Egypt, Leetaru states,
The weeks preceding the protests contained the most negative discourse of his nearly 30–year rule. Combined with the graph of tone towards Egypt, this would have suggested to a policy–maker at the time that there could be an increased possibility of unrest in Egypt, possibly even affecting its previously untouchable head of state.
Tone of coverage mentioning Egyptian President Hosni Mubarak, Summary of World BroadcastsJanuary 1979–March 2011 (January 2011 is 1–24 January). Y–axis is Z–scores (standard deviations from mean).

In case we have not made it clear, Leetaru used one main data source to feed into this computer model.  This source is the Summary World Broadcasts (SWB) or as it is known now as BBC monitoring.  To Leetaru, this particular service begun in 1939, combines both print and online news sources best.  Leetaru makes it clear that he did not use Facebook, or any of the other social media.  The computer algorithms have a difficult time reading the slang expressions used in Facebook which can vary from region to region and country to country.  Also the fact that Facebook does not permit the study of people's comments, even by academics is a great limiting factor.1   Of course as many already know, this is a tremendous limitation to the model.  Increasingly, Facebook and other social media are becoming the place where people spread news and opinions of events.  To leave these organizations out it in our view, a great limiting factor in its future effectiveness.  There is evidence however, that this privacy is quickly eroding on Facebook through other means.2

In an attempt to check on the accuracy of the computer model's predictions, Leetaru compared the SWB or BBC Monitoring data to the New York Times index and English language web only news.  He states,
All three show a sharp shift towards negativity 1–24 January 2011, but the Times, in keeping with its reputation as the Grey Lady of journalism, shows a more muted response. Yet, the fact that the Times shows a similar overall trend curve to SWB is strong evidence that SWB’s strategy of sampling the global press is not a primary driving force in its results, given that the Times was analyzed in its entirety. Given SWB’s increasing reliance on Web–based news, it is not surprising that it is highly correlated with the Web dataset. This is in spite of SWB’s incorporation of local translated Web content, while the Web collection here consists only of English–language news. The five–year time span of the Web collection means it is not possible to place current events in a historical context to determine the signficance of tonal shifts, but the close alignment of its tone with SWB suggests it may become an increasingly–competitive alternative to SWB in the future for some types of analysis.
Here is the graph he posted in the article to bear out his statement.

Tunisia
According to Leetaru, there were  substantially less articles dealing with Tunisia than with Egypt.  Yet the information was still revealing.
For Tunisia, all mentions of the country, regardless of whether they mentioned specific cities in Tunisia, were counted, resulting in 16,856 articles. This results in a weaker tonal profile that is less selective and refined than that used for the other countries. Nevertheless, the two–week period prior to Tunisian President Ben Ali’s resignation was the sixth–most negative period in the last 30 years, coming after a decade–long plunge towards increasing negativity.
This is the graph he obtained from the model for Tunisia.

Libyan Revolution
Apparently, according to the computer model, Ghadafi is used to criticism and crisis.  The low levels he reached on February 15, 2011 had been reached 4 times earlier in the last 30 years.  You may study the graph so see this.
The World's "Tone"
These computer models can be become much larger in scope.  Leetaru modeled one on the entire world.  The point was to count from the news services the negative key words in the stories.
Looking beyond the tone towards a single country, what does the tone of the entire world look like aggregated by month? Is the world as a whole becoming more negative, at least according to the news? Figure 10 shows the average tone of the entire New York Times by month from January 1945 to December 2005. The Times exhibits a strong decade–long trend towards negativity from the early 1960s to the early 1970s, before recovering towards slight negativity, and has trended slightly more negative in recent years up to the 11 September 2001 attacks, which caused news to become sharply more negative in the following four years. The New York Times has a strong U.S. focus, however, so Figure 11 shows the tone of all Summary of World Broadcasts news January 1979 to July 2010 (content after July 2010 was available only for articles mentioning one of the countries above), showing a steady, near linear, march towards negativity. For the period of overlap, January 1979 to December 2005, the two have a Pearson correlation of r=0.55 (n=324), suggesting that news as a whole is becoming more negative.
We include first the New York Times Graph by itself, then the BBC Monitoring one.
New York Times articles 1979-2010

BBC Monitoring 1979-2010

Time and Space
The following animated gifs provided by Leetaru, show the geographical hotspots from the perspective of the New York times, first, and then the BBC Monitoring service second.  It is critical that you understand the color meaning otherwise these movies will be totally confusing.

There was a 400 point color coded scale used - from green to red.  Green = high positivity, while red = high negativity.  So one can see by following the changing red areas where the highest negative tone existed.  The New York Times plot point are for cities and cover the years 1945-2010.  The BBC Monitoring plot points are for cities also and cover the years 1979- 2010.

You can see the NYC animation here.  The BBC Monitoring animation can be see here.

Bin Laden's Locations
The search for Bin Laden proved less impressive in this computer model.  At least to us.  Although narrowing him down within 200 kilometers in the world seems impressive on the surface, it fails to explain that there were millions of people living in major metropolitan areas of Pakistan.  We let Leetaru explain.
From his rise in the global media in the late 1990s to the month prior to his capture, Bin Laden has been most commonly associated with Pakistan and in the map below all roads appear to lead to Northern Pakistan. Indeed nearly 49 percent of all articles mentioning Bin Laden included a city in Pakistan and both Islamabad and Peshawar rank in the top five non–Western cities associated with him. The next four most closely associated countries are the United States (38 percent), Iran (33 percent), Afghanistan (28 percent), and the Philippines (20 percent). The city of his capture, Abbottabad, makes only a single appearance in an article on 16 April 2011 regarding the arrest of a terror suspect in the city (Mir, 2011). However, Abbottabad is less than 200 kilometers from both of two most popular cities associated with him, or roughly the radius between Islamabad and Peshawar. 
While far from a definitive lock on Bin Laden’s location, global news content would have suggested Northern Pakistan in a 200 km. radius around Islamabad and Peshawar as his most likely location, and that he was nearly twice as likely to be making his residence in Pakistan as Afghanistan.
Global geocoded tone of all Summary of World Broadcasts content, January 1979–April 2011 mentioning “bin Laden”.




World Civilizations
The last point that Leetaru makes is the most interesting to us.  He builds a color-coded map of the world indicating the major cultures or civilizations of the world.  They do not lie as their simple geographical regions do, but by historical events of language or conquest that occurred in history. The basis for this mapping was based on a paper published by Vincent Blondel in 2008, titled, Fast Unfolding of Communities in Large Networks.
Modularity finding organizes a network into clusters of nodes where the nodes within each group are more closely connected to each other than to the rest of the network. In the case of this news relatedness network, modularity finding locates groups of countries that are mentioned together more often with each other than with other countries. The resulting partition finds that the global news media, as captured by SWB, divides the world into six major civilizations, visualized spatially in Figure 16. Overall, these civilizations appear to closely track geographic proximity, which might intuitively make sense, with events in one country involving or being related to those in a neighboring country. Notable outliers include Spain, where colonial ties to South America appear to overcome its affinity to Europe, and France and Portugal, reflecting their ties to Africa. The most geographically diverse cluster is centered on the Middle East, but also includes Canada, Norway, and the United Kingdom. The smallest cluster consists of India and several of its immediate neighbors.
World “civilizations” according to SWB, 1979–2009 (click to enlarge)

In our last installment in this series, we will cover other models that have been devised to be able to predict the actions of large populations.

No comments: