Guest post by Nicholas Weller and Kenneth McCubbins
The Global Database of Events, Language and Tone, or GDELT, dataset has promised to provide scholars with a much more expansive look at global conflict by using media reports to identify conflicts and collect information about a variety of aspects of the conflict. Recently some of the major participants have publicly ended their relationship with the dataset, which has raised some controversy about the GDELT dataset and may perhaps draw attention to some of its flaws. This post has nothing to do with the controversy between the people responsible for the data, rather our concerns predate the recent controversy surrounding the dataset.[1]
Our concerns about the data came about as we attempted to use the GDELT data as part of the data challenge from the 2014 International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction. We started the data challenge because we were excited by the idea of being able to use the GDELT data to examine communication about international events across the globe. (Our initial plan was to use GDELT to examine communication between countries as a way to better understand patterns of international interaction. An initial, preliminary example is here.) However, in using the dataset a variety of data-related concerns arose related to the process of data collection and the poorly defined theoretical constructs that underlie the data. We do not aim to unconditionally discourage use of GDELT data or big data in general, but rather to raise a variety of issues that we believe need greater attention.
An initial issue is an ease-of-use issue related to the dataset. If reports are collected from multiple news sources then major events will likely be covered multiple times by these myriad sources. It would be useful to have an event code that references when an event is referenced in a different article or by a different news source. As an example, the 9/11 terrorist attacks on the World Trade Center should use the same event code for both simplicity and ease of use. There are, however, no duplicate event codes. This makes data organization much harder than necessary and it makes the data seem inconsistent. Quality-of-life issues are important for third-parties seeking to use datasets, because the inability to understand a dataset will lead it to be used in ways not intended by the authors and may eventually hurt the dataset’s reputation if it is inappropriately utilized.
The second issue with the dataset stems from its sources. The GDELT dataset is built around automated content analysis of news articles (i.e. newspapers, online news sources, etc.). The dataset is large enough to require automation to find different pieces of information (Actor1, Actor2, Tone, etc.). The automation requires that the text sources are uploaded to a computer and run through a specific algorithm designed to extract and code events that are referred to in news sources. The coding decisions would ideally be available to other researchers so that other people or alternative algorithmic approaches could check a subset of the coding decisions. The creators of the GDELT do not allow for third-party verification: they do not release the articles nor do they list the article sources and dates. In an email communication with Kalev Leetaru (January 11, 2014) we were told: “Our licensing restrictions are quite tight on the data and we cannot make the text available.” This struck us as odd given that most of the sources are publicly available news reports, but more importantly if the underlying data cannot be shared it imperils notions of transparency and replication central to science. Even if one were to attempt to recreate the entire dataset, the existing codebooks and papers about GDELT do not even identify which text sources are utilized for each year. The lack of transparency in the data collection process is a significant barrier to independent validation of the constructs, which ultimately affects whether we trust the data.
The remaining issues all involve the constructs and related measures that underlie the data. The use of others’ data always carries risks even though it may be inevitable given the challenges of original data collection. The most basic requirement of using data from others involves accepting the constructs and measures that guided the data collection. This can be particularly challenging when the constructs in the dataset are poorly defined and the measures are not clearly elaborated. These two factors make it challenging to demonstrate construct and measurement validity for the original authors’ purposes or for other authors to demonstrate the validity of the data for their purposes.[2]
One of our concerns about measurement validity relates to the specific coding rules used to generate the dataset. Assuming the legitimacy of the GDELT primary data sources, their method for coding data raises additional concerns. Rudyard Kipling once famously wrote:
“I Keep six honest serving-men:
(They taught me all I knew)
Their names are What and Where and When
And How and Why and Who.”
This is one of the standard tenets of journalism, but much of this basic information is missing in the GDELT dataset. It seems to us that the Actor1 and Actor2 variables, which represent who is performing an action and who the action is being performed on respectively, should rarely, if ever, be missing. While it may not be apparent who Actor1 is (there could be an unknown attack, terrorist, or gunman), we should at least have information on Actor2, since this is the most basic item of journalistic information (i.e. where is the news happening?). This information is commonly included in the dateline information following the author credit in news stories. Yet, when we examined the data we found that often information about Actor 1 and/or Actor 2 is often missing.
For example, in 1979, in 26.4% of the cases the data for Actor 2 is missing and in 14.4% of the cases both actors are missing in the data. In only 39% of the total observations are both Actors 1 and 2 identified (this occurs across all years in the data, not just 1979). To clarify, if Actor2 is missing it means that the data algorithm did not identify where the event occurred or who was being acted upon. If both actors are missing, the algorithm could not identify anybody involved in the act at all. If this latter example is the case, it would seem that there is simply no news to report. The fact that these actors are missing implies that the coding algorithm fails to identify information that seems crucial to the news reports (the automation, in these cases, obviously skimmed through the articles and did not have the ability to code for certain seemingly common situations), but also that there is no simple human verification to check for important missing data.[3]
Another concern regarding the measures in the data arises because of the news sources from which data are collected. The GDELT documentation lists the different news sources from which data is collected and coded. This set, however, consists primarily of US and British news sources, which may contribute to a bias in what events receive coverage and how they are covered. In addition to biasing the count and location/time of events, the data sources may also affect the coding of variables that are opinionated in nature (i.e. Tone, Goldstein, etc.), the Actors involved (since the Actors are more likely to be local to the news source), as well as how often certain events are being referenced. The data collection process creates a seeming over count of how many events occur between certain countries. If there are 30 articles referencing a single event, this creates 30 separate “events” in the coding where Actor1 and Actor2 are the same, and there is no way to parse this without looking through the many separate data points that comprise an event. This works the other direction as well. The US, for example, communicates with every country (or very nearly every country) in every year (i.e. every possible dyad in which the USA is Actor 1 appears in the data). However, there are many instances in which the USA is an Actor 1 and some other country is Actor 2, but there is no record of that country being Actor 1 and the USA being Actor 2 in the data. This suggests that the data are biased towards capturing events in which the USA (or other Western countries) are Actor 1, but doesn’t capture the same event from the perspective of Actor 2, which can influence empirical results from the data.
We are not the first to notice the potential bias in the GDELT data. Adjusting data for biases is common in the public opinion/survey field in the form of multilevel regression and post-stratification (i.e. Gelman and Little 1997; Lax and Phillips 2009, 2012), but a fundamental difference exists between adjusting the GDELT data and survey data. Because we do not know the theoretical population of events from which the GDELT sample is drawn from it is impossible to either assess the extent and types of bias or to know if our corrections improve the data. The inability to identify the theoretical population of interest makes attempts to adjust the data questionable, but at the same time the data in its raw form does not seem appropriate to use. What is one to do in this situation?
The last issue also concerns how news articles are coded in GDELT. The authors of the dataset have highlighted “tone” as a major and revolutionary feature of their dataset. They have provided many examples of how their algorithms can be used to generate tone using textual analysis. The theoretical range for the tone variable is from -100 to 100, and they claim that most values for the tone variable will be between -10 and 10. Despite this, we could not identify a single negative Tone coding in their entire dataset. For a construct that was supposed to range from -100 to 100, this is a significant problem that suggests a mismatch between the theoretical construct and the measure.
Missing a full half of the theoretical range of the tone variable seems to indicate that the way they are measuring tone does not match with a theoretical construct that underlies a scale from -100 to 100. In an email exchange one of the project’s PIs, Kalev Leetaru, told us that “With tone you rarely use the absolute values, you look at change over time or comparison with other tone values. i.e., you compare the coverage of two countries to see which is more negative, or you look at the tone towards a country over time to see if it is changing, rather than looking at absolute values. This is because the actual mixture of coverage of countries is so varied (even a country like Syria has positive coverage about people being saved, etc).”
While we are grateful for the clarification, it does not really address our concerns about the conceptualization of the tone variable nor is it clear why it should require an email to the dataset’s authors to find out this information. The codebooks and associated papers with the GDELT dataset do not make it clear that the values reported in the dataset should not be used as they are reported. It is also unclear why looking at changes in the Tone variable is justified if using the absolute values is not a good idea. Regardless, our concerns about the Tone variable point to what may be a larger concern regarding the failure to fully define constructs and then demonstrate that a particular measure has the characteristics we associate with construct and measurement validity.
The extent of and solutions to some of these concerns are potentially solvable if third-party researchers were able to view not just the coded data, but also the underlying articles from which this data was generated. Not allowing people to download and verify this data, however, allows this project to fall into the traps that many other big datasets do: we just don’t know whether the data is believable or accurate. Lastly, it is apparent that there are substantial theoretical errors in the dataset that are not directly solvable by third-party verification. We raise these concerns not because we want to argue against the use of big data or other attempts like GDELT to provide us with a better understanding of the social world, but rather because we want these datasets to be useful, which requires that they meet standards of transparency, reliability, and validity.
Nicholas Weller is an assistant professor in the Department of Political Science and School of International Relations at the University of Southern California. His published and working papers are at: dornsife.usc.edu/weller
Kenneth McCubbins received his B.A in Applied Mathematics from the University of California, San Diego and his M.A. in Economics with an emphasis in Quantitative Economics and Finance from the University of Southern California. He is the CEO/Founder of his startup Reakt Fitness (2012-2014).
[1] This dataset was constructed by Kalev Leetaru (Illinois), John Beieler (Penn State), Philip Schrodt (Parus Analytical Systems and formerly at Penn State) and Patrick Brandt (UT Dallas). However, Beieler, Schrodt and Brandt ended their relationship with GDELT as of January 17, 2014 in a way that has raised questions regarding the legal status of the dataset. [2] Many of these points may overlap with the beliefs of some of the original participants in the creation dataset, but it is difficult to tell. For instance, Phil Schrodt’s blog post entitled: “7 Remarks on GDELT” could not be found on his blog as of January 30, 2014. This makes it difficult to identify the various cautions the authors may have elaborated about the data. Regardless, we believe that the points we make in this post are useful even if some of them may have been stated before, because these points are not yet commonly understood and it seems difficult to find prior evidence of these cautionary notes. [3] Given the desire to produce daily updates to the dataset it is perhaps inevitable that data are not verified even cursorily by humans, but the allure of big, fast data should not blind us to the downsides of such data.
13 comments
“The creators of the GDELT do not allow for third-party verification: they do not release the articles nor do they list the article sources and dates. In an email communication with Kalev Leetaru (January 11, 2014) we were told: “Our licensing restrictions are quite tight on the data and we cannot make the text available.” This struck us as odd given that most of the sources are publicly available news reports, but more importantly if the underlying data cannot be shared it imperils notions of transparency and replication central to science.”
I have to agree with Leetaru here. For underlying sources of this size, you typically can’t distribute source content because of licensing agreements with newspapers or wire services. So they’re not exactly “publicly available news reports.”
But I do agree that one should at least cite where the article came from, and, if possible, provide a URL. The daily updates files of GDELT, I believe, try to do this.
Yes, I agree. If the text can’t be made available a full citation certainly could/should be.
Alex is right about sharing the source texts. Indeed, most of the imbroglio around GDELT is about the very question of rights to source texts. While GDELT does report the source for the daily updates, it doesn’t for previous events and should. Phil Schrodt has a recent post on the legal issues around sharing source texts and data (http://asecondmouse.wordpress.com/2014/02/14/the-legal-status-of-event-data/).
I think you’ve misunderstood GDELT’s event disambiguation. You write “If there are 30 articles referencing a single event, this creates 30 separate “events” in the coding where Actor1 and Actor2 are the same, and there is no way to parse this without looking through the many separate data points that comprise an event. ” Actually, GDELT will consolidate identical actor1-event-actor2-location events into one event. If you look at the NumArticles column, you’ll see many many events with numbers greater than one, indicating that different sources were combined. Perhaps you were tripped up looking at the URL field–it only lists one source’s URL to save space. NumMentions tallies the total number of mentions in all articles (it’s the same or higher than NumArticles because some articles mention the same event twice).
Your broader points about black box coding are well taken and raise an extremely important issue for GDELT and future event data systems. TABARI is pretty open (if dense–the manual is here: http://eventdata.parusanalytics.com/tabari.dir/tabari.manual.0.8.4b1.pdf and the dictionaries are here: https://github.com/philip-schrodt/Dictionaries), but things like Tone in GDELT come from who knows where. GKG is much, much worse on all of these counts.
Hopefully, future datasets will be generated in an even more open way and ideally from a system that anyone can download, set up locally, and tinker with.
Some general comments, as a user of GDELT and other event data with the Ward Lab at Duke (http://predictiveheuristics.com/):
There are two issues here. The first is a question of how well the machine-coding of news reports works, compared to human coding. There have been comparisons between these two approaches in the past that show that machine coding is worse than human coding in terms of accuracy, but not terribly so. And certainly over time this will get better and more consistent, unlike human coding. That many of the fields in GDELT are blank could be either a failure of machine translation, or it could just be missing in the source text, although without a doubt such high percentages seem suspicious.
While in theory you could do a second pass over machine-coded data with human coders, this just doesn’t scale well. Looking at 10 event codings and the source texts to correct issues is not hard. Doing this for the millions of events that are coded each month is just not feasible. Back-coding this for the quarter billion events in GDELT from 1979 on is near hopeless. It sounds like your interests are more specific, e.g. particular country/countries during relatively narrow time frames, and in this case maybe this is something feasible to do for third-parties.
So in that regard we have a choice here. We can do human coding of a lower number of stories, that will be more accurate given the source text, or we can go with the GDELT/ICEWS/KEDS approach of quantity over quality (for now). The latter is useful in some contexts, and certainly human coding might be better in others.
The second issue is about biases in news coverage and quality. Simply put, the news are not a perfect reflection of events on the ground. Many things are never reported, other types of events like suicide bombings or mass killings receive high attention. Coverage varies by country and over time. This is not specific to GDELT, and it applies to human coders as well. It applies to my knowledge of events in the Ukraine or Mali, for that matter. And sure, in theory I could go to those places and get a better picture of what is happening in the ground, or look at twitter feeds, or study satellite images, things that by the way that potentially introduce their own sets of biases, but one of the things I struggle with is thinking about how we can really know what the “ground truth” is.
I think you have a really good point about the sources going into GDELT and data efforts like this. They tend to be English language, so we don’t get the other side of the story that we might see in Spanish, Chinese, Russian, or Arabic media. I’m not sure that the human-coded efforts out there, like ACLED, include sources like this though, so for what it’s worth, this is a problem across the board for “big data” as it currently exists in poli sci.
You are right to highlight problems with GDELT. It’s a certain approach with weaknesses, and I would really shy away from any claim that GDELT represents what really happens on the ground. As you mention, you really need to transform it in some fashion to account for coverage biases across countries, time, event types, etc., which is as far as I know a relatively unexplored area and creates even higher barriers to using the data. The more specific you dig, the more of a problem these issues become. But on the other end of the scale, when you are looking at very high level things, e.g. global forecasting models of political conflict, they are useful.