Guest post by Megan Price and Anita Gohdes
With the publishing of GDELT there has been an exciting increase in discussions about data quality (such as here, here and here, to name but a few), where the new compilation of data is analyzed and compared to other available sources for a chunk of space and time. Critical evaluations of the sources we use to test theories about political violence are highly valuable and it’s great to see the blogosphere providing a venue for such discussions (see also Will Moore’s first and second post on this topic).
An important component of data quality is the completeness of that data, or flipped around, what might be missing from these data. We can imagine two goals when thinking about this component of data quality. First, in an ideal world, we would like to have every single act of political violence documented with date, place, perpetrator, and possibly even further information. We know that this type of ‘ground truth’ is virtually never available – with a few exceptions, such as perhaps the data collected by the HLC on Kosovo. Secondly, if the available dataset is incomplete, we would want it to be representative, i.e. that the violence we observe produces the same patterns as the ground truth, that the observed violence is a good proxy for all of the violence. Ideally, we would want the incompleteness to therefore not depend on any factors we are interested in studying (e.g. geography, time, territorial control, or ethnic group, to name just a few possibilities).
One strategy to figure out gaps in data sources is to compare them to other data sources. Determining degrees of incompleteness requires, however, that one of the sources is either complete (ground truth), or representative (produces the same patterns as ground truth). In all other cases, all we can test is whether our sources are incomplete in different ways or not.
This is a topic our team at the Human Rights Data Analysis Group spends a lot of time thinking about. Collectively, we’ve been comparing data sources on violence in over 30 countries over the past two decades. We work with human rights groups, truth commissions, and the UN to arrive at reliable numbers of deaths during and following conflict. Our mandate is therefore mostly to figure out ‘ground truth’, and we hope that some of our experiences can be informative for this developing discussion on data quality in general. What we’ve learned is that reports of violence – those that end up in aggregated forms in press releases, human rights reports and official statistics – typically represent the reporting process. We have – in general – found that reported violence is not representative of the ground truth we are after.
What does that mean in real life? It means that groups working in Syria collect different amounts of information about violence occurring in government-controlled areas, because it is difficult for their networks to access those communities. Sometimes they also collect different amounts of information about violence occurring in opposition-controlled areas, when electricity is unavailable for long stretches of time, making it impossible for their networks to report in. It means that one person killed in the center of Kiev will be reported by thousands of people, but ten people killed in Boda in the Central African Republic will most likely go unnoticed (see Jay Ulfelder’s blog on the Fog of War). These differences shouldn’t surprise us. Indeed any researcher with a little field experience can think of countless examples of ways that different types of violence are more or less likely to be reported.
To give a brief example, we’ve plotted the daily count of uniquely reported killings collected by four well-known data sources (SS, SNHR, VDC, and SCSR) for the Syrian governorate Tartus. All four sources depict a marked increase in violence in May 2013, which corresponds to an alleged massacre in that governorate. Three of the remaining sources observed relatively few victims outside this single spike in violence. The fourth source, VDC, describes the observed peak in May 2013 as the culmination of steadily increasing reports of violence throughout the preceding year. If we did not have access to the VDC data, we would erroneously conclude that there is consensus among data sources that relatively little violence is occurring in Tartus, and that May 2013 was a relatively isolated event.
The important take-home message here is not that VDC is the ‘most’ complete (which it may or may not be) but rather that each of these datasets is incomplete in different ways, and close examination of those differences is a first step in estimating the full picture of violence in Syria. Imagine what different trends a fifth, sixth, or seventh source might yield.
It is tempting to acknowledge differences between different reports of violence, and then to proceed interpreting observed reporting patterns as if they were representative of ground truth. In our work, we are (almost) always limited to data that are incomplete and unrepresentative. But we do not have to be limited to uncritical interpretations of these data. Several fields have developed a toolkit of methods to model the data generating process of reporting. These methods all build on the basic notion of estimating the probability of a certain event (in our research, victims of killing) being included in one or several reports. For example, one class of methods is called Multiple Systems, or Multiple Recapture, Estimation, which you can read about here (and a non-technical introduction here). Lots of exciting research is going on in this field.
To summarize, we have experienced the following:
- Having different stories about a conflict from different sources is the rule, not the exception. We have found the argument about which source is “right” to be counter-productive. Focusing on why different sources are telling different stories can be informative, but only in the sense that we’re learning more about reporting.
- In general (with a few exceptions) reported data represents the reporting process, not actual events.
- For data to be representative, they need to be collected via a probabilistic method, such as a survey administered to a random sample.
Things we have found useful when working with incomplete data:
- Collecting meta-data: including information on how the data were collected, coded, organized, etc. enables assessment of the dataset, suggests additional complementary data to seek out, and potentially aids in modeling the data generation process.
- Transparency: being very clear about the limitations of the data and when the trends we’re observing describe the reporting or data collection process rather than ground truth.
- Multiple hypotheses: speculating about different possible causes for observed trends. E.g., perhaps violence truly did gradually increase in Tartus in late 2011 and early 2012. Or perhaps data collection groups’ networks expanded during this time and thus were able to collect increasing numbers of reports of violence.
- Data generating process of reporting before theory testing: we must be certain that the trend we’re observing is in fact ground truth and not simply an artifact of the data collection/generation process. We must formally rule out this possibility before moving forward testing theories about political violence.
Anita Gohdes is a PhD Candidate at the University of Mannheim and a Consultant to the Human Rights Data Analysis Group (@ARGohdes).
 The suspension of GDELT and re-publishing under different authorship has spurred an entirely different controversy. We have no stake in this debate.  Note that the question of completeness is preceded by the definition of what types of violence are to be included. This preceding question is obviously dependent on the object of inquiry (battle deaths, civilian deaths, one-sided violence, mass killings etc.). If the object of inquiry is restricted to all killings reported by Newspaper X, then the issue of completeness is merely dependent on the coding process.  We consider this an example of event-size bias, where a violent event with many victims is well-documented in multiple sources, but other events, presumably with fewer victims, appear to be under-documented.