I copy parts of the original mentions dataset from the gdelt project into a mysql database for further evaluations. Below my table definition. The definition of the fields is of cause identical. Please also read the original paper:
THE GDELT EVENT DATABASE DATA FORMAT CODEBOOK V2.0
From the paper:
” … Mentions table that records all mentions of each event. As an event is mentioned across multiple news reports, each of those mentions is recorded in the Mentions table, along with several key indicators about that mention, including the location within the article where the mention appeared (in the lead paragraph versus being buried at the bottom) and the “confidence” of the algorithms in their identification of the event from that specific news report. …”
For a lack of better knowledge I use the hash of GLOBALEVENTID, MentionTimeDate, MentionIdentifier, Actor2CharOffset, ActionCharOffset as primary key. Please let me know if there is a better one!
Column Name | Description | Example |
---|---|---|
GLOBALEVENTID | This is the ID of the event that was mentioned in the article. | 595726421 |
EventTimeDate | This is the 15-minute timestamp (YYYYMMDDHHMMSS) when the event being mentioned was first recorded by GDELT (the DATEADDED field of the original event record). This field can be compared against the next one to identify events being mentioned for the first time (their first mentions) or to identify events of a particular vintage being mentioned now (such as filtering for mentions of events at least one week old). | 20161104081500 |
MentionTimeDate | This is the 15-minute timestamp (YYYYMMDDHHMMSS) of the current update. This is identical for all entries in the update file but is included to make it easier to load the Mentions table into a database. | 20171104053000 |
MentionType | This is a numeric identifier that refers to the source collection the document came from and is used to interpret the MentionIdentifier in the next column. In essence, it specifies how to interpret the MentionIdentifier to locate the actual document. At present, it can hold one of the following values: o 1 = WEB (The document originates from the open web and the MentionIdentifier is a fully-qualified URL that can be used to access the document on the web). o 2 = CITATIONONLY (The document originates from a broadcast, print, or other offline source in which only a textual citation is available for the document. In this case the MentionIdentifier contains the textual citation for the document). o 3 = CORE (The document originates from the CORE archive and the MentionIdentifier contains its DOI, suitable for accessing the original document through the CORE website). o 4 = DTIC (The document originates from the DTIC ....). | 1 |
MentionSourceName | This is a human-friendly identifier of the source of the document. For material originating from the open web with a URL this field will contain the top-level domain the page was from. For BBC Monitoring material it will contain “BBC Monitoring” and for JSTOR material it will contain “JSTOR.” This field is intended for human display of major sources as well as for network analysis of information flows by source, obviating the requirement to perform domain or other parsing of the MentionIdentifier field. | pafoa.org |
MentionIdentifier | This is the unique external identifier for the source document. It can be used to uniquely identify the document and access it if you have the necessary subscriptions or authorizations and/or the document is public access. This field can contain a range of values, from URLs of open web resources to textual citations of print or broadcast material to DOI identifiers for various document repositories. For example, if MentionType is equal to 1, this field will contain a fully-qualified URL suitable for direct access. If MentionType is equal to 2, this field will contain a textual citation akin to what would appear in an academic journal article referencing that document (NOTE that the actual citation format will vary (usually between APA, Chicago, Harvard, or MLA) depending on a number of factors and no assumptions should be made on its precise format at this time due .... | http://forum.pafoa.org/external.php?s=cf69e072c195f3ceeabf9a28a2318091&type=RSS2&forumids=2 |
SentenceID | The sentence within the article where the event was mentioned (starting with the first sentence as 1, the second sentence as 2, the third sentence as 3, and so on). This can be used similarly to the CharOffset fields below, but reports the event’s location in the article in terms of sentences instead of characters, which is more amenable to certain measures of the “importance” of an event’s positioning within an article. | 75 |
Actor1CharOffset | The location within the article (in terms of English characters) where Actor1 was found. This can be used in combination with the GKG or other analysis to identify further characteristics and attributes of the actor. NOTE: due to processing performed on each article, this may be slightly offset from the position seen when the article is rendered in a web browser. | -1 |
Actor2CharOffset | The location within the article (in terms of English characters) where Actor2 was found. This can be used in combination with the GKG or other analysis to identify further characteristics and attributes of the actor. NOTE: due to processing performed on each article, this may be slightly offset from the position seen when the article is rendered in a web browser. | 23015 |
ActionCharOffset | The location within the article (in terms of English characters) where the core Action description was found. This can be used in combination with the GKG or other analysis to identify further characteristics and attributes of the actor. NOTE: due to processing performed on each article, this may be slightly offset from the position seen when the article is rendered in a web browser. | 22997 |
InRawText | This records whether the event was found in the original unaltered raw article text (a value of 1) or whether advanced natural language processing algorithms were required to synthesize and rewrite the article text to identify the event (a value of 0). See the discussion on the Confidence field below for more details. Mentions with a value of “1” in this field likely represent strong detail-rich references to an event. | 1 |
MentionDocLen | The length in English characters of the source document (making it possible to filter for short articles focusing on a particular event versus long summary articles that casually mention an event in passing). | 33679 |
Confidence | Percent confidence in the extraction of this event from this article. See the discussion in the codebook at http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf | 100 |
MentionDocTone | The same contents as the AvgTone field in the Events table, but computed for this particular article. NOTE: users interested in emotional measures should use the MentionIdentifier field above to merge the Mentions table with the GKG table to access the complete set of 2,300 emotions and themes from the GCAM system. | -3.44962 |
MentionDocTranslationInfo | This field is internally delimited by semicolons and is used to record provenance information for machine translated documents indicating the original source language and the citation of the translation system used to translate the document for processing. It will be blank for documents originally in English. At this time the field will also be blank for documents translated by a human translator and provided to GDELT in English (such as BBC Monitoring materials) – in future this field may be expanded to include information on human translation pipelines, but at present it only captures information on machine translated materials. An example of the contents of this field might be “srclc:fra; eng:Moses 2.1.1 / MosesCore Europarl fr-en / GT-FRA 1.0”. NOTE: Machine translation is often not as accurate as human translation and users requiring the highest possible confidence levels may wish to exclude events... | |
Extras | This field is currently blank, but is reserved for future use to encode special additional measurements for selected material. | |
inserted | Date the dataset was inserted to Datacoll | 2017-12-29 15:30:00 |
filename | Filename of the source file this dataset originates | 20171027144500.mentions.CSV |
hash_key | hash of GLOBALEVENTID, MentionTimeDate, MentionIdentifier, Actor2CharOffset, ActionCharOffset as part of primary key | 000121889d38d6ce9989dbd189209ee9 |
EventTimeDateMY | like EventTimeDate but only Year and Month, part of primary key | 201901 |