Automatic Extraction of Causal Chains from Text

Background. Automatic extraction of causal chains is valuable for discovering previously unknown and hidden connections between events. However, there is only a handful of works devoted to automatic extraction of causal chains from text. Objective. To develop a method for automatic extraction of causal chains from text. Method. A new approach based on linguistic templates is suggested for causal chain extraction. It is domain-independent, not restricted to extraction from single sentences and unfolded on big data. For implementation, a sequence of four modules was deployed. These are verb restriction, part-of-speech tagging, extracting causal relations, and unification and matching events. Results. 14,821 causal chains (with length=2) have been extracted from 100,000 English Wikipedia articles. Contributions. The extracted causal chains can contribute to developing commonsense knowledge bases, reasoning resources, problem-solving, and generally in discovering previously unknown relationships between entities/events.


INTRODUCTION
In recent years, automatic extraction of causal chains has become increasingly important to discover previously unknown relationships between entities or events. This type of knowledge could be very valuable (de Silva, Zhibo, Rui, & Kezhi, 2017). Causal chains are particularly useful in medicine and biology, where they can be applied to finding unknown connections between symptoms, diseases, and their drugs (Khoo, Chan, & Niu, 2000). By analyzing chains of causal implication within the medical literature, new hypotheses for causes of rare diseases have been discovered. Some of those have received supporting experimental evidence (Swanson, 1987;Swanson & Smalheiser, 1997). The chain of causation can also benefit problem-solving systems.
Nevertheless, to the best of our knowledge, there are only a handful of works devoted to automatic extraction of causal chains from text (Asghar, 2016). Few works determine the causal chain based on the template NP1 causal-verb NP2 (Sawamaru & Kobayashi, 2012). A number of cases were observed where a single sentence contains two 100 different causal assertions, chained together. To handle multiple instances of causality present in the same sentence, it is split into sub-sentences (Hendrickx et al., 2009).
There is some research on extracting causal chains in narrow domains, for example, to determine the chain of causation of teen drug addiction from the documents for enhancing the warning system on the social Web (Pechsiri & Sukharomana, 2017), or to reveal the connection between initial information about the incident and its causes in aviation investigation reports, based on a graph-based text representation that captures both the structure and the content of the report (Sizov & Ozturk, 2013).
We present an approach to causal chain extraction that is domain-independent, not restricted to single sentences, and unfolded on big data.

CAUSAL CHAIN VERSUS CAUSAL RELATION
A causal chain is defined as a sequence of causal relations that lead to some final effect. So, we have something like: event-1 causes event-2 causes event-3, etc. In other words, a causal chain is a sequence of events related by causality. Causal

PROBLEM OF CAUSAL CHAIN EXTRACTION
Extraction of causal chain assumes that in a cause-effect relation, the effect can be a follow-up cause. It suggests that the extraction pattern for cause and effect should be the same.
The traditional explicit syntactic patterns for the detection and extraction of causal relations being focused on 2-member causal extraction (Bethard, Corvey, Klingenstein, & Martin, 2008;Khoo, Kornfilt, Oddy, & Myaeng, 1998;Luo, Zhu, Hwang, & Wang, 2016;O'Gorman, Wright-Bettner, & Palmer, 2016) cannot be used directly for this purpose since they either have different sub-patterns for cause and effect (resultative constructions) or implicit patterns such as, causal links or causal cues; if-then conditionals: or adverbs/adjectives where cause and effect do not have any patterns at all.
The difficulties with causal chain extraction can be illustrated in the following example from (Khoo et al., 1998). Let's assume the following causal relation was extracted: It was raining heavily and because of this, the car failed to brake in time. There are two events here: event A (It was raining heavily) and event B (the car failed to brake in time). Now, to extract a chain, we need the event B, as an effect in cause-effect relation with the event A, to consider as a cause for the next step in chain causality. So, we need to find an event C that will be caused by the event B. The problem is that it is very unlikely for the event B to be described by the same seven words (the car failed to brake in time) in both A causes B and B causes C. The problem is how to represent event B for matching.

METHOD OF CAUSAL CHAIN EXTRACTION
Causal chain extraction A causes B causes C assumes (1) finding explicit linguistic patterns (clues) of causality between A&B and B&C, and (2) finding event B which is the same in both causal relations: A causes B and B causes C. To satisfy both of the above assumptions, we make another two assumptions: 1) the simplest linguistic patterns indicating causality between events are based on to/by 2) the simplest syntactical structure for event representation and matching is V+NP/Pro, where V is a verb, NP is a noun phrase, Pro is a pronoun 1 .
Based on the assumptions, the following two linguistic templates are used for causal chain extraction, where V1 is a verb representing a cause-event and V2 a verb representing the effect-event:

V1+NP1+to+V2+NP2
Example: stabbed the guy to kill him; change his name to obtain ownership V2+NP2+by+V1+NP1 Example: kill the guy by stabbing him; changed his name by dropping the prefix Two linguistic templates create the unification in a causal chain because V2 can be considered as a representative of a cause the same way as V1 but for the further step in causality.
It is important to note that the patterns should contain at least one affected object otherwise the causal link will be too general and abstract.
Similar patterns were suggested in VerbOcean (Chklovski & Patel, 2004) for enablement; for example, Xed * by Ying, where * matches any single word. However, VerbOcean does not include a noun after a verb X in a causal pattern between verbs. The principal difference in our approach is that opened the door by, opened the bottle by and opened the book by assume a different Y.
Extracting causal relations from text using linguistic templates was also explored in (Luo et al., 2016). However, the causal events considered in their work are limited by a single-term representation (for example, typhoon, sunrise, bacteria) and are linked into causal relations, not causal chains. Our work is more general in the sense that an event is described as a verb and a noun phrase, and causal relations are extended into causal chains.

IMPLEMENTATION
The flow chart in Figure 1 shows the general approach of causal chains extraction from text. The process is divided into two main steps. In the first step, causal relations are found by matching pre-defined linguistic templates. In the second step, causal chains are constructed by joining the relations using the process of unification and matching. The details of the approach are below: English Wikipedia articles are used as raw data. 100,000 articles were selected randomly for automatic extraction of causal chains.
• Verb restriction There are some phrases such as allow workers to liberate themselves or have the power to remove the Head that match the templates but are not causal relations. It happens because of a modal verb (allow, have) that governs the other verb in the phrase. To eliminate the problem, this type of verbs is excluded (for example appear, be, begin, consider) from the extraction. In total, 85 modal verbs were excluded. This way, the number of false positives is minimized.
• POS tagging This is performed to allow for POS pattern matching in the next step. The POS tagger used is Averaged Perceptron (Collins, 2002) and the tagset is the Penn Treebank tagset (Santorini, 1990).

• Extracting causal relations
After POS tagging is done, the causal relations are extracted by matching the two linguistic templates: V1+NP1+to+V2+NP2 V2+NP2+by+V1+NP1 • Unification and matching events The procedure used to construct causal chains is as follows: for every causal relation, find a second causal relation such that the cause event of the second relation can be unified with the effect event of the first relation. In our current work, only causal chains consisting of two relations are constructed. Longer causal chains can be built in a similar manner and might be explored in future work. In order to unify the events, exact string matching is insufficient and will miss out many causal chains. The main reason is due to the fact that the same event can be represented by different phrasings. The event unification procedure only considers nouns, pronouns and verbs in the causal relations, ignoring other parts of speech that are not considered significant in event matching. Verbs of different inflections but of the same infinitive form are also considered as the same verb. In order to unify the events, exact string matching is insufficient and will miss out many causal chains. The main reason is the same event can be represented by different phrasings. The event unification procedure only considers nouns, pronouns and verbs in the causal relations, ignoring other parts of speech that are not considered significant in event matching. Verbs of different inflections but of the same infinitive form are also considered as the same verb.

RESULTS
Examples of the extracted causal chains are given in Table 1. The dependencies between the number of articles, number of sentences involved, causal relations extracted and causal chains extracted are shown in Table 2.
The graph in Figure 2 shows almost linear relationship between the number of causal relations extracted and the number of sentences. The graph in Figure 3 shows the relationship between the number of causal chains and the number of sentences. It follows an approximate quadratic dependency which is as expected since the number of causal chains is proportional to the number of pairs of causal relations.
As mentioned earlier, the extracted causal chains do not follow a linear structure. It is a net since an event can be caused by many events and can cause many events. For example, using the by-template for the event commit suicide, the following causes were extracted: by dashing her head, by jumping off, by taking overdose, by cutting his throat, by throwing herself, by falling on sword, by gassing herself, by hanging herself, by tying a shoelace, by inhaling the exhaust gas, by leaping, by opening his veins, by slashing his, own throat, by stabbing, etc. In turn, using the to-template for the same event as a cause we got the following effects: avoid capture, defend the honour, avoid assimilation, etc. Combination of the set of extracted causal relations with itself for the same event as an effect (1 st causal relation) and a cause (2 nd causal relation) accordingly allows making Cartesian multiplication for getting all possible sequences of causal events as separate chains.

EVALUATION
We got 14,821 causal chains (with length=2) from 100,000 English Wikipedia articles. With the naked eyes it is clear that most of the bad chains was caused by two reasons: (1) bad POS tagging (for example: gives birth to twin sons where twin is recognized as the verb; reduce redundant work by taking advantage with incomplete NP) and (2) bad verbs to be used for event representation (the verb remain in remains the only foreign-born driver to win the race). Extraction of causal chains is a new task and there are yet no systematic evaluation measures. Our evaluation was based on a sample of 100 causal chains randomly taken from the ones we extracted.
Due to restrictions on event extraction (V+NP) and causal chain extraction (patterns with by and to only) we cannot extract causal chains comprehensively. As a result of that, recall of the extraction is not appropriate. Nevertheless, we do not make  any restrictions on the verbs (except a very obvious one: see ch.5 for details) and our method assumes involvement the whole set of English verbs. The data we used might create some illusion that only causal chains related with human actions were extracted. In reality, it was caused by the nature of the data chosen (Wikipedia), not by the method applied to it. The precision (effectiveness) of causal chain extraction was evaluated by five human judges. They were asked to assign each chain a number from 1 (very bad chain) to 5 (very good chain). They were instructed with the task formulation and the definitions of event, causal relations and causal chain.
After scoring by five judges, the method of aggregate evaluation was applied. If the deviation in evaluation between all five judges was no more than two points, the average was calculated. If the deviation between four judges was no more than one point, the average was calculated. If there was no deviation between three judges, their evaluation was accepted. If a score falls into more than one category (for example, 5-3-3-4-4 falls in two categories: five judges with deviation of two points and four judges with deviation of one point), the average will be calculated for each category and the maximum of those values is taken. In all other cases (for example, in the case of 5-3-2-4-1), the judge's evaluations were not counted, since they were too variable. See Table 3 for details. Table 4 shows the number of causal chains (among 100 randomly chosen) that passed the aggregate evaluation. Table 5 shows the number of aggregate chains for each type of deviation with the corresponding average mark. The table also shows the final evaluation for all types which has a deviation with average mark of 3.02. Based on that, one can conclude that the method we used allows extracting the chains that were estimated in average as not bad. Table 6 provides the distribution of the scores (among 70 cases in total) with average evaluation of 4 and above, 3-4, and lowers than 3. Examples of chains for each cluster are provided. Our observations show the cluster with average evaluation lower than 3 contain lots of relations with bad POS tagging.

POTENTIAL APPLICATIONS AND FURTHER WORK
From the first attempt of method deployment that includes four modules-verb restriction, POS tagging, extracting causal relations, and unification and matching events, we can conclude that the method is rather effective. It can be tuned further for making evaluation results better. The only module which is the weakest point in the whole process-POS tagging, we cannot improve ourselves since we took it as ready-to-use from outside resources.
The causal chains that were derived from Wikipedia are universal in a sense that they are not related to any specific area. It is domain-independent since we use the whole set of English verbs for event extraction. Causal chains form a net that can be used in developing commonsense knowledge base, reasoning resources, and generally in discovering previously unknown relationships between entities/events. In particular, it can help when an event prediction or decision making is needed. For example, having refurbishing the memorial as a final goal to be accomplished, we can make decision how to do that by making a choice between multiple chains leading to the goal: refurbishing the memorial by raising money; raising money by creating the foundation OR raising money by selling accessories OR raising money by running the race, etc.
We hope to improve the current results by adding synonyms and by increasing 100,000 articles to 5.7 million in English Wikipedia. One of the primary goals for our future work will be to develop a formal procedure for evaluation.