Extraction of causal links from scientific literature

This page introduces how causal links are extracted from scientific literature in the Upright net impact model.

Content simplified for clarity

The explanation on this page is simplified for clarity, and does not reflect all intricacies of the actual knowledge extraction algorithms.

The primary data source for the Upright net impact model is a database of 200M+ scientific articles. The approach to extracting causal links from scientific literature is to determine the volume of scientific research that studies a particular product and a particular impact, and subsequently how often this research concludes that the given product causes the given impact. Relevant information is automatically extracted in two steps:

  1. Collection of relevant articles

  2. Causality classification

These two counts are used to determine the magnitude and certainty of the causal relationship between a product and an impact. Namely

  • The rate at which relevant articles are found to have a causal link are considered to be indicative of the magnitude of the causal relationship. 50 articles with causal link out of 100 relevant articles is implies a stronger relationship than 10 articles out of 100.

  • The absolute volume of relevant articles is considered to be indicative of the certainty of the available information: 50 articles with causal link out of 100 relevant articles is a more reliable assessment than 5 articles out of 10, even if the ratio is the same.

Collection of relevant articles

The database of scientific articles is scanned for articles with combinations of mentions of all possible combinations of product phrases and impact phrases.

Product phrases originate from the Upright product graph, in which each product is associated with a list of phrases commonly used to refer to the product. (e.g. for apples, this could be apple, or malus domestica, the Latin name for apples.)

Impact phrases are phrases used in scientific literature to discuss impacts within Upright's 19 impact categories. For example, the word diabetes is an impact phrase for the impact category Diseases.

Articles identified using this approach are summarized into totals of articles discussing each given product.

Causality classification

Scientific articles factor into the quantification of most of the impact categories of the Net Impact Model. To summarize causality, the net impact model determines for each relevant article whether the article has found a causal link between the given product phrase and impact phrase.

Articles identified using this approach are summarized into totals of articles that find that a given product (as defined by some product phrases) causes a given impact (as defined by some impact phrases).

Article examples

Collection and classification of articles can be illustrated by 3 real-world articles:

  • Article with relevant causal links: They say an apple a day keeps the doctor away. The Upright causal classifier has detected this article among others to support this claim. Such articles are counted as relevant articles towards the positive Health impact.

  • Article without relevant causal links: This article also discusses apples and infections. The article is only counted towards the total article count of Apples rather than as a relevant article towards the Health impact: while infections are a relevant impact phrase for Health, the article only investigates viral infections in apples, not whether apples cause infections. The article is therefore counted only towards articles relevant to the product.

  • Article only discussing the product: While this article discusses fruits, it focuses on the effect of weather on the growth of fruits without discussing any relevant impacts. The article is therefore counted only towards articles relevant to the product.

Mitigation of funding and publication bias

Funding bias refers to the tendency for research outcomes to be influenced by the financial interests of the study's sponsor. Publication bias refers to the tendency for scientific research to be published based on its results rather than its quality or importance. Such biases are present in source material used by Upright, including the CORE database of 200M+ scientific articles.

To mitigate such biases, Upright calibrates its impact results using reliable 3rd party datasets, such as the WHO's Global Burden of Disease dataset. While such datasets don't provide comprehensive coverage of all impacts of products and services, they are effective big-picture mitigation of biases present in source material.

Last updated