A ConstraintBased Algorithm For Causal Discovery
with Cycles, Latent Variables & Selection Bias
Abstract
Causal processes in nature may contain cycles, and real datasets may violate causal sufficiency as well as contain selection bias. No constraintbased causal discovery algorithm can currently handle cycles, latent variables and selection bias (CLS) simultaneously. I therefore introduce an algorithm called Cyclic Causal Inference (CCI) that makes sound inferences with a conditional independence oracle under CLS, provided that we can represent the cyclic causal process as a nonrecursive linear structural equation model with independent errors. Empirical results show that CCI outperforms CCD in the cyclic case as well as rivals FCI and RFCI in the acyclic case.
1 The Problem
Scientists often infer causation using data collected from randomized controlled experiments. However, randomized experiments can be slow, nongeneralizable, unethical or expensive. Consider for example trying to discover the causes of a human illness, a common scenario in modern medical science. Performing interventions with possibly harmful consequences on humans is unethical, so scientists often perform experiments on animals instead knowing full well that the causal relationships discovered in animals may not generalize to humans. Moreover, many possible causes for an illness often exist, so scientists typically perform numerous animal experiments in order to discover the causes. A lengthy trial and error process therefore ensues at considerable financial expense. We would ideally like to speed up this scientific process by discovering causation directly from human observational data, which we can more easily acquire.
The need for faster causal discovery has motivated many to develop algorithms for inferring causation from observational data. The PC algorithm, for example, represents one the earliest algorithms for inferring causation using i.i.d. data collected from an underlying acyclic causal process (Spirtes00). PC actually falls within a wider class of causal discovery algorithms called constraintbased (CB) algorithms which utilize a conditional independence (CI) oracle, or a CI test in the finite sample case, to reconstruct the underlying causal graph. The FCI algorithm is another example of a CB algorithm which extends PC to handle latent variables and selection bias (Spirtes00, Zhang08). Yet another CB algorithm called CCD cannot handle latent variables (and perhaps selection bias) like FCI, but CCD can infer cyclic causal structure provided that all causal relations are linear (Richardson96, Richardson99). Many other CB algorithms exist, and most of these methods come with some guarantee of soundness in the sense that their outputs are provably correct with a CI oracle.
The aforementioned CB algorithms and many other nonCB algorithms have been successful in inferring causation under their respective assumptions. However, causal processes and datasets encountered in practice may not satisfy the assumptions of the algorithms. In particular, many causal processes are known to contain feedback loops (Sachs05), and datasets may contain latent variables as well as some degree of selection bias (Spirtes95). Few algorithms can handle cycles, latent variables and selection bias (CLS) simultaneously, so scientists often must unwillingly apply other methods knowing that the outputs may introduce unwanted bias (Sachs05, Mooij13). Solving the problem of causal discovery under CLS would therefore provide a much needed basis for justifying the output of causal discovery algorithms when run on real data.
A few investigators have devised nonCB based solutions for the problem of causal discovery under CLS. Hyttinen13 introduced the first approach, where CI constraints are fed into a SAT solver which then outputs a graph consistent with the constraints. However, the method can be slow because the SAT solver does not construct efficient test schedules like CB algorithms. Strobl_thesis17 provided a different solution in the Gaussian case, provided that the cyclic causal process can be decomposed into a set of acyclic ones. The method uses both conditional independence testing and mixture modeling, but the mixture modeling inhibits a straightforward extension of the method to the nonparametric setting even in the linear case. Existing solutions to the problem of causal discovery under CLS thus fall short in either efficiency or generalizability.
The purpose of this paper is to introduce the first CB algorithm that is sound under CLS. The method is efficient because it constructs small test schedules, and it is generalizable to the nonparametric setting because the algorithm only requires a sound CI test. We introduce the new CB algorithm as follows. We first provide provide background material on causal discovery without cycles in Sections 2 through 4. We then review causal discovery with cycles in Section 5. Section 6 introduces the new notion of a maximal almost ancestral graph (MAAG) for summarizing cyclic graphs with latent variables and selection bias. Next, Section 7 contains an overview of the proposed algorithm, while Section 8 outlines an algorithm trace. Section 9 subsequently lists the details of the proposed CB algorithm. Experimental results are included in Section 10. We finally conclude the paper in Section 11. Most of the proofs are located in the Appendix.
2 Graph Terminology
Let italicized capital letters such as denote a single variable and bolded as well as italicized capital letters such as denote a set of variables (unless specified otherwise). We will also use the terms “variables” and “vertices” interchangeably.
A graph consists of a set of vertices and a set of edges between each pair of vertices. The edge set may contain the following six edge types: (directed), (bidirected), — (undirected), (partially undirected) and (partially directed),
We call a graph containing only directed edges as a directed graph. We will only consider directed graphs without selfloops in this paper. On the other hand, a mixed graph contains directed, bidirected and undirected edges. We say that and are adjacent in a graph, if they are connected by an edge independent of the edge’s type. An (undirected) path between and is a set of consecutive edges (also independent of their type) connecting the variables such that no vertex is visited more than once. A directed path from to is a set of consecutive directed edges from to in the direction of the arrowheads. A cycle occurs when a path exists from to , and and are adjacent. More specifically, a directed path from to forms a directed cycle with the directed edge and an almost directed cycle with the bidirected edge . We call a directed graph a directed acyclic graph (DAG), if it does not contain directed cycles.
Three vertices form an unshielded triple, if and are adjacent, and are adjacent, but and are not adjacent. On the other hand, the three vertices form a triangle when and are also adjacent. We call a nonendpoint vertex on a path a collider on , if both the edges immediately preceding and succeeding the vertex have an arrowhead at . Likewise, we refer to a nonendpoint vertex on which is not a collider as a noncollider. Finally, an unshielded triple involving is more specifically called a vstructure, if is a collider on the subpath .
We say that is an ancestor of (and is a descendant of ) if and only if there exists a directed path from to or . We write to mean is an ancestor of and to mean is a descendant of . We also apply the definitions of an ancestor and descendant to a set of vertices as follows:
3 Causal & Probabilistic Interpretations of DAGs
We will interpret DAGs in a causal fashion (Spirtes00, Pearl09). To do this, we consider a stochastic causal process with a distribution over that satisfies the Markov property. A distribution satisfies the Markov property if it admits a density that “factorizes according to the DAG” as follows:
(1) 
We can in turn relate the above equation to a graphical criterion called dconnection. Specifically, if is a directed graph in which , and are disjoint sets of vertices in , then and are dconnected by in the directed graph if and only if there exists an active or dconnecting path between some vertex in and some vertex in given . An active path between and given refers to an undirected path between some vertex in and some vertex in such that, for any collider on , a descendant of is in and no noncollider on is in . A path is inactive when it is not active. Now and are dseparated by in if and only if they are not dconnected by in . For shorthand, we will write and when and are dseparated or dconnected given , respectively. The conditioning set is called a minimal separating set if and only if but and are dconnected given any proper subset of .
If we have , then and are conditionally independent given , denoted as , in any joint density factorizing according to (1) (Lauritzen90); we refer to this property as the global directed Markov property. We also refer to the converse of the global directed Markov property as dseparation faithfulness; that is, if , then and are dseparated given . One can in fact show that the factorization in (1) and the global directed Markov property are equivalent, so long as the distribution over admits a density (Lauritzen90). We will only consider distributions which admit densities in this report, so we will use the terms “distribution” and “density” interchangeably from here on out.
4 Ancestral Graphs for DAGs
We can associate a directed graph with a mixed graph with arbitrary edges as follows. For any directed graph with vertices , we consider the partition , where , and are nonoverlapping sets of observable, latent and selection variables, respectively. We then consider a mixed graph over , where the arrowheads and tails have the following interpretations. If we have the arrowhead , where the asterisk is a metasymbol denoting either a tail or an arrowhead, then we say that is not an ancestor of in . On the other hand, if we have the tail then we say that is an ancestor of in . Let denote the ancestors of in . Obviously then, any constructed from a directed graph cannot have a directed cycle, where in and in . Similarly, cannot have an almost directed cycle, where is in and in .
One can also show that, if is acyclic, then any mixed graph constructed from cannot have a undirected edge with incoming arrowheads at or (Richardson00). We therefore find it useful to consider a subclass of mixed graphs called ancestral graphs:
Definition 1.
(Ancestral Graphs) A mixed graph is more specifically called an ancestral graph if and only if satisfies the following three properties:

There is no directed cycle.

There is no almost directed cycle.

For any undirected edge , and have no incoming arrowheads.
Observe that every mixed graph of a DAG is an ancestral graph.
A maximal ancestral graph (MAG) is an ancestral graph where every missing edge corresponds to a conditional independence relation. One can transform a DAG into a MAG as follows. First, for any pair of vertices , make them adjacent in if and only if there is an inducing path between and in . We define an inducing path as follows:
Definition 2.
(Inducing Path) A path between and in is called an inducing path if and only if every collider on is an ancestor of , and every noncollider on (except for the endpoints) is in .
Note that two observables and are connected by an inducing path if and only if they are dconnected given any as well as (Spirtes00). Then, for each adjacency , we have the following edge interpretations: in

If we have , then in .

If we have , then in .
The MAG of a DAG is therefore a kind of marginal graph that does not contain the latent or selection variables, but does contain information about the ancestral relations between the observable and selection variables in the DAG. The MAG also has the same dseparation relations as the DAG, specifically among the observable variables conditional on the selection variables (Spirtes96).
5 Directed Cyclic Graphs as Equilibriated Causal Processes
We now allow cycles in a directed graph. Multiple different causal representations of a directed cyclic graph exist in the literature. Examples include dynamic Bayesian networks (Dagum95), structural equation models with feedback (Spirtes95c), chain graphs (Lauritzen02) and mixtures of DAGs (Strobl_thesis17). See (Strobl_thesis17) for a discussion of each of their strengths and weaknesses. In this report, we will only consider structural equation models with feedback.
Recall that the density associated with a DAG obeys the global Markov property. However, the density may not obey the global Markov property if contains cycles. We therefore must impose certain assumptions on such that its density does obey the property.
Spirtes95c proposed the following assumptions on . We say that a distribution obeys a structural equation model with independent errors (SEMIE) with respect to if and only if we can describe as for all such that is measurable and (Evans16). Here, we have a set of jointly independent errors , and refers to the sigmaalgebra generated by the random variable . An example of an SEMIE is illustrated below with an associated directed graph drawn in Figure 0(a):
(2)  
where denotes a set of jointly independent standard Gaussian error terms, and is a 4 by 4 coefficient matrix. Notice that the structural equations in (2) are linear structural equations.
We can simulate data from an SEMIE using the fixed point method (Fisher70). The fixed point method involves two steps per sample. We first sample the error terms according to their independent distributions and initialize to some values. Next, we apply the structural equations iteratively until the values of the random variables converge to values which satisfy the structural equations; in other words, the values converge almost surely to a fixed point.^{1}^{1}1We can perform the fixed point method more efficiently in the linear case by first representing the structural equations in matrix format: . Then, after drawing the values of , we can obtain the values of by solving the following system of equations: , where denotes the identity matrix. Note that the values of the random variables may not necessarily converge to a fixed point all of the time for every set of structural equations and error distributions, but we will only consider those structural equations and error distributions which do satisfy this property. We call the distribution reached at the fixed points as the equilibrium distribution.
Spirtes95c proved the following regarding linear SEMIEs, or SEMIEs with linear structural equations:
Theorem 1.
The equilibrium distribution of a linear SEMIE satisfies the global directed Markov property with respect to the SEMIE’s directed graph (acyclic or cyclic).
The above theorem provided a basis from which Richardson started constructing the Cyclic Causal Discovery (CCD) algorithm (Richardson96, Richardson99) for causal discovery with feedback.
6 Almost Ancestral Graphs for Directed Graphs
Recall that an ancestral graph satisfies the three properties listed in Definition 1. We now define an almost ancestral graph (AAG) which only satisfies the first two conditions of an ancestral graph. The following result should be obvious:
Proposition 1.
Any mixed graph constructed from a directed graph (cyclic or acyclic) over is an AAG.
Proof.
No directed cycle and no almost directed cycle can exist in because that would imply that there exists a vertex which is simultaneously both an ancestor of and not an ancestor of in . ∎
Now an almost ancestral graph is said to maximal when an edge exists between any two vertices and if and only if there exists an inducing path between and . Note that a maximal almost ancestral graph (MAAG) does not necessarily preserve the dseparation relations between the variables in given in a directed graph , even though does do so when is acyclic (Spirtes96). We provide an example of an MAAG in Figure 0(b), where because and .
7 Overview of the Proposed Algorithm
We now introduce a new CB algorithm called Cyclic Causal Inference (CCI) for discovering causal relations under CLS. We first provide a bird’s eye view of the algorithm (for more details, see Section 9). We will assume that the reader is familiar with past CB algorithms including PC, FCI, RFCI and CCD; for a brief review of each, see Section 12 in the Appendix.
We have summarized CCI in Algorithm 1. The algorithm first performs FCI’s skeleton discovery procedure in Step 1 (see Algorithm 5), which discovers a graph where each adjacency between any two observables and corresponds to an inducing path even in the cyclic case (see Lemma 4 of Section 13 for a proof of this statement). The algorithm then orients vstructures in Step 1 using FCI’s vstructure discovery procedure (Algorithm 4).
Step 1 of CCI checks for additional long range dseparation relations. Recall that the PC algorithm only checks for short range dseparation relations via vstructures. However, two cyclic directed graphs may agree locally on dseparation relations, but disagree on dseparation relations between distant variables, even if they do not contain any latent variables or selection bias (Richardson94). As a result, Step 1 allows the algorithm to orient additional edges by checking for additional dseparation relations.
Step 1 of CCI discovers nonminimal dseparating sets which were not discovered in Step 1. Recall that, if we have and nonadjacent, then every set dseparating and does not contain . However, in the cyclic case, and can be dseparated given a set that contains . It turns out that we can infer additional properties about the MAAG, if we find dseparating sets which contain . Step 1 of Algorithm 1 therefore discovers these additional nonminimal dseparating sets using Algorithm 2. Steps 1 and 1 in turn utilize the dseparating sets discovered in Steps 1 and 1 in order to orient additional edges. Finally, Step 1 applies the 7 orientation rules described in Section 9. CCI thus ultimately outputs a partially oriented MAAG, or an MAAG with tails, arrowheads and unspecified endpoints denoted by circles. with
Now Algorithm 1 is sound due to the following theorem:
Theorem 2.
Consider a DAG or a linear SEMIE with directed cyclic graph . If dseparation faithfulness holds, then CCI outputs a partially oriented MAAG of .
8 Algorithm Trace
We now illustrate a sample run of the CCI algorithm with a CI oracle. Consider the directed graph in Figure 1(a) containing just one latent variable with MAAG in Figure 1(b). CCI proceeds as follows:

Discovers the skeleton shown in Figure 1(c).

Adds two arrowheads onto yielding Figure 1(d) because .

Does not orient any endpoints.

Discovers the additional dseparating set .

Does not orient any endpoints.

Does not orient any endpoints.

Rule 1 orients as and as . Then Rule 4 orients as and as . Next, Rule 1 again fires twice to orient
Now we would expect CCD to output a pretty good partially oriented MAAG, given that the directed graph contains only one latent variable and no selection variables. However, CCD outputs the graph in Figure 1(e). The output contains one error ( is not an ancestor of ) and eight unoriented endpoints which were oriented by CCI.
9 Algorithm Details
We present the details of CCI. We claim that statements made herein hold for both cyclic and acyclic directed graphs, unless indicated otherwise. Most of the proofs are located in Section 13.
9.1 Step 1: Skeleton Discovery
We first discover the skeleton of an MAAG by consulting a CI oracle. The following result demonstrates that we can discover the skeleton of an MAAG, if we can search over all possible separating sets:
Lemma 1.
There exists an inducing path between and if and only if and are dconnected given for all possible subsets .
Searching over all separating sets is however inefficient. We therefore consider the following sets instead:
Definition 3.
(DSEP Set) We say that in a directed graph if and only if there exists a sequence of observables in such that, for any subpath on , we have an inducing path between and that is into as well as an inducing path between and that is into .
Notice that and may not be equivalent. The DSEP set is important, because we can use it to discover inducing paths without searching over all possible separating sets:
Lemma 2.
If there does not exist an inducing path between and , then and are dseparated given . Likewise, and are dseparated given .
The DSEP sets are however not computable. We therefore consider computable possible dseparating sets, which are supersets of the DSEP sets:
Definition 4.
(Possible DSeparating Set) We say that in any partial oriented mixed graph if and only if there exists a path between and in such that, for every subpath on , either is a vstructure or forms a triangle.
The following lemma shows that we can utilize PDSEP sets in replace of DSEP sets:
Lemma 3.
If there does not exist an inducing path between and , then and are dseparated given with in the MAAG . Likewise, and are dseparated given some with in .
Recall that a similar lemma was also proven in the acyclic case (Spirtes00). We conclude that the procedure for discovering the skeleton of an MAAG is equivalent to that of an MAG in the acyclic case.
The justification of Step 1 in Algorithm 1 then follows by generalizing the above lemma to , the partially oriented mixed graph discovered by PC’s skeleton and FCI’s vstructure discovery procedures utilized in Step 1:
Lemma 4.
If an inducing path does not exist between and in , then and are dseparated given with in . Likewise, and are dseparated given some with in .
The above lemma holds because formed using is a subset of formed using ; likewise for .
9.2 Steps 2 & 3: Short and Long Range NonAncestral Relations
We orient endpoints during vstructure discovery with the following lemma:
Lemma 5.
Consider a set . Now suppose that and are dconnected given , and that and are dconnected given . If and are dseparated given such that , then is not an ancestor of .
Recall that, if there exists an inducing path between and as well as an inducing path between and , then and are dconnected given and and are dconnected given by Lemma 1. Moreover, if and are dseparated given , then an inducing path does not exist between and . This means that, if we are dealing with the partially oriented MAAG in Figure 2(a), and and are dseparated given such that , then we orient the endpoints in Figure 2(a) as the arrowheads in Figure 2(b). Lemma 5 therefore justifies the vstructure discovery procedure in Step 1 of CCI for short range nonancestral relations.
Now Lemma 5 also justifies Step 1 of CCI, because and as well as and need not be adjacent in the underlying MAAG. In fact, may be located far from and . CCI utilizes such long range relations because two cyclic directed graphs may agree “locally” on dseparation relations, but disagree on some dseparation relations between distant variables (Richardson94).
9.3 Step 4: Discovering NonMinimal DSeparating Sets
The algorithm now utilizes the graph from Step 1 in order to find additional dseparating sets. The skeleton discovery phase of CCI finds minimal dseparating sets, but nonminimal dseparating sets can also inform the algorithm about the underlying cyclic causal graph. Recall that, if we have and are dconnected given for any (Spirtes96). The same fact is not true however in the cyclic case. Consider for example the graph in Figure 0(a). Here, we know that and are dseparated given , but they are also dseparated given . Moreover, we have in the corresponding MAAG in Figure 0(b). in the acyclic case, then
The above example motivates us to search for additional dseparating sets, which will prove to be important for orientating additional circle endpoints as evidenced in the next section. From Lemma 3, we already know that some subset of and some subset of dseparate and when we additionally condition on . We therefore search for the additional dseparating sets by testing all subsets of as well as those of .
We summarize the details of Step 1 in Algorithm 2. The subprocedure specifically works as follows. For each vstructure and are dseparated given specific supersets of the minimal separating set . In particular, Algorithm 2 forms the sets where in line 2. The algorithm then consults the CI oracle with in line 2. Finally, like the skeleton discovery phase, the algorithm first consults the CI oracle with the smallest subsets and then progresses to larger subsets until such a separating set is found or all subsets of have been exhausted. , the algorithm determines whether
9.4 Step 5: Orienting with NonMinimal DSeparating Sets
The following lemma justifies Step 1 which utilizes the nonminimal dseparating sets discovered in the previous step:
Lemma 6.
Consider a quadruple of vertices . Suppose that we have:

and nonadjacent;

;

and are dseparated given some with and ;

If , then we have . If and in addition we have . , then we have
Notice that the above lemma utilizes as discovered in Step 1.
9.5 Step 6: Long Range Ancestral Relations
We can justify Step 1 with the following result:
Lemma 7.
If and are dseparated given , where , and , then and are also dseparated given .
Notice that the above lemma allows us to infer long range ancestral relations because all variables in are ancestors of . Step 1 of Algorithm 1 then follows by the contrapositive of Lemma 7:
Corollary 1.
Assume that and are dseparated by with and , but and are dconnected by . Then, is not an ancestor of .
9.6 Step 7: Orientation Rules
We will now describe the orientation rules in Step 1. Notice that the orientation rules are always applied after Step 1 and therefore also after vstructure discovery. This ordering implies that, if and are nonadjacent and we have
9.6.1 First to Third Orientation Rules
Lemma 5 allows us to infer nonancestral relations. The following lemma allows us to infer ancestral relations:
Lemma 8.
Suppose that there is a set and every proper subset dconnects and given . If and are dseparated given where , then is an ancestor of .
The above lemma justifies the following orientation rule:
Lemma 9.
If we have
Proof.
If we have
For example, if we have the structure in Figure 3(a), then we can add a undirected edge as in Figure 3(b).
We may also add an arrowhead at in Figure 3(b) provided that some additional conditions are met. The following lemma is central to causal discovery with cycles:
Lemma 10.
If we have with and nonadjacent, then is in a triangle involving and () with and . Moreover, there exists a sequence of undirected edges between and that does not include .
The above statement may appear arcane at first glance, but it justifies multiple orientation rules.
The following definitions are useful towards applying Lemma 10:
Definition 5.
(Potentially Undirected Path) A potentially undirected path exists between and if and only if all endpoints on are tails or circles.
Definition 6.
(Potential 2Triangulation) The edge if and only if (1) and another vertex is in a triangle, (2) we have , , or is said to be potentially 2triangulated w.r.t.
We now have three orientation rules that utilize the concept of potential 2triangulation:
Lemma 11.
The following orientation rules are sound:

If we have

If we have

Suppose that we have with and nonadjacent, and is potentially 2triangulated w.r.t. . If can be potentially 2triangulated w.r.t. using only one vertex in the triangle involve , then orient as , as and/or as . Next, if there exists only one potentially undirected path between and , then substitute all circle endpoints on with tail endpoints.
Proof.
The following arguments correspond to their associated orientation rule:

Follows directly from Lemma 10.
∎
For example, 3(a) (there are only three variables), so we may orient the endpoint as according to the first orientation rule. On the other hand, in Figure 3(c), so we cannot orient the circle endpoint at as an arrowhead. However, if we additionally have , then we can apply Rule 3 to orient is potentially triangulated w.r.t is not potentially triangulated in Figure
9.6.2 Fourth & Fifth Orientation Rules
Lemma 12.
The following orientation rules are sound:

If , there exists a path with at least vertices such that we have for all except for only one index where we have , then orient as .

If we have the sequence of vertices such that with , and we have , then orient as .
Proof.
The following arguments correspond to their associated orientation rule:

Suppose for a contradiction that we had . But then is an ancestor of by transitivity of the tails.

Follows by transitivity of the tail.
∎
9.6.3 Sixth & Seventh Orientation Rules
The CCI algorithm has 2 more orientation rules which require successive applications of the first orientation rule. We first require the following definition:
Definition 7.
(NonPotentially 2Triangulated Path) A path is said to be nonpotentially 2triangulated if the following conditions hold:

If , then the vertices and are nonadjacent for every (i.e., every consecutive triple is nonadjacent), and