Belief Networks

or Bayesian Belief Networks combine probability and graph theory to represent probabilistic models in a structured way.

Read: 12 min

  • translate graphical models to do inference
  • read and interpret graphical models
  • identify dependencies and causal links
  • work through an example


Belief Networks (BN) combine probabilities and their graphical representation to show causal relationships and assumptions about the parameters – e.g. independence between parameters. This is done through nodes and directed links (vertices) between the nodes to form a Directed Acyclic Graph (DAG). This also allows to display operations like conditioning on parameters or marginalizing over parameters.

A DAG is a graph with directed vertices
that does not contain any cycles. 
Fig 1. A basic representation of a Belief Network as a graph.

Use and Properties

In Fig.1 we can see a model with four parameters lets say rain (R), Jake (J), Thomas (T) and sprinkler (S). We observe that R has an effect on our parameters T and J. We can make a distinction between cause and effect – e.g. the rain is causal to Jake being wet.
Further we can model or observe constraints that our model holds, such that we do not have to account for all combination of possible parameters (2^N) and can reduce the space and computational costs.
This graph can represent different independence assumptions through the directions of the vertices (arrows).

Contitional Independence

We know that when conditional independence holds we can express a joint probability as p(x_1,x_2,x_3)=(x_i_1 | x_i_2,x_i_3,)p(x_i_2|x_i_3)p(x_i_3). This gives six possible permutations, so that simply drawing a graph does not work.

Remember there should not be any cycles in the graph. Therefore we have to drop at least two vertices such that we get 4 possible DAGs from the 6 permutations.
Now, when two vertices point to one and the same node we get a collision:

Fig. 2 a) Collisions and induced dependence
Fig. 2b) Model conditioned on T

In case a collision occurs we get either d-separation or a d-connection. For example (Fig. 2) we see that X and Y are independent of each other and are ’cause’ to the effect Z. We can also write p(X,Y,Z)=p(Z|X, Y)p(X)p(Y).
However if we would condition a model on the collision node (see Fig. 2b), we get p(X,Y|T)\neq p(X|T)p(Y|T) and X and Y are not independent anymore.
A connection is introduced on the right model in Fig. 2a where X and Y are now conditionally dependent given Z.

Belief Networks encode conditional independence well, they do not encode dependence well.

Definition: Two nodes (\mathcal{X} and \mathcal{Y}) are d-separated by \mathcal{Z} in a graph G are if and only if they are not d-connected.
Pretty straight forward, right?

A more applicable explanation is: for every variable x in \mathcal{X} and y in \mathcal{Y}, trace every path U between x and y, if all paths are blocked then two nodes are d-separated. For the definition of blocking and descendants of nodes you can find more in this paper.

Remember: X is conditionally independent of Y if p(X,Y)=p(X)p(Y).

Now away from the theory to something more practical:

How to interact with a model

  1. Analyze the problem and/or set of parameters
    1. set a graphical structure for the parameters in a problem with a DAG
    2. OR reconstruct the problem setting from the parameters
  2. Compile a table of all required conditional probabilities – p(x_i|p_a(x_i))
  3. Set and specify parental relations


Recap Fig. 1 – our short example case

Thomas and Jake are neighbors. When Thomas gets out of the house in the morning he observes that the grass is wet (effect). We want to know how likely it is that Thomas forgot to turn of the sprinkler (S). Knowing that it had rained the day before explains away that the grass is wet in the morning. So let us look at the probabilities:

We have 4 boolean (yes/no) variables (see Fig. 1):
R (rain [0,1]), S (sprinkler [0,1]), T (thomas’s-grass [0,1]), J (jake’s-grass [0,1]). So that we can say that Thomas’s grass is wet, given that Jake’s grass is wet, the sprinkler was off and it rained as p(T=1 | J=1, S=0, R=1).
Already in such small example we have a lot of possible states, namely 2^N=2^4=16 values. We already know that our model has constraints, e.g. the sprinkler is independent of the rain and when it rains both Thomas’s and Jake’s grass is wet.

After taking into account our constraints we can factorize our joint probability into
p(T, J, R, S) = p(T|R,S)p(J|R)p(R)p(S) – we say: “The joint probability is computed by the probability that Thomas’s grass is wet, given rain and sprinkler multiplied with the probability that Jake’s grass is wet, given that it rained and the probability that it rained and the probability that the sprinkler was on.” We have now reduced our problem to 4+2+1+1=8 values. Neat!

We use factorization of joint probabilities to reduce the total number of required states and their (conditional) probabilities.

p(J=1 | R=1)1.0
p(J=1 | R=0)*0.2
p(T=1 | R=1, S=0)1.0
p(T=1 | R=1, S=1)1.0
p(T=1 | R=0, S=1)0.9
p(T=1 | R=0, S=0)0.0
Our conditional probability table (CPT)
* sometimes Jake’s grass is simply wet – we don’t know why…

We are still interested if Thomas left the sprinkler on. Therefore compute the posterior probability p(S=1|T=1) = 0.3382 (see below).

    \begin{align*} p(S=1|T=1) &= \frac{p(S=1,T=1)}{p(T=1)}\\ &=\frac{\sum_{J,R}p(T=1, J, R, S=1)}{\sum_{J,R,S}p(T=1, J, R, S)}\\ &=\frac{\sum_R p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= 0.3382 \end{align*}

When we compute the posterior given that Jake’s grass is also wet we get
p(S=1|T=1, J=1) = 0.1604

    \begin{align*} p(S=1|T=1, J=1) &= \frac{p(S=1,T=1, J=1)}{p(T=1, J=1)}\\ &=\frac{\sum_R p(J=1|R)p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(J=1|R)p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= \frac{0.0344}{0.2144}=0.1604 \end{align*}

We have shown that it is less likely that Thomas sprinkler is on, when Jake’s grass is also wet. Where Jake’s grass is extra evidence that affects the likelihood of our observed effect.


Fig. 3 – Uncertainties and unreliable evidence

In case an observed outcome holds uncertainty or we do not trust the source of our values we can account for that. Lets assume the sprinkler malfunctions sometimes, so that instead of a hard [0,1] we get a vector S=[0.2, 0.8] instead, also called soft evidence. We denote this with dashed nodes (see Fig. 2).
Lets assume Jake’s house is quite far away and there is an added unreliability in the evidence that his grass is wet — maybe Jake is a notorious liar about the wetness of his soil. We denote this with a dotted vertex (see Fig.2).
To solve and to account for uncertainties and unreliability is a whole different topic, which we will address in another post.


Some dependency statements cannot be represented structurally in this way; for example marginalized graphs.
One famous example that BNs hold errors when it comes to causal relations is the Simpson Paradoxon . This can be solved with atomic intervention, which is a topic for yet another post.


  • (Bayesian) Belief Networks represent probabilistic models as well as the factorization of distributions into conditional probabilities
  • BNs are directed acyclic graphs (DAGs)
  • We can reduce the amount of computation and space required by taking into account constraints from the model which are expressed structurally
  • conditional independence is expressed as absence of a link in a network


  1. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. USA. 2012: pp.29-51
  2. G. Pipa, Neuroinformatics Lecture Script. 2015: pp.19-26